9618ddefba
* Addition of basic structure
* Reworked categories
* More causal integration additions
* Causal implementation
* Update examples
* delete virtual_speedup files
* Update perfetto submodule to v31.0
* Update dyninst submodule
* Update timemory submodule
* ElfUtils build for libdw
* OMNITRACE_LIKELY and OMNITRACE_UNLIKELY
* Update common lib join
* Examples updates for causal profiling
* config updates with causal options
- OMNITRACE_CAUSAL_FIXED_LINE
- OMNITRACE_CAUSAL_FIXED_SPEEDUP
- OMNITRACE_CAUSAL_FILE
- OMNITRACE_CAUSAL_BINARY_SCOPE
- OMNITRACE_CAUSAL_SOURCE_SCOPE
- version info in banner
- support increments in parse_numeric_range
- fix occasional deadlock in first call to get_config
* PTL general task group
* Always include PID in debug/verbose messages
* Add blocking/unblocking gotchas to runtime init bundle
* CausalState
* thread_data updates
- generic component_bundle_cache
* Improve handling of causal in category_region
* components updates
- backtrace_causal component
- backtrace::get_data member func
- decrease ignore_depth in backtrace::sample(int)
- handle "omnitrace_main" in backtrace::filter_and_patch(...)
- tweak internal thread state scope for pthread_mutex_gotcha wrappers
* simplify tracing get_instrumentation_bundles usage
* sampling updates
- include backtrace_causal component
- disable backtrace_metrics if using causal and not using perfetto
- disable backtrace and backtrace_timestamp when using causal
- post_process_causal
* causal updates
- more checks in blocking_gotcha and unblocking_gotcha start/stop
- miscellaneous overhaul of data
- experiment update
* Remove virtual speedup
* libomnitrace code_object
* causal-profiling test
* libomnitrace library.cpp updates
- handle causal profiling
- fini_bundle
* Disable causal profiling by default
* Updated causal code and example
- example: three execution variants: cpu + rng, cpu, rng
- example: three instrumentation variants: none, omni, coz
- fix blocking gotcha credit
- rework perform_experiment_impl
- get_eligible_address_ranges
- compute_eligible_lines
- support fixed lines/speedups/functions
- update selected_entry to support function mode
- fix causal::delay
- experiment updates
* omnitrace_progress / omnitrace_user_progress
- with accompanying omnitrace_annotated_progress / omnitrace_user_annotated_progress
* Update timemory submodule
* CausalMode
- mode indicated whether causal predictions source be at line-level or function-level
* code_object, config, runtime, sampling, thread_data
- code_object: address_range
- code_object: basic::line_info serialize(), name(), hash()
- config updates
- two signals for causal sampling
- thread_data init fixes
* pthread updates
- pthread_create_gotcha processes delays
- pthread_mutex_gotcha does not wrap pthread_join in causal mode
* backtrace_causal update
- dynamic delay period stats
* main wrapper uses basename of argv[0]
* update elfio submodule
* perf support (currently unused)
* Fix experiment JSON serialization
- static_vector.hpp (unused)
* causal executable + config options updates
- omnitrace-causal exe simplifies running multiple causal configs
- changed the causal config option names
* Support both throughput and latency points
* process-causal-json.py script
- will be used later for testing
* stable_vector
* Rework thread_data
* Improve omnitrace-causal exe
- better verbosity handling
- correct diagnosis of status for child process
- execvpe when only one iteration (debugging)
* Update timemory submodule
* exe --version
- omnitrace, omnitrace-avail, and omnitrace-sample all support --version on command-line
* OMNITRACE_INTERNAL_API + OMNITRACE_{LIKELY,UNLIKELY}
* omnitrace-causal cmake format
* omnitrace config update
- OMNITRACE_CAUSAL_FILE_CLOBBER
* custom exception
- wraps STL exception and gets stacktrace during construction
* exit_gotcha supports _Exit
* use global construct_on_init + max threads
- add some safety when exceeding max # of threads
* update code_object binary filter
- exclude dyninst and tbbmalloc library
* containers: c_array, static_vector, stable_vector
- moved utility::c_array to container::c_array
- created static_vector: std::vector bound to std::array
- created stable_vector: vector with stable references
* grow thread_data when new thread created
* causal updates
- data: improve compute_eligible_lines to ignore lambdas
- data: use new thread_data
- delay: use new thread_data
- experiment: properly support latency points
- experiment: support file clobber
- experiment: ensure non-zero experiment time
- progress_point: use new thread_data
- backtrace_causal: use new thread_data
* Update causal-profiling tests
* fix omnitrace-causal backslash escaping
* process-causal-json script
* restructure causal implementation
- update verbose messages for omnitrace-causal diagnose_status
- migrated causal implementation in sampling.cpp to causal/sampling.cpp
- OMNITRACE_USE_CAUSAL does not require OMNITRACE_USE_SAMPLING
- added Mode::Causal
- causal sampling uses same signals as regular sampling
- moved tracing::thread_init to implementation file
- combined tracing::thread_init and tracing::thread_init_sampling
- added causal/components folder
- pthread_create_gotcha::wrapper_config
- omnitrace_preload checks OMNITRACE_USE_CAUSAL
- updates mode accordingly
* update timemory submodule
* update timemory submodule
* causal example updates
- causal for lulesh
* perf code + utility - helpers
- relocated causal perf code
- placement new when generating unique ptr trait for potentially allocating during sampling
- additions to utility header
- removed previously added helpers.hpp
* update timemory submodule
* Default env variables for omnitrace-causal
- activate OMNITRACE_USE_KOKKOSP, etc.
* update stable_vector and static_vector
- static vector can use atomic for size tracking for thread-safe situations
* update causal example header
- CAUSAL_PROGRESS_NAMED
- use CAUSAL_ prefix for some macros
* Tweak lulesh example
- use CAUSAL_PROGRESS instead of CAUSAL_BEGIN and CAUSAL_END
* omnitrace-sample support for causal mode
- set OMNITRACE_USE_SAMPLING to off when OMNITRACE_MODE=causal
* refactor and cleanup code_object
- scope filter
- fixes to address_range
* overhaul causal data + causal config options
- full support for function and line mode
- support static vector of instruction pointers
- improve line info mapping resolution
- remove thread-locality from miscellanous functions where unnecessary
- causal options for {binary,source,function,fileline} exclusion
* causal experiment, sampling, and backtrace updates
- is_selected + unwind address array
- experiment warning about progress points
- increased buffer size for backtrace_casual sampler
- backtrace_causal only stores IP addresses instead of full unwind info
* category_region updates
- minor refactor
- local_category_region::mark
* Update causal tests
* Bump version to 1.8.0
* omnitrace-causal args + CLOBBER -> RESET
- renamed OMNITRACE_CAUSAL_FILE_CLOBBER to OMNITRACE_CAUSAL_FILE_RESET
- updated omnitrace-causal exe to support recently added configuration options
- other miscellaneous tweaks to data.cpp, experiment.cpp, and sampling.cpp
* Refactor causal and code_object
- code_object.hpp and code_object.cpp moved into binary folder
- causal components namespaced into omnitrace::causal::component
- moved sample_data out of backtrace_causal and into own file
- renamed backtrace_causal to causal::component::backtrace
* preload omnitrace_init + OMNITRACE_DEBUG_MARK
- env OMNITRACE_DEBUG_MARK
- fix omnitrace_init call when LD_PRELOAD-ing omnitrace
* Fix fileline support + line-info output names + experiment log
- line-info log files are prefixed with experiment name
- don't print experiment duration when E2E
- account for fileline scope in analysis
* KokkosP: OMNITRACE_KOKKOSP_NAME_LENGTH_MAX
- config option to limit the name of kokkos tool callbacks
- remove [kokkos] from KokkosP names
* Update causal example
- minor tweaks to decrease probability of overlapping regions in binary
* omnitrace-causal update
- prefix N / Ntot in environment printout
* Miscellaneous updates
- causal::finish_experimenting()
- OMNITRACE_CAUSAL_RANDOM_SEED
- KokkosP causal updates
- exclude some callbacks, make some callbacks unique, etc.
- address_range::operator+=(address_range)
- combine contiguous ranges in binary/analysis.cpp when file, func, line is same and address range is contiguous
- bfd_line_info reads inline info
- wait for perform_experiment_impl to complete
- causal::delay updates
- delay::process checks if experiment is active
- uses threading::get_id()
- experiment scales duration up for larger speedup experiments
- line info samples includes excluded lines
- sampler uses CLOCK_REALTIME
- blocking_gotcha updates
- is no longer fully static
- adds audit routine which sets the postblock value to zero if try/timed routine fails
- category::host was added to causal_throughput_categories_t
- pthread_create_gotcha sets new threads local parent delay
- was using internal value, now uses sequent value
* Causal improvements to KokkosP
* Updates to experiment time scaling
- use stats instead of just max
* binary/link_map.{hpp,cpp}
* update process-causal-json.py
* Folded fileline scope into source scope
* Update documentation
- Add documentation for causal profiling
- Replace 'Omnitrace' with 'OmniTrace' everywhere
* Update causal-helpers.cmake + omnitrace-testing.cmake
- split tests/CMakeLists.txt partially into omnitrace-testing.cmake
* omnitrace/causal.h
- OMNITRACE_CAUSAL_PROGRESS
- OMNITRACE_CAUSAL_PROGRESS_NAMED
- OMNITRACE_CAUSAL_BEGIN
- OMNITRACE_CAUSAL_END
* selected_entry + remove default filters for lambdas and operator()
- selected entry stores range and binary load address
* update process-causal-json.py
* format examples/lulesh/CMakeLists.txt
* causal-helpers find_package(Threads)
* OMNITRACE_KOKKOSP_KERNEL_LOGGER
- was OMNITRACE_KOKKOS_KERNEL_LOGGER
* quiet find of coz-profiler
* Fix rocm_smi exception handling
* Update timemory submodule (binutils)
- fix binutls compile error on some systems
- bump binutils to v2.40
* Fix miscellaneous tests
* OMNITRACE_KOKKOSP_PREFIX
* revert rocm_smi handling
* ElfUtils updates
- default to download version 0.188
- add -Wno-error=null-dereference due to GCC 12 compiler error
* Update causal example
* Remove OMNITRACE_VERBOSE from global workflow envs
* Reliable causal test
* disable compilation of causal perf files
* Remove set_current_selection with unwind stack
* update timemory submodule
* fix for segfault on bionic
- locking in TLS dtor was causing segfault
* remove experiment::is_selected(unwind_stack_t)
* update default init of selected_entry
* Fix for when IP is not offset by load address
* Update CMakeLists.txt
* Miscellaneous updates
- OMNITRACE_WARNING_OR_CI_THROW
- OMNITRACE_REQUIRE
- OMNITRACE_PREFER
- fixed issues with no ASLR
- added load address variable and ipaddr() func to basic/bfd line info
- removed get_basic() from dwarf_line_info
- TIMEMORY_PREFER -> OMNITRACE_PREFER
- removed previously added binary_address and range variables from selected_entry
* Removed superfluous CausalState
* Additional causal tests (lulesh + kokkos)
* filter, prefer, analysis ASLR handling
- removed default filter on cold functions
- fixed OMNITRACE_PREFER
- fixed analysis ASLR handling
* Tweak line-info output
* Removed some superfluous code
- causal/delay
- causal/selected_entry
* Exclude main.cold in function mode
* Update validate-perfetto-proto.py
- account for occasional http errors
* Add sampling test disabling tmp files
* argparser for process-causal-json
- support validation
- support filtering
* Avoid pthread_{lock,unlock} in sampling offload
- use homemade atomic_mutex/atomic_lock since contention will be low and using pthread tools might trigger our wrappers
* Rename process-causal-json.py
- validate-causal-json.py
* rework omnitrace_add_causal_test
- capable of performing validation
- added validation tests
* Fix kokkosp_begin_deep_copy + causal
* Tweak address range in bfd_line_info::read_pc
* Tweak analysis and data IP handling
- look for gaps
* Disable scaling experiment time by speedup
* Revert change in max threads during CI
* binary updates
- significant overhaul of binary analysis implementation
- removed "basic_line_info" and "bfd_line_info" in lieu of "symbol" class
- symbol class has basic BFD info + vector of inlines + vector of dwarf info
* Updated causal to use new binary analysis
- Fix symbol.cpp includes
* Updated formatting target
- include *.cmake files
* Updated causal tests
- causal tests should be stable now
* Update timemory and dyninst submodules
- TPLs are stripped + built w/o debug info
* Increase tolerance for causal validation speedups
- higher speedups have more variance (increased to +/- 5 from 3)
* Support causal output for MPI
- i.e. tag with MPI rank
* omnitrace-causal launcher argument
* improve experiment sampling output
* causal data updates
- call compute lines once
- fixed filtered cached binary info
- debugging info when experiment fails to start
* Tweaked causal validation tests
* dwarf_entry ranges
* CI updates
- increase max threads to 64
* Tweak causal E2E validation tests
- more threads
- shorter thread runtime
- more iterations
* Fix shadowed variable
* fix symbol read_bfd last PC calculation
* fix maybe-uninitialized warning
* omnitrace-causal launcher update
- only inject "omnitrace-causal --" once
- throw error if no matches found
* Update causal profiling docs for launcher
* fix address range boundaries
481 řádky
26 KiB
Markdown
481 řádky
26 KiB
Markdown
# Causal Profiling
|
|
|
|
```eval_rst
|
|
.. toctree::
|
|
:glob:
|
|
:maxdepth: 3
|
|
```
|
|
|
|
## What is "Causal Profiling"?
|
|
|
|
> ***If you speed up a given block of code by X%, the application will execute Y% faster***
|
|
|
|
Causal profiling directs parallel application developers to where they should focus their optimization
|
|
efforts by quantifying the potential impact of optimizations. Causal profiling is rooted in the concept
|
|
that *software execution speed is relative*: speeding up a block of code by X% is mathematically equivalent
|
|
to that block of code running at its current speed if all the other code running slower by X%.
|
|
Thus, causal profiling works by performing experiments on blocks of code during program execution which
|
|
insert pauses to slow down all other concurrently running code. During post-processing, these experiments
|
|
are translated into calculations for the potential impact of speeding up this block of code.
|
|
|
|
Consider the following C++ code executing `foo` and `bar` concurrently in two different threads
|
|
where `foo` is 30% faster than `bar` (ideally):
|
|
|
|
```cpp
|
|
constexpr size_t FOO_N = 7 * 1000000000UL;
|
|
constexpr size_t BAR_N = 10 * 1000000000UL;
|
|
|
|
void foo()
|
|
{
|
|
for(volatile size_t i = 0; i < FOO_N; ++i) {}
|
|
}
|
|
|
|
void bar()
|
|
{
|
|
for(volatile size_t i = 0; i < BAR_N; ++i) {}
|
|
}
|
|
|
|
int main()
|
|
{
|
|
auto _threads = { std::thread{ foo },
|
|
std::thread{ bar } };
|
|
|
|
for(auto& itr : _threads)
|
|
itr.join();
|
|
}
|
|
```
|
|
|
|
No matter how many optimizations are applied to `foo`, the application will always require the same amount of time
|
|
because the end-to-end performance is limited by `bar`. However, a 5% speedup in `bar` will result in the
|
|
end-to-end performance improving by 5% and this trend will continue linearly (10% speedup in `bar` yields 10% speedup in
|
|
end-to-end performance, and so on) up to 30% speedup, at which point, `bar` executes as fast as `foo`;
|
|
any speedup to `bar` beyond 30% will still only yield an end-to-end performance speedup of 30% since the application
|
|
will be limited by performance of `foo`, as demonstrated below in the causal profiling visualization:
|
|
|
|

|
|
|
|
The full details of the causal profiling methodology can be found in the paper [Coz: Finding Code that Counts with Causal Profiling](http://arxiv.org/pdf/1608.03676v1.pdf).
|
|
The author's implementation is publicly available on [GitHub](https://github.com/plasma-umass/coz).
|
|
|
|
## Getting Started
|
|
|
|
### Progress Points
|
|
|
|
Causal profiling requires "progress points" to track progress through the code in between samples. Progress points must be triggered deterministically via instrumentation.
|
|
This can happen in three different ways:
|
|
|
|
1. OmniTrace can leverage the callbacks from Kokkos-Tools, OpenMP-Tools, roctracer, etc. and the wrappers around functions for MPI, NUMA, RCCL, etc. to act as progress-points
|
|
2. User can leverage the [runtime instrumentation capabilities](instrumenting.md#runtime-instrumentation) to insert progress-points (NOTE: binary rewrite to insert progress-points is not supported)
|
|
3. User can leverage the [User API](user_api.md), e.g. `OMNITRACE_CAUSAL_PROGRESS`
|
|
|
|
Please note with regard to #2, binary rewrite to insert progress-points is not supported: when a rewritten binary is executed, Dyninst translates the instruction pointer address in order
|
|
to execute the instrumentation and, as a result, call-stack samples never return instruction pointer addresses in the ranges defined as valid by OmniTrace. Hopefully, a work-around will
|
|
be found in the future.
|
|
|
|
### Key Concepts
|
|
|
|
| Concept | Setting | Options | Description |
|
|
|------------------|-----------------------------------|----------------------------------|--------------------------------------------------------------------------------------------------------------------|
|
|
| Mode | `OMNITRACE_CAUSAL_MODE` | `function`, `line` | Select entire function or individual line of code for causal experiments |
|
|
| End-to-End | `OMNITRACE_CAUSAL_END_TO_END` | boolean | Perform a single experiment during the entire run (does not require progress-points) |
|
|
| Fixed speedup(s) | `OMNITRACE_CAUSAL_FIXED_SPEEDUP` | one or more values from [0, 100] | Virtual speedup or pool of virtual speedups to randomly select |
|
|
| Binary scope | `OMNITRACE_CAUSAL_BINARY_SCOPE` | regular expression(s) | Dynamic binaries containing code for experiments |
|
|
| Source scope | `OMNITRACE_CAUSAL_SOURCE_SCOPE` | regular expression(s) | `<file>` and/or `<file>:<line>` containing code to include in experiments |
|
|
| Function scope | `OMNITRACE_CAUSAL_FUNCTION_SCOPE` | regular expression(s) | Restricts experiments to matching functions (function mode) or lines of code within matching functions (line mode) |
|
|
|
|
#### Notes
|
|
|
|
1. Binary scope defaults to `%MAIN%` (executable). Scope can be expanded to include linked libraries
|
|
2. `<file>` and `<file>:<line>` support requires debug info (i.e. code was compiled with `-g` or, preferably, `-g3`)
|
|
3. Function mode does not require debug info but does not support stripped binaries
|
|
|
|
### Speedup Prediction Variability and `omnitrace-causal` Executable
|
|
|
|
Causal profiling typically require executing the application several times in order to adequately sample all the domains of executing code, experiment speedups, etc. and resolve statistical fluctuations.
|
|
The `omnitrace-causal` executable is designed to simplify running this procedure:
|
|
|
|
```console
|
|
$ omnitrace-causal --help
|
|
[omnitrace-causal] Usage: ./bin/omnitrace-causal [ --help (count: 0, dtype: bool)
|
|
--version (count: 0, dtype: bool)
|
|
--monochrome (max: 1, dtype: bool)
|
|
--debug (max: 1, dtype: bool)
|
|
--verbose (count: 1)
|
|
--config (min: 0, dtype: filepath)
|
|
--launcher (count: 1, dtype: executable)
|
|
--generate-configs (min: 0, dtype: folder)
|
|
--no-defaults (min: 0, dtype: bool)
|
|
--mode (count: 1, dtype: string)
|
|
--output-name (min: 1, dtype: filename)
|
|
--reset (max: 1, dtype: bool)
|
|
--end-to-end (max: 1, dtype: bool)
|
|
--wait (count: 1, dtype: seconds)
|
|
--duration (count: 1, dtype: seconds)
|
|
--iterations (count: 1, dtype: int)
|
|
--speedups (min: 0, dtype: integers)
|
|
--binary-scope (min: 0, dtype: integers)
|
|
--source-scope (min: 0, dtype: integers)
|
|
--function-scope (min: 0, dtype: regex-list)
|
|
--binary-exclude (min: 0, dtype: integers)
|
|
--source-exclude (min: 0, dtype: integers)
|
|
--function-exclude (min: 0, dtype: regex-list)
|
|
]
|
|
|
|
Causal profiling usually requires multiple runs to reliably resolve the speedup estimates.
|
|
This executable is designed to streamline that process.
|
|
For example (assume all commands end with '-- <exe> <args>'):
|
|
|
|
omnitrace-causal -n 5 -- <exe> # runs <exe> 5x with causal profiling enabled
|
|
|
|
omnitrace-causal -s 0 5,10,15,20 # runs <exe> 2x with virtual speedups:
|
|
# - 0
|
|
# - randomly selected from 5, 10, 15, and 20
|
|
|
|
omnitrace-causal -F func_A func_B func_(A|B) # runs <exe> 3x with the function scope limited to:
|
|
# 1. func_A
|
|
# 2. func_B
|
|
# 3. func_A or func_B
|
|
General tips:
|
|
- Insert progress points at hotspots in your code or use omnitrace's runtime instrumentation
|
|
- Note: binary rewrite will produce a incompatible new binary
|
|
- Run omnitrace-causal in "function" mode first (does not require debug info)
|
|
- Run omnitrace-causal in "line" mode when you are targeting one function (requires debug info)
|
|
- Preferably, use predictions from the "function" mode to determine which function to target
|
|
- Limit the virtual speedups to a smaller pool, e.g., 0,5,10,25,50, to get reliable predictions quicker
|
|
- Make use of the binary, source, and function scope to limit the functions/lines selected for experiments
|
|
- Note: source scope requires debug info
|
|
|
|
|
|
Options:
|
|
-h, -?, --help Shows this page
|
|
--version Prints the version and exit
|
|
|
|
[DEBUG OPTIONS]
|
|
|
|
--monochrome Disable colorized output
|
|
--debug Debug output
|
|
-v, --verbose Verbose output
|
|
|
|
[GENERAL OPTIONS]
|
|
|
|
-c, --config Base configuration file
|
|
-l, --launcher When running MPI jobs, omnitrace-causal needs to be *before* the executable which launches the MPI processes (i.e.
|
|
before `mpirun`, `srun`, etc.). Pass the name of the target executable (or a regex for matching to the name of the
|
|
target) for causal profiling, e.g., `omnitrace-causal -l foo -- mpirun -n 4 foo`. This ensures that the omnitrace
|
|
library is LD_PRELOADed on the proper target
|
|
-g, --generate-configs Generate config files instead of passing environment variables directly. If no arguments are provided, the config files
|
|
will be placed in ${PWD}/omnitrace-causal-config folder
|
|
--no-defaults Do not activate default features which are recommended for causal profiling. For example: PID-tagging of output files
|
|
and timestamped subdirectories are disabled by default. Kokkos tools support is added by default
|
|
(OMNITRACE_USE_KOKKOSP=ON) because, for Kokkos applications, the Kokkos-Tools callbacks are used for progress points.
|
|
Activation of OpenMP tools support is similar
|
|
|
|
[CAUSAL PROFILING OPTIONS (General)]
|
|
(These settings will be applied to all causal profiling runs)
|
|
|
|
-m, --mode [ function (func) | line ]
|
|
Causal profiling mode
|
|
-o, --output-name Output filename of causal profiling data w/o extension
|
|
-r, --reset Overwrite any existing experiment results during the first run
|
|
-e, --end-to-end Single causal experiment for the entire application runtime
|
|
-w, --wait Set the wait time (i.e. delay) before starting the first causal experiment (in seconds)
|
|
-d, --duration Set the length of time (in seconds) to perform causal experimentationafter the first experiment is started. Once this
|
|
amount of time has elapsed, no more causal experiments will be started but any currently running experiment will be
|
|
allowed to finish.
|
|
-n, --iterations Number of times to repeat the combination of run configurations
|
|
|
|
[CAUSAL PROFILING OPTIONS (Combinatorial)]
|
|
(Each individual argument to these options will multiply the number runs by the number of arguments and the number of
|
|
iterations. E.g. -n 2 -B "MAIN" -F "foo" "bar" will produce 4 runs: 2 iterations x 1 binary scope x 2 function scopes
|
|
(MAIN+foo, MAIN+bar, MAIN+foo, MAIN+bar))
|
|
|
|
-s, --speedups Pool of virtual speedups to sample from during experimentation. Each space designates a group and multiple speedups can
|
|
be grouped together by commas, e.g. -s 0 0,10,20-50 is two groups: group #1 is '0' and group #2 is '0 10 20 25 30 35 40
|
|
45 50'
|
|
-B, --binary-scope Restricts causal experiments to the binaries matching the list of regular expressions. Each space designates a group
|
|
and multiple scopes can be grouped together with a semi-colon
|
|
-S, --source-scope Restricts causal experiments to the source files or source file + lineno pairs (i.e. <file> or <file>:<line>) matching
|
|
the list of regular expressions. Each space designates a group and multiple scopes can be grouped together with a
|
|
semi-colon
|
|
-F, --function-scope Restricts causal experiments to the functions matching the list of regular expressions. Each space designates a group
|
|
and multiple scopes can be grouped together with a semi-colon
|
|
-BE, --binary-exclude Excludes causal experiments from being performed on the binaries matching the list of regular expressions. Each space
|
|
designates a group and multiple excludes can be grouped together with a semi-colon
|
|
-SE, --source-exclude Excludes causal experiments from being performed on the code from the source files or source file + lineno pair (i.e.
|
|
<file> or <file>:<line>) matching the list of regular expressions. Each space designates a group and multiple excludes
|
|
can be grouped together with a semi-colon
|
|
-FE, --function-exclude Excludes causal experiments from being performed on the functions matching the list of regular expressions. Each space
|
|
designates a group and multiple excludes can be grouped together with a semi-colon
|
|
```
|
|
|
|
#### Examples
|
|
|
|
```bash
|
|
#!/bin/bash -e
|
|
|
|
module load omnitrace
|
|
|
|
N=20
|
|
I=3
|
|
|
|
# when providing speedups to omnitrace-causal, speedup
|
|
# groups are separated by a space so "0,10" results in
|
|
# one speedup group where omnitrace samples from
|
|
# the speedup set of {0, 10}. Passing "0 10" (without
|
|
# quotes to omnitrace-causal multiplies the
|
|
# number of runs by 2, where the first half of the
|
|
# runs instruct omnitrace to only use 0 as the
|
|
# speedup and the second half of the runs instruct
|
|
# omnitrace to only use 10 as the speedup.
|
|
SPEEDUPS="0,0,0,10,20,30,40,50,50,75,75,75,90,90,90"
|
|
# thus, -s ${SPEEDUPS} only multiplies the number
|
|
# of runs by 1 whereas -S ${SPEEDUPS_E2E} multiplies
|
|
# the number of runs by 15:
|
|
# - 3 runs with speedup of 0
|
|
# - 1 run for each of the speedups 10, 20, 30, and 40
|
|
# - 2 runs with speedup of 50
|
|
# - 3 runs with speedup of 75
|
|
# - 3 runs with speedup of 90
|
|
SPEEDUPS_E2E=$(echo "${SPEEDUPS}" | sed 's/,/ /g')
|
|
|
|
|
|
# 20 iterations in function mode with 1 speedup group
|
|
# and source scope set to .cpp files
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.func.coz
|
|
# - causal/experiments.func.json
|
|
#
|
|
# total executions: 20
|
|
#
|
|
omnitrace-causal \
|
|
-n ${N} \
|
|
-s ${SPEEDUPS} \
|
|
-m function \
|
|
-o experiments.func \
|
|
-S ".*\\.cpp" \
|
|
-- \
|
|
./causal-omni-cpu "${@}"
|
|
|
|
|
|
# 20 iterations in line mode with 1 speedup group
|
|
# and source scope restricted to lines 155 and 165
|
|
# in the causal.cpp file.
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.line.coz
|
|
# - causal/experiments.line.json
|
|
#
|
|
# total executions: 20
|
|
#
|
|
omnitrace-causal \
|
|
-n ${N} \
|
|
-s ${SPEEDUPS} \
|
|
-m line \
|
|
-o experiments.line \
|
|
-S "causal\\.cpp:(155|165)" \
|
|
-- \
|
|
./causal-omni-cpu "${@}"
|
|
|
|
|
|
# 3 iterations in function mode of 15 singular speedups
|
|
# in end-to-end mode with 2 different function scopes
|
|
# where one is restricted to "cpu_slow_func" and
|
|
# another is restricted to "cpu_fast_func".
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.func.e2e.coz
|
|
# - causal/experiments.func.e2e.json
|
|
#
|
|
# total executions: 90
|
|
#
|
|
omnitrace-causal \
|
|
-n ${I} \
|
|
-s ${SPEEDUPS_E2E} \
|
|
-m func \
|
|
-e \
|
|
-o experiments.func.e2e \
|
|
-F "cpu_slow_func" \
|
|
"cpu_fast_func" \
|
|
-- \
|
|
./causal-omni-cpu "${@}"
|
|
|
|
# 3 iterations in line mode of 15 singular speedups
|
|
# in end-to-end mode with 2 different source scopes
|
|
# where one is restricted to line 155 in causal.cpp
|
|
# and another is restricted to line 165 in causal.cpp.
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.line.e2e.coz
|
|
# - causal/experiments.line.e2e.json
|
|
#
|
|
# total executions: 90
|
|
#
|
|
omnitrace-causal \
|
|
-n ${I} \
|
|
-s ${SPEEDUPS_E2E} \
|
|
-m line \
|
|
-e \
|
|
-o experiments.line.e2e \
|
|
-S "causal\\.cpp:155" \
|
|
"causal\\.cpp:165" \
|
|
-- \
|
|
./causal-omni-cpu "${@}"
|
|
|
|
|
|
export OMP_NUM_THREADS=8
|
|
export OMP_PROC_BIND=spread
|
|
export OMP_PLACES=threads
|
|
|
|
# set number of iterations to 5
|
|
N=5
|
|
|
|
# 5 iterations in function mode of 1 speedup
|
|
# group with the source scope restricted
|
|
# to files containing "lulesh" in their filename
|
|
# and exclude functions which start with "Kokkos::"
|
|
# or "std::enable_if".
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.func.coz
|
|
# - causal/experiments.func.json
|
|
#
|
|
# total executions: 5
|
|
#
|
|
# First of 5 executions overwrites any
|
|
# existing causal/experiments.func.(coz|json)
|
|
# file due to "--reset" argument
|
|
#
|
|
omnitrace-causal \
|
|
--reset \
|
|
-n ${N} \
|
|
-s ${SPEEDUPS} \
|
|
-m func \
|
|
-o experiments.func \
|
|
-S "lulesh.*" \
|
|
-FE "^(Kokkos::|std::enable_if)" \
|
|
-- \
|
|
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
|
|
|
|
|
# 5 iterations in line mode of 1 speedup
|
|
# group with the source scope restricted
|
|
# to files containing "lulesh" in their filename
|
|
# and exclude functions which start with "exec_range"
|
|
# or "execute" and which contain either
|
|
# "construct_shared_allocation" or "._omp_fn." in
|
|
# the function name.
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.line.coz
|
|
# - causal/experiments.line.json
|
|
#
|
|
# total executions: 5
|
|
#
|
|
# First of 5 executions overwrites any
|
|
# existing causal/experiments.line.(coz|json)
|
|
# file due to "--reset" argument
|
|
#
|
|
omnitrace-causal \
|
|
--reset \
|
|
-n ${N} \
|
|
-s ${SPEEDUPS} \
|
|
-m line \
|
|
-o experiments.line \
|
|
-S "lulesh.*" \
|
|
-FE "^(exec_range|execute);construct_shared_allocation;\\._omp_fn\\." \
|
|
-- \
|
|
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
|
|
|
|
|
# 5 iterations in line mode of 1 speedup
|
|
# group with the source scope restricted
|
|
# to files whose basename is "lulesh.cc"
|
|
# for 3 different functions:
|
|
# - ApplyMaterialPropertiesForElems
|
|
# - CalcHourglassControlForElems
|
|
# - CalcVolumeForceForElems
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.line.targeted.coz
|
|
# - causal/experiments.line.targeted.json
|
|
#
|
|
# total executions: 15
|
|
#
|
|
# First of 5 executions overwrites any
|
|
# existing causal/experiments.line.(coz|json)
|
|
# file due to "--reset" argument
|
|
#
|
|
omnitrace-causal \
|
|
--reset \
|
|
-n ${N} \
|
|
-s ${SPEEDUPS} \
|
|
-m line \
|
|
-o experiments.line.targeted \
|
|
-F "ApplyMaterialPropertiesForElems" \
|
|
"CalcHourglassControlForElems" \
|
|
"CalcVolumeForceForElems" \
|
|
-S "lulesh\\.cc" \
|
|
-- \
|
|
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
|
```
|
|
|
|
#### Using `omnitrace-causal` with other launchers (e.g. `mpirun`)
|
|
|
|
The `omnitrace-causal` executable is intended to assist with application replay and is designed to always be at the start of the command-line (i.e. the primary process).
|
|
`omnitrace-causal` typically adds a `LD_PRELOAD` of the OmniTrace libraries into the environment before launching the command in order to inject the functionality
|
|
required to start the causal profiling tooling. However, this is problematic when the target application for causal profiling requires another command-line
|
|
tool in order to run, e.g. `foo` is the target application but executing `foo` requires `mpirun -n 2 foo`. If one were to simply do `omnitrace-causal -- mpirun -n 2 foo`,
|
|
then the causal profiling would be applied to `mpirun` instead of `foo`. `omnitrace-causal` remedies this by providing a command-line option `-l` / `--launcher`
|
|
to indicate the target application is using a launcher script/executable. The argument to the command-line option is the name of (or regex for) the target application
|
|
on the command-line. When `--launcher` is used, `omnitrace-causal` will generate all the replay configurations and execute them but delay adding the `LD_PRELOAD`, instead it
|
|
will inject a call to itself into the command-line right before the target application. This recursive call to itself will inherit the configuration from
|
|
parent `omnitrace-causal` executable, insert an `LD_PRELOAD` into the environment, and then invoke an `execv` to replace itself with the new process launched by the target
|
|
application.
|
|
|
|
In other words, the following command:
|
|
|
|
```console
|
|
omnitrace-causal -l foo -n 3 -- mpirun -n 2 foo`
|
|
```
|
|
|
|
Effectively results in:
|
|
|
|
```console
|
|
mpirun -n 2 omnitrace-causal -- foo
|
|
mpirun -n 2 omnitrace-causal -- foo
|
|
mpirun -n 2 omnitrace-causal -- foo
|
|
```
|
|
|
|
### Visualizing the Causal Output
|
|
|
|
OmniTrace generates a `causal/experiments.json` and `causal/experiments.coz` in `${OMNITRACE_OUTPUT_PATH}/${OMNITRACE_OUTPUT_PREFIX}`. A standalone GUI for viewing the causal profiling
|
|
results in under development but until this is available, visit [plasma-umass.org/coz/](https://plasma-umass.org/coz/) and open the `*.coz` file.
|
|
|
|
## OmniTrace vs. Coz
|
|
|
|
This section is intended for readers who are familiar with the [Coz profiler](https://github.com/plasma-umass/coz).
|
|
OmniTrace provides several additional features and utilities for causal profiling:
|
|
|
|
| | [Coz](https://github.com/plasma-umass/coz) | [OmniTrace](https://github.com/AMDResearch/omnitrace) | Notes |
|
|
|----------------------|:-------------------------------------------------------------------:|:----------------------------------------------------------:|-------------------------------|
|
|
| Debug info | requires debug info in DWARF v3 format (`-gdwarf-3`) | optional, supports any DWARF format version | See Note #1 below |
|
|
| Experiment selection | `<file>:<line>` | `<function>` or `<file>:<line>` | See Note #2 below |
|
|
| Experiment speedups | Randomly samples b/t 0..100 in increments of 5 or one fixed speedup | Supports specifying smaller subset | Set Note #3 below |
|
|
| Scope options | Supports binary and source scopes | Supports binary, source, and function scopes | See Note #4, #5, and #6 below |
|
|
| Scope inclusion | Uses `%` as wildcard for binary and source scopes | Full regex support for binary, source, and function scopes | |
|
|
| Scope exclusion | Not supported | Supports regexes for excluding binary/source/function | See Note #7 below |
|
|
| Call-stack sampling | Linux perf | libunwind | See Note #8 below |
|
|
|
|
### Notes
|
|
|
|
1. OmniTrace supports a "function" mode which does not require debug info
|
|
2. OmniTrace supports selecting entire range of instruction pointers for a function instead of instruction pointer for one line. In large codes, "function" mode
|
|
can resolve in fewer iterations and once a target function is identified, one can switch to line mode and limit the function scope to the target function
|
|
3. OmniTrace supports randomly sampling from subsets, e.g. { 0, 0, 5, 10 } where 0% is randomly selected 50% of time and 5% and 10% are randomly selected 25% of the time
|
|
4. OmniTrace and COZ have same definition for binary scope: the binaries loaded at runtime (e.g. executable and linked libraries)
|
|
5. OmniTrace "source scope" supports both `<file>` and `<file>:<line>` formats in contrast to COZ "source scope" which requires `<file>:<line>` format
|
|
6. OmniTrace supports a "function" scope which narrows the functions/lines which are eligible for causal experiments to those within the matching functions
|
|
7. OmniTrace supports a second filter on scopes for removing binary/source/function caught by inclusive match, e.g. `BINARY_SCOPE=.*` + `BINARY_EXCLUDE=libmpi.*`
|
|
initially includes all binaries but exclude regex removes MPI libraries
|