9618ddefba
* Addition of basic structure
* Reworked categories
* More causal integration additions
* Causal implementation
* Update examples
* delete virtual_speedup files
* Update perfetto submodule to v31.0
* Update dyninst submodule
* Update timemory submodule
* ElfUtils build for libdw
* OMNITRACE_LIKELY and OMNITRACE_UNLIKELY
* Update common lib join
* Examples updates for causal profiling
* config updates with causal options
- OMNITRACE_CAUSAL_FIXED_LINE
- OMNITRACE_CAUSAL_FIXED_SPEEDUP
- OMNITRACE_CAUSAL_FILE
- OMNITRACE_CAUSAL_BINARY_SCOPE
- OMNITRACE_CAUSAL_SOURCE_SCOPE
- version info in banner
- support increments in parse_numeric_range
- fix occasional deadlock in first call to get_config
* PTL general task group
* Always include PID in debug/verbose messages
* Add blocking/unblocking gotchas to runtime init bundle
* CausalState
* thread_data updates
- generic component_bundle_cache
* Improve handling of causal in category_region
* components updates
- backtrace_causal component
- backtrace::get_data member func
- decrease ignore_depth in backtrace::sample(int)
- handle "omnitrace_main" in backtrace::filter_and_patch(...)
- tweak internal thread state scope for pthread_mutex_gotcha wrappers
* simplify tracing get_instrumentation_bundles usage
* sampling updates
- include backtrace_causal component
- disable backtrace_metrics if using causal and not using perfetto
- disable backtrace and backtrace_timestamp when using causal
- post_process_causal
* causal updates
- more checks in blocking_gotcha and unblocking_gotcha start/stop
- miscellaneous overhaul of data
- experiment update
* Remove virtual speedup
* libomnitrace code_object
* causal-profiling test
* libomnitrace library.cpp updates
- handle causal profiling
- fini_bundle
* Disable causal profiling by default
* Updated causal code and example
- example: three execution variants: cpu + rng, cpu, rng
- example: three instrumentation variants: none, omni, coz
- fix blocking gotcha credit
- rework perform_experiment_impl
- get_eligible_address_ranges
- compute_eligible_lines
- support fixed lines/speedups/functions
- update selected_entry to support function mode
- fix causal::delay
- experiment updates
* omnitrace_progress / omnitrace_user_progress
- with accompanying omnitrace_annotated_progress / omnitrace_user_annotated_progress
* Update timemory submodule
* CausalMode
- mode indicated whether causal predictions source be at line-level or function-level
* code_object, config, runtime, sampling, thread_data
- code_object: address_range
- code_object: basic::line_info serialize(), name(), hash()
- config updates
- two signals for causal sampling
- thread_data init fixes
* pthread updates
- pthread_create_gotcha processes delays
- pthread_mutex_gotcha does not wrap pthread_join in causal mode
* backtrace_causal update
- dynamic delay period stats
* main wrapper uses basename of argv[0]
* update elfio submodule
* perf support (currently unused)
* Fix experiment JSON serialization
- static_vector.hpp (unused)
* causal executable + config options updates
- omnitrace-causal exe simplifies running multiple causal configs
- changed the causal config option names
* Support both throughput and latency points
* process-causal-json.py script
- will be used later for testing
* stable_vector
* Rework thread_data
* Improve omnitrace-causal exe
- better verbosity handling
- correct diagnosis of status for child process
- execvpe when only one iteration (debugging)
* Update timemory submodule
* exe --version
- omnitrace, omnitrace-avail, and omnitrace-sample all support --version on command-line
* OMNITRACE_INTERNAL_API + OMNITRACE_{LIKELY,UNLIKELY}
* omnitrace-causal cmake format
* omnitrace config update
- OMNITRACE_CAUSAL_FILE_CLOBBER
* custom exception
- wraps STL exception and gets stacktrace during construction
* exit_gotcha supports _Exit
* use global construct_on_init + max threads
- add some safety when exceeding max # of threads
* update code_object binary filter
- exclude dyninst and tbbmalloc library
* containers: c_array, static_vector, stable_vector
- moved utility::c_array to container::c_array
- created static_vector: std::vector bound to std::array
- created stable_vector: vector with stable references
* grow thread_data when new thread created
* causal updates
- data: improve compute_eligible_lines to ignore lambdas
- data: use new thread_data
- delay: use new thread_data
- experiment: properly support latency points
- experiment: support file clobber
- experiment: ensure non-zero experiment time
- progress_point: use new thread_data
- backtrace_causal: use new thread_data
* Update causal-profiling tests
* fix omnitrace-causal backslash escaping
* process-causal-json script
* restructure causal implementation
- update verbose messages for omnitrace-causal diagnose_status
- migrated causal implementation in sampling.cpp to causal/sampling.cpp
- OMNITRACE_USE_CAUSAL does not require OMNITRACE_USE_SAMPLING
- added Mode::Causal
- causal sampling uses same signals as regular sampling
- moved tracing::thread_init to implementation file
- combined tracing::thread_init and tracing::thread_init_sampling
- added causal/components folder
- pthread_create_gotcha::wrapper_config
- omnitrace_preload checks OMNITRACE_USE_CAUSAL
- updates mode accordingly
* update timemory submodule
* update timemory submodule
* causal example updates
- causal for lulesh
* perf code + utility - helpers
- relocated causal perf code
- placement new when generating unique ptr trait for potentially allocating during sampling
- additions to utility header
- removed previously added helpers.hpp
* update timemory submodule
* Default env variables for omnitrace-causal
- activate OMNITRACE_USE_KOKKOSP, etc.
* update stable_vector and static_vector
- static vector can use atomic for size tracking for thread-safe situations
* update causal example header
- CAUSAL_PROGRESS_NAMED
- use CAUSAL_ prefix for some macros
* Tweak lulesh example
- use CAUSAL_PROGRESS instead of CAUSAL_BEGIN and CAUSAL_END
* omnitrace-sample support for causal mode
- set OMNITRACE_USE_SAMPLING to off when OMNITRACE_MODE=causal
* refactor and cleanup code_object
- scope filter
- fixes to address_range
* overhaul causal data + causal config options
- full support for function and line mode
- support static vector of instruction pointers
- improve line info mapping resolution
- remove thread-locality from miscellanous functions where unnecessary
- causal options for {binary,source,function,fileline} exclusion
* causal experiment, sampling, and backtrace updates
- is_selected + unwind address array
- experiment warning about progress points
- increased buffer size for backtrace_casual sampler
- backtrace_causal only stores IP addresses instead of full unwind info
* category_region updates
- minor refactor
- local_category_region::mark
* Update causal tests
* Bump version to 1.8.0
* omnitrace-causal args + CLOBBER -> RESET
- renamed OMNITRACE_CAUSAL_FILE_CLOBBER to OMNITRACE_CAUSAL_FILE_RESET
- updated omnitrace-causal exe to support recently added configuration options
- other miscellaneous tweaks to data.cpp, experiment.cpp, and sampling.cpp
* Refactor causal and code_object
- code_object.hpp and code_object.cpp moved into binary folder
- causal components namespaced into omnitrace::causal::component
- moved sample_data out of backtrace_causal and into own file
- renamed backtrace_causal to causal::component::backtrace
* preload omnitrace_init + OMNITRACE_DEBUG_MARK
- env OMNITRACE_DEBUG_MARK
- fix omnitrace_init call when LD_PRELOAD-ing omnitrace
* Fix fileline support + line-info output names + experiment log
- line-info log files are prefixed with experiment name
- don't print experiment duration when E2E
- account for fileline scope in analysis
* KokkosP: OMNITRACE_KOKKOSP_NAME_LENGTH_MAX
- config option to limit the name of kokkos tool callbacks
- remove [kokkos] from KokkosP names
* Update causal example
- minor tweaks to decrease probability of overlapping regions in binary
* omnitrace-causal update
- prefix N / Ntot in environment printout
* Miscellaneous updates
- causal::finish_experimenting()
- OMNITRACE_CAUSAL_RANDOM_SEED
- KokkosP causal updates
- exclude some callbacks, make some callbacks unique, etc.
- address_range::operator+=(address_range)
- combine contiguous ranges in binary/analysis.cpp when file, func, line is same and address range is contiguous
- bfd_line_info reads inline info
- wait for perform_experiment_impl to complete
- causal::delay updates
- delay::process checks if experiment is active
- uses threading::get_id()
- experiment scales duration up for larger speedup experiments
- line info samples includes excluded lines
- sampler uses CLOCK_REALTIME
- blocking_gotcha updates
- is no longer fully static
- adds audit routine which sets the postblock value to zero if try/timed routine fails
- category::host was added to causal_throughput_categories_t
- pthread_create_gotcha sets new threads local parent delay
- was using internal value, now uses sequent value
* Causal improvements to KokkosP
* Updates to experiment time scaling
- use stats instead of just max
* binary/link_map.{hpp,cpp}
* update process-causal-json.py
* Folded fileline scope into source scope
* Update documentation
- Add documentation for causal profiling
- Replace 'Omnitrace' with 'OmniTrace' everywhere
* Update causal-helpers.cmake + omnitrace-testing.cmake
- split tests/CMakeLists.txt partially into omnitrace-testing.cmake
* omnitrace/causal.h
- OMNITRACE_CAUSAL_PROGRESS
- OMNITRACE_CAUSAL_PROGRESS_NAMED
- OMNITRACE_CAUSAL_BEGIN
- OMNITRACE_CAUSAL_END
* selected_entry + remove default filters for lambdas and operator()
- selected entry stores range and binary load address
* update process-causal-json.py
* format examples/lulesh/CMakeLists.txt
* causal-helpers find_package(Threads)
* OMNITRACE_KOKKOSP_KERNEL_LOGGER
- was OMNITRACE_KOKKOS_KERNEL_LOGGER
* quiet find of coz-profiler
* Fix rocm_smi exception handling
* Update timemory submodule (binutils)
- fix binutls compile error on some systems
- bump binutils to v2.40
* Fix miscellaneous tests
* OMNITRACE_KOKKOSP_PREFIX
* revert rocm_smi handling
* ElfUtils updates
- default to download version 0.188
- add -Wno-error=null-dereference due to GCC 12 compiler error
* Update causal example
* Remove OMNITRACE_VERBOSE from global workflow envs
* Reliable causal test
* disable compilation of causal perf files
* Remove set_current_selection with unwind stack
* update timemory submodule
* fix for segfault on bionic
- locking in TLS dtor was causing segfault
* remove experiment::is_selected(unwind_stack_t)
* update default init of selected_entry
* Fix for when IP is not offset by load address
* Update CMakeLists.txt
* Miscellaneous updates
- OMNITRACE_WARNING_OR_CI_THROW
- OMNITRACE_REQUIRE
- OMNITRACE_PREFER
- fixed issues with no ASLR
- added load address variable and ipaddr() func to basic/bfd line info
- removed get_basic() from dwarf_line_info
- TIMEMORY_PREFER -> OMNITRACE_PREFER
- removed previously added binary_address and range variables from selected_entry
* Removed superfluous CausalState
* Additional causal tests (lulesh + kokkos)
* filter, prefer, analysis ASLR handling
- removed default filter on cold functions
- fixed OMNITRACE_PREFER
- fixed analysis ASLR handling
* Tweak line-info output
* Removed some superfluous code
- causal/delay
- causal/selected_entry
* Exclude main.cold in function mode
* Update validate-perfetto-proto.py
- account for occasional http errors
* Add sampling test disabling tmp files
* argparser for process-causal-json
- support validation
- support filtering
* Avoid pthread_{lock,unlock} in sampling offload
- use homemade atomic_mutex/atomic_lock since contention will be low and using pthread tools might trigger our wrappers
* Rename process-causal-json.py
- validate-causal-json.py
* rework omnitrace_add_causal_test
- capable of performing validation
- added validation tests
* Fix kokkosp_begin_deep_copy + causal
* Tweak address range in bfd_line_info::read_pc
* Tweak analysis and data IP handling
- look for gaps
* Disable scaling experiment time by speedup
* Revert change in max threads during CI
* binary updates
- significant overhaul of binary analysis implementation
- removed "basic_line_info" and "bfd_line_info" in lieu of "symbol" class
- symbol class has basic BFD info + vector of inlines + vector of dwarf info
* Updated causal to use new binary analysis
- Fix symbol.cpp includes
* Updated formatting target
- include *.cmake files
* Updated causal tests
- causal tests should be stable now
* Update timemory and dyninst submodules
- TPLs are stripped + built w/o debug info
* Increase tolerance for causal validation speedups
- higher speedups have more variance (increased to +/- 5 from 3)
* Support causal output for MPI
- i.e. tag with MPI rank
* omnitrace-causal launcher argument
* improve experiment sampling output
* causal data updates
- call compute lines once
- fixed filtered cached binary info
- debugging info when experiment fails to start
* Tweaked causal validation tests
* dwarf_entry ranges
* CI updates
- increase max threads to 64
* Tweak causal E2E validation tests
- more threads
- shorter thread runtime
- more iterations
* Fix shadowed variable
* fix symbol read_bfd last PC calculation
* fix maybe-uninitialized warning
* omnitrace-causal launcher update
- only inject "omnitrace-causal --" once
- throw error if no matches found
* Update causal profiling docs for launcher
* fix address range boundaries
358 baris
21 KiB
Markdown
358 baris
21 KiB
Markdown
# Call-Stack Sampling
|
|
|
|
```eval_rst
|
|
.. toctree::
|
|
:glob:
|
|
:maxdepth: 4
|
|
```
|
|
|
|
> ***NOTE: Set `OMNITRACE_USE_SAMPLING=ON` to activate call-stack sampling when executing an instrumented binary***
|
|
|
|
Call-stack sampling can be activated with either a binary instrumented via the `omnitrace` executable or via the `omnitrace-sample` executable.
|
|
***Effectively***, all of the commands below are equivalent:
|
|
|
|
- Binary rewrite with only instrumentation necessary to start/stop sampling
|
|
|
|
```console
|
|
omnitrace -M sampling -o foo.inst -- foo
|
|
./foo.inst
|
|
```
|
|
|
|
- Runtime instrumentation with only instrumentation necessary to start/stop sampling
|
|
|
|
```console
|
|
omnitrace -M sampling -- foo
|
|
```
|
|
|
|
- No instrumentation required
|
|
|
|
```console
|
|
omnitrace-sample -- foo
|
|
```
|
|
|
|
All `omnitrace -M sampling` (referred to as "instrumented-sampling" henceforth) does is wrap the `main` of the executable with initialization
|
|
before `main` starts and finalization after `main` ends.
|
|
This can be easily accomplished without instrumentation via a `LD_PRELOAD` of a library with containing a dynamic symbol wrapper around `__libc_start_main`.
|
|
Thus, whenever binary instrumentation is unnecessary, using `omnitrace-sample` is recommended over `omnitrace -M sampling` for several reasons:
|
|
|
|
1. `omnitrace-sample` provides command-line options for controlling features of omnitrace instead of *requiring* configuration files or environment variables
|
|
2. Despite the fact that instrumented-sampling only requires inserting snippets around one function (`main`), Dyninst
|
|
does not have a feature for specifying that parsing and processing all the other symbols in the binary is unnecessary,
|
|
thus, in the best case scenario, instrumented-sampling has a slightly slower launch time when the target binary is relatively small
|
|
but, in the worst case scenarios, requires a significant amount of time and memory to launch
|
|
3. `omnitrace-sample` is fully compatible with MPI, e.g. `mpirun -n 2 omnitrace-sample -- foo`, whereas `mpirun -n 2 omnitrace -M sampling -- foo`
|
|
is incompatible with some MPI distributions (particularly OpenMPI) because of MPI restrictions against forking within an MPI rank
|
|
- If you recall, when MPI and binary instrumentation is involved, two steps are involed: (1) do a binary rewrite of the executable
|
|
and (2) use the instrumented executable in leiu of the original executable. `omnitrace-sample` is thus much easier to use with MPI.
|
|
|
|
## omnitrace-sample Executable
|
|
|
|
View the help menu of `omnitrace-sample` with the `-h` / `--help` option:
|
|
|
|
```console
|
|
$ omnitrace-sample --help
|
|
[omnitrace-sample] Usage: omnitrace-sample [ --help (count: 0, dtype: bool)
|
|
--monochrome (max: 1, dtype: bool)
|
|
--debug (max: 1, dtype: bool)
|
|
--verbose (count: 1)
|
|
--config (min: 0, dtype: filepath)
|
|
--output (min: 1)
|
|
--trace (max: 1, dtype: bool)
|
|
--profile (max: 1, dtype: bool)
|
|
--flat-profile (max: 1, dtype: bool)
|
|
--host (max: 1, dtype: bool)
|
|
--device (max: 1, dtype: bool)
|
|
--trace-file (count: 1, dtype: filepath)
|
|
--trace-buffer-size (count: 1, dtype: KB)
|
|
--trace-fill-policy (count: 1)
|
|
--profile-format (min: 1)
|
|
--profile-diff (min: 1)
|
|
--process-freq (count: 1)
|
|
--process-wait (count: 1)
|
|
--process-duration (count: 1)
|
|
--cpus (count: unlimited, dtype: int or range)
|
|
--gpus (count: unlimited, dtype: int or range)
|
|
--freq (count: 1)
|
|
--wait (count: 1)
|
|
--duration (count: 1)
|
|
--tids (min: 1)
|
|
--cputime (min: 0)
|
|
--realtime (min: 0)
|
|
--include (count: unlimited)
|
|
--exclude (count: unlimited)
|
|
--cpu-events (count: unlimited)
|
|
--gpu-events (count: unlimited)
|
|
--inlines (max: 1, dtype: bool)
|
|
--hsa-interrupt (count: 1, dtype: int)
|
|
]
|
|
|
|
Options:
|
|
-h, -?, --help Shows this page
|
|
|
|
[DEBUG OPTIONS]
|
|
|
|
--monochrome Disable colorized output
|
|
--debug Debug output
|
|
-v, --verbose Verbose output
|
|
|
|
[GENERAL OPTIONS]
|
|
|
|
-c, --config Configuration file
|
|
-o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix
|
|
-T, --trace Generate a detailed trace (perfetto output)
|
|
-P, --profile Generate a call-stack-based profile (conflicts with --flat-profile)
|
|
-F, --flat-profile Generate a flat profile (conflicts with --profile)
|
|
-H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc.
|
|
-D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc.
|
|
|
|
[TRACING OPTIONS]
|
|
|
|
--trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix.
|
|
--trace-buffer-size Size limit for the trace output (in KB)
|
|
--trace-fill-policy [ discard | ring_buffer ]
|
|
|
|
Policy for new data when the buffer size limit is reached:
|
|
- discard : new data is ignored
|
|
- ring_buffer : new data overwrites oldest data
|
|
|
|
[PROFILE OPTIONS]
|
|
|
|
--profile-format [ console | json | text ]
|
|
Data formats for profiling results
|
|
--profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
|
|
corresponding to the input path and the input prefix
|
|
|
|
[HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
|
|
|
|
|
|
--process-freq Set the default host/device sampling frequency (number of interrupts per second)
|
|
--process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime)
|
|
--process-duration Set the duration of the host/device sampling (in seconds of realtime)
|
|
--cpus CPU IDs for frequency sampling. Supports integers and/or ranges
|
|
--gpus GPU IDs for SMI queries. Supports integers and/or ranges
|
|
|
|
[GENERAL SAMPLING OPTIONS]
|
|
|
|
-f, --freq Set the default sampling frequency (number of interrupts per second)
|
|
-w, --wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
|
|
of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime
|
|
-d, --duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
|
|
delay that exceeds the real-time duration... resulting in zero samples being taken
|
|
-t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
|
|
application is assigned an atomically incrementing value.
|
|
|
|
[SAMPLING TIMER OPTIONS]
|
|
|
|
--cputime Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
|
|
0. Enables sampling based on CPU-clock timer.
|
|
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
|
|
2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
|
|
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
|
|
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
|
|
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
|
|
--realtime Sample based on a real-clock timer. Accepts zero or more arguments:
|
|
0. Enables sampling based on real-clock timer.
|
|
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
|
|
2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
|
|
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
|
|
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
|
|
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
|
|
When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
|
|
to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
|
|
whereas the CPU-clock time does not.
|
|
|
|
[BACKEND OPTIONS] (These options control region information captured w/o sampling or instrumentation)
|
|
|
|
-I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
|
|
Include data from these backends
|
|
-E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
|
|
Exclude data from these backends
|
|
|
|
[HARDWARE COUNTER OPTIONS]
|
|
|
|
-C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`)
|
|
-G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`)
|
|
|
|
[MISCELLANEOUS OPTIONS]
|
|
|
|
-i, --inlines Include inline info in output when available
|
|
--hsa-interrupt [ 0 | 1 ] Set the value of the HSA_ENABLE_INTERRUPT environment variable.
|
|
ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
|
|
that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
|
|
when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
|
|
performance.
|
|
Values:
|
|
0 avoid triggering the bug, potentially at the cost of reduced performance
|
|
1 do not modify how ROCm is notified about kernel completion
|
|
```
|
|
|
|
The general syntax for separating omnitrace command line arguments from the application arguments follows the
|
|
is consistent with the LLVM style of using a standalone double-hyphen (`--`). All arguments preceding the double-hyphen
|
|
are interpreted as belonging to omnitrace and all arguments following the double-hyphen are interpreted as the
|
|
application and it's arguments. The double-hyphen is only necessary when passing command line arguments to the target
|
|
which also use hyphens. E.g. `omnitrace-sample ls` works but, in order to run `ls -la`, use `omnitrace-sample -- ls -la`.
|
|
|
|
[Configuring OmniTrace Runtime](runtime.md) establish the precedence of environment variable values over values specified in the configuration files. This enables
|
|
the user to configure the omnitrace runtime to their preferred default behavior in a file such as `~/.omnitrace.cfg` and then easily override
|
|
those settings via something like `OMNITRACE_ENABLED=OFF omnitrace-sample -- foo`.
|
|
Similarly, the command line arguments passed to `omnitrace-sample` take precedence over environment variables.
|
|
|
|
All of the command-line options above correlate to one or more configuration settings, e.g. `--cpu-events` correlates to the `OMNITRACE_PAPI_EVENTS` configuration variable.
|
|
After the command-line arguments to `omnitrace-sample` have been processed but before the target application is executed, `omnitrace-sample` will emit a log
|
|
for which environment variables where set and/or modified:
|
|
|
|
The snippet below shows the environment updates when `omnitrace-sample` is invoked with no arguments
|
|
|
|
```console
|
|
$ omnitrace-sample -- ./parallel-overhead-locks 30 4 100
|
|
|
|
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
|
HSA_TOOLS_REPORT_LOAD_FAILURE=1
|
|
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
|
OMNITRACE_CRITICAL_TRACE=false
|
|
OMNITRACE_USE_PROCESS_SAMPLING=false
|
|
OMNITRACE_USE_SAMPLING=true
|
|
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
|
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
|
|
|
|
...
|
|
```
|
|
|
|
The snippet below shows the environment updates when `omnitrace-sample` enables profiling, tracing, host process-sampling, device process-sampling, and all the available backends:
|
|
|
|
```console
|
|
$ omnitrace-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100
|
|
|
|
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
|
HSA_TOOLS_REPORT_LOAD_FAILURE=1
|
|
KOKKOS_PROFILE_LIBRARY=/opt/omnitrace/lib/libomnitrace.so.1.7.1
|
|
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
|
OMNITRACE_CPU_FREQ_ENABLED=true
|
|
OMNITRACE_CRITICAL_TRACE=false
|
|
OMNITRACE_TRACE_THREAD_LOCKS=true
|
|
OMNITRACE_TRACE_THREAD_RW_LOCKS=true
|
|
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=true
|
|
OMNITRACE_USE_KOKKOSP=true
|
|
OMNITRACE_USE_MPIP=true
|
|
OMNITRACE_USE_OMPT=true
|
|
OMNITRACE_USE_PERFETTO=true
|
|
OMNITRACE_USE_PROCESS_SAMPLING=true
|
|
OMNITRACE_USE_RCCLP=true
|
|
OMNITRACE_USE_ROCM_SMI=true
|
|
OMNITRACE_USE_ROCPROFILER=true
|
|
OMNITRACE_USE_ROCTRACER=true
|
|
OMNITRACE_USE_ROCTX=true
|
|
OMNITRACE_USE_SAMPLING=true
|
|
OMNITRACE_USE_TIMEMORY=true
|
|
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
|
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
|
|
|
|
...
|
|
```
|
|
|
|
The snippet below shows the environment updates when `omnitrace-sample` enables profiling, tracing, host process-sampling, device process-sampling,
|
|
sets the output path to `omnitrace-output`, the output prefix to `%tag%` and disables all the available backends:
|
|
|
|
```console
|
|
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100
|
|
|
|
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
|
OMNITRACE_CPU_FREQ_ENABLED=true
|
|
OMNITRACE_CRITICAL_TRACE=false
|
|
OMNITRACE_OUTPUT_PATH=omnitrace-output
|
|
OMNITRACE_OUTPUT_PREFIX=%tag%
|
|
OMNITRACE_TRACE_THREAD_LOCKS=false
|
|
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
|
|
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
|
|
OMNITRACE_USE_KOKKOSP=false
|
|
OMNITRACE_USE_MPIP=false
|
|
OMNITRACE_USE_OMPT=false
|
|
OMNITRACE_USE_PERFETTO=true
|
|
OMNITRACE_USE_PROCESS_SAMPLING=true
|
|
OMNITRACE_USE_RCCLP=false
|
|
OMNITRACE_USE_ROCM_SMI=false
|
|
OMNITRACE_USE_ROCPROFILER=false
|
|
OMNITRACE_USE_ROCTRACER=false
|
|
OMNITRACE_USE_ROCTX=false
|
|
OMNITRACE_USE_SAMPLING=true
|
|
OMNITRACE_USE_TIMEMORY=true
|
|
|
|
...
|
|
```
|
|
|
|
## omnitrace-sample Example
|
|
|
|
```console
|
|
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -c -- ./parallel-overhead-locks 30 4 100
|
|
|
|
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
|
OMNITRACE_CONFIG_FILE=
|
|
OMNITRACE_CPU_FREQ_ENABLED=true
|
|
OMNITRACE_CRITICAL_TRACE=false
|
|
OMNITRACE_OUTPUT_PATH=omnitrace-output
|
|
OMNITRACE_OUTPUT_PREFIX=%tag%
|
|
OMNITRACE_TRACE_THREAD_LOCKS=false
|
|
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
|
|
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
|
|
OMNITRACE_USE_KOKKOSP=false
|
|
OMNITRACE_USE_MPIP=false
|
|
OMNITRACE_USE_OMPT=false
|
|
OMNITRACE_USE_PERFETTO=true
|
|
OMNITRACE_USE_PROCESS_SAMPLING=true
|
|
OMNITRACE_USE_RCCLP=false
|
|
OMNITRACE_USE_ROCM_SMI=false
|
|
OMNITRACE_USE_ROCPROFILER=false
|
|
OMNITRACE_USE_ROCTRACER=false
|
|
OMNITRACE_USE_ROCTX=false
|
|
OMNITRACE_USE_SAMPLING=true
|
|
OMNITRACE_USE_TIMEMORY=true
|
|
|
|
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling
|
|
|
|
|
|
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
|
|
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
|
|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
|
|
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
|
|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
|
|
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
|
|
|
|
|
|
[759.689] perfetto.cc:55903 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
|
|
|
|
[parallel-overhead-locks] Threads: 4
|
|
[parallel-overhead-locks] Iterations: 100
|
|
[parallel-overhead-locks] fibonacci(30)...
|
|
[1] number of iterations: 100
|
|
[2] number of iterations: 100
|
|
[3] number of iterations: 100
|
|
[4] number of iterations: 100
|
|
[parallel-overhead-locks] fibonacci(30) x 4 = 394644873
|
|
[parallel-overhead-locks] number of mutex locks = 400
|
|
[omnitrace][107157][0][omnitrace_finalize]
|
|
[omnitrace][107157][0][omnitrace_finalize] finalizing...
|
|
[omnitrace][107157][0][omnitrace_finalize]
|
|
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157 : 0.610427 sec wall_clock, 2.248 MB peak_rss, 2.265 MB page_rss, 2.560000 sec cpu_clock, 419.4 % cpu_util [laps: 1]
|
|
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/0 : 0.608866 sec wall_clock, 0.000677 sec thread_cpu_clock, 0.1 % thread_cpu_util, 2.248 MB peak_rss [laps: 1]
|
|
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/1 : 0.608237 sec wall_clock, 0.603553 sec thread_cpu_clock, 99.2 % thread_cpu_util, 2.204 MB peak_rss [laps: 1]
|
|
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/2 : 0.601430 sec wall_clock, 0.598378 sec thread_cpu_clock, 99.5 % thread_cpu_util, 1.156 MB peak_rss [laps: 1]
|
|
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/3 : 0.570223 sec wall_clock, 0.568713 sec thread_cpu_clock, 99.7 % thread_cpu_util, 0.772 MB peak_rss [laps: 1]
|
|
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/4 : 0.557637 sec wall_clock, 0.557198 sec thread_cpu_clock, 99.9 % thread_cpu_util, 0.156 MB peak_rss [laps: 1]
|
|
[omnitrace][107157][0][omnitrace_finalize]
|
|
[omnitrace][107157][0][omnitrace_finalize] Finalizing perfetto...
|
|
[omnitrace][107157][perfetto]> Outputting '/home/user/data/omnitrace-output/2022-10-19_02.46/parallel-overhead-locksperfetto-trace-107157.proto' (842.90 KB / 0.84 MB / 0.00 GB)... Done
|
|
[omnitrace][107157][trip_count]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockstrip_count-107157.json'
|
|
[omnitrace][107157][trip_count]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockstrip_count-107157.txt'
|
|
[omnitrace][107157][sampling_percent]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_percent-107157.json'
|
|
[omnitrace][107157][sampling_percent]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_percent-107157.txt'
|
|
[omnitrace][107157][sampling_cpu_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_cpu_clock-107157.json'
|
|
[omnitrace][107157][sampling_cpu_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_cpu_clock-107157.txt'
|
|
[omnitrace][107157][sampling_wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_wall_clock-107157.json'
|
|
[omnitrace][107157][sampling_wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_wall_clock-107157.txt'
|
|
[omnitrace][107157][wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockswall_clock-107157.json'
|
|
[omnitrace][107157][wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockswall_clock-107157.txt'
|
|
[omnitrace][107157][metadata]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-locksmetadata-107157.json' and 'omnitrace-output/2022-10-19_02.46/parallel-overhead-locksfunctions-107157.json'
|
|
[omnitrace][107157][0][omnitrace_finalize] Finalized
|
|
[761.584] perfetto.cc:57382 Tracing session 1 ended, total sessions:0
|
|
```
|