Dateien
rocm-systems/source/docs/getting_started.md
T
Jonathan R. Madsen 9618ddefba Causal profiling (#229)
* Addition of basic structure

* Reworked categories

* More causal integration additions

* Causal implementation

* Update examples

* delete virtual_speedup files

* Update perfetto submodule to v31.0

* Update dyninst submodule

* Update timemory submodule

* ElfUtils build for libdw

* OMNITRACE_LIKELY and OMNITRACE_UNLIKELY

* Update common lib join

* Examples updates for causal profiling

* config updates with causal options

- OMNITRACE_CAUSAL_FIXED_LINE
- OMNITRACE_CAUSAL_FIXED_SPEEDUP
- OMNITRACE_CAUSAL_FILE
- OMNITRACE_CAUSAL_BINARY_SCOPE
- OMNITRACE_CAUSAL_SOURCE_SCOPE
- version info in banner
- support increments in parse_numeric_range
- fix occasional deadlock in first call to get_config

* PTL general task group

* Always include PID in debug/verbose messages

* Add blocking/unblocking gotchas to runtime init bundle

* CausalState

* thread_data updates

- generic component_bundle_cache

* Improve handling of causal in category_region

* components updates

- backtrace_causal component
- backtrace::get_data member func
- decrease ignore_depth in backtrace::sample(int)
- handle "omnitrace_main" in backtrace::filter_and_patch(...)
- tweak internal thread state scope for pthread_mutex_gotcha wrappers

* simplify tracing get_instrumentation_bundles usage

* sampling updates

- include backtrace_causal component
- disable backtrace_metrics if using causal and not using perfetto
- disable backtrace and backtrace_timestamp when using causal
- post_process_causal

* causal updates

- more checks in blocking_gotcha and unblocking_gotcha start/stop
- miscellaneous overhaul of data
- experiment update

* Remove virtual speedup

* libomnitrace code_object

* causal-profiling test

* libomnitrace library.cpp updates

- handle causal profiling
- fini_bundle

* Disable causal profiling by default

* Updated causal code and example

- example: three execution variants: cpu + rng, cpu, rng
- example: three instrumentation variants: none, omni, coz
- fix blocking gotcha credit
- rework perform_experiment_impl
- get_eligible_address_ranges
- compute_eligible_lines
- support fixed lines/speedups/functions
- update selected_entry to support function mode
- fix causal::delay
- experiment updates

* omnitrace_progress / omnitrace_user_progress

- with accompanying omnitrace_annotated_progress / omnitrace_user_annotated_progress

* Update timemory submodule

* CausalMode

- mode indicated whether causal predictions source be at line-level or function-level

* code_object, config, runtime, sampling, thread_data

- code_object: address_range
- code_object: basic::line_info serialize(), name(), hash()
- config updates
- two signals for causal sampling
- thread_data init fixes

* pthread updates

- pthread_create_gotcha processes delays
- pthread_mutex_gotcha does not wrap pthread_join in causal mode

* backtrace_causal update

- dynamic delay period stats

* main wrapper uses basename of argv[0]

* update elfio submodule

* perf support (currently unused)

* Fix experiment JSON serialization

- static_vector.hpp (unused)

* causal executable + config options updates

- omnitrace-causal exe simplifies running multiple causal configs
- changed the causal config option names

* Support both throughput and latency points

* process-causal-json.py script

- will be used later for testing

* stable_vector

* Rework thread_data

* Improve omnitrace-causal exe

- better verbosity handling
- correct diagnosis of status for child process
- execvpe when only one iteration (debugging)

* Update timemory submodule

* exe --version

- omnitrace, omnitrace-avail, and omnitrace-sample all support --version on command-line

* OMNITRACE_INTERNAL_API + OMNITRACE_{LIKELY,UNLIKELY}

* omnitrace-causal cmake format

* omnitrace config update

- OMNITRACE_CAUSAL_FILE_CLOBBER

* custom exception

- wraps STL exception and gets stacktrace during construction

* exit_gotcha supports _Exit

* use global construct_on_init + max threads

- add some safety when exceeding max # of threads

* update code_object binary filter

- exclude dyninst and tbbmalloc library

* containers: c_array, static_vector, stable_vector

- moved utility::c_array to container::c_array
- created static_vector: std::vector bound to std::array
- created stable_vector: vector with stable references

* grow thread_data when new thread created

* causal updates

- data: improve compute_eligible_lines to ignore lambdas
- data: use new thread_data
- delay: use new thread_data
- experiment: properly support latency points
- experiment: support file clobber
- experiment: ensure non-zero experiment time
- progress_point: use new thread_data
- backtrace_causal: use new thread_data

* Update causal-profiling tests

* fix omnitrace-causal backslash escaping

* process-causal-json script

* restructure causal implementation

- update verbose messages for omnitrace-causal diagnose_status
- migrated causal implementation in sampling.cpp to causal/sampling.cpp
- OMNITRACE_USE_CAUSAL does not require OMNITRACE_USE_SAMPLING
- added Mode::Causal
- causal sampling uses same signals as regular sampling
- moved tracing::thread_init to implementation file
- combined tracing::thread_init and tracing::thread_init_sampling
- added causal/components folder
- pthread_create_gotcha::wrapper_config
- omnitrace_preload checks OMNITRACE_USE_CAUSAL
  - updates mode accordingly

* update timemory submodule

* update timemory submodule

* causal example updates

- causal for lulesh

* perf code + utility - helpers

- relocated causal perf code
- placement new when generating unique ptr trait for potentially allocating during sampling
- additions to utility header
- removed previously added helpers.hpp

* update timemory submodule

* Default env variables for omnitrace-causal

- activate OMNITRACE_USE_KOKKOSP, etc.

* update stable_vector and static_vector

- static vector can use atomic for size tracking for thread-safe situations

* update causal example header

- CAUSAL_PROGRESS_NAMED
- use CAUSAL_ prefix for some macros

* Tweak lulesh example

- use CAUSAL_PROGRESS instead of CAUSAL_BEGIN and CAUSAL_END

* omnitrace-sample support for causal mode

- set OMNITRACE_USE_SAMPLING to off when OMNITRACE_MODE=causal

* refactor and cleanup code_object

- scope filter
- fixes to address_range

* overhaul causal data + causal config options

- full support for function and line mode
- support static vector of instruction pointers
- improve line info mapping resolution
- remove thread-locality from miscellanous functions where unnecessary
- causal options for {binary,source,function,fileline} exclusion

* causal experiment, sampling, and backtrace updates

- is_selected + unwind address array
- experiment warning about progress points
- increased buffer size for backtrace_casual sampler
- backtrace_causal only stores IP addresses instead of full unwind info

* category_region updates

- minor refactor
- local_category_region::mark

* Update causal tests

* Bump version to 1.8.0

* omnitrace-causal args + CLOBBER -> RESET

- renamed OMNITRACE_CAUSAL_FILE_CLOBBER to OMNITRACE_CAUSAL_FILE_RESET
- updated omnitrace-causal exe to support recently added configuration options
- other miscellaneous tweaks to data.cpp, experiment.cpp, and sampling.cpp

* Refactor causal and code_object

- code_object.hpp and code_object.cpp moved into binary folder
- causal components namespaced into omnitrace::causal::component
- moved sample_data out of backtrace_causal and into own file
- renamed backtrace_causal to causal::component::backtrace

* preload omnitrace_init + OMNITRACE_DEBUG_MARK

- env OMNITRACE_DEBUG_MARK
- fix omnitrace_init call when LD_PRELOAD-ing omnitrace

* Fix fileline support + line-info output names + experiment log

- line-info log files are prefixed with experiment name
- don't print experiment duration when E2E
- account for fileline scope in analysis

* KokkosP: OMNITRACE_KOKKOSP_NAME_LENGTH_MAX

- config option to limit the name of kokkos tool callbacks
- remove [kokkos] from KokkosP names

* Update causal example

- minor tweaks to decrease probability of overlapping regions in binary

* omnitrace-causal update

- prefix N / Ntot in environment printout

* Miscellaneous updates

- causal::finish_experimenting()
- OMNITRACE_CAUSAL_RANDOM_SEED
- KokkosP causal updates
  - exclude some callbacks, make some callbacks unique, etc.
- address_range::operator+=(address_range)
- combine contiguous ranges in binary/analysis.cpp when file, func, line is same and address range is contiguous
- bfd_line_info reads inline info
- wait for perform_experiment_impl to complete
- causal::delay updates
  - delay::process checks if experiment is active
  - uses threading::get_id()
- experiment scales duration up for larger speedup experiments
- line info samples includes excluded lines
- sampler uses CLOCK_REALTIME
- blocking_gotcha updates
  - is no longer fully static
  - adds audit routine which sets the postblock value to zero if try/timed routine fails
- category::host was added to causal_throughput_categories_t
- pthread_create_gotcha sets new threads local parent delay
  - was using internal value, now uses sequent value

* Causal improvements to KokkosP

* Updates to experiment time scaling

- use stats instead of just max

* binary/link_map.{hpp,cpp}

* update process-causal-json.py

* Folded fileline scope into source scope

* Update documentation

- Add documentation for causal profiling
- Replace 'Omnitrace' with 'OmniTrace' everywhere

* Update causal-helpers.cmake + omnitrace-testing.cmake

- split tests/CMakeLists.txt partially into omnitrace-testing.cmake

* omnitrace/causal.h

- OMNITRACE_CAUSAL_PROGRESS
- OMNITRACE_CAUSAL_PROGRESS_NAMED
- OMNITRACE_CAUSAL_BEGIN
- OMNITRACE_CAUSAL_END

* selected_entry + remove default filters for lambdas and operator()

- selected entry stores range and binary load address

* update process-causal-json.py

* format examples/lulesh/CMakeLists.txt

* causal-helpers find_package(Threads)

* OMNITRACE_KOKKOSP_KERNEL_LOGGER

- was OMNITRACE_KOKKOS_KERNEL_LOGGER

* quiet find of coz-profiler

* Fix rocm_smi exception handling

* Update timemory submodule (binutils)

- fix binutls compile error on some systems
- bump binutils to v2.40

* Fix miscellaneous tests

* OMNITRACE_KOKKOSP_PREFIX

* revert rocm_smi handling

* ElfUtils updates

- default to download version 0.188
- add -Wno-error=null-dereference due to GCC 12 compiler error

* Update causal example

* Remove OMNITRACE_VERBOSE from global workflow envs

* Reliable causal test

* disable compilation of causal perf files

* Remove set_current_selection with unwind stack

* update timemory submodule

* fix for segfault on bionic

- locking in TLS dtor was causing segfault

* remove experiment::is_selected(unwind_stack_t)

* update default init of selected_entry

* Fix for when IP is not offset by load address

* Update CMakeLists.txt

* Miscellaneous updates

- OMNITRACE_WARNING_OR_CI_THROW
- OMNITRACE_REQUIRE
- OMNITRACE_PREFER
- fixed issues with no ASLR
-  added load address variable and ipaddr() func to basic/bfd line info
- removed get_basic() from dwarf_line_info
- TIMEMORY_PREFER -> OMNITRACE_PREFER
- removed previously added binary_address and range variables from selected_entry

* Removed superfluous CausalState

* Additional causal tests (lulesh + kokkos)

* filter, prefer, analysis ASLR handling

- removed default filter on cold functions
- fixed OMNITRACE_PREFER
- fixed analysis ASLR handling

* Tweak line-info output

* Removed some superfluous code

- causal/delay
- causal/selected_entry

* Exclude main.cold in function mode

* Update validate-perfetto-proto.py

- account for occasional http errors

* Add sampling test disabling tmp files

* argparser for process-causal-json

- support validation
- support filtering

* Avoid pthread_{lock,unlock} in sampling offload

- use homemade atomic_mutex/atomic_lock since contention will be low and using pthread tools might trigger our wrappers

* Rename process-causal-json.py

- validate-causal-json.py

* rework omnitrace_add_causal_test

- capable of performing validation
- added validation tests

* Fix kokkosp_begin_deep_copy + causal

* Tweak address range in bfd_line_info::read_pc

* Tweak analysis and data IP handling

- look for gaps

* Disable scaling experiment time by speedup

* Revert change in max threads during CI

* binary updates

- significant overhaul of binary analysis implementation
- removed "basic_line_info" and "bfd_line_info" in lieu of "symbol" class
  - symbol class has basic BFD info + vector of inlines + vector of dwarf info

* Updated causal to use new binary analysis

- Fix symbol.cpp includes

* Updated formatting target

- include *.cmake files

* Updated causal tests

- causal tests should be stable now

* Update timemory and dyninst submodules

- TPLs are stripped + built w/o debug info

* Increase tolerance for causal validation speedups

- higher speedups have more variance (increased to +/- 5 from 3)

* Support causal output for MPI

- i.e. tag with MPI rank

* omnitrace-causal launcher argument

* improve experiment sampling output

* causal data updates

- call compute lines once
- fixed filtered cached binary info
- debugging info when experiment fails to start

* Tweaked causal validation tests

* dwarf_entry ranges

* CI updates

- increase max threads to 64

* Tweak causal E2E validation tests

- more threads
- shorter thread runtime
- more iterations

* Fix shadowed variable

* fix symbol read_bfd last PC calculation

* fix maybe-uninitialized warning

* omnitrace-causal launcher update

- only inject "omnitrace-causal --" once
- throw error if no matches found

* Update causal profiling docs for launcher

* fix address range boundaries
2023-01-24 18:53:23 -06:00

14 KiB

Getting Started

.. toctree::
   :glob:
   :maxdepth: 3
<style> em { color: Green; } </style>

Nomenclature

The list provided below is intended to (A) provide a basic glossary for those who are not familiar with binary instrumentation, etc. and (B) provide clarification to ambiguities when certain terms have different contextual meanings, e.g., omnitrace's meaning of the term "module" when instrumenting Python.

  • Binary
    • File written in the Executable and Linkable Format (ELF)
    • Standard file format for executable files, shared libraries, etc.
  • Binary Instrumentation
    • Inserting callbacks to instrumentation into an existing binary. This can be performed statically or dynamically
  • Static Binary Instrumentation
    • Loads an existing binary, determines instrumentation points, and generates a new binary with instrumentation directly embedded
    • Applicable to executables and libraries but limited to only the functions defined in the binary
    • Also known as: Binary Rewrite
  • Dynamic Binary Instrumentation
    • Loads an existing binary into memory, inserts instrumentation, executes binary
    • Limited to executables but capable of instrumenting linked libraries
    • Also known as: Runtime Instrumentation
  • Statistical Sampling
    • Also known as (simply) "sampling"
    • At periodic intervals, the application is paused and the current call-stack of the CPU is recorded alongside with various other metrics
    • Uses timers that measure either (A) real clock time or (B) the CPU time used by the current thread and the CPU time expended on behalf of the thread by the system
    • Sampling Rate
      • The period at which (A) or (B) are triggered (in units of # interrupts / second)
      • Higher values increase the number of samples
    • Sampling Delay
      • How long to wait before (A) and (B) begin triggering at their designated rate
    • Sampling Duration
      • The time (in realtime) after the start of the application to record samples. Once this time limit has been reached, no more samples will be recorded.
  • Process Sampling
    • At periodic (realtime) intervals, a background thread records global metrics without interrupting the current process. These metrics include, but are not limited to: CPU frequency, CPU memory high-water mark (i.e. peak memory usage), GPU Temperature, GPU Power usage, etc.
    • Sampling Rate
      • The realtime period for recording metrics (in units of # measurements / second)
      • Higher values increase the number of samples
    • Sampling Delay
      • How long to wait (in realtime) before recording samples
    • Sampling Duration
      • The time (in realtime) after the start of the application to record samples. Once this time limit has been reached, no more samples will be recorded.
  • Module
    • With respect to binary instrumentation, a module is defined as either the filename (e.g. foo.c) or library name (libfoo.so) which contains the definition of one or more functions
    • With respect to Python instrumentation, a module is defined as the file which contains the definition of one or more functions.
      • The full path to this file typically contains the name of the "Python module"
  • Basic Block
    • Straight-line code sequence with:
      • No branches in (except for the entry)
      • No branches out (except for the exit)
  • Address Range
    • The instructions for a function in a binary start at certain address with the ELF file and end at a certain address, the range is end - start
    • The address range is a decent approximation for the "cost" of a function, i.e., a larger address range approx. equates to more instructions
  • Instrumentation Traps
    • On the x86 architecture, because instructions are of variable size, the instruction at a point may be too small for Dyninst to replace it with the normal code sequence used to call instrumentation
      • Also, when instrumentation is placed at points other than subroutine entry, exit, or call points, traps may be used to ensure the instrumentation fits
    • By default, omnitrace avoids instrumentation which requires using a trap
  • Overlapping functions
    • Due to language constructs or compiler optimizations, it may be possible for multiple functions to overlap (that is, share part of the same function body) or for a single function to have multiple entry points
    • In practice, it is impossible to determine the difference between multiple overlapping functions and a single function with multiple entry points
    • By default, omnitrace avoids instrumenting overlapping functions

General Tips

  • Use omnitrace-avail to lookup configuration settings, hardware counters, and data collection components
    • Use -d flag for descriptions
  • Generate a default configuration with omnitrace-avail -G ${HOME}/.omnitrace.cfg and tweak accordingly to the desired default behavior
  • Decide whether binary instrumentation, statistical sampling, or both will provide the desired performance data (for non-Python applications)
  • Compile code with optimization enabled (e.g. -O2 or higher), disable asserts (i.e. -DNDEBUG), and include debug info (i.e. -g1 at a minimum)
    • NOTE: compiling with debug info does not slow down the code, it only increases compile time and the size of the binary
    • In CMake, this is generally as easy as settings CMAKE_BUILD_TYPE=RelWithDebInfo or CMAKE_BUILD_TYPE=Release and CMAKE_<LANG>_FLAGS=-g1
  • Use binary instrumentation for characterizing the performance of every invocation of specific functions
  • Use statistical sampling to characterize the performance of the entire application while minimizing overhead
  • Enable statistical sampling after binary instrumentation to help "fill in the gaps" between instrumented regions
  • Use the user API to create custom regions, enable/disable omnitrace to specific processes, threads, and/or regions
  • Dynamic symbol interception, callback APIs, and the user API are always available with binary instrumentation and sampling
    • Dynamic symbol interception and callback APIs are (generally) controlled through OMNITRACE_USE_<API> options, e.g. OMNITRACE_USE_KOKKOSP, OMNITRACE_USE_OMPT enable Kokkos-Tools and OpenMP-Tools callbacks, respectively
  • When generically seeking regions for performance improvement:
    • Start off collecting a flat profile
    • Look for functions with high call counts, large cumulative runtimes/values, and/or large standard deviations
      • When call-counts are high, improving the performance of this function or "inlining" the function can be quick and easy performance improvements
      • When the standard-deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context. In this scenario, consider creating a specialized version for the function for the longer running contexts
    • Collect a hierarchical profile and, keeping the flat-profiling data in mind, verify the functions noted in the flat profile are part of the "critical path" of your application
      • E.g. function(s) with high call counts, etc. which are part of a "setup" or "post-processing" phase which does not consume much time relative to the overall time is, generally, a lower priority for optimization
  • Use the information from the profiles when analyzing detailed traces
  • When using binary instrumentation in the "trace" mode, the binary rewrites are preferable to runtime instrumentation.
    • Binary rewrites only instrument the functions defined in the target binary, whereas runtime instrumentation can/will instrument functions defined in the shared libraries which are linked into the target binary
  • When using binary instrumentation with MPI, avoid runtime instrumentation
    • Runtime instrumentation requires a fork + ptrace: which is generally incompatible with how MPI applications spawn their processes
    • Binary rewrite the executable using MPI (and, optionally, libraries used by the executable) and execute the generated instrumented executable instead of the original, e.g. mpirun -n 2 ./myexe should be mpirun -n 2 ./myexe.inst where myexe.inst is the generated instrumented myexe executable.

Data Collection Mode(s)

OmniTrace supports several modes of recording trace and profiling data for your application:

Mode Descriptions
Binary Instrumentation Locates functions (and loops, if desired) in binary and inserts snippets at the entry and exit
Statistical Sampling Periodically pauses application at specified intervals and records various metrics for the given call-stack
Callback APIs Parallelism frameworks such as ROCm, OpenMP, and Kokkos will make callbacks into omnitrace to provide information about the work the API is performing
Dynamic Symbol Interception Wrap function symbols defined in position independent dynamic library/executable, e.g. pthread_mutex_lock in libpthread.so or MPI_Init in the MPI library
User API User-defined regions and controls for omnitrace

The two most generic, important modes are binary instrumentation and statistical sampling. It is important to understand the advantages and disadvantages. Binary instrumentation and statistical sampling can be performed with the omnitrace executable but for statistical sampling, it is highly recommended to use the omnitrace-sample executable instead if no binary instrumentation is required/desired. With either tool, the callback APIs and dynamic symbol interception can be utilized.

Binary Instrumentation

Binary instrumentation will allow one to deterministically record measurements for every single invocation of a given function. Binary instrumentation effectively adds instructions to the target application to collect the required information and, thus, has the potential to cause performance changes which may, in some cases, lead to inaccurate results. The effect depends on what information being collected and which features are activated in omnitrace. For example, collecting only the wall-clock timing data will have less effect than collected the wall-clock timing, cpu-clock timing, memory usage, cache-misses, and number of instructions executed. Similarly, collecting a flat profile will have less overhead than a hierarchical profile and collecting a trace OR a profile will have less overhead than collecting a trace AND a profile.

In omnitrace, the primary heuristic for controlling the overhead with binary instrumentation is the minimum number of instructions for selecting functions for instrumentation.

Statistical Sampling

Statistical call-stack sampling periodically interrupts the application at regular intervals using operating system interrupts. Sampling is typically less numerically accurate and specific, but allows the target program to run at near full speed. In constrast to the data derived from binary instrumentation, the resulting data is not exact but, instead, a statistical approximation. However, sampling often provides a more accurate picture of the application execution because it is less intrusive to the target application and has fewer side effects on memory caches or instruction decoding pipelines. Furthermore, since sampling does not affect the execution speed as significantly, is it relatively immune to over-evaluating the cost of small, frequently called functions or "tight" loops.

In omnitrace, the overhead for statistical sampling is a factor of the sampling rate and whether the samples are taken with respect to the CPU time and/or real time.

Binary Instrumentation vs. Statistical Sampling Example

Consider for the following code:

long fib(long n)
{
    if(n < 2) return n;
    return fib(n - 1) + fib(n - 2);
}

void run(long n)
{
    long result = fib(nfib);
    printf("[%li] fibonacci(%li) = %li\n", i, nfib, result);
}

int main(int argc, char** argv)
{
    long nfib = 30;
    long nitr = 10;
    if(argc > 1) nfib = atol(argv[1]);
    if(argc > 2) nitr = atol(argv[2]);

    for(long i = 0; i < nitr; ++i)
        run(nfib);

    return 0;
}

Binary instrumentation of the fib function will record every single invocation of the function -- which for a very small function such as fib, will result in significant overhead since this simple function tends to be less than 20 or so instructions, whereas the entry and exit snippets are ~1024 instructions. Thus, we generally want to avoid instrumenting functions where the instrumented function has significantly fewer instructions than entry + exit instrumentation (please note, however, that many of the instructions entry/exit functions are either logging functions or depend on the runtime settins and thus may never be executed). However, due to the number of potentially executed instructions in the entry/exit snippets, the default behavior of omnitrace is to only instrument functions which contain fewer than 1024 instructions.

However, recording every single invocation of the function can be extremely useful for detecting anomalies: profiles will show min/max values much smaller/larger than the average and/or high standard deviation and traces will allow you to identify exactly when and where those instances deviated from the norm. Consider the level of details in the following traces where, in the top image, every instance of the fib function was instrumented vs. the bottom image where the fib call-stack was derived via sampling:

Binary Instrumentation of Fibonacci Function

instrumented-fibonnaci-trace

Statistical Sampling of Fibonacci Function

sampled-fibonnaci-trace