## Overview
This PR attempts to increase the stability of binary rewrite and runtime instrumentation.
### Improved protection against self-instrumentation
Using ~~the binary analysis capabilities added from #229~~ the Dyninst SymtabAPI, OmniTrace now does a much better job of avoiding instrumentation of functions which are internally called by OmniTrace:
- The `omnitrace` executable searches for and parses the symbols of various libraries which are known to cause problems when instrumented
- GNU libraries which are common to nearly every library, e.g., `"libc.so.6"`, `"libdl.so.2"`, etc., and thus are outside the scope of the users optimizations efforts
- Libraries which OmniTrace depends on for functionality, e.g. `"libunwind.so"`, `"libgotcha.so"`, `"libroctracer64.so"`, etc.
- OmniTrace skips instrumenting any `module_function` instance when it's member `module_name` or `function_name` variable matches the library name, source file, or function name found for that symbol (unless the user explicitly requests that it be eligible for instrumentation)
- Note: the parsing of the "internal" libraries may result in longer instrumentation time and higher memory usage. Please file an issue if either of these is found to be excessive.
### Function filters based on linkage and visibility
Added options to restrict instrumentation to certain linkage types (e.g. avoid instrumenting weak symbols) and visibility types (e.g. avoid instrumenting hidden symbols).
### Function filters based on instructions
In the past, after instrumentation, some applications instrumented by Dyninst would fail with a trap signal (e.g. #147). In several cases, it was found that this occurred whenever certain instructions were present in the function so an option was added to exclude functions based on one or more regex patterns was added.
## Details
- generates list of "internal" libraries and attempts to find the first match via:
- the library is already open, e.g. `dlopen(<libname>, RTLD_LAZY | RTLD_NOLOAD)`
- searching for the library in `LD_LIBRARY_PATH`
- searching for the library in `OMNITRACE_ROCM_PATH`, `ROCM_PATH`
- searching the folders from `/sbin/ldconfig -p`
- searching for the library in common places such as `/usr/local/lib`
- provides new `--linkage` command line option to restrict instrumentation to functions with particular type(s) of linkage
- Linkage types: `unknown`, `global`, `local`, `weak`, `unique`
- provides new `--visibility` command line option to restrict instrumentation to functions with particular type(s) of visibility
- Visibility types: `unknown`, `default`, `hidden`, `protected`, `internal`
- provides new `--internal-module-include` and `--internal-function-include` command line regex options to bypass automatic exclusion from instrumentation
- provides new `--internal-library-append` command line option to specify a library should be considered internal
- provides new `--internal-library-remove` command line option to specify a library should not be considered internal
- provides new `--instruction-exclude` command line regex option to exclude functions which contain matching instructions
- provides new `--internal-library-deps` command line option to treat libraries linked to internal libraries as internal libraries
- generally, this will only be helpful during runtime instrumentation when OmniTrace is built with an external dyninst library which is dynamically linked to boost libraries and the application is using the same boost libraries
- relaxed restrictions in `module_function::is_module_constrained()`
- relaxed restrictions in `module_function::is_routine_constrained()`
- added a few miscellaneous nullptr checks
## Miscellaneous
- Fix `LD_PRELOAD` + `OMNITRACE_DL_VERBOSE=3` issue
- Adds a sampling offload verbose message
- Improves MPI send-recv.cpp example error message
- Minor tweaks to binary library
- `binary::get_linked_path` returns `std::optional<string>`
- renamed `binary::symbol::read_bfd` to `binary::symbol::read_bfd_line_info`
- `binary::get_binary_info` has param options for reading line info and included undefined symbols
- fixed another edge case instance of resource deadlock during first call to configure_settings
- improved the error log printing in `omnitrace` (does not print repeated messages)
* fix OMNITRACE_DL_VERBOSE=3 + preload issue
- join needs to handle nullptr
* sampling offload verbose message
* mpi-send-recv error message
* binary updates
- get_linked_path returns std::optional<string>
- get_binary_info accepts include_undef flag
- renamed symbol::read_bfd to symbol::read_bfd_line_info
- get_binary_info has param options for reading line info and included undefined symbols
* config updates (initialization)
- fixed another instance of resource deadlock during first call to configure_settings
* Testing fix for HIP w/o rocprofiler support
- disable rocprofiler tests when HIP enabled but OMNITRACE_USE_ROCPROFILER=OFF
* omnitrace exe: insert_instr nullptr check
* omnitrace exe: new method for determining internal constraints
- added internal-libs.cpp
- using binary::get_binary_info on various known libs used by omnitrace
- any matching func/file from symbols found in known internal libs are excluded
- relaxed restrictions in is_module_constrained
- relaxed restrictions in is_routine_constrained
- added a few safety checks
* internal libs append/remove
- options to change which libs are considered internal libraries
* omnitrace exe instruction exclude
- regex option for excluding functions containing specific instructions
* fix is_internal_constrained
* binary link map verbose message
* support constraints on linkage and visibility of symbols
* misc fixes
- fix compiler error for Ubuntu Jammy + GCC 12
- dlopen + libtbbmalloc_proxy appears to be causing issues on OpenSUSE
* Performance details + MT
- multithread processing internal info
- report timing info
* Defer parsing internal data
- wait until after address space is created
* Performance improvement finding for get_symtab_function
* fix data race in get_binary_info
* remove set_default for linkage and visibility argparse
* Parse internal libs with Dyninst::Symtab instead of binary reader
- conflicting versions of libraries for binary analysis causes problems
- expanded whole function restrictions
- expanded module_function::is_routine_constrained regex
* internal lib updates
- include memory usage info
- option to read libraries linked against internal libs: --internal-library-deps
- defer parsing internal libs data to when processing modules
Omnitrace: Application Profiling, Tracing, and Analysis
Omnitrace is an AMD open source research project and is not supported as part of the ROCm software stack.
Overview
AMD Research is seeking to improve observability and performance analysis for software running on AMD heterogeneous systems. If you are familiar with rocprof and/or uProf, you will find many of the capabilities of these tools available via Omnitrace in addition to many new capabilities.
Omnitrace is a comprehensive profiling and tracing tool for parallel applications written in C, C++, Fortran, HIP, OpenCL, and Python which execute on the CPU or CPU+GPU. It is capable of gathering the performance information of functions through any combination of binary instrumentation, call-stack sampling, user-defined regions, and Python interpreter hooks. Omnitrace supports interactive visualization of comprehensive traces in the web browser in addition to high-level summary profiles with mean/min/max/stddev statistics. In addition to runtimes, omnitrace supports the collection of system-level metrics such as the CPU frequency, GPU temperature, and GPU utilization, process-level metrics such as the memory usage, page-faults, and context-switches, and thread-level metrics such as memory usage, CPU time, and numerous hardware counters.
Data Collection Modes
- Dynamic instrumentation
- Runtime instrumentation
- Instrument executable and shared libraries at runtime
- Binary rewriting
- Generate a new executable and/or library with instrumentation built-in
- Runtime instrumentation
- Statistical sampling
- Periodic software interrupts per-thread
- Process-level sampling
- Background thread records process-, system- and device-level metrics while the application executes
- Causal profiling
- Quantifies the potential impact of optimizations in parallel codes
- Critical trace generation
Data Analysis
- High-level summary profiles with mean/min/max/stddev statistics
- Low overhead, memory efficient
- Ideal for running at scale
- Comprehensive traces
- Every individual event/measurement
- Application speedup predictions resulting from potential optimizations in functions and lines of code (causal profiling)
- Critical trace analysis (alpha)
Parallelism API Support
- HIP
- HSA
- Pthreads
- MPI
- Kokkos-Tools (KokkosP)
- OpenMP-Tools (OMPT)
GPU Metrics
- GPU hardware counters
- HIP API tracing
- HIP kernel tracing
- HSA API tracing
- HSA operation tracing
- System-level sampling (via rocm-smi)
- Memory usage
- Power usage
- Temperature
- Utilization
CPU Metrics
- CPU hardware counters sampling and profiles
- CPU frequency sampling
- Various timing metrics
- Wall time
- CPU time (process and/or thread)
- CPU utilization (process and/or thread)
- User CPU time
- Kernel CPU time
- Various memory metrics
- High-water mark (sampling and profiles)
- Memory page allocation
- Virtual memory usage
- Network statistics
- I/O metrics
- ... many more
Documentation
The full documentation for omnitrace is available at amdresearch.github.io/omnitrace. See the Getting Started documentation for general tips and a detailed discussion about sampling vs. binary instrumentation.
Quick Start
Installation
- Visit Releases page
- Select appropriate installer (recommendation:
.shscripts do not require super-user priviledges unlike the DEB/RPM installers)- If targeting a ROCm application, find the installer script with the matching ROCm version
- If you are unsure about your Linux distro, check
/etc/os-release - If no installer script matches your target OS, try one of the Ubuntu 18.04
*.shinstallers- This installation may be built against older library versions supported on your distro via backwards compatibility
Setup
NOTE: Replace
/opt/omnitracebelow with installation prefix as necessary.
- Option 1: Source
setup-env.shscript
source /opt/omnitrace/share/omnitrace/setup-env.sh
- Option 2: Load modulefile
module use /opt/omnitrace/share/modulefiles
module load omnitrace
- Option 3: Manual
export PATH=/opt/omnitrace/bin:${PATH}
export LD_LIBRARY_PATH=/opt/omnitrace/lib:${LD_LIBRARY_PATH}
Omnitrace Settings
Generate an omnitrace configuration file using omnitrace-avail -G omnitrace.cfg. Optionally, use omnitrace-avail -G omnitrace.cfg --all for
a verbose configuration file with descriptions, categories, etc. Modify the configuration file as desired, e.g. enable
perfetto, timemory, sampling, and process-level sampling by default
and tweak some sampling default values:
# ...
OMNITRACE_USE_PERFETTO = true
OMNITRACE_USE_TIMEMORY = true
OMNITRACE_USE_SAMPLING = true
OMNITRACE_USE_PROCESS_SAMPLING = true
# ...
OMNITRACE_SAMPLING_FREQ = 50
OMNITRACE_SAMPLING_CPUS = all
OMNITRACE_SAMPLING_GPUS = $env:HIP_VISIBLE_DEVICES
Once the configuration file is adjusted to your preferences, either export the path to this file via OMNITRACE_CONFIG_FILE=/path/to/omnitrace.cfg
or place this file in ${HOME}/.omnitrace.cfg to ensure these values are always read as the default. If you wish to change any of these settings,
you can override them via environment variables or by specifying an alternative OMNITRACE_CONFIG_FILE.
Call-Stack Sampling
The omnitrace-sample executable is used to execute call-stack sampling on a target application without binary instrumentation.
Use a double-hypen (--) to separate the command-line arguments for omnitrace-sample from the target application and it's arguments.
omnitrace-sample --help
omnitrace-sample <omnitrace-options> -- <exe> <exe-options>
omnitrace-sample -f 1000 -- ls -la
Binary Instrumentation
The omnitrace executable is used to instrument an existing binary. Call-stack sampling can be enabled alongside
the execution an instrumented binary, to help "fill in the gaps" between the instrumentation via setting the OMNITRACE_USE_SAMPLING
configuration variable to ON.
Similar to omnitrace-sample, use a double-hypen (--) to separate the command-line arguments for omnitrace from the target application and it's arguments.
omnitrace --help
omnitrace <omnitrace-options> -- <exe-or-library> <exe-options>
Binary Rewrite
Rewrite the text section of an executable or library with instrumentation:
omnitrace -o app.inst -- /path/to/app
In binary rewrite mode, if you also want instrumentation in the linked libraries, you must also rewrite those libraries.
Example of rewriting the functions starting with "hip" with instrumentation in the amdhip64 library:
mkdir -p ./lib
omnitrace -R '^hip' -o ./lib/libamdhip64.so.4 -- /opt/rocm/lib/libamdhip64.so.4
export LD_LIBRARY_PATH=${PWD}/lib:${LD_LIBRARY_PATH}
Verify via
lddthat your executable will load the instrumented library -- if you built your executable with an RPATH to the original library's directory, then prefixingLD_LIBRARY_PATHwill have no effect.
Once you have rewritten your executable and/or libraries with instrumentation, you can just run the (instrumented) executable or exectuable which loads the instrumented libraries normally, e.g.:
./app.inst
If you want to re-define certain settings to new default in a binary rewrite, use the --env option. This omnitrace option
will set the environment variable to the given value but will not override it. E.g. the default value of OMNITRACE_PERFETTO_BUFFER_SIZE_KB
is 1024000 KB (1 GiB):
# buffer size defaults to 1024000
omnitrace -o app.inst -- /path/to/app
./app.inst
Passing --env OMNITRACE_PERFETTO_BUFFER_SIZE_KB=5120000 will change the default value in app.inst to 5120000 KiB (5 GiB):
# defaults to 5 GiB buffer size
omnitrace -o app.inst --env OMNITRACE_PERFETTO_BUFFER_SIZE_KB=5120000 -- /path/to/app
./app.inst
# override default 5 GiB buffer size to 200 MB
export OMNITRACE_PERFETTO_BUFFER_SIZE_KB=200000
./app.inst
Runtime Instrumentation
Runtime instrumentation will not only instrument the text section of the executable but also the text sections of the
linked libraries. Thus, it may be useful to exclude those libraries via the -ME (module exclude) regex option
or exclude specific functions with the -E regex option.
omnitrace -- /path/to/app
omnitrace -ME '^(libhsa-runtime64|libz\\.so)' -- /path/to/app
omnitrace -E 'rocr::atomic|rocr::core|rocr::HSA' -- /path/to/app
Python Profiling and Tracing
Use the omnitrace-python script to profile/trace Python interpreter function calls.
Use a double-hypen (--) to separate the command-line arguments for omnitrace-python from the target script and it's arguments.
omnitrace-python --help
omnitrace-python <omnitrace-options> -- <python-script> <script-args>
omnitrace-python -- ./script.py
Please note, the first argument after the double-hyphen must be a Python script, e.g. omnitrace-python -- ./script.py.
If you need to specify a specific python interpreter version, use omnitrace-python-X.Y where X.Y is the Python
major and minor version:
omnitrace-python-3.8 -- ./script.py
If you need to specify the full path to a Python interpreter, set the PYTHON_EXECUTABLE environment variable:
PYTHON_EXECUTABLE=/opt/conda/bin/python omnitrace-python -- ./script.py
If you want to restrict the data collection to specific function(s) and its callees, pass the -b / --builtin option after decorating the
function(s) with @profile. Use the @noprofile decorator for excluding/ignoring function(s) and its callees:
def foo():
pass
@noprofile
def bar():
foo()
@profile
def spam():
foo()
bar()
Each time spam is called during profiling, the profiling results will include 1 entry for spam and 1 entry
for foo via the direct call within spam. There will be no entries for bar or the foo invocation within it.
Trace Visualization
- Visit ui.perfetto.dev in the web-browser
- Select "Open trace file" from panel on the left
- Locate the omnitrace perfetto output (extension:
.proto)
Using Perfetto tracing with System Backend
Perfetto tracing with the system backend supports multiple processes writing to the same
output file. Thus, it is a useful technique if Omnitrace is built with partial MPI support
because all the perfetto output will be coalesced into a single file. The
installation docs for perfetto can be found here.
If you are building omnitrace from source, you can configure CMake with OMNITRACE_INSTALL_PERFETTO_TOOLS=ON
and the perfetto and traced applications will be installed as part of the build process. However,
it should be noted that to prevent this option from accidentally overwriting an existing perfetto install,
all the perfetto executables installed by omnitrace are prefixed with omnitrace-perfetto-, except for the perfetto
executable, which is just renamed omnitrace-perfetto.
Enable traced and perfetto in the background:
pkill traced
traced --background
perfetto --out ./omnitrace-perfetto.proto --txt -c ${OMNITRACE_ROOT}/share/omnitrace.cfg --background
NOTE: if the perfetto tools were installed by omnitrace, replace
tracedwithomnitrace-perfetto-tracedandperfettowithomnitrace-perfetto.
Configure omnitrace to use the perfetto system backend:
export OMNITRACE_PERFETTO_BACKEND=system
And finally, execute your instrumented application. Either the binary rewritten application:
omnitrace -o ./myapp.inst -- ./myapp
./myapp.inst
Or with runtime instrumentation:
omnitrace -- ./myapp



