Jonathan R. Madsen 778af2a760 Sampling support + testing + omnitrace namespace (#19)
* omnitrace namespace

* Kokkos + Lulesh example/tests

* Sampling support + more

- OMNITRACE_BUILD_TESTING option
- sampling support
- pthread_gotcha
- fixes to labels for mpi_gotcha, fork_gotcha, omnitrace_component
- tasking::block_signals, tasking::unblock_signals
- instrumentation mode option in omnitrace exe
- argument option groups in omnitrace exe
- categories in omnitrace settings
- remove TIMEMORY_ prefixed options

* Release workflow updates

* Updated settings printing

* Fixed defaults in README

* Tweak setting defaults in README

* CMake fixes

* cmake-format

* clang-format

* LULESH_USE_MPI OFF

* LULESH_USE_MPI fix

* timemory add_secondary fix

* timemory ambiguous internal namespace fix

* Update timemory submodule

* Handle output path/prefix in omnitrace

- updated timemory
- updated test environment

* sampling + papi fix

* Fix to sampling without PAPI

* Fix for using too many processors in CI

* formatting

* Updated CI

- minor cmake tweaks
- updated timemory submodule

* Updated CI

* Updated CI

* CI + timemory updates

- data race fixes

* CI updates + debug for sampling

* Sampling updates

- moved tasking::{block,unblock}_signals to sampling namespace
- improvements to sampling w.r.t. thread-locality

* Minimum OMNITRACE_THREAD_COUNT of 128

* Handle multiple dims in sampler data

* Configure libunwind support for timemory

* Improved safeguards for sampling

- updated CI
- lulesh runtime-instrument test tweak

* formatting

* CI updates + sampler updates + misc

- fixed stack-buffer-overflow in omnitrace (get_*file_line_info)
- test labels
- steady_clock instead of system_clock in sampler
- update dyninst submodule with upgradePlaceholder fix
- disable OMNITRACE_BUILD_TESTING by default

* Updated timemory submodule

- hidden visibility for timemory
- storage finalizers do not capture this

* Update timemory submodule

- component visibility updates

* Reworked header includes

- use <...> for timemory headers
- always include <library/defines.hpp>

* Rename some config options

* Update PTL submodule

* Update kokkos submodule

* Updated sampling

* Updated CI

* Reworked instrumentation exe

- lowered min-address-range threshold to 256
- extended whole function exclude

* CI fix + timemory submodule update

- TIMEMORY_VISIBLE on component base
- RelWithDebugInfo -> RelWithDebInfo
- Info output for parallel-overhead

* Sampling flags + transpose update + CI update

- disable critical trace for parallel-overhead in CI
- SA_RESTART only in sampler
- reworked transpose example to use fewer threads

* CI update

- removed ubuntu-focal-external-debug
- reduced data artifacts upload

* CI timeouts

- updated timemory submodule
- minor tweaks to omnitrace exe logging

* LICENSE updates (partial)

* CI Test stage timeout extension

* Docker and Packaging updates

* Miscellaneous fixes/tweaks

- gpu.hpp / gpu.cpp
- disable roctracer component if no devices
- re-enable InstrStackFrames by default
- disable sampling by default
- pthread_gotcha::m_enable_sampling is false by default
- timemory submodule update w/ sampler and pop(tid) updates
- fix minor bug in sampler logic
- CMake: OMNITRACE_USE_HIP option
- roctracer + timemory fix

* Replaced OMNITRACE_USE_ROCTRACER with OMNITRACE_USE_HIP where appropriate

* cmake format

* Sampler deadlock fixes

* Removed debug messages from sampler

* Fix for MPI detection + test tweaks + misc

* Sampler deadlock fixes + misc

- removed papi_tot_ins
- pthread_gotcha blocks signals globally until sampler is setup
- metadata specialization for sampling components
- OMNITRACE_INSTRUMENTATION_MODE -> OMNITRACE_MODE
- default sampling delay increased to 0.05 from 1.0e-6
- removed {block,unblock}_signals from critical_trace and ptl
    - no longer necessary to use
- sampling delay minimum is 1.0e-3
- OMNITRACE_BUILD_HIDDEN_VISIBILITY

* omnitrace-avail + libunwind update + restructure

- restructured omnitrace components
- build custom omnitrace-avail executable
- updated libunwind to avoid malloc in get_unw_backtrace

* Fix remaining reorganization issues

- removed some duplicate code
- fixed some trait specializations after implicit instatiation
- formatting

* ensure_storage fix + avail improvements

- fix ensure_storage when component not avail
- suppress irrelevant info in omnitrace-avail

* Delay settings initialization

- slight tweak to tests w/ MPI

* Disable OpenMPI testing w/ ubuntu-bionic

- MPI testing is hanging bc of network interface issue on system:

> [[20462,1],0]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
> Module: OpenFabrics (openib)
>   Host: fv-az19-371
> Another transport will be used instead, although this may result in
> lower performance.
> NOTE: You can disable this warning by setting the MCA parameter
> btl_base_warn_component_unused to 0.
2022-01-24 20:49:17 -06:00
2021-11-24 04:59:59 -06:00
2021-11-24 04:59:59 -06:00

omnitrace: application tracing with static/dynamic binary instrumentation

Dependencies

  • DynInst for dynamic or static instrumentation
  • Julia for merging perfetto traces

Installing DynInst

The easiest way to install Dyninst is via spack

git clone https://github.com/spack/spack.git
source ./spack/share/spack/setup-env.sh
spack compiler find
spack external find
spack install dyninst
spack load -r dyninst

Installing Julia

Julia is available via Linux package managers or may be available via a module. Debian-based distributions such as Ubuntu can run (as a super-user):

apt-get install julia

Once Julia is installed, install the necessary packages (this operation only needs to be performed once):

julia -e 'using Pkg; for name in ["JSON", "DataFrames", "Dates", "CSV", "Chain", "PrettyTables"]; Pkg.add(name); end'

Installing omnitrace

OMNITRACE_ROOT=${HOME}/sw/omnitrace
git clone https://github.com/AARInternal/omnitrace-dyninst.git
cmake -B build-omnitrace -DOMNITRACE_USE_MPI=ON -DCMAKE_INSTALL_PREFIX=${OMNITRACE_ROOT} omnitrace-dyninst
cmake --build build-omnitrace --target all --parallel 8
cmake --build build-omnitrace --target install
export PATH=${OMNITRACE_ROOT}/bin:${PATH}
export LD_LIBRARY_PATH=${OMNITRACE_ROOT}/lib64:${OMNITRACE_ROOT}/lib:${LD_LIBRARY_PATH}

Using Omnitrace Executable

omnitrace --help
omnitrace <omnitrace-options> -- <exe-or-library> <exe-options>

Omnitrace Library Environment Settings

Environment Variable Default Value Description
OMNITRACE_USE_PERFETTO false Enable perfetto backend
OMNITRACE_USE_PID true Enable tagging filenames with process identifier (either MPI rank or pid)
OMNITRACE_USE_ROCTRACER true Enable ROCM tracing
OMNITRACE_USE_SAMPLING true Enable statistical sampling of call-stack
OMNITRACE_USE_TIMEMORY false Enable timemory backend
OMNITRACE_BACKEND inprocess Specify the perfetto backend to activate. Options are: 'inprocess', 'system', or 'all'
OMNITRACE_BUFFER_SIZE_KB 1024000 Size of perfetto buffer (in KB)
OMNITRACE_COUT_OUTPUT false Write output to stdout
OMNITRACE_CRITICAL_TRACE false Enable generation of the critical trace
OMNITRACE_CRITICAL_TRACE_BUFFER_COUNT 2000 Number of critical trace records to store in thread-local memory before submitting to shared buffer
OMNITRACE_CRITICAL_TRACE_COUNT 0 Number of critical trace to export (0 == all)
OMNITRACE_CRITICAL_TRACE_DEBUG false Enable debugging for critical trace
OMNITRACE_CRITICAL_TRACE_NUM_THREADS 8 Number of threads to use when generating the critical trace
OMNITRACE_CRITICAL_TRACE_PER_ROW 0 How many critical traces per row in perfetto (0 == all in one row)
OMNITRACE_CRITICAL_TRACE_SERIALIZE_NAMES false Include names in serialization of critical trace (mainly for debugging)
OMNITRACE_DIFF_OUTPUT false Generate a difference output vs. a pre-existing output (see also: TIMEMORY_INPUT_PATH and TIMEMORY_INPUT_PREFIX)
OMNITRACE_FLAT_SAMPLING false Ignore hierarchy in all statistical sampling entries
OMNITRACE_INSTRUMENTATION_INTERVAL 1 Instrumentation only takes measurements once every N function calls (not statistical)
OMNITRACE_JSON_OUTPUT true Write json output files
OMNITRACE_MEMORY_PRECISION -1 Set the precision for components with 'is_memory_category' type-trait
OMNITRACE_MEMORY_SCIENTIFIC false Set the numerical reporting format for components with 'is_memory_category' type-trait
OMNITRACE_MEMORY_UNITS "" Set the units for components with 'uses_memory_units' type-trait
OMNITRACE_OUTPUT_FILE "" Perfetto filename
OMNITRACE_OUTPUT_PATH omnitrace-{EXE}-output Explicitly specify the output folder for results
OMNITRACE_OUTPUT_PREFIX "" Explicitly specify a prefix for all output files
OMNITRACE_PRECISION -1 Set the global output precision for components
OMNITRACE_ROCTRACER_FLAT_PROFILE false Ignore hierarchy in all kernels entries with timemory backend
OMNITRACE_ROCTRACER_HSA_ACTIVITY false Enable HSA activity tracing support
OMNITRACE_ROCTRACER_HSA_API false Enable HSA API tracing support
OMNITRACE_ROCTRACER_HSA_API_TYPES "" HSA API type to collect
OMNITRACE_ROCTRACER_TIMELINE_PROFILE false Create unique entries for every kernel with timemory backend
OMNITRACE_SAMPLING_DELAY 1e-06 Number of seconds to delay activating the statistical sampling
OMNITRACE_SAMPLING_FREQ 10 Number of software interrupts per second when OMNITTRACE_USE_SAMPLING=ON
OMNITRACE_SCIENTIFIC false Set the global numerical reporting to scientific format
OMNITRACE_SETTINGS_DESC false Provide descriptions when printing settings
OMNITRACE_SHMEM_SIZE_HINT_KB 40960 Hint for shared-memory buffer size in perfetto (in KB)
OMNITRACE_TEXT_OUTPUT true Write text output files
OMNITRACE_TIMELINE_SAMPLING false Create unique entries for every sample when statistical sampling is enabled
OMNITRACE_TIMEMORY_COMPONENTS wall_clock List of components to collect via timemory (see timemory-avail)
OMNITRACE_TIME_FORMAT %F_%I.%M_%p Customize the folder generation when TIMEMORY_TIME_OUTPUT is enabled (see also: strftime)
OMNITRACE_TIME_OUTPUT true Output data to subfolder w/ a timestamp (see also: TIMEMORY_TIME_FORMAT)
OMNITRACE_TIMING_PRECISION 6 Set the precision for components with 'is_timing_category' type-trait
OMNITRACE_TIMING_SCIENTIFIC false Set the numerical reporting format for components with 'is_timing_category' type-trait
OMNITRACE_TIMING_UNITS "" Set the units for components with 'uses_timing_units' type-trait
OMNITRACE_TREE_OUTPUT true Write hierarchical json output files

Example Omnitrace Instrumentation

Binary Rewrite

Rewrite the text section of an executable or library with instrumentation:

omnitrace -o app.inst -- /path/to/app

In binary rewrite mode, if you also want instrumentation in the linked libraries, you must also rewrite those libraries. Example of rewriting the functions starting with "hip" with instrumentation in the amdhip64 library:

mkdir -p ./lib
omnitrace -R '^hip' -o ./lib/libamdhip64.so.4 -- /opt/rocm/lib/libamdhip64.so.4
export LD_LIBRARY_PATH=${PWD}/lib:${LD_LIBRARY_PATH}

NOTE: Verify via ldd that your executable will load the instrumented library -- if you built your executable with an RPATH to the original library's directory, then prefixing LD_LIBRARY_PATH will have no effect.

Once you have rewritten your executable and/or libraries with instrumentation, you can just run the (instrumented) executable or exectuable which loads the instrumented libraries normally, e.g.:

./app.inst

If you want to re-define certain settings to new default in a binary rewrite, use the --env option. This omnitrace option will set the environment variable to the given value but will not override it. E.g. the default value of OMNITRACE_BUFFER_SIZE_KB is 1024000 KB (1 GiB):

# buffer size defaults to 1024000
omnitrace -o app.inst -- /path/to/app
./app.inst

Passing --env OMNITRACE_BUFFER_SIZE_KB=5120000 will change the default value in app.inst to 5120000 KiB (5 GiB):

# defaults to 5 GiB buffer size
omnitrace -o app.inst --env OMNITRACE_BUFFER_SIZE_KB=5120000 -- /path/to/app
./app.inst
# override default 5 GiB buffer size to 200 MB
export OMNITRACE_BUFFER_SIZE_KB=200000
./app.inst

Runtime Instrumentation

Runtime instrumentation will not only instrument the text section of the executable but also the text sections of the linked libraries. Thus, it may be useful to exclude those libraries via the -ME (module exclude) regex option.

omnitrace -- /path/to/app
omnitrace -ME '^(libhsa-runtime64|libz\\.so)' -- /path/to/app
omnitrace -E 'rocr::atomic|rocr::core|rocr::HSA' --  /path/to/app

Miscellaneous Features and Caveats

  • You may need to increase the default perfetto buffer size (1 GiB) to capture all the information
    • E.g. export OMNITRACE_BUFFER_SIZE_KB=10240000 increases the buffer size to 10 GiB
  • The omnitrace library has various setting which can be configured via environment variables, you can configure these settings to custom defaults with the omnitrace command-line tool via the --env option
    • E.g. to default to a buffer size of 5 GB, use --env OMNITRACE_BUFFER_SIZE_KB=5120000
    • This is particularly useful in binary rewrite mode
  • Perfetto tooling is enabled by default
  • Timemory tooling is disabled by default
  • Enabling/disabling one of the aformentioned tools but not specifying enabling/disable the other will assume the inverse of the other's enabled state, e.g.
    • OMNITRACE_USE_PERFETTO=OFF yields the same result OMNITRACE_USE_TIMEMORY=ON
    • OMNITRACE_USE_PERFETTO=ON yields the same result as OMNITRACE_USE_TIMEMORY=OFF
    • In order to enable both timemory and perfetto, set both OMNITRACE_USE_TIMEMORY=ON and OMNITRACE_USE_PERFETTO=ON
    • Setting OMNITRACE_USE_TIMEMORY=OFF and OMNITRACE_USE_PERFETTO=OFF will disable all instrumentation
  • Use timemory-avail -S to view the various settings for timemory
  • Set OMNITRACE_COMPONENTS="<comma-delimited-list-of-component-name>" to control which components timemory collects
    • The list of components and their descriptions can be viewed via timemory-avail -Cd
    • The list of components and their string identifiers can be view via timemory-avail -Cbs
  • You can filter any timemory-avail results via -r <regex> -hl

Omnitrace Output

omnitrace will create an output directory named omnitrace-<EXE_NAME>-output, e.g. if your executable is named app.inst, the output directory will be omnitrace-app.inst-output. Depending on whether TIMEMORY_TIME_OUTPUT=ON (the default when perfetto is enabled), there will be a subdirectory with the date and time, e.g. 2021-09-02_01.03_PM. Within this directory, all perfetto files will be named perfetto-trace.<PID>.proto or when OMNITRACE_USE_MPI=ON, perfetto-trace.<RANK>.proto (assuming omnitrace was built with MPI support).

You can explicitly control the output path and naming scheme of the files via the OMNITRACE_OUTPUT_FILE environment variable. The special character sequences %pid% and %rank% will be replaced with the PID or MPI rank, respectively.

Merging the traces from rocprof and omnitrace

NOTE: Using rocprof externally for tracing is deprecated. The current version has built-in support for recording the GPU activity and HIP API calls. If you want to use an external rocprof, either configure CMake with -DOMNITRACE_USE_ROCTRACER=OFF or explicitly set TIMEMORY_ROCTRACER_ENABLED=OFF in the environment.

Use the omnitrace-merge.jl Julia script to merge rocprof and perfetto traces.

export TIMEMORY_ROCTRACER_ENABLED=OFF
rocprof --hip-trace --roctx-trace --stats ./app.inst
omnitrace-merge.jl results.json omnitrace-app.inst-output/2021-09-02_01.03_PM/*.proto

Use Perfetto tracing with System Backend

In a separate window run:

pkill traced
traced --background
perfetto --out ./htrace.out --txt -c ${OMNITRACE_ROOT}/share/roctrace.cfg

then in the window running the application, configure the omnitrace instrumentation to use the system backend:

export OMNITRACE_BACKEND_SYSTEM=1

for the merge use the htrace.out:

omnitrace-merge.jl results.json htrace.out
S
Description
No description provided
Readme 282 MiB
Languages
C++ 67.5%
C 20.6%
Python 6.6%
CMake 3.4%
Shell 0.6%
Other 1.1%