T

Jonathan R. Madsen 9de3a6b0b4 Linux Perf Support + Causal Profiling Updates (#276 )

* causal backtrace updates

- fix initial causal sampling period value

* causal delay updates

- tweak handling of sleep_for_overhead

* Fix experiment global scaling for prog pts

- results in drastically improved predictions

* pthread_mutex_gotcha updates

- disable all wrappers during causal profiling

* validate-causal-json.py updates

- support decimal stddev
- fix setting stddev from command-line

* causal perform_experiment_impl update

- handle start failing because finalizing

* deprecate causal::component::sample_rate

- appears to not help at all

* Rework sample info

* Increase causal unwind_depth

- use OMNITRACE_MAX_UNWIND_DEPTH

* validate-causal-json updates

- min experiments
  - exclude reporting predictions with less than X experiments at a given speedup
- percent samples
  - only print samples within X% of the peak (default: 95%)

* Update timemory submodule

- extensions to sampling for signals delivered via non-timer method
  - e.g. via HW counter overflow

* dwarf_entry::operator< updates

- sort via file

* causal profiling docs updates

- info about backends
- info about installing/enabling perf

* config updates: causal backend

- CausalBackend enum
- OMNITRACE_CAUSAL_BACKEND: perf, timer, auto
- omnitrace-causal option: --backend

* debug update

- use spin_mutex instead of std::mutex

* address_range::contains update

- range from 0-100 contains range from 10-100 but was returning false because high was == 100 not < 100

* symbol::operator< update

- handle load address differences

* sampling updates (non-causal)

- update get_timer to get_trigger + dynamic_cast

* container::static_vector updates

- support construction from container::c_array
- update_size private member func for handling atomic m_size

* Move perf files

- moved library/causal/perf.{hpp,cpp} to library/perf.{hpp,cpp}

* causal example update

- created impl.hpp (forward decls)
- renamed {cpu,rng}_func_impl to {cpu,rng}_impl_func
- only create two threads which run N iterations instead of two threads each iteration

* Update timemory submodule

- updates to unwind::processed_entry
- updates to procfs::maps

* Updated causal documentation

- fixed line numbers changed by modifications to causal example

* omnitrace-causal exe updates

- set OMNITRACE_THREAD_POOL_SIZE to zero by default

* core/containers updates

- static_vector: provide data() member function
- c_array pop_front() and pop_back() member functions

* core: config and argparse updates + perf

- core/perf.{hpp,cpp}
  - forward decl of enums
  - config-related capabilities
- argparse: --sample-overflow
- renamed some config functions
  - e.g. get_sampling_cpu_freq -> get_sampling_cputime_freq
- added config settings related to overflow sampling via perf
- added timer_sampling and overflow_sampling categories

* Update timemory submodule

- sampling allocator flushing

* binary updates

- lookup_ipaddr_entry
- use bfd_find_nearest_line instead of bfd_find_nearest_line_discriminator
  - discriminators are not used
- explicit instantiations of inlined_symbol::serialize

* Bump VERSION to 1.10.0

* sampling and perf updates

- support overflow sampling via Linux Perf
- update perf namespace
- update perf::perf_event
  - update record ctor: pointer instead of const ref
  - update open member func: return optional string
  - add m_batch_size member variable
- sampling updates
  - support overflow sampling
  - flush allocators
  - increase buffer size from 1024 to 2048
  - restructure post-processing in light of perf overflow supports
  - improve offload memory usage only load buffers for thread
  - load_offload_buffer(tid) uses thread-specific filepos
- component updates
  - backtrace_metrics::operator-=
  - backtrace_metrics::operator-
  - backtrace::sample does not record for overflow signal
  - callchain: perf overflow sample

* core updates

- component::sampling_percent does not report self + uses_percent_units

* causal updates

- tweak get_line_info
- overloads for set_current_selection (uint64_t, c_array, std::array)
- delay
  - use sampling::pause/sampling::resume
- experiment
  - experiment::sample derives from unwind::processed_entry
  - experiment::samples is vector instead of set
  - fixed samples
  - overloads for is_selected (uint64_t, c_array, std::array)
  - scaling factor defaults to 100 instead of 50
  - serialize updates follow change to experiment::sample
  - modify algorithm for increasing/decreasing experiment length
- sample_data
  - use map<uintptr, uint64_t> instead of set<sample_data>
  - get_samples returns vector<sample_data> instead of set<sample_data>
- sampling
  - support overflow via Linux Perf
  - update causal_offload_buffer
  - flush sampling allocator
- backtrace
  - overflow component

* libomnitrace-dl updates

- handle dl::InstrumentMode::PythonProfile

* testing updates (causal)

- causal line 155 -> causal line 100
- causal line 165 -> causal line 110

* formatting

* exit_gotcha updates

- exit_info for abort()
- message about non-zero exit code

* testing updates

- fail regex for causal tests
- validate-causal-json: >= min_experiments instead of > min_experiments
- handle OMNITRACE_DEBUG_SETTINGS in omnitrace_write_test_config

* causal sampling updates

- add new lines where appropriate

* causal data updates

- reorder diagnostic info when experiment fails to start

* binary updates

- symbol address range from address to address + symsize + 1
  - add 1 based on debug info

* causal data updates

- sample_selection wait_ns defaults to 1,000 instead of 10,000
- sample_selection wait scaled by iteration number
- save_line_info_impl verbosity
- print latest_eligible_pc when experiment does not start

* causal sampling + component updates

- perf backend disables component::backtrace
- ensure get_sampling_(realtime|cputime|overflow)_signal do not malloc

* causal: remove period stats

* validate-causal-json update

- fix --help

* causal data updates

- improve eligible pc history reporting when experiment fails to start

* causal data updates

- fix compute_eligible_lines_impl
  - eligible address ranges returning too many ranges
  - occasionally, overwrite all *true* eligible address ranges

* causal data updates

- reduce scoped ranges to symbol ranges
- is_eligible_address() returns true contains (not just coarse)
- revert some sample_selection behavior

* binary address_multirange updates

- make coarse_range private
- fix operator+=(pair<coarse, uintptr_t>)

* causal example update

- fix nsync to default to once per iteration

* binary analysis updates

- tweak header file includes

* causal updates

- remove factoring in sleep_for_overhead
- invoke delay::process() even if experiment is not active

* causal data updates

- update latest_eligible_pc structure

* update omnitrace-install.py.in

- fix support for fedora
  - /etc/os-release does not have ID_LIKE
  - fallback to RHEL 8.7 if version not specified

* update omnitrace-install.py.in

- fix support for debian
  - /etc/os-release does not have ID_LIKE
  - version mapping

* Update documentation

- update docs on installation

* causal data and experiment updates

- data: reset_sample_selection

* causal set_current_selection debugging

- debug messages for failed e2e runs

* causal data and backtrace component updates

- data: set_current_selection returns the number of eligible addresses added
- backtrace: if cputime signal has selected zero IPs > 5x, then realtime signal starts contributing call-stacks

* core library updates

- move config::parse_numeric_range to utility namespace
- add core/utility.cpp
- support range:increment, e.g. 5-25:10 expands to '5 15 25' instead of '5 10 15 20 25'

* omnitrace-causal update

- end-to-end expands all speedups
- support range:increment in speedups

* causal backtrace updates

- remove select_ival (realtime signal always contributes when select_count == 0)

* containers: static_vector update

- explicit c_array constructor
- explicit std::array constructor

* causal data updates

- remove set_current_selection(uint64_t)
- remove set_current_selection(std::array)
- sample_selection increase default wait time
- report eligible PC candidates
- move reset_sample_selection to perform_experiment_impl
- decrease latest_eligible_pc array size
- set_current_selection does not guard for experiment::active

* core debug updates

- OMNITRACE_PRINT_COLOR macros

* causal data updates

- tweak to experiment never started message

* causal gotcha updates

- remove unused code

* critical trace updates

- remove unused code

* omnitrace-causal

- OMNITRACE_LAUNCHER

* causal data updates

- don't fail on end-to-end + omnitrace-causal

* causal backtrace updates

- reintroduce select_ival behavior

* causal data updates

- tweak verbose messages about number of PC candidates

* core mproc updates

- utilities for waiting on child PID and diagnosing status
  - omnitrace::mproc::wait_pid
  - omnitrace::mproc::diagnose_status

* omnitrace-run updates

- support --fork argument for executing via fork in current process + execvpe on child instead of execvpe in current process

* omnitrace-causal updates

- wait_pid and diagnose_status just call equivalent functions in omnitrace::mproc

* ubuntu-focal workflow update

- attempt to launch ubuntu-focal-codecov job with CAP_SYS_ADMIN and use perf backend

* tests reorg and updates

- remove binary-rewrite-sampling and runtime-instrument-sampling tests
- rename *-preload tests (which use omnitrace-sample exe) to *-sampling
- split tests/CMakeLists.txt into several tests/omnitrace-<category>-tests.cmake files
- tweak to causal-both-omni-func test
  - add args: -n 2 -b timer

* update validate-causal-json.py

- better reasoning info for adjusting tolerance
- always apply tolerance adjustments in CI mode

* causal e2e tests update

- add label "causal-e2e" label
- tweak params
  - old: 80 12 432525 500000000
  - new: 80 50 432525 100000000
- disable processor affinity for slow-func/line-100 tests
  - artificially inflates some speedups with perf

* unblocking_gotcha updates

- overload operator() according to gotcha function index

* blocking_gotcha updates

- overload operator() according to gotcha function index
- fix bug where potentially post block functors (e.g. pthread_mutex_trylock) throw error if lock is not acquired.

* parse_numeric_range update

- support unordered_set

* config update

- OMNITRACE_DEBUG_{TIDS,PIDS} use parse_numeric_range

2023-04-13 02:14:35 -05:00

.github/workflows

Linux Perf Support + Causal Profiling Updates (#276 )

2023-04-13 02:14:35 -05:00

cmake

Linux Perf Support + Causal Profiling Updates (#276 )

2023-04-13 02:14:35 -05:00

docker

Update entrypoint-rhel.sh (#255 )

2023-03-08 01:54:06 -06:00

examples

Linux Perf Support + Causal Profiling Updates (#276 )

2023-04-13 02:14:35 -05:00

external

Linux Perf Support + Causal Profiling Updates (#276 )

2023-04-13 02:14:35 -05:00

scripts

omnitrace-run executable - required for running binary writes (#257 )

2023-03-14 19:48:29 -05:00

source

Linux Perf Support + Causal Profiling Updates (#276 )

2023-04-13 02:14:35 -05:00

tests

Linux Perf Support + Causal Profiling Updates (#276 )

2023-04-13 02:14:35 -05:00

.clang-format

cmake-format + miscellaneous tweaks (#13 )

2021-09-20 11:12:06 -05:00

.clang-tidy

Reorganization and critical trace support (#17 )

2021-11-23 02:53:14 -06:00

.cmake-format.yaml

Linux Perf Support + Causal Profiling Updates (#276 )

2023-04-13 02:14:35 -05:00

.gitignore

Casual Profiling GUI (#265 )

2023-04-11 23:36:24 -05:00

.gitmodules

Multiple python versions (#42 )

2022-04-21 21:36:07 -05:00

CMakeLists.txt

Address and thread sanitizer fixes (#250 )

2023-02-27 12:09:03 -06:00

LICENSE

Sampling support + testing + omnitrace namespace (#19 )

2022-01-24 20:49:17 -06:00

omnitrace.cfg

Fix for empty perfetto output (#7 )

2022-05-25 00:35:02 -05:00

pyproject.toml

Resolve warnings/errors with extra warnings (#171 )

2022-09-28 14:28:32 -05:00

README.md

Linux Perf Support + Causal Profiling Updates (#276 )

2023-04-13 02:14:35 -05:00

VERSION

Linux Perf Support + Causal Profiling Updates (#276 )

2023-04-13 02:14:35 -05:00

README.md

Omnitrace: Application Profiling, Tracing, and Analysis

Omnitrace is an AMD open source research project and is not supported as part of the ROCm software stack.

Overview

AMD Research is seeking to improve observability and performance analysis for software running on AMD heterogeneous systems. If you are familiar with rocprof and/or uProf, you will find many of the capabilities of these tools available via Omnitrace in addition to many new capabilities.

Omnitrace is a comprehensive profiling and tracing tool for parallel applications written in C, C++, Fortran, HIP, OpenCL, and Python which execute on the CPU or CPU+GPU. It is capable of gathering the performance information of functions through any combination of binary instrumentation, call-stack sampling, user-defined regions, and Python interpreter hooks. Omnitrace supports interactive visualization of comprehensive traces in the web browser in addition to high-level summary profiles with mean/min/max/stddev statistics. In addition to runtimes, omnitrace supports the collection of system-level metrics such as the CPU frequency, GPU temperature, and GPU utilization, process-level metrics such as the memory usage, page-faults, and context-switches, and thread-level metrics such as memory usage, CPU time, and numerous hardware counters.

Data Collection Modes

Dynamic instrumentation
- Runtime instrumentation
  - Instrument executable and shared libraries at runtime
- Binary rewriting
  - Generate a new executable and/or library with instrumentation built-in
Statistical sampling
- Periodic software interrupts per-thread
Process-level sampling
- Background thread records process-, system- and device-level metrics while the application executes
Causal profiling
- Quantifies the potential impact of optimizations in parallel codes
Critical trace generation

Data Analysis

High-level summary profiles with mean/min/max/stddev statistics
- Low overhead, memory efficient
- Ideal for running at scale
Comprehensive traces
- Every individual event/measurement
Application speedup predictions resulting from potential optimizations in functions and lines of code (causal profiling)
Critical trace analysis (alpha)

Parallelism API Support

HIP
HSA
Pthreads
MPI
Kokkos-Tools (KokkosP)
OpenMP-Tools (OMPT)

GPU Metrics

GPU hardware counters
HIP API tracing
HIP kernel tracing
HSA API tracing
HSA operation tracing
System-level sampling (via rocm-smi)
- Memory usage
- Power usage
- Temperature
- Utilization

CPU Metrics

CPU hardware counters sampling and profiles
CPU frequency sampling
Various timing metrics
- Wall time
- CPU time (process and/or thread)
- CPU utilization (process and/or thread)
- User CPU time
- Kernel CPU time
Various memory metrics
- High-water mark (sampling and profiles)
- Memory page allocation
- Virtual memory usage
Network statistics
I/O metrics
... many more

Documentation

The full documentation for omnitrace is available at amdresearch.github.io/omnitrace. See the Getting Started documentation for general tips and a detailed discussion about sampling vs. binary instrumentation.

Quick Start

Installation

Visit Releases page
Select appropriate installer (recommendation: .sh scripts do not require super-user priviledges unlike the DEB/RPM installers)
- If targeting a ROCm application, find the installer script with the matching ROCm version
- If you are unsure about your Linux distro, check /etc/os-release or use the omnitrace-install.py script

If the above recommendation is not desired, download the omnitrace-install.py and specify --prefix <install-directory> when executing it. This script will attempt to auto-detect a compatible OS distribution and version. If ROCm support is desired, specify --rocm X.Y where X is the ROCm major version and Y is the ROCm minor version, e.g. --rocm 5.4.

wget https://github.com/AMDResearch/omnitrace/releases/latest/download/omnitrace-install.py
python3 ./omnitrace-install.py --prefix /opt/omnitrace/rocm-5.4 --rocm 5.4

See the Installation Documentation for detailed information.

Setup

NOTE: Replace /opt/omnitrace below with installation prefix as necessary.

Option 1: Source setup-env.sh script

source /opt/omnitrace/share/omnitrace/setup-env.sh

Option 2: Load modulefile

module use /opt/omnitrace/share/modulefiles
module load omnitrace

Option 3: Manual

export PATH=/opt/omnitrace/bin:${PATH}
export LD_LIBRARY_PATH=/opt/omnitrace/lib:${LD_LIBRARY_PATH}

Omnitrace Settings

Generate an omnitrace configuration file using omnitrace-avail -G omnitrace.cfg. Optionally, use omnitrace-avail -G omnitrace.cfg --all for a verbose configuration file with descriptions, categories, etc. Modify the configuration file as desired, e.g. enable perfetto, timemory, sampling, and process-level sampling by default and tweak some sampling default values:

# ...
OMNITRACE_USE_PERFETTO         = true
OMNITRACE_USE_TIMEMORY         = true
OMNITRACE_USE_SAMPLING         = true
OMNITRACE_USE_PROCESS_SAMPLING = true
# ...
OMNITRACE_SAMPLING_FREQ        = 50
OMNITRACE_SAMPLING_CPUS        = all
OMNITRACE_SAMPLING_GPUS        = $env:HIP_VISIBLE_DEVICES

Once the configuration file is adjusted to your preferences, either export the path to this file via OMNITRACE_CONFIG_FILE=/path/to/omnitrace.cfg or place this file in ${HOME}/.omnitrace.cfg to ensure these values are always read as the default. If you wish to change any of these settings, you can override them via environment variables or by specifying an alternative OMNITRACE_CONFIG_FILE.

Call-Stack Sampling

The omnitrace-sample executable is used to execute call-stack sampling on a target application without binary instrumentation. Use a double-hypen (--) to separate the command-line arguments for omnitrace-sample from the target application and it's arguments.

omnitrace-sample --help
omnitrace-sample <omnitrace-options> -- <exe> <exe-options>
omnitrace-sample -f 1000 -- ls -la

Binary Instrumentation

The omnitrace executable is used to instrument an existing binary. Call-stack sampling can be enabled alongside the execution an instrumented binary, to help "fill in the gaps" between the instrumentation via setting the OMNITRACE_USE_SAMPLING configuration variable to ON. Similar to omnitrace-sample, use a double-hypen (--) to separate the command-line arguments for omnitrace from the target application and it's arguments.

omnitrace-instrument --help
omnitrace-instrument <omnitrace-options> -- <exe-or-library> <exe-options>

Binary Rewrite

Rewrite the text section of an executable or library with instrumentation:

omnitrace-instrument -o app.inst -- /path/to/app

In binary rewrite mode, if you also want instrumentation in the linked libraries, you must also rewrite those libraries. Example of rewriting the functions starting with "hip" with instrumentation in the amdhip64 library:

mkdir -p ./lib
omnitrace-instrument -R '^hip' -o ./lib/libamdhip64.so.4 -- /opt/rocm/lib/libamdhip64.so.4
export LD_LIBRARY_PATH=${PWD}/lib:${LD_LIBRARY_PATH}

Verify via ldd that your executable will load the instrumented library -- if you built your executable with an RPATH to the original library's directory, then prefixing LD_LIBRARY_PATH will have no effect.

Once you have rewritten your executable and/or libraries with instrumentation, you can just run the (instrumented) executable or exectuable which loads the instrumented libraries normally, e.g.:

omnitrace-run -- ./app.inst

If you want to re-define certain settings to new default in a binary rewrite, use the --env option. This omnitrace option will set the environment variable to the given value but will not override it. E.g. the default value of OMNITRACE_PERFETTO_BUFFER_SIZE_KB is 1024000 KB (1 GiB):

# buffer size defaults to 1024000
omnitrace-instrument -o app.inst -- /path/to/app
omnitrace-run -- ./app.inst

Passing --env OMNITRACE_PERFETTO_BUFFER_SIZE_KB=5120000 will change the default value in app.inst to 5120000 KiB (5 GiB):

# defaults to 5 GiB buffer size
omnitrace-instrument -o app.inst --env OMNITRACE_PERFETTO_BUFFER_SIZE_KB=5120000 -- /path/to/app
omnitrace-run -- ./app.inst

# override default 5 GiB buffer size to 200 MB via command-line
omnitrace-run --trace-buffer-size=200000 -- ./app.inst
# override default 5 GiB buffer size to 200 MB via environment
export OMNITRACE_PERFETTO_BUFFER_SIZE_KB=200000
omnitrace-run -- ./app.inst

Runtime Instrumentation

Runtime instrumentation will not only instrument the text section of the executable but also the text sections of the linked libraries. Thus, it may be useful to exclude those libraries via the -ME (module exclude) regex option or exclude specific functions with the -E regex option.

omnitrace-instrument -- /path/to/app
omnitrace-instrument -ME '^(libhsa-runtime64|libz\\.so)' -- /path/to/app
omnitrace-instrument -E 'rocr::atomic|rocr::core|rocr::HSA' --  /path/to/app

Python Profiling and Tracing

Use the omnitrace-python script to profile/trace Python interpreter function calls. Use a double-hypen (--) to separate the command-line arguments for omnitrace-python from the target script and it's arguments.

omnitrace-python --help
omnitrace-python <omnitrace-options> -- <python-script> <script-args>
omnitrace-python -- ./script.py

Please note, the first argument after the double-hyphen must be a Python script, e.g. omnitrace-python -- ./script.py.

If you need to specify a specific python interpreter version, use omnitrace-python-X.Y where X.Y is the Python major and minor version:

omnitrace-python-3.8 -- ./script.py

If you need to specify the full path to a Python interpreter, set the PYTHON_EXECUTABLE environment variable:

PYTHON_EXECUTABLE=/opt/conda/bin/python omnitrace-python -- ./script.py

If you want to restrict the data collection to specific function(s) and its callees, pass the -b / --builtin option after decorating the function(s) with @profile. Use the @noprofile decorator for excluding/ignoring function(s) and its callees:

def foo():
    pass

@noprofile
def bar():
    foo()

@profile
def spam():
    foo()
    bar()

Each time spam is called during profiling, the profiling results will include 1 entry for spam and 1 entry for foo via the direct call within spam. There will be no entries for bar or the foo invocation within it.

Trace Visualization

Visit ui.perfetto.dev in the web-browser
Select "Open trace file" from panel on the left
Locate the omnitrace perfetto output (extension: .proto)

Using Perfetto tracing with System Backend

Perfetto tracing with the system backend supports multiple processes writing to the same output file. Thus, it is a useful technique if Omnitrace is built with partial MPI support because all the perfetto output will be coalesced into a single file. The installation docs for perfetto can be found here. If you are building omnitrace from source, you can configure CMake with OMNITRACE_INSTALL_PERFETTO_TOOLS=ON and the perfetto and traced applications will be installed as part of the build process. However, it should be noted that to prevent this option from accidentally overwriting an existing perfetto install, all the perfetto executables installed by omnitrace are prefixed with omnitrace-perfetto-, except for the perfetto executable, which is just renamed omnitrace-perfetto.

Enable traced and perfetto in the background:

pkill traced
traced --background
perfetto --out ./omnitrace-perfetto.proto --txt -c ${OMNITRACE_ROOT}/share/omnitrace.cfg --background

NOTE: if the perfetto tools were installed by omnitrace, replace traced with omnitrace-perfetto-traced and perfetto with omnitrace-perfetto.

Configure omnitrace to use the perfetto system backend via the --perfetto-backend option of omnitrace-run:

# enable sampling on the uninstrumented binary
omnitrace-run --sample --trace --perfetto-backend=system -- ./myapp
# trace the instrument the binary
omnitrace-instrument -o ./myapp.inst -- ./myapp
omnitrace-run --trace --perfetto-backend=system -- ./myapp.inst

or via the --env option of omnitrace-instrument + runtime instrumentation:

omnitrace-instrument --env OMNITRACE_PERFETTO_BACKEND=system -- ./myapp

Languages

C++ 67.5%

C 20.6%

Python 6.6%

CMake 3.4%

Shell 0.6%

Övrigt 1.1%