Dateien
rocm-systems/source/lib/omnitrace/library/sampling.cpp
T
Jonathan R. Madsen 0da62c980e Binary instrumentation: more robust exclusion of functions used internally (#238)
## Overview

This PR attempts to increase the stability of binary rewrite and runtime instrumentation.

### Improved protection against self-instrumentation

Using ~~the binary analysis capabilities added from #229~~ the Dyninst SymtabAPI, OmniTrace now does a much better job of avoiding instrumentation of functions which are internally called by OmniTrace:

- The `omnitrace` executable searches for and parses the symbols of various libraries which are known to cause problems when instrumented
  - GNU libraries which are common to nearly every library, e.g., `"libc.so.6"`, `"libdl.so.2"`, etc., and thus are outside the scope of the users optimizations efforts
  - Libraries which OmniTrace depends on for functionality, e.g. `"libunwind.so"`, `"libgotcha.so"`, `"libroctracer64.so"`, etc.
    - OmniTrace skips instrumenting any `module_function` instance when it's member `module_name` or `function_name` variable matches the library name, source file, or function name found for that symbol (unless the user explicitly requests that it be eligible for instrumentation)
- Note: the parsing of the "internal" libraries may result in longer instrumentation time and higher memory usage. Please file an issue if either of these is found to be excessive.

### Function filters based on linkage and visibility

Added options to restrict instrumentation to certain linkage types (e.g. avoid instrumenting weak symbols) and visibility types (e.g. avoid instrumenting hidden symbols).

### Function filters based on instructions

In the past, after instrumentation, some applications instrumented by Dyninst would fail with a trap signal (e.g. #147). In several cases, it was found that this occurred whenever certain instructions were present in the function so an option was added to exclude functions based on one or more regex patterns was added. 

## Details

- generates list of "internal" libraries and attempts to find the first match via:
  - the library is already open, e.g. `dlopen(<libname>, RTLD_LAZY | RTLD_NOLOAD)`
  - searching for the library in `LD_LIBRARY_PATH`
  - searching for the library in `OMNITRACE_ROCM_PATH`, `ROCM_PATH`
  - searching the folders from `/sbin/ldconfig -p`
  - searching for the library in common places such as `/usr/local/lib`
- provides new `--linkage` command line option to restrict instrumentation to functions with particular type(s) of linkage
  -  Linkage types: `unknown`, `global`, `local`, `weak`, `unique`
- provides new `--visibility` command line option to restrict instrumentation to functions with particular type(s) of visibility 
  - Visibility types: `unknown`, `default`, `hidden`, `protected`, `internal` 
- provides new `--internal-module-include` and `--internal-function-include` command line regex options to bypass automatic exclusion from instrumentation
- provides new `--internal-library-append` command line option to specify a library should be considered internal
- provides new `--internal-library-remove` command line option to specify a library should not be considered internal
- provides new `--instruction-exclude` command line regex option to exclude functions which contain matching instructions
- provides new `--internal-library-deps` command line option to treat libraries linked to internal libraries as internal libraries
  - generally, this will only be helpful during runtime instrumentation when OmniTrace is built with an external dyninst library which is dynamically linked to boost libraries and the application is using the same boost libraries
- relaxed restrictions in `module_function::is_module_constrained()`
- relaxed restrictions in `module_function::is_routine_constrained()`
- added a few miscellaneous nullptr checks

## Miscellaneous

- Fix `LD_PRELOAD` + `OMNITRACE_DL_VERBOSE=3` issue
- Adds a sampling offload verbose message
- Improves MPI send-recv.cpp example error message
- Minor tweaks to binary library
  - `binary::get_linked_path` returns `std::optional<string>`
  - renamed `binary::symbol::read_bfd` to `binary::symbol::read_bfd_line_info`
  - `binary::get_binary_info` has param options for reading line info and included undefined symbols
- fixed another edge case instance of resource deadlock during first call to configure_settings
- improved the error log printing in `omnitrace` (does not print repeated messages)

* fix OMNITRACE_DL_VERBOSE=3 + preload issue

- join needs to handle nullptr

* sampling offload verbose message

* mpi-send-recv error message

* binary updates

- get_linked_path returns std::optional<string>
- get_binary_info accepts include_undef flag
- renamed symbol::read_bfd to symbol::read_bfd_line_info
- get_binary_info has param options for reading line info and included undefined symbols

* config updates (initialization)

- fixed another instance of resource deadlock during first call to configure_settings

* Testing fix for HIP w/o rocprofiler support

- disable rocprofiler tests when HIP enabled but OMNITRACE_USE_ROCPROFILER=OFF

* omnitrace exe: insert_instr nullptr check

* omnitrace exe: new method for determining internal constraints

- added internal-libs.cpp
- using binary::get_binary_info on various known libs used by omnitrace
- any matching func/file from symbols found in known internal libs are excluded
- relaxed restrictions in is_module_constrained
- relaxed restrictions in is_routine_constrained
- added a few safety checks

* internal libs append/remove

- options to change which libs are considered internal libraries

* omnitrace exe instruction exclude

- regex option for excluding functions containing specific instructions

* fix is_internal_constrained

* binary link map verbose message

* support constraints on linkage and visibility of symbols

* misc fixes

- fix compiler error for Ubuntu Jammy + GCC 12
- dlopen + libtbbmalloc_proxy appears to be causing issues on OpenSUSE

* Performance details + MT

- multithread processing internal info
- report timing info

* Defer parsing internal data

- wait until after address space is created

* Performance improvement finding for get_symtab_function

* fix data race in get_binary_info

* remove set_default for linkage and visibility argparse

* Parse internal libs with Dyninst::Symtab instead of binary reader

- conflicting versions of libraries for binary analysis causes problems
- expanded whole function restrictions
- expanded module_function::is_routine_constrained regex

* internal lib updates

- include memory usage info
- option to read libraries linked against internal libs: --internal-library-deps
- defer parsing internal libs data to when processing modules
2023-02-07 03:39:10 -06:00

1240 Zeilen
43 KiB
C++

// MIT License
//
// Copyright (c) 2022 Advanced Micro Devices, Inc. All Rights Reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include "library/sampling.hpp"
#include "core/common.hpp"
#include "core/components/fwd.hpp"
#include "core/config.hpp"
#include "core/debug.hpp"
#include "core/locking.hpp"
#include "core/state.hpp"
#include "core/utility.hpp"
#include "library/components/backtrace.hpp"
#include "library/components/backtrace_metrics.hpp"
#include "library/components/backtrace_timestamp.hpp"
#include "library/ptl.hpp"
#include "library/runtime.hpp"
#include "library/thread_data.hpp"
#include "library/thread_info.hpp"
#include "library/tracing.hpp"
#include "library/tracing/annotation.hpp"
#include <timemory/backends/papi.hpp>
#include <timemory/backends/threading.hpp>
#include <timemory/components/data_tracker/components.hpp>
#include <timemory/components/macros.hpp>
#include <timemory/components/papi/extern.hpp>
#include <timemory/components/papi/papi_array.hpp>
#include <timemory/components/papi/papi_vector.hpp>
#include <timemory/components/timing/backends.hpp>
#include <timemory/components/trip_count/extern.hpp>
#include <timemory/macros.hpp>
#include <timemory/math.hpp>
#include <timemory/mpl.hpp>
#include <timemory/mpl/quirks.hpp>
#include <timemory/mpl/type_traits.hpp>
#include <timemory/operations.hpp>
#include <timemory/sampling/allocator.hpp>
#include <timemory/sampling/sampler.hpp>
#include <timemory/storage.hpp>
#include <timemory/units.hpp>
#include <timemory/utility/backtrace.hpp>
#include <timemory/utility/demangle.hpp>
#include <timemory/utility/procfs/maps.hpp>
#include <timemory/utility/types.hpp>
#include <timemory/variadic.hpp>
#include <array>
#include <chrono>
#include <condition_variable>
#include <csignal>
#include <cstring>
#include <ctime>
#include <initializer_list>
#include <mutex>
#include <regex>
#include <sstream>
#include <string>
#include <type_traits>
#include <pthread.h>
#include <signal.h>
namespace tim
{
namespace math
{
template <typename Tp, typename Up>
TIMEMORY_INLINE Tp
plus(Tp&& _lhs, const Up& _rhs)
{
Tp _v = _lhs;
plus(_v, _rhs);
return _v;
}
} // namespace math
} // namespace tim
namespace omnitrace
{
namespace sampling
{
using ::tim::sampling::dynamic;
using tim::sampling::timer;
using hw_counters = typename component::backtrace_metrics::hw_counters;
using signal_type_instances = thread_data<std::set<int>, category::sampling>;
using sampler_running_instances = thread_data<bool, category::sampling>;
using bundle_t =
tim::lightweight_tuple<component::backtrace_timestamp, component::backtrace,
component::backtrace_metrics>;
using sampler_t = tim::sampling::sampler<bundle_t, dynamic>;
using sampler_instances = thread_data<sampler_t, category::sampling>;
using sampler_init_instances = thread_data<bundle_t, category::sampling>;
} // namespace sampling
} // namespace omnitrace
OMNITRACE_DEFINE_CONCRETE_TRAIT(prevent_reentry, sampling::sampler_t, std::true_type)
OMNITRACE_DEFINE_CONCRETE_TRAIT(provide_backtrace, sampling::sampler_t, std::false_type)
OMNITRACE_DEFINE_CONCRETE_TRAIT(buffer_size, sampling::sampler_t,
TIMEMORY_ESC(std::integral_constant<size_t, 1024>))
namespace omnitrace
{
namespace sampling
{
namespace
{
using sampler_allocator_t = typename sampler_t::allocator_t;
auto&
get_sampler_allocators()
{
static auto _v = std::vector<std::shared_ptr<sampler_allocator_t>>{};
return _v;
}
std::set<int>
configure(bool _setup, int64_t _tid = threading::get_id());
void
configure_sampler_allocator(std::shared_ptr<sampler_allocator_t>& _v)
{
if(_v) return;
OMNITRACE_SCOPED_SAMPLING_ON_CHILD_THREADS(false);
OMNITRACE_SCOPED_THREAD_STATE(ThreadState::Internal);
_v = std::make_shared<sampler_allocator_t>();
_v->reserve(config::get_sampling_allocator_size());
}
void
configure_sampler_allocators()
{
auto& _allocators = get_sampler_allocators();
if(_allocators.empty())
{
// avoid lock until necessary
auto_lock_t _alloc_lk{ type_mutex<decltype(_allocators)>() };
if(_allocators.empty())
{
_allocators.resize(std::ceil(config::get_num_threads_hint() /
config::get_sampling_allocator_size()));
for(auto& itr : _allocators)
configure_sampler_allocator(itr);
}
}
}
std::shared_ptr<sampler_allocator_t>
get_sampler_allocator()
{
configure_sampler_allocators();
auto& _allocators = get_sampler_allocators();
OMNITRACE_SCOPED_THREAD_STATE(ThreadState::Internal);
auto_lock_t _lk{ type_mutex<sampler_allocator_t>() };
for(auto& itr : _allocators)
{
if(!itr) configure_sampler_allocator(itr);
if(itr->size() < config::get_sampling_allocator_size()) return itr;
}
auto& _v = _allocators.emplace_back();
configure_sampler_allocator(_v);
return _v;
}
template <typename... Args>
void
thread_sigmask(Args... _args)
{
auto _err = pthread_sigmask(_args...);
if(_err != 0)
{
errno = _err;
perror("pthread_sigmask");
exit(EXIT_FAILURE);
}
}
template <typename Tp>
sigset_t
get_signal_set(Tp&& _v)
{
sigset_t _sigset;
sigemptyset(&_sigset);
for(auto itr : _v)
sigaddset(&_sigset, itr);
return _sigset;
}
template <typename Tp>
std::string
get_signal_names(Tp&& _v)
{
std::string _sig_names{};
for(auto&& itr : _v)
_sig_names += std::get<0>(tim::signals::signal_settings::get_info(
static_cast<tim::signals::sys_signal>(itr))) +
" ";
return (_sig_names.empty()) ? _sig_names
: _sig_names.substr(0, _sig_names.length() - 1);
}
unique_ptr_t<sampler_t>&
get_sampler(int64_t _tid = threading::get_id())
{
static auto& _v = sampler_instances::instances();
return _v.at(_tid);
}
unique_ptr_t<bundle_t>&
get_sampler_init(int64_t _tid = threading::get_id())
{
static auto& _v = sampler_init_instances::instances();
if(!_v.at(_tid)) _v.at(_tid) = unique_ptr_t<bundle_t>{ new bundle_t{} };
return _v.at(_tid);
}
unique_ptr_t<bool>&
get_sampler_running(int64_t _tid)
{
static auto& _v = sampler_running_instances::instances(construct_on_init{}, false);
return _v.at(_tid);
}
auto&
get_duration_disabled()
{
static auto _v = std::atomic<bool>{ false };
return _v;
}
auto&
get_is_duration_thread()
{
static thread_local auto _v = false;
return _v;
}
auto&
get_duration_cv()
{
static auto _v = std::condition_variable{};
return _v;
}
auto&
get_duration_mutex()
{
static auto _v = std::mutex{};
return _v;
}
auto&
get_duration_thread()
{
static auto _v = std::unique_ptr<std::thread>{};
return _v;
}
auto
notify_duration_thread()
{
if(get_duration_thread() && !get_is_duration_thread())
{
std::unique_lock<std::mutex> _lk{ get_duration_mutex(), std::defer_lock };
if(!_lk.owns_lock()) _lk.lock();
get_duration_cv().notify_all();
}
}
void
stop_duration_thread()
{
if(get_duration_thread() && !get_is_duration_thread())
{
notify_duration_thread();
get_duration_thread()->join();
get_duration_thread().reset();
}
}
void
start_duration_thread()
{
static std::mutex _start_mutex{};
std::unique_lock<std::mutex> _start_lk{ _start_mutex, std::defer_lock };
if(!_start_lk.owns_lock()) _start_lk.lock();
if(!get_duration_thread() && config::get_sampling_duration() > 0.0)
{
// we may need to protect against recursion bc of pthread wrapper
static bool _protect = false;
if(_protect) return;
_protect = true;
auto _now = std::chrono::steady_clock::now();
auto _end = _now + std::chrono::nanoseconds{ static_cast<uint64_t>(
config::get_sampling_duration() * units::sec) };
auto _func = [_end]() {
thread_info::init(true);
threading::set_thread_name("omni.samp.dur");
get_is_duration_thread() = true;
bool _wait = true;
while(_wait)
{
_wait = false;
std::unique_lock<std::mutex> _lk{ get_duration_mutex(), std::defer_lock };
if(!_lk.owns_lock()) _lk.lock();
get_duration_cv().wait_until(_lk, _end);
auto _premature = (std::chrono::steady_clock::now() < _end);
auto _finalized = (get_state() == State::Finalized);
if(_premature && !_finalized)
{
// protect against spurious wakeups
OMNITRACE_VERBOSE(
2, "%sSpurious wakeup of sampling duration thread...\n",
tim::log::color::warning());
_wait = true;
}
else if(_finalized)
{
break;
}
else
{
get_duration_disabled().store(true);
OMNITRACE_VERBOSE(1,
"Sampling duration of %f seconds has elapsed. "
"Shutting down sampling...\n",
config::get_sampling_duration());
configure(false, 0);
}
}
};
OMNITRACE_VERBOSE(1, "Sampling will be disabled after %f seconds...\n",
config::get_sampling_duration());
OMNITRACE_SCOPED_SAMPLING_ON_CHILD_THREADS(false);
get_duration_thread() = std::make_unique<std::thread>(_func);
_protect = false;
}
}
auto&
get_offload_file()
{
static auto _v = config::get_tmp_file("sampling");
return _v;
}
locking::atomic_mutex&
get_offload_mutex()
{
static auto _v = locking::atomic_mutex{};
return _v;
}
using sampler_bundle_t = typename sampler_t::bundle_type;
using sampler_buffer_t = tim::data_storage::ring_buffer<sampler_bundle_t>;
void
offload_buffer(int64_t _seq, sampler_buffer_t&& _buf)
{
OMNITRACE_REQUIRE(get_use_tmp_files())
<< "Error! sampling allocator tries to offload buffer of samples but "
"omnitrace was configured to not use temporary files\n";
// use homemade atomic_mutex/atomic_lock since contention will be low
// and using pthread_lock might trigger our wrappers
auto _lk = locking::atomic_lock{ get_offload_mutex() };
auto& _file = get_offload_file();
OMNITRACE_REQUIRE(_file)
<< "Error! sampling allocator tried to offload buffer of samples but the "
"offload file does not exist\n";
OMNITRACE_VERBOSE_F(2, "Offloading %zu samples for thread %li to %s...\n",
_buf.count(), _seq, _file->filename.c_str());
auto& _fs = _file->stream;
OMNITRACE_REQUIRE(_fs.good())
<< "Error! temporary file for offloading buffer is in an invalid state\n";
_fs.write(reinterpret_cast<char*>(&_seq), sizeof(_seq));
auto _data = std::move(_buf);
_data.save(_fs);
_data.destroy();
_buf.destroy();
}
auto
load_offload_buffer()
{
auto _data = std::map<int64_t, std::vector<sampler_buffer_t>>{};
if(!get_use_tmp_files())
{
OMNITRACE_WARNING_F(
2, "[sampling] returning no data because using temporary files is disabled");
return _data;
}
// use homemade atomic_mutex/atomic_lock since contention will be low
// and using pthread_lock might trigger our wrappers
auto _lk = locking::atomic_lock{ get_offload_mutex() };
auto& _file = get_offload_file();
if(!_file)
{
OMNITRACE_WARNING_F(
0, "[sampling] returning no data because the offload file no longer exists");
return _data;
}
auto& _fs = _file->stream;
_fs.close();
_file->open(std::ios::binary | std::ios::in);
if(!_fs)
{
OMNITRACE_WARNING_F(0, "[sampling] %s failed to open", _file->filename.c_str());
}
while(!_fs.eof())
{
int64_t _seq = 0;
_fs.read(reinterpret_cast<char*>(&_seq), sizeof(_seq));
if(_fs.eof())
{
OMNITRACE_VERBOSE_F(2, "[sampling] No more samples found in file...\n");
break;
}
sampler_buffer_t _buffer{};
_buffer.load(_fs);
OMNITRACE_VERBOSE_F(2, "[sampling] Loading %zu samples for thread %li...\n",
_buffer.count(), _seq);
_data[_seq].emplace_back(std::move(_buffer));
}
_file.reset();
return _data;
}
std::set<int>
configure(bool _setup, int64_t _tid)
{
const auto& _info = thread_info::get(_tid, SequentTID);
auto& _sampler = sampling::get_sampler(_tid);
auto& _running = get_sampler_running(_tid);
bool _is_running = (!_running) ? false : *_running;
auto& _signal_types = sampling::get_signal_types(_tid);
OMNITRACE_CONDITIONAL_THROW(get_use_causal(),
"Internal error! configuring sampling not permitted when "
"causal profiling is enabled");
OMNITRACE_SCOPED_SAMPLING_ON_CHILD_THREADS(false);
auto&& _cpu_tids = get_sampling_cpu_tids();
auto&& _real_tids = get_sampling_real_tids();
auto _erase_tid_signal = [_tid, &_signal_types](auto& _tids, int _signum) {
if(!_tids.empty())
{
if(_tids.count(_tid) == 0)
{
OMNITRACE_VERBOSE(3, "Disabling SIG%i from thread %li\n", _signum, _tid);
_signal_types->erase(_signum);
}
}
};
_erase_tid_signal(_cpu_tids, get_cputime_signal());
_erase_tid_signal(_real_tids, get_realtime_signal());
if(_setup && !_sampler && !_is_running && !_signal_types->empty())
{
if(get_duration_disabled()) return std::set<int>{};
// if this thread has an offset ID, that means it was created internally
// and is probably here bc it called a function which was instrumented.
// thus we should not start a sampler for it
if(_tid > 0 && _info && _info->is_offset) return std::set<int>{};
// if the thread state is disabled or completed, return
if(_info && _info->index_data->sequent_value == _tid &&
get_thread_state() == ThreadState::Disabled)
return std::set<int>{};
(void) get_debug_sampling(); // make sure query in sampler does not allocate
assert(_tid == threading::get_id());
if(trait::runtime_enabled<backtrace_metrics>::get())
backtrace_metrics::configure(_setup, _tid);
// NOTE: signals need to be unblocked by calling function
sampling::block_signals(*_signal_types);
auto _verbose = std::min<int>(get_verbose() - 2, 2);
if(get_debug_sampling()) _verbose = 2;
OMNITRACE_DEBUG("Requesting allocator for sampler on thread %lu...\n", _tid);
auto _alloc = get_sampler_allocator();
OMNITRACE_DEBUG("Configuring sampler for thread %lu...\n", _tid);
sampling::sampler_instances::construct(construct_on_thread{ _tid }, _alloc,
"omnitrace", _tid, _verbose);
_sampler->set_flags(SA_RESTART);
_sampler->set_verbose(_verbose);
if(_signal_types->count(get_realtime_signal()) > 0)
{
_sampler->configure(timer{ get_realtime_signal(), CLOCK_REALTIME,
SIGEV_THREAD_ID, get_sampling_real_freq(),
get_sampling_real_delay(), _tid,
threading::get_sys_tid() });
}
if(_signal_types->count(get_cputime_signal()) > 0)
{
_sampler->configure(timer{ get_cputime_signal(), CLOCK_THREAD_CPUTIME_ID,
SIGEV_THREAD_ID, get_sampling_cpu_freq(),
get_sampling_cpu_delay(), _tid,
threading::get_sys_tid() });
}
if(get_use_tmp_files())
{
auto _file = get_offload_file();
if(_file && *_file) _sampler->set_offload(&offload_buffer);
}
static_assert(tim::trait::buffer_size<sampling::sampler_t>::value > 0,
"Error! Zero buffer size");
OMNITRACE_CONDITIONAL_THROW(
_sampler->get_buffer_size() !=
tim::trait::buffer_size<sampling::sampler_t>::value,
"dynamic sampler has a buffer size different from static trait: %zu instead "
"of %zu",
_sampler->get_buffer_size(),
tim::trait::buffer_size<sampling::sampler_t>::value);
OMNITRACE_CONDITIONAL_THROW(
_sampler->get_buffer_size() <= 0,
"dynamic sampler requires a positive buffer size: %zu",
_sampler->get_buffer_size());
for(auto itr : *_signal_types)
{
const char* _type = (itr == get_realtime_signal()) ? "wall" : "CPU";
const auto* _timer = _sampler->get_timer(itr);
if(_timer)
{
OMNITRACE_VERBOSE(
2,
"[SIG%i] Sampler for thread %lu will be triggered %.1fx per "
"second of %s-time (every %.3e milliseconds)...\n",
itr, _tid, _timer->get_frequency(units::sec), _type,
_timer->get_period(units::msec));
}
}
*_running = true;
sampling::get_sampler_init(_tid)->sample();
start_duration_thread();
_sampler->start();
}
else if(!_setup && _sampler && _is_running)
{
OMNITRACE_DEBUG("Destroying sampler for thread %lu...\n", _tid);
*_running = false;
if(_tid == threading::get_id() && !_signal_types->empty())
{
sampling::block_signals(*_signal_types);
}
notify_duration_thread();
if(_tid == 0)
{
block_samples();
// this propagates to all threads
_sampler->ignore(*_signal_types);
// wait for the samples to finish
auto _freq =
std::max<double>(get_sampling_cpu_freq(), get_sampling_real_freq());
auto _period = (1.0 / _freq) * units::sec;
_period = std::min<double>(_period, 1.0e9); // max of 1 second
std::this_thread::sleep_for(
std::chrono::nanoseconds{ static_cast<int64_t>(_period) });
for(int64_t i = 1; i < OMNITRACE_MAX_THREADS; ++i)
{
if(sampling::get_sampler(i)) sampling::get_sampler(i)->stop();
}
for(int64_t i = 1; i < OMNITRACE_MAX_THREADS; ++i)
{
if(sampling::get_sampler(i))
{
sampling::get_sampler(i)->reset();
*get_sampler_running(i) = false;
}
}
stop_duration_thread();
}
_sampler->stop();
if(trait::runtime_enabled<backtrace_metrics>::get())
backtrace_metrics::configure(_setup, _tid);
OMNITRACE_DEBUG("Sampler destroyed for thread %lu\n", _tid);
}
return (_signal_types) ? *_signal_types : std::set<int>{};
}
void
post_process_perfetto(int64_t _tid, const bundle_t* _init,
const std::vector<bundle_t*>& _data);
void
post_process_timemory(int64_t _tid, const bundle_t* _init,
const std::vector<bundle_t*>& _data);
} // namespace
unique_ptr_t<std::set<int>>&
get_signal_types(int64_t _tid)
{
static auto& _v = signal_type_instances::instances();
signal_type_instances::construct(construct_on_thread{ _tid },
omnitrace::get_sampling_signals(_tid));
return _v.at(_tid);
}
std::set<int>
setup()
{
if(!get_use_sampling()) return std::set<int>{};
return configure(true);
}
std::set<int>
shutdown()
{
auto _v = configure(false);
if(utility::get_thread_index() == 0) stop_duration_thread();
return _v;
}
void
block_samples()
{
trait::runtime_enabled<sampler_t>::set(false);
}
void
unblock_samples()
{
trait::runtime_enabled<sampler_t>::set(true);
}
void
block_signals(std::set<int> _signals)
{
if(_signals.empty()) _signals = *get_signal_types(threading::get_id());
if(_signals.empty())
{
OMNITRACE_PRINT("No signals to block...\n");
return;
}
OMNITRACE_DEBUG("Blocking signals [%s] on thread #%lu...\n",
get_signal_names(_signals).c_str(), threading::get_id());
sigset_t _v = get_signal_set(_signals);
thread_sigmask(SIG_BLOCK, &_v, nullptr);
}
void
unblock_signals(std::set<int> _signals)
{
if(_signals.empty()) _signals = *get_signal_types(threading::get_id());
if(_signals.empty())
{
OMNITRACE_PRINT("No signals to unblock...\n");
return;
}
OMNITRACE_DEBUG("Unblocking signals [%s] on thread #%lu...\n",
get_signal_names(_signals).c_str(), threading::get_id());
sigset_t _v = get_signal_set(_signals);
thread_sigmask(SIG_UNBLOCK, &_v, nullptr);
}
void
post_process()
{
OMNITRACE_SCOPED_THREAD_STATE(ThreadState::Internal);
size_t _total_data = 0;
size_t _total_threads = 0;
auto _external_samples = std::atomic<size_t>{ 0 };
auto _internal_samples = std::atomic<size_t>{ 0 };
OMNITRACE_VERBOSE(2 || get_debug_sampling(), "Stopping sampling components...\n");
omnitrace::component::backtrace::stop();
for(size_t i = 0; i < max_supported_threads; ++i)
{
backtrace_metrics::configure(false, i);
auto& _sampler = get_sampler(i);
if(_sampler)
{
_sampler->set_offload(nullptr);
_sampler->stop();
}
}
auto _loaded_data = load_offload_buffer();
for(size_t i = 0; i < max_supported_threads; ++i)
{
auto& _sampler = get_sampler(i);
if(!_sampler)
{
// this should be relatively common
OMNITRACE_CONDITIONAL_PRINT(
get_debug() && get_verbose() >= 2,
"Post-processing sampling entries for thread %lu skipped (no sampler)\n",
i);
continue;
}
auto* _init = get_sampler_init(i).get();
if(!_init)
{
// this is not common
OMNITRACE_PRINT("Post-processing sampling entries for thread %lu skipped "
"(not initialized)\n",
i);
continue;
}
const auto& _thread_info = thread_info::get(i, SequentTID);
OMNITRACE_VERBOSE(3 || get_debug_sampling(),
"Getting sampler data for thread %lu...\n", i);
auto _raw_data = _sampler->get_data();
for(auto litr : _loaded_data[i])
{
while(!litr.is_empty())
{
auto _v = sampler_bundle_t{};
litr.read(&_v);
_raw_data.emplace_back(std::move(_v));
}
litr.destroy();
}
OMNITRACE_VERBOSE(2 || get_debug_sampling(),
"Sampler data for thread %lu has %zu initial entries...\n", i,
_raw_data.size());
OMNITRACE_CI_THROW(
_sampler->get_sample_count() != _raw_data.size(),
"Error! sampler recorded %zu samples but %zu samples were returned\n",
_sampler->get_sample_count(), _raw_data.size());
// single sample that is useless (backtrace to unblocking signals)
if(_raw_data.size() == 1 && _raw_data.front().size() <= 1) _raw_data.clear();
std::vector<sampling::bundle_t*> _data{};
for(auto& itr : _raw_data)
{
auto* _bt = itr.get<backtrace>();
auto* _ts = itr.get<backtrace_timestamp>();
if(_thread_info && _bt && !_bt->empty() && _ts &&
_thread_info->is_valid_time(_ts->get_timestamp()))
{
_data.emplace_back(&itr);
}
}
_total_data += _data.size();
_total_threads += (!_data.empty()) ? 1 : 0;
if(!_data.empty())
{
OMNITRACE_VERBOSE(2 || get_debug_sampling(),
"Sampler data for thread %lu has %zu valid entries...\n", i,
_data.size());
if(get_use_perfetto()) post_process_perfetto(i, _init, _data);
if(get_use_timemory()) post_process_timemory(i, _init, _data);
}
else
{
OMNITRACE_VERBOSE(2 || get_debug_sampling(),
"Sampler data for thread %lu has zero valid entries out of "
"%zu... (skipped)\n",
i, _raw_data.size());
}
}
OMNITRACE_VERBOSE(3 || get_debug_sampling(),
"Destroying samplers and allocators...\n");
for(size_t i = 0; i < max_supported_threads; ++i)
get_sampler(i).reset();
for(auto& itr : get_sampler_allocators())
{
if(itr) itr.reset();
}
if(get_offload_file())
{
get_offload_file()->remove();
get_offload_file().reset();
}
OMNITRACE_VERBOSE(1 || get_debug_sampling(),
"Collected %zu samples from %zu threads... %zu samples out of %zu "
"were taken while within instrumented routines\n",
_total_data, _total_threads, _internal_samples.load(),
(_internal_samples + _external_samples));
}
namespace
{
void
post_process_perfetto(int64_t _tid, const bundle_t* _init,
const std::vector<bundle_t*>& _data)
{
auto _valid_metrics = backtrace_metrics::valid_array_t{};
for(const auto& itr : _data)
{
const auto* _bt_mt = itr->get<backtrace_metrics>();
if(_bt_mt) _valid_metrics |= _bt_mt->get_valid();
}
if(trait::runtime_enabled<backtrace_metrics>::get())
{
OMNITRACE_VERBOSE(3 || get_debug_sampling(),
"[%li] Post-processing metrics for perfetto...\n", _tid);
backtrace_metrics::init_perfetto(_tid, _valid_metrics);
for(const auto& itr : _data)
{
const auto* _bt_metrics = itr->get<backtrace_metrics>();
const auto* _bt_time = itr->get<backtrace_timestamp>();
if(!_bt_metrics || !_bt_time) continue;
if(_bt_time->get_tid() != _tid) continue;
_bt_metrics->post_process_perfetto(_tid, _bt_time->get_timestamp());
}
backtrace_metrics::fini_perfetto(_tid, _valid_metrics);
}
OMNITRACE_VERBOSE(3 || get_debug_sampling(),
"[%li] Post-processing backtraces for perfetto...\n", _tid);
const auto& _thread_info = thread_info::get(_tid, SequentTID);
OMNITRACE_CI_THROW(!_thread_info, "No valid thread info for tid=%li\n", _tid);
if(!_thread_info) return;
uint64_t _beg_ns = _thread_info->get_start();
uint64_t _end_ns = _thread_info->get_stop();
uint64_t _last_ts =
std::max<uint64_t>(_init->get<backtrace_timestamp>()->get_timestamp(), _beg_ns);
auto _track = tracing::get_perfetto_track(
category::sampling{},
[](auto _seq_id, auto _sys_id) {
return TIMEMORY_JOIN(" ", "Thread", _seq_id, "(S)", _sys_id);
},
_thread_info->index_data->sequent_value, _thread_info->index_data->system_value);
tracing::push_perfetto_track(category::sampling{}, "samples [omnitrace]", _track,
_beg_ns, [&](perfetto::EventContext ctx) {
if(config::get_perfetto_annotations())
{
tracing::add_perfetto_annotation(ctx, "begin_ns",
_beg_ns);
}
});
auto _as_hex = [](auto _v) { return JOIN("", "0x", std::hex, _v); };
auto _labels = backtrace_metrics::get_hw_counter_labels(_tid);
for(const auto& itr : _data)
{
const auto* _bt_ts = itr->get<backtrace_timestamp>();
const auto* _bt_cs = itr->get<backtrace>();
const auto* _bt_mt = itr->get<backtrace_metrics>();
if(!_bt_ts || !_bt_cs || !_bt_mt) continue;
if(_bt_ts->get_tid() != _tid) continue;
static std::set<std::string> _static_strings{};
const auto* _last = _init;
auto _patched_data = backtrace::filter_and_patch(_bt_cs->get());
size_t _ncount = 0;
for(const auto& iitr : _patched_data)
{
uint64_t _beg = _last_ts;
uint64_t _end = _bt_ts->get_timestamp();
if(!_thread_info->is_valid_lifetime({ _beg, _end })) continue;
auto _ncur = _ncount++;
// the begin/end + HW counters will be same for entire call-stack so only
// annotate
// the top and the bottom functons to keep the data consumption low
bool _include_common = (_ncur == 0 || _ncur + 1 == _patched_data.size());
// Only annotate HW counters when:
// 1. when we can compute a difference from the last sample
// 2. when the number of HW counters b/t this sample and last are the same
bool _include_hw =
_include_common && (_last != nullptr) &&
_bt_mt->get_hw_counters().size() ==
_last->get<backtrace_metrics>()->get_hw_counters().size();
auto _hw_counters_enabled = [](const auto* _bt_v) {
return (_bt_v != nullptr) &&
(*_bt_v)(type_list<backtrace_metrics::hw_counters>{}) &&
(*_bt_v)(category::thread_hardware_counter{});
};
// annotations common to both modes
auto _common_annotate = [&](::perfetto::EventContext& ctx, bool _is_last) {
if(_include_common && _is_last)
{
tracing::add_perfetto_annotation(ctx, "begin_ns", _beg);
tracing::add_perfetto_annotation(ctx, "end_ns", _end);
}
if(_include_hw && _is_last && _last &&
_hw_counters_enabled(_last->get<backtrace_metrics>()) &&
_hw_counters_enabled(_bt_mt))
{
// current values when read
auto _hw_cnt_vals = _bt_mt->get_hw_counters();
// compute difference from last sample to provide the HW counters for
// this sample
tim::math::minus(_hw_cnt_vals,
_last->get<backtrace_metrics>()->get_hw_counters());
for(size_t i = 0; i < _labels.size(); ++i)
tracing::add_perfetto_annotation(ctx, _labels.at(i),
_hw_cnt_vals.at(i));
}
};
if(get_sampling_include_inlines() && iitr.lineinfo)
{
auto _lines = iitr.lineinfo.lines;
std::reverse(_lines.begin(), _lines.end());
size_t _n = 0;
for(const auto& litr : _lines)
{
const auto* _name =
_static_strings.emplace(demangle(litr.name)).first->c_str();
auto _info = JOIN(':', litr.location, litr.line);
tracing::push_perfetto_track(
category::sampling{}, _name, _track, _beg,
[&](perfetto::EventContext ctx) {
if(config::get_perfetto_annotations())
{
_common_annotate(ctx, (_n == 0 && _ncur == 0) ||
(_n + 1 == _lines.size()));
tracing::add_perfetto_annotation(ctx, "file",
iitr.location);
tracing::add_perfetto_annotation(ctx, "lineinfo", _info);
tracing::add_perfetto_annotation(ctx, "inlined",
(_n++ > 0));
}
});
tracing::pop_perfetto_track(category::sampling{}, _name, _track,
_end);
}
}
else
{
const auto* _name = _static_strings.emplace(iitr.name).first->c_str();
tracing::push_perfetto_track(
category::sampling{}, _name, _track, _beg,
[&](perfetto::EventContext ctx) {
if(config::get_perfetto_annotations())
{
_common_annotate(ctx, true);
tracing::add_perfetto_annotation(ctx, "file", iitr.location);
tracing::add_perfetto_annotation(ctx, "pc",
_as_hex(iitr.address));
tracing::add_perfetto_annotation(ctx, "line_address",
_as_hex(iitr.line_address));
if(iitr.lineinfo)
{
auto _lines = iitr.lineinfo.lines;
std::reverse(_lines.begin(), _lines.end());
size_t _n = 0;
for(const auto& litr : _lines)
{
auto _label = JOIN('-', "lineinfo", _n++);
tracing::add_perfetto_annotation(
ctx, _label.c_str(),
JOIN('@', demangle(litr.name),
JOIN(':', litr.location, litr.line)));
}
}
}
});
tracing::pop_perfetto_track(category::sampling{}, _name, _track, _end);
}
}
_last_ts = _bt_ts->get_timestamp();
_last = itr;
}
tracing::pop_perfetto_track(category::sampling{}, "samples [omnitrace]", _track,
_end_ns, [&](perfetto::EventContext ctx) {
if(config::get_perfetto_annotations())
{
tracing::add_perfetto_annotation(ctx, "end_ns",
_end_ns);
}
});
}
void
post_process_timemory(int64_t _tid, const bundle_t* _init,
const std::vector<bundle_t*>& _data)
{
auto _depth_sum = std::map<int64_t, std::map<int64_t, int64_t>>{};
OMNITRACE_VERBOSE(3 || get_debug_sampling(),
"[%li] Post-processing data for timemory...\n", _tid);
const auto* _last = _init;
for(const auto& itr : _data)
{
using bundle_t = tim::lightweight_tuple<comp::trip_count, sampling_wall_clock,
sampling_cpu_clock, hw_counters>;
auto* _bt_data = itr->get<backtrace>();
auto* _bt_time = itr->get<backtrace_timestamp>();
auto* _bt_metrics = itr->get<backtrace_metrics>();
const auto* _last_metrics = _last->get<backtrace_metrics>();
if(!_bt_data || !_bt_time) continue;
double _elapsed_wc = (_bt_time->get_timestamp() -
_last->get<backtrace_timestamp>()->get_timestamp());
std::vector<bundle_t> _tc{};
_tc.reserve(_bt_data->size());
// generate the instances of the tuple of components and start them
for(const auto& iitr : backtrace::filter_and_patch(_bt_data->get()))
{
_tc.emplace_back(tim::string_view_t{ iitr.name });
_tc.back().push(_bt_time->get_tid());
_tc.back().start();
}
// stop the instances and update the values as needed
for(size_t i = 0; i < _tc.size(); ++i)
{
auto& iitr = _tc.at(_tc.size() - i - 1);
size_t _depth = 0;
_depth_sum[_bt_time->get_tid()][_depth] += 1;
iitr.stop();
if constexpr(tim::trait::is_available<sampling_wall_clock>::value)
{
auto* _sc = iitr.get<sampling_wall_clock>();
if(_sc)
{
auto _value = _elapsed_wc / sampling_wall_clock::get_unit();
_sc->set_value(_value);
_sc->set_accum(_value);
}
}
if constexpr(tim::trait::is_available<sampling_cpu_clock>::value)
{
auto* _cc = iitr.get<sampling_cpu_clock>();
if(_cc && _bt_metrics && _last_metrics &&
(*_bt_metrics)(category::thread_cpu_time{}) &&
(*_last_metrics)(category::thread_cpu_time{}))
{
double _elapsed_cc = (_bt_metrics->get_cpu_timestamp() -
_last_metrics->get_cpu_timestamp());
_cc->set_value(_elapsed_cc / sampling_cpu_clock::get_unit());
_cc->set_accum(_elapsed_cc / sampling_cpu_clock::get_unit());
}
}
if constexpr(tim::trait::is_available<hw_counters>::value)
{
auto _hw_counters_enabled = [](const auto* _bt_v) {
return (_bt_v != nullptr) &&
(*_bt_v)(type_list<backtrace_metrics::hw_counters>{}) &&
(*_bt_v)(category::thread_hardware_counter{});
};
if(_bt_metrics && _last_metrics && _hw_counters_enabled(_bt_metrics) &&
_hw_counters_enabled(_last_metrics))
{
auto _hw_cnt_vals = _bt_metrics->get_hw_counters();
if(_bt_metrics->get_hw_counters().size() ==
_last_metrics->get_hw_counters().size())
{
for(size_t k = 0; k < _bt_metrics->get_hw_counters().size(); ++k)
{
if(_last_metrics->get_hw_counters()[k] > _hw_cnt_vals[k])
_hw_cnt_vals[k] -= _last_metrics->get_hw_counters()[k];
}
}
auto* _hw_counter = iitr.get<hw_counters>();
if(_hw_counter)
{
_hw_counter->set_value(_hw_cnt_vals);
_hw_counter->set_accum(_hw_cnt_vals);
}
}
}
iitr.pop();
}
_last = itr;
}
for(auto&& itr : _data)
{
using bundle_t =
tim::lightweight_tuple<sampling_percent, quirk::config<quirk::tree_scope>>;
auto* _bt_data = itr->get<backtrace>();
auto* _bt_time = itr->get<backtrace_timestamp>();
if(!_bt_time || !_bt_data) continue;
if(_depth_sum.find(_bt_time->get_tid()) == _depth_sum.end()) continue;
std::vector<bundle_t> _tc{};
_tc.reserve(_bt_data->size());
// generate the instances of the tuple of components and start them
for(const auto& iitr : backtrace::filter_and_patch(_bt_data->get()))
{
_tc.emplace_back(tim::string_view_t{ iitr.name });
_tc.back().push(_bt_time->get_tid());
_tc.back().start();
}
// stop the instances and update the values as needed
for(size_t i = 0; i < _tc.size(); ++i)
{
auto& iitr = _tc.at(_tc.size() - i - 1);
size_t _depth = 0;
double _value = (1.0 / _depth_sum[_bt_time->get_tid()][_depth]) * 100.0;
iitr.store(std::plus<double>{}, _value);
iitr.stop();
iitr.pop();
}
}
}
struct sampling_initialization
{
static void preinit()
{
sampling_wall_clock::label() = "sampling_wall_clock";
sampling_wall_clock::description() = "Wall clock time (via sampling)";
sampling_cpu_clock::label() = "sampling_cpu_clock";
sampling_cpu_clock::description() = "CPU clock time (via sampling)";
sampling_percent::label() = "sampling_percent";
sampling_percent::description() = "Percentage of samples";
sampling_gpu_busy::label() = "sampling_gpu_busy_percent";
sampling_gpu_busy::description() = "Utilization of GPU(s)";
sampling_gpu_busy::set_precision(0);
sampling_gpu_busy::set_format_flags(sampling_gpu_busy::get_format_flags() &
std::ios_base::showpoint);
sampling_gpu_memory::label() = "sampling_gpu_memory_usage";
sampling_gpu_memory::description() = "Memory usage of GPU(s)";
sampling_gpu_power::label() = "sampling_gpu_power";
sampling_gpu_power::description() = "Power usage of GPU(s)";
sampling_gpu_power::unit() = units::watt;
sampling_gpu_power::display_unit() = "watts";
sampling_gpu_power::set_precision(2);
sampling_gpu_power::set_format_flags(sampling_gpu_power::get_format_flags());
sampling_gpu_temp::label() = "sampling_gpu_temperature";
sampling_gpu_temp::description() = "Temperature of GPU(s)";
sampling_gpu_temp::unit() = 1;
sampling_gpu_temp::display_unit() = "degC";
sampling_gpu_temp::set_precision(1);
sampling_gpu_temp::set_format_flags(sampling_gpu_temp::get_format_flags());
}
};
} // namespace
} // namespace sampling
} // namespace omnitrace
TIMEMORY_INVOKE_PREINIT(omnitrace::sampling::sampling_initialization)