8feb6bf8b65edaa1dd7ece4ffa4343d73b275aa0
5 İşleme
| Yazar | SHA1 | Mesaj | Tarih | |
|---|---|---|---|---|
|
|
8feb6bf8b6 |
Global trace delay and duration (#235)
- The primary feature of this PR is the **addition of support for scoping the collection of tracing/profiling data into one or more time-based windows** - Closes #222 - Closes #207 - Support for a real-clock time delay and/or a duration for tracing/profiling was added, *resembling the support for this feature during sampling and process-sampling* - However, above paradigm was enhanced for tracing - Instead of one delay and/or one duration based on real time, ***tracing supports periodic and varying delays and durations and these delay+duration sets can be controlled with different clocks*** - At some point, this capability will be extended to sampling and process-sampling - A secondary feature of this PR are the improvements to the handling of categories (by-product of the primary feature) - For example, previously setting `OMNITRACE_ENABLE_CATEGORIES` to a specific set of categories only eliminated the disabled categories from the perfetto trace, now these are applied to timemory profiles too - A new configuration variable `OMNITRACE_DISABLE_CATEGORIES` was added for when disabling only a handful of categories is easier - There are quite a few miscellaneous modifications which pollute this PR a bit ## Multiple Tracing Windows As noted above, tracing now supports specifying multiple delays and durations _and_ with different clocks. Consider the configuration below with two entries in the format `<DELAY>:<DURATION>:<REPEAT>:<CLOCK_TYPE>`: ```console OMNITRACE_TRACE_PERIODS = 0.5:1.0:2:realtime 10.0:5.0:3:cputime ``` The above configuration defines: 1. `0.5:1.0:2:realtime` - A delay of 0.5 seconds (real-time) - Followed by a data collection duration of 1 second (real-time) - This delay + duration is repeated 2x - Summary: tracing data is collected for 2 out of the first 3 seconds of the application's execution 2. `10.0:5.0:3:cputime` - A delay of 10 seconds (process _CPU-time_) - Followed by a data collection duration of 5 seconds (process _CPU-time_) - This delay + duration is repeated 3x - Summary: tracing data is collected for a total of 15 seconds of process CPU-time in the ensuing 75 seconds of CPU-time during the application execution. - Note: the elapsed CPU-time is the aggregate of the CPU-time consumed by all the threads in the process and should be scaled accordingly, e.g., 4 threads running constantly for 1 second of real-time is ~4 seconds of CPU time. ## `omnitrace-sample` Changes Formerly, `--wait` and `--duration` command-line options only applied to sampling delay and duration. The value of these options are now applied to the tracing delay and duration. To retain the ability to control sampling delay/duration without setting tracing delay/duration or vice versa, `--sampling-wait`, `--sampling-duration`, `--trace-wait`, and `--trace-duration` options were added. `omnitrace-sample` also has new options for most of the new configuration options detailed below. ## New configuration options | Option | Description | | ------- | ----------- | | `OMNITRACE_DISABLE_CATEGORIES` | inverse behavior from `OMNITRACE_ENABLE_CATEGORIES` -- populates list of all available categories and then removes the specified ones. | | `OMNITRACE_TRACE_DELAY` | Single floating-point number specifying time to wait before starting data collection. Analagous to `OMNITRACE_SAMPLING_DELAY` and `OMNITRACE_PROCESS_SAMPLING_DELAY` | | `OMNITRACE_TRACE_DURATION` | Single floating-point number specifying data collection duration. Analagous to `OMNITRACE_SAMPLING_DURATION` and `OMNITRACE_PROCESS_SAMPLING_DURATION` | | `OMNITRACE_TRACE_PERIOD_CLOCK_ID` | Sets the default clock-type for tracing delay/duration. Always applied to above two options, can be overridden in below option. Accepts `CLOCK_REALTIME`, `CLOCK_MONOTONIC`, `CLOCK_PROCESS_CPUTIME_ID`, `CLOCK_MONOTONIC_RAW`, `CLOCK_REALTIME_COARSE`, `CLOCK_MONOTONIC_COARSE`, `CLOCK_BOOTTIME`. See `man 2 clock_gettime` for details on differences. | | `OMNITRACE_TRACE_PERIODS` | More powerful version for specifying delay + duration. Supports formats: `<DELAY>`, `<DELAY>:<DURATION>`, `<DELAY>:<DURATION>:<REPEAT>`, and `<DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID>`. | ## Miscellaneous Changes - Expanded `critical_trace_categories_t` to include tracing data from MPI, pthread, HIP, HSA, RCCL, NUMA, and Python. - Added categories `thread_wall_time` and `thread_cpu_time` (derived from sampling) - Read DWARF info for breakpoints - Relocated some source code - Reason: necessary to make `libomnitrace` a bit more modular. Eventually, a large chunk will be separated into `libomnitrace-core`, `libomnitrace-binary`, etc. in order to facilitate re-usability - Relocated some functionality from `runtime.cpp` to `config.cpp` - Relocated code using rocm-smi library to query number of devices to `gpu.cpp` (where the code for using HIP to query number of devices is) - Relocated code for perfetto config and perfetto session out of tracing namespace to reside with other perfetto code - `OMNITRACE_COLORIZED_LOG` configuration option renamed to `OMNITRACE_MONOCHROME` - Backwards compatibility via a deprecated option was not retained here since the logic changed (i.e. true in former means false in latter) - Replaced `TIMEMORY_DEFAULT_OBJECT` macro with `OMNITRACE_DEFAULT_OBJECT` macro - Updated some code in roctracer to use `component::category_region` instead of explicitly using `tracing::` functions - Updated `backtrace_metrics` to better support controlling their presence in the traces/profiles via categories - Added support for `--print` in `validate-timemory-json.py` - Generic `OMNITRACE_ADD_VALIDATION_TEST` CMake function ## Git Log * OMNITRACE_DEFAULT_OBJECT - replace TIMEMORY_DEFAULT_OBJECT with TIMEMORY_DEFAULT_OBJECT * trace-time-window example + tests - adds cmake OMNITRACE_ADD_VALIDATION_TEST function for testing - validate-timemory-json.py now supports printing (-p) - update to OMNITRACE_STRIP_TARGET * Update timemory submodule - detailed backtrace print /proc/<PID>/maps - operation::push_node verbosity change - storage::insert_hierarchy use emplace + at instead of operator[] - concepts::is_type_listing - argparse updates for start/end group - argparse color fixes * perfetto updates - Remove OMNITRACE_CUSTOM_DATA_SOURCE CMake option - move tracing::get_perfetto_config and tracing::get_perfetto_session to perfetto.cpp * config and runtime updates - OMNITRACE_DISABLE_CATEGORIES option - get_enabled_categories() + get_disabled_categories() - config impl handles populating them - OMNITRACE_TRACE_DELAY option - OMNITRACE_TRACE_DURATION option - OMNITRACE_TRACE_PERIODS option - {get,set}_signal_handler - removes config.cpp link dependency for omnitrace_finalize - get_realtime_signal() + get_cputime_signal() + get_sampling_signals() - moved from runtime.cpp to config.cpp * utility::convert - helper function for converting string to a type * pthread_create_gotcha + thread_info updates - thread_index_data::as_string() - tweak printing info about new thread / exited thread * binary updates - get_binary_info has arg to disable dwarf parsing - binary_info contains vector of breakpoint addresses - binary_info:filename() function - binary::get_linked_path - binary::get_link_map has args for dlopen mode - symbol::read_dwarf -> symbol::read_dwarf_entries - symbol::read_dwarf_breakpoints * library updates + categories impl - implement config::set_signal_handler - categories.cpp for handling trace delays - implement trace delay/duration/periods * concepts + debug + defines - tuple_element in concepts - removed runtime header from debug header - OMNITRACE_DEFAULT_COPY_MOVE * gpu + rocm_smi - moved rsmi_num_monitor_devices call to gpu.cpp - gpu::rsmi_device_count() * roctracer updates - roctracer_bundle_t -> roctracer_hip_bundle_t - use category_region instead of explicit tracing push/pop calls * sampling + backtrace_metrics - rework backtrace_metrics to support categories * tracing updates - category stack counters (i.e. push vs. pop counter) for profiling and tracing - push_timemory and pop_timemory accept string_view instead of const char* - tweaked the pop_timemory hash search - {push,pop}_perfetto theoretically supports same invocations as for {push,pop}_perfetto_ts and {push,pop}_perfetto_track - mark_perfetto, mark_perfetto_ts, mark_perfetto_track * category_region update - expanded the critical trace categories - use category_push_disabled - use category_pop_disabled - use category_mark_disabled * constraint implementation - This provides generic functionality for constraining data collection within a windows of time. - E.g., delay, delay + duration, (delay + duration) * nrepeat * COLORIZED_LOG -> MONOCHROME * constraint + omnitrace-causal + omnitrace-sample updates - support for using different clock IDs for constraints - OMNITRACE_TRACE_PERIOD_CLOCK_ID option - tweak to trace-time-window example - tweak to trace-time-window tests * Fix formatting * Update time-window tests - Fix detection of validation support for perfetto - Using the --caller-include feature + runtime instrumentation on Ubuntu 18.04 and OpenSUSE 15.2 results in a segfault in the internals of Dyninst. - For now, mark that these tests will fail - Later, determine if updating Dyninst submodule fixes this problem * Fix OMNITRACE_OUTPUT_PATH for all tests - Provide absolute path instead of relative * Tweak lambda for checking whether HW counters are enabled - causing strange build errors on older GCC compilers * Update dyninst submodule - fix issues with using --caller-include for Ubuntu 18.04, OpenSUSE 15.x * cmake formatting * fix sampling compiler issue for GCC 8 * Tweak thread create message * Increase causal validation iterations |
||
|
|
15e6e6d979 |
Fix setup-env + hsa/rocm/ompt serialization + testing + misc (#156)
- Fix setup-env.sh - Closes #149 - omnitrace exe color - test-install.sh script - if config variable is updated in config or env, include in generated config - metadata for hsa, rocm, and ompt - Closes #153 - Closes #154 |
||
|
|
808ea7dfa7 |
Rework sampling and colorized logs (#140)
## Overview
This is a significant PR which has 3 very notable characteristics:
1. Omnitrace colorizes most of it's logging
2. Completely reworked the sampling
- Samples now record the current instruction pointers instead of strings
- This _dramatically_ decreases the overhead of taking a sample
- The collection of metrics during a sample are split out into another component, enabling that data collection to be disabled -- which decreases the sampling overhead even further
- When both `OMNITRACE_SAMPLING_CPUTIME` and `OMNITRACE_SAMPLING_REALTIME` are ON:
- `OMNITRACE_SAMPLING_CPUTIME_FREQ` and `OMNITRACE_SAMPLING_REALTIME_FREQ` can be used to individually control the sampling frequency
- `OMNITRACE_SAMPLING_CPUTIME_DELAY` and `OMNITRACE_SAMPLING_REALTIME_DELAY` can be used to individually control the delay time before starting
- Now, omnitrace does not start a real-time sampler on the main thread unless `OMNITRACE_SAMPLING_REALTIME` is ON
- In the future, an `OMNITRACE_SAMPLING_TIDS` (and real-time, cpu-time variants) configuration variable(s) will allow you to select which threads will be sampled
3. Files produced by `omnitrace` exe -- `available-instr.txt`, `instrumented-instr.txt`, etc. -- now no longer has `-instr` suffix and are placed in `instrumentation/` subfolder, i.e. `available-instr.txt` -> instrumentation/available.txt`
- This helped de-clutter the output folder
Most of the other edits were reorganization (e.g. internal namespace changes), cleanup, and splitting up functionality.
## Bug Fixes
There is a bug fix with respect to the HSA callbacks which disabled sampling on child threads when an HSA API call was made
## Details
- created thread_info struct for mapping different thread IDs
- reorganized file structure significantly
- added categories.hpp, concepts.hpp
- moved around name trait definitions
- moved all omnitrace components into `omnitrace::component` namespace
- there was a lot of inconsistency b/t using `tim::component` in some places and `omnitrace::component`
- added macros like OMNITRACE_DECLARE_COMPONENT in lieu of TIMEMORY_DECLARE_COMPONENT
- OMNITRACE_CRITICAL_TRACE_NUM_THREADS -> OMNITRACE_THREAD_POOL_SIZE
- roctracer and critical_trace use same thread pool
- critical_trace functions do not lock anymore bc of thread-local TaskGroup
- added `component::local_category_region` to support using `component::category_region` without explicitly passing in name
- removed `component::omnitrace` (unused)
- migrated KokkosP and OMPT to use `component::local_category_region`
- removed `component::user_region` as a result
- migrated omnitrace_{push,pop}_{trace,region}_hidden to use component::category_region
- removed `component::functors` as a result
- migrated some ppdefs
- `api::omnitrace` -> `project::omnitrace`
- `api::(...)` -> `category::(...)`
- improved recording the execution time of threads
- migrated this functionality out of pthread_create_gotcha and into thread_info
- moved mpi_gotcha, fork_gotcha, exit_gotcha, rcclp into omnitrace::component namespace
- split backtrace up into backtrace, backtrace_metrics, backtrace_timestamp components
- sampling.cpp handles setup and post-processing that was formerly in backtrace
- updated logging to use colors
- `OMNITRACE_COLORIZED_LOG` config variable
- updated docs on JSON output from timemory
- instrumentation info in instrumentation subfolder
- added testing for KokkosP entries
- added testing for ompt entries
- add_critical_trace function defined in critical_trace.hpp
- disable push_thread_state and pop_thread_state when thread state is Disabled or Completed
- add comp::page_rss to main bundle
- thread_data supports std::optional instead of std::unique_ptr
- thread_data supports tim::identity<T> to avoid unique_ptr or optional
- tracing::record_thread_start_time()
- tracing::push_timemory and tracing::pop_timemory are templated on CategoryT
- removed anonymous namespace from omnitrace::utility
- sampling backtrace stores instruction pointers instead of strings
- component::category_region updates
- handle disabled thread state
- handle finalized state
- fewer debug messages
- invoke thread_init()
- invoke thread_init_sampling()
- handle push/pop count based on category
- push/pop count only modified when used
- component::cpu_freq
- components/ensure_storage.hpp
- reworked the pthread_create replacement function
- updated parallel-overhead example to report # of times locked
- OMNITRACE_MAX_UNWIND_DEPTH build option
- update timemory submodule
|
||
|
|
4208b5654c |
GPU HW Counters via rocprofiler (#84)
* Initial support for GPU hardware counters
* Update find modules for roctracer and rocprofiler
- /opt/rocm/{rocprofiler,roctracer} path is deprecated so tweak search procedure
* Improve ConfigCPack for MPI
* Update rocprofiler
- rocm_metrics()
- minor cleanup
* Update rocm find modules
* declare rocm_metrics + call in omnitrace-avail
* relocate omnitrace-launch-compiler
* REALPATH and find_modules
* Examples cmake (may drop)
* omnitrace-avail
- hw_counter categories
- init rocm
* setenv updates for rocprofiler in library.cpp and dl.cpp
* get_rocm_events config
* gpu::hip_device_count()
* rocm_metrics returns hardware_counters::info
* - relocated library/components/roctracer_callbacks.* to library/roctracer.*
- relocated library/components/rocprofiler.* to library/rocprofiler.*
- cleaned up rocprofiler.hpp
- added perfetto output of rocprofiler
- added timemory output of rocprofiler
- renamed omni.roctracer thread to roctracer.hip
- added roctracer.hsa thread name
- updated timemory submodule to support std::variant
- updated timemory submodule to support = in config value
- updated timemory submodule to support standalone storage
- updated timemory submodule to support new hw counter apis
- updated timemory submodule to prevent label/description caching in data_tracker
* update omnitrace-avail info_type generation
* Update timemory submodule
* rocprofiler component
* cmake formatting
* omnitrace-avail handle no GPUs
- Add -c command-line option for --categories
- support verbosity
* hsa_rsrc_factory throws exceptions
- throw exceptions to avoid aborting on HSA_STATUS_ERROR_NOT_INITIALIZED when advantageous
- removed duplicate specialization of is_available for component::rocprofiler
* rocprofiler symbols for when disabled
* Fix warning in omnitrace-avail
- std::stringstream from initializer list would use explicit constructor
* Fix finalization after settings are deleted
* Reorganized rocprofiler source
* Updated formatting
* Miscellaneous tweaks
- added using statements from timemory
- tweaked the main and thread bundle names
- fixed timemory header includes
|
||
|
|
1f66e23fdd |
Reorganize source/lib/omnitrace (#51)
- Got rid of `source/lib/omnitrace/include` and `source/lib/omnitrace/src` and merged into `source/lib/omnitrace` - Updated perfetto submodule to v25.0 - Updated papi submodule |