## Motivation
- Structured logging with proper log levels (TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL)
- Better performance through compile-time formatting
- Consistent formatting using fmt library
- Runtime log level control via arguments and environment variables
- Easier maintenance and debugging capabilities
## Technical Details
- Added spdlog as a submodule and integrated it into CMake build system
- Created new `rocprofiler-systems-logger` library wrapping spdlog functionality
- Replaced custom logging macros (`ROCPROFSYS_VERBOSE`, `ROCPROFSYS_DEBUG`, `ROCPROFSYS_FATAL`, `ROCPROFSYS_REQUIRE`, `ROCPROFSYS_CI_THROW`, etc.) with spdlog equivalents (`LOG_DEBUG`, `LOG_WARNING`, `LOG_CRITICAL`, etc.)
- Implemented log level control through command-line arguments and environment variables
- Converted assertion macros to proper error handling with exceptions and std::abort()
## Motivation
The `rocprof-sys-avail -H -c GPU` command is returning blank output which is expected to display a list of available GPU hardware counters instead.
The `rocprof-sys-sample` and `rocprof-sys-run` is missing the `--gpu-events` option for specifying GPU counter events during profiling.
## Technical Details
The initialize_event_info() function had a logic bug where it only called set_agents() if the agent_manager was empty, but the actual issue was that the gpu_agents and cpu_agents vectors were empty even when agents were discovered.
Fixed the conditional logic to properly call set_agents() when gpu_agents and cpu_agents are empty, regardless of the agent_manager state.
Added the `--gpu-events (-G)` option which sets the `ROCPROFSYS_ROCM_EVENTS` environment variable to the specified values.
Fixes an issue where unsupported GPU/APU arch is being skipped gracefully - more details about this issue in the below comment.
* Put cached perfetto traces as default one
* Improve cached data and perfetto traces in order to be more aligned with E2E tests
* Addressing PR comments and findings
* Force early instrumentation bundle instantiation
* Sync-up insturumented containers with thread growth data
* Revert ompvv number of host threads to default 8
* Fixed counter track namings for amd-smi
* AIPROFSYST-34 [rocprof-sys] Update documentation describing newly introduced changes to default tracing mechanism
* refactor: centralize update_env across binaries with unit test added for testing
* removed unused includes suggested by clangd and small cleanup
* use centralized update_env in argparse as well
* review comments incorporated
* move update_env tests closer to common library
* fix: missing common:: prefix in rocprof-sys-sample
* cmake formatting
## Motivation
The idea is to unify the way and place where we store our traces. Current implementation uses `trace_cache` for rocpd traces, but perfetto is in lined inside of each module. This change allows us to have a single point in code where we will collect data, process it and store it in the desired format. This means that we can declutter the code further and have single point of responsibility and single point of failure.
## Technical Details
New `processor` (perfetto_post_processing.cpp) is added to the `trace_cache` which purpose is to use the cached data to populate perfetto tracks. Cache manager is responsible for keeping the instance of this processor and for its lifetime.
* Enable HOST ompvv runtime-instrumentation ctests
* Fix rocprofiler-systems-avail-regex-negation test failure
* Exclude problematic function from instrumentation
* Make push pop skip an env option for ctests
* Remove SKIP_PUSH_POP_CHECK from argument parse
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
* Refactor papi enumeration to fix a hang on Intel systems
- Add an exclude argument to available_events_info() for
perf_event_uncore causing hang like case on Intel systems with large
number of uncore events.
- Enumerate papi available events only when papi events are specified by
users inside early initialization logic
- Move papi available event query for ROCPROFSYS_SAMPLING_OVERFLOW_EVENT
config setting to the avail component, to move the heavy logic outside
initialization.
- Make category option for rocprof-sys-avail -H -c case insensitive
- Provide new option to query available overflow events that can be
specified for ROCPROFSYS_SAMPLING_OVERFLOW_EVENT using new command
option rocprof-sys-avail -H -c overflow
* Update projects/rocprofiler-systems/source/bin/rocprof-sys-avail/common.cpp
Co-authored-by: Milan Radosavljevic <milan.radosavljevic@amd.com>
* Update timemory submodule pointer
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Fix errors on compile
* Change 1: Optimization for the category matching lambda
Optmization changes.
* Modify the rocprof-sys-avail -c option for overflow
Overflow should not be displayed as a device in rocprof-sys-avail -H -c CPU
Users can instead do regex on summary where overflow is appended in description
User can do rocprof-sys-avail -H -c CPU -d -r overflow
* Revert change to column width
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Milan Radosavljevic <milan.radosavljevic@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
* refactor: duplicated path helpers into common/path.hpp
* update rocprof-sys-instrument to use shared path utility
* Add path::realpath(std::string[, std::string*]) helper function in common/path.hpp for binaries
* common: centralize remove_env implementation in environment.hpp
* remove unused includes from rocprof-sys binaries and argparse
* changing set to unordered_set wherever sorting is not required and additional cleanup
* review comment incorporated
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* copilot review for remove_env incorporated
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Consolidate CTests to tests/ folder
* Remove comment
* Consolidate CTests to tests/ folder
* Remove comment
* Separate source code and test code for thread-limit into appropriate folders
* Remove sleeper.cpp and instead use linux sleep cmd
* Merge python-console tests into python-tests
This PR fixes a segmentation fault seen when running rocprof-sys-sample with multi-process OpenMP/HIP applications.
The crash was caused by missing libomptarget.so on the runtime loader path or incorrect LD_PRELOAD settings.
Fixes SWDEV-552804
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
* Add ROCPROFSYS_ROOT to the env for sample
* Add env for causal
* Add env for instrument
* Check for null and address memory leak
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Adjusted the regex to filter out new "PAGE*" domains added by the
SDK. This was causing the passing regex to fail.
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
[ROCm/rocprofiler-systems commit: de6120daf9]
* Fix to find MPI symbols from undefined symbols
* Moved condition checks before
* Fixing format
---------
Co-authored-by: Anuj Shukla <anujshuk@amd.com>
[ROCm/rocprofiler-systems commit: 67ec52b523]
Update Dyninst submodule
Refactoring of build scripts to build TBB, Boost, ElfUtils, and LibIberty, since Dyninst build scripts no longer do.
Workflows are now building Dyninst and its dependencies.
---------
Co-authored-by: marantic-amd <marantic@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
[ROCm/rocprofiler-systems commit: 96df9b6d3e]
- Add support for RCCL API tracing through rocprofiler-sdk.
- Refactored the comm_data code to use the SDK RCCL_API callbacks.
- Add a runtime version check for SDK to gate callback enablement, rather than just the compile-time check.
- Fixed: SAMPLING_TIMEOUT was not being handled correctly in add_test.
[ROCm/rocprofiler-systems commit: af77d93f75]
- Move the MPI gotcha functionality from Timemory to the repo.
- Add the PMPI Fortran MPI functions to the existing mpi gotcha handle.
[ROCm/rocprofiler-systems commit: 4fcd8cc78d]
- Path to merge script not found unless user explicitly sources "share/rocprofiler-systems/setup-env.sh" to setup PATHs.
- Instead, let's derive the path when the application loads and use it when executing the helper script
- Rename script to rocprof-sys-merge-output.sh.
- Change install folder to <prefix>/libexec/rocprofiler-systems based on dev-ops feedback.
- Updated PATH variable in the modulefile and source scrtipt.
- For SWDEV-528101
[ROCm/rocprofiler-systems commit: adc66956b0]
- Addresses concern that device metric tracks are still shown in Perfetto trace file even when only -H is specified to rocprof-sys-sample (and vice versa).
- Update sampling call-stack docs.
[ROCm/rocprofiler-systems commit: 8ae6651357]
* Change terminal color back to normal after printing usage
* Format source
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
[ROCm/rocprofiler-systems commit: ee11f5b206]
- Updated Timemory module.
- Fixes a crash when running rocprof-sys-avail -G without explicitly providing -F <format>. The default value of "txt" was not being used.
- Define "choices" before "default" when defining the "--config-format" argument in the parser.
[ROCm/rocprofiler-systems commit: 3833c8d162]
- VA API tracing using Timemory gotcha wrappers.
- rocDecode API tracing integration using callback to ROCPROFILER_CALLBACK_TRACING_ROCDECODE_API
- Updated videodecode ctest to validate rocDecode APIs in perfetto trace.
[ROCm/rocprofiler-systems commit: 697d1ac02f]
* Integrating amd-smi into rocprofiler-systems due to rocm-smi deprecation.
* No functionality changes to users other than naming conventions.
* New tracks available in perfetto- gpu busy percentage metrics now splits gfx busy into separate gfx, umc, and mm engine measurements.
---------
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
[ROCm/rocprofiler-systems commit: 0c32dfd6bc]
- Renames the CMake option "ROCPROFSYS_USE_HIP" to "ROCPROFSYS_USE_ROCM"
- Remove the "ROCPROFSYS_USE_ROCM_SMI option. Controlled with the "ROCPROFSYS_USE_ROCM" option, instead.
- Runtime configuration can still toggle ROCPROFSYS_USE_ROCM_SMI to disable the sampling.
- Rename ROCPROFSYS_HIP_VERSION macro to ROCPROFSYS_ROCM_VERSION and remove blocks for `ROCPROFSYS_ROCM_VERSION < 60000`
- Remove ROCPROFSYS_USE_ROCTRACER and ROCPROFSYS_USE_ROCPROFILER
- Update test cases
- Update docker files and workflows to install cmake 3.21, which is required for the rocprofiler-sdk findPackage script.
- Removed rocm-6.2 from workflows due to a rocprofiler-sdk API change.
[ROCm/rocprofiler-systems commit: 88aa2d3cbe]
The Omnitrace program is being renamed.
Full name: "ROCm Systems Profiler"
Package name: "rocprofiler-systems"
Binary / Library names: "rocprof-sys-*"
---------
Co-authored-by: Xuan Chen <xuchen@amd.com>
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
[ROCm/rocprofiler-systems commit: d07bf508a9]
- omnitrace-run, omnitrace-sample, and omnitrace-causal now automatically append the LD_LIBRARY_PATH with the directory containing the omnitrace libraries
- this helps ensure that binary rewritten exes can resolve omnitrace-rt library location
[ROCm/rocprofiler-systems commit: 18833a0a5e]
* Build omnitrace-rt library
- Explicitly build dyninstAPI_RT as omnitrace-rt so that the SONAME in the ELF is omnitrace-rt instead of dyninstAPI_RT
- Create symbolic link lib/omnitrace/libdyninstAPI_RT.so which points to lib/libomnitrace-rt.so
- Simplify build tree location of libomnitrace-rt.so since it is ../lib from the bin directory even in the build tree
- Update dyninst submodule with minor tweaks to dyninstAPI_RT/CMakeLists.txt
* Update source/lib/omnitrace-rt/cmake/platform.cmake
* Use ftpmirror.gnu.org instead of ftp.gnu.org
- in timemory and dyninst submodules
- minor .clang-tidy tweak
[ROCm/rocprofiler-systems commit: 0cf017251e]
* Fix omnitrace-avail component list
- remove omnitrace components from `omnitrace-avail -C` since these are no-ops in OMNITRACE_TIMEMORY_COMPONENTS
* Fix omnitrace-avail-filter-wall-clock-available test
[ROCm/rocprofiler-systems commit: 77d52814e9]
* Fix roctracer_flush_activity
- invoke roctracer_flush_activity() before disabling domains
* create comp::roctracer::flush()
- real issue was the global state when roctracer_flush_activity() was called
* formatting
* Update lib/omnitrace/library/components/roctracer.hpp
- provide definition of comp::roctracer::flush when OMNITRACE_USE_ROCTRACER is not defined
* omnitrace.cfg -> perfetto.cfg
- rename provided perfetto config file (omnitrace.cfg) to perfetto.cfg to avoid confusion
* Update lib/core
- gpu.hpp: defines for OMNITRACE_USE_{HIP,ROCTRACER,ROCPROFILER,ROCM_SMI}
- gpu.cpp
- include core/hip_runtime.hpp
- fix serialization of hipDeviceProp_t
- add hip_runtime.hpp
- ensure proper inclusion of hip_runtime.h
- add rccl.hpp
- ensure proper inclusion of rccl.h
* Update lib/omnitrace/library
- rcclp.cpp
- update includes for rccl
- roctracer.hpp
- update includes for hip_runtime
- components/comm_data.hpp
- update includes for rccl
- components/rcclp.hpp
- update includes for rccl
* Update bin/omnitrace-avail/avail.cpp
- update includes for hip_runtime
* Update examples/rccl/CMakeLists.txt
- fix find_package for rccl when CI enabled
* Update CMakeLists.txt
- set cmake policy CMP0135 to NEW for cmake >= 3.24
- Enable DOWNLOAD_EXTRACT_TIMESTAMP with ExternalProject_Add + URL download method
* Update timemory submodule
* Update pybind11 submodule
* Update pybind11 submodule
* Update lib/core/rccl.hpp
- include rccl.h only if OMNITRACE_USE_RCCL > 0
* Update lib/core/{gpu,hip_runtime}.hpp
* Update lib/core/gpu.cpp
- reintroduce some ppdefs
* Update lib/core/gpu.cpp
- fix ifdef on OMNITRACE_HIP_VERSION
* Update lib/core/gpu.cpp
- fix static assert for OMNITRACE_HIP_VERSION_MINOR when HIP version 4.x or older (unreliable minor versions)
* Update lib/core/gpu.cpp
- fix ifdef on OMNITRACE_HIP_VERSION
* Update lib/core/config.cpp
- disable OMNITRACE_PERFETTO_COMBINE_TRACES by default
* Update lib/core/perfetto.cpp
- if unable to open perfetto temp file, return the ReadTraceBlocking()
* Update lib/core/config.*
- flush tmpfile before closing
[ROCm/rocprofiler-systems commit: 7bc50f5a0a]
* Tests for exceeding OMNITRACE_MAX_THREADS
- tests which exceeds OMNITRACE_MAX_THREADS value for thread creation
* CMake Formatting.cmake update
- include source files in /tests/source directory
* Add unknown-hash= to OMNITRACE_ABORT_FAIL_REGEX
- fail if a timemory hash is not resolved to a name
* Tests for exceeding OMNITRACE_MAX_THREADS
- update
* omnitrace-sample update
- remove env disabling of critical-trace and process-sampling
* core library update
- make_unique in concepts.hpp
- add OMNITRACE_USE_ROCM_SMI to "process_sampling" category
- remove forced disabling of critical-trace in sampling mode
- parentheses for OMNITRACE_PREFER
- use tim::get_hash_id instead of tim::get_combined_hash_id
* core library update (containers)
- added aligned_static_vector.hpp
- similar to static_vector.hpp but attempts to align to cache line size
- alignment template parameter for stable_vector
- added missing aliases in static_vector
- consistent with aligned_static_vector aliases
* thread_info update
- track the peak number of threads created
- thread_info::get_peak_num_threads() returns the peak number of threads
* thread_data update
- generic thread_data inherits from base_thread_data
- thread_data reworked to support dynamic expansion
- base_thread_data updated to invoke private_instance() function
- thread_data<optional<T>> uses stable_vector aligned to cache line width
- thread_data<identity<T>> uses stable_vector aligned to cache line width
- thread_data for optional and identity provide private private_instance function + friend to base_thread_data
- component_bundle_cache<T> is now thread_data<component_bundle_cache_impl<T>>
* causal update
- thread_data<T>::instances -> thread_data<T>::instance(construct_on_thread{ ... })
- loop over max_supported_threads (constexpr) -> loop over thread_info::get_peak_num_threads()
- tim::get_combined_hash_id -> tim::get_hash_id
- update progress_bundle usage to new thread_data API
* backtrace/backtrace_metrics component update
- backtrace_metrics update
- update to new thead_data API
- add thread CPU time row in perfetto
- fix potential bug when rusage categories are disabled
- fix bug in operator-= not subtracting cpu time of rhs
- backtrace update
- skip all child call-stack below 'tim::openmp::' if sampling_keep_internal = false
* pthread_gotcha component update
- pthread_gotcha::shutdown() invokes pthread_create_gotcha::shutdown()
* pthread_create_gotcha component update
- minor tweak to {start,stop}_bundle functions: pass in thread id
- update to new thread_data API
- track native handles of internal threads
- implement system with pthread_kill to stop dangling bundles
* rocprofiler/roctracer component update
- update to new thread_data API
- loop over max_supported_threads (constexpr) -> loop over thread_info::get_peak_num_threads()
* critical trace (library) update
- update to new thread_data API
- tim::get_combined_hash_id -> tim::get_hash_id
* coverage update
- update to new thread_data API
* tasking update
- update to new thread_data API
- loop over max_supported_threads (constexpr) -> loop over thread_info::get_peak_num_threads()
* roctracer update
- update to new thread_data API
- loop over max_supported_threads (constexpr) -> loop over thread_info::get_peak_num_threads()
* rocm_smi update
- update to new thread_data API
* runtime.cpp update
- update to new thread_data API
* sampling.cpp update
- update to new thread_data API
- loop over max_supported_threads (constexpr) -> loop over thread_info::get_peak_num_threads()
* ompt.cpp update
- invoke pthread_gotcha::shutdown before invoking OMPT finalize function
- this prevents signals from being delivered to OpenMP threads
* tracing.hpp and tracing.cpp update
- replace get_timemory_hash_{ids,aliases} functions with copy_timemory_hash_ids function
- update to new thread_data API
- loop over max_supported_threads (constexpr) -> loop over thread_info::get_peak_num_threads()
- tim::get_combined_hash_id -> tim::get_hash_id
- improvements to + error checking in thread_init function
* library.cpp update
- move copying timemory hash id/aliases to tracing.cpp
- update to new thread_data API
- loop over max_supported_threads (constexpr) -> loop over thread_info::get_peak_num_threads()
* Update BuildSettings.cmake
- add -Wno-interference-size to suppress warning about use of std::hardware_destructive_interference
* Update fork example
- improve scheme for waiting on child processes via waitpid instead of wait
- support running main routine multiple times
- push/pop regions in child process
* Update lib/common/defines.h.in
- allow use to specify misc values via -D <name>=<value>
- OMNITRACE_CACHELINE_SIZE
- OMNITRACE_CACHELINE_SIZE_MIN
- OMNITRACE_ROCM_MAX_COUNTERS
- remove unused defines
- OMNITRACE_ROCM_LOOK_AHEAD
- OMNITRACE_MAX_ROCM_QUEUES
* Update rocprofiler.hpp
- OMNITRACE_MAX_ROCM_COUNTERS -> OMNITRACE_ROCM_MAX_COUNTERS
* Update aligned_static_vector
- set cacheline_align_v from max of OMNITRACE_CACHELINE_SIZE and OMNITRACE_CACHELINE_SIZE_MIN
* Update tracing.cpp
- acquire locks for updating main hash ids/aliases
- only propagate ids/aliases when finalizing
* Update pthread_create_gotcha.cpp
- make sure hash for "start_thread" exists on main thread
* Update causal end to end tests
- if OMNITRACE_BUILD_NUMBER is 1, set OMNITRACE_VERBOSE=0
[ROCm/rocprofiler-systems commit: 518c83e0f9]