## Motivation
With the introduction of the new logging system base on `spdlog` library, opportunity shows to replace `timemory` dependent JOIN implementation with `fmt` library `format` and `join` APIs, which are shipped as a part of `spdlog` lib
## Technical Details
Use `fmt` provided APIs to properly format and package strings.
## Motivation
Enable UCX communication tracing and communication metadata
## Technical Details
Implement UCX API wrappers to trace transport-layer communication. This adds communication data tracking and exposes “UCX Comm Send/Recv” timelines, enabling detailed analysis of MPI, OpenSHMEM, and other UCX-based runtime communication patterns.
- Implements function interception for UCX functions across multiple categories using gotcha component.
- Extended comm_data component to track UCX send/recv operations - Added ucx_send and ucx_recv labels for Perfetto counter tracks. Integrated UCX data tracking with existing MPI/RCCL tracking infrastructure.
- Added ROCPROFSYS_USE_UCX configuration option (enabled by default).
- Created FindUCX.cmake module for UCX header detection. Falls back to internal UCX headers if system headers not found.
- Updated all Dockerfiles to include UCX dependencies.
This reverts commit 7b00d3a89b.
The workaround is no longer needed - root cause fixed in:
- rocm-smi-lib (PR #2531): Made devInfoTypesStrings file-local static
- amdsmi (PR #2575): Added visibility("hidden") attribute
## Motivation
- Structured logging with proper log levels (TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL)
- Better performance through compile-time formatting
- Consistent formatting using fmt library
- Runtime log level control via arguments and environment variables
- Easier maintenance and debugging capabilities
## Technical Details
- Added spdlog as a submodule and integrated it into CMake build system
- Created new `rocprofiler-systems-logger` library wrapping spdlog functionality
- Replaced custom logging macros (`ROCPROFSYS_VERBOSE`, `ROCPROFSYS_DEBUG`, `ROCPROFSYS_FATAL`, `ROCPROFSYS_REQUIRE`, `ROCPROFSYS_CI_THROW`, etc.) with spdlog equivalents (`LOG_DEBUG`, `LOG_WARNING`, `LOG_CRITICAL`, etc.)
- Implemented log level control through command-line arguments and environment variables
- Converted assertion macros to proper error handling with exceptions and std::abort()
## Motivation
The `rocprof-sys-avail -H -c GPU` command is returning blank output which is expected to display a list of available GPU hardware counters instead.
The `rocprof-sys-sample` and `rocprof-sys-run` is missing the `--gpu-events` option for specifying GPU counter events during profiling.
## Technical Details
The initialize_event_info() function had a logic bug where it only called set_agents() if the agent_manager was empty, but the actual issue was that the gpu_agents and cpu_agents vectors were empty even when agents were discovered.
Fixed the conditional logic to properly call set_agents() when gpu_agents and cpu_agents are empty, regardless of the agent_manager state.
Added the `--gpu-events (-G)` option which sets the `ROCPROFSYS_ROCM_EVENTS` environment variable to the specified values.
Fixes an issue where unsupported GPU/APU arch is being skipped gracefully - more details about this issue in the below comment.
## Motivation
When profiling multi-process applications where a parent process sends SIGKILL to child processes, the termination can occur before the profiler has a chance to flush collected data. This PR introduces a configurable delay before SIGKILL signals are forwarded, allowing profiling data to be captured before process termination. This is workaround.
## Technical Details
- Added new configuration setting `ROCPROFSYS_KILL_DELAY` (default: 0 seconds) to specify a delay before SIGKILL signals are forwarded to other processes
- Implemented `kill_gotcha` component that intercepts the `kill()` system call
- The gotcha only delays SIGKILL signals sent to external processes (pid > 0 and not self)
- Integrated `kill_gotcha_t` into the `preinit_bundle_t` for early initialization
* Put cached perfetto traces as default one
* Improve cached data and perfetto traces in order to be more aligned with E2E tests
* Addressing PR comments and findings
* Force early instrumentation bundle instantiation
* Sync-up insturumented containers with thread growth data
* Revert ompvv number of host threads to default 8
* Fixed counter track namings for amd-smi
* AIPROFSYST-34 [rocprof-sys] Update documentation describing newly introduced changes to default tracing mechanism
* refactor: centralize update_env across binaries with unit test added for testing
* removed unused includes suggested by clangd and small cleanup
* use centralized update_env in argparse as well
* review comments incorporated
* move update_env tests closer to common library
* fix: missing common:: prefix in rocprof-sys-sample
* cmake formatting
## Motivation
The idea is to unify the way and place where we store our traces. Current implementation uses `trace_cache` for rocpd traces, but perfetto is in lined inside of each module. This change allows us to have a single point in code where we will collect data, process it and store it in the desired format. This means that we can declutter the code further and have single point of responsibility and single point of failure.
## Technical Details
New `processor` (perfetto_post_processing.cpp) is added to the `trace_cache` which purpose is to use the cached data to populate perfetto tracks. Cache manager is responsible for keeping the instance of this processor and for its lifetime.
When doing this ticket, I also noticed the program would SEGFAULT when ROCPROFSYS_ROCM_DOMAINS=roctx even though the docs tell us we can do this. Went ahead and fixed that.
Also noticed that timemory push/pop in rocprofiler-sdk.cpp was always using category::rocm_marker_api instead of CategoryT. Fixed that as well.
* Enable HOST ompvv runtime-instrumentation ctests
* Fix rocprofiler-systems-avail-regex-negation test failure
* Exclude problematic function from instrumentation
* Make push pop skip an env option for ctests
* Remove SKIP_PUSH_POP_CHECK from argument parse
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
## Motivation
Resolved: SWDEV-566226
The current implementation of agents inside of rocprof-systems keeps just the minimal necessary set of information required for populating the `info_agent` table inside of rocpd database. There is a sufficient amount of data that is being left out from database, so this change should fix that and store the additional agent information as an `extdata` row inside of `info_agent` table.
## Technical Details
This PR introduces additional filed inside of `agent` structure inside which is representing the JSON formatted string of all the additional information we can acquire about particular agent. This data is processed and added during the initial fetching of agents, and afterwards pushed inside of the database.
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
* Refactor papi enumeration to fix a hang on Intel systems
- Add an exclude argument to available_events_info() for
perf_event_uncore causing hang like case on Intel systems with large
number of uncore events.
- Enumerate papi available events only when papi events are specified by
users inside early initialization logic
- Move papi available event query for ROCPROFSYS_SAMPLING_OVERFLOW_EVENT
config setting to the avail component, to move the heavy logic outside
initialization.
- Make category option for rocprof-sys-avail -H -c case insensitive
- Provide new option to query available overflow events that can be
specified for ROCPROFSYS_SAMPLING_OVERFLOW_EVENT using new command
option rocprof-sys-avail -H -c overflow
* Update projects/rocprofiler-systems/source/bin/rocprof-sys-avail/common.cpp
Co-authored-by: Milan Radosavljevic <milan.radosavljevic@amd.com>
* Update timemory submodule pointer
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Fix errors on compile
* Change 1: Optimization for the category matching lambda
Optmization changes.
* Modify the rocprof-sys-avail -c option for overflow
Overflow should not be displayed as a device in rocprof-sys-avail -H -c CPU
Users can instead do regex on summary where overflow is appended in description
User can do rocprof-sys-avail -H -c CPU -d -r overflow
* Revert change to column width
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Milan Radosavljevic <milan.radosavljevic@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
- Redesigned buffer_storage with a flush_worker pattern for better thread management and resource cleanup
- Introduced type-safe abstractions through new components: cacheable.hpp, cache_type_traits.hpp, sample_processor.hpp, and type_registry.hpp
- Optimized type erasure implementation in sample processor to reduce runtime overhead
- Renamed rocpd_post_processing to rocpd_processor and restructured the processing pipeline
- Removed storage_parser.cpp and integrated functionality into header-based template implementation
- Enhanced cache_manager with improved processing workflow and better separation of concerns
* refactor: duplicated path helpers into common/path.hpp
* update rocprof-sys-instrument to use shared path utility
* Add path::realpath(std::string[, std::string*]) helper function in common/path.hpp for binaries
* common: centralize remove_env implementation in environment.hpp
* remove unused includes from rocprof-sys binaries and argparse
* changing set to unordered_set wherever sorting is not required and additional cleanup
* review comment incorporated
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* copilot review for remove_env incorporated
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Add XGMI and PCIe metrics to the profiling data
Add support for AMD XGMI (GPU-to-GPU interconnect) and PCIe
metrics:
* XGMI link width in bits
* XGMI link speed in GT/s
* Per-link read bandwidth (KB)
* Per-link write bandwidth (KB)
- Add new categories for PCIe metrics:
* PCIe link width
* PCIe link speed in GT/s
* Accumulated bandwidth (MB)
* Instantaneous bandwidth (MB/s)
* Fix VCN/JPEG insert logic
* Modify the gpu_metrics struct to accomodate XCP structure
* Add ctest automation for gpu interconnect metrics
* Refactor to move gpu_metrics struct and serialization to another file
* Possible fix for timeout in CI
Fix redundant skip check in ctest
Add xgmi and pcie option in rocprof-sys-avail.
* Change2: Address review comments
Change ctest sampling to avoid timeout
Change variable name and code structuring
* Add option in ctest to run rocprof-sys-run without rewrite
Run transferbench with rocprof-sys-run without sampling
* Change3: Fix sample insert bug and address review comments
xgmi and pci support check
renaming variables
additional hip_api validation in rocpd
* Reduce the load from the trnasferBench sample
The CI builds were timing out when flushing a big temporary file to the
DB: (2720824.23 KB / 2720.82 MB / 2.72 GB)...
* Add clean up of buffered_storage files
* Add step to workflows to test for remaining temp files after tests
* Applied suggestions from code review
* add deletion of all cache files
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
- Integrate rocprofiler-systems with rocprofiler-sdk-rocpd to fetch schema
- If rocprofiler-sdk-rocpd is not availabe, use embedded schema files. With this we provide rocpd format support even if ROCm is not available
- Include detection in CMake if rocprofiler-sdk-rocpd package is available (and valid), and build database class upon that
- Update embedded schema that is used as a fallback.
- Update some validation tests to account for schema changes.
* Change how cache manager handles child process trace cache
* Sampling and backtrace metrics to cache
* Apply cmake formatting
* Fix parsing of metadata json
* Code clean up
* Fix build nlohmann json from source
* Fix storage parsed finished callback
* Revert sampling for child process
* Change cache file name generating
* Fix thread start stop
* Fix process start end timestamp
* Applied suggestions from code review
* Try with late start of flushing task thread
* Change dockerfiles for ci
* Revert changes on github workflows
* Remove json_fwd.hpp include
* fix dump
* Build nlohmann/json by default
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Update location of build artifacts for nlohmann/json
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Revert use_output_suffix
* Remove unused logs
* Fix cache store inside counter due to structure change
* Remove decode tests from debian ci
* Fix issue where all databases have the same UUID (#1499)
Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>
* Removing the cpack and install steps to save space
* Revert "Remove decode tests from debian ci"
This reverts commit ddabf6dd142dcf438e6b8997b8abe86f2c868468.
* Revert "Removing the cpack and install steps to save space"
This reverts commit 973da3a1ba99d99d529af5269d30e177092f9bfa.
* Add prepare-runner job as dependency to clean up the space
* Fix formatting
* Free up even more space
* Remove verbose for workflows
* remove hw_counters from ext_data
* move space clean up inside container
* try to remove external folder to free up space
* Check space
* Refactor Cleanup to it's own step
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Aleksandar Djordjevic <aleksandar.djordjevic@amd.com>
Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>
* Add OMPT to ROCpd
* Use correct category
* Added wrapper functions for future control
* Formatting
* Fix naming
* Comment change
* Remove ompt_get_cb_args
* Switched to using region_sample for OMPT
* Remove relic function
* Remove get_use_rocpd that was used in this pr (one still remains)
* Rename ompt_get_args_string and reuse in tool_tracing_callback_stop
* Make lock init and destroy cb instant
* [Prototype] ROCPD Name fix
* [Prototype] ROCPD Name fix P1
* [Prototype] ROCPD Name fix P2
* ROCPD Name fix
* Var name changes
* Rewrite cb overwrite to single function
* [Important] Use parallel_data as key for parallel callback map
* Fix workflow failure
* Make cpp USE_ROCM consistent with hpp and use default constructor if USE_ROCM = 0
* Add missing ROCPROFILER_VERSION check
* Improve readability
* Make ompt storage maps thread local
* Part 1: Variable name fix, memory cleanup, and fixed asserts
* Part 2: Add comments
* Part 3: Add CI_THROW
* Part 4: Formatting
* Part 5: Move #include to cpp
This PR fixes a segmentation fault seen when running rocprof-sys-sample with multi-process OpenMP/HIP applications.
The crash was caused by missing libomptarget.so on the runtime loader path or incorrect LD_PRELOAD settings.
Fixes SWDEV-552804
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
* Add ROCPROFSYS_ROOT to the env for sample
* Add env for causal
* Add env for instrument
* Check for null and address memory leak
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Move amd-smi to use caching mechanism
* Add VCN and JPEG activity to rocpd
* Switch cpu_freq to use caching mechanism
* Different approach with xcp activity & applied suggestions from code review
* Applied suggestions from code review
* Fix shadowing
* Applied suggestions from code review
* Rocpd part 2, caching
* Fix shadowed variables
* backward compatibility
* Fixed designated initializers
* Fix timemory include
* Remove benchmark & Fix build issues for rhel
* Add missing bracket
* Fix shadowing and pedantic
* Fix pedantic pt2
* Fix duplicated SDK calls
* Add decay in get_size_impl
* Rename sample cache to trace cache
* Add cache storage supported types
* Resolving track naming in sampling module
* fix sampling of flushing thread
* fix sampling of flushing thread 2
* throw exception upon store while buffer storage is not running
* Prevent fork crashing
* Fix rebase issue
* Applied suggestions from code review
* Change flushing thread to use PTL
* Fix agent creation order
* Fix stream id ci throw
* Remove force setup of rocprofiler-sdk
* Code cleanup
* Change initialization for agent
* Add missing namespace
* Fix the mismatch within the tool_agent->device_id
* Switch from using handle to use agent type index
* Fix pmc info comparator in metadata registry
---------
Co-authored-by: Aleksandar <aleksandar.djordjevic@amd.com>
Co-authored-by: Milan Radosavljevic <milan.radosavljevic@amd.com>
Co-authored-by: Marjan Antic <marantic@amd.com>
- Corelate memory_copy and kernel_dispatch events with their HIP stream_id and add stream_id as an annotation in Perfetto.
- By default, group memory_copy and kernel_dispatch events in Perfetto output by their stream_id.
- Add option, with the configuration setting ROCPROFSYS_ROCM_GROUP_BY_QUEUE, to group by HSA queue instead.
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
[ROCm/rocprofiler-systems commit: 4b4a846b58]
* Conditionally include backtraces in ROCPROFSYS_THROW based on verbosity
Modify ROCPROFSYS_THROW to only include backtraces when:
debug mode is enabled, OR
verbose level is >= 2, OR
running in CI environment
* Fix formatting errors
[ROCm/rocprofiler-systems commit: b0ff07b4fe]
On AMD-SMI, in rocm 7.0, vcn_activity and jpeg_activity will not be reported when XCP (partition) stats, vcn_busy and jpeg_busy, are available. This causes the activity tracking to fail. The fix is to read the busy values when activity values are not supported.
For issue: SWDEV-536439
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
[ROCm/rocprofiler-systems commit: e3741f678b]