* [rocprofiler-sdk] Fix buffer flush ordering and sanitizer CI improvements
Buffer Pool Design
------------------
Replace the fixed array-based double buffer with a dynamic pool design to
fix race conditions that caused "internal correlation id was retired
prematurely" errors.
The original design had a race where flush callbacks could be delivered
out-of-order: when buffer 0 fills and begins flushing, writes go to
buffer 1. If buffer 1 fills before buffer 0's flush completes, the
buffer index wraps back to 0 (which may still be flushing). Independent
flush tasks submitted to the thread pool can complete out of order.
The new pool design:
- Uses a std::deque of buffer instances that grows as needed
- Allocates buffers from the pool when the current buffer needs to flush
- Serializes flushes with a mutex to ensure FIFO callback ordering
- Returns buffers to the pool after flush completion
- Eliminates the race between buffer selection and write operations
New Unit Tests
--------------
- buffer_correlation_ordering.cpp: Tests that API records are always
delivered before their corresponding retirement records
- buffer_ordering_stress.cpp: Stress tests buffer flush ordering under
high contention with multiple threads rapidly filling buffers
HSA Tool Hooks
--------------
Added hsa_tool_hooks.cpp/hpp to register an HSA OnUnload callback that
waits for pending flush tasks before tool finalization, preventing
"retired prematurely" errors during HSA shutdown.
Sanitizer Improvements
----------------------
- LSAN: Set fast_unwind_on_malloc=1 to prevent deadlock in libgcc unwinder
- LSAN: Added suppressions for external tools (liblzma, liblsan, seq, strdup)
- TSAN: Added suppression for false positive on C++11 thread-safe static
initialization in create_write_functor
- ASAN/UBSAN: Added patterns for known issues in HSA runtime, HIP, perfetto
- Disabled attachment tests for sanitizers due to library preloading issues
Other Fixes
-----------
- Thread-trace agent test: Use heap-allocated callback state
- Correlation ID: Refactored reference counting and finalization ordering
* [rocprofiler-sdk] Revert buffer pool design changes
Revert buffer.cpp and buffer.hpp to the original double-buffer
design from develop branch. The pool-based redesign introduced
concerns about:
- Signal safety (mutex vs atomic_flag)
- API changes (flush() return type)
- Complexity of the new design
This revert removes:
- Dynamic buffer pool with std::deque
- std::mutex/condition_variable synchronization
- buffer_correlation_ordering.cpp test
- buffer_ordering_stress.cpp test
The underlying buffer flush ordering issue will need to be
addressed with a different approach that preserves the original
API and synchronization characteristics.
* [rocprofiler-sdk] Consistent fini_status checks to prevent correlation ID creation during finalization
- Revert TOCTOU CAS loop change in sub_ref_count() - not needed with consistent checks
- Add fini_status check in correlation_tracing_service::construct() with ROCP_CI_LOG warning
- Add nullptr checks at all construct() call sites (queue.cpp, async_copy.cpp, memory_allocation.cpp)
- Change all 'get_fini_status() > 0' to '!= 0' for consistent behavior:
- hsa/queue.cpp (lines 105, 210)
- hsa/async_copy.cpp (line 344)
- hsa/hsa_barrier.cpp (line 43)
- buffer.cpp (lines 107, 138, 185)
This ensures no correlation IDs are created once finalization starts (fini_status != 0),
preventing races between finalization and ongoing tracing operations.
* [rocprofiler-sdk] Replace arrival-order checks with timestamp-based temporal validation
Buffer records are not guaranteed to arrive in any specific order. Tests and
samples should use timestamps for temporal ordering validation instead.
Changes:
- samples/external_correlation_id_request: Replace 'retired prematurely' arrival
order check with timestamp-based validation that retirement timestamp >=
max(end_timestamps) for records with the same correlation ID
- tests/external_correlation.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check
- tests/registration.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check
- tests/roctx.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check
Correlation IDs are not guaranteed to be monotonically increasing when records
are sorted by timestamp. Temporal ordering should be validated using the
timestamp fields in each record.
* [rocprofiler-sdk] Revert external/CMakeLists.txt SYSTEM keyword removal
Restore the SYSTEM keyword to target_include_directories for
rocprofiler-sdk-fmt to match develop branch.
* [rccl] Remove orphaned rocSHMEM gitlink
Remove orphaned submodule reference that was introduced during a merge
but never had a corresponding .gitmodules entry, causing CI failures
with "fatal: no submodule mapping found in .gitmodules".
* [rocprofiler-sdk] Add HSA ABI version 0x09 support
Add ABI checks for HSA_AMD_EXT_API_TABLE_STEP_VERSION 0x09 which
introduces hsa_amd_counted_queue_acquire and hsa_amd_counted_queue_release
functions (added in rocr-runtime SWDEV-561708).
* [rocprofiler-sdk] Handle finalized status gracefully in buffer flush operations
This commit consolidates fixes for handling the finalization status during
buffer flush operations across the SDK.
Changes:
- Tool and samples: Handle ROCPROFILER_STATUS_ERROR_FINALIZED gracefully
when flushing buffers, as this indicates buffers were already flushed
during finalization (not an error condition)
- HSA handlers (queue.cpp, async_copy.cpp, hsa_barrier.cpp): Use > 0 check
for fini_status to allow operations during finalization process
- buffer.cpp: Revert fini_status checks to use > 0 for consistency
- correlation_id.cpp: Add fini_status > 0 check with ROCP_TRACE logging
to prevent correlation ID creation after finalization starts
Files modified:
- source/lib/rocprofiler-sdk-tool/tool.cpp
- tests/tools/json-tool.cpp
- source/lib/rocprofiler-sdk/tests/registration.cpp
- source/lib/rocprofiler-sdk/tests/roctx.cpp
- samples/api_buffered_tracing/client.cpp
- samples/counter_collection/buffered_client.cpp
- samples/counter_collection/device_counting_async_client.cpp
- samples/external_correlation_id_request/client.cpp
- samples/pc_sampling/client.cpp
- source/lib/rocprofiler-sdk/buffer.cpp
- source/lib/rocprofiler-sdk/context/correlation_id.cpp
- source/lib/rocprofiler-sdk/hsa/queue.cpp
- source/lib/rocprofiler-sdk/hsa/async_copy.cpp
- source/lib/rocprofiler-sdk/hsa/hsa_barrier.cpp
* [rocprofiler-sdk] Remove hsa_tool_hooks and simplify buffer flush handling
Remove the hsa_tool_hooks infrastructure and simplify buffer flush calls
in samples and tools. The ERROR_FINALIZED handling was overly complex
and the hsa_tool_hooks OnUnload synchronization is no longer needed.
Changes:
- Remove hsa_tool_hooks.cpp/hpp and related registration.cpp code
- Simplify buffer flush calls in samples to use direct ROCPROFILER_CALL
- Simplify buffer flush in tool.cpp and json-tool.cpp
- Remove ERROR_FINALIZED special handling from test files
Co-Authored-By: Claude <noreply@anthropic.com>
* [rocprofiler-sdk] Fix output_stream move semantics to null source pointers
The default move constructor and move assignment operator for
output_stream did not null out the source's pointers after the move.
This caused double-close when the moved-from temporary was destroyed,
leading to use-after-free crashes (SIGSEGV in std::ostream::sentry).
Co-Authored-By: Claude <noreply@anthropic.com>
* [rocprofiler-sdk] Improve Perfetto trace writer and sanitizer configuration
- generatePerfetto.cpp: Move output_stream into shared_state to prevent
use-after-free race conditions during Perfetto callback execution
- run-ci.py: Simplify and consolidate sanitizer environment variable
configuration for better maintainability
Co-Authored-By: Claude <noreply@anthropic.com>
* [rocprofiler-sdk] Revert run-ci.py changes that broke sanitizer suppressions
The previous changes removed MEMCHECK_SANITIZER_OPTIONS which is required
for CTest to properly pass suppression files to the sanitizers during
memcheck runs.
Co-Authored-By: Claude <noreply@anthropic.com>
* Revert "[rccl] Remove orphaned rocSHMEM gitlink"
This reverts commit 1ad21003941355658fff8114fa27768f11a948f7.
* [rocprofiler-sdk] Revert registration.cpp changes
Revert changes to registration.cpp to match develop branch.
Co-Authored-By: Claude <noreply@anthropic.com>
* [rocprofiler-sdk] Remove suppression file content printing from run-ci.py
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix output_stream move ctor/assignment operator
* Fix erroneous revert of registration.cpp
* Fix handling of fini status in correlation ID construction
* [rocprofiler-sdk] Fix OMPT segfault during finalization
Add nullptr checks in OMPT tracing code to handle the case where
correlation_tracing_service::construct() returns nullptr during
finalization. This fixes segfaults in openmp-target-sample and
tests.integration.execute.openmp-tools.
The correlation ID construction now returns nullptr when fini_status > 0,
but the OMPT callbacks were not checking for this, causing crashes when
dereferencing the null pointer during OpenMP runtime shutdown.
Changes:
- event_common(): Return nullptr early if correlation ID is null
- event(): Check for nullptr before calling sub_ref_count()
- ompt_task_create_callback(): Return early if correlation ID is null
- ompt_task_schedule_callback(): Return early if correlation ID is null
* [rocprofiler-sdk] Fix HSA API tracing segfault during finalization
Add nullptr check in hsa_api_impl::functor after correlation ID
construction. During finalization, correlation_service::construct()
returns nullptr, and without this check the code would dereference
the null pointer when accessing corr_id->internal.
This fixes the SEGV at address 0x000000000008 (null + 8 byte offset)
that occurs when HSA async event threads call hsa_signal_destroy
during runtime shutdown after finalization has started.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
* attach: Formalize ROCAttach API
- Make ROCAttach public with public headers
- Change detach to take a PID
- attach and detach are now reentrant
- Cleanup of states and signal handling in ptrace session
- Fixes mixed up definition of ROCPROF_ATTACH_TOOL_LIBRARY
- ROCPROF_ATTACH_TOOL_LIBRARY now always means the tool library loaded by the attachment target
- ROCPROF_ATTACH_LIBRARY refers to the library used to perform attachment
- Add direct call of rocprof-attach
- Fix python library call of rocprof-attach
- Function now named attach(), changed from main()
* attach: rocprof-compute ROCAttach updates
- Update to new library names
- Correct usage of C lib detach
* attach: add test for rocattach
- Disable ASan, TSan, and UBSan for the new parallel-attach test
- Lower log level for LSan tests, existing behavior from other tests
---------
Co-authored-by: Ammar ELWazir <aelwazir@amd.com>
* Fix buffer tracing synchronization lock
- PR #529 (in rocprofiler-sdk-internal) introduced waiting on the syncer flag when emplacing in a buffer to prevent the overwriting buffer records currently being processed in a buffer flush callback
- The above fix introduced a block on the both buffers when a buffer flush callback was being executed instead of a block on the buffer being flushed.
* Add rocpd tests for duplicate records
* Address code review comments
* adding ROCpd database merge
* adding ROCpd database merge concatenating all tables
* update merge script
- copy all tables from files
* fix merge format
* Add package submodule, initial POC. Need to refine
* Minor fixes and clean up duplicated code in package.py
* Revamp metadata layout, add wildcard and .rpdb parsing
* Add auto merge & package when > 5 DBs, add examples, don't use auto_merge when using sub-commands merge & package
* - Extend package/yaml inputs to all rocpd modules
- Improve handling more corner cases for bad input files when parsing input parameters (bad yaml files, bad .rpdb folder, folders as input)
- Changed to use UUID in merged filename instead of the time, in auto-merge algorithm
* Minor text fixes for consistancy between modules
* Add more wildcard support and add package, merge tests
* Make changes based on review suggestions
* Move parsing packages into importer.py, simplified adding required params to a function
* fix package test by flattening input list before processing
* Integrate merge.py changes from Jonathan to add name-collision checks, recreating indexes, foreign key check (disabled for now, due to processing time)
* Rework rocpd.<submodule>.{add_args,process_args}
- add_args function returns a functor which accepts input and args
- time_window functor returned from add_args automatically applies time windowing of input
* change merge&package limit to 1, merge should create data views
* Move files by default instead of making copies
- copying can be enabled by passing "copy=True" or --copy cmdline argument
* refactor package to make the logic cleaner, set merge limit back to 5
* Allow automerge-limit param to override limit, change default back to 1. Tests updated to use query, much quicker
* Update --help instructions for package
---------
Co-authored-by: acanadas <acanadas@amd.com>
Co-authored-by: a-canadasruiz <Araceli.CanadasRuiz@amd.com>
Co-authored-by: Young Hui <young.hui@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
* attach: fix test permissions
- Test is now skipped if insufficient permissions detected
- Should fix test (for now) in Azure CI pipeline
- Add more extensive permission checking for the tests
- Add default parameters to prevent running rm -rf on a root directory
- Add use for unused LOG_LEVEL parameter
* Fix dimension mismatch for multi-GPU systems with identical architectures
This change addresses an issue where counter dimensions were incorrectly
shared across all GPU agents with the same architecture name, even when
those agents had different hardware configurations (e.g., different CU counts).
Changes:
- Updated getBlockDimensions() to accept agent ID instead of architecture name
- Made dimension cache agent-specific instead of architecture-specific
- Updated set_dimensions() in AST evaluation to use specific agent ID
- Modified all API functions to handle agent-specific dimension lookups
- Updated tests to work with agent-specific dimensions
This fix ensures that dimensions accurately reflect the actual hardware
configuration of each individual GPU agent, preventing dimension mismatches
in multi-GPU systems where GPUs share the same architecture but have
different physical configurations.
Counter ID Representation Changes:
- Modified counter_id encoding to include agent information in bits 37-32
- Agent logical_node_id is encoded as (value + 1) to ensure agent 0 is detectable
- Counter records internally store only 16-bit base metric IDs (bits 15-0)
- Tool reconstructs agent-encoded counter IDs from base metric ID & agent info
- Instance record counter_id field uses bitwise AND mask to extract base metric ID
(counter_id.handle & 0xFFFF) to fit in 16-bit storage
- Output generators (CSV, JSON, Perfetto) use agent-encoded IDs for consistency
- Updated counter_config.cpp and metrics.cpp to extract base metric ID when needed
- All counter lookups now properly handle agent-encoded vs base metric IDs
This ensures counter IDs are consistent between metadata and output records while
maintaining compact storage in instance records.
* Changed stream error warning, remove regex search from attach execute test
* Formatting
* Revert accidental change
* Fix stream hang error due to grabbing same lock twice
* Updated add stream code, need to update tests
* Update attachment tests to use streams, threads, and multiple devices
* Update tests and fix stream issues
* Updated error messages to be more explicit, updated json to csv code in conftest to include streams and threads
* Formatting
* Add attachment label to attachment tests and update validation to fix errors
* Fix attach twice conftest
* Disabled thread san tests for attachment since they no longer work with bin file changes
* Updated for comment
* Added null check for getting attach status
* Initial consecutive kernel WIP
* Updated logic after discussion, create context only when needed, change set of captured ids to dispatch_id_t type
* Updated to fix concurrency issues and revert kernel_iterations
* Add captured id in first lock capture
* Updated code to use wlock, added comments, removed some unecessary atomic
* Cleaned up, need to add test
* Add test to check that generated stats csv file is not empty
* Updated test to check if vector-ops kernels are being used
* Fix phase bug
* Updated for comments
* Flattened ATT logic a bit
* Fix incorrect if-statement
* Fix merge conflict
The GFX12 host-trap PC sampling support in SDK and V3.
Introducing parser tests specific to GFX12.
Co-authored-by: vlaindic_amdeng <vladimir.indic@amd.com>
* Write agent info to CSV
* Write kernel to CSV
* Write memory copy to CSV
* Write memory allocation to CSV
* Write hip api to CSV
* Write hsa api to CSV
* Write marker api to CSV
* Write counters to CSV
* Write scratch memory to CSV
* Write rccl api to CSV
* Write rocdecode api to CSV
* Write rocjpeg api to CSV
* Remove info_process joins
* Format agent id
* Compose full file name is sql writer function
* Add missing fields to kernel traces csv
* Rename vgpr_count to arch_vgpr_count
* Fix kernel name
* Skip empty query results
* Format csv.py
* Delete c++ CSV writer
* Add CSV header comparison test
* Fix comment spacing in csv.py
* Change ALLOC to ALLOCATE in memory allocation writer
* Do not append trace to agent info file name
* Revert changes for VGPR_Count
* Fix csv validation test
* Add sorting by guid
* Use EXISTS to check query results are not empty
* Merge API-specific queries
* Optimize regions query
* Column name mapping for agent info
* Pass config to sql writer
* Move agent id string building to a separate function
* add titled_headers argument
* Remove titled-columns argument
* Improvements for regions csv
* fix CSV validation test
* improve CSV validation test
* remove roctxMarkA from csv validation test
* fix capability field titles in agent info
* remove filter.py from query as that is still experimental
* Remove some aliases, now that query will auto-title the column headers
---------
Co-authored-by: Aleksei Tumakaev <atumakae@amd.com>
Co-authored-by: Young Hui <young.hui@amd.com>
Co-authored-by: Young Hui - AMD <145490163+yhuiYH@users.noreply.github.com>
* attach: milestone: API tracing
- This pairs with another commit in rocprofiler-sdk to fully
function
- Add ptrace entry points for tool attachment
- API tracing works at this commit
- Queue tracing not supported yet
* attach: cleanup
- Remove hardcode for loading of tool library
- Make invoke registration functions public again
* attach: proxy queue first draft
- Adds ability to trace with queues during attachment
- Must be paired with updated rocprofiler-sdk
* attach: prestore overhaul
- Must be paired with commit in rocprofiler-sdk
* attach: add dispatch table rework
- Register will load the prestore library and provide entrypoints to sdk
* attach: formatting and cleanup
* attach: revise dispatch table scheme
* attach: formatting
* attach: milestone: API tracing
- This change must be paired with a change in rocprofiler-register to
fully function.
- API tracing works at this commit
- Queue tracing not supported yet
* attach: cleanup and comments
* attach: Formatting and crash fixes
* attach: add attach duration
- Add option attach-duration-msec for attachment
* Formatting + sglang hang fix via signal handling
* Changed FATAL_IF to DFATAL_IF for scratch_memory due to persistent crash when iterating queues
* attach: proxy queue first draft
- Adds ability to trace with queues during attachment
- Must be paired with updated rocprofiler-register
* Allow null agents for scratch output
* attach: improve queue library interface
- Significant changes to force exported interfaces back to C
- Fixes bug with unknown agents at attachment
- Code objects' names may still be incorrect
* attach: add code_object support
- Kernel traces will now have names and all other information for launches
- Add capture of hsa_executable to the queue library
- Various logging improvements
* attach: rename queue library to prestore
* attach: prestore overhaul
- Must be paired with commit from rocprofiler-register
- Massive overhaul of code organization in prestore library
- Separates registrations for different object types
- Sets up future changes for initialization
* attach: add prestore dispatch table
- Removes linkage to prestore library from sdk
* attach: cleanup
* attach: formatting
* attach: fix input prompt not appearing
* attach: fix component name in cmake
* attach: revert change to export level
* Make prestore API public
* attach: update sdk attachment library WIP
- This commit is NONFUNCTIONAL
- Changes around structure to remove classes
- Seperate C linkage where needed
- Still needs updates to register for correct usage
* attach: update register with dispatch table WIP
- This commit is NONFUNCTIONAL
- Changes rocprofiler_register to handle dispatch table from attach
library.
- Still needs changes in SDK with dispatch table usage
* attach: dispatch table wip
- This commit is NONFUNCTIONAL
* attach: move attach component into core
* attach: rename to rocprofv3-attach
* attach: add callbacks for new queues and code objects
* attach: finish dispatch table implementation
- Fixes kernel tracing
* attach: add cmake variable for attachment support
* feat: Add --attach alias for rocprofv3 with comprehensive attachment tests
- Add `--attach` as an alias to existing `-p/--pid` functionality in rocprofv3.py
- Create comprehensive attachment test suite with CSV and JSON output validation:
- New attachment-test application for testing dynamic profiling scenarios
- Unified test script supporting both CSV and JSON output formats
- Pytest-based validation for kernel traces, memory copies, HSA API calls, and agent info
- Add CMake integration for automated attachment testing
- Support parameterized output directory and filename specification
- Implement proper environment setup for attachment queue registration
Tests verify successful attachment to running processes and capture of:
- Kernel dispatch traces with workgroup/grid dimensions
- Memory copy operations (H2D/D2H) with size validation
- HSA API call traces across multiple domains
- GPU/CPU agent information and capabilities
* Documentation Update
* attach: make attach script callable
* Added ROCPROFILER_REGISTER_ATTACHMENT_TOOL_LIB to remove hardcoded name
* attach: revert metrics library path changes
* Generic Attachment in Register (#942)
Remove tool references in register
* Add second param to attach call in rocprof register
* Add experimental reattachment support for ROCprofiler-SDK
This commit introduces experimental reattachment functionality allowing tools
to dynamically reattach to running processes with comprehensive design changes
to support multiple attach/detach cycles:
**Core Reattachment API:**
- Add rocprofiler_tool_configure_result_experimental_t with tool_reattach/tool_detach callbacks
- Add rocprofiler_call_client_reattach and rocprofiler_call_client_detach C exports
- Implement reattachment tracking in rocprofiler_register_attach to differentiate
initial attachment from reattachment cycles
- Add rocprofiler_register_invoke_reattach for handling reattachment requests
**Design Changes - Registration System Flow:**
The registration system now supports a dual-path initialization:
1. Initial Attachment Flow:
- rocprofiler_register_attach() -> rocprofiler_register_invoke_all_registrations()
- Full tool initialization with complete context setup
- Sets prev_attached atomic flag to track state
2. Reattachment Flow:
- rocprofiler_register_attach() detects prev_attached=true -> rocprofiler_register_invoke_reattach()
- Bypasses full re-initialization, calls client reattach callbacks instead
- Preserves existing contexts and buffers, only reactivates profiling services
**Design Changes - Tool Library Loading:**
Enhanced rocprofiler-register library loading with function pointer resolution:
- Extended rocp_set_api_table_data_t tuple to include reattach/detach function pointers
- Automatic symbol resolution for rocprofiler_call_client_reattach/detach functions
- Support for both LD_PRELOAD and dlopen scenarios with consistent callback availability
**Design Changes - Context Management:**
Introduced dual context systems for attachment scenarios:
- get_contexts() - Original contexts for standard tool initialization
- get_attach_contexts() - Separate context map for attachment-specific lifecycle
- attach_init() - Creates contexts for ALL buffer tracing services using existing buffers
- attach_start() - Selectively starts contexts based on configuration options
- attach_detach() - Cleanly stops and destroys attachment contexts
**Design Changes - Buffer Management:**
Added reset_tmp_file_buffer() template for clean reattachment state:
- Properly closes and removes old temporary files
- Deletes existing file_buffer instances to prevent stale file position tracking
- Creates fresh file_buffer instances for clean reattachment cycles
- Addresses core issue where file position metadata becomes stale between cycles
**Design Changes - Environment Variable Injection:**
Added ROCP_REGISTERED_TOOL_ATTACH environment variable:
- Distinguishes attachment-loaded tools from LD_PRELOAD scenarios
- Enables registration system to apply attachment-specific logic
- Helps tools adapt behavior for attachment vs standard initialization
**Attachment Context Management:**
- Add attach_init/attach_start/attach_detach functions for dynamic context lifecycle
- Add reset_tmp_file_buffer template for clean reattachment state management
- Implement get_attach_contexts() for tracking active attachment contexts
**Test Infrastructure:**
- Add projects/rocprofiler-sdk/tests/rocprofv3/reattach/ comprehensive test suite
- Include reattachment test scripts with unified attachment/detachment cycles
- Add validate.py with trace data validation for kernel, memory copy, HSA API, and agent info
- Add conftest.py for JSON and CSV data loading utilities
**Configuration Updates:**
- Update CMakeLists.txt to include reattachment tests in build system
- Add environment variable ROCP_REGISTERED_TOOL_ATTACH for attachment state tracking
- Enhance rocprofiler-register library loading with reattach/detach function resolution
**Flow Impact Analysis:**
This design enables robust multi-cycle attachment by:
1. Preventing duplicate initialization on reattachment
2. Maintaining separate context lifecycles for attachment vs standard operation
3. Ensuring clean temporary file state between attachment cycles
4. Providing tools with explicit reattach/detach callback hooks
5. Supporting both programmatic and environment-based tool configuration
The experimental nature allows for iteration on the API while establishing
the foundation for production-ready dynamic profiling capabilities.
* Fix misc clang-tidy warnings/errors
* CMake Option and Environment Variable Updates
- CMake: ROCPROFILER_REGISTER_ALWAYS_SUPPORT_ATTACH -> ROCPROFILER_REGISTER_BUILD_DEFAULT_ATTACHMENT
- Env: ROCPROFILER_REGISTER_ATTACHMENT_ENABLED ->
* Source reorganization
* Formatting + new lines at EOF
* Fix flake8 F841: local variable is assigned to but never used
* Update attachment test
- get rid of 5 second start delay
- add roctx
* Rework implementation
- Remove rocprofiler_tool_configure_result_experimental_t in lieu of rocprofiler_configure_attach
- Add <rocprofiler-sdk/experimental/registration.h>
- TODO: Update process_attachment.rst
* Handle re-attachment options
- inherit options from previous attachment
- check previous options do not modify data collection services
* Fix support for tools w/o rocprofiler_configure_attach
- fix segfault when rocprofiler_configure_attach does not exist
- fix naming convention for functions accepting attach dispatch table
- cleanup rocprofiler_configure_attach implementation in rocprofv3 tool
* attach: remove unknown agent handling
- Change was from earlier commit, no longer needed
* attach: add error for attaching without library loaded
* attach: revise version numbering
* attach: register header revisions
* attach: clang format register
* attach: formatting
* attach: fix build failure
- Remove cross dependency into rocprofiler-sdk, fixes build on some systems
* attach: revise register library detection
* Update rocprofiler-register and attach library
- formatting
- proper signature of register_functor for rocprofiler-sdk-attach library callback
- remove get_dispatch_registration_table()
* Bump rocprofiler-register version to 0.6.0 + AnyNewerVersion
* Fix output support for rocprofiler-sdk-tool
* Fix formatting
* Fix clang tidy errors
* Misc rocprofiler-sdk-attach fixes
* attach: add sigint handling to attach python
* tool README.md formatting
Co-authored-by: Jonathan R. Madsen <jrmadsen@users.noreply.github.com>
* Fix buffered output issue
* attach: add errors for tool attach
* CI Fixes
* Rework tests
* attach: improve library loading in rocprofv3 attach
* formatting
* Update tests to use pytest framework
* Fix test_attachment_hsa_api_trace
* attach: catch ctypes exceptions
* attach: fix leak in registration
* attach: fix sanitizer tests
* attach: fix sanitizer tests further
* attach: disable attach asan tests
* attach: disable ubsan test
* attach: fix permissions in installed test package
* attach: formatting
---------
Co-authored-by: Ian Trowbridge <Ian.Trowbridge@amd.com>
Co-authored-by: Tim Gu <Tim.Gu@amd.com>
Co-authored-by: Claude Code <claude@anthropic.com>
Co-authored-by: Benjamin Welton <bwelton@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
Co-authored-by: Jonathan R. Madsen <jrmadsen@users.noreply.github.com>
Co-authored-by: Benjamin Welton <bewelton@amd.com>
* Remove config checks for stream and kernel rename data collection
* Updated csv generation to check if kernel rename is on before calling get_kernel_name
* Update metadata to use kernel_rename bool argument
* Formatting + unconditionally store kernel name in rocpd
* Readded kernel rename parameter after rebase
* Fixed rebase conflicts
* Updated comment in line with github comments
* Added check in rocpd csv.cpp to output kernel name if region name is empty
* Add test for kernel rename
---------
Co-authored-by: Ian Trowbridge <Ian.Trowbridge@amd.com>
* Increase rocDecode code coverage and add version check
* Update rocJPEG tests
* Fix rocJPEG tests
* Enable building tests/samples in rocm release compat workflow
* Readded rocJPEG test skips
* formatting
* Adding ROCm libraries for the code-coverage job
* Added return value check for error message and updated compatability to enable tests
* Disable rocm_release_compatibility samples and tests until openmp issue is resolved
---------
Co-authored-by: Ian Trowbridge <Ian.Trowbridge@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
Co-authored-by: Jonathan R. Madsen <Jonathan.Madsen@amd.com>
* Updated stream code to handle special cases when stream value is 0x01 or 0x02
* Removed extra definitions and updated tests to account for special case
* Modified stream.cpp so that each thread assigned a unique stream ID when hipStreamPerThread is used as stream value. Modified tests to check that threads are assigned unique, repeated values when hipStreamPerThread is called
* Updated idx_offset, stream_map, and thread counter to be in one struct.
* Update stream.cpp to only use add_stream() and update tests for seperate unit test for hipStreamPerThread
* Remove unecessary comment
* Removed unecessary line
* Updated tests and stream.cpp to update stream ID correctly
* Updated test structure
* Adding new trace decoder record types and new ATT parameters
* Add compatiblity with decoder 0.1.2
* Added RT
* Format
* Add logging to sdata values
* Review comment
* Review comments
* Update projects/rocprofiler-sdk/source/include/rocprofiler-sdk/experimental/thread-trace/trace_decoder_types.h
Problem with original test:
- Created circular dependencies between queues:
* Queue1: Kernel A → Barrier(waits for signal_2) → Kernel C
* Queue2: Barrier(waits for signal_1) → Kernel B → sets signal_2
- With strict "one kernel at a time" serialization, this created deadlock:
* Queue1 executed Kernel A, then blocked on barrier waiting for signal_2
* Serializer switched to Queue2, but Queue2 was blocked waiting for signal_1
* Neither queue could proceed: Queue1 needed Queue2's Kernel B to complete,
but Queue2 couldn't start until Queue1 finished completely
- Test would hang indefinitely at hsa_signal_wait_relaxed() for signal_2
Solution implemented:
- Reordered packet submission to eliminate circular dependencies
- Ensured signal producers execute before consumers need them:
* Kernel A produces signal_1 before Queue2's barrier needs it
* Kernel B produces signal_2 before Queue1's continuation needs it
- Dependencies now flow forward without cycles, allowing serializer progress
Refactoring changes:
- Extract common functionality into helper functions:
* create_completion_signal() for signal creation
* create_queue() for queue creation
* submit_kernel_packet() for kernel dispatch packets
* submit_barrier_packet() for barrier packets
- Add comprehensive documentation explaining expected execution pattern
- Simplify main() function making the dependency flow more readable
Co-authored-by: Benjamin Welton <bewelton@amd.com>
[ROCm/rocprofiler-sdk commit: b5e1645a14]
* Adding GPU index as a parameter for ATT
* Tidy fix
* Using tokenize
* Update tests/rocprofv3/advanced-thread-trace/CMakeLists.txt
Co-authored-by: Indic, Vladimir <Vladimir.Indic@amd.com>
* Update tests/rocprofv3/advanced-thread-trace/CMakeLists.txt
* Adding error logging. Using idx instead of id.
---------
Co-authored-by: Giovanni <gbaraldi@amd.com>
Co-authored-by: Indic, Vladimir <Vladimir.Indic@amd.com>
[ROCm/rocprofiler-sdk commit: fd6f96ffb5]
* Reverted header and field location for csv memory allocation and updated tests
* Updated example csv file and made small update
[ROCm/rocprofiler-sdk commit: 533a8329d8]
* adding summary.py to generate tmp <category_region>_summary views
* migrating CSV summary to SDK method of writing CSVs
- Add domain_view to summary.py
- omit the C++ code of writing CSV because it gets revered later anyway
* Add summary subparser and write_sql_view_to_csv function
* adding all <>_summary views generation to summary.py
* add summary_per_rank feature
* add --summary-per-rank
* reconstruct generate_summary_view and create_domain_view
-introduce by_rank
* remove sqr and variance in summary views
* use RocpdImportData instead of connection
* two fixes on summary.py
--modify the generate_summary_view function to return a tuple with view name and sql code
add if_not_exits parameter to generete_summary_view
* Refactor summary.py to allow output path and filename args, and apply time_window
- clean up summary table column headers
- only generate by-rank views if that param is specified
* Add ProcessID to Hostname output and csv, so users can identify the system in the by-rank summaries
* Summary.py, just add hostname to by-rank summaries, instead of creating mapping table
* Summary - migrate csv writer to pandas, for more future flexibility
* Adding a few simple tests for summary.py
* Linting fixes
* add region_categories to summary options
- Automatically retrieve region categories from the database if argument is None
* add backticks for view_names
* fix tests after rebase
* Made code review changes
- fixed whitespace in CMakelists.txt
- adding query.py module & subparser in __main__.py
- refactor summary function to return query
- used query.py to output csv
- used query.py to also output summary to console
- provided new command line options to select summary output to csv or console
* Made fix to jinja template in query.py, as suggested by copilot
* Consolidated output calls to query in export_view function based on feedback
- refactored: helpers, query functions, create view functions
- extended formats to include what query supports (md, html, pdf, json)
- added json format to query, and changed orient=records
- adding jinja2 and reportlab to requirements.txt
* Add version_info for rocpd and roctx
* Add rocpd commandline tool
* Add executable permissions to source/bin/rocpd.py
* Removed rocpd2query, and cleaned up --help examples
---------
Co-authored-by: acanadas <acanadas@amd.com>
Co-authored-by: Jin Tao <jintao12@amd.com>
Co-authored-by: a-canadasruiz <Araceli.CanadasRuiz@amd.com>
Co-authored-by: Jonathan R. Madsen <Jonathan.Madsen@amd.com>
[ROCm/rocprofiler-sdk commit: 3954cedd25]
* Fix null handle
- use .handle=0, not .handle=numeric_limits<>::max()
* Update lib.common.hasher
* Fix ROCPROFILER_CONTEXT_NONE
* Use context operator==
* Update CHANGELOG
* Updated null handle for scratch memory and changed allocation test so that free ops account for null agent
---------
Co-authored-by: Ian Trowbridge <Ian.Trowbridge@amd.com>
[ROCm/rocprofiler-sdk commit: 4d6a61f5e5]
* addressing issues
* doc fix
* test fix
* fix
* fix formatting issue and doc update
* fix column size
* fix
* fix formatting in output
* tests fix
* test fix
* add new line
* add new line
* fix new line
* fixing typo in using-rocprofv3-avail.rst
[ROCm/rocprofiler-sdk commit: 3aaffc42da]
* [SWDEV-516561][1/2] Add MARKER_RANGE_EXTENT to capture ROCTX ranges
Range extent to capture all work between roctxpush/pop operations. Entry callback takes place during roxtxpush and exit callback takes place in roctxpop. This is primarily to allow us to keep an ancestor id on the ancestor stack such that all operations that take place within the push/pop context can be annotated as being apart of this range. With the current setup (where push and pop are two separate operations that need to be combined externally), we cannot keep an ancestor id on the stack and thus cannot tie tracing events to particular ranges.
Correlation id information is inherited from the push operation. Ancestor id needs to be added in a future commit that also outputs this ancestor to CSV.
Output:
```
[ctest] {'size': 64, 'kind': 7, 'operation': 1, 'correlation_id': {'internal': 1525, 'external': 0, 'ancestor': 1524}, 'start_timestamp': 2932551479402642, 'end_timestamp': 2932551491178449, 'thread_id': 3254861}
[ctest] {'size': 64, 'kind': 8, 'operation': 2, 'correlation_id': {'internal': 1525, 'external': 0, 'ancestor': 1524}, 'start_timestamp': 2932551479405878, 'end_timestamp': 2932551491181214, 'thread_id': 3254861}
```
Note: Kind 8 = range extent op.
* Merge fix
Revert several changes
source/lib/rocprofiler-sdk/marker/range_marker.*
- separate out range marker implementation for standard marker implementation
Update public API with marker core range
Support marker core range in sdk (source/lib/rocprofiler-sdk)
Transition rocprofiler-sdk-tool and output lib to use marker core range
Misc fixes for tests
Fix logic in lib/output/generate{CSV,Stats}.cpp
Update tests/rocprofv3/tracing-hip-in-libraries (marker validation)
Fix test_otf2_data
* Test fixes
---------
Co-authored-by: Benjamin Welton <bewelton@amd.com>
[ROCm/rocprofiler-sdk commit: 2c4e20b951]
* Disable other unstable tests
* Disable validating test_total_runtime in kernel-tracing
* The disabled tests will be stabilized and re-enabled by ROCm 7.0.1 or ROCm 7.1
[ROCm/rocprofiler-sdk commit: 69f71b8097]
* [CI] Testing Stability
- CMake option ROCPROFILER_DISABLE_UNSTABLE_CTESTS
- used for tests which periodically fail around 1 out of every 10 runs
- set to ON while instability remains, this needs to set to OFF in ROCm 7.1 or, ideally, ROCm 7.0.1
- Use FIXTURES_SETUP and FIXTURES_REQUIRED for some tests
- replace "threw an exception" with "${ROCPROFILER_DEFAULT_FAIL_REGEX}" for misc FAIL_REGULAR_EXPRESSIONS
* Remove contents of all EXCLUDE_{TESTS,LABEL}_REGEX from CI workflow
* Disable patch git step in code-coverage run
* Tweak spin time of reproducible runtime
* Removed patch git step in code-coverage run
* Update ROCPROFILER_DEFAULT_FAIL_REGEX
* Mark test-counter-collection tests as unstable
- add fixtures setup/required
* Remove ATTACHED_FILES_ON_FAIL
- CDash doesn't store enable downloading these properly anyway
* Relax collection-period fuzzing window
* Disable unstable collection-period test
- too unstable
* formatting
* Disable unstable device_counting_service_test.async_counters
* Suppress perfetto internal data race errors
* Switch code-coverage CI jobs to mi300 runner
* Timeout increases
* rocprofv3-test-rocpd updates
- add fixtures
- switch executable
- redefine input/output paths
* Revert code-coverage job to mi300a runner
* Update rocprofv3-test-rocpd-execute-multiproc
- reduce problem size
* disable multiproc rocpd
* Split code-coverage into separate workflow
- network issues cause this job to fail frequently
- when in a separate workflow, it can be restarted easily
* Fixtures for rocprofv3-test-trace-hip-in-libraries
* Disable unstable device_counting_service_test.sync_counters
* Potential fix for code scanning alert no. 171: Workflow does not contain permissions
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
* Switch code-coverage to run on rocprof-azure
- mi300a EMU runner set is unstable (network issues)
* tests/rocprofv3/pc-sampling SKIP_REGULAR_EXPRESSION
* Update rocprofv3-test-list-avail-trace-execute
- reduce log level and increase timeout
* rocprofv3: Prevent recursive call to rocprofv3_error_signal_handler + log chaining
* rocprofv3: Use ROCP_ERROR + std::exit instead of ROCP_FATAL
- should help with SKIP_REGULAR_EXPRESSION
---------
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
[ROCm/rocprofiler-sdk commit: 640ca55ac0]