* [rocprofiler-sdk] Fix domain_ops_padding for 515+ HIP operations
The HIP runtime API now has 515+ operations (as of ROCm 7.x), but
domain_ops_padding was set to 512. This caused std::out_of_range
exceptions when checking operations >= 512 via std::bitset::test().
Changes:
- Increase domain_ops_padding from 512 to 1024
- Add compile-time static_assert to validate padding is sufficient
for all API domains (HIP, HSA, marker, RCCL, rocDecode, rocJPEG)
Co-Authored-By: Claude (claude-opus-4.5) <noreply@anthropic.com>
* Update projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/context/domain.cpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* [rocprofiler-sdk] Apply clang-format-11 to domain.cpp
Co-Authored-By: Claude (claude-opus-4.5) <noreply@anthropic.com>
* Rework implementation to ensure coverage of all operation enums
* Fix compiler error in unit test for enum_string.cpp
* Fix data types of domain_ops_padding values
* Revert some changes in domain.cpp
---------
Co-authored-by: Claude (claude-opus-4.5) <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
* SWDEV-561708 Counted queue size from env var
* use counted_queue_size for test
* remove rocrtst changes; add a const for default queue size
* Remove env var from test; use queue->size
* Improve env var documentation
* Correct type
- Fix ROCM_VERSION guard used for the scratch_memory_record structure
- This fixes a rocm/7.0.2 build failure
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Adding support of AQL Profiler for GFX 11.5
* Removing hard coded value for sa_number
* Adding instance count for WGP block, removing hard coded values.
* Fixed SQ counter block and TD counter block instances
Mode-1 GPU reset affects entire XGMI hive. Added
xgmi_hive_id check to reset only once for same-hive
GPUs while preserving separate resets for different
hives or no hives.
- Example:
`sudo amd-smi reset -G` or `sudo amd-smi reset -G -g 0`
on MI300 will reset all GPU's only once.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Show "N/A" for ASICs without fan support
`amd-smi set -h` fan help text will be dynamic instead of "0-255 or 0-100%"
Signed-off-by: Sumanth Gavini <sumanth.gavini@amd.com>
* Update pyhton docs for process memory usage
* Added comment for processes total memory usage
---------
Signed-off-by: yalmusaf <Yazen.ALMusaffar@amd.com>
* Add common module
* Added information to help with unknowns
* Allow paring of cmds
* change cmd print default
* Reduce cmds to be tested
---------
Signed-off-by: amd-josnarlo <joseph.narlo@amd.com>
Co-authored-by: amd-josnarlo <joseph.narlo@amd.com>
* Fix exception handling in power profile commands
* Update CHANGELOG.md
* Update amdsmi_parser.py for the single character argument for --profile as -o
---------
Co-authored-by: Koushik Billakanti <Koushik.Billakanti@amd.com>
Co-authored-by: gabrpham <Gabriel.Pham@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
This change resolves some of the warnings generated during clr builds.
Quiet regular output of doxygen.
Disable non-documented warnings of doxygen.
Signed-off-by: Sebastian Luzynski <Sebastian.Luzynski@amd.com>
* [rocprofiler-sdk] Fix buffer flush ordering and sanitizer CI improvements
Buffer Pool Design
------------------
Replace the fixed array-based double buffer with a dynamic pool design to
fix race conditions that caused "internal correlation id was retired
prematurely" errors.
The original design had a race where flush callbacks could be delivered
out-of-order: when buffer 0 fills and begins flushing, writes go to
buffer 1. If buffer 1 fills before buffer 0's flush completes, the
buffer index wraps back to 0 (which may still be flushing). Independent
flush tasks submitted to the thread pool can complete out of order.
The new pool design:
- Uses a std::deque of buffer instances that grows as needed
- Allocates buffers from the pool when the current buffer needs to flush
- Serializes flushes with a mutex to ensure FIFO callback ordering
- Returns buffers to the pool after flush completion
- Eliminates the race between buffer selection and write operations
New Unit Tests
--------------
- buffer_correlation_ordering.cpp: Tests that API records are always
delivered before their corresponding retirement records
- buffer_ordering_stress.cpp: Stress tests buffer flush ordering under
high contention with multiple threads rapidly filling buffers
HSA Tool Hooks
--------------
Added hsa_tool_hooks.cpp/hpp to register an HSA OnUnload callback that
waits for pending flush tasks before tool finalization, preventing
"retired prematurely" errors during HSA shutdown.
Sanitizer Improvements
----------------------
- LSAN: Set fast_unwind_on_malloc=1 to prevent deadlock in libgcc unwinder
- LSAN: Added suppressions for external tools (liblzma, liblsan, seq, strdup)
- TSAN: Added suppression for false positive on C++11 thread-safe static
initialization in create_write_functor
- ASAN/UBSAN: Added patterns for known issues in HSA runtime, HIP, perfetto
- Disabled attachment tests for sanitizers due to library preloading issues
Other Fixes
-----------
- Thread-trace agent test: Use heap-allocated callback state
- Correlation ID: Refactored reference counting and finalization ordering
* [rocprofiler-sdk] Revert buffer pool design changes
Revert buffer.cpp and buffer.hpp to the original double-buffer
design from develop branch. The pool-based redesign introduced
concerns about:
- Signal safety (mutex vs atomic_flag)
- API changes (flush() return type)
- Complexity of the new design
This revert removes:
- Dynamic buffer pool with std::deque
- std::mutex/condition_variable synchronization
- buffer_correlation_ordering.cpp test
- buffer_ordering_stress.cpp test
The underlying buffer flush ordering issue will need to be
addressed with a different approach that preserves the original
API and synchronization characteristics.
* [rocprofiler-sdk] Consistent fini_status checks to prevent correlation ID creation during finalization
- Revert TOCTOU CAS loop change in sub_ref_count() - not needed with consistent checks
- Add fini_status check in correlation_tracing_service::construct() with ROCP_CI_LOG warning
- Add nullptr checks at all construct() call sites (queue.cpp, async_copy.cpp, memory_allocation.cpp)
- Change all 'get_fini_status() > 0' to '!= 0' for consistent behavior:
- hsa/queue.cpp (lines 105, 210)
- hsa/async_copy.cpp (line 344)
- hsa/hsa_barrier.cpp (line 43)
- buffer.cpp (lines 107, 138, 185)
This ensures no correlation IDs are created once finalization starts (fini_status != 0),
preventing races between finalization and ongoing tracing operations.
* [rocprofiler-sdk] Replace arrival-order checks with timestamp-based temporal validation
Buffer records are not guaranteed to arrive in any specific order. Tests and
samples should use timestamps for temporal ordering validation instead.
Changes:
- samples/external_correlation_id_request: Replace 'retired prematurely' arrival
order check with timestamp-based validation that retirement timestamp >=
max(end_timestamps) for records with the same correlation ID
- tests/external_correlation.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check
- tests/registration.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check
- tests/roctx.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check
Correlation IDs are not guaranteed to be monotonically increasing when records
are sorted by timestamp. Temporal ordering should be validated using the
timestamp fields in each record.
* [rocprofiler-sdk] Revert external/CMakeLists.txt SYSTEM keyword removal
Restore the SYSTEM keyword to target_include_directories for
rocprofiler-sdk-fmt to match develop branch.
* [rccl] Remove orphaned rocSHMEM gitlink
Remove orphaned submodule reference that was introduced during a merge
but never had a corresponding .gitmodules entry, causing CI failures
with "fatal: no submodule mapping found in .gitmodules".
* [rocprofiler-sdk] Add HSA ABI version 0x09 support
Add ABI checks for HSA_AMD_EXT_API_TABLE_STEP_VERSION 0x09 which
introduces hsa_amd_counted_queue_acquire and hsa_amd_counted_queue_release
functions (added in rocr-runtime SWDEV-561708).
* [rocprofiler-sdk] Handle finalized status gracefully in buffer flush operations
This commit consolidates fixes for handling the finalization status during
buffer flush operations across the SDK.
Changes:
- Tool and samples: Handle ROCPROFILER_STATUS_ERROR_FINALIZED gracefully
when flushing buffers, as this indicates buffers were already flushed
during finalization (not an error condition)
- HSA handlers (queue.cpp, async_copy.cpp, hsa_barrier.cpp): Use > 0 check
for fini_status to allow operations during finalization process
- buffer.cpp: Revert fini_status checks to use > 0 for consistency
- correlation_id.cpp: Add fini_status > 0 check with ROCP_TRACE logging
to prevent correlation ID creation after finalization starts
Files modified:
- source/lib/rocprofiler-sdk-tool/tool.cpp
- tests/tools/json-tool.cpp
- source/lib/rocprofiler-sdk/tests/registration.cpp
- source/lib/rocprofiler-sdk/tests/roctx.cpp
- samples/api_buffered_tracing/client.cpp
- samples/counter_collection/buffered_client.cpp
- samples/counter_collection/device_counting_async_client.cpp
- samples/external_correlation_id_request/client.cpp
- samples/pc_sampling/client.cpp
- source/lib/rocprofiler-sdk/buffer.cpp
- source/lib/rocprofiler-sdk/context/correlation_id.cpp
- source/lib/rocprofiler-sdk/hsa/queue.cpp
- source/lib/rocprofiler-sdk/hsa/async_copy.cpp
- source/lib/rocprofiler-sdk/hsa/hsa_barrier.cpp
* [rocprofiler-sdk] Remove hsa_tool_hooks and simplify buffer flush handling
Remove the hsa_tool_hooks infrastructure and simplify buffer flush calls
in samples and tools. The ERROR_FINALIZED handling was overly complex
and the hsa_tool_hooks OnUnload synchronization is no longer needed.
Changes:
- Remove hsa_tool_hooks.cpp/hpp and related registration.cpp code
- Simplify buffer flush calls in samples to use direct ROCPROFILER_CALL
- Simplify buffer flush in tool.cpp and json-tool.cpp
- Remove ERROR_FINALIZED special handling from test files
Co-Authored-By: Claude <noreply@anthropic.com>
* [rocprofiler-sdk] Fix output_stream move semantics to null source pointers
The default move constructor and move assignment operator for
output_stream did not null out the source's pointers after the move.
This caused double-close when the moved-from temporary was destroyed,
leading to use-after-free crashes (SIGSEGV in std::ostream::sentry).
Co-Authored-By: Claude <noreply@anthropic.com>
* [rocprofiler-sdk] Improve Perfetto trace writer and sanitizer configuration
- generatePerfetto.cpp: Move output_stream into shared_state to prevent
use-after-free race conditions during Perfetto callback execution
- run-ci.py: Simplify and consolidate sanitizer environment variable
configuration for better maintainability
Co-Authored-By: Claude <noreply@anthropic.com>
* [rocprofiler-sdk] Revert run-ci.py changes that broke sanitizer suppressions
The previous changes removed MEMCHECK_SANITIZER_OPTIONS which is required
for CTest to properly pass suppression files to the sanitizers during
memcheck runs.
Co-Authored-By: Claude <noreply@anthropic.com>
* Revert "[rccl] Remove orphaned rocSHMEM gitlink"
This reverts commit 1ad21003941355658fff8114fa27768f11a948f7.
* [rocprofiler-sdk] Revert registration.cpp changes
Revert changes to registration.cpp to match develop branch.
Co-Authored-By: Claude <noreply@anthropic.com>
* [rocprofiler-sdk] Remove suppression file content printing from run-ci.py
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix output_stream move ctor/assignment operator
* Fix erroneous revert of registration.cpp
* Fix handling of fini status in correlation ID construction
* [rocprofiler-sdk] Fix OMPT segfault during finalization
Add nullptr checks in OMPT tracing code to handle the case where
correlation_tracing_service::construct() returns nullptr during
finalization. This fixes segfaults in openmp-target-sample and
tests.integration.execute.openmp-tools.
The correlation ID construction now returns nullptr when fini_status > 0,
but the OMPT callbacks were not checking for this, causing crashes when
dereferencing the null pointer during OpenMP runtime shutdown.
Changes:
- event_common(): Return nullptr early if correlation ID is null
- event(): Check for nullptr before calling sub_ref_count()
- ompt_task_create_callback(): Return early if correlation ID is null
- ompt_task_schedule_callback(): Return early if correlation ID is null
* [rocprofiler-sdk] Fix HSA API tracing segfault during finalization
Add nullptr check in hsa_api_impl::functor after correlation ID
construction. During finalization, correlation_service::construct()
returns nullptr, and without this check the code would dereference
the null pointer when accessing corr_id->internal.
This fixes the SEGV at address 0x000000000008 (null + 8 byte offset)
that occurs when HSA async event threads call hsa_signal_destroy
during runtime shutdown after finalization has started.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
* SWDEV-558836, SWDEV-558837 - Add hipMemSetMemPool and hipMemGetMemPool implementation
* Add managed allocation type for mem pools
* Update rocprofiler-sdk with APis declaration
* SWDEV-558848 - Move DRM calls to thunk for better abstraction
* Use thunk device handle instead of drm inside agent
* Update IPC functions with new thunk calls
* create hsaKmtHandleImport interface to support ipc
* Reset metadata inside hsaKmtMemHandleFree
* remove whitespaces and NULL usage
* Add thunk apis to libhsakmt.ver
* Add comments to new structs in thunk
* Minor fixes to declarations
* resolve merge conflicts in amd_kfd_driver
---------
Co-authored-by: Rahul Manocha <rmanocha@amd.com>
* Adding --torch-operator option in rocprof-compute. Creates csv file for
each operator that has gpu activity, showing operator to counter values
mapping.
* --torch-operators flag added to rocprofiler-sdk
* Adding ctest for --torch-operators.
* Adding pytest markers.
* Corrections in ctest and message logging.
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Adding a check for pytorch installation only when --torch-operators is passed.
* moving inject_roctx.py into src/utils.
* rebase
* Updating docs and changelog.
* Update projects/rocprofiler-compute/src/argparser.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update projects/rocprofiler-compute/src/utils/inject_roctx.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Removing special characters.
* Minor corrections.
* Setting default value for torch_operators_enabled.
* Updating the number of files according to the number of passes.
* Adding rocpd support.
* Adding a warning message to be shown when profiling a non-python workload.
* copilot suggestions, rocpd+native tool fix
* Fixed the incorrect usage of dispatch_id as event_id in the function update_rocpd_pmc_events()
* ruff format fix
* ruff formating
* Deleting torch_trace.csvs after consolidating the operator data.
* Removing checks since *torch_trace.csv files are deleted.
* Fixing file deletion.
* Update projects/rocprofiler-compute/src/utils/inject_roctx.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update projects/rocprofiler-compute/src/utils/utils.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update projects/rocprofiler-compute/tests/test_profile_general.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Using default options in the testcase.
* Adding test for overhead measurement.
* Corrections in docs.
* doc updates.
* Update projects/rocprofiler-compute/src/utils/inject_roctx.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Handling potential empty frames.
* Corrected the test cases.
* Changing the flag to --torch-trace
* Fixed helper_app path issues
* Path issues
* process_torch_trace_output() now takes csv file paths as input + allows default usage.
* Replaced pandas with sqlite3
* Adding marker_trace extraction to rocpd_data.py
* Allowing all workloads to use --torch-trace option. Assuming the workload is user verified.
* Modified help section for the flag.
* Added difference in runtimes for longest running kernels in each profiling runs to overhead measurements.
* Update projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Removed the accesses to the tables.
* Ruff fixes.
* ruff
* Ruff Fixes
* Adding getattr for args.torch_trace to handle mock args.
* Fix for 'Missing guid in counter collection data - in csv mode'
* Sending output_format to process_torch_trace_output
* Warning for self contained binaries.
* Ruff
* Ruff
* Measuring longest_running_kernel_baseline instead of worst_kernel_increase, very small kernel runtimes are blowing up the worst_kernel_increase metric.
* Minor fixes in input arguments
* Ruff
* Loging PyTorch version
* Fix ruff formatting for PyTorch version logging
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Reworked Unit_hipLaunchCooperativeKernel_Basic and Unit_hipLaunchCooperativeKernelMultiDevice_Basic
* Introduce reduction_factor for coop groups tests. Fix Unit_Coalesced_Group_Tiled_Partition_Sync_Positive_Basic
* Fix always false requirement by adding a cast
* Change data type to unsigned long long to align with cuda
* Change literal type to double to ensure proper type casting
* Remove formatting comments
* Initial refactoring work, including using build targets, and settable MSCCLPP_ROOT, MSCCLPP_SOURCE, MSCCLPP_APPLY_PATCHES.
* Another large refactor of MSCCLPP cmake to make all portions targets with appropriate dependencies. This should include all paths to the final target: starting with a full mscclpp install, starting with custom mscclpp and/or json source code, or from submodules + optional patches.
* Update whitespace Findmscclpp_nccl_static.cmake
---------
Co-authored-by: Corey Derochie <corey.derochie@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
Convert a subset of the ctest to pytest to be used in TheRock CI.
Create a new cmake flag `ROCPROFSYS_INSTALL_TESTING` to control test suite installation.
- pytest package will be installed to share/rocprofiler-systems/tests
- all compiled examples are put in share/rocprofiler-systems/examples
- all test relevant scripts are put in share/rocprofiler-systems/tests
- see README.md in share/rocprofiler-systems/tests
* Pin versions in requirements-test.txt
- Validated compatibility to version pins in requirements.txt
- Validated compatibility with pytest, ctest, automatic test suite
- Validated compatibility with Python 3.9, 3.10, 3.11, and 3.12.
* Remove unused mock dependency
* Initial cleanup of compute workflows and skeleton of ghcr workflow
* Add containers-ci.yml, update opensuse and rhel dockerfiles
* rename id in rocprofiler-compute-ghcr.yml
* Add new line to end of containers-ci.yml
* Update action versions for rocprofiler-compute-ghcr.yml
* Switch back to SHA for action versions
* Add conda set solver classic fix to compute CI dockerfiles
* Update conda install for compute Dockerfiles
* Change opensuse version to 15.6 in containers-ci.yml
* Add fix for ubuntu noble to compute Dockerfile.ubuntu.ci
* Add default distro and version to Dockerfile.ubuntu.ci
* Updated regex for tarball version
* Remove Python3.8 from compute CI Dockerfiles
* Change RHEL 9.4 to 9, add retry for compute workflow
* Revert name change for compute rhel workflow
* update path naming
* Remove binutils-gold from Dockerfile.opensuse.ci
* Remove conda python installs from Dockerfile.ci files in compute
* Change CMake version to 3.21 in compute Dockerfile.ci files
* Update checkout actions from v4 to v5