Problem with original test:
- Created circular dependencies between queues:
* Queue1: Kernel A → Barrier(waits for signal_2) → Kernel C
* Queue2: Barrier(waits for signal_1) → Kernel B → sets signal_2
- With strict "one kernel at a time" serialization, this created deadlock:
* Queue1 executed Kernel A, then blocked on barrier waiting for signal_2
* Serializer switched to Queue2, but Queue2 was blocked waiting for signal_1
* Neither queue could proceed: Queue1 needed Queue2's Kernel B to complete,
but Queue2 couldn't start until Queue1 finished completely
- Test would hang indefinitely at hsa_signal_wait_relaxed() for signal_2
Solution implemented:
- Reordered packet submission to eliminate circular dependencies
- Ensured signal producers execute before consumers need them:
* Kernel A produces signal_1 before Queue2's barrier needs it
* Kernel B produces signal_2 before Queue1's continuation needs it
- Dependencies now flow forward without cycles, allowing serializer progress
Refactoring changes:
- Extract common functionality into helper functions:
* create_completion_signal() for signal creation
* create_queue() for queue creation
* submit_kernel_packet() for kernel dispatch packets
* submit_barrier_packet() for barrier packets
- Add comprehensive documentation explaining expected execution pattern
- Simplify main() function making the dependency flow more readable
Co-authored-by: Benjamin Welton <bewelton@amd.com>
ROCprofiler-SDK: Application Profiling, Tracing, and Performance Analysis
Important
We are phasing out development and support for
ROCTracer, ROCprofiler, rocprof, and rocprofv2in favour ofROCprofiler-SDKandrocprofv3in upcoming ROCm releases. Starting with theROCm 6.4release, only critical defect fixes will be addressed for older versions of the profiling tools and libraries. We encourage all users to upgrade to the latest version of theROCprofiler-SDKlibrary and therocprofv3tool to ensure continued support and access to new features.
Please note that we anticipate the end of life for ROCprofiler V1/V2 and ROCTracer within nine months after the ROCm 7.0 release, aligning with the Q1 2026.
Overview
ROCprofiler-SDK is AMD’s new and improved tooling infrastructure, providing a hardware-specific low-level performance analysis interface for profiling and tracing GPU compute applications. To see what's changed, Click Here
Note
The published documentation is available at ROCprofiler-SDK documentation in an organized, easy-to-read format, with search and a table of contents. The documentation source files reside in the
rocprofiler-sdk/source/docsfolder of this repository. As with all ROCm projects, the documentation is open source. For more information on contributing to the documentation, see Contribute to ROCm documentation.
GPU Metrics
- GPU hardware counters
- Dispatch Counter Collection
- Device Counter Collection
- PC Sampling (Host Trap)
- Thread trace and ROCprof trace decoder (SQTT, ATT).
API Trace Support
- HIP API tracing
- HSA API tracing
- Marker (ROCTx) tracing
- Memory copy tracing
- Memory allocation tracing
- Page Migration Event tracing
- Scratch Memory tracing
- RCCL API tracing
- rocDecode API tracing
- rocJPEG API tracing
Parallelism API Support
- HIP
- HSA
- MPI
- Kokkos-Tools (KokkosP)
- OpenMP-Tools (OMPT)
Tool Support
rocprofv3 is the command line tool built using the rocprofiler-sdk library and shipped with the ROCm stack. To see details on the command line options of rocprofv3, please see rocprofv3 user guide Click Here
Documentation
We make use of doxygen to generate API documentation automatically. The generated document can be found in the following path:
<ROCM_PATH>/share/html/rocprofiler-sdk
ROCM_PATH by default is /opt/rocm It can be set by the user in different locations if needed.
Build and Installation
git clone https://github.com/ROCm/rocprofiler-sdk.git rocprofiler-sdk-source
cmake \
-B rocprofiler-sdk-build \
-DCMAKE_INSTALL_PREFIX=/opt/rocm \
-DCMAKE_PREFIX_PATH=/opt/rocm \
rocprofiler-sdk-source
cmake --build rocprofiler-sdk-build --target all --parallel $(nproc)
To install ROCprofiler, run:
cmake --build rocprofiler-sdk-build --target install
Please see the detailed section on build and installation here: Click Here
Support
Please report issues on GitHub OR send an email to dl.ROCm-Profiler.support@amd.com
Limitations
-
Individual XCC mode is not supported.
-
By default, PC sampling API is disabled. To use PC sampling. Setting the
ROCPROFILER_PC_SAMPLING_BETA_ENABLEDenvironment variable grants access to the PC Sampling experimental beta feature. This feature is still under development and may not be completely stable.-
Risk Acknowledgment: By activating this environment variable, you acknowledge and accept the following potential risks:
- Hardware Freeze: This beta feature could cause your hardware to freeze unexpectedly.
- Need for Cold Restart: In the event of a hardware freeze, you may need to perform a cold restart (turning the hardware off and on) to restore normal operations. Please use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk.
-
At this point, we do not recommend stress-testing the beta implementation.
-
Correlation IDs provided by the PC sampling service are verified only for HIP API calls.
-
Timestamps in PC sampling records might not be 100% accurate.
-
For low PC-sampling frequencies with intervals < 65k cycles, a lot of error samples might be delivered. We're working on optimizing this to allow lower sampling frequencies.
-
-
gfx11 and gfx12 architectures require a stable power state for counter collection. This includes AMD Radeon RX 7000 series GPUs and newer.
# For device <N>. Use 'rocm-smi' or 'amd-smi monitor' to see device number. sudo amd-smi set -g <N> -l stable_std # After profiling, set power state back to 'auto' sudo amd-smi set -g <N> -l autoThe gfx version can be found via
amd-smi static --asic -g <N>in theTARGET_GRAPHICS_VERSIONfield:$ amd-smi static -a -g 2 GPU: 2 ASIC: MARKET_NAME: Navi 33 [Radeon Pro W7500] VENDOR_ID: 0x1002 VENDOR_NAME: Advanced Micro Devices Inc. [AMD/ATI] SUBVENDOR_ID: 0x1002 DEVICE_ID: 0x7489 SUBSYSTEM_ID: 0x0e0d REV_ID: 0x00 ASIC_SERIAL: N/A OAM_ID: N/A NUM_COMPUTE_UNITS: 28 TARGET_GRAPHICS_VERSION: gfx1102
Warning
To use ROCprofiler-SDK, obtain the latest mainline version of AQLprofile from here.