Files
rocm-systems/projects/rocprofiler-sdk/tests/thread-trace/agent.cpp
T
Benjamin Welton 1517a398bf [rocprofiler-sdk] Buffer finalization fixes and HSA ABI 0x09 support (#2318)
* [rocprofiler-sdk] Fix buffer flush ordering and sanitizer CI improvements

Buffer Pool Design
------------------
Replace the fixed array-based double buffer with a dynamic pool design to
fix race conditions that caused "internal correlation id was retired
prematurely" errors.

The original design had a race where flush callbacks could be delivered
out-of-order: when buffer 0 fills and begins flushing, writes go to
buffer 1. If buffer 1 fills before buffer 0's flush completes, the
buffer index wraps back to 0 (which may still be flushing). Independent
flush tasks submitted to the thread pool can complete out of order.

The new pool design:
- Uses a std::deque of buffer instances that grows as needed
- Allocates buffers from the pool when the current buffer needs to flush
- Serializes flushes with a mutex to ensure FIFO callback ordering
- Returns buffers to the pool after flush completion
- Eliminates the race between buffer selection and write operations

New Unit Tests
--------------
- buffer_correlation_ordering.cpp: Tests that API records are always
  delivered before their corresponding retirement records
- buffer_ordering_stress.cpp: Stress tests buffer flush ordering under
  high contention with multiple threads rapidly filling buffers

HSA Tool Hooks
--------------
Added hsa_tool_hooks.cpp/hpp to register an HSA OnUnload callback that
waits for pending flush tasks before tool finalization, preventing
"retired prematurely" errors during HSA shutdown.

Sanitizer Improvements
----------------------
- LSAN: Set fast_unwind_on_malloc=1 to prevent deadlock in libgcc unwinder
- LSAN: Added suppressions for external tools (liblzma, liblsan, seq, strdup)
- TSAN: Added suppression for false positive on C++11 thread-safe static
  initialization in create_write_functor
- ASAN/UBSAN: Added patterns for known issues in HSA runtime, HIP, perfetto
- Disabled attachment tests for sanitizers due to library preloading issues

Other Fixes
-----------
- Thread-trace agent test: Use heap-allocated callback state
- Correlation ID: Refactored reference counting and finalization ordering

* [rocprofiler-sdk] Revert buffer pool design changes

Revert buffer.cpp and buffer.hpp to the original double-buffer
design from develop branch. The pool-based redesign introduced
concerns about:
- Signal safety (mutex vs atomic_flag)
- API changes (flush() return type)
- Complexity of the new design

This revert removes:
- Dynamic buffer pool with std::deque
- std::mutex/condition_variable synchronization
- buffer_correlation_ordering.cpp test
- buffer_ordering_stress.cpp test

The underlying buffer flush ordering issue will need to be
addressed with a different approach that preserves the original
API and synchronization characteristics.

* [rocprofiler-sdk] Consistent fini_status checks to prevent correlation ID creation during finalization

- Revert TOCTOU CAS loop change in sub_ref_count() - not needed with consistent checks
- Add fini_status check in correlation_tracing_service::construct() with ROCP_CI_LOG warning
- Add nullptr checks at all construct() call sites (queue.cpp, async_copy.cpp, memory_allocation.cpp)
- Change all 'get_fini_status() > 0' to '!= 0' for consistent behavior:
  - hsa/queue.cpp (lines 105, 210)
  - hsa/async_copy.cpp (line 344)
  - hsa/hsa_barrier.cpp (line 43)
  - buffer.cpp (lines 107, 138, 185)

This ensures no correlation IDs are created once finalization starts (fini_status != 0),
preventing races between finalization and ongoing tracing operations.

* [rocprofiler-sdk] Replace arrival-order checks with timestamp-based temporal validation

Buffer records are not guaranteed to arrive in any specific order. Tests and
samples should use timestamps for temporal ordering validation instead.

Changes:
- samples/external_correlation_id_request: Replace 'retired prematurely' arrival
  order check with timestamp-based validation that retirement timestamp >=
  max(end_timestamps) for records with the same correlation ID
- tests/external_correlation.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check
- tests/registration.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check
- tests/roctx.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check

Correlation IDs are not guaranteed to be monotonically increasing when records
are sorted by timestamp. Temporal ordering should be validated using the
timestamp fields in each record.

* [rocprofiler-sdk] Revert external/CMakeLists.txt SYSTEM keyword removal

Restore the SYSTEM keyword to target_include_directories for
rocprofiler-sdk-fmt to match develop branch.

* [rccl] Remove orphaned rocSHMEM gitlink

Remove orphaned submodule reference that was introduced during a merge
but never had a corresponding .gitmodules entry, causing CI failures
with "fatal: no submodule mapping found in .gitmodules".

* [rocprofiler-sdk] Add HSA ABI version 0x09 support

Add ABI checks for HSA_AMD_EXT_API_TABLE_STEP_VERSION 0x09 which
introduces hsa_amd_counted_queue_acquire and hsa_amd_counted_queue_release
functions (added in rocr-runtime SWDEV-561708).

* [rocprofiler-sdk] Handle finalized status gracefully in buffer flush operations

This commit consolidates fixes for handling the finalization status during
buffer flush operations across the SDK.

Changes:
- Tool and samples: Handle ROCPROFILER_STATUS_ERROR_FINALIZED gracefully
  when flushing buffers, as this indicates buffers were already flushed
  during finalization (not an error condition)
- HSA handlers (queue.cpp, async_copy.cpp, hsa_barrier.cpp): Use > 0 check
  for fini_status to allow operations during finalization process
- buffer.cpp: Revert fini_status checks to use > 0 for consistency
- correlation_id.cpp: Add fini_status > 0 check with ROCP_TRACE logging
  to prevent correlation ID creation after finalization starts

Files modified:
- source/lib/rocprofiler-sdk-tool/tool.cpp
- tests/tools/json-tool.cpp
- source/lib/rocprofiler-sdk/tests/registration.cpp
- source/lib/rocprofiler-sdk/tests/roctx.cpp
- samples/api_buffered_tracing/client.cpp
- samples/counter_collection/buffered_client.cpp
- samples/counter_collection/device_counting_async_client.cpp
- samples/external_correlation_id_request/client.cpp
- samples/pc_sampling/client.cpp
- source/lib/rocprofiler-sdk/buffer.cpp
- source/lib/rocprofiler-sdk/context/correlation_id.cpp
- source/lib/rocprofiler-sdk/hsa/queue.cpp
- source/lib/rocprofiler-sdk/hsa/async_copy.cpp
- source/lib/rocprofiler-sdk/hsa/hsa_barrier.cpp

* [rocprofiler-sdk] Remove hsa_tool_hooks and simplify buffer flush handling

Remove the hsa_tool_hooks infrastructure and simplify buffer flush calls
in samples and tools. The ERROR_FINALIZED handling was overly complex
and the hsa_tool_hooks OnUnload synchronization is no longer needed.

Changes:
- Remove hsa_tool_hooks.cpp/hpp and related registration.cpp code
- Simplify buffer flush calls in samples to use direct ROCPROFILER_CALL
- Simplify buffer flush in tool.cpp and json-tool.cpp
- Remove ERROR_FINALIZED special handling from test files

Co-Authored-By: Claude <noreply@anthropic.com>

* [rocprofiler-sdk] Fix output_stream move semantics to null source pointers

The default move constructor and move assignment operator for
output_stream did not null out the source's pointers after the move.
This caused double-close when the moved-from temporary was destroyed,
leading to use-after-free crashes (SIGSEGV in std::ostream::sentry).

Co-Authored-By: Claude <noreply@anthropic.com>

* [rocprofiler-sdk] Improve Perfetto trace writer and sanitizer configuration

- generatePerfetto.cpp: Move output_stream into shared_state to prevent
  use-after-free race conditions during Perfetto callback execution
- run-ci.py: Simplify and consolidate sanitizer environment variable
  configuration for better maintainability

Co-Authored-By: Claude <noreply@anthropic.com>

* [rocprofiler-sdk] Revert run-ci.py changes that broke sanitizer suppressions

The previous changes removed MEMCHECK_SANITIZER_OPTIONS which is required
for CTest to properly pass suppression files to the sanitizers during
memcheck runs.

Co-Authored-By: Claude <noreply@anthropic.com>

* Revert "[rccl] Remove orphaned rocSHMEM gitlink"

This reverts commit 1ad21003941355658fff8114fa27768f11a948f7.

* [rocprofiler-sdk] Revert registration.cpp changes

Revert changes to registration.cpp to match develop branch.

Co-Authored-By: Claude <noreply@anthropic.com>

* [rocprofiler-sdk] Remove suppression file content printing from run-ci.py

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix output_stream move ctor/assignment operator

* Fix erroneous revert of registration.cpp

* Fix handling of fini status in correlation ID construction

* [rocprofiler-sdk] Fix OMPT segfault during finalization

Add nullptr checks in OMPT tracing code to handle the case where
correlation_tracing_service::construct() returns nullptr during
finalization. This fixes segfaults in openmp-target-sample and
tests.integration.execute.openmp-tools.

The correlation ID construction now returns nullptr when fini_status > 0,
but the OMPT callbacks were not checking for this, causing crashes when
dereferencing the null pointer during OpenMP runtime shutdown.

Changes:
- event_common(): Return nullptr early if correlation ID is null
- event(): Check for nullptr before calling sub_ref_count()
- ompt_task_create_callback(): Return early if correlation ID is null
- ompt_task_schedule_callback(): Return early if correlation ID is null

* [rocprofiler-sdk] Fix HSA API tracing segfault during finalization

Add nullptr check in hsa_api_impl::functor after correlation ID
construction. During finalization, correlation_service::construct()
returns nullptr, and without this check the code would dereference
the null pointer when accessing corr_id->internal.

This fixes the SEGV at address 0x000000000008 (null + 8 byte offset)
that occurs when HSA async event threads call hsa_signal_destroy
during runtime shutdown after finalization has started.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
2026-01-27 13:27:54 -05:00

256 строки
9.8 KiB
C++

// MIT License
//
// Copyright (c) 2024-2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
//
// undefine NDEBUG so asserts are implemented
#ifdef NDEBUG
# undef NDEBUG
#endif
#include "trace_callbacks.hpp"
#include <atomic>
#include <mutex>
#include <set>
namespace ATTTest
{
namespace Agent
{
rocprofiler_client_id_t* client_id = nullptr;
rocprofiler_context_id_t agent_ctx = {};
rocprofiler_context_id_t tracing_ctx = {};
// Callback state allocated on heap to control destruction order
struct CallbackState
{
std::atomic<bool> isprofiling{false};
std::atomic<bool> stop_profiling{false};
std::mutex mut{};
std::set<int> captured_ids{};
};
CallbackState* callback_state = nullptr;
void
tool_fini(void* tool_data)
{
// Stop contexts to ensure no more callbacks are dispatched before static destruction
rocprofiler_stop_context(tracing_ctx);
rocprofiler_stop_context(agent_ctx);
// Call the shared finalize logic
Callbacks::finalize(tool_data);
// Clean up heap-allocated callback state after finalize
delete callback_state;
callback_state = nullptr;
}
void
dispatch_tracing_callback(rocprofiler_callback_tracing_record_t record,
rocprofiler_user_data_t* /* user_data */,
void* /* userdata */)
{
if(record.kind != ROCPROFILER_CALLBACK_TRACING_KERNEL_DISPATCH) return;
if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT) return;
// Check if callback_state is still valid (may be null during shutdown)
if(!callback_state) return;
assert(record.payload);
auto* rdata = static_cast<rocprofiler_callback_tracing_kernel_dispatch_data_t*>(record.payload);
auto dispatch_id = rdata->dispatch_info.dispatch_id;
// Choose two dispatches to begin(6) and end(10) the trace
constexpr uint64_t begin_dispatch = 6;
constexpr uint64_t end_dispatch = 10;
if(record.phase == ROCPROFILER_CALLBACK_PHASE_ENTER)
{
if(dispatch_id == begin_dispatch)
{
ROCPROFILER_CALL(rocprofiler_start_context(agent_ctx), "context start");
callback_state->isprofiling.store(true);
}
if(callback_state->isprofiling && dispatch_id <= end_dispatch)
{
std::unique_lock<std::mutex> lk(callback_state->mut);
callback_state->captured_ids.insert(dispatch_id);
}
if(dispatch_id > end_dispatch) callback_state->stop_profiling.store(true);
return;
}
assert(record.phase == ROCPROFILER_CALLBACK_PHASE_NONE);
if(!callback_state->isprofiling) return;
std::unique_lock<std::mutex> lk(callback_state->mut);
callback_state->captured_ids.erase(dispatch_id);
if(!callback_state->captured_ids.empty() || callback_state->stop_profiling == false) return;
bool _exp = true;
if(!callback_state->isprofiling.compare_exchange_strong(_exp, false, std::memory_order_relaxed))
return;
ROCPROFILER_CALL(rocprofiler_stop_context(agent_ctx), "context stop");
}
rocprofiler_status_t
query_available_agents(rocprofiler_agent_version_t /* version */,
const void** agents,
size_t num_agents,
void* user_data)
{
rocprofiler_user_data_t user{};
user.ptr = user_data;
for(size_t idx = 0; idx < num_agents; idx++)
{
const auto* agent = static_cast<const rocprofiler_agent_v0_t*>(agents[idx]);
if(agent->type != ROCPROFILER_AGENT_TYPE_GPU) continue;
uint64_t buffer_size_gb = 1;
// Are we testing for larger buffers?
if(const char* var = std::getenv("ATT_LARGE_BUFFER_TEST"); var && atoi(var))
{
// To fully test this feature, we need >4GB per shader engine (>8GB total).
// Some RDNA GPUs only have 8GB of VRAM, so we have to use 5GB total = 2.5GB per SE.
uint64_t total_memory = 0;
for(uint32_t i = 0; i < agent->mem_banks_count; i++)
total_memory += agent->mem_banks[i].size_in_bytes;
// Check we have >11GB VRAM. If so, allocate 10GB.
if(total_memory > (11ul << 30))
buffer_size_gb = 10;
else
buffer_size_gb = 5;
}
uint64_t buffer_size_bytes = buffer_size_gb << 30;
if(agent->gfx_target_version / 10000 == 11u)
buffer_size_bytes = 255ul << 20; // gfx11 limititation
auto parameters = std::vector<rocprofiler_thread_trace_parameter_t>{};
parameters.push_back({ROCPROFILER_THREAD_TRACE_PARAMETER_TARGET_CU, {1}});
parameters.push_back({ROCPROFILER_THREAD_TRACE_PARAMETER_SIMD_SELECT, {0xF}});
parameters.push_back({ROCPROFILER_THREAD_TRACE_PARAMETER_BUFFER_SIZE, {buffer_size_bytes}});
parameters.push_back({ROCPROFILER_THREAD_TRACE_PARAMETER_SHADER_ENGINE_MASK, {0x3}});
static const bool extra_args =
std::getenv("ATT_NODETAIL") ? std::stoi(std::getenv("ATT_NODETAIL")) != 0 : false;
if(extra_args)
{
// Dont generate instruction profiling, only occupancy and shaderdata
parameters.emplace_back(rocprofiler_thread_trace_parameter_t{
ROCPROFILER_THREAD_TRACE_PARAMETER_NO_DETAIL, {1}});
}
ROCPROFILER_CALL(
rocprofiler_configure_device_thread_trace_service(agent_ctx,
agent->id,
parameters.data(),
parameters.size(),
Callbacks::shader_data_callback,
user),
"thread trace service configure");
}
return ROCPROFILER_STATUS_SUCCESS;
}
int
tool_init(rocprofiler_client_finalize_t /* fini_func */, void* /* tool_data */)
{
Callbacks::init();
// Allocate callback state on heap for controlled destruction order
callback_state = new CallbackState{};
ROCPROFILER_CALL(rocprofiler_create_context(&tracing_ctx), "context creation");
ROCPROFILER_CALL(rocprofiler_create_context(&agent_ctx), "context creation");
ROCPROFILER_CALL(
rocprofiler_configure_callback_tracing_service(tracing_ctx,
ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT,
nullptr,
0,
Callbacks::tool_codeobj_tracing_callback,
nullptr),
"code object tracing service configure");
ROCPROFILER_CALL(
rocprofiler_configure_callback_tracing_service(tracing_ctx,
ROCPROFILER_CALLBACK_TRACING_KERNEL_DISPATCH,
nullptr,
0,
dispatch_tracing_callback,
nullptr),
"dispatch tracing service configure");
ROCPROFILER_CALL(rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0,
&query_available_agents,
sizeof(rocprofiler_agent_t),
nullptr),
"Failed to find GPU agents");
int valid_ctx = 0;
ROCPROFILER_CALL(rocprofiler_context_is_valid(agent_ctx, &valid_ctx), "validity check");
assert(valid_ctx != 0);
ROCPROFILER_CALL(rocprofiler_context_is_valid(tracing_ctx, &valid_ctx), "validity check");
assert(valid_ctx != 0);
ROCPROFILER_CALL(rocprofiler_start_context(tracing_ctx), "context start");
// no errors
return 0;
}
} // namespace Agent
} // namespace ATTTest
extern "C" rocprofiler_tool_configure_result_t*
rocprofiler_configure(uint32_t /* version */,
const char* /* runtime_version */,
uint32_t priority,
rocprofiler_client_id_t* id)
{
// only activate if main tool
if(priority > 0) return nullptr;
// set the client name
id->name = "ATT_test_agent";
// store client info
ATTTest::Agent::client_id = id;
// create configure data
static auto cfg =
rocprofiler_tool_configure_result_t{sizeof(rocprofiler_tool_configure_result_t),
&ATTTest::Agent::tool_init,
&ATTTest::Agent::tool_fini,
nullptr};
// return pointer to configure data
return &cfg;
}