1517a398bf
* [rocprofiler-sdk] Fix buffer flush ordering and sanitizer CI improvements Buffer Pool Design ------------------ Replace the fixed array-based double buffer with a dynamic pool design to fix race conditions that caused "internal correlation id was retired prematurely" errors. The original design had a race where flush callbacks could be delivered out-of-order: when buffer 0 fills and begins flushing, writes go to buffer 1. If buffer 1 fills before buffer 0's flush completes, the buffer index wraps back to 0 (which may still be flushing). Independent flush tasks submitted to the thread pool can complete out of order. The new pool design: - Uses a std::deque of buffer instances that grows as needed - Allocates buffers from the pool when the current buffer needs to flush - Serializes flushes with a mutex to ensure FIFO callback ordering - Returns buffers to the pool after flush completion - Eliminates the race between buffer selection and write operations New Unit Tests -------------- - buffer_correlation_ordering.cpp: Tests that API records are always delivered before their corresponding retirement records - buffer_ordering_stress.cpp: Stress tests buffer flush ordering under high contention with multiple threads rapidly filling buffers HSA Tool Hooks -------------- Added hsa_tool_hooks.cpp/hpp to register an HSA OnUnload callback that waits for pending flush tasks before tool finalization, preventing "retired prematurely" errors during HSA shutdown. Sanitizer Improvements ---------------------- - LSAN: Set fast_unwind_on_malloc=1 to prevent deadlock in libgcc unwinder - LSAN: Added suppressions for external tools (liblzma, liblsan, seq, strdup) - TSAN: Added suppression for false positive on C++11 thread-safe static initialization in create_write_functor - ASAN/UBSAN: Added patterns for known issues in HSA runtime, HIP, perfetto - Disabled attachment tests for sanitizers due to library preloading issues Other Fixes ----------- - Thread-trace agent test: Use heap-allocated callback state - Correlation ID: Refactored reference counting and finalization ordering * [rocprofiler-sdk] Revert buffer pool design changes Revert buffer.cpp and buffer.hpp to the original double-buffer design from develop branch. The pool-based redesign introduced concerns about: - Signal safety (mutex vs atomic_flag) - API changes (flush() return type) - Complexity of the new design This revert removes: - Dynamic buffer pool with std::deque - std::mutex/condition_variable synchronization - buffer_correlation_ordering.cpp test - buffer_ordering_stress.cpp test The underlying buffer flush ordering issue will need to be addressed with a different approach that preserves the original API and synchronization characteristics. * [rocprofiler-sdk] Consistent fini_status checks to prevent correlation ID creation during finalization - Revert TOCTOU CAS loop change in sub_ref_count() - not needed with consistent checks - Add fini_status check in correlation_tracing_service::construct() with ROCP_CI_LOG warning - Add nullptr checks at all construct() call sites (queue.cpp, async_copy.cpp, memory_allocation.cpp) - Change all 'get_fini_status() > 0' to '!= 0' for consistent behavior: - hsa/queue.cpp (lines 105, 210) - hsa/async_copy.cpp (line 344) - hsa/hsa_barrier.cpp (line 43) - buffer.cpp (lines 107, 138, 185) This ensures no correlation IDs are created once finalization starts (fini_status != 0), preventing races between finalization and ongoing tracing operations. * [rocprofiler-sdk] Replace arrival-order checks with timestamp-based temporal validation Buffer records are not guaranteed to arrive in any specific order. Tests and samples should use timestamps for temporal ordering validation instead. Changes: - samples/external_correlation_id_request: Replace 'retired prematurely' arrival order check with timestamp-based validation that retirement timestamp >= max(end_timestamps) for records with the same correlation ID - tests/external_correlation.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check - tests/registration.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check - tests/roctx.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check Correlation IDs are not guaranteed to be monotonically increasing when records are sorted by timestamp. Temporal ordering should be validated using the timestamp fields in each record. * [rocprofiler-sdk] Revert external/CMakeLists.txt SYSTEM keyword removal Restore the SYSTEM keyword to target_include_directories for rocprofiler-sdk-fmt to match develop branch. * [rccl] Remove orphaned rocSHMEM gitlink Remove orphaned submodule reference that was introduced during a merge but never had a corresponding .gitmodules entry, causing CI failures with "fatal: no submodule mapping found in .gitmodules". * [rocprofiler-sdk] Add HSA ABI version 0x09 support Add ABI checks for HSA_AMD_EXT_API_TABLE_STEP_VERSION 0x09 which introduces hsa_amd_counted_queue_acquire and hsa_amd_counted_queue_release functions (added in rocr-runtime SWDEV-561708). * [rocprofiler-sdk] Handle finalized status gracefully in buffer flush operations This commit consolidates fixes for handling the finalization status during buffer flush operations across the SDK. Changes: - Tool and samples: Handle ROCPROFILER_STATUS_ERROR_FINALIZED gracefully when flushing buffers, as this indicates buffers were already flushed during finalization (not an error condition) - HSA handlers (queue.cpp, async_copy.cpp, hsa_barrier.cpp): Use > 0 check for fini_status to allow operations during finalization process - buffer.cpp: Revert fini_status checks to use > 0 for consistency - correlation_id.cpp: Add fini_status > 0 check with ROCP_TRACE logging to prevent correlation ID creation after finalization starts Files modified: - source/lib/rocprofiler-sdk-tool/tool.cpp - tests/tools/json-tool.cpp - source/lib/rocprofiler-sdk/tests/registration.cpp - source/lib/rocprofiler-sdk/tests/roctx.cpp - samples/api_buffered_tracing/client.cpp - samples/counter_collection/buffered_client.cpp - samples/counter_collection/device_counting_async_client.cpp - samples/external_correlation_id_request/client.cpp - samples/pc_sampling/client.cpp - source/lib/rocprofiler-sdk/buffer.cpp - source/lib/rocprofiler-sdk/context/correlation_id.cpp - source/lib/rocprofiler-sdk/hsa/queue.cpp - source/lib/rocprofiler-sdk/hsa/async_copy.cpp - source/lib/rocprofiler-sdk/hsa/hsa_barrier.cpp * [rocprofiler-sdk] Remove hsa_tool_hooks and simplify buffer flush handling Remove the hsa_tool_hooks infrastructure and simplify buffer flush calls in samples and tools. The ERROR_FINALIZED handling was overly complex and the hsa_tool_hooks OnUnload synchronization is no longer needed. Changes: - Remove hsa_tool_hooks.cpp/hpp and related registration.cpp code - Simplify buffer flush calls in samples to use direct ROCPROFILER_CALL - Simplify buffer flush in tool.cpp and json-tool.cpp - Remove ERROR_FINALIZED special handling from test files Co-Authored-By: Claude <noreply@anthropic.com> * [rocprofiler-sdk] Fix output_stream move semantics to null source pointers The default move constructor and move assignment operator for output_stream did not null out the source's pointers after the move. This caused double-close when the moved-from temporary was destroyed, leading to use-after-free crashes (SIGSEGV in std::ostream::sentry). Co-Authored-By: Claude <noreply@anthropic.com> * [rocprofiler-sdk] Improve Perfetto trace writer and sanitizer configuration - generatePerfetto.cpp: Move output_stream into shared_state to prevent use-after-free race conditions during Perfetto callback execution - run-ci.py: Simplify and consolidate sanitizer environment variable configuration for better maintainability Co-Authored-By: Claude <noreply@anthropic.com> * [rocprofiler-sdk] Revert run-ci.py changes that broke sanitizer suppressions The previous changes removed MEMCHECK_SANITIZER_OPTIONS which is required for CTest to properly pass suppression files to the sanitizers during memcheck runs. Co-Authored-By: Claude <noreply@anthropic.com> * Revert "[rccl] Remove orphaned rocSHMEM gitlink" This reverts commit 1ad21003941355658fff8114fa27768f11a948f7. * [rocprofiler-sdk] Revert registration.cpp changes Revert changes to registration.cpp to match develop branch. Co-Authored-By: Claude <noreply@anthropic.com> * [rocprofiler-sdk] Remove suppression file content printing from run-ci.py Co-Authored-By: Claude <noreply@anthropic.com> * Fix output_stream move ctor/assignment operator * Fix erroneous revert of registration.cpp * Fix handling of fini status in correlation ID construction * [rocprofiler-sdk] Fix OMPT segfault during finalization Add nullptr checks in OMPT tracing code to handle the case where correlation_tracing_service::construct() returns nullptr during finalization. This fixes segfaults in openmp-target-sample and tests.integration.execute.openmp-tools. The correlation ID construction now returns nullptr when fini_status > 0, but the OMPT callbacks were not checking for this, causing crashes when dereferencing the null pointer during OpenMP runtime shutdown. Changes: - event_common(): Return nullptr early if correlation ID is null - event(): Check for nullptr before calling sub_ref_count() - ompt_task_create_callback(): Return early if correlation ID is null - ompt_task_schedule_callback(): Return early if correlation ID is null * [rocprofiler-sdk] Fix HSA API tracing segfault during finalization Add nullptr check in hsa_api_impl::functor after correlation ID construction. During finalization, correlation_service::construct() returns nullptr, and without this check the code would dereference the null pointer when accessing corr_id->internal. This fixes the SEGV at address 0x000000000008 (null + 8 byte offset) that occurs when HSA async event threads call hsa_signal_destroy during runtime shutdown after finalization has started. --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
715 строки
30 KiB
C++
715 строки
30 KiB
C++
// MIT License
|
|
//
|
|
/* Copyright (c) 2022-2025 Advanced Micro Devices, Inc.
|
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
of this software and associated documentation files (the "Software"), to deal
|
|
in the Software without restriction, including without limitation the rights
|
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
copies of the Software, and to permit persons to whom the Software is
|
|
furnished to do so, subject to the following conditions:
|
|
|
|
The above copyright notice and this permission notice shall be included in
|
|
all copies or substantial portions of the Software.
|
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
|
THE SOFTWARE. */
|
|
|
|
#include "lib/rocprofiler-sdk/hsa/queue.hpp"
|
|
#include "lib/common/scope_destructor.hpp"
|
|
#include "lib/common/utility.hpp"
|
|
#include "lib/rocprofiler-sdk/code_object/code_object.hpp"
|
|
#include "lib/rocprofiler-sdk/context/context.hpp"
|
|
#include "lib/rocprofiler-sdk/hsa/details/fmt.hpp"
|
|
#include "lib/rocprofiler-sdk/hsa/hsa.hpp"
|
|
#include "lib/rocprofiler-sdk/hsa/queue_controller.hpp"
|
|
#include "lib/rocprofiler-sdk/kernel_dispatch/profiling_time.hpp"
|
|
#include "lib/rocprofiler-sdk/kernel_dispatch/tracing.hpp"
|
|
#include "lib/rocprofiler-sdk/pc_sampling/hsa_adapter.hpp"
|
|
#include "lib/rocprofiler-sdk/pc_sampling/service.hpp"
|
|
#include "lib/rocprofiler-sdk/registration.hpp"
|
|
#include "lib/rocprofiler-sdk/tracing/tracing.hpp"
|
|
|
|
#include <rocprofiler-sdk/callback_tracing.h>
|
|
#include <rocprofiler-sdk/external_correlation.h>
|
|
#include <rocprofiler-sdk/fwd.h>
|
|
|
|
#include <fmt/format.h>
|
|
#include <fmt/ranges.h>
|
|
#include <hsa/hsa.h>
|
|
#include <hsa/hsa_ext_amd.h>
|
|
|
|
#include <atomic>
|
|
#include <memory>
|
|
|
|
// static assert for rocprofiler_packet ABI compatibility
|
|
static_assert(sizeof(hsa_ext_amd_aql_pm4_packet_t) == sizeof(hsa_kernel_dispatch_packet_t),
|
|
"unexpected ABI incompatibility");
|
|
static_assert(sizeof(hsa_ext_amd_aql_pm4_packet_t) == sizeof(hsa_barrier_and_packet_t),
|
|
"unexpected ABI incompatibility");
|
|
static_assert(sizeof(hsa_ext_amd_aql_pm4_packet_t) == sizeof(hsa_barrier_or_packet_t),
|
|
"unexpected ABI incompatibility");
|
|
static_assert(offsetof(hsa_ext_amd_aql_pm4_packet_t, completion_signal) ==
|
|
offsetof(hsa_kernel_dispatch_packet_t, completion_signal),
|
|
"unexpected ABI incompatibility");
|
|
static_assert(offsetof(hsa_ext_amd_aql_pm4_packet_t, completion_signal) ==
|
|
offsetof(hsa_barrier_and_packet_t, completion_signal),
|
|
"unexpected ABI incompatibility");
|
|
static_assert(offsetof(hsa_ext_amd_aql_pm4_packet_t, completion_signal) ==
|
|
offsetof(hsa_barrier_or_packet_t, completion_signal),
|
|
"unexpected ABI incompatibility");
|
|
|
|
namespace rocprofiler
|
|
{
|
|
namespace hsa
|
|
{
|
|
namespace
|
|
{
|
|
template <typename DomainT, typename... Args>
|
|
inline bool
|
|
context_filter(const context::context* ctx, DomainT domain, Args... args)
|
|
{
|
|
if constexpr(std::is_same<DomainT, rocprofiler_buffer_tracing_kind_t>::value)
|
|
{
|
|
return (ctx->buffered_tracer && ctx->buffered_tracer->domains(domain, args...));
|
|
}
|
|
else if constexpr(std::is_same<DomainT, rocprofiler_callback_tracing_kind_t>::value)
|
|
{
|
|
return (ctx->callback_tracer && ctx->callback_tracer->domains(domain, args...));
|
|
}
|
|
else
|
|
{
|
|
static_assert(common::mpl::assert_false<DomainT>::value, "unsupported domain type");
|
|
return false;
|
|
}
|
|
}
|
|
|
|
bool
|
|
context_filter(const context::context* ctx)
|
|
{
|
|
return (context_filter(ctx, ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH) ||
|
|
context_filter(ctx, ROCPROFILER_CALLBACK_TRACING_KERNEL_DISPATCH));
|
|
}
|
|
|
|
bool
|
|
AsyncSignalHandler(hsa_signal_value_t /*signal_v*/, void* data)
|
|
{
|
|
if(!data) return true;
|
|
|
|
// if we have fully finalized, delete the data and return
|
|
if(registration::get_fini_status() > 0)
|
|
{
|
|
auto* _session = static_cast<Queue::queue_info_session_t**>(data);
|
|
delete _session;
|
|
return false;
|
|
}
|
|
|
|
auto& shared_ptr_info = *static_cast<std::shared_ptr<Queue::queue_info_session_t>*>(data);
|
|
auto& queue_info_session = *shared_ptr_info;
|
|
|
|
auto dispatch_time = kernel_dispatch::get_dispatch_time(queue_info_session);
|
|
|
|
kernel_dispatch::dispatch_complete(queue_info_session, dispatch_time);
|
|
|
|
// Calls our internal callbacks to callers who need to be notified post
|
|
// kernel execution.
|
|
queue_info_session.queue.signal_callback([&](const auto& map) {
|
|
for(const auto& [client_id, cb_pair] : map)
|
|
{
|
|
cb_pair.second(queue_info_session.queue,
|
|
queue_info_session.kernel_pkt,
|
|
shared_ptr_info,
|
|
queue_info_session.inst_pkt,
|
|
dispatch_time);
|
|
}
|
|
});
|
|
|
|
if(queue_info_session.is_serialized)
|
|
{
|
|
CHECK_NOTNULL(hsa::get_queue_controller())
|
|
->serializer(&queue_info_session.queue)
|
|
.wlock([&](auto& serializer) {
|
|
serializer.kernel_completion_signal(queue_info_session.queue);
|
|
});
|
|
}
|
|
|
|
// Delete signals and packets, signal we have completed.
|
|
if(queue_info_session.interrupt_signal.handle != 0u)
|
|
{
|
|
#if !defined(NDEBUG)
|
|
CHECK_NOTNULL(hsa::get_queue_controller())->_debug_signals.wlock([&](auto& signals) {
|
|
signals.erase(queue_info_session.interrupt_signal.handle);
|
|
});
|
|
#endif
|
|
hsa::get_core_table()->hsa_signal_store_screlease_fn(queue_info_session.interrupt_signal,
|
|
-1);
|
|
hsa::get_core_table()->hsa_signal_destroy_fn(queue_info_session.interrupt_signal);
|
|
}
|
|
if(queue_info_session.kernel_pkt.ext_amd_aql_pm4.completion_signal.handle != 0u)
|
|
{
|
|
hsa::get_core_table()->hsa_signal_destroy_fn(
|
|
queue_info_session.kernel_pkt.ext_amd_aql_pm4.completion_signal);
|
|
}
|
|
|
|
// we need to decrement this reference count at the end of the functions
|
|
auto* _corr_id = queue_info_session.correlation_id;
|
|
if(_corr_id)
|
|
{
|
|
ROCP_FATAL_IF(_corr_id->get_ref_count() == 0)
|
|
<< "reference counter for correlation id " << _corr_id->internal << " from thread "
|
|
<< _corr_id->thread_idx << " has no reference count";
|
|
_corr_id->sub_kern_count();
|
|
_corr_id->sub_ref_count();
|
|
}
|
|
|
|
queue_info_session.queue.async_complete();
|
|
delete &shared_ptr_info;
|
|
|
|
return false;
|
|
}
|
|
|
|
template <typename Integral = uint64_t>
|
|
constexpr Integral
|
|
bit_mask(int first, int last)
|
|
{
|
|
assert(last >= first && "Error: hsa_support::bit_mask -> invalid argument");
|
|
size_t num_bits = last - first + 1;
|
|
return ((num_bits >= sizeof(Integral) * 8) ? ~Integral{0}
|
|
/* num_bits exceed the size of Integral */
|
|
: ((Integral{1} << num_bits) - 1))
|
|
<< first;
|
|
}
|
|
|
|
/* Extract bits [last:first] from t. */
|
|
template <typename Integral>
|
|
constexpr Integral
|
|
bit_extract(Integral x, int first, int last)
|
|
{
|
|
return (x >> first) & bit_mask<Integral>(0, last - first);
|
|
}
|
|
|
|
/**
|
|
* @brief This function is a queue write interceptor. It intercepts the
|
|
* packet write function. Creates an instance of packet class with the raw
|
|
* pointer. invoke the populate function of the packet class which returns a
|
|
* pointer to the packet. This packet is written into the queue by this
|
|
* interceptor by invoking the writer function.
|
|
*/
|
|
void
|
|
WriteInterceptor(const void* packets,
|
|
uint64_t pkt_count,
|
|
uint64_t,
|
|
void* data,
|
|
hsa_amd_queue_intercept_packet_writer writer)
|
|
{
|
|
if(registration::get_fini_status() > 0)
|
|
{
|
|
writer(packets, pkt_count);
|
|
return;
|
|
}
|
|
|
|
using callback_record_t = Queue::queue_info_session_t::callback_record_t;
|
|
|
|
// unique sequence id for the dispatch
|
|
static auto sequence_counter = std::atomic<rocprofiler_dispatch_id_t>{0};
|
|
|
|
auto&& CreateBarrierPacket = [](hsa_signal_t* dependency_signal,
|
|
hsa_signal_t* completion_signal,
|
|
std::vector<rocprofiler_packet>& _packets) {
|
|
hsa_barrier_and_packet_t barrier{};
|
|
barrier.header = HSA_PACKET_TYPE_BARRIER_AND << HSA_PACKET_HEADER_TYPE;
|
|
barrier.header |= 1 << HSA_PACKET_HEADER_BARRIER;
|
|
if(dependency_signal != nullptr) barrier.dep_signal[0] = *dependency_signal;
|
|
if(completion_signal != nullptr) barrier.completion_signal = *completion_signal;
|
|
_packets.emplace_back(barrier);
|
|
};
|
|
|
|
ROCP_FATAL_IF(data == nullptr) << "WriteInterceptor was not passed a pointer to the queue";
|
|
|
|
auto& queue = *static_cast<Queue*>(data);
|
|
|
|
// We have no packets or no one who needs to be notified, do nothing.
|
|
if(pkt_count == 0 ||
|
|
(queue.get_notifiers() == 0 && context::get_active_contexts(context_filter).empty()))
|
|
{
|
|
writer(packets, pkt_count);
|
|
return;
|
|
}
|
|
|
|
auto tracing_data_v = tracing::tracing_data{};
|
|
tracing::populate_contexts(ROCPROFILER_CALLBACK_TRACING_KERNEL_DISPATCH,
|
|
ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH,
|
|
tracing_data_v);
|
|
// these are for the services (dispatch counter collection, pc sampling, ATT) which use
|
|
// the queue/queue_controller callback mechanism
|
|
const auto queue_callback_context_filter = [](const context::context* ctx) {
|
|
return (ctx->counter_collection || ctx->pc_sampler || ctx->dispatch_thread_trace);
|
|
};
|
|
|
|
for(const auto* itr : context::get_active_contexts(queue_callback_context_filter))
|
|
tracing_data_v.external_correlation_ids.emplace(itr, tracing::empty_user_data);
|
|
|
|
const auto* packets_arr = static_cast<const rocprofiler_packet*>(packets);
|
|
auto transformed_packets = std::vector<rocprofiler_packet>{};
|
|
|
|
// Searching accross all the packets given during this write
|
|
for(size_t i = 0; i < pkt_count; ++i)
|
|
{
|
|
const auto& original_packet = packets_arr[i].kernel_dispatch;
|
|
auto packet_type = bit_extract(original_packet.header,
|
|
HSA_PACKET_HEADER_TYPE,
|
|
HSA_PACKET_HEADER_TYPE + HSA_PACKET_HEADER_WIDTH_TYPE - 1);
|
|
if(packet_type != HSA_PACKET_TYPE_KERNEL_DISPATCH)
|
|
{
|
|
transformed_packets.emplace_back(packets_arr[i]);
|
|
continue;
|
|
}
|
|
|
|
auto* corr_id = context::get_latest_correlation_id();
|
|
context::correlation_id* _corr_id_pop = nullptr;
|
|
|
|
if(!corr_id)
|
|
{
|
|
constexpr auto ref_count = 1;
|
|
corr_id = context::correlation_tracing_service::construct(ref_count);
|
|
_corr_id_pop = corr_id;
|
|
}
|
|
|
|
if(!corr_id)
|
|
{
|
|
// During finalization - just write packet through without tracing
|
|
transformed_packets.emplace_back(packets_arr[i]);
|
|
continue;
|
|
}
|
|
|
|
// increase the reference count to denote that this correlation id is being used in a kernel
|
|
corr_id->add_ref_count();
|
|
corr_id->add_kern_count();
|
|
|
|
auto thr_id = (corr_id) ? corr_id->thread_idx : common::get_tid();
|
|
auto user_data = rocprofiler_user_data_t{.value = 0};
|
|
auto internal_corr_id = (corr_id) ? corr_id->internal : 0;
|
|
auto ancestor_corr_id = (corr_id) ? corr_id->ancestor : 0;
|
|
|
|
// if we constructed a correlation id, this decrements the reference count after the
|
|
// underlying function returns
|
|
auto _corr_id_dtor = common::scope_destructor{[_corr_id_pop]() {
|
|
if(_corr_id_pop)
|
|
{
|
|
context::pop_latest_correlation_id(_corr_id_pop);
|
|
_corr_id_pop->sub_ref_count();
|
|
}
|
|
}};
|
|
|
|
tracing::populate_external_correlation_ids(
|
|
tracing_data_v.external_correlation_ids,
|
|
thr_id,
|
|
ROCPROFILER_EXTERNAL_CORRELATION_REQUEST_KERNEL_DISPATCH,
|
|
ROCPROFILER_KERNEL_DISPATCH_ENQUEUE,
|
|
internal_corr_id);
|
|
|
|
queue.async_started();
|
|
|
|
const auto original_completion_signal = original_packet.completion_signal;
|
|
const bool existing_completion_signal = (original_completion_signal.handle != 0);
|
|
const uint64_t kernel_id = code_object::get_kernel_id(original_packet.kernel_object);
|
|
|
|
// Copy kernel pkt, copy is to allow for signal to be modified
|
|
rocprofiler_packet kernel_pkt = packets_arr[i];
|
|
// create our own signal that we can get a callback on. if there is an original completion
|
|
// signal we will create a barrier packet, assign the original completion signal that that
|
|
// barrier packet, and add it right after the kernel packet
|
|
queue.create_signal(0, &kernel_pkt.kernel_dispatch.completion_signal);
|
|
|
|
// computes the "size" based on the offset of reserved_padding field
|
|
constexpr auto kernel_dispatch_info_rt_size =
|
|
common::compute_runtime_sizeof<rocprofiler_kernel_dispatch_info_t>();
|
|
|
|
static_assert(kernel_dispatch_info_rt_size < sizeof(rocprofiler_kernel_dispatch_info_t),
|
|
"failed to compute size field based on offset of reserved_padding field");
|
|
|
|
auto dispatch_id = ++sequence_counter;
|
|
auto callback_record = callback_record_t{
|
|
sizeof(callback_record_t),
|
|
rocprofiler_timestamp_t{0},
|
|
rocprofiler_timestamp_t{0},
|
|
rocprofiler_kernel_dispatch_info_t{
|
|
.size = kernel_dispatch_info_rt_size,
|
|
.agent_id = queue.get_agent().get_rocp_agent()->id,
|
|
.queue_id = queue.get_id(),
|
|
.kernel_id = kernel_id,
|
|
.dispatch_id = dispatch_id,
|
|
.private_segment_size = kernel_pkt.kernel_dispatch.private_segment_size,
|
|
.group_segment_size = kernel_pkt.kernel_dispatch.group_segment_size,
|
|
.workgroup_size = rocprofiler_dim3_t{kernel_pkt.kernel_dispatch.workgroup_size_x,
|
|
kernel_pkt.kernel_dispatch.workgroup_size_y,
|
|
kernel_pkt.kernel_dispatch.workgroup_size_z},
|
|
.grid_size = rocprofiler_dim3_t{kernel_pkt.kernel_dispatch.grid_size_x,
|
|
kernel_pkt.kernel_dispatch.grid_size_y,
|
|
kernel_pkt.kernel_dispatch.grid_size_z},
|
|
.reserved_padding = {0}}};
|
|
|
|
{
|
|
auto tracer_data = callback_record;
|
|
tracing::execute_phase_enter_callbacks(tracing_data_v.callback_contexts,
|
|
thr_id,
|
|
internal_corr_id,
|
|
tracing_data_v.external_correlation_ids,
|
|
ancestor_corr_id,
|
|
ROCPROFILER_CALLBACK_TRACING_KERNEL_DISPATCH,
|
|
ROCPROFILER_KERNEL_DISPATCH_ENQUEUE,
|
|
tracer_data);
|
|
}
|
|
|
|
// map all the external correlation ids (after enqueue enter phase) for all the contexts
|
|
// captured by the info session
|
|
tracing::update_external_correlation_ids(
|
|
tracing_data_v.external_correlation_ids,
|
|
thr_id,
|
|
ROCPROFILER_EXTERNAL_CORRELATION_REQUEST_KERNEL_DISPATCH);
|
|
|
|
// Stores the instrumentation pkt (i.e. AQL packets for counter collection)
|
|
// along with an ID of the client we got the packet from (this will be returned via
|
|
// completed_cb_t)
|
|
auto inst_pkt = inst_pkt_t{};
|
|
|
|
// True if any service (ATT,SPM,CC) requests this dispatch to be serialized
|
|
bool bRequest_Serialize = false;
|
|
|
|
// Signal callbacks that a kernel_pkt is being enqueued
|
|
queue.signal_callback([&](const auto& map) {
|
|
for(const auto& [client_id, cb_pair] : map)
|
|
{
|
|
auto [packet, bSerial] = cb_pair.first(queue,
|
|
kernel_pkt,
|
|
kernel_id,
|
|
dispatch_id,
|
|
&user_data,
|
|
tracing_data_v.external_correlation_ids,
|
|
corr_id);
|
|
bRequest_Serialize |= bSerial;
|
|
if(packet) inst_pkt.push_back(std::make_pair(std::move(packet), client_id));
|
|
}
|
|
});
|
|
|
|
bool inserted_before = false;
|
|
if(bRequest_Serialize)
|
|
{
|
|
inserted_before = true;
|
|
CHECK_NOTNULL(hsa::get_queue_controller())
|
|
->serializer(&queue)
|
|
.rlock([&](const auto& serializer) {
|
|
for(auto& s_pkt : serializer.kernel_dispatch(queue))
|
|
transformed_packets.emplace_back(s_pkt.ext_amd_aql_pm4);
|
|
});
|
|
}
|
|
for(const auto& pkt_injection : inst_pkt)
|
|
{
|
|
for(const auto& pkt : pkt_injection.first->before_krn_pkt)
|
|
{
|
|
inserted_before = true;
|
|
transformed_packets.emplace_back(pkt);
|
|
}
|
|
}
|
|
|
|
#if ROCPROFILER_SDK_HSA_PC_SAMPLING > 0
|
|
if(pc_sampling::is_pc_sample_service_configured(queue.get_agent().get_rocp_agent()->id))
|
|
{
|
|
transformed_packets.emplace_back(pc_sampling::hsa::generate_marker_packet_for_kernel(
|
|
corr_id, tracing_data_v.external_correlation_ids, dispatch_id));
|
|
}
|
|
#endif
|
|
|
|
// emplace the kernel packet
|
|
transformed_packets.emplace_back(kernel_pkt);
|
|
// If a profiling packet was inserted, wait for completion before executing the dispatch
|
|
if(inserted_before)
|
|
transformed_packets.back().kernel_dispatch.header |= 1 << HSA_PACKET_HEADER_BARRIER;
|
|
|
|
// if the original completion signal exists, trigger it via a barrier packet
|
|
if(existing_completion_signal)
|
|
{
|
|
auto barrier = hsa_barrier_and_packet_t{};
|
|
barrier.header = HSA_PACKET_TYPE_BARRIER_AND << HSA_PACKET_HEADER_TYPE;
|
|
barrier.header |= (1 << HSA_PACKET_HEADER_BARRIER);
|
|
barrier.completion_signal = original_completion_signal;
|
|
transformed_packets.emplace_back(barrier);
|
|
}
|
|
|
|
bool injected_end_pkt = false;
|
|
for(const auto& pkt_injection : inst_pkt)
|
|
{
|
|
for(const auto& pkt : pkt_injection.first->after_krn_pkt)
|
|
{
|
|
transformed_packets.emplace_back(pkt);
|
|
injected_end_pkt = true;
|
|
}
|
|
}
|
|
|
|
auto completion_signal = hsa_signal_t{.handle = 0};
|
|
auto interrupt_signal = hsa_signal_t{.handle = 0};
|
|
if(injected_end_pkt)
|
|
{
|
|
// Adding a barrier packet with the original packet's completion signal.
|
|
queue.create_signal(0, &interrupt_signal);
|
|
completion_signal = interrupt_signal;
|
|
transformed_packets.back().ext_amd_aql_pm4.completion_signal = interrupt_signal;
|
|
CreateBarrierPacket(&interrupt_signal, &interrupt_signal, transformed_packets);
|
|
}
|
|
else
|
|
{
|
|
completion_signal = kernel_pkt.kernel_dispatch.completion_signal;
|
|
get_core_table()->hsa_signal_store_screlease_fn(completion_signal, 0);
|
|
}
|
|
|
|
ROCP_FATAL_IF(packet_type != HSA_PACKET_TYPE_KERNEL_DISPATCH)
|
|
<< "get_kernel_id below might need to be updated";
|
|
|
|
// Enqueue the signal into the handler. Will call completed_cb when
|
|
// signal completes.
|
|
|
|
{
|
|
Queue::queue_info_session_t info_session{.queue = queue,
|
|
.inst_pkt = std::move(inst_pkt),
|
|
.interrupt_signal = interrupt_signal,
|
|
.tid = thr_id,
|
|
.enqueue_ts = common::timestamp_ns(),
|
|
.user_data = user_data,
|
|
.correlation_id = corr_id,
|
|
.kernel_pkt = kernel_pkt,
|
|
.callback_record = callback_record,
|
|
.tracing_data = tracing_data_v,
|
|
.is_serialized = bRequest_Serialize};
|
|
|
|
auto shared = std::make_shared<Queue::queue_info_session_t>(std::move(info_session));
|
|
|
|
queue.signal_async_handler(completion_signal,
|
|
new std::shared_ptr<Queue::queue_info_session_t>(shared));
|
|
|
|
auto tracer_data = callback_record;
|
|
tracing::execute_phase_exit_callbacks(tracing_data_v.callback_contexts,
|
|
tracing_data_v.external_correlation_ids,
|
|
ROCPROFILER_CALLBACK_TRACING_KERNEL_DISPATCH,
|
|
ROCPROFILER_KERNEL_DISPATCH_ENQUEUE,
|
|
tracer_data);
|
|
}
|
|
}
|
|
|
|
// Command is only executed if GLOG_v=2 or higher, otherwise it is a no-op
|
|
ROCP_TRACE << fmt::format(
|
|
"QueueID {}: {}", queue.get_id().handle, fmt::join(transformed_packets, fmt::format(" ")));
|
|
|
|
writer(transformed_packets.data(), transformed_packets.size());
|
|
}
|
|
} // namespace
|
|
|
|
Queue::Queue(const AgentCache& agent, CoreApiTable table)
|
|
: _core_api(table)
|
|
, _agent(agent)
|
|
{
|
|
_core_api.hsa_signal_create_fn(0, 0, nullptr, &_active_kernels);
|
|
}
|
|
|
|
Queue::Queue(const AgentCache& agent,
|
|
uint32_t size,
|
|
hsa_queue_type32_t type,
|
|
void (*callback)(hsa_status_t status, hsa_queue_t* source, void* data),
|
|
void* data,
|
|
uint32_t private_segment_size,
|
|
uint32_t group_segment_size,
|
|
CoreApiTable core_api,
|
|
AmdExtTable ext_api,
|
|
hsa_queue_t** queue)
|
|
: _core_api(core_api)
|
|
, _ext_api(ext_api)
|
|
, _agent(agent)
|
|
{
|
|
ROCP_HSA_TABLE_CALL(FATAL,
|
|
_ext_api.hsa_amd_queue_intercept_create_fn(_agent.get_hsa_agent(),
|
|
size,
|
|
type,
|
|
callback,
|
|
data,
|
|
private_segment_size,
|
|
group_segment_size,
|
|
&_intercept_queue))
|
|
<< "Could not create intercept queue";
|
|
|
|
ROCP_HSA_TABLE_CALL(FATAL,
|
|
_ext_api.hsa_amd_profiling_set_profiler_enabled_fn(_intercept_queue, true))
|
|
<< "Could not setup intercept profiler";
|
|
|
|
if(!context::get_registered_contexts([](const context::context* ctx) {
|
|
return (ctx->counter_collection || ctx->device_counter_collection ||
|
|
ctx->dispatch_thread_trace || ctx->device_thread_trace);
|
|
}).empty())
|
|
{
|
|
CHECK(_agent.cpu_pool().handle != 0);
|
|
CHECK(_agent.get_hsa_agent().handle != 0);
|
|
|
|
// Set state of the queue to allow profiling
|
|
aql::set_profiler_active_on_queue(
|
|
_agent.cpu_pool(), _agent.get_hsa_agent(), [&](hsa::rocprofiler_packet pkt) {
|
|
hsa_signal_t completion;
|
|
create_signal(0, &completion);
|
|
pkt.ext_amd_aql_pm4.completion_signal = completion;
|
|
counters::submitPacket(_intercept_queue, &pkt);
|
|
constexpr auto timeout_hint =
|
|
std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::seconds{1});
|
|
hsa_signal_value_t val;
|
|
for(int i = 0; i < 3; i++)
|
|
{
|
|
val = core_api.hsa_signal_wait_scacquire_fn(completion,
|
|
HSA_SIGNAL_CONDITION_EQ,
|
|
0,
|
|
timeout_hint.count(),
|
|
HSA_WAIT_STATE_ACTIVE);
|
|
if(val == 0)
|
|
{
|
|
core_api.hsa_signal_destroy_fn(completion);
|
|
return;
|
|
}
|
|
}
|
|
ROCP_FATAL << "Could not set agent to be profiled - Signal Value: " << val;
|
|
});
|
|
}
|
|
|
|
ROCP_HSA_TABLE_CALL(
|
|
FATAL,
|
|
_ext_api.hsa_amd_queue_intercept_register_fn(_intercept_queue, WriteInterceptor, this))
|
|
<< "Could not register interceptor";
|
|
|
|
create_signal(0, &ready_signal);
|
|
create_signal(0, &block_signal);
|
|
create_signal(0, &_active_kernels);
|
|
_core_api.hsa_signal_store_screlease_fn(ready_signal, 0);
|
|
_core_api.hsa_signal_store_screlease_fn(_active_kernels, 0);
|
|
*queue = _intercept_queue;
|
|
}
|
|
|
|
Queue::Queue(
|
|
const AgentCache& agent,
|
|
CoreApiTable core_api,
|
|
AmdExtTable ext_api,
|
|
hsa_queue_t* queue,
|
|
set_write_interceptor_t set_write_interceptor) // NOLINT(performance-unnecessary-value-param)
|
|
: _core_api(core_api)
|
|
, _ext_api(ext_api)
|
|
, _agent(agent)
|
|
, _intercept_queue(queue)
|
|
{
|
|
if(!context::get_registered_contexts([](const context::context* ctx) {
|
|
return (ctx->counter_collection || ctx->device_counter_collection ||
|
|
ctx->dispatch_thread_trace || ctx->device_thread_trace);
|
|
}).empty())
|
|
{
|
|
CHECK(_agent.cpu_pool().handle != 0);
|
|
CHECK(_agent.get_hsa_agent().handle != 0);
|
|
|
|
// Set state of the queue to allow profiling
|
|
aql::set_profiler_active_on_queue(
|
|
_agent.cpu_pool(), _agent.get_hsa_agent(), [&](hsa::rocprofiler_packet pkt) {
|
|
hsa_signal_t completion;
|
|
create_signal(0, &completion);
|
|
pkt.ext_amd_aql_pm4.completion_signal = completion;
|
|
counters::submitPacket(_intercept_queue, &pkt);
|
|
constexpr auto timeout_hint =
|
|
std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::seconds{1});
|
|
if(core_api.hsa_signal_wait_relaxed_fn(completion,
|
|
HSA_SIGNAL_CONDITION_EQ,
|
|
0,
|
|
timeout_hint.count(),
|
|
HSA_WAIT_STATE_ACTIVE) != 0)
|
|
{
|
|
ROCP_FATAL << "Could not set agent to be profiled";
|
|
}
|
|
core_api.hsa_signal_destroy_fn(completion);
|
|
});
|
|
}
|
|
|
|
set_write_interceptor(WriteInterceptor, this);
|
|
|
|
create_signal(0, &ready_signal);
|
|
create_signal(0, &block_signal);
|
|
create_signal(0, &_active_kernels);
|
|
_core_api.hsa_signal_store_screlease_fn(ready_signal, 0);
|
|
_core_api.hsa_signal_store_screlease_fn(_active_kernels, 0);
|
|
}
|
|
|
|
Queue::~Queue()
|
|
{
|
|
sync();
|
|
_core_api.hsa_signal_destroy_fn(_active_kernels);
|
|
}
|
|
|
|
void
|
|
Queue::signal_async_handler(const hsa_signal_t& signal, void* data) const
|
|
{
|
|
#if !defined(NDEBUG)
|
|
CHECK_NOTNULL(hsa::get_queue_controller())->_debug_signals.wlock([&](auto& signals) {
|
|
signals[signal.handle] = signal;
|
|
});
|
|
#endif
|
|
hsa_status_t status = _ext_api.hsa_amd_signal_async_handler_fn(
|
|
signal, HSA_SIGNAL_CONDITION_EQ, -1, AsyncSignalHandler, data);
|
|
ROCP_FATAL_IF(status != HSA_STATUS_SUCCESS && status != HSA_STATUS_INFO_BREAK)
|
|
<< "Error: hsa_amd_signal_async_handler failed with error code " << status
|
|
<< " :: " << hsa::get_hsa_status_string(status);
|
|
}
|
|
|
|
void
|
|
Queue::create_signal(uint32_t attribute, hsa_signal_t* signal) const
|
|
{
|
|
hsa_status_t status = _ext_api.hsa_amd_signal_create_fn(1, 0, nullptr, attribute, signal);
|
|
ROCP_FATAL_IF(status != HSA_STATUS_SUCCESS && status != HSA_STATUS_INFO_BREAK)
|
|
<< "Error: hsa_amd_signal_create failed with error code " << status
|
|
<< " :: " << hsa::get_hsa_status_string(status);
|
|
}
|
|
|
|
void
|
|
Queue::sync() const
|
|
{
|
|
if(_active_kernels.handle != 0u)
|
|
{
|
|
_core_api.hsa_signal_wait_relaxed_fn(
|
|
_active_kernels, HSA_SIGNAL_CONDITION_EQ, 0, UINT64_MAX, HSA_WAIT_STATE_ACTIVE);
|
|
}
|
|
}
|
|
|
|
void
|
|
Queue::register_callback(ClientID id, queue_cb_t enqueue_cb, completed_cb_t complete_cb)
|
|
{
|
|
_callbacks.wlock([&](auto& map) {
|
|
ROCP_FATAL_IF(rocprofiler::common::get_val(map, id)) << "ID already exists!";
|
|
_notifiers++;
|
|
map[id] = std::make_pair(enqueue_cb, complete_cb);
|
|
});
|
|
}
|
|
|
|
void
|
|
Queue::remove_callback(ClientID id)
|
|
{
|
|
_callbacks.wlock([&](auto& map) {
|
|
if(map.erase(id) == 1) _notifiers--;
|
|
});
|
|
}
|
|
|
|
queue_state
|
|
Queue::get_state() const
|
|
{
|
|
return _state;
|
|
}
|
|
|
|
void
|
|
Queue::set_state(queue_state state)
|
|
{
|
|
_state = state;
|
|
}
|
|
} // namespace hsa
|
|
} // namespace rocprofiler
|