Files
rocm-systems/source/lib/rocprofiler-sdk/hsa/queue.cpp
T
Ammar ELWazir 987ae3cc47 PC Sampling Support (#715)
* cmake formatting (cmake-format) (#188)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* source formatting (clang-format v11) (#189)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: design of the pc sampling data struct; guarding parts of code that uses ROCr marker packets

* source formatting (clang-format v11) (#191)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* cmake formatting (cmake-format) (#192)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: shadow variable fix

* pcs: fix for compiler errors reported by CI/CD

* source formatting (clang-format v11) (#193)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: docs fix; samples uses rocprofiler::rocprofiler library

* cmake formatting (cmake-format) (#195)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: client in samples folder fixed

* pcs: client requires rocprofiler package as dependency

* pcs: client uses single context

* source formatting (clang-format v11) (#196)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: client using single buffer; no buffer destroy in client

* pcs: client::setup explicitly called from the example

* pcs: rocprofiler_pc_sample_record_t updated

* pcs: fixed init of external correlation id

* source formatting (clang-format v11) (#198)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: remove outdated files; update CMakeLists

* cmake formatting (cmake-format) (#212)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: using rocprofiler_agent_id_t

* pcs: Removing trailing whitespaces

Co-authored-by: Jonathan R. Madsen <jrmadsen@users.noreply.github.com>

* source formatting (clang-format v11) (#214)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: mapping agent_id to the agent

* source formatting (clang-format v11) (#215)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: const while iterating over agents

* source formatting (clang-format v11) (#216)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: calling get_buffer instead of get_buffers

* pcs: workgroup typo

* pcs: documentation for the public PC sampling API

* pcs: queue_cb_t signature adaptation

* pcs: mocks removed

* pcs: updating HsaApiTable with HSA/ROCr PC sampling API

* pcs: querying available PC sampling configs through IOCTL

* pcs: create the PCS session in IOCTL

* pcs: first actual PC samples delivered to the rocprofiler's client :)

* pcs: works with marker packet too

* pcs: using HSA table to call pc sampling related functions

* pcs: using ioctl instead of kfd in naming

* pcs: configuration service test fixed

* pcs: sample processing test fixed

* pcs: marker packet macro wrapper removed

* pcs: marker packet is part of the rocprofiler_packet union

* pcs: one fixme added

* pcs: client that uses pc-sampling and code obj tracing

* pcs: client that supprts PC sampling and code obj tracing refactored

* pcs: show more info for each PC sample

* pcs: hex output for the samples that do not belong to the matmul kernel

* pcs: querying avail configuration happens immediately before configuring

* pcs: hsa_ven_amd_pcs_create_from_id renamed

* pcs: using hsa_stop; accessing a buffer by id from parser

* pcs: includes reworked, tests returned to life

* pcs: rocrofiler dir removed as outdated

* cmake formatting (cmake-format) (#271)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* source formatting (clang-format v11) (#272)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: some warnings fixed

* source formatting (clang-format v11) (#273)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* cmake formatting (cmake-format) (#274)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: show MI200 relevant information in the sample

* pcs: queue cb fixed; rocr.h include fixed

* source formatting (clang-format v11) (#296)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: getting hsa_agent and the doorbell_id from hsa_queue

* source formatting (clang-format v11) (#297)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: correlation ID logic fixed

* source formatting (clang-format v11) (#303)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: pure pc sampling example fixed

* source formatting (clang-format v11) (#307)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* cmake formatting (cmake-format) (#308)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: interval value if the PC sampling is already configured

* pcs: ROCPROFILER_STATUS_ERROR_PC_SAMPLING_ALREADY_CONFIGURED

New status code if another process configured PC sampling service with different configuration.
Samples are extended to consider this case and retry if it happens.

* pcs: hsa_amd_queue_get_info mocked in tests

* source formatting (clang-format v11) (#328)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs (tests): query configs after configuring service

* source formatting (clang-format v11) (#329)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: sample checks workgroup_id_* and wave_id

* source formatting (clang-format v11) (#330)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs samples: running samples on the device 0

* pcs: kfd_ioctl updated

* pcs: ioctl config struct changed fields names

* pcs: status when PC sampling is configured by another process is renamed

* pcs: HSA PC sampling API table fixed

* pcs: tmp hack to be able to use HSA pc sampling table

* source formatting (clang-format v11) (#443)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs service use CIDs generated by HIP API tracing service

* source formatting (clang-format v11) (#455)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* cmake formatting (cmake-format) (#456)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: CID manager

* pcs: explicit flush with no delivered data executes retirement logic

* source formatting (clang-format v11) (#464)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: rocprofiler_query_pc_sampling_agent_configurations docs update

* source formatting (clang-format v11) (#465)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: rocprofiler_configure_pc_sampling_service docs update

* pcs: explicit sync introduced in PCSCIDManager

* pcs: new logic for retiring CIDs in PC sampling service documented

* pcs: queue interception cb signature updated

* source formatting (clang-format v11) (#471)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: if no agents supports PC sampling, fail gracefully

* elaborating when KFD returns EBUSY and EEXIST

* pcs: the second PC sampling examples fails gracefully

* code samples use only single kernel for now

* pcs: CID manager refactored

* source formatting (clang-format v11) (#481)

Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>

* pcs: ioctl update

* source formatting (clang-format v11) (#531)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs:code sample to test PC sampling applied on concurrent kernels

* source formatting (clang-format v11) (#533)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: pc sampling strest test included

* cmake formatting (cmake-format) (#539)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* source formatting (clang-format v11) (#540)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: standalone benchmark

* cmake formatting (cmake-format) (#555)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: glance in external correlation IDs

* source formatting (clang-format v11) (#557)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* another change in ioctl interface

* pcs: update queue interceptor callbacks and samples accroding to the agent 0 version

* source formatting (clang-format v11) (#611)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: avoid running problematic PC sampling test

* pcs: guarding tests not to fail on architectures not supporting PC sampling

* source formatting (clang-format v11) (#617)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: check IOCTL version prior to each KFD call

* pcs: ioctl refactoring

* pcs: PC sampling service increases the ref_count of the correlation ID of the kernel dispatch

* cmake formatting (cmake-format) (#631)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* source formatting (clang-format v11) (#632)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: PC sampling service provides external correlation IDs

* source formatting (clang-format v11) (#644)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: use rocprofiler_dim3_t for workgrou_ip

* source formatting (clang-format v11) (#645)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: minor fixes

* pcs: updating the documentation for the pc sampling API functions

* pcs: api table and queue controller fix

* pcs: don't generate marker packets for the agent if PC sampling is not configured on it

* pcs: multi-GPU and single-GPU clients

* source formatting (clang-format v11) (#700)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: warning and errors fixed

* source formatting (clang-format v11) (#702)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: clang compiler errors and warnings fixed

* source formatting (clang-format v11) (#716)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: const reference in cid manager

* source formatting (clang-format v11) (#717)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: const & func in manager explicit

* pcs: test to cover creating PC sampling service of agent that does not exist

* pcs: generate marker packets if service is active

* source formatting (clang-format v11) (#719)

Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com>

* pcs: refactoring hsa_adapter; use the correlation_id->thread_idx

* Update source/lib/rocprofiler-sdk/pc_sampling/cid_manager.cpp

* Update source/lib/rocprofiler-sdk/pc_sampling/cid_manager.cpp

* Update source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp

* Update source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp

* Update source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp

* Update source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp

* Update source/lib/rocprofiler-sdk/pc_sampling/utils.cpp

* Update utils.cpp

* moving pc-sampling tests and samples to pc-sampling label

* Format fix

* pcs: use configured instead of active service

* Update source/lib/rocprofiler-sdk/pc_sampling/service.cpp

* pcs: ensure configuring PC sampling on the HSA level is called only once

* pcs: minor fix

* Update CMakeLists.txt

* Update CMakeLists.txt

* Update CMakeLists.txt

* Update CMakeLists.txt

* pcs: refactoring IOCTL integration

* Update source/lib/rocprofiler-sdk/pc_sampling/tests/CMakeLists.txt

Co-authored-by: Ammar ELWazir <ammar.elwazir@amd.com>

* Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.cpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.cpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.hpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* pcs: reverting back what bot doubled

* Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* pcs: retesting the bot

* Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* pcs: why bot fails on this IOCTL status

* pcs: why failing on <vector>

* Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.cpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* pcs: returning commits removed by bot

* pcs: formatting locally

* pcs: clients are flushing buffers inside the tool_fini

* pcs: sync function in public API

* pcs: sync prior to unloading the code object

* pcs: sync function requires context

* pcs: client uses CID retirement service

* pcs: test for flusing internal ROCr buffers

* pcs: source formatting

* Update source/lib/rocprofiler-sdk/pc_sampling/tests/CMakeLists.txt

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* pcs: code samples refactoring

* pcs: public API header refactored

* pcs: rocprofiler_buffer_flush drains internal PC sampling buffers too

* pcs: remove unnecessary functions

* pcs: do not call hsa's copytables

* pcs: include reordering

* pcs: using ROCP_ERROR inside PC sampling implementation

* pcs: pc_sampling sample uses ostream instean of printfs

* pcs: pc_sampling_codeobj tracing using ostream instead of prints

* pcs: registering once for interceptor callbacks

* pcs: do not generate internal CIDs if not in debug mode

* pcs: rebasing fixed; missing external correlation IDs

* pcs: code formatting

* enable kernel tracing service to receive external correlation IDs

* pcs: using ROCPROFILER_STATUS_ERROR_INCOMPATIBLE_KERNEL

* pcs: polishing parser

* formatting

* updating parser to use workgroup_id

* kfd_ioctl.h extracted in details folder

* refactoring

* pcs: preparing to generate code object information

* flush internal buffers prior to unloading code object

* pcs: generating marker records

* pcs: wrap code_object's shutdown function

* ROCR_VISIBLE_DEVICES and HIP_VISISBLE_DEVICES unsupported at the moment

* documenting the ignorance of ROCR/HIP_VISIBLE_DEVICES

* pcs: separate structs for code object loading/unloading markers

* pcs: inst_pkt_t changed the namespace

* pcs: removing wrapper around the shutdown function

* pcs: size in record field

* pcs: documentation refactoring + typdefs

* renaming PCSAgentConfig to PCSAgentSession

* pcs: service does not keep a pointer to the context

* pcs: static assertions related to the versioning

* pcs: rocprofiler_pc_sampling_configuration_t size field

* pcs: report API unimplemented unleass explicitly enabled

* pcs: skip tests if KFD does not support PC sampling

* pcs: if ROCr hides some devices, no PC samples will be delivered for it

* pcs: hip error check after kernel launch

* formatting

* removing PCS info from agent.h

* fix based on review

* Update continuous integration workflow

- use mi200 runner for code coverage (supports PC sampling)
- split sanitizer jobs across navi3, vega20, and mi300

* Updating pc sampling test labels

* ROCP_PC_SAMPLING_ENABLED env in CI

* ROCP_PC_SAMPLING_ENABLED for all CI mi200 jobs

* Rearrange sanitizer assignments

* fixes according to review

* removed unused functions

* pcs: rocprofiler_agent_id_t instead of handle as a key in map

* Update source/lib/rocprofiler-sdk/context/context.hpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* removing drm_fd from the agent.h

* pcs: removing one sample due to complexity

* pcs: refactoring sample

* simplifying sample

* new lines

* Improve queue_control enable intercepter logic

* Update lib/rocprofiler-sdk/hsa/types.hpp

- handle amd_ext size for HSA 1.12.0

* ROCP_PC_SAMPLING_ENABLED -> ROCPROFILER_PC_SAMPLING_BETA_ENABLED

* Update hsa_adapter.cpp

- anonymous namespace + remove debug

* parser update

* Apply suggestions from code review

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: vlaindic <vlaindic@users.noreply.github.com>
Co-authored-by: vlaindic <vladimir.indic@amd.com>
Co-authored-by: vlaindic <vlaindic@amd.com>
Co-authored-by: Vladimir Indic <139573562+vlaindic@users.noreply.github.com>
Co-authored-by: Jonathan R. Madsen <jrmadsen@users.noreply.github.com>
Co-authored-by: gobhardw <gopesh.bhardwaj@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
2024-05-24 09:49:44 -05:00

575 строки
23 KiB
C++

// MIT License
//
/* Copyright (c) 2022 Advanced Micro Devices, Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE. */
#include "lib/rocprofiler-sdk/hsa/queue.hpp"
#include "lib/common/scope_destructor.hpp"
#include "lib/common/utility.hpp"
#include "lib/rocprofiler-sdk/code_object/code_object.hpp"
#include "lib/rocprofiler-sdk/context/context.hpp"
#include "lib/rocprofiler-sdk/hsa/details/fmt.hpp"
#include "lib/rocprofiler-sdk/hsa/hsa.hpp"
#include "lib/rocprofiler-sdk/hsa/queue_controller.hpp"
#include "lib/rocprofiler-sdk/kernel_dispatch/tracing.hpp"
#include "lib/rocprofiler-sdk/pc_sampling/hsa_adapter.hpp"
#include "lib/rocprofiler-sdk/pc_sampling/service.hpp"
#include "lib/rocprofiler-sdk/registration.hpp"
#include "lib/rocprofiler-sdk/tracing/tracing.hpp"
#include <rocprofiler-sdk/callback_tracing.h>
#include <rocprofiler-sdk/external_correlation.h>
#include <rocprofiler-sdk/fwd.h>
#include <hsa/hsa.h>
#include <hsa/hsa_ext_amd.h>
#include <atomic>
// static assert for rocprofiler_packet ABI compatibility
static_assert(sizeof(hsa_ext_amd_aql_pm4_packet_t) == sizeof(hsa_kernel_dispatch_packet_t),
"unexpected ABI incompatibility");
static_assert(sizeof(hsa_ext_amd_aql_pm4_packet_t) == sizeof(hsa_barrier_and_packet_t),
"unexpected ABI incompatibility");
static_assert(sizeof(hsa_ext_amd_aql_pm4_packet_t) == sizeof(hsa_barrier_or_packet_t),
"unexpected ABI incompatibility");
static_assert(offsetof(hsa_ext_amd_aql_pm4_packet_t, completion_signal) ==
offsetof(hsa_kernel_dispatch_packet_t, completion_signal),
"unexpected ABI incompatibility");
static_assert(offsetof(hsa_ext_amd_aql_pm4_packet_t, completion_signal) ==
offsetof(hsa_barrier_and_packet_t, completion_signal),
"unexpected ABI incompatibility");
static_assert(offsetof(hsa_ext_amd_aql_pm4_packet_t, completion_signal) ==
offsetof(hsa_barrier_or_packet_t, completion_signal),
"unexpected ABI incompatibility");
namespace rocprofiler
{
namespace hsa
{
namespace
{
template <typename DomainT, typename... Args>
inline bool
context_filter(const context::context* ctx, DomainT domain, Args... args)
{
if constexpr(std::is_same<DomainT, rocprofiler_buffer_tracing_kind_t>::value)
{
return (ctx->buffered_tracer && ctx->buffered_tracer->domains(domain, args...));
}
else if constexpr(std::is_same<DomainT, rocprofiler_callback_tracing_kind_t>::value)
{
return (ctx->callback_tracer && ctx->callback_tracer->domains(domain, args...));
}
else
{
static_assert(common::mpl::assert_false<DomainT>::value, "unsupported domain type");
return false;
}
}
bool
context_filter(const context::context* ctx)
{
return (context_filter(ctx, ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH) ||
context_filter(ctx, ROCPROFILER_CALLBACK_TRACING_KERNEL_DISPATCH));
}
bool
AsyncSignalHandler(hsa_signal_value_t /*signal_v*/, void* data)
{
if(!data) return true;
// if we have fully finalized, delete the data and return
if(registration::get_fini_status() > 0)
{
auto* _session = static_cast<Queue::queue_info_session_t**>(data);
delete _session;
return false;
}
auto& queue_info_session = *static_cast<Queue::queue_info_session_t*>(data);
kernel_dispatch::dispatch_complete(queue_info_session);
// Calls our internal callbacks to callers who need to be notified post
// kernel execution.
queue_info_session.queue.signal_callback([&](const auto& map) {
for(const auto& [client_id, cb_pair] : map)
{
cb_pair.second(queue_info_session.queue,
queue_info_session.kernel_pkt,
queue_info_session,
queue_info_session.inst_pkt);
}
});
// Delete signals and packets, signal we have completed.
if(queue_info_session.interrupt_signal.handle != 0u)
{
#if !defined(NDEBUG)
CHECK_NOTNULL(hsa::get_queue_controller())->_debug_signals.wlock([&](auto& signals) {
signals.erase(queue_info_session.interrupt_signal.handle);
});
#endif
hsa::get_core_table()->hsa_signal_store_screlease_fn(queue_info_session.interrupt_signal,
-1);
hsa::get_core_table()->hsa_signal_destroy_fn(queue_info_session.interrupt_signal);
}
if(queue_info_session.kernel_pkt.ext_amd_aql_pm4.completion_signal.handle != 0u)
{
hsa::get_core_table()->hsa_signal_destroy_fn(
queue_info_session.kernel_pkt.ext_amd_aql_pm4.completion_signal);
}
// we need to decrement this reference count at the end of the functions
auto* _corr_id = queue_info_session.correlation_id;
if(_corr_id)
{
ROCP_FATAL_IF(_corr_id->get_ref_count() == 0)
<< "reference counter for correlation id " << _corr_id->internal << " from thread "
<< _corr_id->thread_idx << " has no reference count";
_corr_id->sub_kern_count();
_corr_id->sub_ref_count();
}
queue_info_session.queue.async_complete();
delete static_cast<Queue::queue_info_session_t*>(data);
return false;
}
template <typename Integral = uint64_t>
constexpr Integral
bit_mask(int first, int last)
{
assert(last >= first && "Error: hsa_support::bit_mask -> invalid argument");
size_t num_bits = last - first + 1;
return ((num_bits >= sizeof(Integral) * 8) ? ~Integral{0}
/* num_bits exceed the size of Integral */
: ((Integral{1} << num_bits) - 1))
<< first;
}
/* Extract bits [last:first] from t. */
template <typename Integral>
constexpr Integral
bit_extract(Integral x, int first, int last)
{
return (x >> first) & bit_mask<Integral>(0, last - first);
}
/**
* @brief This function is a queue write interceptor. It intercepts the
* packet write function. Creates an instance of packet class with the raw
* pointer. invoke the populate function of the packet class which returns a
* pointer to the packet. This packet is written into the queue by this
* interceptor by invoking the writer function.
*/
void
WriteInterceptor(const void* packets,
uint64_t pkt_count,
uint64_t,
void* data,
hsa_amd_queue_intercept_packet_writer writer)
{
if(registration::get_fini_status() > 0)
{
writer(packets, pkt_count);
return;
}
using callback_record_t = Queue::queue_info_session_t::callback_record_t;
// unique sequence id for the dispatch
static auto sequence_counter = std::atomic<rocprofiler_dispatch_id_t>{0};
auto&& CreateBarrierPacket = [](hsa_signal_t* dependency_signal,
hsa_signal_t* completion_signal,
std::vector<rocprofiler_packet>& _packets) {
hsa_barrier_and_packet_t barrier{};
barrier.header = HSA_PACKET_TYPE_BARRIER_AND << HSA_PACKET_HEADER_TYPE;
barrier.header |= 1 << HSA_PACKET_HEADER_BARRIER;
if(dependency_signal != nullptr) barrier.dep_signal[0] = *dependency_signal;
if(completion_signal != nullptr) barrier.completion_signal = *completion_signal;
_packets.emplace_back(barrier);
};
ROCP_FATAL_IF(data == nullptr) << "WriteInterceptor was not passed a pointer to the queue";
auto& queue = *static_cast<Queue*>(data);
// We have no packets or no one who needs to be notified, do nothing.
if(pkt_count == 0 ||
(queue.get_notifiers() == 0 && context::get_active_contexts(context_filter).empty()))
{
writer(packets, pkt_count);
return;
}
auto tracing_data_v = tracing::tracing_data{};
tracing::populate_contexts(ROCPROFILER_CALLBACK_TRACING_KERNEL_DISPATCH,
ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH,
tracing_data_v);
const auto* packets_arr = static_cast<const rocprofiler_packet*>(packets);
auto transformed_packets = std::vector<rocprofiler_packet>{};
// Searching accross all the packets given during this write
for(size_t i = 0; i < pkt_count; ++i)
{
const auto& original_packet = packets_arr[i].kernel_dispatch;
auto packet_type = bit_extract(original_packet.header,
HSA_PACKET_HEADER_TYPE,
HSA_PACKET_HEADER_TYPE + HSA_PACKET_HEADER_WIDTH_TYPE - 1);
if(packet_type != HSA_PACKET_TYPE_KERNEL_DISPATCH)
{
transformed_packets.emplace_back(packets_arr[i]);
continue;
}
auto* corr_id = context::get_latest_correlation_id();
context::correlation_id* _corr_id_pop = nullptr;
if(!corr_id)
{
constexpr auto ref_count = 1;
corr_id = context::correlation_tracing_service::construct(ref_count);
_corr_id_pop = corr_id;
}
// increase the reference count to denote that this correlation id is being used in a kernel
corr_id->add_ref_count();
corr_id->add_kern_count();
auto thr_id = (corr_id) ? corr_id->thread_idx : common::get_tid();
auto user_data = rocprofiler_user_data_t{.value = 0};
auto internal_corr_id = (corr_id) ? corr_id->internal : 0;
// if we constructed a correlation id, this decrements the reference count after the
// underlying function returns
auto _corr_id_dtor = common::scope_destructor{[_corr_id_pop]() {
if(_corr_id_pop)
{
context::pop_latest_correlation_id(_corr_id_pop);
_corr_id_pop->sub_ref_count();
}
}};
tracing::populate_external_correlation_ids(
tracing_data_v.external_correlation_ids,
thr_id,
ROCPROFILER_EXTERNAL_CORRELATION_REQUEST_KERNEL_DISPATCH,
ROCPROFILER_KERNEL_DISPATCH_ENQUEUE,
internal_corr_id);
queue.async_started();
// Copy kernel pkt, copy is to allow for signal to be modified
rocprofiler_packet kernel_pkt = packets_arr[i];
uint64_t kernel_id = code_object::get_kernel_id(kernel_pkt.kernel_dispatch.kernel_object);
queue.create_signal(HSA_AMD_SIGNAL_AMD_GPU_ONLY,
&kernel_pkt.ext_amd_aql_pm4.completion_signal);
// computes the "size" based on the offset of reserved_padding field
constexpr auto kernel_dispatch_info_rt_size =
common::compute_runtime_sizeof<rocprofiler_kernel_dispatch_info_t>();
static_assert(kernel_dispatch_info_rt_size < sizeof(rocprofiler_kernel_dispatch_info_t),
"failed to compute size field based on offset of reserved_padding field");
auto dispatch_id = ++sequence_counter;
auto callback_record = callback_record_t{
sizeof(callback_record_t),
rocprofiler_timestamp_t{0},
rocprofiler_timestamp_t{0},
rocprofiler_kernel_dispatch_info_t{
.size = kernel_dispatch_info_rt_size,
.agent_id = queue.get_agent().get_rocp_agent()->id,
.queue_id = queue.get_id(),
.kernel_id = kernel_id,
.dispatch_id = dispatch_id,
.private_segment_size = kernel_pkt.kernel_dispatch.private_segment_size,
.group_segment_size = kernel_pkt.kernel_dispatch.group_segment_size,
.workgroup_size = rocprofiler_dim3_t{kernel_pkt.kernel_dispatch.workgroup_size_x,
kernel_pkt.kernel_dispatch.workgroup_size_y,
kernel_pkt.kernel_dispatch.workgroup_size_z},
.grid_size = rocprofiler_dim3_t{kernel_pkt.kernel_dispatch.grid_size_x,
kernel_pkt.kernel_dispatch.grid_size_y,
kernel_pkt.kernel_dispatch.grid_size_z},
.reserved_padding = {0}}};
{
auto tracer_data = callback_record;
tracing::execute_phase_enter_callbacks(tracing_data_v.callback_contexts,
thr_id,
internal_corr_id,
tracing_data_v.external_correlation_ids,
ROCPROFILER_CALLBACK_TRACING_KERNEL_DISPATCH,
ROCPROFILER_KERNEL_DISPATCH_ENQUEUE,
tracer_data);
}
// map all the external correlation ids (after enqueue enter phase) for all the contexts
// captured by the info session
tracing::update_external_correlation_ids(
tracing_data_v.external_correlation_ids,
thr_id,
ROCPROFILER_EXTERNAL_CORRELATION_REQUEST_KERNEL_DISPATCH);
// Stores the instrumentation pkt (i.e. AQL packets for counter collection)
// along with an ID of the client we got the packet from (this will be returned via
// completed_cb_t)
auto inst_pkt = inst_pkt_t{};
// Signal callbacks that a kernel_pkt is being enqueued
queue.signal_callback([&](const auto& map) {
for(const auto& [client_id, cb_pair] : map)
{
if(auto maybe_pkt = cb_pair.first(queue,
kernel_pkt,
kernel_id,
dispatch_id,
&user_data,
tracing_data_v.external_correlation_ids,
corr_id))
{
inst_pkt.push_back(std::make_pair(std::move(maybe_pkt), client_id));
}
}
});
bool inserted_before = false;
for(const auto& pkt_injection : inst_pkt)
{
for(const auto& pkt : pkt_injection.first->before_krn_pkt)
{
inserted_before = true;
transformed_packets.emplace_back(pkt);
}
}
// Barrier packet is last packet inserted into queue
if(inserted_before)
{
CreateBarrierPacket(nullptr, nullptr, transformed_packets);
}
if(pc_sampling::is_pc_sample_service_configured(queue.get_agent().get_rocp_agent()->id))
{
transformed_packets.emplace_back(pc_sampling::hsa::generate_marker_packet_for_kernel(
corr_id, tracing_data_v.external_correlation_ids));
}
transformed_packets.emplace_back(kernel_pkt);
// Make a copy of the original packet, adding its signal to a barrier
// packet and create a new signal for it to get timestamps
if(original_packet.completion_signal.handle != 0u)
{
hsa_barrier_and_packet_t barrier{};
barrier.header = HSA_PACKET_TYPE_BARRIER_AND << HSA_PACKET_HEADER_TYPE;
barrier.header |= 1 << HSA_PACKET_HEADER_BARRIER;
barrier.completion_signal = original_packet.completion_signal;
transformed_packets.emplace_back(barrier);
}
hsa_signal_t interrupt_signal{};
// Adding a barrier packet with the original packet's completion signal.
queue.create_signal(0, &interrupt_signal);
bool injected_end_pkt = false;
for(const auto& pkt_injection : inst_pkt)
{
for(const auto& pkt : pkt_injection.first->after_krn_pkt)
{
transformed_packets.emplace_back(pkt);
injected_end_pkt = true;
}
}
if(injected_end_pkt)
{
transformed_packets.back().ext_amd_aql_pm4.completion_signal = interrupt_signal;
CreateBarrierPacket(&interrupt_signal, &interrupt_signal, transformed_packets);
}
else
{
get_core_table()->hsa_signal_store_screlease_fn(interrupt_signal, 0);
hsa_barrier_and_packet_t barrier{};
barrier.header = HSA_PACKET_TYPE_BARRIER_AND << HSA_PACKET_HEADER_TYPE;
barrier.completion_signal = interrupt_signal;
transformed_packets.emplace_back(barrier);
}
ROCP_FATAL_IF(packet_type != HSA_PACKET_TYPE_KERNEL_DISPATCH)
<< "get_kernel_id below might need to be updated";
// Enqueue the signal into the handler. Will call completed_cb when
// signal completes.
queue.signal_async_handler(
interrupt_signal,
new Queue::queue_info_session_t{.queue = queue,
.inst_pkt = std::move(inst_pkt),
.interrupt_signal = interrupt_signal,
.tid = thr_id,
.enqueue_ts = common::timestamp_ns(),
.user_data = user_data,
.correlation_id = corr_id,
.kernel_pkt = kernel_pkt,
.callback_record = callback_record,
.tracing_data = tracing_data_v});
{
auto tracer_data = callback_record;
tracing::execute_phase_exit_callbacks(tracing_data_v.callback_contexts,
tracing_data_v.external_correlation_ids,
ROCPROFILER_CALLBACK_TRACING_KERNEL_DISPATCH,
ROCPROFILER_KERNEL_DISPATCH_ENQUEUE,
tracer_data);
}
}
// Command is only executed if GLOG_v=2 or higher, otherwise it is a no-op
ROCP_TRACE << fmt::format(
"QueueID {}: {}", queue.get_id().handle, fmt::join(transformed_packets, fmt::format(" ")));
writer(transformed_packets.data(), transformed_packets.size());
}
} // namespace
Queue::Queue(const AgentCache& agent, CoreApiTable table)
: _core_api(table)
, _agent(agent)
{
_core_api.hsa_signal_create_fn(0, 0, nullptr, &_active_kernels);
}
Queue::Queue(const AgentCache& agent,
uint32_t size,
hsa_queue_type32_t type,
void (*callback)(hsa_status_t status, hsa_queue_t* source, void* data),
void* data,
uint32_t private_segment_size,
uint32_t group_segment_size,
CoreApiTable core_api,
AmdExtTable ext_api,
hsa_queue_t** queue)
: _core_api(core_api)
, _ext_api(ext_api)
, _agent(agent)
{
LOG_IF(FATAL,
_ext_api.hsa_amd_queue_intercept_create_fn(_agent.get_hsa_agent(),
size,
type,
callback,
data,
private_segment_size,
group_segment_size,
&_intercept_queue) != HSA_STATUS_SUCCESS)
<< "Could not create intercept queue";
LOG_IF(FATAL,
_ext_api.hsa_amd_profiling_set_profiler_enabled_fn(_intercept_queue, true) !=
HSA_STATUS_SUCCESS)
<< "Could not setup intercept profiler";
LOG_IF(FATAL,
_ext_api.hsa_amd_queue_intercept_register_fn(_intercept_queue, WriteInterceptor, this))
<< "Could not register interceptor";
create_signal(0, &ready_signal);
create_signal(0, &block_signal);
create_signal(0, &_active_kernels);
_core_api.hsa_signal_store_screlease_fn(ready_signal, 0);
_core_api.hsa_signal_store_screlease_fn(_active_kernels, 0);
*queue = _intercept_queue;
}
Queue::~Queue()
{
sync();
_core_api.hsa_signal_destroy_fn(_active_kernels);
}
void
Queue::signal_async_handler(const hsa_signal_t& signal, Queue::queue_info_session_t* data) const
{
#if !defined(NDEBUG)
CHECK_NOTNULL(hsa::get_queue_controller())->_debug_signals.wlock([&](auto& signals) {
signals[signal.handle] = signal;
});
#endif
hsa_status_t status = _ext_api.hsa_amd_signal_async_handler_fn(
signal, HSA_SIGNAL_CONDITION_EQ, -1, AsyncSignalHandler, static_cast<void*>(data));
ROCP_FATAL_IF(status != HSA_STATUS_SUCCESS && status != HSA_STATUS_INFO_BREAK)
<< "Error: hsa_amd_signal_async_handler failed";
}
void
Queue::create_signal(uint32_t attribute, hsa_signal_t* signal) const
{
hsa_status_t status = _ext_api.hsa_amd_signal_create_fn(1, 0, nullptr, attribute, signal);
ROCP_FATAL_IF(status != HSA_STATUS_SUCCESS && status != HSA_STATUS_INFO_BREAK)
<< "Error: hsa_amd_signal_create failed";
}
void
Queue::sync() const
{
if(_active_kernels.handle != 0u)
{
_core_api.hsa_signal_wait_relaxed_fn(
_active_kernels, HSA_SIGNAL_CONDITION_EQ, 0, -1, HSA_WAIT_STATE_ACTIVE);
}
}
void
Queue::register_callback(ClientID id, queue_cb_t enqueue_cb, completed_cb_t complete_cb)
{
_callbacks.wlock([&](auto& map) {
ROCP_FATAL_IF(rocprofiler::common::get_val(map, id)) << "ID already exists!";
_notifiers++;
map[id] = std::make_pair(enqueue_cb, complete_cb);
});
}
void
Queue::remove_callback(ClientID id)
{
_callbacks.wlock([&](auto& map) {
if(map.erase(id) == 1) _notifiers--;
});
}
queue_state
Queue::get_state() const
{
return _state;
}
void
Queue::set_state(queue_state state)
{
_state = state;
}
} // namespace hsa
} // namespace rocprofiler