Files
rocm-systems/projects/rocprofiler-sdk/tests/bin/module-loading-test/main.cpp
T
Benjamin Welton 1517a398bf [rocprofiler-sdk] Buffer finalization fixes and HSA ABI 0x09 support (#2318)
* [rocprofiler-sdk] Fix buffer flush ordering and sanitizer CI improvements

Buffer Pool Design
------------------
Replace the fixed array-based double buffer with a dynamic pool design to
fix race conditions that caused "internal correlation id was retired
prematurely" errors.

The original design had a race where flush callbacks could be delivered
out-of-order: when buffer 0 fills and begins flushing, writes go to
buffer 1. If buffer 1 fills before buffer 0's flush completes, the
buffer index wraps back to 0 (which may still be flushing). Independent
flush tasks submitted to the thread pool can complete out of order.

The new pool design:
- Uses a std::deque of buffer instances that grows as needed
- Allocates buffers from the pool when the current buffer needs to flush
- Serializes flushes with a mutex to ensure FIFO callback ordering
- Returns buffers to the pool after flush completion
- Eliminates the race between buffer selection and write operations

New Unit Tests
--------------
- buffer_correlation_ordering.cpp: Tests that API records are always
  delivered before their corresponding retirement records
- buffer_ordering_stress.cpp: Stress tests buffer flush ordering under
  high contention with multiple threads rapidly filling buffers

HSA Tool Hooks
--------------
Added hsa_tool_hooks.cpp/hpp to register an HSA OnUnload callback that
waits for pending flush tasks before tool finalization, preventing
"retired prematurely" errors during HSA shutdown.

Sanitizer Improvements
----------------------
- LSAN: Set fast_unwind_on_malloc=1 to prevent deadlock in libgcc unwinder
- LSAN: Added suppressions for external tools (liblzma, liblsan, seq, strdup)
- TSAN: Added suppression for false positive on C++11 thread-safe static
  initialization in create_write_functor
- ASAN/UBSAN: Added patterns for known issues in HSA runtime, HIP, perfetto
- Disabled attachment tests for sanitizers due to library preloading issues

Other Fixes
-----------
- Thread-trace agent test: Use heap-allocated callback state
- Correlation ID: Refactored reference counting and finalization ordering

* [rocprofiler-sdk] Revert buffer pool design changes

Revert buffer.cpp and buffer.hpp to the original double-buffer
design from develop branch. The pool-based redesign introduced
concerns about:
- Signal safety (mutex vs atomic_flag)
- API changes (flush() return type)
- Complexity of the new design

This revert removes:
- Dynamic buffer pool with std::deque
- std::mutex/condition_variable synchronization
- buffer_correlation_ordering.cpp test
- buffer_ordering_stress.cpp test

The underlying buffer flush ordering issue will need to be
addressed with a different approach that preserves the original
API and synchronization characteristics.

* [rocprofiler-sdk] Consistent fini_status checks to prevent correlation ID creation during finalization

- Revert TOCTOU CAS loop change in sub_ref_count() - not needed with consistent checks
- Add fini_status check in correlation_tracing_service::construct() with ROCP_CI_LOG warning
- Add nullptr checks at all construct() call sites (queue.cpp, async_copy.cpp, memory_allocation.cpp)
- Change all 'get_fini_status() > 0' to '!= 0' for consistent behavior:
  - hsa/queue.cpp (lines 105, 210)
  - hsa/async_copy.cpp (line 344)
  - hsa/hsa_barrier.cpp (line 43)
  - buffer.cpp (lines 107, 138, 185)

This ensures no correlation IDs are created once finalization starts (fini_status != 0),
preventing races between finalization and ongoing tracing operations.

* [rocprofiler-sdk] Replace arrival-order checks with timestamp-based temporal validation

Buffer records are not guaranteed to arrive in any specific order. Tests and
samples should use timestamps for temporal ordering validation instead.

Changes:
- samples/external_correlation_id_request: Replace 'retired prematurely' arrival
  order check with timestamp-based validation that retirement timestamp >=
  max(end_timestamps) for records with the same correlation ID
- tests/external_correlation.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check
- tests/registration.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check
- tests/roctx.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check

Correlation IDs are not guaranteed to be monotonically increasing when records
are sorted by timestamp. Temporal ordering should be validated using the
timestamp fields in each record.

* [rocprofiler-sdk] Revert external/CMakeLists.txt SYSTEM keyword removal

Restore the SYSTEM keyword to target_include_directories for
rocprofiler-sdk-fmt to match develop branch.

* [rccl] Remove orphaned rocSHMEM gitlink

Remove orphaned submodule reference that was introduced during a merge
but never had a corresponding .gitmodules entry, causing CI failures
with "fatal: no submodule mapping found in .gitmodules".

* [rocprofiler-sdk] Add HSA ABI version 0x09 support

Add ABI checks for HSA_AMD_EXT_API_TABLE_STEP_VERSION 0x09 which
introduces hsa_amd_counted_queue_acquire and hsa_amd_counted_queue_release
functions (added in rocr-runtime SWDEV-561708).

* [rocprofiler-sdk] Handle finalized status gracefully in buffer flush operations

This commit consolidates fixes for handling the finalization status during
buffer flush operations across the SDK.

Changes:
- Tool and samples: Handle ROCPROFILER_STATUS_ERROR_FINALIZED gracefully
  when flushing buffers, as this indicates buffers were already flushed
  during finalization (not an error condition)
- HSA handlers (queue.cpp, async_copy.cpp, hsa_barrier.cpp): Use > 0 check
  for fini_status to allow operations during finalization process
- buffer.cpp: Revert fini_status checks to use > 0 for consistency
- correlation_id.cpp: Add fini_status > 0 check with ROCP_TRACE logging
  to prevent correlation ID creation after finalization starts

Files modified:
- source/lib/rocprofiler-sdk-tool/tool.cpp
- tests/tools/json-tool.cpp
- source/lib/rocprofiler-sdk/tests/registration.cpp
- source/lib/rocprofiler-sdk/tests/roctx.cpp
- samples/api_buffered_tracing/client.cpp
- samples/counter_collection/buffered_client.cpp
- samples/counter_collection/device_counting_async_client.cpp
- samples/external_correlation_id_request/client.cpp
- samples/pc_sampling/client.cpp
- source/lib/rocprofiler-sdk/buffer.cpp
- source/lib/rocprofiler-sdk/context/correlation_id.cpp
- source/lib/rocprofiler-sdk/hsa/queue.cpp
- source/lib/rocprofiler-sdk/hsa/async_copy.cpp
- source/lib/rocprofiler-sdk/hsa/hsa_barrier.cpp

* [rocprofiler-sdk] Remove hsa_tool_hooks and simplify buffer flush handling

Remove the hsa_tool_hooks infrastructure and simplify buffer flush calls
in samples and tools. The ERROR_FINALIZED handling was overly complex
and the hsa_tool_hooks OnUnload synchronization is no longer needed.

Changes:
- Remove hsa_tool_hooks.cpp/hpp and related registration.cpp code
- Simplify buffer flush calls in samples to use direct ROCPROFILER_CALL
- Simplify buffer flush in tool.cpp and json-tool.cpp
- Remove ERROR_FINALIZED special handling from test files

Co-Authored-By: Claude <noreply@anthropic.com>

* [rocprofiler-sdk] Fix output_stream move semantics to null source pointers

The default move constructor and move assignment operator for
output_stream did not null out the source's pointers after the move.
This caused double-close when the moved-from temporary was destroyed,
leading to use-after-free crashes (SIGSEGV in std::ostream::sentry).

Co-Authored-By: Claude <noreply@anthropic.com>

* [rocprofiler-sdk] Improve Perfetto trace writer and sanitizer configuration

- generatePerfetto.cpp: Move output_stream into shared_state to prevent
  use-after-free race conditions during Perfetto callback execution
- run-ci.py: Simplify and consolidate sanitizer environment variable
  configuration for better maintainability

Co-Authored-By: Claude <noreply@anthropic.com>

* [rocprofiler-sdk] Revert run-ci.py changes that broke sanitizer suppressions

The previous changes removed MEMCHECK_SANITIZER_OPTIONS which is required
for CTest to properly pass suppression files to the sanitizers during
memcheck runs.

Co-Authored-By: Claude <noreply@anthropic.com>

* Revert "[rccl] Remove orphaned rocSHMEM gitlink"

This reverts commit 1ad21003941355658fff8114fa27768f11a948f7.

* [rocprofiler-sdk] Revert registration.cpp changes

Revert changes to registration.cpp to match develop branch.

Co-Authored-By: Claude <noreply@anthropic.com>

* [rocprofiler-sdk] Remove suppression file content printing from run-ci.py

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix output_stream move ctor/assignment operator

* Fix erroneous revert of registration.cpp

* Fix handling of fini status in correlation ID construction

* [rocprofiler-sdk] Fix OMPT segfault during finalization

Add nullptr checks in OMPT tracing code to handle the case where
correlation_tracing_service::construct() returns nullptr during
finalization. This fixes segfaults in openmp-target-sample and
tests.integration.execute.openmp-tools.

The correlation ID construction now returns nullptr when fini_status > 0,
but the OMPT callbacks were not checking for this, causing crashes when
dereferencing the null pointer during OpenMP runtime shutdown.

Changes:
- event_common(): Return nullptr early if correlation ID is null
- event(): Check for nullptr before calling sub_ref_count()
- ompt_task_create_callback(): Return early if correlation ID is null
- ompt_task_schedule_callback(): Return early if correlation ID is null

* [rocprofiler-sdk] Fix HSA API tracing segfault during finalization

Add nullptr check in hsa_api_impl::functor after correlation ID
construction. During finalization, correlation_service::construct()
returns nullptr, and without this check the code would dereference
the null pointer when accessing corr_id->internal.

This fixes the SEGV at address 0x000000000008 (null + 8 byte offset)
that occurs when HSA async event threads call hsa_signal_destroy
during runtime shutdown after finalization has started.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
2026-01-27 13:27:54 -05:00

254 строки
9.0 KiB
C++

// MIT License
//
// Copyright (c) 2023-2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
/**
* @file tests/bin/code-object-multi-threaded/main.cpp
*
* @brief Multi-threaded code object loading stress test for rocprofiler-sdk
*
* This test verifies thread-safety when multiple threads concurrently load
* HIP modules on different GPUs, which triggers concurrent calls to
* executable_freeze_internal and code object tracing callbacks.
*
* Data races tested:
* - user_data map access (now protected with Synchronized)
* - contexts vector assignment (now protected with wlock)
* - is_shutdown flag (now atomic)
* - end_notified/beg_notified flags (now atomic)
* - executable_destroy serialization (now protected with mutex)
*/
#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>
#include <hip/hiprtc.h>
#include <chrono>
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <thread>
#include <vector>
#define HIP_CHECK(expr) \
do \
{ \
hipError_t _err = (expr); \
if(_err != hipSuccess) \
{ \
std::cerr << "HIP error " << hipGetErrorString(_err) << " at " << __FILE__ << ":" \
<< __LINE__ << "\n"; \
std::abort(); \
} \
} while(0)
// Simple kernel code as string for runtime compilation/loading
static const char* kernel_code = R"(
extern "C" __global__ void dynamic_kernel_a(int* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < n) data[idx] = idx * 2;
}
extern "C" __global__ void dynamic_kernel_b(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < n) data[idx] = idx * 3.14f;
}
extern "C" __global__ void dynamic_kernel_c(int* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < n) data[idx] = idx * idx;
}
)";
// Static kernels as fallback
__global__ void
test_kernel_a()
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx == 0) printf("Kernel A\n");
}
__global__ void
test_kernel_b(int* data)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] = idx * 2;
}
__global__ void
test_kernel_c(float* data, int n)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < n) data[idx] = idx * 3.14f;
}
void
gpu_worker(int gpu_id, int iterations)
{
// Set device for this thread
HIP_CHECK(hipSetDevice(gpu_id));
const int n = 1024;
// Each thread loads modules in parallel - this triggers concurrent
// calls to executable_freeze_internal and code object callbacks
for(int i = 0; i < iterations; ++i)
{
// Use hiprtc to compile kernel source at runtime, then load the compiled module
// This triggers parallel module loading across threads
hiprtcProgram prog{};
hiprtcResult rtc_result =
hiprtcCreateProgram(&prog, kernel_code, "dynamic_kernels.cu", 0, nullptr, nullptr);
bool use_dynamic_loading = (rtc_result == HIPRTC_SUCCESS);
if(use_dynamic_loading)
{
// Compile the program
rtc_result = hiprtcCompileProgram(prog, 0, nullptr);
if(rtc_result != HIPRTC_SUCCESS)
{
size_t log_size{0};
hiprtcGetProgramLogSize(prog, &log_size);
if(log_size > 1)
{
auto log = std::vector<char>(log_size);
hiprtcGetProgramLog(prog, log.data());
std::cerr << "Compilation failed:\n" << log.data() << "\n";
}
use_dynamic_loading = false;
}
}
if(use_dynamic_loading)
{
// Get the compiled code
size_t code_size{0};
hiprtcGetCodeSize(prog, &code_size);
std::vector<char> code(code_size);
hiprtcGetCode(prog, code.data());
hiprtcDestroyProgram(&prog);
// Load the compiled module - THIS is what triggers the parallel code object loading
hipModule_t module;
hipError_t err = hipModuleLoadData(&module, code.data());
if(err == hipSuccess)
{
// Get functions from the module
hipFunction_t func_a, func_b, func_c;
HIP_CHECK(hipModuleGetFunction(&func_a, module, "dynamic_kernel_a"));
HIP_CHECK(hipModuleGetFunction(&func_b, module, "dynamic_kernel_b"));
HIP_CHECK(hipModuleGetFunction(&func_c, module, "dynamic_kernel_c"));
// Allocate device memory
int* d_int_data{nullptr};
float* d_float_data{nullptr};
HIP_CHECK(hipMalloc(&d_int_data, n * sizeof(int)));
HIP_CHECK(hipMalloc(&d_float_data, n * sizeof(float)));
// Launch kernels using module functions
int n_arg = n;
void* args_a[] = {&d_int_data, &n_arg};
HIP_CHECK(hipModuleLaunchKernel(
func_a, (n + 255) / 256, 1, 1, 256, 1, 1, 0, 0, args_a, nullptr));
void* args_b[] = {&d_float_data, &n_arg};
HIP_CHECK(hipModuleLaunchKernel(
func_b, (n + 255) / 256, 1, 1, 256, 1, 1, 0, 0, args_b, nullptr));
void* args_c[] = {&d_int_data, &n_arg};
HIP_CHECK(hipModuleLaunchKernel(
func_c, (n + 255) / 256, 1, 1, 256, 1, 1, 0, 0, args_c, nullptr));
}
else
{
use_dynamic_loading = false;
}
}
if(!use_dynamic_loading)
{
// Fallback to regular kernel launches if module loading fails
int* d_int_data{nullptr};
float* d_float_data{nullptr};
HIP_CHECK(hipMalloc(&d_int_data, n * sizeof(int)));
HIP_CHECK(hipMalloc(&d_float_data, n * sizeof(float)));
test_kernel_a<<<1, 64>>>();
HIP_CHECK(hipGetLastError());
test_kernel_b<<<(n + 255) / 256, 256>>>(d_int_data);
HIP_CHECK(hipGetLastError());
test_kernel_c<<<(n + 255) / 256, 256>>>(d_float_data, n);
HIP_CHECK(hipGetLastError());
}
}
}
int
main()
{
std::cout << "Multi-Threaded Code Object Loading Test\n";
int num_gpus = 0;
HIP_CHECK(hipGetDeviceCount(&num_gpus));
if(num_gpus == 0)
{
std::cerr << "No GPUs found. Test requires at least one GPU.\n";
return 1;
}
std::cout << "Found " << num_gpus << " GPU(s)\n";
// Cap at 64 threads to avoid overwhelming the system while still testing concurrency
int num_threads = std::min(std::thread::hardware_concurrency(), 64u);
int threads_per_gpu = num_threads / num_gpus;
std::cout << "Launching " << num_threads << " threads\n";
constexpr int iterations = 3;
// Create worker threads
auto threads = std::vector<std::thread>{};
threads.reserve(num_threads);
for(int gpu_id = 0; gpu_id < num_gpus; ++gpu_id)
{
for(int thread_id = 0; thread_id < threads_per_gpu; ++thread_id)
{
threads.emplace_back(gpu_worker, gpu_id, iterations);
}
}
// Wait for all threads to complete
for(auto& t : threads)
{
t.join();
}
std::cout << "Test completed successfully\n";
return 0;
}