1517a398bf
* [rocprofiler-sdk] Fix buffer flush ordering and sanitizer CI improvements Buffer Pool Design ------------------ Replace the fixed array-based double buffer with a dynamic pool design to fix race conditions that caused "internal correlation id was retired prematurely" errors. The original design had a race where flush callbacks could be delivered out-of-order: when buffer 0 fills and begins flushing, writes go to buffer 1. If buffer 1 fills before buffer 0's flush completes, the buffer index wraps back to 0 (which may still be flushing). Independent flush tasks submitted to the thread pool can complete out of order. The new pool design: - Uses a std::deque of buffer instances that grows as needed - Allocates buffers from the pool when the current buffer needs to flush - Serializes flushes with a mutex to ensure FIFO callback ordering - Returns buffers to the pool after flush completion - Eliminates the race between buffer selection and write operations New Unit Tests -------------- - buffer_correlation_ordering.cpp: Tests that API records are always delivered before their corresponding retirement records - buffer_ordering_stress.cpp: Stress tests buffer flush ordering under high contention with multiple threads rapidly filling buffers HSA Tool Hooks -------------- Added hsa_tool_hooks.cpp/hpp to register an HSA OnUnload callback that waits for pending flush tasks before tool finalization, preventing "retired prematurely" errors during HSA shutdown. Sanitizer Improvements ---------------------- - LSAN: Set fast_unwind_on_malloc=1 to prevent deadlock in libgcc unwinder - LSAN: Added suppressions for external tools (liblzma, liblsan, seq, strdup) - TSAN: Added suppression for false positive on C++11 thread-safe static initialization in create_write_functor - ASAN/UBSAN: Added patterns for known issues in HSA runtime, HIP, perfetto - Disabled attachment tests for sanitizers due to library preloading issues Other Fixes ----------- - Thread-trace agent test: Use heap-allocated callback state - Correlation ID: Refactored reference counting and finalization ordering * [rocprofiler-sdk] Revert buffer pool design changes Revert buffer.cpp and buffer.hpp to the original double-buffer design from develop branch. The pool-based redesign introduced concerns about: - Signal safety (mutex vs atomic_flag) - API changes (flush() return type) - Complexity of the new design This revert removes: - Dynamic buffer pool with std::deque - std::mutex/condition_variable synchronization - buffer_correlation_ordering.cpp test - buffer_ordering_stress.cpp test The underlying buffer flush ordering issue will need to be addressed with a different approach that preserves the original API and synchronization characteristics. * [rocprofiler-sdk] Consistent fini_status checks to prevent correlation ID creation during finalization - Revert TOCTOU CAS loop change in sub_ref_count() - not needed with consistent checks - Add fini_status check in correlation_tracing_service::construct() with ROCP_CI_LOG warning - Add nullptr checks at all construct() call sites (queue.cpp, async_copy.cpp, memory_allocation.cpp) - Change all 'get_fini_status() > 0' to '!= 0' for consistent behavior: - hsa/queue.cpp (lines 105, 210) - hsa/async_copy.cpp (line 344) - hsa/hsa_barrier.cpp (line 43) - buffer.cpp (lines 107, 138, 185) This ensures no correlation IDs are created once finalization starts (fini_status != 0), preventing races between finalization and ongoing tracing operations. * [rocprofiler-sdk] Replace arrival-order checks with timestamp-based temporal validation Buffer records are not guaranteed to arrive in any specific order. Tests and samples should use timestamps for temporal ordering validation instead. Changes: - samples/external_correlation_id_request: Replace 'retired prematurely' arrival order check with timestamp-based validation that retirement timestamp >= max(end_timestamps) for records with the same correlation ID - tests/external_correlation.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check - tests/registration.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check - tests/roctx.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check Correlation IDs are not guaranteed to be monotonically increasing when records are sorted by timestamp. Temporal ordering should be validated using the timestamp fields in each record. * [rocprofiler-sdk] Revert external/CMakeLists.txt SYSTEM keyword removal Restore the SYSTEM keyword to target_include_directories for rocprofiler-sdk-fmt to match develop branch. * [rccl] Remove orphaned rocSHMEM gitlink Remove orphaned submodule reference that was introduced during a merge but never had a corresponding .gitmodules entry, causing CI failures with "fatal: no submodule mapping found in .gitmodules". * [rocprofiler-sdk] Add HSA ABI version 0x09 support Add ABI checks for HSA_AMD_EXT_API_TABLE_STEP_VERSION 0x09 which introduces hsa_amd_counted_queue_acquire and hsa_amd_counted_queue_release functions (added in rocr-runtime SWDEV-561708). * [rocprofiler-sdk] Handle finalized status gracefully in buffer flush operations This commit consolidates fixes for handling the finalization status during buffer flush operations across the SDK. Changes: - Tool and samples: Handle ROCPROFILER_STATUS_ERROR_FINALIZED gracefully when flushing buffers, as this indicates buffers were already flushed during finalization (not an error condition) - HSA handlers (queue.cpp, async_copy.cpp, hsa_barrier.cpp): Use > 0 check for fini_status to allow operations during finalization process - buffer.cpp: Revert fini_status checks to use > 0 for consistency - correlation_id.cpp: Add fini_status > 0 check with ROCP_TRACE logging to prevent correlation ID creation after finalization starts Files modified: - source/lib/rocprofiler-sdk-tool/tool.cpp - tests/tools/json-tool.cpp - source/lib/rocprofiler-sdk/tests/registration.cpp - source/lib/rocprofiler-sdk/tests/roctx.cpp - samples/api_buffered_tracing/client.cpp - samples/counter_collection/buffered_client.cpp - samples/counter_collection/device_counting_async_client.cpp - samples/external_correlation_id_request/client.cpp - samples/pc_sampling/client.cpp - source/lib/rocprofiler-sdk/buffer.cpp - source/lib/rocprofiler-sdk/context/correlation_id.cpp - source/lib/rocprofiler-sdk/hsa/queue.cpp - source/lib/rocprofiler-sdk/hsa/async_copy.cpp - source/lib/rocprofiler-sdk/hsa/hsa_barrier.cpp * [rocprofiler-sdk] Remove hsa_tool_hooks and simplify buffer flush handling Remove the hsa_tool_hooks infrastructure and simplify buffer flush calls in samples and tools. The ERROR_FINALIZED handling was overly complex and the hsa_tool_hooks OnUnload synchronization is no longer needed. Changes: - Remove hsa_tool_hooks.cpp/hpp and related registration.cpp code - Simplify buffer flush calls in samples to use direct ROCPROFILER_CALL - Simplify buffer flush in tool.cpp and json-tool.cpp - Remove ERROR_FINALIZED special handling from test files Co-Authored-By: Claude <noreply@anthropic.com> * [rocprofiler-sdk] Fix output_stream move semantics to null source pointers The default move constructor and move assignment operator for output_stream did not null out the source's pointers after the move. This caused double-close when the moved-from temporary was destroyed, leading to use-after-free crashes (SIGSEGV in std::ostream::sentry). Co-Authored-By: Claude <noreply@anthropic.com> * [rocprofiler-sdk] Improve Perfetto trace writer and sanitizer configuration - generatePerfetto.cpp: Move output_stream into shared_state to prevent use-after-free race conditions during Perfetto callback execution - run-ci.py: Simplify and consolidate sanitizer environment variable configuration for better maintainability Co-Authored-By: Claude <noreply@anthropic.com> * [rocprofiler-sdk] Revert run-ci.py changes that broke sanitizer suppressions The previous changes removed MEMCHECK_SANITIZER_OPTIONS which is required for CTest to properly pass suppression files to the sanitizers during memcheck runs. Co-Authored-By: Claude <noreply@anthropic.com> * Revert "[rccl] Remove orphaned rocSHMEM gitlink" This reverts commit 1ad21003941355658fff8114fa27768f11a948f7. * [rocprofiler-sdk] Revert registration.cpp changes Revert changes to registration.cpp to match develop branch. Co-Authored-By: Claude <noreply@anthropic.com> * [rocprofiler-sdk] Remove suppression file content printing from run-ci.py Co-Authored-By: Claude <noreply@anthropic.com> * Fix output_stream move ctor/assignment operator * Fix erroneous revert of registration.cpp * Fix handling of fini status in correlation ID construction * [rocprofiler-sdk] Fix OMPT segfault during finalization Add nullptr checks in OMPT tracing code to handle the case where correlation_tracing_service::construct() returns nullptr during finalization. This fixes segfaults in openmp-target-sample and tests.integration.execute.openmp-tools. The correlation ID construction now returns nullptr when fini_status > 0, but the OMPT callbacks were not checking for this, causing crashes when dereferencing the null pointer during OpenMP runtime shutdown. Changes: - event_common(): Return nullptr early if correlation ID is null - event(): Check for nullptr before calling sub_ref_count() - ompt_task_create_callback(): Return early if correlation ID is null - ompt_task_schedule_callback(): Return early if correlation ID is null * [rocprofiler-sdk] Fix HSA API tracing segfault during finalization Add nullptr check in hsa_api_impl::functor after correlation ID construction. During finalization, correlation_service::construct() returns nullptr, and without this check the code would dereference the null pointer when accessing corr_id->internal. This fixes the SEGV at address 0x000000000008 (null + 8 byte offset) that occurs when HSA async event threads call hsa_signal_destroy during runtime shutdown after finalization has started. --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
254 líneas
9.0 KiB
C++
254 líneas
9.0 KiB
C++
// MIT License
|
|
//
|
|
// Copyright (c) 2023-2025 Advanced Micro Devices, Inc. All rights reserved.
|
|
//
|
|
// Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
// of this software and associated documentation files (the "Software"), to deal
|
|
// in the Software without restriction, including without limitation the rights
|
|
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
// copies of the Software, and to permit persons to whom the Software is
|
|
// furnished to do so, subject to the following conditions:
|
|
//
|
|
// The above copyright notice and this permission notice shall be included in all
|
|
// copies or substantial portions of the Software.
|
|
//
|
|
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
// SOFTWARE.
|
|
|
|
/**
|
|
* @file tests/bin/code-object-multi-threaded/main.cpp
|
|
*
|
|
* @brief Multi-threaded code object loading stress test for rocprofiler-sdk
|
|
*
|
|
* This test verifies thread-safety when multiple threads concurrently load
|
|
* HIP modules on different GPUs, which triggers concurrent calls to
|
|
* executable_freeze_internal and code object tracing callbacks.
|
|
*
|
|
* Data races tested:
|
|
* - user_data map access (now protected with Synchronized)
|
|
* - contexts vector assignment (now protected with wlock)
|
|
* - is_shutdown flag (now atomic)
|
|
* - end_notified/beg_notified flags (now atomic)
|
|
* - executable_destroy serialization (now protected with mutex)
|
|
*/
|
|
|
|
#include <hip/hip_runtime.h>
|
|
#include <hip/hip_runtime_api.h>
|
|
#include <hip/hiprtc.h>
|
|
|
|
#include <chrono>
|
|
#include <cstdlib>
|
|
#include <cstring>
|
|
#include <iostream>
|
|
#include <thread>
|
|
#include <vector>
|
|
|
|
#define HIP_CHECK(expr) \
|
|
do \
|
|
{ \
|
|
hipError_t _err = (expr); \
|
|
if(_err != hipSuccess) \
|
|
{ \
|
|
std::cerr << "HIP error " << hipGetErrorString(_err) << " at " << __FILE__ << ":" \
|
|
<< __LINE__ << "\n"; \
|
|
std::abort(); \
|
|
} \
|
|
} while(0)
|
|
|
|
// Simple kernel code as string for runtime compilation/loading
|
|
static const char* kernel_code = R"(
|
|
extern "C" __global__ void dynamic_kernel_a(int* data, int n) {
|
|
int idx = blockIdx.x * blockDim.x + threadIdx.x;
|
|
if(idx < n) data[idx] = idx * 2;
|
|
}
|
|
|
|
extern "C" __global__ void dynamic_kernel_b(float* data, int n) {
|
|
int idx = blockIdx.x * blockDim.x + threadIdx.x;
|
|
if(idx < n) data[idx] = idx * 3.14f;
|
|
}
|
|
|
|
extern "C" __global__ void dynamic_kernel_c(int* data, int n) {
|
|
int idx = blockIdx.x * blockDim.x + threadIdx.x;
|
|
if(idx < n) data[idx] = idx * idx;
|
|
}
|
|
)";
|
|
|
|
// Static kernels as fallback
|
|
__global__ void
|
|
test_kernel_a()
|
|
{
|
|
int idx = blockIdx.x * blockDim.x + threadIdx.x;
|
|
if(idx == 0) printf("Kernel A\n");
|
|
}
|
|
|
|
__global__ void
|
|
test_kernel_b(int* data)
|
|
{
|
|
int idx = blockIdx.x * blockDim.x + threadIdx.x;
|
|
data[idx] = idx * 2;
|
|
}
|
|
|
|
__global__ void
|
|
test_kernel_c(float* data, int n)
|
|
{
|
|
int idx = blockIdx.x * blockDim.x + threadIdx.x;
|
|
if(idx < n) data[idx] = idx * 3.14f;
|
|
}
|
|
|
|
void
|
|
gpu_worker(int gpu_id, int iterations)
|
|
{
|
|
// Set device for this thread
|
|
HIP_CHECK(hipSetDevice(gpu_id));
|
|
|
|
const int n = 1024;
|
|
|
|
// Each thread loads modules in parallel - this triggers concurrent
|
|
// calls to executable_freeze_internal and code object callbacks
|
|
for(int i = 0; i < iterations; ++i)
|
|
{
|
|
// Use hiprtc to compile kernel source at runtime, then load the compiled module
|
|
// This triggers parallel module loading across threads
|
|
hiprtcProgram prog{};
|
|
hiprtcResult rtc_result =
|
|
hiprtcCreateProgram(&prog, kernel_code, "dynamic_kernels.cu", 0, nullptr, nullptr);
|
|
|
|
bool use_dynamic_loading = (rtc_result == HIPRTC_SUCCESS);
|
|
|
|
if(use_dynamic_loading)
|
|
{
|
|
// Compile the program
|
|
rtc_result = hiprtcCompileProgram(prog, 0, nullptr);
|
|
|
|
if(rtc_result != HIPRTC_SUCCESS)
|
|
{
|
|
size_t log_size{0};
|
|
hiprtcGetProgramLogSize(prog, &log_size);
|
|
if(log_size > 1)
|
|
{
|
|
auto log = std::vector<char>(log_size);
|
|
hiprtcGetProgramLog(prog, log.data());
|
|
std::cerr << "Compilation failed:\n" << log.data() << "\n";
|
|
}
|
|
use_dynamic_loading = false;
|
|
}
|
|
}
|
|
|
|
if(use_dynamic_loading)
|
|
{
|
|
// Get the compiled code
|
|
size_t code_size{0};
|
|
hiprtcGetCodeSize(prog, &code_size);
|
|
|
|
std::vector<char> code(code_size);
|
|
hiprtcGetCode(prog, code.data());
|
|
hiprtcDestroyProgram(&prog);
|
|
|
|
// Load the compiled module - THIS is what triggers the parallel code object loading
|
|
hipModule_t module;
|
|
hipError_t err = hipModuleLoadData(&module, code.data());
|
|
|
|
if(err == hipSuccess)
|
|
{
|
|
// Get functions from the module
|
|
hipFunction_t func_a, func_b, func_c;
|
|
HIP_CHECK(hipModuleGetFunction(&func_a, module, "dynamic_kernel_a"));
|
|
HIP_CHECK(hipModuleGetFunction(&func_b, module, "dynamic_kernel_b"));
|
|
HIP_CHECK(hipModuleGetFunction(&func_c, module, "dynamic_kernel_c"));
|
|
|
|
// Allocate device memory
|
|
int* d_int_data{nullptr};
|
|
float* d_float_data{nullptr};
|
|
HIP_CHECK(hipMalloc(&d_int_data, n * sizeof(int)));
|
|
HIP_CHECK(hipMalloc(&d_float_data, n * sizeof(float)));
|
|
|
|
// Launch kernels using module functions
|
|
int n_arg = n;
|
|
void* args_a[] = {&d_int_data, &n_arg};
|
|
HIP_CHECK(hipModuleLaunchKernel(
|
|
func_a, (n + 255) / 256, 1, 1, 256, 1, 1, 0, 0, args_a, nullptr));
|
|
|
|
void* args_b[] = {&d_float_data, &n_arg};
|
|
HIP_CHECK(hipModuleLaunchKernel(
|
|
func_b, (n + 255) / 256, 1, 1, 256, 1, 1, 0, 0, args_b, nullptr));
|
|
|
|
void* args_c[] = {&d_int_data, &n_arg};
|
|
HIP_CHECK(hipModuleLaunchKernel(
|
|
func_c, (n + 255) / 256, 1, 1, 256, 1, 1, 0, 0, args_c, nullptr));
|
|
}
|
|
else
|
|
{
|
|
use_dynamic_loading = false;
|
|
}
|
|
}
|
|
|
|
if(!use_dynamic_loading)
|
|
{
|
|
// Fallback to regular kernel launches if module loading fails
|
|
int* d_int_data{nullptr};
|
|
float* d_float_data{nullptr};
|
|
HIP_CHECK(hipMalloc(&d_int_data, n * sizeof(int)));
|
|
HIP_CHECK(hipMalloc(&d_float_data, n * sizeof(float)));
|
|
|
|
test_kernel_a<<<1, 64>>>();
|
|
HIP_CHECK(hipGetLastError());
|
|
|
|
test_kernel_b<<<(n + 255) / 256, 256>>>(d_int_data);
|
|
HIP_CHECK(hipGetLastError());
|
|
|
|
test_kernel_c<<<(n + 255) / 256, 256>>>(d_float_data, n);
|
|
HIP_CHECK(hipGetLastError());
|
|
}
|
|
}
|
|
}
|
|
|
|
int
|
|
main()
|
|
{
|
|
std::cout << "Multi-Threaded Code Object Loading Test\n";
|
|
|
|
int num_gpus = 0;
|
|
HIP_CHECK(hipGetDeviceCount(&num_gpus));
|
|
|
|
if(num_gpus == 0)
|
|
{
|
|
std::cerr << "No GPUs found. Test requires at least one GPU.\n";
|
|
return 1;
|
|
}
|
|
|
|
std::cout << "Found " << num_gpus << " GPU(s)\n";
|
|
|
|
// Cap at 64 threads to avoid overwhelming the system while still testing concurrency
|
|
int num_threads = std::min(std::thread::hardware_concurrency(), 64u);
|
|
int threads_per_gpu = num_threads / num_gpus;
|
|
std::cout << "Launching " << num_threads << " threads\n";
|
|
|
|
constexpr int iterations = 3;
|
|
|
|
// Create worker threads
|
|
auto threads = std::vector<std::thread>{};
|
|
threads.reserve(num_threads);
|
|
|
|
for(int gpu_id = 0; gpu_id < num_gpus; ++gpu_id)
|
|
{
|
|
for(int thread_id = 0; thread_id < threads_per_gpu; ++thread_id)
|
|
{
|
|
threads.emplace_back(gpu_worker, gpu_id, iterations);
|
|
}
|
|
}
|
|
|
|
// Wait for all threads to complete
|
|
for(auto& t : threads)
|
|
{
|
|
t.join();
|
|
}
|
|
|
|
std::cout << "Test completed successfully\n";
|
|
return 0;
|
|
}
|