[rocprof-sys] Fix segfault from thread ID array overflow (#2172)

**Thread limit configuration and enforcement: **

* Added a check in `CMakeLists.txt` to ensure `ROCPROFSYS_MAX_THREADS` is at least 128, automatically setting it to 128 with a warning if a lower value is provided.
* Replaced hardcoded thread limit (`allowed_max_threads`) in `pthread_create_gotcha.cpp` with the configurable `ROCPROFSYS_MAX_THREADS` value, ensuring all runtime checks and warnings use the actual configured limit.

**Documentation improvements: **

* Updated the development guide to explain the new thread limit behavior, including how exceeding the limit is handled gracefully, how to configure it, and the build-time validation rules.

**Test updates: **

* Modified thread limit tests to use the configurable `ROCPROFSYS_MAX_THREADS` value instead of a hardcoded limit and expanded the range of tested thread values.
* Increased test timeouts to accommodate larger thread counts and ensure reliability with higher limits.
This commit is contained in:
anujshuk-amd
2026-01-08 00:33:37 +05:30
کامیت شده توسط GitHub
والد 050e88ee71
کامیت 596ffce5fe
5فایلهای تغییر یافته به همراه60 افزوده شده و 25 حذف شده
@@ -17,6 +17,11 @@ Full documentation for ROCm Systems Profiler is available at [https://rocm.docs.
- By default, tracing uses deferred trace generation (cached mode) for improved performance and minimal runtime overhead.
- `--trace` / `-T` CLI flag enables tracing with cached mode by default.
- `--trace-legacy` / `-L` CLI flag enables legacy direct mode for tracing.
- Changed thread storage allocation from a hard-coded 4096-element array to a compile-time computed size derived from the ROCPROFSYS_MAX_THREADS configuration flag.
### Resolved issues
- Fixed application termination with segfault when thread creation surpasses ROCPROFSYS_MAX_THREADS configuration.
### Removed
@@ -275,9 +275,13 @@ else()
math(EXPR ROCPROFSYS_THREAD_COUNT "16 * ${ROCPROFSYS_PROCESSOR_COUNT}")
compute_pow2_ceil(ROCPROFSYS_THREAD_COUNT "16 * ${ROCPROFSYS_PROCESSOR_COUNT}")
# set the default to 2048 if it could not be calculated
# Fatal error if pow2 calculation failed (e.g., Python3 not found)
if(ROCPROFSYS_THREAD_COUNT LESS 2)
set(ROCPROFSYS_THREAD_COUNT 2048)
rocprofiler_systems_message(
FATAL_ERROR
"Failed to compute power of 2 ceiling for ROCPROFSYS_THREAD_COUNT. "
"Ensure dependency is available. Processor count: ${ROCPROFSYS_PROCESSOR_COUNT}"
)
endif()
endif()
@@ -327,14 +327,46 @@ Thread-data class
Currently, most thread data is effectively stored in a static
``std::array<std::unique_ptr<T>, ROCPROFSYS_MAX_THREADS>`` instance.
``ROCPROFSYS_MAX_THREADS`` is a value defined a compile-time and set to ``2048``
for release builds. During finalization,
``ROCPROFSYS_MAX_THREADS`` is a value defined at compile-time for release builds. During finalization,
ROCm Systems Profiler iterates through the thread-data and transforms that data
into something that can be passed along to Perfetto and/or Timemory.
The downside of the current model is that if the user exceeds ``ROCPROFSYS_MAX_THREADS``,
a segmentation fault occurs. To fix this issue,
a new model is being adopted which has all the benefits of this model
but permits dynamic expansion.
In the current model, if the user exceeds ``ROCPROFSYS_MAX_THREADS`` at runtime,
thread creation fails gracefully with a warning message, excess threads operate with thread-local fallback,
and profiling is skipped and not persisted to output files for threads beyond ``ROCPROFSYS_MAX_THREADS``.
To support truly dynamic thread limits without compile-time constraints, a new model is being adopted which
has all the benefits of static allocation but permits dynamic expansion beyond ``ROCPROFSYS_MAX_THREADS``.
Currently, the thread limit can be increased at compile-time using the ``ROCPROFSYS_MAX_THREADS`` CMake configuration option.
Configuring thread limits
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ROCm Systems Profiler uses a single CMake configuration option to control thread-related memory allocation:
* ``ROCPROFSYS_MAX_THREADS``: Maximum number of threads supported (default if not explicitly set: ``128`` if nproc < 8, otherwise ``pow2_ceil(16 * nproc)``; must be a power of 2)
This setting controls:
* Thread ID manager capacity (maximum thread IDs that can be tracked)
* Storage array sizes for thread-local data across the codebase
* Timemory's internal thread storage (``TIMEMORY_MAX_THREADS``)
**Build-time validation:**
CMake enforces that ``ROCPROFSYS_MAX_THREADS`` must be a power of 2:
.. code-block:: cmake
# Valid: 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, ... any power of 2
# Invalid: 100, 3000, 5000, 10000, ... (FATAL_ERROR)
**Example: Building with custom thread limit**
.. code-block:: shell
# Build with support for 8192 threads
cmake -B build \
-DROCPROFSYS_MAX_THREADS=8192 \
..
cmake --build build
Sampling model
========================================
@@ -64,9 +64,6 @@ namespace component
{
using bundle_t = tim::lightweight_tuple<comp::wall_clock>;
using category_region_t = tim::lightweight_tuple<category_region<category::pthread>>;
// The maximum limit for the number of threads is set at 4096. declared and stored in the
// set_storage struct's `types.hpp` file.
constexpr size_t allowed_max_threads = 4096;
namespace
{
@@ -187,7 +184,7 @@ pthread_create_gotcha::wrapper::operator()() const
const auto& _parent_info = thread_info::get(m_config.parent_tid, InternalTID);
const auto& _info = thread_info::init(m_config.offset);
auto _sequent_value = _info->index_data ? _info->index_data->sequent_value : -1;
if(static_cast<size_t>(_sequent_value) >= allowed_max_threads)
if(static_cast<size_t>(_sequent_value) >= ROCPROFSYS_MAX_THREADS)
{
static std::once_flag thread_limit_warning_flag;
std::call_once(thread_limit_warning_flag, []() {
@@ -196,7 +193,7 @@ pthread_create_gotcha::wrapper::operator()() const
"[rocprof-sys][WARNING] Maximum allowed thread limit (%zu) "
"reached. Further thread creation and profiling will be "
"disabled to prevent resource exhaustion.\n",
allowed_max_threads);
static_cast<size_t>(ROCPROFSYS_MAX_THREADS));
});
return m_routine(m_arg);
}
@@ -40,21 +40,18 @@ set(_thread_limit_environment
"ROCPROFSYS_TIMEMORY_COMPONENTS=wall_clock,peak_rss,page_rss"
)
# Maximum allowed threads
set(ALLOWED_MAX_THREADS 4096)
math(EXPR THREAD_VAL_1 "${ROCPROFSYS_MAX_THREADS} - 1")
math(EXPR THREAD_VAL_2 "${ROCPROFSYS_MAX_THREADS} + 24")
math(EXPR THREAD_VAL_1 "${ROCPROFSYS_MAX_THREADS} + 24")
math(EXPR THREAD_VAL_2 "${ALLOWED_MAX_THREADS} + 1")
set(THREAD_VALUES ${THREAD_VAL_1} ${THREAD_VAL_2})
set(THREAD_VALUES ${THREAD_VAL_1} ${THREAD_VAL_2} ${ROCPROFSYS_MAX_THREADS})
# Loop over thread values
foreach(THREADS IN LISTS THREAD_VALUES)
set(THREAD_PASS_VALUE ${THREADS})
math(EXPR THREAD_FAIL_VALUE "${THREADS} + 1")
if(${THREADS} GREATER_EQUAL ${ALLOWED_MAX_THREADS})
math(EXPR THREAD_PASS_VALUE "${ALLOWED_MAX_THREADS} - 1")
math(EXPR THREAD_FAIL_VALUE "${THREADS}")
if(${THREADS} GREATER_EQUAL ${ROCPROFSYS_MAX_THREADS})
math(EXPR THREAD_PASS_VALUE "${ROCPROFSYS_MAX_THREADS} - 1")
math(EXPR THREAD_FAIL_VALUE "${ROCPROFSYS_MAX_THREADS} + 1")
endif()
set(_thread_limit_pass_regex "\\|${THREAD_PASS_VALUE}>>>")
@@ -72,9 +69,9 @@ foreach(THREADS IN LISTS THREAD_VALUES)
REWRITE_ARGS -e -v 2 -i 1024 --label return args
RUNTIME_ARGS -e -v 1 -i 1024 --label return args
RUN_ARGS 35 2 ${THREADS}
SAMPLING_TIMEOUT 180
REWRITE_TIMEOUT 180
RUNTIME_TIMEOUT 360
SAMPLING_TIMEOUT 480
REWRITE_TIMEOUT 480
RUNTIME_TIMEOUT 480
RUNTIME_PASS_REGEX "${_thread_limit_pass_regex}"
SAMPLING_PASS_REGEX "${_thread_limit_pass_regex}"
REWRITE_RUN_PASS_REGEX "${_thread_limit_pass_regex}"