Fix double-free crash when librocm_smi64.so and libamd_smi.so are loaded together (#2531)
Problem: When TheRock-based PyTorch package is installed along with amdsmi, importing torch causes a double-free crash on exit (GitHub issue ROCm/TheRock#2269). Root cause: Both librocm_smi64.so and libamd_smi.so export the C++ static member 'amd::smi::Device::devInfoTypesStrings'. When libraries are loaded with RTLD_GLOBAL, the dynamic linker resolves libamd_smi.so's reference to this symbol to the one in librocm_smi64.so. This causes: 1. librocm_smi64.so registers its destructor for devInfoTypesStrings 2. libamd_smi.so also registers a destructor, but for the SAME address 3. On exit, both destructors run on the same object -> double-free Fix: Change devInfoTypesStrings from a class static member to a file-local static variable. This ensures the symbol has internal linkage and is not exported, preventing the symbol collision. Changes: - rocm_smi_device.h: Remove static member declaration - rocm_smi_device.cc: Change from 'Device::devInfoTypesStrings' to file-local 'static const std::map<...> devInfoTypesStrings' - rocm_smi.cc: Remove the global alias to the (now removed) class member Tested on gfx1151. `import torch` crashed on exit before the fix, and doesn't crash after the fix.
Этот коммит содержится в:
коммит произвёл
GitHub
родитель
29cd25df66
Коммит
1883f736ad
@@ -244,7 +244,7 @@ class Device {
|
||||
rsmi_status_t dev_log_gpu_metrics(std::ostringstream& outstream_metrics);
|
||||
AMGpuMetricsPublicLatestTupl_t dev_copy_internal_to_external_metrics();
|
||||
|
||||
static const std::map<DevInfoTypes, const char*> devInfoTypesStrings;
|
||||
|
||||
void set_smi_device_id(uint32_t i) { m_device_id = i; }
|
||||
void set_smi_partition_id(uint32_t i) { m_partition_id = i; }
|
||||
static const char* get_type_string(DevInfoTypes type);
|
||||
|
||||
@@ -84,7 +84,6 @@ using amd::smi::monitorTypesToString;
|
||||
using amd::smi::getRSMIStatusString;
|
||||
using amd::smi::AMDGpuMetricsUnitType_t;
|
||||
using amd::smi::AMDGpuMetricTypeId_t;
|
||||
auto &devInfoTypesStrings = amd::smi::Device::devInfoTypesStrings;
|
||||
|
||||
static const uint32_t kMaxOverdriveLevel = 20;
|
||||
static const float kEnergyCounterResolution = 15.3F;
|
||||
|
||||
@@ -379,8 +379,7 @@ static const std::map<DevInfoTypes, uint8_t> kDevInfoVarTypeToRSMIVariant = {
|
||||
{kDevDFCountersAvailable, RSMI_EVNT_GRP_XGMI}
|
||||
};
|
||||
|
||||
const std::map<DevInfoTypes, const char*>
|
||||
Device::devInfoTypesStrings = {
|
||||
static const std::map<DevInfoTypes, const char*> devInfoTypesStrings = {
|
||||
{kDevPerfLevel, "kDevPerfLevel"},
|
||||
{kDevOverDriveLevel, "kDevOverDriveLevel"},
|
||||
{kDevMemOverDriveLevel, "kDevMemOverDriveLevel"},
|
||||
|
||||
Ссылка в новой задаче
Block a user