Fix double-free crash when librocm_smi64.so and libamd_smi.so are loaded together (#2531)

Problem:
When TheRock-based PyTorch package is installed along with amdsmi, importing
torch causes a double-free crash on exit (GitHub issue ROCm/TheRock#2269).

Root cause:
Both librocm_smi64.so and libamd_smi.so export the C++ static member
'amd::smi::Device::devInfoTypesStrings'. When libraries are loaded with
RTLD_GLOBAL, the dynamic linker resolves libamd_smi.so's reference to this
symbol to the one in librocm_smi64.so. This causes:
1. librocm_smi64.so registers its destructor for devInfoTypesStrings
2. libamd_smi.so also registers a destructor, but for the SAME address
3. On exit, both destructors run on the same object -> double-free

Fix:
Change devInfoTypesStrings from a class static member to a file-local static
variable. This ensures the symbol has internal linkage and is not exported,
preventing the symbol collision.

Changes:
- rocm_smi_device.h: Remove static member declaration
- rocm_smi_device.cc: Change from 'Device::devInfoTypesStrings' to file-local
  'static const std::map<...> devInfoTypesStrings'
- rocm_smi.cc: Remove the global alias to the (now removed) class member

Tested on gfx1151. `import torch` crashed on exit before the fix, and doesn't crash after the fix.
Этот коммит содержится в:
Matthias Gehre
2026-01-15 17:43:47 +01:00
коммит произвёл GitHub
родитель 29cd25df66
Коммит 1883f736ad
3 изменённых файлов: 2 добавлений и 4 удалений
+1 -1
Просмотреть файл
@@ -244,7 +244,7 @@ class Device {
rsmi_status_t dev_log_gpu_metrics(std::ostringstream& outstream_metrics);
AMGpuMetricsPublicLatestTupl_t dev_copy_internal_to_external_metrics();
static const std::map<DevInfoTypes, const char*> devInfoTypesStrings;
void set_smi_device_id(uint32_t i) { m_device_id = i; }
void set_smi_partition_id(uint32_t i) { m_partition_id = i; }
static const char* get_type_string(DevInfoTypes type);
-1
Просмотреть файл
@@ -84,7 +84,6 @@ using amd::smi::monitorTypesToString;
using amd::smi::getRSMIStatusString;
using amd::smi::AMDGpuMetricsUnitType_t;
using amd::smi::AMDGpuMetricTypeId_t;
auto &devInfoTypesStrings = amd::smi::Device::devInfoTypesStrings;
static const uint32_t kMaxOverdriveLevel = 20;
static const float kEnergyCounterResolution = 15.3F;
+1 -2
Просмотреть файл
@@ -379,8 +379,7 @@ static const std::map<DevInfoTypes, uint8_t> kDevInfoVarTypeToRSMIVariant = {
{kDevDFCountersAvailable, RSMI_EVNT_GRP_XGMI}
};
const std::map<DevInfoTypes, const char*>
Device::devInfoTypesStrings = {
static const std::map<DevInfoTypes, const char*> devInfoTypesStrings = {
{kDevPerfLevel, "kDevPerfLevel"},
{kDevOverDriveLevel, "kDevOverDriveLevel"},
{kDevMemOverDriveLevel, "kDevMemOverDriveLevel"},