AI/rocm-systems

Fork 0

Dateien

T

AidanBeltonS 607d66e87c Add messages to static asserts to prevent warnings (#1011 )

2026-01-13 14:02:36 +00:00

16 KiB

Originalformat Permalink Blame Verlauf

CommandPool Refactoring Analysis: Moving from Global Singleton to Per-Stream Pools

Executive Summary

This document analyzes the feasibility and impact of moving the CommandPool from a global static singleton instance to per-stream instances within HostQueue. This change aims to eliminate contention bottlenecks in multithreaded applications with many concurrent streams.

Current Architecture

CommandPool Implementation

The CommandPool is currently implemented as a static singleton with the following characteristics:

Location: rocclr/platform/command.cpp (lines 338-408)
Access Pattern: CommandPool::instance() returns a static singleton
Thread Safety: Protected by a single std::mutex mutex_
Storage: Ring buffer with 64 entries (q_size_ = 64)
Memory Management:
- Allocates aligned memory using std::aligned_alloc(maxAlignment_, maxSize_)
- Reuses deallocated command memory when pool is not full
- Frees memory when pool is full

Command Types Using CommandPool

The following command types use the pool via custom operator new and release() methods:

ReadMemoryCommand - operator new() and release() (lines 684-707)
WriteMemoryCommand - operator new() and release() (lines 714-733)
FillMemoryCommand - operator new() and release() (lines 744-765)
CopyMemoryCommand - operator new() and release() (lines 799-820)
CopyMemoryP2PCommand - operator new() and release() (lines 1003-1024)
Marker - operator new() and release() (lines 1027-1048)

Current Allocation Flow

// Command creation
void* ReadMemoryCommand::operator new(size_t size) {
  void* ptr = CommandPool::instance().allocate();  // Global singleton access
  // ...
  return ptr;
}

// Command destruction
uint ReadMemoryCommand::release() {
  uint newCount = referenceCount_.fetch_sub(1, std::memory_order_acq_rel) - 1;
  if (newCount == 0) {
    if (terminate()) {
      CommandPool::instance().deallocate(this);  // Global singleton access
      return 0;
    }
  }
  return newCount;
}

Problem: Contention Bottleneck

In a multithreaded application with many streams:

All threads compete for the same global CommandPool::instance() mutex
High-frequency command allocation/deallocation creates lock contention
Performance degrades as the number of concurrent streams increases
The single mutex serializes all command pool operations across all streams

Proposed Solution: Per-Stream CommandPool

Architecture Changes

Move CommandPool into HostQueue
- Each HostQueue instance owns its own CommandPool
- Eliminates cross-stream contention
- Commands allocated from a stream's pool are returned to the same pool
Update Command Allocation
- Commands already have a queue_ pointer (set in constructor)
- Commands can access their queue's pool via queue()->commandPool()
- No need to pass additional parameters
Lifecycle Management
- Pool created when HostQueue is constructed
- Pool destroyed when HostQueue is destroyed
- No global cleanup needed

Implementation Plan

Step 1: Move CommandPool Class Definition

Keep CommandPool class definition in command.cpp (or move to header if needed)
Remove static instance() method
Make it a regular class (no singleton pattern)

Step 2: Add CommandPool to HostQueue

// In commandqueue.hpp
class HostQueue : public CommandQueue {
  // ... existing members ...
private:
  CommandPool commandPool_;  // Per-queue command pool
};

Step 3: Update Command Allocation Methods

// Before:
void* ReadMemoryCommand::operator new(size_t size) {
  void* ptr = CommandPool::instance().allocate();
  // ...
}

// After:
void* ReadMemoryCommand::operator new(size_t size, HostQueue& queue) {
  void* ptr = queue.commandPool().allocate();
  // ...
}

Challenge: operator new is called with new ReadMemoryCommand(...), but we need access to the queue. The queue is passed to the constructor, not operator new.

Solution: Commands already store queue_ pointer. We can use a two-phase approach:

Phase 1: Allocate memory (may need temporary global pool or direct allocation)
Phase 2: After construction, move to queue's pool (not practical)

Better Solution: Use placement new or modify allocation pattern:

Option A: Allocate from queue's pool before construction
Option B: Store pool reference in command and deallocate to correct pool
Option C: Use a thread-local or queue-specific allocation mechanism

Step 4: Update Command Deallocation

// Before:
uint ReadMemoryCommand::release() {
  // ...
  CommandPool::instance().deallocate(this);
  // ...
}

// After:
uint ReadMemoryCommand::release() {
  // ...
  queue_->commandPool().deallocate(this);  // Use queue's pool
  // ...
}

Note: This is straightforward since queue_ is already available in the command.

Detailed Implementation Strategy

Option 1: Two-Phase Allocation (Recommended)

Since operator new is called before the constructor, we need a way to get the queue reference. However, commands are always created with a queue parameter:

// Current pattern:
ReadMemoryCommand* cmd = new ReadMemoryCommand(queue, ...);

// The queue is available at call site!

Solution: Use a placement-new-like pattern or thread-local storage:

Thread-Local Queue Context (Simpler but less clean):

thread_local HostQueue* g_currentQueue = nullptr;

void* ReadMemoryCommand::operator new(size_t size) {
  HostQueue* queue = g_currentQueue;
  if (queue) {
    return queue->commandPool().allocate();
  }
  // Fallback to direct allocation
  return std::aligned_alloc(alignof(ReadMemoryCommand), size);
}

Queue Parameter in operator new (Requires syntax change):

// This would require: new(queue) ReadMemoryCommand(...)
// Not standard C++ placement new syntax

Allocate from Queue Before Construction (Most practical):

// At call sites, allocate from queue first:
void* mem = queue.commandPool().allocate();
ReadMemoryCommand* cmd = new(mem) ReadMemoryCommand(queue, ...);

This requires changing all call sites (80+ locations).

Option 2: Deferred Pool Assignment (Hybrid Approach)

Allocate commands using a temporary mechanism (direct allocation or small per-thread pool)
After construction, commands have queue_ pointer
On deallocation, return to the correct queue's pool
Problem: Can't reuse memory from different queues efficiently

Option 3: Queue-Scoped Allocation Helper (Recommended)

Create a helper that wraps command creation:

template<typename CmdType, typename... Args>
CmdType* createCommand(HostQueue& queue, Args&&... args) {
  void* mem = queue.commandPool().allocate();
  return new(mem) CmdType(queue, std::forward<Args>(args)...);
}

// Usage:
auto cmd = createCommand<ReadMemoryCommand>(queue, CL_COMMAND_READ_BUFFER, ...);

This requires updating all 80+ call sites but provides clean semantics.

Code Changes Required

Files to Modify

rocclr/platform/commandqueue.hpp
- Add CommandPool commandPool_; member to HostQueue
- Add CommandPool& commandPool() accessor method
rocclr/platform/commandqueue.cpp
- Initialize commandPool_ in HostQueue constructor
- Implement commandPool() accessor
rocclr/platform/command.cpp
- Remove CommandPool::instance() static method
- Update all 6 command types' operator new() methods
- Update all 6 command types' release() methods to use queue_->commandPool()
All command creation sites (80+ locations):
- hipamd/src/hip_memory.cpp
- hipamd/src/hip_stream.cpp
- hipamd/src/hip_event.cpp
- rocclr/platform/commandqueue.cpp
- opencl/amdocl/cl_execute.cpp
- opencl/amdocl/cl_memobj.cpp
- And others...

Benefits

Eliminates Contention: Each stream has its own pool, no cross-stream locking
Better Locality: Commands allocated from a stream are reused by the same stream
Scalability: Performance scales with number of streams (no global bottleneck)
Memory Efficiency: Per-stream pools can be sized appropriately
Thread Safety: Each pool only accessed by its stream's thread (mostly)

Challenges and Considerations

Challenge 1: operator new() Timing

operator new() is called before constructor
Queue reference not available in operator new() signature
Solution: Use helper function or thread-local context

Challenge 2: Cross-Queue Command References

Commands may reference events from other queues
Commands are destroyed when reference count reaches zero
Impact: Low - commands are typically destroyed by their owning queue

Challenge 3: Memory Pool Sizing

Current: 64 entries shared across all streams
Per-stream: 64 entries per stream
Memory Impact: N streams × 64 entries × maxSize_ bytes
Mitigation: Could make pool size configurable or smaller per-stream

Challenge 4: Thread Safety Within Queue

Commands may be allocated/deallocated from different threads
HostQueue::append() may be called from any thread
Solution: CommandPool mutex still needed, but contention is per-stream only

Challenge 5: Backward Compatibility

Need to ensure no regression in single-stream scenarios
Performance should be equal or better

Testing Considerations

Single Stream: Verify no performance regression
Multiple Streams: Measure contention reduction
High Concurrency: Test with many concurrent streams
Memory Leaks: Ensure pools are properly cleaned up
Command Lifecycle: Verify commands are correctly returned to pools

Migration Strategy

Phase 1: Implement per-queue pools alongside global pool (feature flag)
Phase 2: Update command allocation to use queue pools
Phase 3: Update all call sites to use new allocation pattern
Phase 4: Remove global pool after validation
Phase 5: Performance testing and optimization

Alternative: Thread-Local Pool

Instead of per-queue pools, consider thread-local pools:

Simpler implementation (no queue parameter needed)
Still reduces contention (per-thread instead of global)
Drawback: Threads may service multiple queues, less optimal locality

Recommendation

Proceed with per-queue CommandPool implementation using Option 3 (Queue-Scoped Allocation Helper):

High Impact: Eliminates major contention bottleneck
Manageable Complexity: Clear ownership model (queue owns pool)
Good Locality: Commands reused within same stream
Incremental Migration: Can be done with feature flags

The main effort is updating ~80 call sites to use the allocation helper, but this provides the cleanest semantics and best performance.

Implementation Example

Step 1: Modify CommandPool Class

// In command.cpp - Remove singleton pattern
class CommandPool {
public:
  CommandPool() {
    static_assert(((q_size_ & (q_size_ - 1)) == 0) && "q_size must be power of 2");
  }
  
  // Remove: static CommandPool& instance();
  
  template <typename CmdType>
  void deallocate(CmdType *ptr) {
    // ... existing implementation ...
  }
  
  void *allocate() {
    // ... existing implementation ...
  }
  
  // ... rest of implementation unchanged ...
};

Step 2: Add CommandPool to HostQueue

// In commandqueue.hpp
class HostQueue : public CommandQueue {
  // ... existing members ...
  
public:
  // Accessor for command pool
  CommandPool& commandPool() { return commandPool_; }
  const CommandPool& commandPool() const { return commandPool_; }

private:
  CommandPool commandPool_;  // Per-queue command pool
  // ... rest of members ...
};

// In commandqueue.cpp - Initialize in constructor
HostQueue::HostQueue(Context& context, Device& device, ...)
    : CommandQueue(...),
      commandPool_(),  // Initialize pool
      // ... other initializations ...
{
  // ... existing constructor code ...
}

Step 3: Create Allocation Helper

// In command.hpp or a new command_utils.hpp
namespace amd {

// Helper function to create commands using queue's pool
template<typename CmdType, typename... Args>
CmdType* createCommand(HostQueue& queue, Args&&... args) {
  void* mem = queue.commandPool().allocate();
  if (mem == nullptr) {
    return nullptr;
  }
  return new(mem) CmdType(queue, std::forward<Args>(args)...);
}

} // namespace amd

Step 4: Update Command Deallocation

// In command.cpp - Update all 6 command types
uint ReadMemoryCommand::release() {
  uint newCount = referenceCount_.fetch_sub(1, std::memory_order_acq_rel) - 1;
  if (newCount == 0) {
    if (terminate()) {
      // Use queue's pool instead of global singleton
      queue_->commandPool().deallocate(this);
      return 0;
    }
  }
  return newCount;
}

// Repeat for: WriteMemoryCommand, FillMemoryCommand, 
//            CopyMemoryCommand, CopyMemoryP2PCommand, Marker

Step 5: Update Command Allocation (Remove operator new)

Since we're using placement new via the helper, we can either:

Keep operator new as fallback (for compatibility)
Remove it entirely (cleaner, but requires all call sites updated)

Option A: Keep as fallback

void* ReadMemoryCommand::operator new(size_t size) {
  // Fallback: direct allocation if helper not used
  return std::aligned_alloc(alignof(ReadMemoryCommand), size);
}

Option B: Remove operator new (preferred after migration)

Step 6: Update Call Sites

// Before:
amd::ReadMemoryCommand* cmd = new amd::ReadMemoryCommand(
    *pStream, CL_COMMAND_READ_BUFFER, waitList, ...);

// After:
amd::ReadMemoryCommand* cmd = amd::createCommand<amd::ReadMemoryCommand>(
    *pStream, CL_COMMAND_READ_BUFFER, waitList, ...);

Migration Example: hip_memory.cpp

// Current code (line 587):
command = new amd::ReadMemoryCommand(*pStream, CL_COMMAND_READ_BUFFER, waitList,
                                     *srcBuffer, origin, size, dst, rowPitch, slicePitch);

// Migrated code:
command = amd::createCommand<amd::ReadMemoryCommand>(*pStream, CL_COMMAND_READ_BUFFER, 
                                                      waitList, *srcBuffer, origin, size, 
                                                      dst, rowPitch, slicePitch);

Performance Impact Estimate

Current Bottleneck

Single mutex protecting global pool
N threads contending for same lock
Lock hold time: ~100-500ns per allocation/deallocation
Contention cost: O(N) threads × lock overhead

After Refactoring

N mutexes (one per stream)
1 thread per stream typically (or small number)
Lock hold time: Same (~100-500ns)
Contention cost: O(1) per stream

Expected Improvement

Single stream: No change (or slight improvement from better locality)
Multiple streams: Near-linear scaling with number of streams
High concurrency (16+ streams): 10-100x improvement in allocation throughput

Risk Assessment

Low Risk

✅ Command deallocation (already has queue pointer)
✅ Pool initialization/destruction (RAII in HostQueue)
✅ Memory management (same algorithm, just per-queue)

Medium Risk

⚠️ Call site updates (80+ locations, but mechanical)
⚠️ Testing coverage (need multi-stream scenarios)
⚠️ Backward compatibility during migration

Mitigation Strategies

Feature flag: Enable per-queue pools behind flag
Gradual migration: Update call sites incrementally
Fallback mechanism: Keep global pool as fallback initially
Comprehensive testing: Multi-stream stress tests

Conclusion

Moving CommandPool from a global singleton to per-queue instances is highly recommended:

Solves real performance problem in multithreaded applications
Clear implementation path with manageable complexity
Significant scalability improvement expected
Low risk with proper testing and gradual migration

The main implementation effort is mechanical (updating call sites), and the architectural change is sound and well-scoped.

16 KiB Originalformat Permalink Blame Verlauf

CommandPool Refactoring Analysis: Moving from Global Singleton to Per-Stream Pools

Executive Summary

Current Architecture

CommandPool Implementation

Command Types Using CommandPool

Current Allocation Flow

Problem: Contention Bottleneck

Proposed Solution: Per-Stream CommandPool

Architecture Changes

Implementation Plan

Step 1: Move CommandPool Class Definition

Step 2: Add CommandPool to HostQueue

Step 3: Update Command Allocation Methods

Step 4: Update Command Deallocation

Detailed Implementation Strategy

Option 1: Two-Phase Allocation (Recommended)

Option 2: Deferred Pool Assignment (Hybrid Approach)

Option 3: Queue-Scoped Allocation Helper (Recommended)

Code Changes Required

Files to Modify

Benefits

Challenges and Considerations

Challenge 1: operator new() Timing

Challenge 2: Cross-Queue Command References

Challenge 3: Memory Pool Sizing

Challenge 4: Thread Safety Within Queue

Challenge 5: Backward Compatibility

Testing Considerations

Migration Strategy

Alternative: Thread-Local Pool

Recommendation

Implementation Example

Step 1: Modify CommandPool Class

Step 2: Add CommandPool to HostQueue

Step 3: Create Allocation Helper

Step 4: Update Command Deallocation

Step 5: Update Command Allocation (Remove operator new)

Step 6: Update Call Sites

Migration Example: hip_memory.cpp

Performance Impact Estimate

Current Bottleneck

After Refactoring

Expected Improvement

Risk Assessment

Low Risk

Medium Risk

Mitigation Strategies

Conclusion

16 KiB

Originalformat Permalink Blame Verlauf