Dateien
rocm-systems/projects/clr/COMMAND_POOL_ANALYSIS.md

16 KiB

CommandPool Refactoring Analysis: Moving from Global Singleton to Per-Stream Pools

Executive Summary

This document analyzes the feasibility and impact of moving the CommandPool from a global static singleton instance to per-stream instances within HostQueue. This change aims to eliminate contention bottlenecks in multithreaded applications with many concurrent streams.

Current Architecture

CommandPool Implementation

The CommandPool is currently implemented as a static singleton with the following characteristics:

  • Location: rocclr/platform/command.cpp (lines 338-408)
  • Access Pattern: CommandPool::instance() returns a static singleton
  • Thread Safety: Protected by a single std::mutex mutex_
  • Storage: Ring buffer with 64 entries (q_size_ = 64)
  • Memory Management:
    • Allocates aligned memory using std::aligned_alloc(maxAlignment_, maxSize_)
    • Reuses deallocated command memory when pool is not full
    • Frees memory when pool is full

Command Types Using CommandPool

The following command types use the pool via custom operator new and release() methods:

  1. ReadMemoryCommand - operator new() and release() (lines 684-707)
  2. WriteMemoryCommand - operator new() and release() (lines 714-733)
  3. FillMemoryCommand - operator new() and release() (lines 744-765)
  4. CopyMemoryCommand - operator new() and release() (lines 799-820)
  5. CopyMemoryP2PCommand - operator new() and release() (lines 1003-1024)
  6. Marker - operator new() and release() (lines 1027-1048)

Current Allocation Flow

// Command creation
void* ReadMemoryCommand::operator new(size_t size) {
  void* ptr = CommandPool::instance().allocate();  // Global singleton access
  // ...
  return ptr;
}

// Command destruction
uint ReadMemoryCommand::release() {
  uint newCount = referenceCount_.fetch_sub(1, std::memory_order_acq_rel) - 1;
  if (newCount == 0) {
    if (terminate()) {
      CommandPool::instance().deallocate(this);  // Global singleton access
      return 0;
    }
  }
  return newCount;
}

Problem: Contention Bottleneck

In a multithreaded application with many streams:

  • All threads compete for the same global CommandPool::instance() mutex
  • High-frequency command allocation/deallocation creates lock contention
  • Performance degrades as the number of concurrent streams increases
  • The single mutex serializes all command pool operations across all streams

Proposed Solution: Per-Stream CommandPool

Architecture Changes

  1. Move CommandPool into HostQueue

    • Each HostQueue instance owns its own CommandPool
    • Eliminates cross-stream contention
    • Commands allocated from a stream's pool are returned to the same pool
  2. Update Command Allocation

    • Commands already have a queue_ pointer (set in constructor)
    • Commands can access their queue's pool via queue()->commandPool()
    • No need to pass additional parameters
  3. Lifecycle Management

    • Pool created when HostQueue is constructed
    • Pool destroyed when HostQueue is destroyed
    • No global cleanup needed

Implementation Plan

Step 1: Move CommandPool Class Definition

  • Keep CommandPool class definition in command.cpp (or move to header if needed)
  • Remove static instance() method
  • Make it a regular class (no singleton pattern)

Step 2: Add CommandPool to HostQueue

// In commandqueue.hpp
class HostQueue : public CommandQueue {
  // ... existing members ...
private:
  CommandPool commandPool_;  // Per-queue command pool
};

Step 3: Update Command Allocation Methods

// Before:
void* ReadMemoryCommand::operator new(size_t size) {
  void* ptr = CommandPool::instance().allocate();
  // ...
}

// After:
void* ReadMemoryCommand::operator new(size_t size, HostQueue& queue) {
  void* ptr = queue.commandPool().allocate();
  // ...
}

Challenge: operator new is called with new ReadMemoryCommand(...), but we need access to the queue. The queue is passed to the constructor, not operator new.

Solution: Commands already store queue_ pointer. We can use a two-phase approach:

  • Phase 1: Allocate memory (may need temporary global pool or direct allocation)
  • Phase 2: After construction, move to queue's pool (not practical)

Better Solution: Use placement new or modify allocation pattern:

  • Option A: Allocate from queue's pool before construction
  • Option B: Store pool reference in command and deallocate to correct pool
  • Option C: Use a thread-local or queue-specific allocation mechanism

Step 4: Update Command Deallocation

// Before:
uint ReadMemoryCommand::release() {
  // ...
  CommandPool::instance().deallocate(this);
  // ...
}

// After:
uint ReadMemoryCommand::release() {
  // ...
  queue_->commandPool().deallocate(this);  // Use queue's pool
  // ...
}

Note: This is straightforward since queue_ is already available in the command.

Detailed Implementation Strategy

Since operator new is called before the constructor, we need a way to get the queue reference. However, commands are always created with a queue parameter:

// Current pattern:
ReadMemoryCommand* cmd = new ReadMemoryCommand(queue, ...);

// The queue is available at call site!

Solution: Use a placement-new-like pattern or thread-local storage:

  1. Thread-Local Queue Context (Simpler but less clean):

    thread_local HostQueue* g_currentQueue = nullptr;
    
    void* ReadMemoryCommand::operator new(size_t size) {
      HostQueue* queue = g_currentQueue;
      if (queue) {
        return queue->commandPool().allocate();
      }
      // Fallback to direct allocation
      return std::aligned_alloc(alignof(ReadMemoryCommand), size);
    }
    
  2. Queue Parameter in operator new (Requires syntax change):

    // This would require: new(queue) ReadMemoryCommand(...)
    // Not standard C++ placement new syntax
    
  3. Allocate from Queue Before Construction (Most practical):

    // At call sites, allocate from queue first:
    void* mem = queue.commandPool().allocate();
    ReadMemoryCommand* cmd = new(mem) ReadMemoryCommand(queue, ...);
    

    This requires changing all call sites (80+ locations).

Option 2: Deferred Pool Assignment (Hybrid Approach)

  1. Allocate commands using a temporary mechanism (direct allocation or small per-thread pool)
  2. After construction, commands have queue_ pointer
  3. On deallocation, return to the correct queue's pool
  4. Problem: Can't reuse memory from different queues efficiently

Create a helper that wraps command creation:

template<typename CmdType, typename... Args>
CmdType* createCommand(HostQueue& queue, Args&&... args) {
  void* mem = queue.commandPool().allocate();
  return new(mem) CmdType(queue, std::forward<Args>(args)...);
}

// Usage:
auto cmd = createCommand<ReadMemoryCommand>(queue, CL_COMMAND_READ_BUFFER, ...);

This requires updating all 80+ call sites but provides clean semantics.

Code Changes Required

Files to Modify

  1. rocclr/platform/commandqueue.hpp

    • Add CommandPool commandPool_; member to HostQueue
    • Add CommandPool& commandPool() accessor method
  2. rocclr/platform/commandqueue.cpp

    • Initialize commandPool_ in HostQueue constructor
    • Implement commandPool() accessor
  3. rocclr/platform/command.cpp

    • Remove CommandPool::instance() static method
    • Update all 6 command types' operator new() methods
    • Update all 6 command types' release() methods to use queue_->commandPool()
  4. All command creation sites (80+ locations):

    • hipamd/src/hip_memory.cpp
    • hipamd/src/hip_stream.cpp
    • hipamd/src/hip_event.cpp
    • rocclr/platform/commandqueue.cpp
    • opencl/amdocl/cl_execute.cpp
    • opencl/amdocl/cl_memobj.cpp
    • And others...

Benefits

  1. Eliminates Contention: Each stream has its own pool, no cross-stream locking
  2. Better Locality: Commands allocated from a stream are reused by the same stream
  3. Scalability: Performance scales with number of streams (no global bottleneck)
  4. Memory Efficiency: Per-stream pools can be sized appropriately
  5. Thread Safety: Each pool only accessed by its stream's thread (mostly)

Challenges and Considerations

Challenge 1: operator new() Timing

  • operator new() is called before constructor
  • Queue reference not available in operator new() signature
  • Solution: Use helper function or thread-local context

Challenge 2: Cross-Queue Command References

  • Commands may reference events from other queues
  • Commands are destroyed when reference count reaches zero
  • Impact: Low - commands are typically destroyed by their owning queue

Challenge 3: Memory Pool Sizing

  • Current: 64 entries shared across all streams
  • Per-stream: 64 entries per stream
  • Memory Impact: N streams × 64 entries × maxSize_ bytes
  • Mitigation: Could make pool size configurable or smaller per-stream

Challenge 4: Thread Safety Within Queue

  • Commands may be allocated/deallocated from different threads
  • HostQueue::append() may be called from any thread
  • Solution: CommandPool mutex still needed, but contention is per-stream only

Challenge 5: Backward Compatibility

  • Need to ensure no regression in single-stream scenarios
  • Performance should be equal or better

Testing Considerations

  1. Single Stream: Verify no performance regression
  2. Multiple Streams: Measure contention reduction
  3. High Concurrency: Test with many concurrent streams
  4. Memory Leaks: Ensure pools are properly cleaned up
  5. Command Lifecycle: Verify commands are correctly returned to pools

Migration Strategy

  1. Phase 1: Implement per-queue pools alongside global pool (feature flag)
  2. Phase 2: Update command allocation to use queue pools
  3. Phase 3: Update all call sites to use new allocation pattern
  4. Phase 4: Remove global pool after validation
  5. Phase 5: Performance testing and optimization

Alternative: Thread-Local Pool

Instead of per-queue pools, consider thread-local pools:

  • Simpler implementation (no queue parameter needed)
  • Still reduces contention (per-thread instead of global)
  • Drawback: Threads may service multiple queues, less optimal locality

Recommendation

Proceed with per-queue CommandPool implementation using Option 3 (Queue-Scoped Allocation Helper):

  1. High Impact: Eliminates major contention bottleneck
  2. Manageable Complexity: Clear ownership model (queue owns pool)
  3. Good Locality: Commands reused within same stream
  4. Incremental Migration: Can be done with feature flags

The main effort is updating ~80 call sites to use the allocation helper, but this provides the cleanest semantics and best performance.

Implementation Example

Step 1: Modify CommandPool Class

// In command.cpp - Remove singleton pattern
class CommandPool {
public:
  CommandPool() {
    static_assert(((q_size_ & (q_size_ - 1)) == 0) && "q_size must be power of 2");
  }
  
  // Remove: static CommandPool& instance();
  
  template <typename CmdType>
  void deallocate(CmdType *ptr) {
    // ... existing implementation ...
  }
  
  void *allocate() {
    // ... existing implementation ...
  }
  
  // ... rest of implementation unchanged ...
};

Step 2: Add CommandPool to HostQueue

// In commandqueue.hpp
class HostQueue : public CommandQueue {
  // ... existing members ...
  
public:
  // Accessor for command pool
  CommandPool& commandPool() { return commandPool_; }
  const CommandPool& commandPool() const { return commandPool_; }

private:
  CommandPool commandPool_;  // Per-queue command pool
  // ... rest of members ...
};
// In commandqueue.cpp - Initialize in constructor
HostQueue::HostQueue(Context& context, Device& device, ...)
    : CommandQueue(...),
      commandPool_(),  // Initialize pool
      // ... other initializations ...
{
  // ... existing constructor code ...
}

Step 3: Create Allocation Helper

// In command.hpp or a new command_utils.hpp
namespace amd {

// Helper function to create commands using queue's pool
template<typename CmdType, typename... Args>
CmdType* createCommand(HostQueue& queue, Args&&... args) {
  void* mem = queue.commandPool().allocate();
  if (mem == nullptr) {
    return nullptr;
  }
  return new(mem) CmdType(queue, std::forward<Args>(args)...);
}

} // namespace amd

Step 4: Update Command Deallocation

// In command.cpp - Update all 6 command types
uint ReadMemoryCommand::release() {
  uint newCount = referenceCount_.fetch_sub(1, std::memory_order_acq_rel) - 1;
  if (newCount == 0) {
    if (terminate()) {
      // Use queue's pool instead of global singleton
      queue_->commandPool().deallocate(this);
      return 0;
    }
  }
  return newCount;
}

// Repeat for: WriteMemoryCommand, FillMemoryCommand, 
//            CopyMemoryCommand, CopyMemoryP2PCommand, Marker

Step 5: Update Command Allocation (Remove operator new)

Since we're using placement new via the helper, we can either:

  • Keep operator new as fallback (for compatibility)
  • Remove it entirely (cleaner, but requires all call sites updated)

Option A: Keep as fallback

void* ReadMemoryCommand::operator new(size_t size) {
  // Fallback: direct allocation if helper not used
  return std::aligned_alloc(alignof(ReadMemoryCommand), size);
}

Option B: Remove operator new (preferred after migration)

Step 6: Update Call Sites

// Before:
amd::ReadMemoryCommand* cmd = new amd::ReadMemoryCommand(
    *pStream, CL_COMMAND_READ_BUFFER, waitList, ...);

// After:
amd::ReadMemoryCommand* cmd = amd::createCommand<amd::ReadMemoryCommand>(
    *pStream, CL_COMMAND_READ_BUFFER, waitList, ...);

Migration Example: hip_memory.cpp

// Current code (line 587):
command = new amd::ReadMemoryCommand(*pStream, CL_COMMAND_READ_BUFFER, waitList,
                                     *srcBuffer, origin, size, dst, rowPitch, slicePitch);

// Migrated code:
command = amd::createCommand<amd::ReadMemoryCommand>(*pStream, CL_COMMAND_READ_BUFFER, 
                                                      waitList, *srcBuffer, origin, size, 
                                                      dst, rowPitch, slicePitch);

Performance Impact Estimate

Current Bottleneck

  • Single mutex protecting global pool
  • N threads contending for same lock
  • Lock hold time: ~100-500ns per allocation/deallocation
  • Contention cost: O(N) threads × lock overhead

After Refactoring

  • N mutexes (one per stream)
  • 1 thread per stream typically (or small number)
  • Lock hold time: Same (~100-500ns)
  • Contention cost: O(1) per stream

Expected Improvement

  • Single stream: No change (or slight improvement from better locality)
  • Multiple streams: Near-linear scaling with number of streams
  • High concurrency (16+ streams): 10-100x improvement in allocation throughput

Risk Assessment

Low Risk

  • Command deallocation (already has queue pointer)
  • Pool initialization/destruction (RAII in HostQueue)
  • Memory management (same algorithm, just per-queue)

Medium Risk

  • ⚠️ Call site updates (80+ locations, but mechanical)
  • ⚠️ Testing coverage (need multi-stream scenarios)
  • ⚠️ Backward compatibility during migration

Mitigation Strategies

  1. Feature flag: Enable per-queue pools behind flag
  2. Gradual migration: Update call sites incrementally
  3. Fallback mechanism: Keep global pool as fallback initially
  4. Comprehensive testing: Multi-stream stress tests

Conclusion

Moving CommandPool from a global singleton to per-queue instances is highly recommended:

  • Solves real performance problem in multithreaded applications
  • Clear implementation path with manageable complexity
  • Significant scalability improvement expected
  • Low risk with proper testing and gradual migration

The main implementation effort is mechanical (updating call sites), and the architectural change is sound and well-scoped.