16 KiB
CommandPool Refactoring Analysis: Moving from Global Singleton to Per-Stream Pools
Executive Summary
This document analyzes the feasibility and impact of moving the CommandPool from a global static singleton instance to per-stream instances within HostQueue. This change aims to eliminate contention bottlenecks in multithreaded applications with many concurrent streams.
Current Architecture
CommandPool Implementation
The CommandPool is currently implemented as a static singleton with the following characteristics:
- Location:
rocclr/platform/command.cpp(lines 338-408) - Access Pattern:
CommandPool::instance()returns a static singleton - Thread Safety: Protected by a single
std::mutex mutex_ - Storage: Ring buffer with 64 entries (
q_size_ = 64) - Memory Management:
- Allocates aligned memory using
std::aligned_alloc(maxAlignment_, maxSize_) - Reuses deallocated command memory when pool is not full
- Frees memory when pool is full
- Allocates aligned memory using
Command Types Using CommandPool
The following command types use the pool via custom operator new and release() methods:
- ReadMemoryCommand -
operator new()andrelease()(lines 684-707) - WriteMemoryCommand -
operator new()andrelease()(lines 714-733) - FillMemoryCommand -
operator new()andrelease()(lines 744-765) - CopyMemoryCommand -
operator new()andrelease()(lines 799-820) - CopyMemoryP2PCommand -
operator new()andrelease()(lines 1003-1024) - Marker -
operator new()andrelease()(lines 1027-1048)
Current Allocation Flow
// Command creation
void* ReadMemoryCommand::operator new(size_t size) {
void* ptr = CommandPool::instance().allocate(); // Global singleton access
// ...
return ptr;
}
// Command destruction
uint ReadMemoryCommand::release() {
uint newCount = referenceCount_.fetch_sub(1, std::memory_order_acq_rel) - 1;
if (newCount == 0) {
if (terminate()) {
CommandPool::instance().deallocate(this); // Global singleton access
return 0;
}
}
return newCount;
}
Problem: Contention Bottleneck
In a multithreaded application with many streams:
- All threads compete for the same global
CommandPool::instance()mutex - High-frequency command allocation/deallocation creates lock contention
- Performance degrades as the number of concurrent streams increases
- The single mutex serializes all command pool operations across all streams
Proposed Solution: Per-Stream CommandPool
Architecture Changes
-
Move CommandPool into HostQueue
- Each
HostQueueinstance owns its ownCommandPool - Eliminates cross-stream contention
- Commands allocated from a stream's pool are returned to the same pool
- Each
-
Update Command Allocation
- Commands already have a
queue_pointer (set in constructor) - Commands can access their queue's pool via
queue()->commandPool() - No need to pass additional parameters
- Commands already have a
-
Lifecycle Management
- Pool created when
HostQueueis constructed - Pool destroyed when
HostQueueis destroyed - No global cleanup needed
- Pool created when
Implementation Plan
Step 1: Move CommandPool Class Definition
- Keep
CommandPoolclass definition incommand.cpp(or move to header if needed) - Remove static
instance()method - Make it a regular class (no singleton pattern)
Step 2: Add CommandPool to HostQueue
// In commandqueue.hpp
class HostQueue : public CommandQueue {
// ... existing members ...
private:
CommandPool commandPool_; // Per-queue command pool
};
Step 3: Update Command Allocation Methods
// Before:
void* ReadMemoryCommand::operator new(size_t size) {
void* ptr = CommandPool::instance().allocate();
// ...
}
// After:
void* ReadMemoryCommand::operator new(size_t size, HostQueue& queue) {
void* ptr = queue.commandPool().allocate();
// ...
}
Challenge: operator new is called with new ReadMemoryCommand(...), but we need access to the queue. The queue is passed to the constructor, not operator new.
Solution: Commands already store queue_ pointer. We can use a two-phase approach:
- Phase 1: Allocate memory (may need temporary global pool or direct allocation)
- Phase 2: After construction, move to queue's pool (not practical)
Better Solution: Use placement new or modify allocation pattern:
- Option A: Allocate from queue's pool before construction
- Option B: Store pool reference in command and deallocate to correct pool
- Option C: Use a thread-local or queue-specific allocation mechanism
Step 4: Update Command Deallocation
// Before:
uint ReadMemoryCommand::release() {
// ...
CommandPool::instance().deallocate(this);
// ...
}
// After:
uint ReadMemoryCommand::release() {
// ...
queue_->commandPool().deallocate(this); // Use queue's pool
// ...
}
Note: This is straightforward since queue_ is already available in the command.
Detailed Implementation Strategy
Option 1: Two-Phase Allocation (Recommended)
Since operator new is called before the constructor, we need a way to get the queue reference. However, commands are always created with a queue parameter:
// Current pattern:
ReadMemoryCommand* cmd = new ReadMemoryCommand(queue, ...);
// The queue is available at call site!
Solution: Use a placement-new-like pattern or thread-local storage:
-
Thread-Local Queue Context (Simpler but less clean):
thread_local HostQueue* g_currentQueue = nullptr; void* ReadMemoryCommand::operator new(size_t size) { HostQueue* queue = g_currentQueue; if (queue) { return queue->commandPool().allocate(); } // Fallback to direct allocation return std::aligned_alloc(alignof(ReadMemoryCommand), size); } -
Queue Parameter in operator new (Requires syntax change):
// This would require: new(queue) ReadMemoryCommand(...) // Not standard C++ placement new syntax -
Allocate from Queue Before Construction (Most practical):
// At call sites, allocate from queue first: void* mem = queue.commandPool().allocate(); ReadMemoryCommand* cmd = new(mem) ReadMemoryCommand(queue, ...);This requires changing all call sites (80+ locations).
Option 2: Deferred Pool Assignment (Hybrid Approach)
- Allocate commands using a temporary mechanism (direct allocation or small per-thread pool)
- After construction, commands have
queue_pointer - On deallocation, return to the correct queue's pool
- Problem: Can't reuse memory from different queues efficiently
Option 3: Queue-Scoped Allocation Helper (Recommended)
Create a helper that wraps command creation:
template<typename CmdType, typename... Args>
CmdType* createCommand(HostQueue& queue, Args&&... args) {
void* mem = queue.commandPool().allocate();
return new(mem) CmdType(queue, std::forward<Args>(args)...);
}
// Usage:
auto cmd = createCommand<ReadMemoryCommand>(queue, CL_COMMAND_READ_BUFFER, ...);
This requires updating all 80+ call sites but provides clean semantics.
Code Changes Required
Files to Modify
-
rocclr/platform/commandqueue.hpp
- Add
CommandPool commandPool_;member toHostQueue - Add
CommandPool& commandPool()accessor method
- Add
-
rocclr/platform/commandqueue.cpp
- Initialize
commandPool_inHostQueueconstructor - Implement
commandPool()accessor
- Initialize
-
rocclr/platform/command.cpp
- Remove
CommandPool::instance()static method - Update all 6 command types'
operator new()methods - Update all 6 command types'
release()methods to usequeue_->commandPool()
- Remove
-
All command creation sites (80+ locations):
hipamd/src/hip_memory.cpphipamd/src/hip_stream.cpphipamd/src/hip_event.cpprocclr/platform/commandqueue.cppopencl/amdocl/cl_execute.cppopencl/amdocl/cl_memobj.cpp- And others...
Benefits
- Eliminates Contention: Each stream has its own pool, no cross-stream locking
- Better Locality: Commands allocated from a stream are reused by the same stream
- Scalability: Performance scales with number of streams (no global bottleneck)
- Memory Efficiency: Per-stream pools can be sized appropriately
- Thread Safety: Each pool only accessed by its stream's thread (mostly)
Challenges and Considerations
Challenge 1: operator new() Timing
operator new()is called before constructor- Queue reference not available in
operator new()signature - Solution: Use helper function or thread-local context
Challenge 2: Cross-Queue Command References
- Commands may reference events from other queues
- Commands are destroyed when reference count reaches zero
- Impact: Low - commands are typically destroyed by their owning queue
Challenge 3: Memory Pool Sizing
- Current: 64 entries shared across all streams
- Per-stream: 64 entries per stream
- Memory Impact: N streams × 64 entries × maxSize_ bytes
- Mitigation: Could make pool size configurable or smaller per-stream
Challenge 4: Thread Safety Within Queue
- Commands may be allocated/deallocated from different threads
HostQueue::append()may be called from any thread- Solution: CommandPool mutex still needed, but contention is per-stream only
Challenge 5: Backward Compatibility
- Need to ensure no regression in single-stream scenarios
- Performance should be equal or better
Testing Considerations
- Single Stream: Verify no performance regression
- Multiple Streams: Measure contention reduction
- High Concurrency: Test with many concurrent streams
- Memory Leaks: Ensure pools are properly cleaned up
- Command Lifecycle: Verify commands are correctly returned to pools
Migration Strategy
- Phase 1: Implement per-queue pools alongside global pool (feature flag)
- Phase 2: Update command allocation to use queue pools
- Phase 3: Update all call sites to use new allocation pattern
- Phase 4: Remove global pool after validation
- Phase 5: Performance testing and optimization
Alternative: Thread-Local Pool
Instead of per-queue pools, consider thread-local pools:
- Simpler implementation (no queue parameter needed)
- Still reduces contention (per-thread instead of global)
- Drawback: Threads may service multiple queues, less optimal locality
Recommendation
Proceed with per-queue CommandPool implementation using Option 3 (Queue-Scoped Allocation Helper):
- High Impact: Eliminates major contention bottleneck
- Manageable Complexity: Clear ownership model (queue owns pool)
- Good Locality: Commands reused within same stream
- Incremental Migration: Can be done with feature flags
The main effort is updating ~80 call sites to use the allocation helper, but this provides the cleanest semantics and best performance.
Implementation Example
Step 1: Modify CommandPool Class
// In command.cpp - Remove singleton pattern
class CommandPool {
public:
CommandPool() {
static_assert(((q_size_ & (q_size_ - 1)) == 0) && "q_size must be power of 2");
}
// Remove: static CommandPool& instance();
template <typename CmdType>
void deallocate(CmdType *ptr) {
// ... existing implementation ...
}
void *allocate() {
// ... existing implementation ...
}
// ... rest of implementation unchanged ...
};
Step 2: Add CommandPool to HostQueue
// In commandqueue.hpp
class HostQueue : public CommandQueue {
// ... existing members ...
public:
// Accessor for command pool
CommandPool& commandPool() { return commandPool_; }
const CommandPool& commandPool() const { return commandPool_; }
private:
CommandPool commandPool_; // Per-queue command pool
// ... rest of members ...
};
// In commandqueue.cpp - Initialize in constructor
HostQueue::HostQueue(Context& context, Device& device, ...)
: CommandQueue(...),
commandPool_(), // Initialize pool
// ... other initializations ...
{
// ... existing constructor code ...
}
Step 3: Create Allocation Helper
// In command.hpp or a new command_utils.hpp
namespace amd {
// Helper function to create commands using queue's pool
template<typename CmdType, typename... Args>
CmdType* createCommand(HostQueue& queue, Args&&... args) {
void* mem = queue.commandPool().allocate();
if (mem == nullptr) {
return nullptr;
}
return new(mem) CmdType(queue, std::forward<Args>(args)...);
}
} // namespace amd
Step 4: Update Command Deallocation
// In command.cpp - Update all 6 command types
uint ReadMemoryCommand::release() {
uint newCount = referenceCount_.fetch_sub(1, std::memory_order_acq_rel) - 1;
if (newCount == 0) {
if (terminate()) {
// Use queue's pool instead of global singleton
queue_->commandPool().deallocate(this);
return 0;
}
}
return newCount;
}
// Repeat for: WriteMemoryCommand, FillMemoryCommand,
// CopyMemoryCommand, CopyMemoryP2PCommand, Marker
Step 5: Update Command Allocation (Remove operator new)
Since we're using placement new via the helper, we can either:
- Keep
operator newas fallback (for compatibility) - Remove it entirely (cleaner, but requires all call sites updated)
Option A: Keep as fallback
void* ReadMemoryCommand::operator new(size_t size) {
// Fallback: direct allocation if helper not used
return std::aligned_alloc(alignof(ReadMemoryCommand), size);
}
Option B: Remove operator new (preferred after migration)
Step 6: Update Call Sites
// Before:
amd::ReadMemoryCommand* cmd = new amd::ReadMemoryCommand(
*pStream, CL_COMMAND_READ_BUFFER, waitList, ...);
// After:
amd::ReadMemoryCommand* cmd = amd::createCommand<amd::ReadMemoryCommand>(
*pStream, CL_COMMAND_READ_BUFFER, waitList, ...);
Migration Example: hip_memory.cpp
// Current code (line 587):
command = new amd::ReadMemoryCommand(*pStream, CL_COMMAND_READ_BUFFER, waitList,
*srcBuffer, origin, size, dst, rowPitch, slicePitch);
// Migrated code:
command = amd::createCommand<amd::ReadMemoryCommand>(*pStream, CL_COMMAND_READ_BUFFER,
waitList, *srcBuffer, origin, size,
dst, rowPitch, slicePitch);
Performance Impact Estimate
Current Bottleneck
- Single mutex protecting global pool
- N threads contending for same lock
- Lock hold time: ~100-500ns per allocation/deallocation
- Contention cost: O(N) threads × lock overhead
After Refactoring
- N mutexes (one per stream)
- 1 thread per stream typically (or small number)
- Lock hold time: Same (~100-500ns)
- Contention cost: O(1) per stream
Expected Improvement
- Single stream: No change (or slight improvement from better locality)
- Multiple streams: Near-linear scaling with number of streams
- High concurrency (16+ streams): 10-100x improvement in allocation throughput
Risk Assessment
Low Risk
- ✅ Command deallocation (already has queue pointer)
- ✅ Pool initialization/destruction (RAII in HostQueue)
- ✅ Memory management (same algorithm, just per-queue)
Medium Risk
- ⚠️ Call site updates (80+ locations, but mechanical)
- ⚠️ Testing coverage (need multi-stream scenarios)
- ⚠️ Backward compatibility during migration
Mitigation Strategies
- Feature flag: Enable per-queue pools behind flag
- Gradual migration: Update call sites incrementally
- Fallback mechanism: Keep global pool as fallback initially
- Comprehensive testing: Multi-stream stress tests
Conclusion
Moving CommandPool from a global singleton to per-queue instances is highly recommended:
- Solves real performance problem in multithreaded applications
- Clear implementation path with manageable complexity
- Significant scalability improvement expected
- Low risk with proper testing and gradual migration
The main implementation effort is mechanical (updating call sites), and the architectural change is sound and well-scoped.