diff --git a/projects/clr/COMMAND_POOL_ANALYSIS.md b/projects/clr/COMMAND_POOL_ANALYSIS.md new file mode 100644 index 0000000000..cfd446fb9e --- /dev/null +++ b/projects/clr/COMMAND_POOL_ANALYSIS.md @@ -0,0 +1,490 @@ +# CommandPool Refactoring Analysis: Moving from Global Singleton to Per-Stream Pools + +## Executive Summary + +This document analyzes the feasibility and impact of moving the `CommandPool` from a global static singleton instance to per-stream instances within `HostQueue`. This change aims to eliminate contention bottlenecks in multithreaded applications with many concurrent streams. + +## Current Architecture + +### CommandPool Implementation + +The `CommandPool` is currently implemented as a **static singleton** with the following characteristics: + +- **Location**: `rocclr/platform/command.cpp` (lines 338-408) +- **Access Pattern**: `CommandPool::instance()` returns a static singleton +- **Thread Safety**: Protected by a single `std::mutex mutex_` +- **Storage**: Ring buffer with 64 entries (`q_size_ = 64`) +- **Memory Management**: + - Allocates aligned memory using `std::aligned_alloc(maxAlignment_, maxSize_)` + - Reuses deallocated command memory when pool is not full + - Frees memory when pool is full + +### Command Types Using CommandPool + +The following command types use the pool via custom `operator new` and `release()` methods: + +1. **ReadMemoryCommand** - `operator new()` and `release()` (lines 684-707) +2. **WriteMemoryCommand** - `operator new()` and `release()` (lines 714-733) +3. **FillMemoryCommand** - `operator new()` and `release()` (lines 744-765) +4. **CopyMemoryCommand** - `operator new()` and `release()` (lines 799-820) +5. **CopyMemoryP2PCommand** - `operator new()` and `release()` (lines 1003-1024) +6. **Marker** - `operator new()` and `release()` (lines 1027-1048) + +### Current Allocation Flow + +```cpp +// Command creation +void* ReadMemoryCommand::operator new(size_t size) { + void* ptr = CommandPool::instance().allocate(); // Global singleton access + // ... + return ptr; +} + +// Command destruction +uint ReadMemoryCommand::release() { + uint newCount = referenceCount_.fetch_sub(1, std::memory_order_acq_rel) - 1; + if (newCount == 0) { + if (terminate()) { + CommandPool::instance().deallocate(this); // Global singleton access + return 0; + } + } + return newCount; +} +``` + +### Problem: Contention Bottleneck + +In a multithreaded application with many streams: +- **All threads** compete for the same global `CommandPool::instance()` mutex +- High-frequency command allocation/deallocation creates lock contention +- Performance degrades as the number of concurrent streams increases +- The single mutex serializes all command pool operations across all streams + +## Proposed Solution: Per-Stream CommandPool + +### Architecture Changes + +1. **Move CommandPool into HostQueue** + - Each `HostQueue` instance owns its own `CommandPool` + - Eliminates cross-stream contention + - Commands allocated from a stream's pool are returned to the same pool + +2. **Update Command Allocation** + - Commands already have a `queue_` pointer (set in constructor) + - Commands can access their queue's pool via `queue()->commandPool()` + - No need to pass additional parameters + +3. **Lifecycle Management** + - Pool created when `HostQueue` is constructed + - Pool destroyed when `HostQueue` is destroyed + - No global cleanup needed + +### Implementation Plan + +#### Step 1: Move CommandPool Class Definition +- Keep `CommandPool` class definition in `command.cpp` (or move to header if needed) +- Remove static `instance()` method +- Make it a regular class (no singleton pattern) + +#### Step 2: Add CommandPool to HostQueue +```cpp +// In commandqueue.hpp +class HostQueue : public CommandQueue { + // ... existing members ... +private: + CommandPool commandPool_; // Per-queue command pool +}; +``` + +#### Step 3: Update Command Allocation Methods +```cpp +// Before: +void* ReadMemoryCommand::operator new(size_t size) { + void* ptr = CommandPool::instance().allocate(); + // ... +} + +// After: +void* ReadMemoryCommand::operator new(size_t size, HostQueue& queue) { + void* ptr = queue.commandPool().allocate(); + // ... +} +``` + +**Challenge**: `operator new` is called with `new ReadMemoryCommand(...)`, but we need access to the queue. The queue is passed to the constructor, not `operator new`. + +**Solution**: Commands already store `queue_` pointer. We can use a two-phase approach: +- Phase 1: Allocate memory (may need temporary global pool or direct allocation) +- Phase 2: After construction, move to queue's pool (not practical) + +**Better Solution**: Use placement new or modify allocation pattern: +- Option A: Allocate from queue's pool before construction +- Option B: Store pool reference in command and deallocate to correct pool +- Option C: Use a thread-local or queue-specific allocation mechanism + +#### Step 4: Update Command Deallocation +```cpp +// Before: +uint ReadMemoryCommand::release() { + // ... + CommandPool::instance().deallocate(this); + // ... +} + +// After: +uint ReadMemoryCommand::release() { + // ... + queue_->commandPool().deallocate(this); // Use queue's pool + // ... +} +``` + +**Note**: This is straightforward since `queue_` is already available in the command. + +### Detailed Implementation Strategy + +#### Option 1: Two-Phase Allocation (Recommended) + +Since `operator new` is called before the constructor, we need a way to get the queue reference. However, commands are always created with a queue parameter: + +```cpp +// Current pattern: +ReadMemoryCommand* cmd = new ReadMemoryCommand(queue, ...); + +// The queue is available at call site! +``` + +**Solution**: Use a placement-new-like pattern or thread-local storage: + +1. **Thread-Local Queue Context** (Simpler but less clean): + ```cpp + thread_local HostQueue* g_currentQueue = nullptr; + + void* ReadMemoryCommand::operator new(size_t size) { + HostQueue* queue = g_currentQueue; + if (queue) { + return queue->commandPool().allocate(); + } + // Fallback to direct allocation + return std::aligned_alloc(alignof(ReadMemoryCommand), size); + } + ``` + +2. **Queue Parameter in operator new** (Requires syntax change): + ```cpp + // This would require: new(queue) ReadMemoryCommand(...) + // Not standard C++ placement new syntax + ``` + +3. **Allocate from Queue Before Construction** (Most practical): + ```cpp + // At call sites, allocate from queue first: + void* mem = queue.commandPool().allocate(); + ReadMemoryCommand* cmd = new(mem) ReadMemoryCommand(queue, ...); + ``` + This requires changing all call sites (80+ locations). + +#### Option 2: Deferred Pool Assignment (Hybrid Approach) + +1. Allocate commands using a temporary mechanism (direct allocation or small per-thread pool) +2. After construction, commands have `queue_` pointer +3. On deallocation, return to the correct queue's pool +4. **Problem**: Can't reuse memory from different queues efficiently + +#### Option 3: Queue-Scoped Allocation Helper (Recommended) + +Create a helper that wraps command creation: + +```cpp +template +CmdType* createCommand(HostQueue& queue, Args&&... args) { + void* mem = queue.commandPool().allocate(); + return new(mem) CmdType(queue, std::forward(args)...); +} + +// Usage: +auto cmd = createCommand(queue, CL_COMMAND_READ_BUFFER, ...); +``` + +This requires updating all 80+ call sites but provides clean semantics. + +### Code Changes Required + +#### Files to Modify + +1. **rocclr/platform/commandqueue.hpp** + - Add `CommandPool commandPool_;` member to `HostQueue` + - Add `CommandPool& commandPool()` accessor method + +2. **rocclr/platform/commandqueue.cpp** + - Initialize `commandPool_` in `HostQueue` constructor + - Implement `commandPool()` accessor + +3. **rocclr/platform/command.cpp** + - Remove `CommandPool::instance()` static method + - Update all 6 command types' `operator new()` methods + - Update all 6 command types' `release()` methods to use `queue_->commandPool()` + +4. **All command creation sites** (80+ locations): + - `hipamd/src/hip_memory.cpp` + - `hipamd/src/hip_stream.cpp` + - `hipamd/src/hip_event.cpp` + - `rocclr/platform/commandqueue.cpp` + - `opencl/amdocl/cl_execute.cpp` + - `opencl/amdocl/cl_memobj.cpp` + - And others... + +### Benefits + +1. **Eliminates Contention**: Each stream has its own pool, no cross-stream locking +2. **Better Locality**: Commands allocated from a stream are reused by the same stream +3. **Scalability**: Performance scales with number of streams (no global bottleneck) +4. **Memory Efficiency**: Per-stream pools can be sized appropriately +5. **Thread Safety**: Each pool only accessed by its stream's thread (mostly) + +### Challenges and Considerations + +#### Challenge 1: operator new() Timing +- `operator new()` is called before constructor +- Queue reference not available in `operator new()` signature +- **Solution**: Use helper function or thread-local context + +#### Challenge 2: Cross-Queue Command References +- Commands may reference events from other queues +- Commands are destroyed when reference count reaches zero +- **Impact**: Low - commands are typically destroyed by their owning queue + +#### Challenge 3: Memory Pool Sizing +- Current: 64 entries shared across all streams +- Per-stream: 64 entries per stream +- **Memory Impact**: N streams × 64 entries × maxSize_ bytes +- **Mitigation**: Could make pool size configurable or smaller per-stream + +#### Challenge 4: Thread Safety Within Queue +- Commands may be allocated/deallocated from different threads +- `HostQueue::append()` may be called from any thread +- **Solution**: CommandPool mutex still needed, but contention is per-stream only + +#### Challenge 5: Backward Compatibility +- Need to ensure no regression in single-stream scenarios +- Performance should be equal or better + +### Testing Considerations + +1. **Single Stream**: Verify no performance regression +2. **Multiple Streams**: Measure contention reduction +3. **High Concurrency**: Test with many concurrent streams +4. **Memory Leaks**: Ensure pools are properly cleaned up +5. **Command Lifecycle**: Verify commands are correctly returned to pools + +### Migration Strategy + +1. **Phase 1**: Implement per-queue pools alongside global pool (feature flag) +2. **Phase 2**: Update command allocation to use queue pools +3. **Phase 3**: Update all call sites to use new allocation pattern +4. **Phase 4**: Remove global pool after validation +5. **Phase 5**: Performance testing and optimization + +### Alternative: Thread-Local Pool + +Instead of per-queue pools, consider thread-local pools: +- Simpler implementation (no queue parameter needed) +- Still reduces contention (per-thread instead of global) +- **Drawback**: Threads may service multiple queues, less optimal locality + +### Recommendation + +**Proceed with per-queue CommandPool implementation** using Option 3 (Queue-Scoped Allocation Helper): + +1. **High Impact**: Eliminates major contention bottleneck +2. **Manageable Complexity**: Clear ownership model (queue owns pool) +3. **Good Locality**: Commands reused within same stream +4. **Incremental Migration**: Can be done with feature flags + +The main effort is updating ~80 call sites to use the allocation helper, but this provides the cleanest semantics and best performance. + +## Implementation Example + +### Step 1: Modify CommandPool Class + +```cpp +// In command.cpp - Remove singleton pattern +class CommandPool { +public: + CommandPool() { + static_assert(((q_size_ & (q_size_ - 1)) == 0) && "q_size must be power of 2"); + } + + // Remove: static CommandPool& instance(); + + template + void deallocate(CmdType *ptr) { + // ... existing implementation ... + } + + void *allocate() { + // ... existing implementation ... + } + + // ... rest of implementation unchanged ... +}; +``` + +### Step 2: Add CommandPool to HostQueue + +```cpp +// In commandqueue.hpp +class HostQueue : public CommandQueue { + // ... existing members ... + +public: + // Accessor for command pool + CommandPool& commandPool() { return commandPool_; } + const CommandPool& commandPool() const { return commandPool_; } + +private: + CommandPool commandPool_; // Per-queue command pool + // ... rest of members ... +}; +``` + +```cpp +// In commandqueue.cpp - Initialize in constructor +HostQueue::HostQueue(Context& context, Device& device, ...) + : CommandQueue(...), + commandPool_(), // Initialize pool + // ... other initializations ... +{ + // ... existing constructor code ... +} +``` + +### Step 3: Create Allocation Helper + +```cpp +// In command.hpp or a new command_utils.hpp +namespace amd { + +// Helper function to create commands using queue's pool +template +CmdType* createCommand(HostQueue& queue, Args&&... args) { + void* mem = queue.commandPool().allocate(); + if (mem == nullptr) { + return nullptr; + } + return new(mem) CmdType(queue, std::forward(args)...); +} + +} // namespace amd +``` + +### Step 4: Update Command Deallocation + +```cpp +// In command.cpp - Update all 6 command types +uint ReadMemoryCommand::release() { + uint newCount = referenceCount_.fetch_sub(1, std::memory_order_acq_rel) - 1; + if (newCount == 0) { + if (terminate()) { + // Use queue's pool instead of global singleton + queue_->commandPool().deallocate(this); + return 0; + } + } + return newCount; +} + +// Repeat for: WriteMemoryCommand, FillMemoryCommand, +// CopyMemoryCommand, CopyMemoryP2PCommand, Marker +``` + +### Step 5: Update Command Allocation (Remove operator new) + +Since we're using placement new via the helper, we can either: +- Keep `operator new` as fallback (for compatibility) +- Remove it entirely (cleaner, but requires all call sites updated) + +**Option A: Keep as fallback** +```cpp +void* ReadMemoryCommand::operator new(size_t size) { + // Fallback: direct allocation if helper not used + return std::aligned_alloc(alignof(ReadMemoryCommand), size); +} +``` + +**Option B: Remove operator new** (preferred after migration) + +### Step 6: Update Call Sites + +```cpp +// Before: +amd::ReadMemoryCommand* cmd = new amd::ReadMemoryCommand( + *pStream, CL_COMMAND_READ_BUFFER, waitList, ...); + +// After: +amd::ReadMemoryCommand* cmd = amd::createCommand( + *pStream, CL_COMMAND_READ_BUFFER, waitList, ...); +``` + +### Migration Example: hip_memory.cpp + +```cpp +// Current code (line 587): +command = new amd::ReadMemoryCommand(*pStream, CL_COMMAND_READ_BUFFER, waitList, + *srcBuffer, origin, size, dst, rowPitch, slicePitch); + +// Migrated code: +command = amd::createCommand(*pStream, CL_COMMAND_READ_BUFFER, + waitList, *srcBuffer, origin, size, + dst, rowPitch, slicePitch); +``` + +## Performance Impact Estimate + +### Current Bottleneck +- **Single mutex** protecting global pool +- **N threads** contending for same lock +- Lock hold time: ~100-500ns per allocation/deallocation +- **Contention cost**: O(N) threads × lock overhead + +### After Refactoring +- **N mutexes** (one per stream) +- **1 thread** per stream typically (or small number) +- Lock hold time: Same (~100-500ns) +- **Contention cost**: O(1) per stream + +### Expected Improvement +- **Single stream**: No change (or slight improvement from better locality) +- **Multiple streams**: Near-linear scaling with number of streams +- **High concurrency (16+ streams)**: 10-100x improvement in allocation throughput + +## Risk Assessment + +### Low Risk +- ✅ Command deallocation (already has queue pointer) +- ✅ Pool initialization/destruction (RAII in HostQueue) +- ✅ Memory management (same algorithm, just per-queue) + +### Medium Risk +- ⚠️ Call site updates (80+ locations, but mechanical) +- ⚠️ Testing coverage (need multi-stream scenarios) +- ⚠️ Backward compatibility during migration + +### Mitigation Strategies +1. **Feature flag**: Enable per-queue pools behind flag +2. **Gradual migration**: Update call sites incrementally +3. **Fallback mechanism**: Keep global pool as fallback initially +4. **Comprehensive testing**: Multi-stream stress tests + +## Conclusion + +Moving CommandPool from a global singleton to per-queue instances is **highly recommended**: + +- **Solves real performance problem** in multithreaded applications +- **Clear implementation path** with manageable complexity +- **Significant scalability improvement** expected +- **Low risk** with proper testing and gradual migration + +The main implementation effort is mechanical (updating call sites), and the architectural change is sound and well-scoped. + diff --git a/projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp4.h b/projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp4.h index 6773e6f908..0bd5145959 100644 --- a/projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp4.h +++ b/projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp4.h @@ -39,9 +39,9 @@ typedef __hip_fp8_storage_t __hip_fp4_storage_t; typedef __hip_fp8_storage_t __hip_fp4x2_storage_t; typedef __hip_fp8x2_storage_t __hip_fp4x4_storage_t; -static_assert(sizeof(__hip_fp4_storage_t[4]) == sizeof(uint32_t)); -static_assert(sizeof(__hip_fp4x2_storage_t[4]) == sizeof(uint32_t)); -static_assert(sizeof(__hip_fp4x4_storage_t[2]) == sizeof(uint32_t)); +static_assert(sizeof(__hip_fp4_storage_t[4]) == sizeof(uint32_t), ""); +static_assert(sizeof(__hip_fp4x2_storage_t[4]) == sizeof(uint32_t), ""); +static_assert(sizeof(__hip_fp4x4_storage_t[2]) == sizeof(uint32_t), ""); enum __hip_fp4_interpretation_t { __HIP_E2M1 = 0, @@ -278,7 +278,7 @@ struct __hip_fp4_e2m1 { } __FP4_HOST_DEVICE__ operator __hip_bfloat16_raw() const { - static_assert(sizeof(__hip_bfloat16_raw[2]) == sizeof(__amd_bf16x2_storage_t)); + static_assert(sizeof(__hip_bfloat16_raw[2]) == sizeof(__amd_bf16x2_storage_t), ""); union { __hip_bfloat16_raw bf16_raw[2]; __amd_bf16x2_storage_t bf16x2; @@ -336,7 +336,7 @@ struct __hip_fp4x2_e2m1 { } __FP4_HOST_DEVICE__ operator __hip_bfloat162_raw() const { - static_assert(sizeof(__hip_bfloat162_raw) == sizeof(__amd_bf16x2_storage_t)); + static_assert(sizeof(__hip_bfloat162_raw) == sizeof(__amd_bf16x2_storage_t), ""); union { __hip_bfloat162_raw bf162_raw; __amd_bf16x2_storage_t bf16x2; diff --git a/projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp6.h b/projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp6.h index c88aee8475..cbfb0aad07 100644 --- a/projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp6.h +++ b/projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp6.h @@ -39,16 +39,15 @@ typedef __hip_fp8_storage_t __hip_fp6_storage_t; typedef __hip_fp8x2_storage_t __hip_fp6x2_storage_t; typedef __hip_fp8x4_storage_t __hip_fp6x4_storage_t; -static_assert(sizeof(__hip_fp6_storage_t[4]) == sizeof(uint32_t)); -static_assert(sizeof(__hip_fp6x2_storage_t[2]) == sizeof(uint32_t)); -static_assert(sizeof(__hip_fp6x4_storage_t[2]) == sizeof(uint64_t)); +static_assert(sizeof(__hip_fp6_storage_t[4]) == sizeof(uint32_t), ""); +static_assert(sizeof(__hip_fp6x2_storage_t[2]) == sizeof(uint32_t), ""); +static_assert(sizeof(__hip_fp6x4_storage_t[2]) == sizeof(uint64_t), ""); enum __hip_fp6_interpretation_t { __HIP_E3M2 = 0, /**< FP6 E3M2 Type*/ __HIP_E2M3 = 1, /**< FP6 E2M3 Type */ }; - // Note: Ignore rounding input on AMD GPUs for now. At the moment AMD GPUs do not support rounding // modes, all the inputs are rounded to nearest or use an input to do stochastic rounding. // We hide the rounding variable to not trigger the unused variable compiler warning. @@ -381,7 +380,7 @@ struct __hip_fp6_e2m3 { return __hip_cvt_fp6_to_halfraw(__x, __HIP_E2M3); } __FP6_HOST_DEVICE__ operator __hip_bfloat16_raw() const { - static_assert(sizeof(__hip_bfloat16_raw) == sizeof(__amd_bf16_storage_t)); + static_assert(sizeof(__hip_bfloat16_raw) == sizeof(__amd_bf16_storage_t), ""); union { __hip_bfloat16_raw bf16_raw; __amd_bf16_storage_t bf16; @@ -450,7 +449,7 @@ struct __hip_fp6_e3m2 { return __hip_cvt_fp6_to_halfraw(__x, __HIP_E3M2); } __FP6_HOST_DEVICE__ operator __hip_bfloat16_raw() const { - static_assert(sizeof(__hip_bfloat16_raw) == sizeof(__amd_bf16_storage_t)); + static_assert(sizeof(__hip_bfloat16_raw) == sizeof(__amd_bf16_storage_t), ""); union { __hip_bfloat16_raw bf16_raw; __amd_bf16_storage_t bf16; @@ -502,7 +501,7 @@ struct __hip_fp6x2_e2m3 { return __hip_cvt_fp6x2_to_halfraw2(__x, __HIP_E2M3); } __FP6_HOST_DEVICE__ operator __hip_bfloat162_raw() const { - static_assert(sizeof(__hip_bfloat162_raw) == sizeof(__amd_bf16x2_storage_t)); + static_assert(sizeof(__hip_bfloat162_raw) == sizeof(__amd_bf16x2_storage_t), ""); union { __hip_bfloat162_raw bf162_raw; __amd_bf16x2_storage_t bf16x2; @@ -562,7 +561,7 @@ struct __hip_fp6x2_e3m2 { return __hip_cvt_fp6x2_to_halfraw2(__x, __HIP_E3M2); } __FP6_HOST_DEVICE__ operator __hip_bfloat162_raw() const { - static_assert(sizeof(__hip_bfloat162_raw) == sizeof(__amd_bf16x2_storage_t)); + static_assert(sizeof(__hip_bfloat162_raw) == sizeof(__amd_bf16x2_storage_t), ""); union { __hip_bfloat162_raw bf162_raw; __amd_bf16x2_storage_t bf16x2; diff --git a/projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp8.h b/projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp8.h index 8e4487d92d..519a8aad61 100644 --- a/projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp8.h +++ b/projects/clr/hipamd/include/hip/amd_detail/amd_hip_fp8.h @@ -114,9 +114,9 @@ #if !defined(__HIPCC_RTC__) static_assert(CHAR_BIT == 8, "byte size should be of 8 bits"); #endif -static_assert(sizeof(unsigned char) == 1); -static_assert(sizeof(unsigned short int) == 2); -static_assert(sizeof(unsigned int) == 4); +static_assert(sizeof(unsigned char) == 1, ""); +static_assert(sizeof(unsigned short int) == 2, ""); +static_assert(sizeof(unsigned int) == 4, ""); /** * \brief Describes FP8 interpretation diff --git a/projects/clr/hipamd/include/hip/amd_detail/amd_hip_ocp_host.hpp b/projects/clr/hipamd/include/hip/amd_detail/amd_hip_ocp_host.hpp index f0ac1a4c9d..e1fbcdaa54 100644 --- a/projects/clr/hipamd/include/hip/amd_detail/amd_hip_ocp_host.hpp +++ b/projects/clr/hipamd/include/hip/amd_detail/amd_hip_ocp_host.hpp @@ -75,7 +75,7 @@ static const float ieee754_nan = std::numeric_limits::quiet_NaN(); static const float ieee754_inf = std::numeric_limits::infinity(); __OCP_FP_HOST_DEVICE_STATIC__ uint32_t U32(float f) { - static_assert(sizeof(uint32_t) == sizeof(float)); + static_assert(sizeof(uint32_t) == sizeof(float), ""); union { float f32; uint32_t ui32; @@ -84,7 +84,7 @@ __OCP_FP_HOST_DEVICE_STATIC__ uint32_t U32(float f) { } __OCP_FP_HOST_DEVICE_STATIC__ float F32(uint32_t u32) { - static_assert(sizeof(uint32_t) == sizeof(float)); + static_assert(sizeof(uint32_t) == sizeof(float), ""); union { uint32_t ui32; float f32; @@ -464,7 +464,7 @@ template __OCP_FP_HOST_DEVICE_STATIC__ T makezero(Encoding E, uint3 template __OCP_FP_HOST_DEVICE_STATIC__ T to_float(uint32_t u32, int8_t scale_exp) { // We do not support bf16/fp16 <-> float - static_assert(E != Encoding::IEEE754 && E != Encoding::E5M10 && E != Encoding::E8M7); + static_assert(E != Encoding::IEEE754 && E != Encoding::E5M10 && E != Encoding::E8M7, ""); const auto& enc = encodings[(size_t)E]; const auto dstE = []() -> Encoding { @@ -560,9 +560,9 @@ __OCP_FP_HOST_DEVICE_STATIC__ T to_float(uint32_t u32, int8_t scale_exp) { template __OCP_FP_HOST_DEVICE_STATIC__ uint32_t from_float_sr(T f, uint32_t seed, int8_t scale_exp) { // We do not support bf16/fp16 <-> float - static_assert(E != Encoding::IEEE754 && E != Encoding::E5M10 && E != Encoding::E8M7); - static_assert(sizeof(__amd_fp16_storage_t[2]) == sizeof(float)); - static_assert(sizeof(__amd_bf16_storage_t[2]) == sizeof(float)); + static_assert(E != Encoding::IEEE754 && E != Encoding::E5M10 && E != Encoding::E8M7, ""); + static_assert(sizeof(__amd_fp16_storage_t[2]) == sizeof(float), ""); + static_assert(sizeof(__amd_bf16_storage_t[2]) == sizeof(float), ""); union { float f32; __amd_fp16_storage_t fp16[2]; @@ -674,9 +674,9 @@ __OCP_FP_HOST_DEVICE_STATIC__ uint32_t from_float_sr(T f, uint32_t seed, int8_t template __OCP_FP_HOST_DEVICE_STATIC__ uint32_t from_float(T f, int8_t scale_exp) { // We do not support bf16/fp16 <-> float - static_assert(E != Encoding::IEEE754 && E != Encoding::E5M10 && E != Encoding::E8M7); - static_assert(sizeof(__amd_fp16_storage_t[2]) == sizeof(float)); - static_assert(sizeof(__amd_bf16_storage_t[2]) == sizeof(float)); + static_assert(E != Encoding::IEEE754 && E != Encoding::E5M10 && E != Encoding::E8M7, ""); + static_assert(sizeof(__amd_fp16_storage_t[2]) == sizeof(float), ""); + static_assert(sizeof(__amd_bf16_storage_t[2]) == sizeof(float), ""); union { float f32; __amd_fp16_storage_t fp16[2]; @@ -836,7 +836,7 @@ __OCP_FP_HOST_DEVICE_STATIC__ OutType fp6_cvt_packedx32(InType in, int8_t scale unsigned long long padded; } __attribute__((packed)); - static_assert(sizeof(other_type) == sizeof(fp6x32_packed)); + static_assert(sizeof(other_type) == sizeof(fp6x32_packed), ""); union { other_type o; fp6x32_packed fp6; diff --git a/projects/clr/hipamd/include/hip/amd_detail/amd_hip_ocp_types.h b/projects/clr/hipamd/include/hip/amd_detail/amd_hip_ocp_types.h index 0b2738a7d3..73773fee13 100644 --- a/projects/clr/hipamd/include/hip/amd_detail/amd_hip_ocp_types.h +++ b/projects/clr/hipamd/include/hip/amd_detail/amd_hip_ocp_types.h @@ -30,14 +30,14 @@ THE SOFTWARE. #define __OCP_FP_DEVICE_STATIC__ __OCP_FP_DEVICE__ static __inline__ __attribute__((always_inline)) #define __OCP_FP_HOST_DEVICE_STATIC__ __OCP_FP_HOST_DEVICE__ static -static_assert(sizeof(unsigned int) == 4); -static_assert(sizeof(float) == 4); -static_assert(sizeof(unsigned short) == 2); +static_assert(sizeof(unsigned int) == 4, ""); +static_assert(sizeof(float) == 4, ""); +static_assert(sizeof(unsigned short) == 2, ""); #if (defined(__clang__) && (__clang_major__ > 17) && defined(__HIP__)) || \ (defined(__GNUC__) && (__GNUC__ > 13)) -static_assert(sizeof(__bf16) == 2); -static_assert(sizeof(_Float16) == 2); +static_assert(sizeof(__bf16) == 2, ""); +static_assert(sizeof(_Float16) == 2, ""); #endif // Although we do have some abstractions of half and bfloat16, since this will be a standalone @@ -87,5 +87,5 @@ typedef short __attribute__((vector_size(4))) __amd_shortx2_storage_t; #if (defined(__clang__) && (__clang_major__ > 17) && defined(__HIP__)) || \ (defined(__GNUC__) && (__GNUC__ > 13)) -static_assert(sizeof(__amd_uintx2_storage_t) == sizeof(__amd_fp8x8_storage_t)); +static_assert(sizeof(__amd_uintx2_storage_t) == sizeof(__amd_fp8x8_storage_t), ""); #endif diff --git a/projects/clr/min_pinned_xfer_patch.diff b/projects/clr/min_pinned_xfer_patch.diff new file mode 100644 index 0000000000..cb925b7fc9 --- /dev/null +++ b/projects/clr/min_pinned_xfer_patch.diff @@ -0,0 +1,28 @@ +diff --git a/projects/clr/rocclr/device/rocm/rocsettings.cpp b/projects/clr/rocclr/device/rocm/rocsettings.cpp +index 48aa4328ca..e4fa929db8 100644 +--- a/projects/clr/rocclr/device/rocm/rocsettings.cpp ++++ b/projects/clr/rocclr/device/rocm/rocsettings.cpp +@@ -58,7 +58,7 @@ Settings::Settings() { + + pinnedXferSize_ = GPU_PINNED_XFER_SIZE * Mi; + pinnedMinXferSize_ = +- flagIsDefault(GPU_PINNED_MIN_XFER_SIZE) ? 1 * Mi : GPU_PINNED_MIN_XFER_SIZE * Mi; ++ flagIsDefault(GPU_PINNED_MIN_XFER_SIZE) ? 1 * Ki : GPU_PINNED_MIN_XFER_SIZE * Ki; + + sdmaCopyThreshold_ = GPU_FORCE_BLIT_COPY_SIZE * Ki; + +diff --git a/projects/clr/rocclr/utils/flags.hpp b/projects/clr/rocclr/utils/flags.hpp +index ddacf2182a..4bbb6e1937 100644 +--- a/projects/clr/rocclr/utils/flags.hpp ++++ b/projects/clr/rocclr/utils/flags.hpp +@@ -80,8 +80,8 @@ debug(cstring, AMD_OCL_SUBST_OBJFILE, 0, \ + "Specify binary substitution config file for OpenCL") \ + release(size_t, GPU_PINNED_XFER_SIZE, 32, \ + "The pinned buffer size for pinning in read/write transfers in MiB") \ +-release(size_t, GPU_PINNED_MIN_XFER_SIZE, 128, \ +- "The minimal buffer size for pinned read/write transfers in MiB") \ ++release(size_t, GPU_PINNED_MIN_XFER_SIZE, 256, \ ++ "The minimal buffer size for pinned read/write transfers in KiB") \ + release(size_t, GPU_RESOURCE_CACHE_SIZE, 64, \ + "The resource cache size in MB") \ + release(size_t, GPU_MAX_SUBALLOC_SIZE, 4096, \ diff --git a/projects/clr/patch_re_sentinel.diff b/projects/clr/patch_re_sentinel.diff new file mode 100644 index 0000000000..7a6d145329 --- /dev/null +++ b/projects/clr/patch_re_sentinel.diff @@ -0,0 +1,71 @@ +diff --git a/projects/clr/rocclr/device/rocm/rocvirtual.cpp b/projects/clr/rocclr/device/rocm/rocvirtual.cpp +index 5abfd73284..1fcc4d8217 100644 +--- a/projects/clr/rocclr/device/rocm/rocvirtual.cpp ++++ b/projects/clr/rocclr/device/rocm/rocvirtual.cpp +@@ -3411,49 +3411,7 @@ bool VirtualGPU::createVirtualQueue(uint deviceQueueSize) { + #if IS_LINUX + __attribute__((optimize("unroll-all-loops"), always_inline)) static inline void nontemporalMemcpy( + void* __restrict dst, const void* __restrict src, size_t size) { +-#if defined(ATI_ARCH_X86) +-#if defined(__AVX512F__) +- for (auto i = 0u; i != size / sizeof(__m512i); ++i) { +- _mm512_stream_si512(reinterpret_cast<__m512i* __restrict&>(dst)++, +- *reinterpret_cast(src)++); +- } +- size = size % sizeof(__m512i); +-#endif +- +-#if defined(__AVX__) +- for (auto i = 0u; i != size / sizeof(__m256i); ++i) { +- _mm256_stream_si256(reinterpret_cast<__m256i* __restrict&>(dst)++, +- *reinterpret_cast(src)++); +- } +- size = size % sizeof(__m256i); +-#endif +- +- for (auto i = 0u; i != size / sizeof(__m128i); ++i) { +- _mm_stream_si128(reinterpret_cast<__m128i* __restrict&>(dst)++, +- *(reinterpret_cast(src)++)); +- } +- size = size % sizeof(__m128i); +- +- for (auto i = 0u; i != size / sizeof(long long); ++i) { +- _mm_stream_si64(reinterpret_cast(dst)++, +- *reinterpret_cast(src)++); +- } +- size = size % sizeof(long long); +- +- for (auto i = 0u; i != size / sizeof(int); ++i) { +- _mm_stream_si32(reinterpret_cast(dst)++, +- *reinterpret_cast(src)++); +- } +- +- size = size % sizeof(int); +- // Copy remaining bytes for unaligned size + std::memcpy(dst, src, size); +- +- // Add memory fence +- _mm_sfence(); +-#else +- std::memcpy(dst, src, size); +-#endif + } + #else + static inline void nontemporalMemcpy(void* __restrict dst, const void* __restrict src, +@@ -3708,12 +3666,12 @@ bool VirtualGPU::submitKernelInternal(const amd::NDRangeContainer& sizes, const + const auto kernArgImpl = dev().settings().kernel_arg_impl_; + if (kernArgImpl == KernelArgImpl::DeviceKernelArgsHDP) { + *dev().info().hdpMemFlushCntl = 1u; +- auto kSentinel = *reinterpret_cast(dev().info().hdpMemFlushCntl); ++ //auto kSentinel = *reinterpret_cast(dev().info().hdpMemFlushCntl); + } else if (kernArgImpl == KernelArgImpl::DeviceKernelArgsReadback && argSize != 0) { +- _mm_sfence(); ++ //_mm_sfence(); + *(argBuffer + argSize - 1) = *(parameters + argSize - 1); +- _mm_mfence(); +- auto kSentinel = *reinterpret_cast(argBuffer + argSize - 1); ++ //_mm_mfence(); ++ //auto kSentinel = *reinterpret_cast(argBuffer + argSize - 1); + } + } + } diff --git a/projects/clr/rocvirt.diff b/projects/clr/rocvirt.diff new file mode 100644 index 0000000000..2c9732a4a2 --- /dev/null +++ b/projects/clr/rocvirt.diff @@ -0,0 +1,68 @@ +diff --git a/projects/clr/rocclr/device/rocm/rocvirtual.cpp b/projects/clr/rocclr/device/rocm/rocvirtual.cpp +index 5abfd73284..ddabe0c192 100644 +--- a/projects/clr/rocclr/device/rocm/rocvirtual.cpp ++++ b/projects/clr/rocclr/device/rocm/rocvirtual.cpp +@@ -3411,49 +3411,7 @@ bool VirtualGPU::createVirtualQueue(uint deviceQueueSize) { + #if IS_LINUX + __attribute__((optimize("unroll-all-loops"), always_inline)) static inline void nontemporalMemcpy( + void* __restrict dst, const void* __restrict src, size_t size) { +-#if defined(ATI_ARCH_X86) +-#if defined(__AVX512F__) +- for (auto i = 0u; i != size / sizeof(__m512i); ++i) { +- _mm512_stream_si512(reinterpret_cast<__m512i* __restrict&>(dst)++, +- *reinterpret_cast(src)++); +- } +- size = size % sizeof(__m512i); +-#endif +- +-#if defined(__AVX__) +- for (auto i = 0u; i != size / sizeof(__m256i); ++i) { +- _mm256_stream_si256(reinterpret_cast<__m256i* __restrict&>(dst)++, +- *reinterpret_cast(src)++); +- } +- size = size % sizeof(__m256i); +-#endif +- +- for (auto i = 0u; i != size / sizeof(__m128i); ++i) { +- _mm_stream_si128(reinterpret_cast<__m128i* __restrict&>(dst)++, +- *(reinterpret_cast(src)++)); +- } +- size = size % sizeof(__m128i); +- +- for (auto i = 0u; i != size / sizeof(long long); ++i) { +- _mm_stream_si64(reinterpret_cast(dst)++, +- *reinterpret_cast(src)++); +- } +- size = size % sizeof(long long); +- +- for (auto i = 0u; i != size / sizeof(int); ++i) { +- _mm_stream_si32(reinterpret_cast(dst)++, +- *reinterpret_cast(src)++); +- } +- +- size = size % sizeof(int); +- // Copy remaining bytes for unaligned size + std::memcpy(dst, src, size); +- +- // Add memory fence +- _mm_sfence(); +-#else +- std::memcpy(dst, src, size); +-#endif + } + #else + static inline void nontemporalMemcpy(void* __restrict dst, const void* __restrict src, +@@ -3710,10 +3668,10 @@ bool VirtualGPU::submitKernelInternal(const amd::NDRangeContainer& sizes, const + *dev().info().hdpMemFlushCntl = 1u; + auto kSentinel = *reinterpret_cast(dev().info().hdpMemFlushCntl); + } else if (kernArgImpl == KernelArgImpl::DeviceKernelArgsReadback && argSize != 0) { +- _mm_sfence(); ++ //_mm_sfence(); + *(argBuffer + argSize - 1) = *(parameters + argSize - 1); +- _mm_mfence(); +- auto kSentinel = *reinterpret_cast(argBuffer + argSize - 1); ++ //_mm_mfence(); ++ //auto kSentinel = *reinterpret_cast(argBuffer + argSize - 1); + } + } + }