Add messages to static asserts to prevent warnings (#1011)

2026-01-13 14:02:36 +00:00
@@ -0,0 +1,490 @@
+# CommandPool Refactoring Analysis: Moving from Global Singleton to Per-Stream Pools
+
+## Executive Summary
+
+This document analyzes the feasibility and impact of moving the `CommandPool` from a global static singleton instance to per-stream instances within `HostQueue`. This change aims to eliminate contention bottlenecks in multithreaded applications with many concurrent streams.
+
+## Current Architecture
+
+### CommandPool Implementation
+
+The `CommandPool` is currently implemented as a **static singleton** with the following characteristics:
+
+- **Location**: `rocclr/platform/command.cpp` (lines 338-408)
+- **Access Pattern**: `CommandPool::instance()` returns a static singleton
+- **Thread Safety**: Protected by a single `std::mutex mutex_`
+- **Storage**: Ring buffer with 64 entries (`q_size_ = 64`)
+- **Memory Management**: 
+  - Allocates aligned memory using `std::aligned_alloc(maxAlignment_, maxSize_)`
+  - Reuses deallocated command memory when pool is not full
+  - Frees memory when pool is full
+
+### Command Types Using CommandPool
+
+The following command types use the pool via custom `operator new` and `release()` methods:
+
+1. **ReadMemoryCommand** - `operator new()` and `release()` (lines 684-707)
+2. **WriteMemoryCommand** - `operator new()` and `release()` (lines 714-733)
+3. **FillMemoryCommand** - `operator new()` and `release()` (lines 744-765)
+4. **CopyMemoryCommand** - `operator new()` and `release()` (lines 799-820)
+5. **CopyMemoryP2PCommand** - `operator new()` and `release()` (lines 1003-1024)
+6. **Marker** - `operator new()` and `release()` (lines 1027-1048)
+
+### Current Allocation Flow
+
+```cpp
+// Command creation
+void* ReadMemoryCommand::operator new(size_t size) {
+  void* ptr = CommandPool::instance().allocate();  // Global singleton access
+  // ...
+  return ptr;
+}
+
+// Command destruction
+uint ReadMemoryCommand::release() {
+  uint newCount = referenceCount_.fetch_sub(1, std::memory_order_acq_rel) - 1;
+  if (newCount == 0) {
+    if (terminate()) {
+      CommandPool::instance().deallocate(this);  // Global singleton access
+      return 0;
+    }
+  }
+  return newCount;
+}
+```
+
+### Problem: Contention Bottleneck
+
+In a multithreaded application with many streams:
+- **All threads** compete for the same global `CommandPool::instance()` mutex
+- High-frequency command allocation/deallocation creates lock contention
+- Performance degrades as the number of concurrent streams increases
+- The single mutex serializes all command pool operations across all streams
+
+## Proposed Solution: Per-Stream CommandPool
+
+### Architecture Changes
+
+1. **Move CommandPool into HostQueue**
+   - Each `HostQueue` instance owns its own `CommandPool`
+   - Eliminates cross-stream contention
+   - Commands allocated from a stream's pool are returned to the same pool
+
+2. **Update Command Allocation**
+   - Commands already have a `queue_` pointer (set in constructor)
+   - Commands can access their queue's pool via `queue()->commandPool()`
+   - No need to pass additional parameters
+
+3. **Lifecycle Management**
+   - Pool created when `HostQueue` is constructed
+   - Pool destroyed when `HostQueue` is destroyed
+   - No global cleanup needed
+
+### Implementation Plan
+
+#### Step 1: Move CommandPool Class Definition
+- Keep `CommandPool` class definition in `command.cpp` (or move to header if needed)
+- Remove static `instance()` method
+- Make it a regular class (no singleton pattern)
+
+#### Step 2: Add CommandPool to HostQueue
+```cpp
+// In commandqueue.hpp
+class HostQueue : public CommandQueue {
+  // ... existing members ...
+private:
+  CommandPool commandPool_;  // Per-queue command pool
+};
+```
+
+#### Step 3: Update Command Allocation Methods
+```cpp
+// Before:
+void* ReadMemoryCommand::operator new(size_t size) {
+  void* ptr = CommandPool::instance().allocate();
+  // ...
+}
+
+// After:
+void* ReadMemoryCommand::operator new(size_t size, HostQueue& queue) {
+  void* ptr = queue.commandPool().allocate();
+  // ...
+}
+```
+
+**Challenge**: `operator new` is called with `new ReadMemoryCommand(...)`, but we need access to the queue. The queue is passed to the constructor, not `operator new`.
+
+**Solution**: Commands already store `queue_` pointer. We can use a two-phase approach:
+- Phase 1: Allocate memory (may need temporary global pool or direct allocation)
+- Phase 2: After construction, move to queue's pool (not practical)
+
+**Better Solution**: Use placement new or modify allocation pattern:
+- Option A: Allocate from queue's pool before construction
+- Option B: Store pool reference in command and deallocate to correct pool
+- Option C: Use a thread-local or queue-specific allocation mechanism
+
+#### Step 4: Update Command Deallocation
+```cpp
+// Before:
+uint ReadMemoryCommand::release() {
+  // ...
+  CommandPool::instance().deallocate(this);
+  // ...
+}
+
+// After:
+uint ReadMemoryCommand::release() {
+  // ...
+  queue_->commandPool().deallocate(this);  // Use queue's pool
+  // ...
+}
+```
+
+**Note**: This is straightforward since `queue_` is already available in the command.
+
+### Detailed Implementation Strategy
+
+#### Option 1: Two-Phase Allocation (Recommended)
+
+Since `operator new` is called before the constructor, we need a way to get the queue reference. However, commands are always created with a queue parameter:
+
+```cpp
+// Current pattern:
+ReadMemoryCommand* cmd = new ReadMemoryCommand(queue, ...);
+
+// The queue is available at call site!
+```
+
+**Solution**: Use a placement-new-like pattern or thread-local storage:
+
+1. **Thread-Local Queue Context** (Simpler but less clean):
+   ```cpp
+   thread_local HostQueue* g_currentQueue = nullptr;
+   
+   void* ReadMemoryCommand::operator new(size_t size) {
+     HostQueue* queue = g_currentQueue;
+     if (queue) {
+       return queue->commandPool().allocate();
+     }
+     // Fallback to direct allocation
+     return std::aligned_alloc(alignof(ReadMemoryCommand), size);
+   }
+   ```
+
+2. **Queue Parameter in operator new** (Requires syntax change):
+   ```cpp
+   // This would require: new(queue) ReadMemoryCommand(...)
+   // Not standard C++ placement new syntax
+   ```
+
+3. **Allocate from Queue Before Construction** (Most practical):
+   ```cpp
+   // At call sites, allocate from queue first:
+   void* mem = queue.commandPool().allocate();
+   ReadMemoryCommand* cmd = new(mem) ReadMemoryCommand(queue, ...);
+   ```
+   This requires changing all call sites (80+ locations).
+
+#### Option 2: Deferred Pool Assignment (Hybrid Approach)
+
+1. Allocate commands using a temporary mechanism (direct allocation or small per-thread pool)
+2. After construction, commands have `queue_` pointer
+3. On deallocation, return to the correct queue's pool
+4. **Problem**: Can't reuse memory from different queues efficiently
+
+#### Option 3: Queue-Scoped Allocation Helper (Recommended)
+
+Create a helper that wraps command creation:
+
+```cpp
+template<typename CmdType, typename... Args>
+CmdType* createCommand(HostQueue& queue, Args&&... args) {
+  void* mem = queue.commandPool().allocate();
+  return new(mem) CmdType(queue, std::forward<Args>(args)...);
+}
+
+// Usage:
+auto cmd = createCommand<ReadMemoryCommand>(queue, CL_COMMAND_READ_BUFFER, ...);
+```
+
+This requires updating all 80+ call sites but provides clean semantics.
+
+### Code Changes Required
+
+#### Files to Modify
+
+1. **rocclr/platform/commandqueue.hpp**
+   - Add `CommandPool commandPool_;` member to `HostQueue`
+   - Add `CommandPool& commandPool()` accessor method
+
+2. **rocclr/platform/commandqueue.cpp**
+   - Initialize `commandPool_` in `HostQueue` constructor
+   - Implement `commandPool()` accessor
+
+3. **rocclr/platform/command.cpp**
+   - Remove `CommandPool::instance()` static method
+   - Update all 6 command types' `operator new()` methods
+   - Update all 6 command types' `release()` methods to use `queue_->commandPool()`
+
+4. **All command creation sites** (80+ locations):
+   - `hipamd/src/hip_memory.cpp`
+   - `hipamd/src/hip_stream.cpp`
+   - `hipamd/src/hip_event.cpp`
+   - `rocclr/platform/commandqueue.cpp`
+   - `opencl/amdocl/cl_execute.cpp`
+   - `opencl/amdocl/cl_memobj.cpp`
+   - And others...
+
+### Benefits
+
+1. **Eliminates Contention**: Each stream has its own pool, no cross-stream locking
+2. **Better Locality**: Commands allocated from a stream are reused by the same stream
+3. **Scalability**: Performance scales with number of streams (no global bottleneck)
+4. **Memory Efficiency**: Per-stream pools can be sized appropriately
+5. **Thread Safety**: Each pool only accessed by its stream's thread (mostly)
+
+### Challenges and Considerations
+
+#### Challenge 1: operator new() Timing
+- `operator new()` is called before constructor
+- Queue reference not available in `operator new()` signature
+- **Solution**: Use helper function or thread-local context
+
+#### Challenge 2: Cross-Queue Command References
+- Commands may reference events from other queues
+- Commands are destroyed when reference count reaches zero
+- **Impact**: Low - commands are typically destroyed by their owning queue
+
+#### Challenge 3: Memory Pool Sizing
+- Current: 64 entries shared across all streams
+- Per-stream: 64 entries per stream
+- **Memory Impact**: N streams × 64 entries × maxSize_ bytes
+- **Mitigation**: Could make pool size configurable or smaller per-stream
+
+#### Challenge 4: Thread Safety Within Queue
+- Commands may be allocated/deallocated from different threads
+- `HostQueue::append()` may be called from any thread
+- **Solution**: CommandPool mutex still needed, but contention is per-stream only
+
+#### Challenge 5: Backward Compatibility
+- Need to ensure no regression in single-stream scenarios
+- Performance should be equal or better
+
+### Testing Considerations
+
+1. **Single Stream**: Verify no performance regression
+2. **Multiple Streams**: Measure contention reduction
+3. **High Concurrency**: Test with many concurrent streams
+4. **Memory Leaks**: Ensure pools are properly cleaned up
+5. **Command Lifecycle**: Verify commands are correctly returned to pools
+
+### Migration Strategy
+
+1. **Phase 1**: Implement per-queue pools alongside global pool (feature flag)
+2. **Phase 2**: Update command allocation to use queue pools
+3. **Phase 3**: Update all call sites to use new allocation pattern
+4. **Phase 4**: Remove global pool after validation
+5. **Phase 5**: Performance testing and optimization
+
+### Alternative: Thread-Local Pool
+
+Instead of per-queue pools, consider thread-local pools:
+- Simpler implementation (no queue parameter needed)
+- Still reduces contention (per-thread instead of global)
+- **Drawback**: Threads may service multiple queues, less optimal locality
+
+### Recommendation
+
+**Proceed with per-queue CommandPool implementation** using Option 3 (Queue-Scoped Allocation Helper):
+
+1. **High Impact**: Eliminates major contention bottleneck
+2. **Manageable Complexity**: Clear ownership model (queue owns pool)
+3. **Good Locality**: Commands reused within same stream
+4. **Incremental Migration**: Can be done with feature flags
+
+The main effort is updating ~80 call sites to use the allocation helper, but this provides the cleanest semantics and best performance.
+
+## Implementation Example
+
+### Step 1: Modify CommandPool Class
+
+```cpp
+// In command.cpp - Remove singleton pattern
+class CommandPool {
+public:
+  CommandPool() {
+    static_assert(((q_size_ & (q_size_ - 1)) == 0) && "q_size must be power of 2");
+  }
+  
+  // Remove: static CommandPool& instance();
+  
+  template <typename CmdType>
+  void deallocate(CmdType *ptr) {
+    // ... existing implementation ...
+  }
+  
+  void *allocate() {
+    // ... existing implementation ...
+  }
+  
+  // ... rest of implementation unchanged ...
+};
+```
+
+### Step 2: Add CommandPool to HostQueue
+
+```cpp
+// In commandqueue.hpp
+class HostQueue : public CommandQueue {
+  // ... existing members ...
+  
+public:
+  // Accessor for command pool
+  CommandPool& commandPool() { return commandPool_; }
+  const CommandPool& commandPool() const { return commandPool_; }
+
+private:
+  CommandPool commandPool_;  // Per-queue command pool
+  // ... rest of members ...
+};
+```
+
+```cpp
+// In commandqueue.cpp - Initialize in constructor
+HostQueue::HostQueue(Context& context, Device& device, ...)
+    : CommandQueue(...),
+      commandPool_(),  // Initialize pool
+      // ... other initializations ...
+{
+  // ... existing constructor code ...
+}
+```
+
+### Step 3: Create Allocation Helper
+
+```cpp
+// In command.hpp or a new command_utils.hpp
+namespace amd {
+
+// Helper function to create commands using queue's pool
+template<typename CmdType, typename... Args>
+CmdType* createCommand(HostQueue& queue, Args&&... args) {
+  void* mem = queue.commandPool().allocate();
+  if (mem == nullptr) {
+    return nullptr;
+  }
+  return new(mem) CmdType(queue, std::forward<Args>(args)...);
+}
+
+} // namespace amd
+```
+
+### Step 4: Update Command Deallocation
+
+```cpp
+// In command.cpp - Update all 6 command types
+uint ReadMemoryCommand::release() {
+  uint newCount = referenceCount_.fetch_sub(1, std::memory_order_acq_rel) - 1;
+  if (newCount == 0) {
+    if (terminate()) {
+      // Use queue's pool instead of global singleton
+      queue_->commandPool().deallocate(this);
+      return 0;
+    }
+  }
+  return newCount;
+}
+
+// Repeat for: WriteMemoryCommand, FillMemoryCommand, 
+//            CopyMemoryCommand, CopyMemoryP2PCommand, Marker
+```
+
+### Step 5: Update Command Allocation (Remove operator new)
+
+Since we're using placement new via the helper, we can either:
+- Keep `operator new` as fallback (for compatibility)
+- Remove it entirely (cleaner, but requires all call sites updated)
+
+**Option A: Keep as fallback**
+```cpp
+void* ReadMemoryCommand::operator new(size_t size) {
+  // Fallback: direct allocation if helper not used
+  return std::aligned_alloc(alignof(ReadMemoryCommand), size);
+}
+```
+
+**Option B: Remove operator new** (preferred after migration)
+
+### Step 6: Update Call Sites
+
+```cpp
+// Before:
+amd::ReadMemoryCommand* cmd = new amd::ReadMemoryCommand(
+    *pStream, CL_COMMAND_READ_BUFFER, waitList, ...);
+
+// After:
+amd::ReadMemoryCommand* cmd = amd::createCommand<amd::ReadMemoryCommand>(
+    *pStream, CL_COMMAND_READ_BUFFER, waitList, ...);
+```
+
+### Migration Example: hip_memory.cpp
+
+```cpp
+// Current code (line 587):
+command = new amd::ReadMemoryCommand(*pStream, CL_COMMAND_READ_BUFFER, waitList,
+                                     *srcBuffer, origin, size, dst, rowPitch, slicePitch);
+
+// Migrated code:
+command = amd::createCommand<amd::ReadMemoryCommand>(*pStream, CL_COMMAND_READ_BUFFER, 
+                                                      waitList, *srcBuffer, origin, size, 
+                                                      dst, rowPitch, slicePitch);
+```
+
+## Performance Impact Estimate
+
+### Current Bottleneck
+- **Single mutex** protecting global pool
+- **N threads** contending for same lock
+- Lock hold time: ~100-500ns per allocation/deallocation
+- **Contention cost**: O(N) threads × lock overhead
+
+### After Refactoring
+- **N mutexes** (one per stream)
+- **1 thread** per stream typically (or small number)
+- Lock hold time: Same (~100-500ns)
+- **Contention cost**: O(1) per stream
+
+### Expected Improvement
+- **Single stream**: No change (or slight improvement from better locality)
+- **Multiple streams**: Near-linear scaling with number of streams
+- **High concurrency (16+ streams)**: 10-100x improvement in allocation throughput
+
+## Risk Assessment
+
+### Low Risk
+- ✅ Command deallocation (already has queue pointer)
+- ✅ Pool initialization/destruction (RAII in HostQueue)
+- ✅ Memory management (same algorithm, just per-queue)
+
+### Medium Risk
+- ⚠️ Call site updates (80+ locations, but mechanical)
+- ⚠️ Testing coverage (need multi-stream scenarios)
+- ⚠️ Backward compatibility during migration
+
+### Mitigation Strategies
+1. **Feature flag**: Enable per-queue pools behind flag
+2. **Gradual migration**: Update call sites incrementally
+3. **Fallback mechanism**: Keep global pool as fallback initially
+4. **Comprehensive testing**: Multi-stream stress tests
+
+## Conclusion
+
+Moving CommandPool from a global singleton to per-queue instances is **highly recommended**:
+
+- **Solves real performance problem** in multithreaded applications
+- **Clear implementation path** with manageable complexity
+- **Significant scalability improvement** expected
+- **Low risk** with proper testing and gradual migration
+
+The main implementation effort is mechanical (updating call sites), and the architectural change is sound and well-scoped.
+
@@ -39,9 +39,9 @@ typedef __hip_fp8_storage_t __hip_fp4_storage_t;
 typedef __hip_fp8_storage_t __hip_fp4x2_storage_t;
 typedef __hip_fp8x2_storage_t __hip_fp4x4_storage_t;

-static_assert(sizeof(__hip_fp4_storage_t[4]) == sizeof(uint32_t));
-static_assert(sizeof(__hip_fp4x2_storage_t[4]) == sizeof(uint32_t));
-static_assert(sizeof(__hip_fp4x4_storage_t[2]) == sizeof(uint32_t));
+static_assert(sizeof(__hip_fp4_storage_t[4]) == sizeof(uint32_t), "");
+static_assert(sizeof(__hip_fp4x2_storage_t[4]) == sizeof(uint32_t), "");
+static_assert(sizeof(__hip_fp4x4_storage_t[2]) == sizeof(uint32_t), "");

 enum __hip_fp4_interpretation_t {
  __HIP_E2M1 = 0,
@@ -278,7 +278,7 @@ struct __hip_fp4_e2m1 {
  }

  __FP4_HOST_DEVICE__ operator __hip_bfloat16_raw() const {
-    static_assert(sizeof(__hip_bfloat16_raw[2]) == sizeof(__amd_bf16x2_storage_t));
+    static_assert(sizeof(__hip_bfloat16_raw[2]) == sizeof(__amd_bf16x2_storage_t), "");
    union {
      __hip_bfloat16_raw bf16_raw[2];
      __amd_bf16x2_storage_t bf16x2;
@@ -336,7 +336,7 @@ struct __hip_fp4x2_e2m1 {
  }

  __FP4_HOST_DEVICE__ operator __hip_bfloat162_raw() const {
-    static_assert(sizeof(__hip_bfloat162_raw) == sizeof(__amd_bf16x2_storage_t));
+    static_assert(sizeof(__hip_bfloat162_raw) == sizeof(__amd_bf16x2_storage_t), "");
    union {
      __hip_bfloat162_raw bf162_raw;
      __amd_bf16x2_storage_t bf16x2;
@@ -39,16 +39,15 @@ typedef __hip_fp8_storage_t __hip_fp6_storage_t;
 typedef __hip_fp8x2_storage_t __hip_fp6x2_storage_t;
 typedef __hip_fp8x4_storage_t __hip_fp6x4_storage_t;

-static_assert(sizeof(__hip_fp6_storage_t[4]) == sizeof(uint32_t));
-static_assert(sizeof(__hip_fp6x2_storage_t[2]) == sizeof(uint32_t));
-static_assert(sizeof(__hip_fp6x4_storage_t[2]) == sizeof(uint64_t));
+static_assert(sizeof(__hip_fp6_storage_t[4]) == sizeof(uint32_t), "");
+static_assert(sizeof(__hip_fp6x2_storage_t[2]) == sizeof(uint32_t), "");
+static_assert(sizeof(__hip_fp6x4_storage_t[2]) == sizeof(uint64_t), "");

 enum __hip_fp6_interpretation_t {
  __HIP_E3M2 = 0, /**< FP6 E3M2 Type*/
  __HIP_E2M3 = 1, /**< FP6 E2M3 Type */
 };

-
 // Note: Ignore rounding input on AMD GPUs for now. At the moment AMD GPUs do not support rounding
 // modes, all the inputs are rounded to nearest or use an input to do stochastic rounding.
 // We hide the rounding variable to not trigger the unused variable compiler warning.
@@ -381,7 +380,7 @@ struct __hip_fp6_e2m3 {
    return __hip_cvt_fp6_to_halfraw(__x, __HIP_E2M3);
  }
  __FP6_HOST_DEVICE__ operator __hip_bfloat16_raw() const {
-    static_assert(sizeof(__hip_bfloat16_raw) == sizeof(__amd_bf16_storage_t));
+    static_assert(sizeof(__hip_bfloat16_raw) == sizeof(__amd_bf16_storage_t), "");
    union {
      __hip_bfloat16_raw bf16_raw;
      __amd_bf16_storage_t bf16;
@@ -450,7 +449,7 @@ struct __hip_fp6_e3m2 {
    return __hip_cvt_fp6_to_halfraw(__x, __HIP_E3M2);
  }
  __FP6_HOST_DEVICE__ operator __hip_bfloat16_raw() const {
-    static_assert(sizeof(__hip_bfloat16_raw) == sizeof(__amd_bf16_storage_t));
+    static_assert(sizeof(__hip_bfloat16_raw) == sizeof(__amd_bf16_storage_t), "");
    union {
      __hip_bfloat16_raw bf16_raw;
      __amd_bf16_storage_t bf16;
@@ -502,7 +501,7 @@ struct __hip_fp6x2_e2m3 {
    return __hip_cvt_fp6x2_to_halfraw2(__x, __HIP_E2M3);
  }
  __FP6_HOST_DEVICE__ operator __hip_bfloat162_raw() const {
-    static_assert(sizeof(__hip_bfloat162_raw) == sizeof(__amd_bf16x2_storage_t));
+    static_assert(sizeof(__hip_bfloat162_raw) == sizeof(__amd_bf16x2_storage_t), "");
    union {
      __hip_bfloat162_raw bf162_raw;
      __amd_bf16x2_storage_t bf16x2;
@@ -562,7 +561,7 @@ struct __hip_fp6x2_e3m2 {
    return __hip_cvt_fp6x2_to_halfraw2(__x, __HIP_E3M2);
  }
  __FP6_HOST_DEVICE__ operator __hip_bfloat162_raw() const {
-    static_assert(sizeof(__hip_bfloat162_raw) == sizeof(__amd_bf16x2_storage_t));
+    static_assert(sizeof(__hip_bfloat162_raw) == sizeof(__amd_bf16x2_storage_t), "");
    union {
      __hip_bfloat162_raw bf162_raw;
      __amd_bf16x2_storage_t bf16x2;
@@ -114,9 +114,9 @@
 #if !defined(__HIPCC_RTC__)
 static_assert(CHAR_BIT == 8, "byte size should be of 8 bits");
 #endif
-static_assert(sizeof(unsigned char) == 1);
-static_assert(sizeof(unsigned short int) == 2);
-static_assert(sizeof(unsigned int) == 4);
+static_assert(sizeof(unsigned char) == 1, "");
+static_assert(sizeof(unsigned short int) == 2, "");
+static_assert(sizeof(unsigned int) == 4, "");

 /**
 * \brief Describes FP8 interpretation
@@ -75,7 +75,7 @@ static const float ieee754_nan = std::numeric_limits<float>::quiet_NaN();
 static const float ieee754_inf = std::numeric_limits<float>::infinity();

 __OCP_FP_HOST_DEVICE_STATIC__ uint32_t U32(float f) {
-  static_assert(sizeof(uint32_t) == sizeof(float));
+  static_assert(sizeof(uint32_t) == sizeof(float), "");
  union {
    float f32;
    uint32_t ui32;
@@ -84,7 +84,7 @@ __OCP_FP_HOST_DEVICE_STATIC__ uint32_t U32(float f) {
 }

 __OCP_FP_HOST_DEVICE_STATIC__ float F32(uint32_t u32) {
-  static_assert(sizeof(uint32_t) == sizeof(float));
+  static_assert(sizeof(uint32_t) == sizeof(float), "");
  union {
    uint32_t ui32;
    float f32;
@@ -464,7 +464,7 @@ template <typename T> __OCP_FP_HOST_DEVICE_STATIC__ T makezero(Encoding E, uint3
 template <typename T, Encoding E, bool sat>
 __OCP_FP_HOST_DEVICE_STATIC__ T to_float(uint32_t u32, int8_t scale_exp) {
  // We do not support bf16/fp16 <-> float
-  static_assert(E != Encoding::IEEE754 && E != Encoding::E5M10 && E != Encoding::E8M7);
+  static_assert(E != Encoding::IEEE754 && E != Encoding::E5M10 && E != Encoding::E8M7, "");

  const auto& enc = encodings[(size_t)E];
  const auto dstE = []() -> Encoding {
@@ -560,9 +560,9 @@ __OCP_FP_HOST_DEVICE_STATIC__ T to_float(uint32_t u32, int8_t scale_exp) {
 template <typename T, Encoding E, bool sat>
 __OCP_FP_HOST_DEVICE_STATIC__ uint32_t from_float_sr(T f, uint32_t seed, int8_t scale_exp) {
  // We do not support bf16/fp16 <-> float
-  static_assert(E != Encoding::IEEE754 && E != Encoding::E5M10 && E != Encoding::E8M7);
-  static_assert(sizeof(__amd_fp16_storage_t[2]) == sizeof(float));
-  static_assert(sizeof(__amd_bf16_storage_t[2]) == sizeof(float));
+  static_assert(E != Encoding::IEEE754 && E != Encoding::E5M10 && E != Encoding::E8M7, "");
+  static_assert(sizeof(__amd_fp16_storage_t[2]) == sizeof(float), "");
+  static_assert(sizeof(__amd_bf16_storage_t[2]) == sizeof(float), "");
  union {
    float f32;
    __amd_fp16_storage_t fp16[2];
@@ -674,9 +674,9 @@ __OCP_FP_HOST_DEVICE_STATIC__ uint32_t from_float_sr(T f, uint32_t seed, int8_t
 template <typename T, Encoding E, bool sat>
 __OCP_FP_HOST_DEVICE_STATIC__ uint32_t from_float(T f, int8_t scale_exp) {
  // We do not support bf16/fp16 <-> float
-  static_assert(E != Encoding::IEEE754 && E != Encoding::E5M10 && E != Encoding::E8M7);
-  static_assert(sizeof(__amd_fp16_storage_t[2]) == sizeof(float));
-  static_assert(sizeof(__amd_bf16_storage_t[2]) == sizeof(float));
+  static_assert(E != Encoding::IEEE754 && E != Encoding::E5M10 && E != Encoding::E8M7, "");
+  static_assert(sizeof(__amd_fp16_storage_t[2]) == sizeof(float), "");
+  static_assert(sizeof(__amd_bf16_storage_t[2]) == sizeof(float), "");
  union {
    float f32;
    __amd_fp16_storage_t fp16[2];
@@ -836,7 +836,7 @@ __OCP_FP_HOST_DEVICE_STATIC__ OutType fp6_cvt_packedx32(InType in, int8_t scale
    unsigned long long padded;
  } __attribute__((packed));

-  static_assert(sizeof(other_type) == sizeof(fp6x32_packed));
+  static_assert(sizeof(other_type) == sizeof(fp6x32_packed), "");
  union {
    other_type o;
    fp6x32_packed fp6;
@@ -30,14 +30,14 @@ THE SOFTWARE.
 #define __OCP_FP_DEVICE_STATIC__ __OCP_FP_DEVICE__ static __inline__ __attribute__((always_inline))
 #define __OCP_FP_HOST_DEVICE_STATIC__ __OCP_FP_HOST_DEVICE__ static

-static_assert(sizeof(unsigned int) == 4);
-static_assert(sizeof(float) == 4);
-static_assert(sizeof(unsigned short) == 2);
+static_assert(sizeof(unsigned int) == 4, "");
+static_assert(sizeof(float) == 4, "");
+static_assert(sizeof(unsigned short) == 2, "");

 #if (defined(__clang__) && (__clang_major__ > 17) && defined(__HIP__)) ||                          \
    (defined(__GNUC__) && (__GNUC__ > 13))
-static_assert(sizeof(__bf16) == 2);
-static_assert(sizeof(_Float16) == 2);
+static_assert(sizeof(__bf16) == 2, "");
+static_assert(sizeof(_Float16) == 2, "");
 #endif

 // Although we do have some abstractions of half and bfloat16, since this will be a standalone
@@ -87,5 +87,5 @@ typedef short __attribute__((vector_size(4))) __amd_shortx2_storage_t;

 #if (defined(__clang__) && (__clang_major__ > 17) && defined(__HIP__)) ||                          \
    (defined(__GNUC__) && (__GNUC__ > 13))
-static_assert(sizeof(__amd_uintx2_storage_t) == sizeof(__amd_fp8x8_storage_t));
+static_assert(sizeof(__amd_uintx2_storage_t) == sizeof(__amd_fp8x8_storage_t), "");
 #endif
@@ -0,0 +1,28 @@
+diff --git a/projects/clr/rocclr/device/rocm/rocsettings.cpp b/projects/clr/rocclr/device/rocm/rocsettings.cpp
+index 48aa4328ca..e4fa929db8 100644
+--- a/projects/clr/rocclr/device/rocm/rocsettings.cpp
+++ b/projects/clr/rocclr/device/rocm/rocsettings.cpp
+@@ -58,7 +58,7 @@ Settings::Settings() {
+ 
+   pinnedXferSize_ = GPU_PINNED_XFER_SIZE * Mi;
+   pinnedMinXferSize_ =
+-      flagIsDefault(GPU_PINNED_MIN_XFER_SIZE) ? 1 * Mi : GPU_PINNED_MIN_XFER_SIZE * Mi;
+      flagIsDefault(GPU_PINNED_MIN_XFER_SIZE) ? 1 * Ki : GPU_PINNED_MIN_XFER_SIZE * Ki;
+ 
+   sdmaCopyThreshold_ = GPU_FORCE_BLIT_COPY_SIZE * Ki;
+ 
+diff --git a/projects/clr/rocclr/utils/flags.hpp b/projects/clr/rocclr/utils/flags.hpp
+index ddacf2182a..4bbb6e1937 100644
+--- a/projects/clr/rocclr/utils/flags.hpp
+++ b/projects/clr/rocclr/utils/flags.hpp
+@@ -80,8 +80,8 @@ debug(cstring, AMD_OCL_SUBST_OBJFILE, 0,                                      \
+         "Specify binary substitution config file for OpenCL")                 \
+ release(size_t, GPU_PINNED_XFER_SIZE, 32,                                     \
+         "The pinned buffer size for pinning in read/write transfers in MiB")  \
+-release(size_t, GPU_PINNED_MIN_XFER_SIZE, 128,                                \
+-        "The minimal buffer size for pinned read/write transfers in MiB")     \
+release(size_t, GPU_PINNED_MIN_XFER_SIZE, 256,                                \
+        "The minimal buffer size for pinned read/write transfers in KiB")     \
+ release(size_t, GPU_RESOURCE_CACHE_SIZE, 64,                                  \
+         "The resource cache size in MB")                                      \
+ release(size_t, GPU_MAX_SUBALLOC_SIZE, 4096,                                  \
@@ -0,0 +1,71 @@
+diff --git a/projects/clr/rocclr/device/rocm/rocvirtual.cpp b/projects/clr/rocclr/device/rocm/rocvirtual.cpp
+index 5abfd73284..1fcc4d8217 100644
+--- a/projects/clr/rocclr/device/rocm/rocvirtual.cpp
+++ b/projects/clr/rocclr/device/rocm/rocvirtual.cpp
+@@ -3411,49 +3411,7 @@ bool VirtualGPU::createVirtualQueue(uint deviceQueueSize) {
+ #if IS_LINUX
+ __attribute__((optimize("unroll-all-loops"), always_inline)) static inline void nontemporalMemcpy(
+     void* __restrict dst, const void* __restrict src, size_t size) {
+-#if defined(ATI_ARCH_X86)
+-#if defined(__AVX512F__)
+-  for (auto i = 0u; i != size / sizeof(__m512i); ++i) {
+-    _mm512_stream_si512(reinterpret_cast<__m512i* __restrict&>(dst)++,
+-                        *reinterpret_cast<const __m512i* __restrict&>(src)++);
+-  }
+-  size = size % sizeof(__m512i);
+-#endif
+-
+-#if defined(__AVX__)
+-  for (auto i = 0u; i != size / sizeof(__m256i); ++i) {
+-    _mm256_stream_si256(reinterpret_cast<__m256i* __restrict&>(dst)++,
+-                        *reinterpret_cast<const __m256i* __restrict&>(src)++);
+-  }
+-  size = size % sizeof(__m256i);
+-#endif
+-
+-  for (auto i = 0u; i != size / sizeof(__m128i); ++i) {
+-    _mm_stream_si128(reinterpret_cast<__m128i* __restrict&>(dst)++,
+-                     *(reinterpret_cast<const __m128i* __restrict&>(src)++));
+-  }
+-  size = size % sizeof(__m128i);
+-
+-  for (auto i = 0u; i != size / sizeof(long long); ++i) {
+-    _mm_stream_si64(reinterpret_cast<long long* __restrict&>(dst)++,
+-                    *reinterpret_cast<const long long* __restrict&>(src)++);
+-  }
+-  size = size % sizeof(long long);
+-
+-  for (auto i = 0u; i != size / sizeof(int); ++i) {
+-    _mm_stream_si32(reinterpret_cast<int* __restrict&>(dst)++,
+-                    *reinterpret_cast<const int* __restrict&>(src)++);
+-  }
+-
+-  size = size % sizeof(int);
+-  // Copy remaining bytes for unaligned size
+   std::memcpy(dst, src, size);
+-
+-  // Add memory fence
+-  _mm_sfence();
+-#else
+-  std::memcpy(dst, src, size);
+-#endif
+ }
+ #else
+ static inline void nontemporalMemcpy(void* __restrict dst, const void* __restrict src,
+@@ -3708,12 +3666,12 @@ bool VirtualGPU::submitKernelInternal(const amd::NDRangeContainer& sizes, const
+       const auto kernArgImpl = dev().settings().kernel_arg_impl_;
+       if (kernArgImpl == KernelArgImpl::DeviceKernelArgsHDP) {
+         *dev().info().hdpMemFlushCntl = 1u;
+-        auto kSentinel = *reinterpret_cast<volatile int*>(dev().info().hdpMemFlushCntl);
+        //auto kSentinel = *reinterpret_cast<volatile int*>(dev().info().hdpMemFlushCntl);
+       } else if (kernArgImpl == KernelArgImpl::DeviceKernelArgsReadback && argSize != 0) {
+-        _mm_sfence();
+        //_mm_sfence();
+         *(argBuffer + argSize - 1) = *(parameters + argSize - 1);
+-        _mm_mfence();
+-        auto kSentinel = *reinterpret_cast<volatile unsigned char*>(argBuffer + argSize - 1);
+        //_mm_mfence();
+        //auto kSentinel = *reinterpret_cast<volatile unsigned char*>(argBuffer + argSize - 1);
+       }
+     }
+   }
@@ -0,0 +1,68 @@
+diff --git a/projects/clr/rocclr/device/rocm/rocvirtual.cpp b/projects/clr/rocclr/device/rocm/rocvirtual.cpp
+index 5abfd73284..ddabe0c192 100644
+--- a/projects/clr/rocclr/device/rocm/rocvirtual.cpp
+++ b/projects/clr/rocclr/device/rocm/rocvirtual.cpp
+@@ -3411,49 +3411,7 @@ bool VirtualGPU::createVirtualQueue(uint deviceQueueSize) {
+ #if IS_LINUX
+ __attribute__((optimize("unroll-all-loops"), always_inline)) static inline void nontemporalMemcpy(
+     void* __restrict dst, const void* __restrict src, size_t size) {
+-#if defined(ATI_ARCH_X86)
+-#if defined(__AVX512F__)
+-  for (auto i = 0u; i != size / sizeof(__m512i); ++i) {
+-    _mm512_stream_si512(reinterpret_cast<__m512i* __restrict&>(dst)++,
+-                        *reinterpret_cast<const __m512i* __restrict&>(src)++);
+-  }
+-  size = size % sizeof(__m512i);
+-#endif
+-
+-#if defined(__AVX__)
+-  for (auto i = 0u; i != size / sizeof(__m256i); ++i) {
+-    _mm256_stream_si256(reinterpret_cast<__m256i* __restrict&>(dst)++,
+-                        *reinterpret_cast<const __m256i* __restrict&>(src)++);
+-  }
+-  size = size % sizeof(__m256i);
+-#endif
+-
+-  for (auto i = 0u; i != size / sizeof(__m128i); ++i) {
+-    _mm_stream_si128(reinterpret_cast<__m128i* __restrict&>(dst)++,
+-                     *(reinterpret_cast<const __m128i* __restrict&>(src)++));
+-  }
+-  size = size % sizeof(__m128i);
+-
+-  for (auto i = 0u; i != size / sizeof(long long); ++i) {
+-    _mm_stream_si64(reinterpret_cast<long long* __restrict&>(dst)++,
+-                    *reinterpret_cast<const long long* __restrict&>(src)++);
+-  }
+-  size = size % sizeof(long long);
+-
+-  for (auto i = 0u; i != size / sizeof(int); ++i) {
+-    _mm_stream_si32(reinterpret_cast<int* __restrict&>(dst)++,
+-                    *reinterpret_cast<const int* __restrict&>(src)++);
+-  }
+-
+-  size = size % sizeof(int);
+-  // Copy remaining bytes for unaligned size
+   std::memcpy(dst, src, size);
+-
+-  // Add memory fence
+-  _mm_sfence();
+-#else
+-  std::memcpy(dst, src, size);
+-#endif
+ }
+ #else
+ static inline void nontemporalMemcpy(void* __restrict dst, const void* __restrict src,
+@@ -3710,10 +3668,10 @@ bool VirtualGPU::submitKernelInternal(const amd::NDRangeContainer& sizes, const
+         *dev().info().hdpMemFlushCntl = 1u;
+         auto kSentinel = *reinterpret_cast<volatile int*>(dev().info().hdpMemFlushCntl);
+       } else if (kernArgImpl == KernelArgImpl::DeviceKernelArgsReadback && argSize != 0) {
+-        _mm_sfence();
+        //_mm_sfence();
+         *(argBuffer + argSize - 1) = *(parameters + argSize - 1);
+-        _mm_mfence();
+-        auto kSentinel = *reinterpret_cast<volatile unsigned char*>(argBuffer + argSize - 1);
+        //_mm_mfence();
+        //auto kSentinel = *reinterpret_cast<volatile unsigned char*>(argBuffer + argSize - 1);
+       }
+     }
+   }