diff --git a/.gitignore b/.gitignore
index 6f3f8d07cc..9e1a5ff76c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,2 @@
 .cline_storage
+/projects/hip/_build
diff --git a/projects/hip/docs/how-to/performance_guidelines.rst b/projects/hip/docs/how-to/performance_guidelines.rst
index 75bc420a55..f26367775e 100644
--- a/projects/hip/docs/how-to/performance_guidelines.rst
+++ b/projects/hip/docs/how-to/performance_guidelines.rst
@@ -10,10 +10,10 @@ Performance guidelines
 *******************************************************************************
 
 The AMD HIP performance guidelines provide practical, actionable techniques for
-optimizing application performance on AMD GPUs. This guide focuses on 
+optimizing application performance on AMD GPUs. This guide focuses on
 step-by-step instructions and best practices for improving performance.
 
-For theoretical foundations and performance concepts, see 
+For theoretical foundations and performance concepts, see
 :doc:`../understand/performance_optimization`.
 
 Optimization workflow
@@ -22,33 +22,33 @@ Optimization workflow
 Follow this systematic approach to optimize GPU performance:
 
 1. **Profile and measure baseline**
-   
+
    Use ``rocprofv3`` to identify bottlenecks:
-   
+
    .. code-block:: bash
-   
-      rocprofv3 --stats ./your_application
-   
-   Collect metrics on kernel execution time, memory bandwidth, occupancy, and 
-   CU utilization.
+
+      rocprofv3 --stats --<tracing_option> -- <application_path>
+
+   Collect metrics on kernel execution time, memory bandwidth, occupancy, and
+   CU utilization. For more details on using ``rocprofv3`` for application tracing and profiling, see :doc:`rocprofv3 documentation <rocprofiler-sdk:how-to/using-rocprofv3>`.
 
 2. **Analyze metrics to identify bottlenecks**
-   
-   Determine if kernels are compute-bound or memory-bound. Check arithmetic 
+
+   Determine if kernels are compute-bound or memory-bound. Check arithmetic
    intensity, memory bandwidth achieved vs peak, and compute throughput.
-   
+
    For understanding the roofline model, see :ref:`roofline_model`.
 
 3. **Apply targeted optimizations**
-   
+
    Based on identified bottlenecks, apply techniques from this guide.
 
 4. **Verify improvements**
-   
+
    Re-profile to confirm performance gains.
 
 5. **Iterate**
-   
+
    Repeat until performance goals are met.
 
 .. _parallel execution:
@@ -70,9 +70,9 @@ To enable parallel execution across the host and devices:
 
 For parallel workloads:
 
-* Use :cpp:func:`__syncthreads()` (see :ref:`synchronization_functions`) for 
+* Use :cpp:func:`__syncthreads()` (see :ref:`synchronization_functions`) for
   intra-block synchronization
-* Use global memory with separate kernel invocations for inter-block 
+* Use global memory with separate kernel invocations for inter-block
   synchronization (has overhead, minimize when possible)
 
 Device level
@@ -103,7 +103,7 @@ Memory throughput optimization
 The first step in maximizing memory throughput is to minimize low-bandwidth
 data transfers between the host and the device.
 
-Additionally, maximize the use of on-chip memory (shared memory and caches) and 
+Additionally, maximize the use of on-chip memory (shared memory and caches) and
 minimize transfers with global memory.
 
 .. _data transfer:
@@ -130,14 +130,14 @@ effective bandwidth.
    for (int i = 0; i < n; i++) {
        hipMemcpy(&d_data[i], &h_data[i], sizeof(float), ...);
    }
-   
+
    // Use a single large transfer
    hipMemcpy(d_data, h_data, n * sizeof(float), ...);
 
 **Use page-locked memory for transfers**
 
-Page-locked (pinned) memory cannot be swapped to disk by the operating system, 
-allowing the GPU to access it directly via DMA without CPU involvement. This 
+Page-locked (pinned) memory cannot be swapped to disk by the operating system,
+allowing the GPU to access it directly via DMA without CPU involvement. This
 eliminates an extra copy through a staging buffer and achieves higher bandwidth.
 
 .. code-block:: cuda
@@ -149,8 +149,8 @@ eliminates an extra copy through a staging buffer and achieves higher bandwidth.
 
 **Use mapped memory on integrated systems**
 
-On integrated GPUs (APUs), the CPU and GPU share the same physical memory. 
-Mapped page-locked memory allows zero-copy access, where the GPU reads directly 
+On integrated GPUs (APUs), the CPU and GPU share the same physical memory.
+Mapped page-locked memory allows zero-copy access, where the GPU reads directly
 from host memory without requiring an explicit transfer, eliminating redundant copies.
 
 .. code-block:: cuda
@@ -169,8 +169,8 @@ Device memory access
 
 **Ensure proper alignment**
 
-Memory hardware loads data in aligned chunks (typically 128 bytes). Using 
-naturally aligned data types ensures each access maps to a single memory 
+Memory hardware loads data in aligned chunks (typically 128 bytes). Using
+naturally aligned data types ensures each access maps to a single memory
 transaction, maximizing bandwidth and avoiding split transactions.
 
 .. code-block:: cuda
@@ -178,7 +178,7 @@ transaction, maximizing bandwidth and avoiding split transactions.
    // Use naturally aligned types
    float4 data;  // 16-byte aligned
    float2 data;  // 8-byte aligned
-   
+
    // Ensure structure alignment
    struct __align__(16) MyStruct {
        float4 data;
@@ -186,8 +186,8 @@ transaction, maximizing bandwidth and avoiding split transactions.
 
 **Optimize 2D array access**
 
-Padding 2D arrays to multiples of the wavefront size ensures each row starts 
-at an aligned memory boundary. This allows consecutive threads accessing the 
+Padding 2D arrays to multiples of the wavefront size ensures each row starts
+at an aligned memory boundary. This allows consecutive threads accessing the
 same row to generate coalesced memory transactions, thereby maximizing
 bandwidth.
 
@@ -196,14 +196,14 @@ bandwidth.
    // Ensure array width is multiple of warp size
    int width = ((actual_width + warpSize - 1) / warpSize) * warpSize;
    hipMalloc(&array, width * height * sizeof(float));
-   
+
    // Access pattern
    int idx = x + width * y;  // width should be warp-aligned
 
 **Coalesce memory accesses**
 
-When consecutive threads in a wavefront access consecutive memory addresses, 
-the hardware combines these into a single wide transaction. Non-coalesced 
+When consecutive threads in a wavefront access consecutive memory addresses,
+the hardware combines these into a single wide transaction. Non-coalesced
 patterns require multiple transactions, reducing effective bandwidth.
 
 .. code-block:: cuda
@@ -211,7 +211,7 @@ patterns require multiple transactions, reducing effective bandwidth.
    // Good: consecutive threads access consecutive addresses
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    data[idx] = value;
-   
+
    // Bad: strided access
    int idx = threadIdx.x * stride;  // Non-coalesced if stride > 1
    data[idx] = value;
@@ -220,8 +220,8 @@ For understanding memory coalescing theory, see :ref:`memory_hierarchy_theory`.
 
 **Use shared memory for data reuse**
 
-Shared memory (LDS) provides low-latency on-chip storage shared across threads 
-in a block. Loading data into shared memory once and reusing it many times 
+Shared memory (LDS) provides low-latency on-chip storage shared across threads
+in a block. Loading data into shared memory once and reusing it many times
 reduces global memory traffic, particularly effective for tiled algorithms such
 as matrix multiplication.
 
@@ -229,18 +229,18 @@ as matrix multiplication.
 
    __global__ void optimized_kernel(float* input, float* output) {
        __shared__ float tile[TILE_SIZE][TILE_SIZE];
-       
+
        // Load data into shared memory
        tile[threadIdx.y][threadIdx.x] = input[...];
        __syncthreads();
-       
+
        // Reuse data from fast shared memory
        float result = 0;
        for (int i = 0; i < TILE_SIZE; i++) {
            result += tile[threadIdx.y][i] * tile[i][threadIdx.x];
        }
        __syncthreads();
-       
+
        output[...] = result;
    }
 
@@ -256,7 +256,7 @@ shifts addresses to avoid systematic conflicts.
    // Bad: power-of-2 stride causes conflicts
    __shared__ float data[32][32];
    float value = data[threadIdx.x][threadIdx.y];
-   
+
    // Good: padding avoids conflicts
    __shared__ float data[32][33];  // Extra column
    float value = data[threadIdx.x][threadIdx.y];
@@ -265,8 +265,8 @@ For bank conflict theory, see :ref:`bank_conflicts_theory`.
 
 **Use texture memory for 2D spatial access**
 
-Texture memory provides hardware-accelerated 2D filtering and caching optimized 
-for spatial locality. It automatically handles boundary conditions and can 
+Texture memory provides hardware-accelerated 2D filtering and caching optimized
+for spatial locality. It automatically handles boundary conditions and can
 interpolate values, making it ideal for image processing and nearby-neighbor access patterns.
 
 .. code-block:: cuda
@@ -274,7 +274,7 @@ interpolate values, making it ideal for image processing and nearby-neighbor acc
    // Create texture object
    hipTextureObject_t texObj;
    hipCreateTextureObject(&texObj, &resDesc, &texDesc, NULL);
-   
+
    // Access in kernel
    float value = tex2D<float>(texObj, x, y);
 
@@ -288,8 +288,8 @@ Arithmetic instructions
 
 **Use efficient operations**
 
-Division requires many more hardware cycles than multiplication. Similarly, 
-bitwise operations (shifts, AND, OR) are single-cycle instructions on integer 
+Division requires many more hardware cycles than multiplication. Similarly,
+bitwise operations (shifts, AND, OR) are single-cycle instructions on integer
 units, making them far more efficient than equivalent arithmetic for power-of-two calculations.
 
 .. code-block:: cuda
@@ -297,15 +297,15 @@ units, making them far more efficient than equivalent arithmetic for power-of-tw
    // Prefer multiplication over division
    float result = value * 0.5f;     // Fast
    float result = value / 2.0f;     // Slower
-   
+
    // Use bitwise operations for powers of 2
    int index = threadIdx.x << 2;    // Multiply by 4
    int mask = (1 << n) - 1;         // Create bit mask
 
 **Use single-precision when possible**
 
-AMD GPUs have significantly higher throughput for single-precision (FP32) 
-operations compared to double-precision (FP64). Using single-precision math 
+AMD GPUs have significantly higher throughput for single-precision (FP32)
+operations compared to double-precision (FP64). Using single-precision math
 functions can deliver substantial performance gains when FP64 accuracy is not required.
 
 .. code-block:: cuda
@@ -313,7 +313,7 @@ functions can deliver substantial performance gains when FP64 accuracy is not re
    // Single-precision (faster)
    float result = sinf(x);
    float result = expf(x);
-   
+
    // Double-precision (slower, use only when necessary)
    double result = sin(x);
    double result = exp(x);
@@ -340,8 +340,8 @@ Control flow optimization
 
 **Minimize divergence**
 
-When threads in a wavefront take different execution paths, the hardware 
-serializes both branches, executing each path with only the relevant threads 
+When threads in a wavefront take different execution paths, the hardware
+serializes both branches, executing each path with only the relevant threads
 active. This reduces effective parallelism and wastes cycles on inactive threads.
 
 .. code-block:: cuda
@@ -350,7 +350,7 @@ active. This reduces effective parallelism and wastes cycles on inactive threads
    if (threadIdx.x < 32) {
        // All threads in first half-warp execute
    }
-   
+
    // Bad: divergence within warp
    if (data[threadIdx.x] > threshold) {
        // Some threads execute, others don't
@@ -358,8 +358,8 @@ active. This reduces effective parallelism and wastes cycles on inactive threads
 
 **Use branch hints for predictable conditions**
 
-Providing hints about branch likelihood helps the compiler generate better 
-instruction ordering and can improve the branch predictor's accuracy, reducing 
+Providing hints about branch likelihood helps the compiler generate better
+instruction ordering and can improve the branch predictor's accuracy, reducing
 pipeline stalls when the prediction proves correct.
 
 .. code-block:: cuda
@@ -367,7 +367,7 @@ pipeline stalls when the prediction proves correct.
    if (__builtin_expect(rare_condition, 0)) {
        // Unlikely branch
    }
-   
+
    // C++20 attribute
    if (common_condition) [[likely]] {
        // Likely branch
@@ -375,9 +375,9 @@ pipeline stalls when the prediction proves correct.
 
 **Avoid divergent warps**
 
-When divergence is unavoidable, restructure the code to separate divergent paths 
-into different kernel launches or use predication (branchless programming) to 
-keep all threads active, though computing unnecessary values may be acceptable 
+When divergence is unavoidable, restructure the code to separate divergent paths
+into different kernel launches or use predication (branchless programming) to
+keep all threads active, though computing unnecessary values may be acceptable
 if it avoids the serialization penalty.
 
 .. code-block:: cuda
@@ -388,7 +388,7 @@ if it avoids the serialization penalty.
    } else {
        result = compute_odd();
    }
-   
+
    // Consider separating into different kernels or using predication
 
 Synchronization
@@ -396,32 +396,32 @@ Synchronization
 
 **Use minimal synchronization**
 
-Each synchronization point stalls all threads in a block until the slowest one 
-reaches the barrier. Minimize synchronizations by carefully analyzing data 
-dependencies—only synchronize when threads genuinely need to exchange data 
+Each synchronization point stalls all threads in a block until the slowest one
+reaches the barrier. Minimize synchronizations by carefully analyzing data
+dependencies—only synchronize when threads genuinely need to exchange data
 through shared memory.
 
 .. code-block:: cuda
 
    __global__ void kernel() {
        __shared__ float data[256];
-       
+
        // Load phase
        data[threadIdx.x] = input[...];
        __syncthreads();  // Necessary sync
-       
+
        // Compute phase - no sync needed if threads are independent
        float result = compute(data[...]);
-       
+
        // Store phase - sync only if needed
        output[...] = result;
    }
 
 **Use streams for async execution**
 
-Streams enable concurrent execution of independent operations. Commands in 
-different streams can overlap in time, allowing kernel execution and memory 
-transfers to run simultaneously. This maximizes GPU utilization by keeping 
+Streams enable concurrent execution of independent operations. Commands in
+different streams can overlap in time, allowing kernel execution and memory
+transfers to run simultaneously. This maximizes GPU utilization by keeping
 multiple execution engines busy concurrently.
 
 .. code-block:: cuda
@@ -429,11 +429,11 @@ multiple execution engines busy concurrently.
    hipStream_t stream1, stream2;
    hipStreamCreate(&stream1);
    hipStreamCreate(&stream2);
-   
+
    // Overlap independent operations
    kernel1<<<grid, block, 0, stream1>>>(...);
    kernel2<<<grid, block, 0, stream2>>>(...);
-   
+
    hipStreamSynchronize(stream1);
    hipStreamSynchronize(stream2);
 
@@ -444,9 +444,9 @@ High register usage can limit occupancy. Follow these steps:
 
 **Minimize live variables**
 
-The compiler allocates registers for every variable that must remain accessible. 
-Reducing the number of simultaneously live variables frees registers, allowing 
-more wavefronts to fit on each CU. Chaining function calls trades some redundant 
+The compiler allocates registers for every variable that must remain accessible.
+Reducing the number of simultaneously live variables frees registers, allowing
+more wavefronts to fit on each CU. Chaining function calls trades some redundant
 computation for lower register usage.
 
 .. code-block:: cuda
@@ -456,35 +456,35 @@ computation for lower register usage.
    float b = compute_b();
    float c = compute_c();
    float result = combine(a, b, c);
-   
+
    // Recompute or chain operations
    float result = combine(compute_a(), compute_b(), compute_c());
 
 **Use shared memory for temporary storage**
 
-Per-thread arrays stored in registers consume valuable register space, limiting 
-occupancy. Moving temporary storage to shared memory trades register usage for 
-shared memory usage, often allowing higher occupancy since shared memory limits 
+Per-thread arrays stored in registers consume valuable register space, limiting
+occupancy. Moving temporary storage to shared memory trades register usage for
+shared memory usage, often allowing higher occupancy since shared memory limits
 are typically less restrictive.
 
 .. code-block:: cuda
 
    // Instead of per-thread arrays (uses registers)
    float temp[100];
-   
+
    // Use shared memory
    __shared__ float temp[blockDim.x][100];
    float* my_temp = temp[threadIdx.x];
 
 **Adjust launch bounds**
 
-The ``__launch_bounds__`` attribute provides hints to the compiler about expected 
-thread block size and minimum blocks per CU. This guides register allocation 
+The ``__launch_bounds__`` attribute provides hints to the compiler about expected
+thread block size and minimum blocks per CU. This guides register allocation
 decisions, potentially trading per-thread register count for higher occupancy.
 
 .. code-block:: cuda
 
-   __global__ void 
+   __global__ void
    __launch_bounds__(256, 4)  // 256 threads, 4 blocks per CU
    my_kernel() {
        // Kernel code
@@ -492,8 +492,8 @@ decisions, potentially trading per-thread register count for higher occupancy.
 
 **Check register usage during compilation**
 
-The compiler can report per-kernel register usage statistics. Monitoring this 
-output helps identify kernels consuming excessive registers, guiding optimization 
+The compiler can report per-kernel register usage statistics. Monitoring this
+output helps identify kernels consuming excessive registers, guiding optimization
 efforts toward reducing register pressure in the most impactful areas.
 
 .. code-block:: bash
@@ -513,22 +513,22 @@ Use techniques from "Managing register pressure" above.
 
 **Reduce shared memory usage per block**
 
-Each CU has limited shared memory that must be divided among resident blocks. 
-Reducing per-block shared memory usage allows more blocks to reside simultaneously, 
+Each CU has limited shared memory that must be divided among resident blocks.
+Reducing per-block shared memory usage allows more blocks to reside simultaneously,
 increasing occupancy and improving latency hiding through greater thread-level parallelism.
 
 .. code-block:: cuda
 
    // Allocate only what's needed
    __shared__ float tile[TILE_SIZE][TILE_SIZE];
-   
+
    // Or use dynamic allocation
    extern __shared__ float dynamic_shared[];
 
 **Optimize block size**
 
-AMD GPUs execute threads in wavefronts of 64. Choosing block sizes as multiples 
-of 64 prevents partial wavefronts that waste execution slots. Larger blocks 
+AMD GPUs execute threads in wavefronts of 64. Choosing block sizes as multiples
+of 64 prevents partial wavefronts that waste execution slots. Larger blocks
 (128-256 threads) typically achieve better occupancy and resource utilization.
 
 .. code-block:: cuda
@@ -537,14 +537,14 @@ of 64 prevents partial wavefronts that waste execution slots. Larger blocks
    dim3 block(64);    // Good for AMD GPUs (wavefront=64)
    dim3 block(128);   // Common choice
    dim3 block(256);   // Good for high-occupancy kernels
-   
+
    // Avoid very small blocks
    dim3 block(32);    // May waste resources
 
 **Profile occupancy**
 
-Profiling tools report the ratio of active wavefronts to maximum possible 
-wavefronts per CU. Low occupancy suggests resource constraints (registers or 
+Profiling tools report the ratio of active wavefronts to maximum possible
+wavefronts per CU. Low occupancy suggests resource constraints (registers or
 shared memory) are limiting parallelism and may indicate opportunities for optimization.
 
 .. code-block:: bash
@@ -561,8 +561,8 @@ allocation calls over time. To optimize:
 
 **Allocate early, deallocate late**
 
-Frequent allocation and deallocation causes memory fragmentation and increases 
-allocator overhead. Reusing allocations across iterations amortizes the cost 
+Frequent allocation and deallocation causes memory fragmentation and increases
+allocator overhead. Reusing allocations across iterations amortizes the cost
 of memory management and maintains better memory locality.
 
 .. code-block:: cuda
@@ -574,7 +574,7 @@ of memory management and maintains better memory locality.
        // Use temp
        hipFree(temp);
    }
-   
+
    // Good: allocate once
    float* temp;
    hipMalloc(&temp, size);
@@ -585,22 +585,22 @@ of memory management and maintains better memory locality.
 
 **Avoid allocating all available memory**
 
-Reserving some memory headroom prevents allocation failures and system instability. 
-The driver and runtime need workspace for internal operations, and leaving a 
+Reserving some memory headroom prevents allocation failures and system instability.
+The driver and runtime need workspace for internal operations, and leaving a
 safety margin ensures stable operation without unexpected out-of-memory errors.
 
 .. code-block:: cuda
 
    size_t free, total;
    hipMemGetInfo(&free, &total);
-   
+
    // Don't allocate all free memory
    size_t safe_size = free * 0.9;  // Leave some margin
 
 **Use managed memory for oversubscription**
 
-Managed memory automatically migrates data between host and device on demand, 
-allowing allocations larger than physical GPU memory. Prefetching hints help 
+Managed memory automatically migrates data between host and device on demand,
+allowing allocations larger than physical GPU memory. Prefetching hints help
 the runtime optimize page placement, reducing migration overhead during kernel execution.
 
 .. code-block:: cuda
@@ -608,7 +608,7 @@ the runtime optimize page placement, reducing migration overhead during kernel e
    // Allows exceeding physical memory
    float* data;
    hipMallocManaged(&data, large_size);
-   
+
    // Optionally prefetch to device
    hipMemPrefetchAsync(data, size, device, stream);
 
@@ -623,5 +623,5 @@ Key optimization techniques:
 * **Manage resources**: Balance registers, shared memory, and occupancy
 * **Minimize divergence**: Structure control flow to keep warps coherent
 
-For understanding the theory behind these techniques, refer to 
+For understanding the theory behind these techniques, refer to
 :doc:`../understand/performance_optimization` and :doc:`../understand/hardware_implementation`.