diff --git a/.gitignore b/.gitignore index 6f3f8d07cc..9e1a5ff76c 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,2 @@ .cline_storage +/projects/hip/_build diff --git a/projects/hip/docs/how-to/performance_guidelines.rst b/projects/hip/docs/how-to/performance_guidelines.rst index 75bc420a55..f26367775e 100644 --- a/projects/hip/docs/how-to/performance_guidelines.rst +++ b/projects/hip/docs/how-to/performance_guidelines.rst @@ -10,10 +10,10 @@ Performance guidelines ******************************************************************************* The AMD HIP performance guidelines provide practical, actionable techniques for -optimizing application performance on AMD GPUs. This guide focuses on +optimizing application performance on AMD GPUs. This guide focuses on step-by-step instructions and best practices for improving performance. -For theoretical foundations and performance concepts, see +For theoretical foundations and performance concepts, see :doc:`../understand/performance_optimization`. Optimization workflow @@ -22,33 +22,33 @@ Optimization workflow Follow this systematic approach to optimize GPU performance: 1. **Profile and measure baseline** - + Use ``rocprofv3`` to identify bottlenecks: - + .. code-block:: bash - - rocprofv3 --stats ./your_application - - Collect metrics on kernel execution time, memory bandwidth, occupancy, and - CU utilization. + + rocprofv3 --stats -- -- + + Collect metrics on kernel execution time, memory bandwidth, occupancy, and + CU utilization. For more details on using ``rocprofv3`` for application tracing and profiling, see :doc:`rocprofv3 documentation `. 2. **Analyze metrics to identify bottlenecks** - - Determine if kernels are compute-bound or memory-bound. Check arithmetic + + Determine if kernels are compute-bound or memory-bound. Check arithmetic intensity, memory bandwidth achieved vs peak, and compute throughput. - + For understanding the roofline model, see :ref:`roofline_model`. 3. **Apply targeted optimizations** - + Based on identified bottlenecks, apply techniques from this guide. 4. **Verify improvements** - + Re-profile to confirm performance gains. 5. **Iterate** - + Repeat until performance goals are met. .. _parallel execution: @@ -70,9 +70,9 @@ To enable parallel execution across the host and devices: For parallel workloads: -* Use :cpp:func:`__syncthreads()` (see :ref:`synchronization_functions`) for +* Use :cpp:func:`__syncthreads()` (see :ref:`synchronization_functions`) for intra-block synchronization -* Use global memory with separate kernel invocations for inter-block +* Use global memory with separate kernel invocations for inter-block synchronization (has overhead, minimize when possible) Device level @@ -103,7 +103,7 @@ Memory throughput optimization The first step in maximizing memory throughput is to minimize low-bandwidth data transfers between the host and the device. -Additionally, maximize the use of on-chip memory (shared memory and caches) and +Additionally, maximize the use of on-chip memory (shared memory and caches) and minimize transfers with global memory. .. _data transfer: @@ -130,14 +130,14 @@ effective bandwidth. for (int i = 0; i < n; i++) { hipMemcpy(&d_data[i], &h_data[i], sizeof(float), ...); } - + // Use a single large transfer hipMemcpy(d_data, h_data, n * sizeof(float), ...); **Use page-locked memory for transfers** -Page-locked (pinned) memory cannot be swapped to disk by the operating system, -allowing the GPU to access it directly via DMA without CPU involvement. This +Page-locked (pinned) memory cannot be swapped to disk by the operating system, +allowing the GPU to access it directly via DMA without CPU involvement. This eliminates an extra copy through a staging buffer and achieves higher bandwidth. .. code-block:: cuda @@ -149,8 +149,8 @@ eliminates an extra copy through a staging buffer and achieves higher bandwidth. **Use mapped memory on integrated systems** -On integrated GPUs (APUs), the CPU and GPU share the same physical memory. -Mapped page-locked memory allows zero-copy access, where the GPU reads directly +On integrated GPUs (APUs), the CPU and GPU share the same physical memory. +Mapped page-locked memory allows zero-copy access, where the GPU reads directly from host memory without requiring an explicit transfer, eliminating redundant copies. .. code-block:: cuda @@ -169,8 +169,8 @@ Device memory access **Ensure proper alignment** -Memory hardware loads data in aligned chunks (typically 128 bytes). Using -naturally aligned data types ensures each access maps to a single memory +Memory hardware loads data in aligned chunks (typically 128 bytes). Using +naturally aligned data types ensures each access maps to a single memory transaction, maximizing bandwidth and avoiding split transactions. .. code-block:: cuda @@ -178,7 +178,7 @@ transaction, maximizing bandwidth and avoiding split transactions. // Use naturally aligned types float4 data; // 16-byte aligned float2 data; // 8-byte aligned - + // Ensure structure alignment struct __align__(16) MyStruct { float4 data; @@ -186,8 +186,8 @@ transaction, maximizing bandwidth and avoiding split transactions. **Optimize 2D array access** -Padding 2D arrays to multiples of the wavefront size ensures each row starts -at an aligned memory boundary. This allows consecutive threads accessing the +Padding 2D arrays to multiples of the wavefront size ensures each row starts +at an aligned memory boundary. This allows consecutive threads accessing the same row to generate coalesced memory transactions, thereby maximizing bandwidth. @@ -196,14 +196,14 @@ bandwidth. // Ensure array width is multiple of warp size int width = ((actual_width + warpSize - 1) / warpSize) * warpSize; hipMalloc(&array, width * height * sizeof(float)); - + // Access pattern int idx = x + width * y; // width should be warp-aligned **Coalesce memory accesses** -When consecutive threads in a wavefront access consecutive memory addresses, -the hardware combines these into a single wide transaction. Non-coalesced +When consecutive threads in a wavefront access consecutive memory addresses, +the hardware combines these into a single wide transaction. Non-coalesced patterns require multiple transactions, reducing effective bandwidth. .. code-block:: cuda @@ -211,7 +211,7 @@ patterns require multiple transactions, reducing effective bandwidth. // Good: consecutive threads access consecutive addresses int idx = threadIdx.x + blockIdx.x * blockDim.x; data[idx] = value; - + // Bad: strided access int idx = threadIdx.x * stride; // Non-coalesced if stride > 1 data[idx] = value; @@ -220,8 +220,8 @@ For understanding memory coalescing theory, see :ref:`memory_hierarchy_theory`. **Use shared memory for data reuse** -Shared memory (LDS) provides low-latency on-chip storage shared across threads -in a block. Loading data into shared memory once and reusing it many times +Shared memory (LDS) provides low-latency on-chip storage shared across threads +in a block. Loading data into shared memory once and reusing it many times reduces global memory traffic, particularly effective for tiled algorithms such as matrix multiplication. @@ -229,18 +229,18 @@ as matrix multiplication. __global__ void optimized_kernel(float* input, float* output) { __shared__ float tile[TILE_SIZE][TILE_SIZE]; - + // Load data into shared memory tile[threadIdx.y][threadIdx.x] = input[...]; __syncthreads(); - + // Reuse data from fast shared memory float result = 0; for (int i = 0; i < TILE_SIZE; i++) { result += tile[threadIdx.y][i] * tile[i][threadIdx.x]; } __syncthreads(); - + output[...] = result; } @@ -256,7 +256,7 @@ shifts addresses to avoid systematic conflicts. // Bad: power-of-2 stride causes conflicts __shared__ float data[32][32]; float value = data[threadIdx.x][threadIdx.y]; - + // Good: padding avoids conflicts __shared__ float data[32][33]; // Extra column float value = data[threadIdx.x][threadIdx.y]; @@ -265,8 +265,8 @@ For bank conflict theory, see :ref:`bank_conflicts_theory`. **Use texture memory for 2D spatial access** -Texture memory provides hardware-accelerated 2D filtering and caching optimized -for spatial locality. It automatically handles boundary conditions and can +Texture memory provides hardware-accelerated 2D filtering and caching optimized +for spatial locality. It automatically handles boundary conditions and can interpolate values, making it ideal for image processing and nearby-neighbor access patterns. .. code-block:: cuda @@ -274,7 +274,7 @@ interpolate values, making it ideal for image processing and nearby-neighbor acc // Create texture object hipTextureObject_t texObj; hipCreateTextureObject(&texObj, &resDesc, &texDesc, NULL); - + // Access in kernel float value = tex2D(texObj, x, y); @@ -288,8 +288,8 @@ Arithmetic instructions **Use efficient operations** -Division requires many more hardware cycles than multiplication. Similarly, -bitwise operations (shifts, AND, OR) are single-cycle instructions on integer +Division requires many more hardware cycles than multiplication. Similarly, +bitwise operations (shifts, AND, OR) are single-cycle instructions on integer units, making them far more efficient than equivalent arithmetic for power-of-two calculations. .. code-block:: cuda @@ -297,15 +297,15 @@ units, making them far more efficient than equivalent arithmetic for power-of-tw // Prefer multiplication over division float result = value * 0.5f; // Fast float result = value / 2.0f; // Slower - + // Use bitwise operations for powers of 2 int index = threadIdx.x << 2; // Multiply by 4 int mask = (1 << n) - 1; // Create bit mask **Use single-precision when possible** -AMD GPUs have significantly higher throughput for single-precision (FP32) -operations compared to double-precision (FP64). Using single-precision math +AMD GPUs have significantly higher throughput for single-precision (FP32) +operations compared to double-precision (FP64). Using single-precision math functions can deliver substantial performance gains when FP64 accuracy is not required. .. code-block:: cuda @@ -313,7 +313,7 @@ functions can deliver substantial performance gains when FP64 accuracy is not re // Single-precision (faster) float result = sinf(x); float result = expf(x); - + // Double-precision (slower, use only when necessary) double result = sin(x); double result = exp(x); @@ -340,8 +340,8 @@ Control flow optimization **Minimize divergence** -When threads in a wavefront take different execution paths, the hardware -serializes both branches, executing each path with only the relevant threads +When threads in a wavefront take different execution paths, the hardware +serializes both branches, executing each path with only the relevant threads active. This reduces effective parallelism and wastes cycles on inactive threads. .. code-block:: cuda @@ -350,7 +350,7 @@ active. This reduces effective parallelism and wastes cycles on inactive threads if (threadIdx.x < 32) { // All threads in first half-warp execute } - + // Bad: divergence within warp if (data[threadIdx.x] > threshold) { // Some threads execute, others don't @@ -358,8 +358,8 @@ active. This reduces effective parallelism and wastes cycles on inactive threads **Use branch hints for predictable conditions** -Providing hints about branch likelihood helps the compiler generate better -instruction ordering and can improve the branch predictor's accuracy, reducing +Providing hints about branch likelihood helps the compiler generate better +instruction ordering and can improve the branch predictor's accuracy, reducing pipeline stalls when the prediction proves correct. .. code-block:: cuda @@ -367,7 +367,7 @@ pipeline stalls when the prediction proves correct. if (__builtin_expect(rare_condition, 0)) { // Unlikely branch } - + // C++20 attribute if (common_condition) [[likely]] { // Likely branch @@ -375,9 +375,9 @@ pipeline stalls when the prediction proves correct. **Avoid divergent warps** -When divergence is unavoidable, restructure the code to separate divergent paths -into different kernel launches or use predication (branchless programming) to -keep all threads active, though computing unnecessary values may be acceptable +When divergence is unavoidable, restructure the code to separate divergent paths +into different kernel launches or use predication (branchless programming) to +keep all threads active, though computing unnecessary values may be acceptable if it avoids the serialization penalty. .. code-block:: cuda @@ -388,7 +388,7 @@ if it avoids the serialization penalty. } else { result = compute_odd(); } - + // Consider separating into different kernels or using predication Synchronization @@ -396,32 +396,32 @@ Synchronization **Use minimal synchronization** -Each synchronization point stalls all threads in a block until the slowest one -reaches the barrier. Minimize synchronizations by carefully analyzing data -dependencies—only synchronize when threads genuinely need to exchange data +Each synchronization point stalls all threads in a block until the slowest one +reaches the barrier. Minimize synchronizations by carefully analyzing data +dependencies—only synchronize when threads genuinely need to exchange data through shared memory. .. code-block:: cuda __global__ void kernel() { __shared__ float data[256]; - + // Load phase data[threadIdx.x] = input[...]; __syncthreads(); // Necessary sync - + // Compute phase - no sync needed if threads are independent float result = compute(data[...]); - + // Store phase - sync only if needed output[...] = result; } **Use streams for async execution** -Streams enable concurrent execution of independent operations. Commands in -different streams can overlap in time, allowing kernel execution and memory -transfers to run simultaneously. This maximizes GPU utilization by keeping +Streams enable concurrent execution of independent operations. Commands in +different streams can overlap in time, allowing kernel execution and memory +transfers to run simultaneously. This maximizes GPU utilization by keeping multiple execution engines busy concurrently. .. code-block:: cuda @@ -429,11 +429,11 @@ multiple execution engines busy concurrently. hipStream_t stream1, stream2; hipStreamCreate(&stream1); hipStreamCreate(&stream2); - + // Overlap independent operations kernel1<<>>(...); kernel2<<>>(...); - + hipStreamSynchronize(stream1); hipStreamSynchronize(stream2); @@ -444,9 +444,9 @@ High register usage can limit occupancy. Follow these steps: **Minimize live variables** -The compiler allocates registers for every variable that must remain accessible. -Reducing the number of simultaneously live variables frees registers, allowing -more wavefronts to fit on each CU. Chaining function calls trades some redundant +The compiler allocates registers for every variable that must remain accessible. +Reducing the number of simultaneously live variables frees registers, allowing +more wavefronts to fit on each CU. Chaining function calls trades some redundant computation for lower register usage. .. code-block:: cuda @@ -456,35 +456,35 @@ computation for lower register usage. float b = compute_b(); float c = compute_c(); float result = combine(a, b, c); - + // Recompute or chain operations float result = combine(compute_a(), compute_b(), compute_c()); **Use shared memory for temporary storage** -Per-thread arrays stored in registers consume valuable register space, limiting -occupancy. Moving temporary storage to shared memory trades register usage for -shared memory usage, often allowing higher occupancy since shared memory limits +Per-thread arrays stored in registers consume valuable register space, limiting +occupancy. Moving temporary storage to shared memory trades register usage for +shared memory usage, often allowing higher occupancy since shared memory limits are typically less restrictive. .. code-block:: cuda // Instead of per-thread arrays (uses registers) float temp[100]; - + // Use shared memory __shared__ float temp[blockDim.x][100]; float* my_temp = temp[threadIdx.x]; **Adjust launch bounds** -The ``__launch_bounds__`` attribute provides hints to the compiler about expected -thread block size and minimum blocks per CU. This guides register allocation +The ``__launch_bounds__`` attribute provides hints to the compiler about expected +thread block size and minimum blocks per CU. This guides register allocation decisions, potentially trading per-thread register count for higher occupancy. .. code-block:: cuda - __global__ void + __global__ void __launch_bounds__(256, 4) // 256 threads, 4 blocks per CU my_kernel() { // Kernel code @@ -492,8 +492,8 @@ decisions, potentially trading per-thread register count for higher occupancy. **Check register usage during compilation** -The compiler can report per-kernel register usage statistics. Monitoring this -output helps identify kernels consuming excessive registers, guiding optimization +The compiler can report per-kernel register usage statistics. Monitoring this +output helps identify kernels consuming excessive registers, guiding optimization efforts toward reducing register pressure in the most impactful areas. .. code-block:: bash @@ -513,22 +513,22 @@ Use techniques from "Managing register pressure" above. **Reduce shared memory usage per block** -Each CU has limited shared memory that must be divided among resident blocks. -Reducing per-block shared memory usage allows more blocks to reside simultaneously, +Each CU has limited shared memory that must be divided among resident blocks. +Reducing per-block shared memory usage allows more blocks to reside simultaneously, increasing occupancy and improving latency hiding through greater thread-level parallelism. .. code-block:: cuda // Allocate only what's needed __shared__ float tile[TILE_SIZE][TILE_SIZE]; - + // Or use dynamic allocation extern __shared__ float dynamic_shared[]; **Optimize block size** -AMD GPUs execute threads in wavefronts of 64. Choosing block sizes as multiples -of 64 prevents partial wavefronts that waste execution slots. Larger blocks +AMD GPUs execute threads in wavefronts of 64. Choosing block sizes as multiples +of 64 prevents partial wavefronts that waste execution slots. Larger blocks (128-256 threads) typically achieve better occupancy and resource utilization. .. code-block:: cuda @@ -537,14 +537,14 @@ of 64 prevents partial wavefronts that waste execution slots. Larger blocks dim3 block(64); // Good for AMD GPUs (wavefront=64) dim3 block(128); // Common choice dim3 block(256); // Good for high-occupancy kernels - + // Avoid very small blocks dim3 block(32); // May waste resources **Profile occupancy** -Profiling tools report the ratio of active wavefronts to maximum possible -wavefronts per CU. Low occupancy suggests resource constraints (registers or +Profiling tools report the ratio of active wavefronts to maximum possible +wavefronts per CU. Low occupancy suggests resource constraints (registers or shared memory) are limiting parallelism and may indicate opportunities for optimization. .. code-block:: bash @@ -561,8 +561,8 @@ allocation calls over time. To optimize: **Allocate early, deallocate late** -Frequent allocation and deallocation causes memory fragmentation and increases -allocator overhead. Reusing allocations across iterations amortizes the cost +Frequent allocation and deallocation causes memory fragmentation and increases +allocator overhead. Reusing allocations across iterations amortizes the cost of memory management and maintains better memory locality. .. code-block:: cuda @@ -574,7 +574,7 @@ of memory management and maintains better memory locality. // Use temp hipFree(temp); } - + // Good: allocate once float* temp; hipMalloc(&temp, size); @@ -585,22 +585,22 @@ of memory management and maintains better memory locality. **Avoid allocating all available memory** -Reserving some memory headroom prevents allocation failures and system instability. -The driver and runtime need workspace for internal operations, and leaving a +Reserving some memory headroom prevents allocation failures and system instability. +The driver and runtime need workspace for internal operations, and leaving a safety margin ensures stable operation without unexpected out-of-memory errors. .. code-block:: cuda size_t free, total; hipMemGetInfo(&free, &total); - + // Don't allocate all free memory size_t safe_size = free * 0.9; // Leave some margin **Use managed memory for oversubscription** -Managed memory automatically migrates data between host and device on demand, -allowing allocations larger than physical GPU memory. Prefetching hints help +Managed memory automatically migrates data between host and device on demand, +allowing allocations larger than physical GPU memory. Prefetching hints help the runtime optimize page placement, reducing migration overhead during kernel execution. .. code-block:: cuda @@ -608,7 +608,7 @@ the runtime optimize page placement, reducing migration overhead during kernel e // Allows exceeding physical memory float* data; hipMallocManaged(&data, large_size); - + // Optionally prefetch to device hipMemPrefetchAsync(data, size, device, stream); @@ -623,5 +623,5 @@ Key optimization techniques: * **Manage resources**: Balance registers, shared memory, and occupancy * **Minimize divergence**: Structure control flow to keep warps coherent -For understanding the theory behind these techniques, refer to +For understanding the theory behind these techniques, refer to :doc:`../understand/performance_optimization` and :doc:`../understand/hardware_implementation`.