2024-05-07 15:52:14 -04:00
.. meta ::
2024-10-21 16:50:09 +02:00
:description: This chapter describes a set of best practices designed to help
developers optimize the performance of HIP-capable GPU architectures.
2025-12-11 10:52:34 +01:00
:keywords: AMD, ROCm, HIP, CUDA, performance, guidelines, optimization, how-to
2024-05-07 15:52:14 -04:00
2025-03-19 22:04:47 +01:00
.. _how_to_performance_guidelines:
2024-05-07 15:52:14 -04:00
*******************************************************************************
2024-10-21 16:50:09 +02:00
Performance guidelines
2024-05-07 15:52:14 -04:00
*******************************************************************************
2025-12-11 10:52:34 +01:00
The AMD HIP performance guidelines provide practical, actionable techniques for
2026-01-28 22:46:19 +05:30
optimizing application performance on AMD GPUs. This guide focuses on
2025-12-11 10:52:34 +01:00
step-by-step instructions and best practices for improving performance.
2024-05-07 15:52:14 -04:00
2026-01-28 22:46:19 +05:30
For theoretical foundations and performance concepts, see
2025-12-11 10:52:34 +01:00
:doc: `../understand/performance_optimization` .
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
Optimization workflow
=====================
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
Follow this systematic approach to optimize GPU performance:
1. **Profile and measure baseline**
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
Use `` rocprofv3 `` to identify bottlenecks:
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
.. code-block:: bash
2026-01-28 22:46:19 +05:30
rocprofv3 --stats --<tracing_option> -- <application_path>
Collect metrics on kernel execution time, memory bandwidth, occupancy, and
CU utilization. For more details on using `` rocprofv3 `` for application tracing and profiling, see :doc: `rocprofv3 documentation <rocprofiler-sdk:how-to/using-rocprofv3>` .
2025-12-11 10:52:34 +01:00
2. **Analyze metrics to identify bottlenecks**
2026-01-28 22:46:19 +05:30
Determine if kernels are compute-bound or memory-bound. Check arithmetic
2025-12-11 10:52:34 +01:00
intensity, memory bandwidth achieved vs peak, and compute throughput.
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
For understanding the roofline model, see :ref: `roofline_model` .
3. **Apply targeted optimizations**
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
Based on identified bottlenecks, apply techniques from this guide.
4. **Verify improvements**
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
Re-profile to confirm performance gains.
5. **Iterate**
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
Repeat until performance goals are met.
2024-05-07 15:52:14 -04:00
.. _parallel execution:
Parallel execution
2025-12-11 10:52:34 +01:00
==================
2024-05-07 15:52:14 -04:00
2024-10-21 16:50:09 +02:00
For optimal use and to keep all system components busy, the application must
2025-12-11 10:52:34 +01:00
reveal and efficiently provide as much parallelism as possible.
2025-03-19 22:04:47 +01:00
2024-05-07 15:52:14 -04:00
Application level
2025-12-11 10:52:34 +01:00
-----------------
To enable parallel execution across the host and devices:
* Use :ref: `asynchronous calls and streams <asynchronous_how-to>`
* Assign serial workloads to the host
* Assign parallel workloads to the devices
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
For parallel workloads:
2024-05-07 15:52:14 -04:00
2026-01-28 22:46:19 +05:30
* Use :cpp:func: `__syncthreads()` (see :ref: `synchronization_functions` ) for
2025-12-11 10:52:34 +01:00
intra-block synchronization
2026-01-28 22:46:19 +05:30
* Use global memory with separate kernel invocations for inter-block
2025-12-11 10:52:34 +01:00
synchronization (has overhead, minimize when possible)
2024-05-07 15:52:14 -04:00
Device level
2025-12-11 10:52:34 +01:00
------------
Maximize parallel execution across multiprocessors:
* Execute multiple kernels concurrently on a device
* Use streams to overlap computation and data transfers
* Keep all multiprocessors busy with enough concurrent kernels
* Avoid launching too many kernels (causes resource contention)
2024-05-07 15:52:14 -04:00
Multiprocessor level
2025-12-11 10:52:34 +01:00
--------------------
Maximize parallel execution within each multiprocessor:
* Ensure sufficient resident warps for every clock cycle
* Exploit instruction-level parallelism within warps
* Exploit thread-level parallelism across warps
* Balance resource usage for optimal occupancy
2024-05-07 15:52:14 -04:00
.. _memory optimization:
2024-10-21 16:50:09 +02:00
Memory throughput optimization
2025-12-11 10:52:34 +01:00
==============================
2024-05-07 15:52:14 -04:00
The first step in maximizing memory throughput is to minimize low-bandwidth
2024-10-21 16:50:09 +02:00
data transfers between the host and the device.
2026-01-28 22:46:19 +05:30
Additionally, maximize the use of on-chip memory (shared memory and caches) and
2025-12-11 10:52:34 +01:00
minimize transfers with global memory.
2024-10-21 16:50:09 +02:00
.. _data transfer:
2025-12-11 10:52:34 +01:00
Data transfer optimization
--------------------------
**Minimize host-device transfers**
* Move computations from host to device when possible
* Create, use, and discard intermediate data structures on device
* Avoid unnecessary copies to host memory
**Batch small transfers**
Each memory transfer incurs a fixed overhead from driver calls and PCIe
transaction setup. Consolidating many small transfers into a single large
transfer amortizes this overhead across more data, resulting in much higher
effective bandwidth.
.. code-block :: cuda
// Instead of many small transfers
for (int i = 0; i < n; i++) {
hipMemcpy(&d_data[i], &h_data[i], sizeof(float), ...);
}
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Use a single large transfer
hipMemcpy(d_data, h_data, n * sizeof(float), ...);
**Use page-locked memory for transfers**
2026-01-28 22:46:19 +05:30
Page-locked (pinned) memory cannot be swapped to disk by the operating system,
allowing the GPU to access it directly via DMA without CPU involvement. This
2025-12-11 10:52:34 +01:00
eliminates an extra copy through a staging buffer and achieves higher bandwidth.
2024-10-21 16:50:09 +02:00
2025-12-11 10:52:34 +01:00
.. code-block :: cuda
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
float* h_pinned;
hipHostMalloc(&h_pinned, size);
// Faster transfers than pageable memory
hipMemcpy(d_data, h_pinned, size, hipMemcpyHostToDevice);
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
**Use mapped memory on integrated systems**
2024-05-07 15:52:14 -04:00
2026-01-28 22:46:19 +05:30
On integrated GPUs (APUs), the CPU and GPU share the same physical memory.
Mapped page-locked memory allows zero-copy access, where the GPU reads directly
2025-12-11 10:52:34 +01:00
from host memory without requiring an explicit transfer, eliminating redundant copies.
.. code-block :: cuda
int integrated;
hipDeviceGetAttribute(&integrated, hipDeviceAttributeIntegrated, device);
if (integrated) {
// Use mapped page-locked memory - no explicit copy needed
hipHostMalloc(&ptr, size, hipHostMallocMapped);
}
2024-05-07 15:52:14 -04:00
2024-10-21 16:50:09 +02:00
.. _device memory access:
2024-05-07 15:52:14 -04:00
2024-10-21 16:50:09 +02:00
Device memory access
2025-12-11 10:52:34 +01:00
--------------------
**Ensure proper alignment**
2026-01-28 22:46:19 +05:30
Memory hardware loads data in aligned chunks (typically 128 bytes). Using
naturally aligned data types ensures each access maps to a single memory
2025-12-11 10:52:34 +01:00
transaction, maximizing bandwidth and avoiding split transactions.
.. code-block :: cuda
// Use naturally aligned types
float4 data; // 16-byte aligned
float2 data; // 8-byte aligned
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Ensure structure alignment
struct __align__(16) MyStruct {
float4 data;
};
**Optimize 2D array access**
2026-01-28 22:46:19 +05:30
Padding 2D arrays to multiples of the wavefront size ensures each row starts
at an aligned memory boundary. This allows consecutive threads accessing the
2025-12-11 10:52:34 +01:00
same row to generate coalesced memory transactions, thereby maximizing
bandwidth.
.. code-block :: cuda
// Ensure array width is multiple of warp size
int width = ((actual_width + warpSize - 1) / warpSize) * warpSize;
hipMalloc(&array, width * height * sizeof(float));
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Access pattern
int idx = x + width * y; // width should be warp-aligned
**Coalesce memory accesses**
2026-01-28 22:46:19 +05:30
When consecutive threads in a wavefront access consecutive memory addresses,
the hardware combines these into a single wide transaction. Non-coalesced
2025-12-11 10:52:34 +01:00
patterns require multiple transactions, reducing effective bandwidth.
.. code-block :: cuda
// Good: consecutive threads access consecutive addresses
int idx = threadIdx.x + blockIdx.x * blockDim.x;
data[idx] = value;
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Bad: strided access
int idx = threadIdx.x * stride; // Non-coalesced if stride > 1
data[idx] = value;
For understanding memory coalescing theory, see :ref: `memory_hierarchy_theory` .
**Use shared memory for data reuse**
2026-01-28 22:46:19 +05:30
Shared memory (LDS) provides low-latency on-chip storage shared across threads
in a block. Loading data into shared memory once and reusing it many times
2025-12-11 10:52:34 +01:00
reduces global memory traffic, particularly effective for tiled algorithms such
as matrix multiplication.
.. code-block :: cuda
__global__ void optimized_kernel(float* input, float* output) {
__shared__ float tile[TILE_SIZE][TILE_SIZE];
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Load data into shared memory
tile[threadIdx.y][threadIdx.x] = input[...];
__syncthreads();
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Reuse data from fast shared memory
float result = 0;
for (int i = 0; i < TILE_SIZE; i++) {
result += tile[threadIdx.y][i] * tile[i][threadIdx.x];
}
__syncthreads();
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
output[...] = result;
}
**Avoid bank conflicts in shared memory**
Shared memory is organized into banks, each capable of servicing one request per
cycle. When multiple threads in a warp access the same bank simultaneously, the
requests are serialized, reducing throughput. Padding arrays by one element
shifts addresses to avoid systematic conflicts.
.. code-block :: cuda
// Bad: power-of-2 stride causes conflicts
__shared__ float data[32][32];
float value = data[threadIdx.x][threadIdx.y];
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Good: padding avoids conflicts
__shared__ float data[32][33]; // Extra column
float value = data[threadIdx.x][threadIdx.y];
For bank conflict theory, see :ref: `bank_conflicts_theory` .
**Use texture memory for 2D spatial access**
2026-01-28 22:46:19 +05:30
Texture memory provides hardware-accelerated 2D filtering and caching optimized
for spatial locality. It automatically handles boundary conditions and can
2025-12-11 10:52:34 +01:00
interpolate values, making it ideal for image processing and nearby-neighbor access patterns.
.. code-block :: cuda
// Create texture object
hipTextureObject_t texObj;
hipCreateTextureObject(&texObj, &resDesc, &texDesc, NULL);
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Access in kernel
float value = tex2D<float>(texObj, x, y);
2024-05-07 15:52:14 -04:00
.. _instruction optimization:
2025-12-11 10:52:34 +01:00
Instruction throughput optimization
====================================
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
Arithmetic instructions
-----------------------
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
**Use efficient operations**
2024-10-21 16:50:09 +02:00
2026-01-28 22:46:19 +05:30
Division requires many more hardware cycles than multiplication. Similarly,
bitwise operations (shifts, AND, OR) are single-cycle instructions on integer
2025-12-11 10:52:34 +01:00
units, making them far more efficient than equivalent arithmetic for power-of-two calculations.
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
.. code-block :: cuda
// Prefer multiplication over division
float result = value * 0.5f; // Fast
float result = value / 2.0f; // Slower
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Use bitwise operations for powers of 2
int index = threadIdx.x << 2; // Multiply by 4
int mask = (1 << n) - 1; // Create bit mask
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
**Use single-precision when possible**
2024-05-07 15:52:14 -04:00
2026-01-28 22:46:19 +05:30
AMD GPUs have significantly higher throughput for single-precision (FP32)
operations compared to double-precision (FP64). Using single-precision math
2025-12-11 10:52:34 +01:00
functions can deliver substantial performance gains when FP64 accuracy is not required.
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
.. code-block :: cuda
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
// Single-precision (faster)
float result = sinf(x);
float result = expf(x);
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Double-precision (slower, use only when necessary)
double result = sin(x);
double result = exp(x);
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
**Leverage fast math intrinsics**
Hardware-specific intrinsics bypass certain accuracy checks and use lookup
tables or polynomial approximations, trading slight precision loss for
significantly higher throughput. These should be used when the application can
tolerate reduced precision.
.. code-block :: cuda
// Fast intrinsic versions
float ex = __expf(x); // Fast exponential
float lg = __logf(x); // Fast logarithm
float sq = __fsqrt_rn(x); // Fast square root
float rc = __frcp_rn(x); // Fast reciprocal
2024-05-07 15:52:14 -04:00
2024-10-21 16:50:09 +02:00
.. _control flow instructions:
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
Control flow optimization
-------------------------
**Minimize divergence**
2026-01-28 22:46:19 +05:30
When threads in a wavefront take different execution paths, the hardware
serializes both branches, executing each path with only the relevant threads
2025-12-11 10:52:34 +01:00
active. This reduces effective parallelism and wastes cycles on inactive threads.
.. code-block :: cuda
// Good: no divergence (condition depends on threadIdx)
if (threadIdx.x < 32) {
// All threads in first half-warp execute
}
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Bad: divergence within warp
if (data[threadIdx.x] > threshold) {
// Some threads execute, others don't
}
**Use branch hints for predictable conditions**
2026-01-28 22:46:19 +05:30
Providing hints about branch likelihood helps the compiler generate better
instruction ordering and can improve the branch predictor's accuracy, reducing
2025-12-11 10:52:34 +01:00
pipeline stalls when the prediction proves correct.
.. code-block :: cuda
if (__builtin_expect(rare_condition, 0)) {
// Unlikely branch
}
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// C++20 attribute
if (common_condition) [[likely]] {
// Likely branch
}
**Avoid divergent warps**
2026-01-28 22:46:19 +05:30
When divergence is unavoidable, restructure the code to separate divergent paths
into different kernel launches or use predication (branchless programming) to
keep all threads active, though computing unnecessary values may be acceptable
2025-12-11 10:52:34 +01:00
if it avoids the serialization penalty.
.. code-block :: cuda
// Instead of:
if (threadIdx.x % 2 == 0) {
result = compute_even();
} else {
result = compute_odd();
}
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Consider separating into different kernels or using predication
2024-05-07 15:52:14 -04:00
Synchronization
2025-12-11 10:52:34 +01:00
---------------
**Use minimal synchronization**
2026-01-28 22:46:19 +05:30
Each synchronization point stalls all threads in a block until the slowest one
reaches the barrier. Minimize synchronizations by carefully analyzing data
dependencies—only synchronize when threads genuinely need to exchange data
2025-12-11 10:52:34 +01:00
through shared memory.
.. code-block :: cuda
__global__ void kernel() {
__shared__ float data[256];
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Load phase
data[threadIdx.x] = input[...];
__syncthreads(); // Necessary sync
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Compute phase - no sync needed if threads are independent
float result = compute(data[...]);
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Store phase - sync only if needed
output[...] = result;
}
**Use streams for async execution**
2026-01-28 22:46:19 +05:30
Streams enable concurrent execution of independent operations. Commands in
different streams can overlap in time, allowing kernel execution and memory
transfers to run simultaneously. This maximizes GPU utilization by keeping
2025-12-11 10:52:34 +01:00
multiple execution engines busy concurrently.
.. code-block :: cuda
hipStream_t stream1, stream2;
hipStreamCreate(&stream1);
hipStreamCreate(&stream2);
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Overlap independent operations
kernel1<<<grid, block, 0, stream1>>>(...);
kernel2<<<grid, block, 0, stream2>>>(...);
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
hipStreamSynchronize(stream1);
hipStreamSynchronize(stream2);
Managing register pressure
==========================
High register usage can limit occupancy. Follow these steps:
**Minimize live variables**
2026-01-28 22:46:19 +05:30
The compiler allocates registers for every variable that must remain accessible.
Reducing the number of simultaneously live variables frees registers, allowing
more wavefronts to fit on each CU. Chaining function calls trades some redundant
2025-12-11 10:52:34 +01:00
computation for lower register usage.
.. code-block :: cuda
// Instead of storing all intermediate results
float a = compute_a();
float b = compute_b();
float c = compute_c();
float result = combine(a, b, c);
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Recompute or chain operations
float result = combine(compute_a(), compute_b(), compute_c());
**Use shared memory for temporary storage**
2026-01-28 22:46:19 +05:30
Per-thread arrays stored in registers consume valuable register space, limiting
occupancy. Moving temporary storage to shared memory trades register usage for
shared memory usage, often allowing higher occupancy since shared memory limits
2025-12-11 10:52:34 +01:00
are typically less restrictive.
.. code-block :: cuda
// Instead of per-thread arrays (uses registers)
float temp[100];
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Use shared memory
__shared__ float temp[blockDim.x][100];
float* my_temp = temp[threadIdx.x];
**Adjust launch bounds**
2026-01-28 22:46:19 +05:30
The `` __launch_bounds__ `` attribute provides hints to the compiler about expected
thread block size and minimum blocks per CU. This guides register allocation
2025-12-11 10:52:34 +01:00
decisions, potentially trading per-thread register count for higher occupancy.
.. code-block :: cuda
2026-01-28 22:46:19 +05:30
__global__ void
2025-12-11 10:52:34 +01:00
__launch_bounds__(256, 4) // 256 threads, 4 blocks per CU
my_kernel() {
// Kernel code
}
**Check register usage during compilation**
2026-01-28 22:46:19 +05:30
The compiler can report per-kernel register usage statistics. Monitoring this
output helps identify kernels consuming excessive registers, guiding optimization
2025-12-11 10:52:34 +01:00
efforts toward reducing register pressure in the most impactful areas.
.. code-block :: bash
hipcc --resource-usage kernel.hip
For register pressure theory, see :ref: `register_pressure_theory` .
Improving occupancy
===================
Higher occupancy helps hide latency. Follow these steps:
**Reduce register usage per thread**
Use techniques from "Managing register pressure" above.
**Reduce shared memory usage per block**
2026-01-28 22:46:19 +05:30
Each CU has limited shared memory that must be divided among resident blocks.
Reducing per-block shared memory usage allows more blocks to reside simultaneously,
2025-12-11 10:52:34 +01:00
increasing occupancy and improving latency hiding through greater thread-level parallelism.
.. code-block :: cuda
// Allocate only what's needed
__shared__ float tile[TILE_SIZE][TILE_SIZE];
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Or use dynamic allocation
extern __shared__ float dynamic_shared[];
**Optimize block size**
2026-01-28 22:46:19 +05:30
AMD GPUs execute threads in wavefronts of 64. Choosing block sizes as multiples
of 64 prevents partial wavefronts that waste execution slots. Larger blocks
2025-12-11 10:52:34 +01:00
(128-256 threads) typically achieve better occupancy and resource utilization.
.. code-block :: cuda
// Use multiples of wavefront size
dim3 block(64); // Good for AMD GPUs (wavefront=64)
dim3 block(128); // Common choice
dim3 block(256); // Good for high-occupancy kernels
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Avoid very small blocks
dim3 block(32); // May waste resources
**Profile occupancy**
2026-01-28 22:46:19 +05:30
Profiling tools report the ratio of active wavefronts to maximum possible
wavefronts per CU. Low occupancy suggests resource constraints (registers or
2025-12-11 10:52:34 +01:00
shared memory) are limiting parallelism and may indicate opportunities for optimization.
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
.. code-block :: bash
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
rocprofv3 --occupancy ./your_application
2024-05-07 15:52:14 -04:00
2025-12-11 10:52:34 +01:00
For occupancy theory, see :ref: `occupancy` .
2024-05-07 15:52:14 -04:00
Minimizing memory thrashing
2025-12-11 10:52:34 +01:00
============================
2024-10-21 16:50:09 +02:00
Applications frequently allocating and freeing memory might experience slower
2025-12-11 10:52:34 +01:00
allocation calls over time. To optimize:
**Allocate early, deallocate late**
2026-01-28 22:46:19 +05:30
Frequent allocation and deallocation causes memory fragmentation and increases
allocator overhead. Reusing allocations across iterations amortizes the cost
2025-12-11 10:52:34 +01:00
of memory management and maintains better memory locality.
.. code-block :: cuda
// Bad: frequent allocation in loop
for (int i = 0; i < iterations; i++) {
float* temp;
hipMalloc(&temp, size);
// Use temp
hipFree(temp);
}
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Good: allocate once
float* temp;
hipMalloc(&temp, size);
for (int i = 0; i < iterations; i++) {
// Reuse temp
}
hipFree(temp);
**Avoid allocating all available memory**
2026-01-28 22:46:19 +05:30
Reserving some memory headroom prevents allocation failures and system instability.
The driver and runtime need workspace for internal operations, and leaving a
2025-12-11 10:52:34 +01:00
safety margin ensures stable operation without unexpected out-of-memory errors.
.. code-block :: cuda
size_t free, total;
hipMemGetInfo(&free, &total);
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Don't allocate all free memory
size_t safe_size = free * 0.9; // Leave some margin
**Use managed memory for oversubscription**
2026-01-28 22:46:19 +05:30
Managed memory automatically migrates data between host and device on demand,
allowing allocations larger than physical GPU memory. Prefetching hints help
2025-12-11 10:52:34 +01:00
the runtime optimize page placement, reducing migration overhead during kernel execution.
.. code-block :: cuda
// Allows exceeding physical memory
float* data;
hipMallocManaged(&data, large_size);
2026-01-28 22:46:19 +05:30
2025-12-11 10:52:34 +01:00
// Optionally prefetch to device
hipMemPrefetchAsync(data, size, device, stream);
Summary
=======
Key optimization techniques:
* **Profile first** : Use `` rocprofv3 `` to identify actual bottlenecks
* **Parallelize effectively** : Maximize work at all levels (application, device, CU)
* **Optimize memory** : Minimize transfers, maximize coalescing, use LDS
* **Manage resources** : Balance registers, shared memory, and occupancy
* **Minimize divergence** : Structure control flow to keep warps coherent
2026-01-28 22:46:19 +05:30
For understanding the theory behind these techniques, refer to
2025-12-11 10:52:34 +01:00
:doc: `../understand/performance_optimization` and :doc: `../understand/hardware_implementation` .