Correct rocprofv3 usage instructions (#2925)

* Correct rocprofv3 usage

* Apply suggestion from @SwRaw

* Apply suggestion from @SwRaw

* Update .gitignore
This commit is contained in:
Swati Rawat
2026-01-28 22:46:19 +05:30
committed by GitHub
parent d917259953
commit 9de4a2ebb1
2 changed files with 100 additions and 99 deletions
+1
View File
@@ -1 +1,2 @@
.cline_storage .cline_storage
/projects/hip/_build
@@ -10,10 +10,10 @@ Performance guidelines
******************************************************************************* *******************************************************************************
The AMD HIP performance guidelines provide practical, actionable techniques for The AMD HIP performance guidelines provide practical, actionable techniques for
optimizing application performance on AMD GPUs. This guide focuses on optimizing application performance on AMD GPUs. This guide focuses on
step-by-step instructions and best practices for improving performance. step-by-step instructions and best practices for improving performance.
For theoretical foundations and performance concepts, see For theoretical foundations and performance concepts, see
:doc:`../understand/performance_optimization`. :doc:`../understand/performance_optimization`.
Optimization workflow Optimization workflow
@@ -22,33 +22,33 @@ Optimization workflow
Follow this systematic approach to optimize GPU performance: Follow this systematic approach to optimize GPU performance:
1. **Profile and measure baseline** 1. **Profile and measure baseline**
Use ``rocprofv3`` to identify bottlenecks: Use ``rocprofv3`` to identify bottlenecks:
.. code-block:: bash .. code-block:: bash
rocprofv3 --stats ./your_application rocprofv3 --stats --<tracing_option> -- <application_path>
Collect metrics on kernel execution time, memory bandwidth, occupancy, and Collect metrics on kernel execution time, memory bandwidth, occupancy, and
CU utilization. CU utilization. For more details on using ``rocprofv3`` for application tracing and profiling, see :doc:`rocprofv3 documentation <rocprofiler-sdk:how-to/using-rocprofv3>`.
2. **Analyze metrics to identify bottlenecks** 2. **Analyze metrics to identify bottlenecks**
Determine if kernels are compute-bound or memory-bound. Check arithmetic Determine if kernels are compute-bound or memory-bound. Check arithmetic
intensity, memory bandwidth achieved vs peak, and compute throughput. intensity, memory bandwidth achieved vs peak, and compute throughput.
For understanding the roofline model, see :ref:`roofline_model`. For understanding the roofline model, see :ref:`roofline_model`.
3. **Apply targeted optimizations** 3. **Apply targeted optimizations**
Based on identified bottlenecks, apply techniques from this guide. Based on identified bottlenecks, apply techniques from this guide.
4. **Verify improvements** 4. **Verify improvements**
Re-profile to confirm performance gains. Re-profile to confirm performance gains.
5. **Iterate** 5. **Iterate**
Repeat until performance goals are met. Repeat until performance goals are met.
.. _parallel execution: .. _parallel execution:
@@ -70,9 +70,9 @@ To enable parallel execution across the host and devices:
For parallel workloads: For parallel workloads:
* Use :cpp:func:`__syncthreads()` (see :ref:`synchronization_functions`) for * Use :cpp:func:`__syncthreads()` (see :ref:`synchronization_functions`) for
intra-block synchronization intra-block synchronization
* Use global memory with separate kernel invocations for inter-block * Use global memory with separate kernel invocations for inter-block
synchronization (has overhead, minimize when possible) synchronization (has overhead, minimize when possible)
Device level Device level
@@ -103,7 +103,7 @@ Memory throughput optimization
The first step in maximizing memory throughput is to minimize low-bandwidth The first step in maximizing memory throughput is to minimize low-bandwidth
data transfers between the host and the device. data transfers between the host and the device.
Additionally, maximize the use of on-chip memory (shared memory and caches) and Additionally, maximize the use of on-chip memory (shared memory and caches) and
minimize transfers with global memory. minimize transfers with global memory.
.. _data transfer: .. _data transfer:
@@ -130,14 +130,14 @@ effective bandwidth.
for (int i = 0; i < n; i++) { for (int i = 0; i < n; i++) {
hipMemcpy(&d_data[i], &h_data[i], sizeof(float), ...); hipMemcpy(&d_data[i], &h_data[i], sizeof(float), ...);
} }
// Use a single large transfer // Use a single large transfer
hipMemcpy(d_data, h_data, n * sizeof(float), ...); hipMemcpy(d_data, h_data, n * sizeof(float), ...);
**Use page-locked memory for transfers** **Use page-locked memory for transfers**
Page-locked (pinned) memory cannot be swapped to disk by the operating system, Page-locked (pinned) memory cannot be swapped to disk by the operating system,
allowing the GPU to access it directly via DMA without CPU involvement. This allowing the GPU to access it directly via DMA without CPU involvement. This
eliminates an extra copy through a staging buffer and achieves higher bandwidth. eliminates an extra copy through a staging buffer and achieves higher bandwidth.
.. code-block:: cuda .. code-block:: cuda
@@ -149,8 +149,8 @@ eliminates an extra copy through a staging buffer and achieves higher bandwidth.
**Use mapped memory on integrated systems** **Use mapped memory on integrated systems**
On integrated GPUs (APUs), the CPU and GPU share the same physical memory. On integrated GPUs (APUs), the CPU and GPU share the same physical memory.
Mapped page-locked memory allows zero-copy access, where the GPU reads directly Mapped page-locked memory allows zero-copy access, where the GPU reads directly
from host memory without requiring an explicit transfer, eliminating redundant copies. from host memory without requiring an explicit transfer, eliminating redundant copies.
.. code-block:: cuda .. code-block:: cuda
@@ -169,8 +169,8 @@ Device memory access
**Ensure proper alignment** **Ensure proper alignment**
Memory hardware loads data in aligned chunks (typically 128 bytes). Using Memory hardware loads data in aligned chunks (typically 128 bytes). Using
naturally aligned data types ensures each access maps to a single memory naturally aligned data types ensures each access maps to a single memory
transaction, maximizing bandwidth and avoiding split transactions. transaction, maximizing bandwidth and avoiding split transactions.
.. code-block:: cuda .. code-block:: cuda
@@ -178,7 +178,7 @@ transaction, maximizing bandwidth and avoiding split transactions.
// Use naturally aligned types // Use naturally aligned types
float4 data; // 16-byte aligned float4 data; // 16-byte aligned
float2 data; // 8-byte aligned float2 data; // 8-byte aligned
// Ensure structure alignment // Ensure structure alignment
struct __align__(16) MyStruct { struct __align__(16) MyStruct {
float4 data; float4 data;
@@ -186,8 +186,8 @@ transaction, maximizing bandwidth and avoiding split transactions.
**Optimize 2D array access** **Optimize 2D array access**
Padding 2D arrays to multiples of the wavefront size ensures each row starts Padding 2D arrays to multiples of the wavefront size ensures each row starts
at an aligned memory boundary. This allows consecutive threads accessing the at an aligned memory boundary. This allows consecutive threads accessing the
same row to generate coalesced memory transactions, thereby maximizing same row to generate coalesced memory transactions, thereby maximizing
bandwidth. bandwidth.
@@ -196,14 +196,14 @@ bandwidth.
// Ensure array width is multiple of warp size // Ensure array width is multiple of warp size
int width = ((actual_width + warpSize - 1) / warpSize) * warpSize; int width = ((actual_width + warpSize - 1) / warpSize) * warpSize;
hipMalloc(&array, width * height * sizeof(float)); hipMalloc(&array, width * height * sizeof(float));
// Access pattern // Access pattern
int idx = x + width * y; // width should be warp-aligned int idx = x + width * y; // width should be warp-aligned
**Coalesce memory accesses** **Coalesce memory accesses**
When consecutive threads in a wavefront access consecutive memory addresses, When consecutive threads in a wavefront access consecutive memory addresses,
the hardware combines these into a single wide transaction. Non-coalesced the hardware combines these into a single wide transaction. Non-coalesced
patterns require multiple transactions, reducing effective bandwidth. patterns require multiple transactions, reducing effective bandwidth.
.. code-block:: cuda .. code-block:: cuda
@@ -211,7 +211,7 @@ patterns require multiple transactions, reducing effective bandwidth.
// Good: consecutive threads access consecutive addresses // Good: consecutive threads access consecutive addresses
int idx = threadIdx.x + blockIdx.x * blockDim.x; int idx = threadIdx.x + blockIdx.x * blockDim.x;
data[idx] = value; data[idx] = value;
// Bad: strided access // Bad: strided access
int idx = threadIdx.x * stride; // Non-coalesced if stride > 1 int idx = threadIdx.x * stride; // Non-coalesced if stride > 1
data[idx] = value; data[idx] = value;
@@ -220,8 +220,8 @@ For understanding memory coalescing theory, see :ref:`memory_hierarchy_theory`.
**Use shared memory for data reuse** **Use shared memory for data reuse**
Shared memory (LDS) provides low-latency on-chip storage shared across threads Shared memory (LDS) provides low-latency on-chip storage shared across threads
in a block. Loading data into shared memory once and reusing it many times in a block. Loading data into shared memory once and reusing it many times
reduces global memory traffic, particularly effective for tiled algorithms such reduces global memory traffic, particularly effective for tiled algorithms such
as matrix multiplication. as matrix multiplication.
@@ -229,18 +229,18 @@ as matrix multiplication.
__global__ void optimized_kernel(float* input, float* output) { __global__ void optimized_kernel(float* input, float* output) {
__shared__ float tile[TILE_SIZE][TILE_SIZE]; __shared__ float tile[TILE_SIZE][TILE_SIZE];
// Load data into shared memory // Load data into shared memory
tile[threadIdx.y][threadIdx.x] = input[...]; tile[threadIdx.y][threadIdx.x] = input[...];
__syncthreads(); __syncthreads();
// Reuse data from fast shared memory // Reuse data from fast shared memory
float result = 0; float result = 0;
for (int i = 0; i < TILE_SIZE; i++) { for (int i = 0; i < TILE_SIZE; i++) {
result += tile[threadIdx.y][i] * tile[i][threadIdx.x]; result += tile[threadIdx.y][i] * tile[i][threadIdx.x];
} }
__syncthreads(); __syncthreads();
output[...] = result; output[...] = result;
} }
@@ -256,7 +256,7 @@ shifts addresses to avoid systematic conflicts.
// Bad: power-of-2 stride causes conflicts // Bad: power-of-2 stride causes conflicts
__shared__ float data[32][32]; __shared__ float data[32][32];
float value = data[threadIdx.x][threadIdx.y]; float value = data[threadIdx.x][threadIdx.y];
// Good: padding avoids conflicts // Good: padding avoids conflicts
__shared__ float data[32][33]; // Extra column __shared__ float data[32][33]; // Extra column
float value = data[threadIdx.x][threadIdx.y]; float value = data[threadIdx.x][threadIdx.y];
@@ -265,8 +265,8 @@ For bank conflict theory, see :ref:`bank_conflicts_theory`.
**Use texture memory for 2D spatial access** **Use texture memory for 2D spatial access**
Texture memory provides hardware-accelerated 2D filtering and caching optimized Texture memory provides hardware-accelerated 2D filtering and caching optimized
for spatial locality. It automatically handles boundary conditions and can for spatial locality. It automatically handles boundary conditions and can
interpolate values, making it ideal for image processing and nearby-neighbor access patterns. interpolate values, making it ideal for image processing and nearby-neighbor access patterns.
.. code-block:: cuda .. code-block:: cuda
@@ -274,7 +274,7 @@ interpolate values, making it ideal for image processing and nearby-neighbor acc
// Create texture object // Create texture object
hipTextureObject_t texObj; hipTextureObject_t texObj;
hipCreateTextureObject(&texObj, &resDesc, &texDesc, NULL); hipCreateTextureObject(&texObj, &resDesc, &texDesc, NULL);
// Access in kernel // Access in kernel
float value = tex2D<float>(texObj, x, y); float value = tex2D<float>(texObj, x, y);
@@ -288,8 +288,8 @@ Arithmetic instructions
**Use efficient operations** **Use efficient operations**
Division requires many more hardware cycles than multiplication. Similarly, Division requires many more hardware cycles than multiplication. Similarly,
bitwise operations (shifts, AND, OR) are single-cycle instructions on integer bitwise operations (shifts, AND, OR) are single-cycle instructions on integer
units, making them far more efficient than equivalent arithmetic for power-of-two calculations. units, making them far more efficient than equivalent arithmetic for power-of-two calculations.
.. code-block:: cuda .. code-block:: cuda
@@ -297,15 +297,15 @@ units, making them far more efficient than equivalent arithmetic for power-of-tw
// Prefer multiplication over division // Prefer multiplication over division
float result = value * 0.5f; // Fast float result = value * 0.5f; // Fast
float result = value / 2.0f; // Slower float result = value / 2.0f; // Slower
// Use bitwise operations for powers of 2 // Use bitwise operations for powers of 2
int index = threadIdx.x << 2; // Multiply by 4 int index = threadIdx.x << 2; // Multiply by 4
int mask = (1 << n) - 1; // Create bit mask int mask = (1 << n) - 1; // Create bit mask
**Use single-precision when possible** **Use single-precision when possible**
AMD GPUs have significantly higher throughput for single-precision (FP32) AMD GPUs have significantly higher throughput for single-precision (FP32)
operations compared to double-precision (FP64). Using single-precision math operations compared to double-precision (FP64). Using single-precision math
functions can deliver substantial performance gains when FP64 accuracy is not required. functions can deliver substantial performance gains when FP64 accuracy is not required.
.. code-block:: cuda .. code-block:: cuda
@@ -313,7 +313,7 @@ functions can deliver substantial performance gains when FP64 accuracy is not re
// Single-precision (faster) // Single-precision (faster)
float result = sinf(x); float result = sinf(x);
float result = expf(x); float result = expf(x);
// Double-precision (slower, use only when necessary) // Double-precision (slower, use only when necessary)
double result = sin(x); double result = sin(x);
double result = exp(x); double result = exp(x);
@@ -340,8 +340,8 @@ Control flow optimization
**Minimize divergence** **Minimize divergence**
When threads in a wavefront take different execution paths, the hardware When threads in a wavefront take different execution paths, the hardware
serializes both branches, executing each path with only the relevant threads serializes both branches, executing each path with only the relevant threads
active. This reduces effective parallelism and wastes cycles on inactive threads. active. This reduces effective parallelism and wastes cycles on inactive threads.
.. code-block:: cuda .. code-block:: cuda
@@ -350,7 +350,7 @@ active. This reduces effective parallelism and wastes cycles on inactive threads
if (threadIdx.x < 32) { if (threadIdx.x < 32) {
// All threads in first half-warp execute // All threads in first half-warp execute
} }
// Bad: divergence within warp // Bad: divergence within warp
if (data[threadIdx.x] > threshold) { if (data[threadIdx.x] > threshold) {
// Some threads execute, others don't // Some threads execute, others don't
@@ -358,8 +358,8 @@ active. This reduces effective parallelism and wastes cycles on inactive threads
**Use branch hints for predictable conditions** **Use branch hints for predictable conditions**
Providing hints about branch likelihood helps the compiler generate better Providing hints about branch likelihood helps the compiler generate better
instruction ordering and can improve the branch predictor's accuracy, reducing instruction ordering and can improve the branch predictor's accuracy, reducing
pipeline stalls when the prediction proves correct. pipeline stalls when the prediction proves correct.
.. code-block:: cuda .. code-block:: cuda
@@ -367,7 +367,7 @@ pipeline stalls when the prediction proves correct.
if (__builtin_expect(rare_condition, 0)) { if (__builtin_expect(rare_condition, 0)) {
// Unlikely branch // Unlikely branch
} }
// C++20 attribute // C++20 attribute
if (common_condition) [[likely]] { if (common_condition) [[likely]] {
// Likely branch // Likely branch
@@ -375,9 +375,9 @@ pipeline stalls when the prediction proves correct.
**Avoid divergent warps** **Avoid divergent warps**
When divergence is unavoidable, restructure the code to separate divergent paths When divergence is unavoidable, restructure the code to separate divergent paths
into different kernel launches or use predication (branchless programming) to into different kernel launches or use predication (branchless programming) to
keep all threads active, though computing unnecessary values may be acceptable keep all threads active, though computing unnecessary values may be acceptable
if it avoids the serialization penalty. if it avoids the serialization penalty.
.. code-block:: cuda .. code-block:: cuda
@@ -388,7 +388,7 @@ if it avoids the serialization penalty.
} else { } else {
result = compute_odd(); result = compute_odd();
} }
// Consider separating into different kernels or using predication // Consider separating into different kernels or using predication
Synchronization Synchronization
@@ -396,32 +396,32 @@ Synchronization
**Use minimal synchronization** **Use minimal synchronization**
Each synchronization point stalls all threads in a block until the slowest one Each synchronization point stalls all threads in a block until the slowest one
reaches the barrier. Minimize synchronizations by carefully analyzing data reaches the barrier. Minimize synchronizations by carefully analyzing data
dependencies—only synchronize when threads genuinely need to exchange data dependencies—only synchronize when threads genuinely need to exchange data
through shared memory. through shared memory.
.. code-block:: cuda .. code-block:: cuda
__global__ void kernel() { __global__ void kernel() {
__shared__ float data[256]; __shared__ float data[256];
// Load phase // Load phase
data[threadIdx.x] = input[...]; data[threadIdx.x] = input[...];
__syncthreads(); // Necessary sync __syncthreads(); // Necessary sync
// Compute phase - no sync needed if threads are independent // Compute phase - no sync needed if threads are independent
float result = compute(data[...]); float result = compute(data[...]);
// Store phase - sync only if needed // Store phase - sync only if needed
output[...] = result; output[...] = result;
} }
**Use streams for async execution** **Use streams for async execution**
Streams enable concurrent execution of independent operations. Commands in Streams enable concurrent execution of independent operations. Commands in
different streams can overlap in time, allowing kernel execution and memory different streams can overlap in time, allowing kernel execution and memory
transfers to run simultaneously. This maximizes GPU utilization by keeping transfers to run simultaneously. This maximizes GPU utilization by keeping
multiple execution engines busy concurrently. multiple execution engines busy concurrently.
.. code-block:: cuda .. code-block:: cuda
@@ -429,11 +429,11 @@ multiple execution engines busy concurrently.
hipStream_t stream1, stream2; hipStream_t stream1, stream2;
hipStreamCreate(&stream1); hipStreamCreate(&stream1);
hipStreamCreate(&stream2); hipStreamCreate(&stream2);
// Overlap independent operations // Overlap independent operations
kernel1<<<grid, block, 0, stream1>>>(...); kernel1<<<grid, block, 0, stream1>>>(...);
kernel2<<<grid, block, 0, stream2>>>(...); kernel2<<<grid, block, 0, stream2>>>(...);
hipStreamSynchronize(stream1); hipStreamSynchronize(stream1);
hipStreamSynchronize(stream2); hipStreamSynchronize(stream2);
@@ -444,9 +444,9 @@ High register usage can limit occupancy. Follow these steps:
**Minimize live variables** **Minimize live variables**
The compiler allocates registers for every variable that must remain accessible. The compiler allocates registers for every variable that must remain accessible.
Reducing the number of simultaneously live variables frees registers, allowing Reducing the number of simultaneously live variables frees registers, allowing
more wavefronts to fit on each CU. Chaining function calls trades some redundant more wavefronts to fit on each CU. Chaining function calls trades some redundant
computation for lower register usage. computation for lower register usage.
.. code-block:: cuda .. code-block:: cuda
@@ -456,35 +456,35 @@ computation for lower register usage.
float b = compute_b(); float b = compute_b();
float c = compute_c(); float c = compute_c();
float result = combine(a, b, c); float result = combine(a, b, c);
// Recompute or chain operations // Recompute or chain operations
float result = combine(compute_a(), compute_b(), compute_c()); float result = combine(compute_a(), compute_b(), compute_c());
**Use shared memory for temporary storage** **Use shared memory for temporary storage**
Per-thread arrays stored in registers consume valuable register space, limiting Per-thread arrays stored in registers consume valuable register space, limiting
occupancy. Moving temporary storage to shared memory trades register usage for occupancy. Moving temporary storage to shared memory trades register usage for
shared memory usage, often allowing higher occupancy since shared memory limits shared memory usage, often allowing higher occupancy since shared memory limits
are typically less restrictive. are typically less restrictive.
.. code-block:: cuda .. code-block:: cuda
// Instead of per-thread arrays (uses registers) // Instead of per-thread arrays (uses registers)
float temp[100]; float temp[100];
// Use shared memory // Use shared memory
__shared__ float temp[blockDim.x][100]; __shared__ float temp[blockDim.x][100];
float* my_temp = temp[threadIdx.x]; float* my_temp = temp[threadIdx.x];
**Adjust launch bounds** **Adjust launch bounds**
The ``__launch_bounds__`` attribute provides hints to the compiler about expected The ``__launch_bounds__`` attribute provides hints to the compiler about expected
thread block size and minimum blocks per CU. This guides register allocation thread block size and minimum blocks per CU. This guides register allocation
decisions, potentially trading per-thread register count for higher occupancy. decisions, potentially trading per-thread register count for higher occupancy.
.. code-block:: cuda .. code-block:: cuda
__global__ void __global__ void
__launch_bounds__(256, 4) // 256 threads, 4 blocks per CU __launch_bounds__(256, 4) // 256 threads, 4 blocks per CU
my_kernel() { my_kernel() {
// Kernel code // Kernel code
@@ -492,8 +492,8 @@ decisions, potentially trading per-thread register count for higher occupancy.
**Check register usage during compilation** **Check register usage during compilation**
The compiler can report per-kernel register usage statistics. Monitoring this The compiler can report per-kernel register usage statistics. Monitoring this
output helps identify kernels consuming excessive registers, guiding optimization output helps identify kernels consuming excessive registers, guiding optimization
efforts toward reducing register pressure in the most impactful areas. efforts toward reducing register pressure in the most impactful areas.
.. code-block:: bash .. code-block:: bash
@@ -513,22 +513,22 @@ Use techniques from "Managing register pressure" above.
**Reduce shared memory usage per block** **Reduce shared memory usage per block**
Each CU has limited shared memory that must be divided among resident blocks. Each CU has limited shared memory that must be divided among resident blocks.
Reducing per-block shared memory usage allows more blocks to reside simultaneously, Reducing per-block shared memory usage allows more blocks to reside simultaneously,
increasing occupancy and improving latency hiding through greater thread-level parallelism. increasing occupancy and improving latency hiding through greater thread-level parallelism.
.. code-block:: cuda .. code-block:: cuda
// Allocate only what's needed // Allocate only what's needed
__shared__ float tile[TILE_SIZE][TILE_SIZE]; __shared__ float tile[TILE_SIZE][TILE_SIZE];
// Or use dynamic allocation // Or use dynamic allocation
extern __shared__ float dynamic_shared[]; extern __shared__ float dynamic_shared[];
**Optimize block size** **Optimize block size**
AMD GPUs execute threads in wavefronts of 64. Choosing block sizes as multiples AMD GPUs execute threads in wavefronts of 64. Choosing block sizes as multiples
of 64 prevents partial wavefronts that waste execution slots. Larger blocks of 64 prevents partial wavefronts that waste execution slots. Larger blocks
(128-256 threads) typically achieve better occupancy and resource utilization. (128-256 threads) typically achieve better occupancy and resource utilization.
.. code-block:: cuda .. code-block:: cuda
@@ -537,14 +537,14 @@ of 64 prevents partial wavefronts that waste execution slots. Larger blocks
dim3 block(64); // Good for AMD GPUs (wavefront=64) dim3 block(64); // Good for AMD GPUs (wavefront=64)
dim3 block(128); // Common choice dim3 block(128); // Common choice
dim3 block(256); // Good for high-occupancy kernels dim3 block(256); // Good for high-occupancy kernels
// Avoid very small blocks // Avoid very small blocks
dim3 block(32); // May waste resources dim3 block(32); // May waste resources
**Profile occupancy** **Profile occupancy**
Profiling tools report the ratio of active wavefronts to maximum possible Profiling tools report the ratio of active wavefronts to maximum possible
wavefronts per CU. Low occupancy suggests resource constraints (registers or wavefronts per CU. Low occupancy suggests resource constraints (registers or
shared memory) are limiting parallelism and may indicate opportunities for optimization. shared memory) are limiting parallelism and may indicate opportunities for optimization.
.. code-block:: bash .. code-block:: bash
@@ -561,8 +561,8 @@ allocation calls over time. To optimize:
**Allocate early, deallocate late** **Allocate early, deallocate late**
Frequent allocation and deallocation causes memory fragmentation and increases Frequent allocation and deallocation causes memory fragmentation and increases
allocator overhead. Reusing allocations across iterations amortizes the cost allocator overhead. Reusing allocations across iterations amortizes the cost
of memory management and maintains better memory locality. of memory management and maintains better memory locality.
.. code-block:: cuda .. code-block:: cuda
@@ -574,7 +574,7 @@ of memory management and maintains better memory locality.
// Use temp // Use temp
hipFree(temp); hipFree(temp);
} }
// Good: allocate once // Good: allocate once
float* temp; float* temp;
hipMalloc(&temp, size); hipMalloc(&temp, size);
@@ -585,22 +585,22 @@ of memory management and maintains better memory locality.
**Avoid allocating all available memory** **Avoid allocating all available memory**
Reserving some memory headroom prevents allocation failures and system instability. Reserving some memory headroom prevents allocation failures and system instability.
The driver and runtime need workspace for internal operations, and leaving a The driver and runtime need workspace for internal operations, and leaving a
safety margin ensures stable operation without unexpected out-of-memory errors. safety margin ensures stable operation without unexpected out-of-memory errors.
.. code-block:: cuda .. code-block:: cuda
size_t free, total; size_t free, total;
hipMemGetInfo(&free, &total); hipMemGetInfo(&free, &total);
// Don't allocate all free memory // Don't allocate all free memory
size_t safe_size = free * 0.9; // Leave some margin size_t safe_size = free * 0.9; // Leave some margin
**Use managed memory for oversubscription** **Use managed memory for oversubscription**
Managed memory automatically migrates data between host and device on demand, Managed memory automatically migrates data between host and device on demand,
allowing allocations larger than physical GPU memory. Prefetching hints help allowing allocations larger than physical GPU memory. Prefetching hints help
the runtime optimize page placement, reducing migration overhead during kernel execution. the runtime optimize page placement, reducing migration overhead during kernel execution.
.. code-block:: cuda .. code-block:: cuda
@@ -608,7 +608,7 @@ the runtime optimize page placement, reducing migration overhead during kernel e
// Allows exceeding physical memory // Allows exceeding physical memory
float* data; float* data;
hipMallocManaged(&data, large_size); hipMallocManaged(&data, large_size);
// Optionally prefetch to device // Optionally prefetch to device
hipMemPrefetchAsync(data, size, device, stream); hipMemPrefetchAsync(data, size, device, stream);
@@ -623,5 +623,5 @@ Key optimization techniques:
* **Manage resources**: Balance registers, shared memory, and occupancy * **Manage resources**: Balance registers, shared memory, and occupancy
* **Minimize divergence**: Structure control flow to keep warps coherent * **Minimize divergence**: Structure control flow to keep warps coherent
For understanding the theory behind these techniques, refer to For understanding the theory behind these techniques, refer to
:doc:`../understand/performance_optimization` and :doc:`../understand/hardware_implementation`. :doc:`../understand/performance_optimization` and :doc:`../understand/hardware_implementation`.