Correct rocprofv3 usage instructions (#2925)
* Correct rocprofv3 usage * Apply suggestion from @SwRaw * Apply suggestion from @SwRaw * Update .gitignore
이 커밋은 다음에 포함됨:
+1
@@ -1 +1,2 @@
|
||||
.cline_storage
|
||||
/projects/hip/_build
|
||||
|
||||
@@ -10,10 +10,10 @@ Performance guidelines
|
||||
*******************************************************************************
|
||||
|
||||
The AMD HIP performance guidelines provide practical, actionable techniques for
|
||||
optimizing application performance on AMD GPUs. This guide focuses on
|
||||
optimizing application performance on AMD GPUs. This guide focuses on
|
||||
step-by-step instructions and best practices for improving performance.
|
||||
|
||||
For theoretical foundations and performance concepts, see
|
||||
For theoretical foundations and performance concepts, see
|
||||
:doc:`../understand/performance_optimization`.
|
||||
|
||||
Optimization workflow
|
||||
@@ -22,33 +22,33 @@ Optimization workflow
|
||||
Follow this systematic approach to optimize GPU performance:
|
||||
|
||||
1. **Profile and measure baseline**
|
||||
|
||||
|
||||
Use ``rocprofv3`` to identify bottlenecks:
|
||||
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
rocprofv3 --stats ./your_application
|
||||
|
||||
Collect metrics on kernel execution time, memory bandwidth, occupancy, and
|
||||
CU utilization.
|
||||
|
||||
rocprofv3 --stats --<tracing_option> -- <application_path>
|
||||
|
||||
Collect metrics on kernel execution time, memory bandwidth, occupancy, and
|
||||
CU utilization. For more details on using ``rocprofv3`` for application tracing and profiling, see :doc:`rocprofv3 documentation <rocprofiler-sdk:how-to/using-rocprofv3>`.
|
||||
|
||||
2. **Analyze metrics to identify bottlenecks**
|
||||
|
||||
Determine if kernels are compute-bound or memory-bound. Check arithmetic
|
||||
|
||||
Determine if kernels are compute-bound or memory-bound. Check arithmetic
|
||||
intensity, memory bandwidth achieved vs peak, and compute throughput.
|
||||
|
||||
|
||||
For understanding the roofline model, see :ref:`roofline_model`.
|
||||
|
||||
3. **Apply targeted optimizations**
|
||||
|
||||
|
||||
Based on identified bottlenecks, apply techniques from this guide.
|
||||
|
||||
4. **Verify improvements**
|
||||
|
||||
|
||||
Re-profile to confirm performance gains.
|
||||
|
||||
5. **Iterate**
|
||||
|
||||
|
||||
Repeat until performance goals are met.
|
||||
|
||||
.. _parallel execution:
|
||||
@@ -70,9 +70,9 @@ To enable parallel execution across the host and devices:
|
||||
|
||||
For parallel workloads:
|
||||
|
||||
* Use :cpp:func:`__syncthreads()` (see :ref:`synchronization_functions`) for
|
||||
* Use :cpp:func:`__syncthreads()` (see :ref:`synchronization_functions`) for
|
||||
intra-block synchronization
|
||||
* Use global memory with separate kernel invocations for inter-block
|
||||
* Use global memory with separate kernel invocations for inter-block
|
||||
synchronization (has overhead, minimize when possible)
|
||||
|
||||
Device level
|
||||
@@ -103,7 +103,7 @@ Memory throughput optimization
|
||||
The first step in maximizing memory throughput is to minimize low-bandwidth
|
||||
data transfers between the host and the device.
|
||||
|
||||
Additionally, maximize the use of on-chip memory (shared memory and caches) and
|
||||
Additionally, maximize the use of on-chip memory (shared memory and caches) and
|
||||
minimize transfers with global memory.
|
||||
|
||||
.. _data transfer:
|
||||
@@ -130,14 +130,14 @@ effective bandwidth.
|
||||
for (int i = 0; i < n; i++) {
|
||||
hipMemcpy(&d_data[i], &h_data[i], sizeof(float), ...);
|
||||
}
|
||||
|
||||
|
||||
// Use a single large transfer
|
||||
hipMemcpy(d_data, h_data, n * sizeof(float), ...);
|
||||
|
||||
**Use page-locked memory for transfers**
|
||||
|
||||
Page-locked (pinned) memory cannot be swapped to disk by the operating system,
|
||||
allowing the GPU to access it directly via DMA without CPU involvement. This
|
||||
Page-locked (pinned) memory cannot be swapped to disk by the operating system,
|
||||
allowing the GPU to access it directly via DMA without CPU involvement. This
|
||||
eliminates an extra copy through a staging buffer and achieves higher bandwidth.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -149,8 +149,8 @@ eliminates an extra copy through a staging buffer and achieves higher bandwidth.
|
||||
|
||||
**Use mapped memory on integrated systems**
|
||||
|
||||
On integrated GPUs (APUs), the CPU and GPU share the same physical memory.
|
||||
Mapped page-locked memory allows zero-copy access, where the GPU reads directly
|
||||
On integrated GPUs (APUs), the CPU and GPU share the same physical memory.
|
||||
Mapped page-locked memory allows zero-copy access, where the GPU reads directly
|
||||
from host memory without requiring an explicit transfer, eliminating redundant copies.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -169,8 +169,8 @@ Device memory access
|
||||
|
||||
**Ensure proper alignment**
|
||||
|
||||
Memory hardware loads data in aligned chunks (typically 128 bytes). Using
|
||||
naturally aligned data types ensures each access maps to a single memory
|
||||
Memory hardware loads data in aligned chunks (typically 128 bytes). Using
|
||||
naturally aligned data types ensures each access maps to a single memory
|
||||
transaction, maximizing bandwidth and avoiding split transactions.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -178,7 +178,7 @@ transaction, maximizing bandwidth and avoiding split transactions.
|
||||
// Use naturally aligned types
|
||||
float4 data; // 16-byte aligned
|
||||
float2 data; // 8-byte aligned
|
||||
|
||||
|
||||
// Ensure structure alignment
|
||||
struct __align__(16) MyStruct {
|
||||
float4 data;
|
||||
@@ -186,8 +186,8 @@ transaction, maximizing bandwidth and avoiding split transactions.
|
||||
|
||||
**Optimize 2D array access**
|
||||
|
||||
Padding 2D arrays to multiples of the wavefront size ensures each row starts
|
||||
at an aligned memory boundary. This allows consecutive threads accessing the
|
||||
Padding 2D arrays to multiples of the wavefront size ensures each row starts
|
||||
at an aligned memory boundary. This allows consecutive threads accessing the
|
||||
same row to generate coalesced memory transactions, thereby maximizing
|
||||
bandwidth.
|
||||
|
||||
@@ -196,14 +196,14 @@ bandwidth.
|
||||
// Ensure array width is multiple of warp size
|
||||
int width = ((actual_width + warpSize - 1) / warpSize) * warpSize;
|
||||
hipMalloc(&array, width * height * sizeof(float));
|
||||
|
||||
|
||||
// Access pattern
|
||||
int idx = x + width * y; // width should be warp-aligned
|
||||
|
||||
**Coalesce memory accesses**
|
||||
|
||||
When consecutive threads in a wavefront access consecutive memory addresses,
|
||||
the hardware combines these into a single wide transaction. Non-coalesced
|
||||
When consecutive threads in a wavefront access consecutive memory addresses,
|
||||
the hardware combines these into a single wide transaction. Non-coalesced
|
||||
patterns require multiple transactions, reducing effective bandwidth.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -211,7 +211,7 @@ patterns require multiple transactions, reducing effective bandwidth.
|
||||
// Good: consecutive threads access consecutive addresses
|
||||
int idx = threadIdx.x + blockIdx.x * blockDim.x;
|
||||
data[idx] = value;
|
||||
|
||||
|
||||
// Bad: strided access
|
||||
int idx = threadIdx.x * stride; // Non-coalesced if stride > 1
|
||||
data[idx] = value;
|
||||
@@ -220,8 +220,8 @@ For understanding memory coalescing theory, see :ref:`memory_hierarchy_theory`.
|
||||
|
||||
**Use shared memory for data reuse**
|
||||
|
||||
Shared memory (LDS) provides low-latency on-chip storage shared across threads
|
||||
in a block. Loading data into shared memory once and reusing it many times
|
||||
Shared memory (LDS) provides low-latency on-chip storage shared across threads
|
||||
in a block. Loading data into shared memory once and reusing it many times
|
||||
reduces global memory traffic, particularly effective for tiled algorithms such
|
||||
as matrix multiplication.
|
||||
|
||||
@@ -229,18 +229,18 @@ as matrix multiplication.
|
||||
|
||||
__global__ void optimized_kernel(float* input, float* output) {
|
||||
__shared__ float tile[TILE_SIZE][TILE_SIZE];
|
||||
|
||||
|
||||
// Load data into shared memory
|
||||
tile[threadIdx.y][threadIdx.x] = input[...];
|
||||
__syncthreads();
|
||||
|
||||
|
||||
// Reuse data from fast shared memory
|
||||
float result = 0;
|
||||
for (int i = 0; i < TILE_SIZE; i++) {
|
||||
result += tile[threadIdx.y][i] * tile[i][threadIdx.x];
|
||||
}
|
||||
__syncthreads();
|
||||
|
||||
|
||||
output[...] = result;
|
||||
}
|
||||
|
||||
@@ -256,7 +256,7 @@ shifts addresses to avoid systematic conflicts.
|
||||
// Bad: power-of-2 stride causes conflicts
|
||||
__shared__ float data[32][32];
|
||||
float value = data[threadIdx.x][threadIdx.y];
|
||||
|
||||
|
||||
// Good: padding avoids conflicts
|
||||
__shared__ float data[32][33]; // Extra column
|
||||
float value = data[threadIdx.x][threadIdx.y];
|
||||
@@ -265,8 +265,8 @@ For bank conflict theory, see :ref:`bank_conflicts_theory`.
|
||||
|
||||
**Use texture memory for 2D spatial access**
|
||||
|
||||
Texture memory provides hardware-accelerated 2D filtering and caching optimized
|
||||
for spatial locality. It automatically handles boundary conditions and can
|
||||
Texture memory provides hardware-accelerated 2D filtering and caching optimized
|
||||
for spatial locality. It automatically handles boundary conditions and can
|
||||
interpolate values, making it ideal for image processing and nearby-neighbor access patterns.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -274,7 +274,7 @@ interpolate values, making it ideal for image processing and nearby-neighbor acc
|
||||
// Create texture object
|
||||
hipTextureObject_t texObj;
|
||||
hipCreateTextureObject(&texObj, &resDesc, &texDesc, NULL);
|
||||
|
||||
|
||||
// Access in kernel
|
||||
float value = tex2D<float>(texObj, x, y);
|
||||
|
||||
@@ -288,8 +288,8 @@ Arithmetic instructions
|
||||
|
||||
**Use efficient operations**
|
||||
|
||||
Division requires many more hardware cycles than multiplication. Similarly,
|
||||
bitwise operations (shifts, AND, OR) are single-cycle instructions on integer
|
||||
Division requires many more hardware cycles than multiplication. Similarly,
|
||||
bitwise operations (shifts, AND, OR) are single-cycle instructions on integer
|
||||
units, making them far more efficient than equivalent arithmetic for power-of-two calculations.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -297,15 +297,15 @@ units, making them far more efficient than equivalent arithmetic for power-of-tw
|
||||
// Prefer multiplication over division
|
||||
float result = value * 0.5f; // Fast
|
||||
float result = value / 2.0f; // Slower
|
||||
|
||||
|
||||
// Use bitwise operations for powers of 2
|
||||
int index = threadIdx.x << 2; // Multiply by 4
|
||||
int mask = (1 << n) - 1; // Create bit mask
|
||||
|
||||
**Use single-precision when possible**
|
||||
|
||||
AMD GPUs have significantly higher throughput for single-precision (FP32)
|
||||
operations compared to double-precision (FP64). Using single-precision math
|
||||
AMD GPUs have significantly higher throughput for single-precision (FP32)
|
||||
operations compared to double-precision (FP64). Using single-precision math
|
||||
functions can deliver substantial performance gains when FP64 accuracy is not required.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -313,7 +313,7 @@ functions can deliver substantial performance gains when FP64 accuracy is not re
|
||||
// Single-precision (faster)
|
||||
float result = sinf(x);
|
||||
float result = expf(x);
|
||||
|
||||
|
||||
// Double-precision (slower, use only when necessary)
|
||||
double result = sin(x);
|
||||
double result = exp(x);
|
||||
@@ -340,8 +340,8 @@ Control flow optimization
|
||||
|
||||
**Minimize divergence**
|
||||
|
||||
When threads in a wavefront take different execution paths, the hardware
|
||||
serializes both branches, executing each path with only the relevant threads
|
||||
When threads in a wavefront take different execution paths, the hardware
|
||||
serializes both branches, executing each path with only the relevant threads
|
||||
active. This reduces effective parallelism and wastes cycles on inactive threads.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -350,7 +350,7 @@ active. This reduces effective parallelism and wastes cycles on inactive threads
|
||||
if (threadIdx.x < 32) {
|
||||
// All threads in first half-warp execute
|
||||
}
|
||||
|
||||
|
||||
// Bad: divergence within warp
|
||||
if (data[threadIdx.x] > threshold) {
|
||||
// Some threads execute, others don't
|
||||
@@ -358,8 +358,8 @@ active. This reduces effective parallelism and wastes cycles on inactive threads
|
||||
|
||||
**Use branch hints for predictable conditions**
|
||||
|
||||
Providing hints about branch likelihood helps the compiler generate better
|
||||
instruction ordering and can improve the branch predictor's accuracy, reducing
|
||||
Providing hints about branch likelihood helps the compiler generate better
|
||||
instruction ordering and can improve the branch predictor's accuracy, reducing
|
||||
pipeline stalls when the prediction proves correct.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -367,7 +367,7 @@ pipeline stalls when the prediction proves correct.
|
||||
if (__builtin_expect(rare_condition, 0)) {
|
||||
// Unlikely branch
|
||||
}
|
||||
|
||||
|
||||
// C++20 attribute
|
||||
if (common_condition) [[likely]] {
|
||||
// Likely branch
|
||||
@@ -375,9 +375,9 @@ pipeline stalls when the prediction proves correct.
|
||||
|
||||
**Avoid divergent warps**
|
||||
|
||||
When divergence is unavoidable, restructure the code to separate divergent paths
|
||||
into different kernel launches or use predication (branchless programming) to
|
||||
keep all threads active, though computing unnecessary values may be acceptable
|
||||
When divergence is unavoidable, restructure the code to separate divergent paths
|
||||
into different kernel launches or use predication (branchless programming) to
|
||||
keep all threads active, though computing unnecessary values may be acceptable
|
||||
if it avoids the serialization penalty.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -388,7 +388,7 @@ if it avoids the serialization penalty.
|
||||
} else {
|
||||
result = compute_odd();
|
||||
}
|
||||
|
||||
|
||||
// Consider separating into different kernels or using predication
|
||||
|
||||
Synchronization
|
||||
@@ -396,32 +396,32 @@ Synchronization
|
||||
|
||||
**Use minimal synchronization**
|
||||
|
||||
Each synchronization point stalls all threads in a block until the slowest one
|
||||
reaches the barrier. Minimize synchronizations by carefully analyzing data
|
||||
dependencies—only synchronize when threads genuinely need to exchange data
|
||||
Each synchronization point stalls all threads in a block until the slowest one
|
||||
reaches the barrier. Minimize synchronizations by carefully analyzing data
|
||||
dependencies—only synchronize when threads genuinely need to exchange data
|
||||
through shared memory.
|
||||
|
||||
.. code-block:: cuda
|
||||
|
||||
__global__ void kernel() {
|
||||
__shared__ float data[256];
|
||||
|
||||
|
||||
// Load phase
|
||||
data[threadIdx.x] = input[...];
|
||||
__syncthreads(); // Necessary sync
|
||||
|
||||
|
||||
// Compute phase - no sync needed if threads are independent
|
||||
float result = compute(data[...]);
|
||||
|
||||
|
||||
// Store phase - sync only if needed
|
||||
output[...] = result;
|
||||
}
|
||||
|
||||
**Use streams for async execution**
|
||||
|
||||
Streams enable concurrent execution of independent operations. Commands in
|
||||
different streams can overlap in time, allowing kernel execution and memory
|
||||
transfers to run simultaneously. This maximizes GPU utilization by keeping
|
||||
Streams enable concurrent execution of independent operations. Commands in
|
||||
different streams can overlap in time, allowing kernel execution and memory
|
||||
transfers to run simultaneously. This maximizes GPU utilization by keeping
|
||||
multiple execution engines busy concurrently.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -429,11 +429,11 @@ multiple execution engines busy concurrently.
|
||||
hipStream_t stream1, stream2;
|
||||
hipStreamCreate(&stream1);
|
||||
hipStreamCreate(&stream2);
|
||||
|
||||
|
||||
// Overlap independent operations
|
||||
kernel1<<<grid, block, 0, stream1>>>(...);
|
||||
kernel2<<<grid, block, 0, stream2>>>(...);
|
||||
|
||||
|
||||
hipStreamSynchronize(stream1);
|
||||
hipStreamSynchronize(stream2);
|
||||
|
||||
@@ -444,9 +444,9 @@ High register usage can limit occupancy. Follow these steps:
|
||||
|
||||
**Minimize live variables**
|
||||
|
||||
The compiler allocates registers for every variable that must remain accessible.
|
||||
Reducing the number of simultaneously live variables frees registers, allowing
|
||||
more wavefronts to fit on each CU. Chaining function calls trades some redundant
|
||||
The compiler allocates registers for every variable that must remain accessible.
|
||||
Reducing the number of simultaneously live variables frees registers, allowing
|
||||
more wavefronts to fit on each CU. Chaining function calls trades some redundant
|
||||
computation for lower register usage.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -456,35 +456,35 @@ computation for lower register usage.
|
||||
float b = compute_b();
|
||||
float c = compute_c();
|
||||
float result = combine(a, b, c);
|
||||
|
||||
|
||||
// Recompute or chain operations
|
||||
float result = combine(compute_a(), compute_b(), compute_c());
|
||||
|
||||
**Use shared memory for temporary storage**
|
||||
|
||||
Per-thread arrays stored in registers consume valuable register space, limiting
|
||||
occupancy. Moving temporary storage to shared memory trades register usage for
|
||||
shared memory usage, often allowing higher occupancy since shared memory limits
|
||||
Per-thread arrays stored in registers consume valuable register space, limiting
|
||||
occupancy. Moving temporary storage to shared memory trades register usage for
|
||||
shared memory usage, often allowing higher occupancy since shared memory limits
|
||||
are typically less restrictive.
|
||||
|
||||
.. code-block:: cuda
|
||||
|
||||
// Instead of per-thread arrays (uses registers)
|
||||
float temp[100];
|
||||
|
||||
|
||||
// Use shared memory
|
||||
__shared__ float temp[blockDim.x][100];
|
||||
float* my_temp = temp[threadIdx.x];
|
||||
|
||||
**Adjust launch bounds**
|
||||
|
||||
The ``__launch_bounds__`` attribute provides hints to the compiler about expected
|
||||
thread block size and minimum blocks per CU. This guides register allocation
|
||||
The ``__launch_bounds__`` attribute provides hints to the compiler about expected
|
||||
thread block size and minimum blocks per CU. This guides register allocation
|
||||
decisions, potentially trading per-thread register count for higher occupancy.
|
||||
|
||||
.. code-block:: cuda
|
||||
|
||||
__global__ void
|
||||
__global__ void
|
||||
__launch_bounds__(256, 4) // 256 threads, 4 blocks per CU
|
||||
my_kernel() {
|
||||
// Kernel code
|
||||
@@ -492,8 +492,8 @@ decisions, potentially trading per-thread register count for higher occupancy.
|
||||
|
||||
**Check register usage during compilation**
|
||||
|
||||
The compiler can report per-kernel register usage statistics. Monitoring this
|
||||
output helps identify kernels consuming excessive registers, guiding optimization
|
||||
The compiler can report per-kernel register usage statistics. Monitoring this
|
||||
output helps identify kernels consuming excessive registers, guiding optimization
|
||||
efforts toward reducing register pressure in the most impactful areas.
|
||||
|
||||
.. code-block:: bash
|
||||
@@ -513,22 +513,22 @@ Use techniques from "Managing register pressure" above.
|
||||
|
||||
**Reduce shared memory usage per block**
|
||||
|
||||
Each CU has limited shared memory that must be divided among resident blocks.
|
||||
Reducing per-block shared memory usage allows more blocks to reside simultaneously,
|
||||
Each CU has limited shared memory that must be divided among resident blocks.
|
||||
Reducing per-block shared memory usage allows more blocks to reside simultaneously,
|
||||
increasing occupancy and improving latency hiding through greater thread-level parallelism.
|
||||
|
||||
.. code-block:: cuda
|
||||
|
||||
// Allocate only what's needed
|
||||
__shared__ float tile[TILE_SIZE][TILE_SIZE];
|
||||
|
||||
|
||||
// Or use dynamic allocation
|
||||
extern __shared__ float dynamic_shared[];
|
||||
|
||||
**Optimize block size**
|
||||
|
||||
AMD GPUs execute threads in wavefronts of 64. Choosing block sizes as multiples
|
||||
of 64 prevents partial wavefronts that waste execution slots. Larger blocks
|
||||
AMD GPUs execute threads in wavefronts of 64. Choosing block sizes as multiples
|
||||
of 64 prevents partial wavefronts that waste execution slots. Larger blocks
|
||||
(128-256 threads) typically achieve better occupancy and resource utilization.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -537,14 +537,14 @@ of 64 prevents partial wavefronts that waste execution slots. Larger blocks
|
||||
dim3 block(64); // Good for AMD GPUs (wavefront=64)
|
||||
dim3 block(128); // Common choice
|
||||
dim3 block(256); // Good for high-occupancy kernels
|
||||
|
||||
|
||||
// Avoid very small blocks
|
||||
dim3 block(32); // May waste resources
|
||||
|
||||
**Profile occupancy**
|
||||
|
||||
Profiling tools report the ratio of active wavefronts to maximum possible
|
||||
wavefronts per CU. Low occupancy suggests resource constraints (registers or
|
||||
Profiling tools report the ratio of active wavefronts to maximum possible
|
||||
wavefronts per CU. Low occupancy suggests resource constraints (registers or
|
||||
shared memory) are limiting parallelism and may indicate opportunities for optimization.
|
||||
|
||||
.. code-block:: bash
|
||||
@@ -561,8 +561,8 @@ allocation calls over time. To optimize:
|
||||
|
||||
**Allocate early, deallocate late**
|
||||
|
||||
Frequent allocation and deallocation causes memory fragmentation and increases
|
||||
allocator overhead. Reusing allocations across iterations amortizes the cost
|
||||
Frequent allocation and deallocation causes memory fragmentation and increases
|
||||
allocator overhead. Reusing allocations across iterations amortizes the cost
|
||||
of memory management and maintains better memory locality.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -574,7 +574,7 @@ of memory management and maintains better memory locality.
|
||||
// Use temp
|
||||
hipFree(temp);
|
||||
}
|
||||
|
||||
|
||||
// Good: allocate once
|
||||
float* temp;
|
||||
hipMalloc(&temp, size);
|
||||
@@ -585,22 +585,22 @@ of memory management and maintains better memory locality.
|
||||
|
||||
**Avoid allocating all available memory**
|
||||
|
||||
Reserving some memory headroom prevents allocation failures and system instability.
|
||||
The driver and runtime need workspace for internal operations, and leaving a
|
||||
Reserving some memory headroom prevents allocation failures and system instability.
|
||||
The driver and runtime need workspace for internal operations, and leaving a
|
||||
safety margin ensures stable operation without unexpected out-of-memory errors.
|
||||
|
||||
.. code-block:: cuda
|
||||
|
||||
size_t free, total;
|
||||
hipMemGetInfo(&free, &total);
|
||||
|
||||
|
||||
// Don't allocate all free memory
|
||||
size_t safe_size = free * 0.9; // Leave some margin
|
||||
|
||||
**Use managed memory for oversubscription**
|
||||
|
||||
Managed memory automatically migrates data between host and device on demand,
|
||||
allowing allocations larger than physical GPU memory. Prefetching hints help
|
||||
Managed memory automatically migrates data between host and device on demand,
|
||||
allowing allocations larger than physical GPU memory. Prefetching hints help
|
||||
the runtime optimize page placement, reducing migration overhead during kernel execution.
|
||||
|
||||
.. code-block:: cuda
|
||||
@@ -608,7 +608,7 @@ the runtime optimize page placement, reducing migration overhead during kernel e
|
||||
// Allows exceeding physical memory
|
||||
float* data;
|
||||
hipMallocManaged(&data, large_size);
|
||||
|
||||
|
||||
// Optionally prefetch to device
|
||||
hipMemPrefetchAsync(data, size, device, stream);
|
||||
|
||||
@@ -623,5 +623,5 @@ Key optimization techniques:
|
||||
* **Manage resources**: Balance registers, shared memory, and occupancy
|
||||
* **Minimize divergence**: Structure control flow to keep warps coherent
|
||||
|
||||
For understanding the theory behind these techniques, refer to
|
||||
For understanding the theory behind these techniques, refer to
|
||||
:doc:`../understand/performance_optimization` and :doc:`../understand/hardware_implementation`.
|
||||
|
||||
새 이슈에서 참조
사용자 차단