Correct rocprofv3 usage instructions (#2925)
* Correct rocprofv3 usage * Apply suggestion from @SwRaw * Apply suggestion from @SwRaw * Update .gitignore
This commit is contained in:
@@ -1 +1,2 @@
|
|||||||
.cline_storage
|
.cline_storage
|
||||||
|
/projects/hip/_build
|
||||||
|
|||||||
@@ -10,10 +10,10 @@ Performance guidelines
|
|||||||
*******************************************************************************
|
*******************************************************************************
|
||||||
|
|
||||||
The AMD HIP performance guidelines provide practical, actionable techniques for
|
The AMD HIP performance guidelines provide practical, actionable techniques for
|
||||||
optimizing application performance on AMD GPUs. This guide focuses on
|
optimizing application performance on AMD GPUs. This guide focuses on
|
||||||
step-by-step instructions and best practices for improving performance.
|
step-by-step instructions and best practices for improving performance.
|
||||||
|
|
||||||
For theoretical foundations and performance concepts, see
|
For theoretical foundations and performance concepts, see
|
||||||
:doc:`../understand/performance_optimization`.
|
:doc:`../understand/performance_optimization`.
|
||||||
|
|
||||||
Optimization workflow
|
Optimization workflow
|
||||||
@@ -22,33 +22,33 @@ Optimization workflow
|
|||||||
Follow this systematic approach to optimize GPU performance:
|
Follow this systematic approach to optimize GPU performance:
|
||||||
|
|
||||||
1. **Profile and measure baseline**
|
1. **Profile and measure baseline**
|
||||||
|
|
||||||
Use ``rocprofv3`` to identify bottlenecks:
|
Use ``rocprofv3`` to identify bottlenecks:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
rocprofv3 --stats ./your_application
|
rocprofv3 --stats --<tracing_option> -- <application_path>
|
||||||
|
|
||||||
Collect metrics on kernel execution time, memory bandwidth, occupancy, and
|
Collect metrics on kernel execution time, memory bandwidth, occupancy, and
|
||||||
CU utilization.
|
CU utilization. For more details on using ``rocprofv3`` for application tracing and profiling, see :doc:`rocprofv3 documentation <rocprofiler-sdk:how-to/using-rocprofv3>`.
|
||||||
|
|
||||||
2. **Analyze metrics to identify bottlenecks**
|
2. **Analyze metrics to identify bottlenecks**
|
||||||
|
|
||||||
Determine if kernels are compute-bound or memory-bound. Check arithmetic
|
Determine if kernels are compute-bound or memory-bound. Check arithmetic
|
||||||
intensity, memory bandwidth achieved vs peak, and compute throughput.
|
intensity, memory bandwidth achieved vs peak, and compute throughput.
|
||||||
|
|
||||||
For understanding the roofline model, see :ref:`roofline_model`.
|
For understanding the roofline model, see :ref:`roofline_model`.
|
||||||
|
|
||||||
3. **Apply targeted optimizations**
|
3. **Apply targeted optimizations**
|
||||||
|
|
||||||
Based on identified bottlenecks, apply techniques from this guide.
|
Based on identified bottlenecks, apply techniques from this guide.
|
||||||
|
|
||||||
4. **Verify improvements**
|
4. **Verify improvements**
|
||||||
|
|
||||||
Re-profile to confirm performance gains.
|
Re-profile to confirm performance gains.
|
||||||
|
|
||||||
5. **Iterate**
|
5. **Iterate**
|
||||||
|
|
||||||
Repeat until performance goals are met.
|
Repeat until performance goals are met.
|
||||||
|
|
||||||
.. _parallel execution:
|
.. _parallel execution:
|
||||||
@@ -70,9 +70,9 @@ To enable parallel execution across the host and devices:
|
|||||||
|
|
||||||
For parallel workloads:
|
For parallel workloads:
|
||||||
|
|
||||||
* Use :cpp:func:`__syncthreads()` (see :ref:`synchronization_functions`) for
|
* Use :cpp:func:`__syncthreads()` (see :ref:`synchronization_functions`) for
|
||||||
intra-block synchronization
|
intra-block synchronization
|
||||||
* Use global memory with separate kernel invocations for inter-block
|
* Use global memory with separate kernel invocations for inter-block
|
||||||
synchronization (has overhead, minimize when possible)
|
synchronization (has overhead, minimize when possible)
|
||||||
|
|
||||||
Device level
|
Device level
|
||||||
@@ -103,7 +103,7 @@ Memory throughput optimization
|
|||||||
The first step in maximizing memory throughput is to minimize low-bandwidth
|
The first step in maximizing memory throughput is to minimize low-bandwidth
|
||||||
data transfers between the host and the device.
|
data transfers between the host and the device.
|
||||||
|
|
||||||
Additionally, maximize the use of on-chip memory (shared memory and caches) and
|
Additionally, maximize the use of on-chip memory (shared memory and caches) and
|
||||||
minimize transfers with global memory.
|
minimize transfers with global memory.
|
||||||
|
|
||||||
.. _data transfer:
|
.. _data transfer:
|
||||||
@@ -130,14 +130,14 @@ effective bandwidth.
|
|||||||
for (int i = 0; i < n; i++) {
|
for (int i = 0; i < n; i++) {
|
||||||
hipMemcpy(&d_data[i], &h_data[i], sizeof(float), ...);
|
hipMemcpy(&d_data[i], &h_data[i], sizeof(float), ...);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Use a single large transfer
|
// Use a single large transfer
|
||||||
hipMemcpy(d_data, h_data, n * sizeof(float), ...);
|
hipMemcpy(d_data, h_data, n * sizeof(float), ...);
|
||||||
|
|
||||||
**Use page-locked memory for transfers**
|
**Use page-locked memory for transfers**
|
||||||
|
|
||||||
Page-locked (pinned) memory cannot be swapped to disk by the operating system,
|
Page-locked (pinned) memory cannot be swapped to disk by the operating system,
|
||||||
allowing the GPU to access it directly via DMA without CPU involvement. This
|
allowing the GPU to access it directly via DMA without CPU involvement. This
|
||||||
eliminates an extra copy through a staging buffer and achieves higher bandwidth.
|
eliminates an extra copy through a staging buffer and achieves higher bandwidth.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -149,8 +149,8 @@ eliminates an extra copy through a staging buffer and achieves higher bandwidth.
|
|||||||
|
|
||||||
**Use mapped memory on integrated systems**
|
**Use mapped memory on integrated systems**
|
||||||
|
|
||||||
On integrated GPUs (APUs), the CPU and GPU share the same physical memory.
|
On integrated GPUs (APUs), the CPU and GPU share the same physical memory.
|
||||||
Mapped page-locked memory allows zero-copy access, where the GPU reads directly
|
Mapped page-locked memory allows zero-copy access, where the GPU reads directly
|
||||||
from host memory without requiring an explicit transfer, eliminating redundant copies.
|
from host memory without requiring an explicit transfer, eliminating redundant copies.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -169,8 +169,8 @@ Device memory access
|
|||||||
|
|
||||||
**Ensure proper alignment**
|
**Ensure proper alignment**
|
||||||
|
|
||||||
Memory hardware loads data in aligned chunks (typically 128 bytes). Using
|
Memory hardware loads data in aligned chunks (typically 128 bytes). Using
|
||||||
naturally aligned data types ensures each access maps to a single memory
|
naturally aligned data types ensures each access maps to a single memory
|
||||||
transaction, maximizing bandwidth and avoiding split transactions.
|
transaction, maximizing bandwidth and avoiding split transactions.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -178,7 +178,7 @@ transaction, maximizing bandwidth and avoiding split transactions.
|
|||||||
// Use naturally aligned types
|
// Use naturally aligned types
|
||||||
float4 data; // 16-byte aligned
|
float4 data; // 16-byte aligned
|
||||||
float2 data; // 8-byte aligned
|
float2 data; // 8-byte aligned
|
||||||
|
|
||||||
// Ensure structure alignment
|
// Ensure structure alignment
|
||||||
struct __align__(16) MyStruct {
|
struct __align__(16) MyStruct {
|
||||||
float4 data;
|
float4 data;
|
||||||
@@ -186,8 +186,8 @@ transaction, maximizing bandwidth and avoiding split transactions.
|
|||||||
|
|
||||||
**Optimize 2D array access**
|
**Optimize 2D array access**
|
||||||
|
|
||||||
Padding 2D arrays to multiples of the wavefront size ensures each row starts
|
Padding 2D arrays to multiples of the wavefront size ensures each row starts
|
||||||
at an aligned memory boundary. This allows consecutive threads accessing the
|
at an aligned memory boundary. This allows consecutive threads accessing the
|
||||||
same row to generate coalesced memory transactions, thereby maximizing
|
same row to generate coalesced memory transactions, thereby maximizing
|
||||||
bandwidth.
|
bandwidth.
|
||||||
|
|
||||||
@@ -196,14 +196,14 @@ bandwidth.
|
|||||||
// Ensure array width is multiple of warp size
|
// Ensure array width is multiple of warp size
|
||||||
int width = ((actual_width + warpSize - 1) / warpSize) * warpSize;
|
int width = ((actual_width + warpSize - 1) / warpSize) * warpSize;
|
||||||
hipMalloc(&array, width * height * sizeof(float));
|
hipMalloc(&array, width * height * sizeof(float));
|
||||||
|
|
||||||
// Access pattern
|
// Access pattern
|
||||||
int idx = x + width * y; // width should be warp-aligned
|
int idx = x + width * y; // width should be warp-aligned
|
||||||
|
|
||||||
**Coalesce memory accesses**
|
**Coalesce memory accesses**
|
||||||
|
|
||||||
When consecutive threads in a wavefront access consecutive memory addresses,
|
When consecutive threads in a wavefront access consecutive memory addresses,
|
||||||
the hardware combines these into a single wide transaction. Non-coalesced
|
the hardware combines these into a single wide transaction. Non-coalesced
|
||||||
patterns require multiple transactions, reducing effective bandwidth.
|
patterns require multiple transactions, reducing effective bandwidth.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -211,7 +211,7 @@ patterns require multiple transactions, reducing effective bandwidth.
|
|||||||
// Good: consecutive threads access consecutive addresses
|
// Good: consecutive threads access consecutive addresses
|
||||||
int idx = threadIdx.x + blockIdx.x * blockDim.x;
|
int idx = threadIdx.x + blockIdx.x * blockDim.x;
|
||||||
data[idx] = value;
|
data[idx] = value;
|
||||||
|
|
||||||
// Bad: strided access
|
// Bad: strided access
|
||||||
int idx = threadIdx.x * stride; // Non-coalesced if stride > 1
|
int idx = threadIdx.x * stride; // Non-coalesced if stride > 1
|
||||||
data[idx] = value;
|
data[idx] = value;
|
||||||
@@ -220,8 +220,8 @@ For understanding memory coalescing theory, see :ref:`memory_hierarchy_theory`.
|
|||||||
|
|
||||||
**Use shared memory for data reuse**
|
**Use shared memory for data reuse**
|
||||||
|
|
||||||
Shared memory (LDS) provides low-latency on-chip storage shared across threads
|
Shared memory (LDS) provides low-latency on-chip storage shared across threads
|
||||||
in a block. Loading data into shared memory once and reusing it many times
|
in a block. Loading data into shared memory once and reusing it many times
|
||||||
reduces global memory traffic, particularly effective for tiled algorithms such
|
reduces global memory traffic, particularly effective for tiled algorithms such
|
||||||
as matrix multiplication.
|
as matrix multiplication.
|
||||||
|
|
||||||
@@ -229,18 +229,18 @@ as matrix multiplication.
|
|||||||
|
|
||||||
__global__ void optimized_kernel(float* input, float* output) {
|
__global__ void optimized_kernel(float* input, float* output) {
|
||||||
__shared__ float tile[TILE_SIZE][TILE_SIZE];
|
__shared__ float tile[TILE_SIZE][TILE_SIZE];
|
||||||
|
|
||||||
// Load data into shared memory
|
// Load data into shared memory
|
||||||
tile[threadIdx.y][threadIdx.x] = input[...];
|
tile[threadIdx.y][threadIdx.x] = input[...];
|
||||||
__syncthreads();
|
__syncthreads();
|
||||||
|
|
||||||
// Reuse data from fast shared memory
|
// Reuse data from fast shared memory
|
||||||
float result = 0;
|
float result = 0;
|
||||||
for (int i = 0; i < TILE_SIZE; i++) {
|
for (int i = 0; i < TILE_SIZE; i++) {
|
||||||
result += tile[threadIdx.y][i] * tile[i][threadIdx.x];
|
result += tile[threadIdx.y][i] * tile[i][threadIdx.x];
|
||||||
}
|
}
|
||||||
__syncthreads();
|
__syncthreads();
|
||||||
|
|
||||||
output[...] = result;
|
output[...] = result;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -256,7 +256,7 @@ shifts addresses to avoid systematic conflicts.
|
|||||||
// Bad: power-of-2 stride causes conflicts
|
// Bad: power-of-2 stride causes conflicts
|
||||||
__shared__ float data[32][32];
|
__shared__ float data[32][32];
|
||||||
float value = data[threadIdx.x][threadIdx.y];
|
float value = data[threadIdx.x][threadIdx.y];
|
||||||
|
|
||||||
// Good: padding avoids conflicts
|
// Good: padding avoids conflicts
|
||||||
__shared__ float data[32][33]; // Extra column
|
__shared__ float data[32][33]; // Extra column
|
||||||
float value = data[threadIdx.x][threadIdx.y];
|
float value = data[threadIdx.x][threadIdx.y];
|
||||||
@@ -265,8 +265,8 @@ For bank conflict theory, see :ref:`bank_conflicts_theory`.
|
|||||||
|
|
||||||
**Use texture memory for 2D spatial access**
|
**Use texture memory for 2D spatial access**
|
||||||
|
|
||||||
Texture memory provides hardware-accelerated 2D filtering and caching optimized
|
Texture memory provides hardware-accelerated 2D filtering and caching optimized
|
||||||
for spatial locality. It automatically handles boundary conditions and can
|
for spatial locality. It automatically handles boundary conditions and can
|
||||||
interpolate values, making it ideal for image processing and nearby-neighbor access patterns.
|
interpolate values, making it ideal for image processing and nearby-neighbor access patterns.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -274,7 +274,7 @@ interpolate values, making it ideal for image processing and nearby-neighbor acc
|
|||||||
// Create texture object
|
// Create texture object
|
||||||
hipTextureObject_t texObj;
|
hipTextureObject_t texObj;
|
||||||
hipCreateTextureObject(&texObj, &resDesc, &texDesc, NULL);
|
hipCreateTextureObject(&texObj, &resDesc, &texDesc, NULL);
|
||||||
|
|
||||||
// Access in kernel
|
// Access in kernel
|
||||||
float value = tex2D<float>(texObj, x, y);
|
float value = tex2D<float>(texObj, x, y);
|
||||||
|
|
||||||
@@ -288,8 +288,8 @@ Arithmetic instructions
|
|||||||
|
|
||||||
**Use efficient operations**
|
**Use efficient operations**
|
||||||
|
|
||||||
Division requires many more hardware cycles than multiplication. Similarly,
|
Division requires many more hardware cycles than multiplication. Similarly,
|
||||||
bitwise operations (shifts, AND, OR) are single-cycle instructions on integer
|
bitwise operations (shifts, AND, OR) are single-cycle instructions on integer
|
||||||
units, making them far more efficient than equivalent arithmetic for power-of-two calculations.
|
units, making them far more efficient than equivalent arithmetic for power-of-two calculations.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -297,15 +297,15 @@ units, making them far more efficient than equivalent arithmetic for power-of-tw
|
|||||||
// Prefer multiplication over division
|
// Prefer multiplication over division
|
||||||
float result = value * 0.5f; // Fast
|
float result = value * 0.5f; // Fast
|
||||||
float result = value / 2.0f; // Slower
|
float result = value / 2.0f; // Slower
|
||||||
|
|
||||||
// Use bitwise operations for powers of 2
|
// Use bitwise operations for powers of 2
|
||||||
int index = threadIdx.x << 2; // Multiply by 4
|
int index = threadIdx.x << 2; // Multiply by 4
|
||||||
int mask = (1 << n) - 1; // Create bit mask
|
int mask = (1 << n) - 1; // Create bit mask
|
||||||
|
|
||||||
**Use single-precision when possible**
|
**Use single-precision when possible**
|
||||||
|
|
||||||
AMD GPUs have significantly higher throughput for single-precision (FP32)
|
AMD GPUs have significantly higher throughput for single-precision (FP32)
|
||||||
operations compared to double-precision (FP64). Using single-precision math
|
operations compared to double-precision (FP64). Using single-precision math
|
||||||
functions can deliver substantial performance gains when FP64 accuracy is not required.
|
functions can deliver substantial performance gains when FP64 accuracy is not required.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -313,7 +313,7 @@ functions can deliver substantial performance gains when FP64 accuracy is not re
|
|||||||
// Single-precision (faster)
|
// Single-precision (faster)
|
||||||
float result = sinf(x);
|
float result = sinf(x);
|
||||||
float result = expf(x);
|
float result = expf(x);
|
||||||
|
|
||||||
// Double-precision (slower, use only when necessary)
|
// Double-precision (slower, use only when necessary)
|
||||||
double result = sin(x);
|
double result = sin(x);
|
||||||
double result = exp(x);
|
double result = exp(x);
|
||||||
@@ -340,8 +340,8 @@ Control flow optimization
|
|||||||
|
|
||||||
**Minimize divergence**
|
**Minimize divergence**
|
||||||
|
|
||||||
When threads in a wavefront take different execution paths, the hardware
|
When threads in a wavefront take different execution paths, the hardware
|
||||||
serializes both branches, executing each path with only the relevant threads
|
serializes both branches, executing each path with only the relevant threads
|
||||||
active. This reduces effective parallelism and wastes cycles on inactive threads.
|
active. This reduces effective parallelism and wastes cycles on inactive threads.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -350,7 +350,7 @@ active. This reduces effective parallelism and wastes cycles on inactive threads
|
|||||||
if (threadIdx.x < 32) {
|
if (threadIdx.x < 32) {
|
||||||
// All threads in first half-warp execute
|
// All threads in first half-warp execute
|
||||||
}
|
}
|
||||||
|
|
||||||
// Bad: divergence within warp
|
// Bad: divergence within warp
|
||||||
if (data[threadIdx.x] > threshold) {
|
if (data[threadIdx.x] > threshold) {
|
||||||
// Some threads execute, others don't
|
// Some threads execute, others don't
|
||||||
@@ -358,8 +358,8 @@ active. This reduces effective parallelism and wastes cycles on inactive threads
|
|||||||
|
|
||||||
**Use branch hints for predictable conditions**
|
**Use branch hints for predictable conditions**
|
||||||
|
|
||||||
Providing hints about branch likelihood helps the compiler generate better
|
Providing hints about branch likelihood helps the compiler generate better
|
||||||
instruction ordering and can improve the branch predictor's accuracy, reducing
|
instruction ordering and can improve the branch predictor's accuracy, reducing
|
||||||
pipeline stalls when the prediction proves correct.
|
pipeline stalls when the prediction proves correct.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -367,7 +367,7 @@ pipeline stalls when the prediction proves correct.
|
|||||||
if (__builtin_expect(rare_condition, 0)) {
|
if (__builtin_expect(rare_condition, 0)) {
|
||||||
// Unlikely branch
|
// Unlikely branch
|
||||||
}
|
}
|
||||||
|
|
||||||
// C++20 attribute
|
// C++20 attribute
|
||||||
if (common_condition) [[likely]] {
|
if (common_condition) [[likely]] {
|
||||||
// Likely branch
|
// Likely branch
|
||||||
@@ -375,9 +375,9 @@ pipeline stalls when the prediction proves correct.
|
|||||||
|
|
||||||
**Avoid divergent warps**
|
**Avoid divergent warps**
|
||||||
|
|
||||||
When divergence is unavoidable, restructure the code to separate divergent paths
|
When divergence is unavoidable, restructure the code to separate divergent paths
|
||||||
into different kernel launches or use predication (branchless programming) to
|
into different kernel launches or use predication (branchless programming) to
|
||||||
keep all threads active, though computing unnecessary values may be acceptable
|
keep all threads active, though computing unnecessary values may be acceptable
|
||||||
if it avoids the serialization penalty.
|
if it avoids the serialization penalty.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -388,7 +388,7 @@ if it avoids the serialization penalty.
|
|||||||
} else {
|
} else {
|
||||||
result = compute_odd();
|
result = compute_odd();
|
||||||
}
|
}
|
||||||
|
|
||||||
// Consider separating into different kernels or using predication
|
// Consider separating into different kernels or using predication
|
||||||
|
|
||||||
Synchronization
|
Synchronization
|
||||||
@@ -396,32 +396,32 @@ Synchronization
|
|||||||
|
|
||||||
**Use minimal synchronization**
|
**Use minimal synchronization**
|
||||||
|
|
||||||
Each synchronization point stalls all threads in a block until the slowest one
|
Each synchronization point stalls all threads in a block until the slowest one
|
||||||
reaches the barrier. Minimize synchronizations by carefully analyzing data
|
reaches the barrier. Minimize synchronizations by carefully analyzing data
|
||||||
dependencies—only synchronize when threads genuinely need to exchange data
|
dependencies—only synchronize when threads genuinely need to exchange data
|
||||||
through shared memory.
|
through shared memory.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
|
|
||||||
__global__ void kernel() {
|
__global__ void kernel() {
|
||||||
__shared__ float data[256];
|
__shared__ float data[256];
|
||||||
|
|
||||||
// Load phase
|
// Load phase
|
||||||
data[threadIdx.x] = input[...];
|
data[threadIdx.x] = input[...];
|
||||||
__syncthreads(); // Necessary sync
|
__syncthreads(); // Necessary sync
|
||||||
|
|
||||||
// Compute phase - no sync needed if threads are independent
|
// Compute phase - no sync needed if threads are independent
|
||||||
float result = compute(data[...]);
|
float result = compute(data[...]);
|
||||||
|
|
||||||
// Store phase - sync only if needed
|
// Store phase - sync only if needed
|
||||||
output[...] = result;
|
output[...] = result;
|
||||||
}
|
}
|
||||||
|
|
||||||
**Use streams for async execution**
|
**Use streams for async execution**
|
||||||
|
|
||||||
Streams enable concurrent execution of independent operations. Commands in
|
Streams enable concurrent execution of independent operations. Commands in
|
||||||
different streams can overlap in time, allowing kernel execution and memory
|
different streams can overlap in time, allowing kernel execution and memory
|
||||||
transfers to run simultaneously. This maximizes GPU utilization by keeping
|
transfers to run simultaneously. This maximizes GPU utilization by keeping
|
||||||
multiple execution engines busy concurrently.
|
multiple execution engines busy concurrently.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -429,11 +429,11 @@ multiple execution engines busy concurrently.
|
|||||||
hipStream_t stream1, stream2;
|
hipStream_t stream1, stream2;
|
||||||
hipStreamCreate(&stream1);
|
hipStreamCreate(&stream1);
|
||||||
hipStreamCreate(&stream2);
|
hipStreamCreate(&stream2);
|
||||||
|
|
||||||
// Overlap independent operations
|
// Overlap independent operations
|
||||||
kernel1<<<grid, block, 0, stream1>>>(...);
|
kernel1<<<grid, block, 0, stream1>>>(...);
|
||||||
kernel2<<<grid, block, 0, stream2>>>(...);
|
kernel2<<<grid, block, 0, stream2>>>(...);
|
||||||
|
|
||||||
hipStreamSynchronize(stream1);
|
hipStreamSynchronize(stream1);
|
||||||
hipStreamSynchronize(stream2);
|
hipStreamSynchronize(stream2);
|
||||||
|
|
||||||
@@ -444,9 +444,9 @@ High register usage can limit occupancy. Follow these steps:
|
|||||||
|
|
||||||
**Minimize live variables**
|
**Minimize live variables**
|
||||||
|
|
||||||
The compiler allocates registers for every variable that must remain accessible.
|
The compiler allocates registers for every variable that must remain accessible.
|
||||||
Reducing the number of simultaneously live variables frees registers, allowing
|
Reducing the number of simultaneously live variables frees registers, allowing
|
||||||
more wavefronts to fit on each CU. Chaining function calls trades some redundant
|
more wavefronts to fit on each CU. Chaining function calls trades some redundant
|
||||||
computation for lower register usage.
|
computation for lower register usage.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -456,35 +456,35 @@ computation for lower register usage.
|
|||||||
float b = compute_b();
|
float b = compute_b();
|
||||||
float c = compute_c();
|
float c = compute_c();
|
||||||
float result = combine(a, b, c);
|
float result = combine(a, b, c);
|
||||||
|
|
||||||
// Recompute or chain operations
|
// Recompute or chain operations
|
||||||
float result = combine(compute_a(), compute_b(), compute_c());
|
float result = combine(compute_a(), compute_b(), compute_c());
|
||||||
|
|
||||||
**Use shared memory for temporary storage**
|
**Use shared memory for temporary storage**
|
||||||
|
|
||||||
Per-thread arrays stored in registers consume valuable register space, limiting
|
Per-thread arrays stored in registers consume valuable register space, limiting
|
||||||
occupancy. Moving temporary storage to shared memory trades register usage for
|
occupancy. Moving temporary storage to shared memory trades register usage for
|
||||||
shared memory usage, often allowing higher occupancy since shared memory limits
|
shared memory usage, often allowing higher occupancy since shared memory limits
|
||||||
are typically less restrictive.
|
are typically less restrictive.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
|
|
||||||
// Instead of per-thread arrays (uses registers)
|
// Instead of per-thread arrays (uses registers)
|
||||||
float temp[100];
|
float temp[100];
|
||||||
|
|
||||||
// Use shared memory
|
// Use shared memory
|
||||||
__shared__ float temp[blockDim.x][100];
|
__shared__ float temp[blockDim.x][100];
|
||||||
float* my_temp = temp[threadIdx.x];
|
float* my_temp = temp[threadIdx.x];
|
||||||
|
|
||||||
**Adjust launch bounds**
|
**Adjust launch bounds**
|
||||||
|
|
||||||
The ``__launch_bounds__`` attribute provides hints to the compiler about expected
|
The ``__launch_bounds__`` attribute provides hints to the compiler about expected
|
||||||
thread block size and minimum blocks per CU. This guides register allocation
|
thread block size and minimum blocks per CU. This guides register allocation
|
||||||
decisions, potentially trading per-thread register count for higher occupancy.
|
decisions, potentially trading per-thread register count for higher occupancy.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
|
|
||||||
__global__ void
|
__global__ void
|
||||||
__launch_bounds__(256, 4) // 256 threads, 4 blocks per CU
|
__launch_bounds__(256, 4) // 256 threads, 4 blocks per CU
|
||||||
my_kernel() {
|
my_kernel() {
|
||||||
// Kernel code
|
// Kernel code
|
||||||
@@ -492,8 +492,8 @@ decisions, potentially trading per-thread register count for higher occupancy.
|
|||||||
|
|
||||||
**Check register usage during compilation**
|
**Check register usage during compilation**
|
||||||
|
|
||||||
The compiler can report per-kernel register usage statistics. Monitoring this
|
The compiler can report per-kernel register usage statistics. Monitoring this
|
||||||
output helps identify kernels consuming excessive registers, guiding optimization
|
output helps identify kernels consuming excessive registers, guiding optimization
|
||||||
efforts toward reducing register pressure in the most impactful areas.
|
efforts toward reducing register pressure in the most impactful areas.
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
@@ -513,22 +513,22 @@ Use techniques from "Managing register pressure" above.
|
|||||||
|
|
||||||
**Reduce shared memory usage per block**
|
**Reduce shared memory usage per block**
|
||||||
|
|
||||||
Each CU has limited shared memory that must be divided among resident blocks.
|
Each CU has limited shared memory that must be divided among resident blocks.
|
||||||
Reducing per-block shared memory usage allows more blocks to reside simultaneously,
|
Reducing per-block shared memory usage allows more blocks to reside simultaneously,
|
||||||
increasing occupancy and improving latency hiding through greater thread-level parallelism.
|
increasing occupancy and improving latency hiding through greater thread-level parallelism.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
|
|
||||||
// Allocate only what's needed
|
// Allocate only what's needed
|
||||||
__shared__ float tile[TILE_SIZE][TILE_SIZE];
|
__shared__ float tile[TILE_SIZE][TILE_SIZE];
|
||||||
|
|
||||||
// Or use dynamic allocation
|
// Or use dynamic allocation
|
||||||
extern __shared__ float dynamic_shared[];
|
extern __shared__ float dynamic_shared[];
|
||||||
|
|
||||||
**Optimize block size**
|
**Optimize block size**
|
||||||
|
|
||||||
AMD GPUs execute threads in wavefronts of 64. Choosing block sizes as multiples
|
AMD GPUs execute threads in wavefronts of 64. Choosing block sizes as multiples
|
||||||
of 64 prevents partial wavefronts that waste execution slots. Larger blocks
|
of 64 prevents partial wavefronts that waste execution slots. Larger blocks
|
||||||
(128-256 threads) typically achieve better occupancy and resource utilization.
|
(128-256 threads) typically achieve better occupancy and resource utilization.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -537,14 +537,14 @@ of 64 prevents partial wavefronts that waste execution slots. Larger blocks
|
|||||||
dim3 block(64); // Good for AMD GPUs (wavefront=64)
|
dim3 block(64); // Good for AMD GPUs (wavefront=64)
|
||||||
dim3 block(128); // Common choice
|
dim3 block(128); // Common choice
|
||||||
dim3 block(256); // Good for high-occupancy kernels
|
dim3 block(256); // Good for high-occupancy kernels
|
||||||
|
|
||||||
// Avoid very small blocks
|
// Avoid very small blocks
|
||||||
dim3 block(32); // May waste resources
|
dim3 block(32); // May waste resources
|
||||||
|
|
||||||
**Profile occupancy**
|
**Profile occupancy**
|
||||||
|
|
||||||
Profiling tools report the ratio of active wavefronts to maximum possible
|
Profiling tools report the ratio of active wavefronts to maximum possible
|
||||||
wavefronts per CU. Low occupancy suggests resource constraints (registers or
|
wavefronts per CU. Low occupancy suggests resource constraints (registers or
|
||||||
shared memory) are limiting parallelism and may indicate opportunities for optimization.
|
shared memory) are limiting parallelism and may indicate opportunities for optimization.
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
@@ -561,8 +561,8 @@ allocation calls over time. To optimize:
|
|||||||
|
|
||||||
**Allocate early, deallocate late**
|
**Allocate early, deallocate late**
|
||||||
|
|
||||||
Frequent allocation and deallocation causes memory fragmentation and increases
|
Frequent allocation and deallocation causes memory fragmentation and increases
|
||||||
allocator overhead. Reusing allocations across iterations amortizes the cost
|
allocator overhead. Reusing allocations across iterations amortizes the cost
|
||||||
of memory management and maintains better memory locality.
|
of memory management and maintains better memory locality.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -574,7 +574,7 @@ of memory management and maintains better memory locality.
|
|||||||
// Use temp
|
// Use temp
|
||||||
hipFree(temp);
|
hipFree(temp);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Good: allocate once
|
// Good: allocate once
|
||||||
float* temp;
|
float* temp;
|
||||||
hipMalloc(&temp, size);
|
hipMalloc(&temp, size);
|
||||||
@@ -585,22 +585,22 @@ of memory management and maintains better memory locality.
|
|||||||
|
|
||||||
**Avoid allocating all available memory**
|
**Avoid allocating all available memory**
|
||||||
|
|
||||||
Reserving some memory headroom prevents allocation failures and system instability.
|
Reserving some memory headroom prevents allocation failures and system instability.
|
||||||
The driver and runtime need workspace for internal operations, and leaving a
|
The driver and runtime need workspace for internal operations, and leaving a
|
||||||
safety margin ensures stable operation without unexpected out-of-memory errors.
|
safety margin ensures stable operation without unexpected out-of-memory errors.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
|
|
||||||
size_t free, total;
|
size_t free, total;
|
||||||
hipMemGetInfo(&free, &total);
|
hipMemGetInfo(&free, &total);
|
||||||
|
|
||||||
// Don't allocate all free memory
|
// Don't allocate all free memory
|
||||||
size_t safe_size = free * 0.9; // Leave some margin
|
size_t safe_size = free * 0.9; // Leave some margin
|
||||||
|
|
||||||
**Use managed memory for oversubscription**
|
**Use managed memory for oversubscription**
|
||||||
|
|
||||||
Managed memory automatically migrates data between host and device on demand,
|
Managed memory automatically migrates data between host and device on demand,
|
||||||
allowing allocations larger than physical GPU memory. Prefetching hints help
|
allowing allocations larger than physical GPU memory. Prefetching hints help
|
||||||
the runtime optimize page placement, reducing migration overhead during kernel execution.
|
the runtime optimize page placement, reducing migration overhead during kernel execution.
|
||||||
|
|
||||||
.. code-block:: cuda
|
.. code-block:: cuda
|
||||||
@@ -608,7 +608,7 @@ the runtime optimize page placement, reducing migration overhead during kernel e
|
|||||||
// Allows exceeding physical memory
|
// Allows exceeding physical memory
|
||||||
float* data;
|
float* data;
|
||||||
hipMallocManaged(&data, large_size);
|
hipMallocManaged(&data, large_size);
|
||||||
|
|
||||||
// Optionally prefetch to device
|
// Optionally prefetch to device
|
||||||
hipMemPrefetchAsync(data, size, device, stream);
|
hipMemPrefetchAsync(data, size, device, stream);
|
||||||
|
|
||||||
@@ -623,5 +623,5 @@ Key optimization techniques:
|
|||||||
* **Manage resources**: Balance registers, shared memory, and occupancy
|
* **Manage resources**: Balance registers, shared memory, and occupancy
|
||||||
* **Minimize divergence**: Structure control flow to keep warps coherent
|
* **Minimize divergence**: Structure control flow to keep warps coherent
|
||||||
|
|
||||||
For understanding the theory behind these techniques, refer to
|
For understanding the theory behind these techniques, refer to
|
||||||
:doc:`../understand/performance_optimization` and :doc:`../understand/hardware_implementation`.
|
:doc:`../understand/performance_optimization` and :doc:`../understand/hardware_implementation`.
|
||||||
|
|||||||
Reference in New Issue
Block a user