2f6fb89c51
Update projects/hip/docs/tutorial/programming-patterns/atomic_operations_histogram.rst WIP Co-authored-by: Julia Jiang <56359287+jujiang-del@users.noreply.github.com>
311 lignes
10 KiB
ReStructuredText
311 lignes
10 KiB
ReStructuredText
.. meta::
|
|
:description: HIP atomic operations histogram tutorial
|
|
:keywords: AMD, ROCm, HIP, atomic operations, GPU programming, histogram, synchronization primitives
|
|
|
|
*******************************************************************************
|
|
Atomic operations: Histogram tutorial
|
|
*******************************************************************************
|
|
|
|
In GPU programming, a core design principle is to **avoid simultaneous writes to
|
|
the same memory address by multiple threads**. When multiple threads write to
|
|
the same location without proper synchronization, this creates a
|
|
**race condition**, where the final result depends on unpredictable thread
|
|
execution order.
|
|
|
|
Unlike CPUs, GPUs are designed for high-throughput parallel execution with
|
|
relaxed memory consistency models and limited cache coherence mechanisms. This
|
|
architectural choice maximizes bandwidth and scalability but introduces
|
|
challenges when multiple threads need to safely update shared state.
|
|
|
|
This tutorial demonstrates how to safely handle **concurrent memory updates**
|
|
using **atomic operations**, illustrated through the practical example of
|
|
computing an image brightness histogram on the GPU.
|
|
|
|
.. include:: ../prerequisites.rst
|
|
|
|
Race condition
|
|
==============
|
|
|
|
A **race condition** occurs when two or more threads attempt to
|
|
read-modify-write the same memory location concurrently without proper
|
|
synchronization. Because GPU threads execute asynchronously across multiple
|
|
cores (compute units), concurrent writes can interleave unpredictably,
|
|
leading to incorrect results.
|
|
|
|
For example, if two threads simultaneously attempt:
|
|
|
|
.. code-block:: c++
|
|
|
|
histogram[bin] = histogram[bin] + 1;
|
|
|
|
both may read the same old value before either writes back,
|
|
resulting in only one increment being reflected. This results in **lost updates**
|
|
and **nondeterministic output**, which must be avoided.
|
|
|
|
Histogram
|
|
=========
|
|
|
|
A **histogram** partitions continuous data into discrete intervals called
|
|
**bins** and counts how many data points fall into each bin. In image processing,
|
|
a histogram typically represents the **distribution of pixel intensities** for
|
|
example brightness or color channel values.
|
|
|
|
The histogram algorithm can be expressed as:
|
|
|
|
.. math::
|
|
|
|
H[b] = \sum_{i=1}^{N} \delta(b - \lfloor f(x_i) \rfloor)
|
|
|
|
where :math:`f(x_i)` maps each data value to its corresponding bin index
|
|
:math:`b`, and :math:`\delta()` is 1 when the value belongs to bin :math:`b` and
|
|
0 otherwise.
|
|
|
|
The basic computational steps are:
|
|
|
|
1. Iterate through all pixels (or data points).
|
|
2. Determine the appropriate bin for each value.
|
|
3. Increment that bin’s count.
|
|
|
|
In a serial CPU program, this is straightforward. On a GPU, thousands of threads
|
|
may attempt to increment the same bin concurrently, leading to **race
|
|
conditions** unless atomic synchronization is used.
|
|
|
|
The Challenge in parallel context
|
|
---------------------------------
|
|
|
|
When multiple threads attempt to increment the same bin:
|
|
|
|
* One thread’s update can overwrite another’s pending increment.
|
|
|
|
* Memory coherence cannot guarantee ordered visibility across thread blocks.
|
|
|
|
* The final result may be inconsistent or incorrect.
|
|
|
|
This necessitates synchronization mechanisms to ensure that updates occur in a
|
|
**mutually exclusive** manner without introducing high overhead.
|
|
|
|
Atomic operations
|
|
=================
|
|
|
|
An **atomic operation** ensures that a compound operation — typically a
|
|
read-modify-write sequence — executes as an **indivisible unit**. From the
|
|
programmer’s perspective, atomicity guarantees that no other thread can observe
|
|
a partially completed operation.
|
|
|
|
Formally, an operation :math:`O(x)` on shared variable :math:`x` is **atomic**
|
|
if its execution satisfies:
|
|
|
|
.. math::
|
|
|
|
\forall T_i, T_j, \text{ the effects of } O(x) \text{ appear serializable.}
|
|
|
|
That is, all threads observe results as if operations occurred in a single,
|
|
sequential order.
|
|
|
|
Mechanics
|
|
---------
|
|
|
|
Atomic operations on GPUs are implemented in hardware through a **memory
|
|
arbitration unit** that locks a cache line, performs the modification, and
|
|
releases the lock. This ensures correctness even under massive parallelism.
|
|
|
|
When a thread performs an atomic operation:
|
|
|
|
1. The target memory location is temporarily locked.
|
|
2. The value is fetched and updated.
|
|
3. The update is written back, and the lock is released.
|
|
|
|
No other thread can modify the same memory location during this sequence.
|
|
|
|
Atomic functions
|
|
----------------
|
|
|
|
HIP provides a wide set of atomic primitives to synchronize updates to shared
|
|
memory or global memory locations:
|
|
|
|
.. list-table::
|
|
:header-rows: 1
|
|
:widths: 20 80
|
|
|
|
* - Operation
|
|
- Description
|
|
* - ``atomicAdd``
|
|
- Atomically adds a value to a memory location and returns the old value.
|
|
* - ``atomicSub``
|
|
- Atomically subtracts a value.
|
|
* - ``atomicExch``
|
|
- Atomically exchanges values between a register and memory.
|
|
* - ``atomicCAS``
|
|
- Performs an atomic compare-and-swap; fundamental for implementing locks.
|
|
* - ``atomicMax`` / ``atomicMin``
|
|
- Updates to the maximum or minimum of two values.
|
|
* - ``atomicInc`` / ``atomicDec``
|
|
- Atomically increments or decrements a counter, wrapping at a boundary.
|
|
|
|
Atomic operations in kernels can operate on block scope (shared memory),
|
|
device scope (global memory), or system scope (system memory), depending on
|
|
:doc:`hardware support <rocm:reference/gpu-atomics-operation>`.
|
|
|
|
For more information, please check :ref:`atomic functions <atomic functions>`.
|
|
|
|
Image brightness histogram
|
|
==========================
|
|
|
|
We will compute a histogram that captures the **distribution of pixel
|
|
brightness** in an RGB image. The algorithm:
|
|
|
|
1. Reads image data in **channel-height-width** format.
|
|
2. Converts RGB values to grayscale brightness.
|
|
3. Maps brightness to a histogram bin.
|
|
4. Atomically increments the corresponding bin counter.
|
|
|
|
Kernel implementation
|
|
---------------------
|
|
|
|
.. code-block:: c++
|
|
|
|
__global__ void calculateHistogram(float* imageData, int* histogram,
|
|
int width, int height,
|
|
int channels, int numBins)
|
|
{
|
|
int x = blockIdx.x * blockDim.x + threadIdx.x;
|
|
int y = blockIdx.y * blockDim.y + threadIdx.y;
|
|
|
|
if (x >= width || y >= height)
|
|
return;
|
|
|
|
int idx = (y * width + x) * channels;
|
|
float brightness = 0.0f;
|
|
|
|
for (int c = 0; c < channels; ++c)
|
|
brightness += imageData[idx + c];
|
|
|
|
brightness /= channels; // Normalize to [0, 1]
|
|
int bin = static_cast<int>(brightness * numBins);
|
|
|
|
// Atomic increment to avoid race conditions
|
|
atomicAdd(&histogram[bin], 1);
|
|
}
|
|
|
|
Thread identification
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Each thread computes one pixel’s contribution using its 2D thread and block
|
|
indices:
|
|
|
|
.. code-block:: c++
|
|
|
|
int x = blockIdx.x * blockDim.x + threadIdx.x;
|
|
int y = blockIdx.y * blockDim.y + threadIdx.y;
|
|
|
|
This mapping provides a 1:1 correspondence between threads and pixels, making
|
|
the computation naturally parallel.
|
|
|
|
Brightness computation
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Each pixel’s brightness is computed as the arithmetic mean of its RGB channels:
|
|
|
|
.. math::
|
|
:nowrap:
|
|
|
|
\[
|
|
I'(x, y) = \frac{R + G + B}{3}
|
|
\]
|
|
|
|
This value is then normalized to [0, 1] and mapped to one of `numBins`
|
|
histogram intervals.
|
|
|
|
Safe histogram update
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The key step is:
|
|
|
|
.. code-block:: c++
|
|
|
|
atomicAdd(&histogram[bin], 1);
|
|
|
|
This ensures that even if thousands of threads map to the same bin, each
|
|
increment is serialized correctly, maintaining an accurate bin count.
|
|
|
|
Performance characteristics
|
|
===========================
|
|
|
|
Benefits
|
|
--------
|
|
|
|
* **Correctness under parallel updates:** Ensures race-free accumulation.
|
|
|
|
* **Simplified synchronization:** No explicit locks or barriers needed.
|
|
|
|
* **Hardware-level efficiency:** Implemented directly in the GPU memory
|
|
subsystem.
|
|
|
|
Limitations
|
|
-----------
|
|
|
|
While atomic operations guarantee correctness, they can **serialize execution**
|
|
when multiple threads target the same memory address. This causes contention and
|
|
reduces effective parallelism.
|
|
|
|
Typical performance degradation sources include:
|
|
|
|
* **Hot bins:** When many pixels fall into a small subset of bins.
|
|
|
|
* **Global memory atomics:** Global memory atomics are slower than shared memory
|
|
atomics due to higher access latency.
|
|
|
|
* **Warp serialization:** Threads within a warp waiting for the same atomic
|
|
target serialize.
|
|
|
|
Best practices
|
|
==============
|
|
|
|
1. **Apply atomic operations only where necessary**
|
|
|
|
Atomic instructions serialize access to a memory location and use can
|
|
diminish SIMT parallel efficiency and increase warp stalls. Restrict atomic
|
|
usage to code paths where data races cannot be eliminated through algorithmic
|
|
restructuring.
|
|
|
|
2. **Minimize contention**
|
|
|
|
High contention on a single address or a small set of addresses leads to
|
|
serialization. Distribute writes across independent memory locations.
|
|
|
|
3. **Leverage shared memory**
|
|
|
|
Use fast, low-latency shared memory to aggregate partial results within a
|
|
block before issuing a single atomic update to global memory.
|
|
|
|
4. **Validate correctness**
|
|
|
|
Validate the numerical and logical correctness of GPU kernels by comparing
|
|
against single-threaded or deterministic multi-threaded CPU baselines.
|
|
|
|
5. **Profile regularly**
|
|
|
|
GPU performance is highly sensitive to thread divergence, memory-access
|
|
patterns, and workload distribution. Regularly use profiling tools such as
|
|
:doc:`rocprofv3<rocprofiler-sdk:how-to/using-rocprofv3>` or
|
|
:doc:`ROCm compute profiler<rocprofiler-compute:how-to/profile/mode>` to
|
|
examine warp-level execution efficiency, memory-coalescing behavior,
|
|
occupancy, and atomic throughput bottlenecks.
|
|
|
|
Conclusion
|
|
==========
|
|
|
|
Atomic operations provide a low-level synchronization mechanism that allows
|
|
correct and deterministic parallel updates to shared data structures. In the
|
|
histogram example, :cpp:func:`atomicAdd` ensures that all threads safely
|
|
contribute to their corresponding bins, preventing race conditions.
|
|
|
|
While atomics incur some serialization overhead, they are indispensable for
|
|
algorithms that require concurrent accumulation or counting. By applying
|
|
techniques like privatization and reduction, developers can achieve both
|
|
**correctness** and **high performance** on modern GPUs.
|
|
|
|
Atomic operations form the foundation for more advanced synchronization
|
|
patterns, including parallel reductions, prefix sums, and graph traversal, and
|
|
are essential for developing scalable, data-parallel GPU algorithms.
|