Files
Istvan Kiss 2f6fb89c51 Add GPU programming patterns tutorials (#1918)
Update projects/hip/docs/tutorial/programming-patterns/atomic_operations_histogram.rst


WIP

Co-authored-by: Julia Jiang <56359287+jujiang-del@users.noreply.github.com>
2025-11-20 10:03:22 -08:00

485 lines
15 KiB
ReStructuredText
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
.. meta::
:description: Multi-kernel programming with breadth-first search (BFS)
:keywords: AMD, ROCm, HIP, GPU programming, multi-kernel, BFS, breadth-first search
*******************************************************************************
Multi-kernel programming: breadth-first search tutorial
*******************************************************************************
Many real-world GPU workloads involve multiple kernels cooperating to solve a
single problem. This tutorial explores **multi-kernel GPU programming** using
the breadth-first search (BFS) algorithm, a foundational graph traversal
method widely used in networking, path-finding, and social network analysis.
The implementation is adapted from the **Rodinia benchmark suite**, a
well-known collection of heterogeneous computing workloads that demonstrate
different parallel programming strategies.
.. include:: ../prerequisites.rst
Multi-kernel GPU programming
============================
In GPU computing, some algorithms cannot be efficiently expressed using a
single kernel due to synchronization or dependency constraints. Instead, they
are decomposed into multiple kernels that execute sequentially, with each
kernel responsible for a specific computation phase.
This approach, called **multi-kernel programming**, is essential when:
* Results from one kernel determine the input for the next.
* Global synchronization between thread blocks is required.
* Control flow depends on runtime conditions.
* The algorithm involves iterative or level-wise processing.
Breadth-first search (BFS)
==========================
Breadth-first search (BFS) is a **layered graph traversal algorithm** that
explores nodes level by level, starting from a root node. It guarantees finding
the shortest path (in edge count) to all reachable nodes in an unweighted
graph.
Applications of BFS include:
* **Path-finding**: Finding shortest paths between nodes.
* **Peer-to-peer networking**: Network topology discovery.
* **GPS navigation**: Route planning and optimization.
* **Social networks**: Friend recommendations and connection analysis.
* **Web crawling**: Systematic website exploration.
Algorithm characteristics
-------------------------
BFS is structured as a level-synchronous algorithm:
* Nodes in the same graph level are processed concurrently.
* A queue (or "frontier") tracks which nodes to explore next.
* Each node is visited once to prevent redundant processing.
Sequential BFS is straightforward but inherently serial due to the
level-by-level dependency between nodes. GPU parallelization requires
restructuring the traversal to exploit data parallelism across nodes
within the same frontier.
Sequential BFS algorithm
=========================
Let's first understand how BFS works sequentially before parallelizing it.
Example Graph
~~~~~~~~~~~~~
Consider a simple graph with four nodes:
.. code-block:: text
R (root)
/ \
A B
\ /
C
Step-by-step execution
~~~~~~~~~~~~~~~~~~~~~~
**Step 1**: Start at the root node ``R``
* Mark ``R`` as visited
* Enqueue ``R``
* Queue: [R]
**Step 2**: Process ``R``
* Dequeue ``R``
* Discover neighbors: ``A`` and ``B``
* Enqueue both, mark as visited
* Queue: [A, B]
**Step 3**: Process ``A``
* Dequeue ``A``
* Neighbors: ``R`` (visited) and ``C`` (new)
* Enqueue ``C``
* Queue: [B, C]
**Step 4**: Process ``B``
* Dequeue ``B``
* Neighbors: ``R`` (visited) and ``C`` (visited)
* Queue: [C]
**Step 5**: Process ``C``
* Dequeue ``C``
* All neighbors visited
* Queue becomes empty — traversal complete
Parallel BFS on GPU
===================
Unlike dense linear algebra, BFS is an **irregular** algorithm. The amount of
work per node varies, and the connectivity pattern of the graph drives
execution. The main challenges are:
1. **Data dependencies**: nodes in the next level depend on the previous level.
2. **Irregular parallelism**: each frontier may contain a very different number of nodes.
3. **Dynamic workload**: the size of the next frontier is unknown at runtime.
4. **Synchronization**: all nodes in one frontier must complete before the next begins.
The **frontier** is the set of nodes being processed at a given BFS level.
Parallel BFS executes all frontier nodes simultaneously, using one thread per
node to discover new neighbors and mark them for the next iteration.
Implementation strategy
-----------------------
The GPU implementation performs BFS using **two cooperating kernels**:
1. **Kernel 1**: processes all nodes in the current frontier.
2. **Kernel 2**: updates the next frontier and checks if work remains.
This design provides **implicit synchronization** between levels while avoiding
race conditions. The host (CPU) manages the iterative control loop, launching
kernels repeatedly until no more frontier nodes exist.
Data structures
===============
The graph is represented using adjacency lists stored in arrays:
.. code-block:: c++
struct Node {
int starting; // starting index in the edge list
int no_of_edges; // number of outgoing edges
};
**Main arrays:**
* ``g_graph_nodes``: node array storing offsets into the edge list.
* ``g_graph_edges``: flattened list of edge destinations.
* ``g_graph_mask``: boolean array indicating active frontier nodes.
* ``g_updating_graph_mask``: marks nodes to be added to the next frontier.
* ``g_graph_visited``: tracks which nodes were visited.
* ``g_graph_cost``: stores the distance (edge count) from the source node.
**Control flow flags:**
* ``g_over``: device-side flag indicating whether another iteration is needed.
* The host resets this flag each iteration and checks it after kernel execution.
The two-kernel approach
=======================
The two-kernel structure ensures correctness and efficient synchronization:
* **Exploration kernel (Kernel 1)** discovers new nodes.
* **Update kernel (Kernel 2)** finalizes state for the next iteration.
This separation:
* Avoids race conditions between threads of different levels.
* Provides synchronization between BFS levels.
* Keeps control logic simple on the host side.
Kernel 1: process current frontier
----------------------------------
Each thread processes one node from the current frontier, examining all of its
outgoing edges:
.. code-block:: c++
__global__ void Kernel1(
Node* g_graph_nodes,
int* g_graph_edges,
bool* g_graph_mask,
bool* g_updating_graph_mask,
bool* g_graph_visited,
int* g_graph_cost,
int no_of_nodes)
{
int tid = hipBlockIdx_x * MAX_THREADS_PER_BLOCK + hipThreadIdx_x;
if (tid < no_of_nodes && g_graph_mask[tid]) {
g_graph_mask[tid] = false;
for (int i = g_graph_nodes[tid].starting;
i < g_graph_nodes[tid].starting + g_graph_nodes[tid].no_of_edges;
i++) {
int id = g_graph_edges[i];
if (!g_graph_visited[id]) {
g_graph_cost[id] = g_graph_cost[tid] + 1;
g_updating_graph_mask[id] = true;
}
}
}
}
**Kernel 1 responsibilities:**
* Clear the nodes mask (mark processed).
* Explore all edges.
* For each unvisited neighbor:
* Compute cost (distance).
* Add to the next frontier.
Kernel 2: update frontier
-------------------------
This kernel finalizes the next frontier:
.. code-block:: c++
__global__ void Kernel2(
bool* g_graph_mask,
bool* g_updating_graph_mask,
bool* g_graph_visited,
bool* g_over,
int no_of_nodes)
{
int tid = hipBlockIdx_x * MAX_THREADS_PER_BLOCK + hipThreadIdx_x;
if (tid < no_of_nodes && g_updating_graph_mask[tid]) {
g_graph_mask[tid] = true;
g_graph_visited[tid] = true;
*g_over = true;
g_updating_graph_mask[tid] = false;
}
}
**Kernel 2 responsibilities:**
* Move newly discovered nodes into the active frontier.
* Mark them as visited.
* Signal continuation via ``*g_over``.
Host-side control loop
======================
.. code-block:: c++
do {
h_over = false;
hipMemcpy(d_over, &h_over, sizeof(bool), hipMemcpyHostToDevice);
Kernel1<<<num_blocks, MAX_THREADS_PER_BLOCK>>>(
d_graph_nodes, d_graph_edges, d_graph_mask,
d_graph_updating_graph_mask, d_graph_visited,
d_graph_cost, no_of_nodes);
hipDeviceSynchronize();
Kernel2<<<num_blocks, MAX_THREADS_PER_BLOCK>>>(
d_graph_mask, d_graph_updating_graph_mask,
d_graph_visited, d_over, no_of_nodes);
hipDeviceSynchronize();
hipMemcpy(&h_over, d_over, sizeof(bool), hipMemcpyDeviceToHost);
} while (h_over);
The loop exits when no new nodes are discovered. ``g_over`` or ``h_over`` on
host side remains ``false`` after one full iteration.
Performance Characteristics
===========================
Parallelism Patterns
--------------------
**Within each iteration:**
- High parallelism: All frontier nodes processed simultaneously
- Work distribution: One thread per node
**Across iterations:**
- Sequential: Must complete one level before starting the next
- Variable parallelism: Different levels may have different numbers of nodes
Workload Characteristics
------------------------
.. list-table::
:header-rows: 1
:widths: 30 70
* - Characteristic
- Description
* - **Irregular**
- Frontier size varies dramatically across levels
* - **Data-dependent**
- Graph structure determines parallel work available
* - **Dynamic**
- Cannot predict workload statically
* - **Memory-bound**
- Many memory accesses per computation
Best practices
==============
This section outlines recommended practices for implementing an efficient
GPU-accelerated Breadth-First Search (BFS). It highlights design principles,
memory-management strategies, and debugging techniques that help ensure
correctness, maintainability, and high performance when mapping BFS onto modern
GPU architectures.
Design principles
-----------------
1. **Define clear kernel roles**
Decompose BFS into well-defined GPU kernels, each responsible for a specific
phase of computation. For example:
* **Kernel 1**: frontier expansion (discovering new nodes)
* **Kernel 2**: frontier update (marking next-level nodes)
This separation simplifies synchronization and ensures that each kernel
operates on independent data regions.
2. **Minimize hostdevice communication**
Keep graph data structures (nodes, edges, masks) resident on the GPU across
iterations. Only transfer lightweight control flags such as ``g_over`` to the
host each loop iteration to check termination conditions.
3. **Kernel boundaries as synchronization points**
Kernel launch boundaries on the same stream naturally enforce global
synchronization across all threads on the GPU. Each kernel invocation
completes before the next begins, ensuring that:
* All nodes in the current frontier are fully processed before updating the
next frontier.
* Memory updates to arrays like ``g_graph_cost`` or ``g_graph_mask`` are
visible to all threads in subsequent kernels.
This avoids the need for costly device-wide barriers or explicit
synchronization primitives within a single kernel. Leverage kernel sequencing
to structure iterative algorithms cleanly—each kernel represents one
computation phase per BFS level.
4. **Flag-based control**
Use device-side flags for dynamic termination and conditional control flow.
In BFS, the Boolean flag ``g_over`` serves as a device-to-host signal
indicating whether new nodes were discovered during the current iteration.
* Initialize ``g_over`` to ``false`` on the host at the start of each
iteration.
* Allow GPU threads in **Kernel 2** to set ``*g_over = true`` when adding new
nodes to the next frontier.
* After kernel completion, copy the flag back to the host using
:cpp:func:`hipMemcpy`. If ``g_over`` remains false, the traversal is
complete.
This mechanism avoids repeated host intervention and enables a tight
CPUGPU control loop that dynamically adapts to workload size without
transferring large data structures.
Memory strategy
---------------
1. **Persistent device allocations**
Allocate all required device buffers once prior to traversal. Reuse these
allocations across multiple BFS runs or multiple source nodes to minimize
the overhead of repeated :cpp:func:`hipMalloc` and :cpp:func:`hipFree`
calls.
2. **Minimize hostdevice communication**
Keep graph data structures (nodes, edges, masks) resident on the GPU across
iterations. Only transfer lightweight control flags such as ``g_over`` to the
host each loop iteration to check termination conditions.
3. **Use pinned host memory for control flags**
When copying ``g_over`` or other control signals between host and device,
allocate host memory using pinned (page-locked) buffers to accelerate DMA
transfers.
Debugging and validation
------------------------
1. **Frontier validation**
After each iteration, verify the number of nodes marked in
``g_graph_mask``. Unexpected empty or overfull frontiers often indicate
incorrect synchronization or uninitialized masks.
2. **Termination condition check**
Confirm that the host-side loop terminates when ``g_over`` remains false
for one iteration. If the loop never ends, ensure ``g_over`` is reset on the
host before each kernel launch.
3. **Result verification**
Compare computed distances in ``g_graph_cost`` against a CPU reference
implementation for small graphs to validate correctness.
4. **Profiling and bottleneck detection**
Use tools such as :doc:`rocprofv3<rocprofiler-sdk:how-to/using-rocprofv3>`
or :doc:`ROCm compute profiler<rocprofiler-compute:how-to/profile/mode>`
to measure per-kernel execution times, memory throughput, and
synchronization overhead.
5. **Logging and debug builds**
Enable optional logging for iteration counts, frontier sizes, and
synchronization states during development. Disable logging in production
builds to avoid performance impact.
Conclusion
==========
Multi-kernel GPU programming is essential for complex algorithms that require:
* Multiple phases of computation.
* Data dependencies between phases.
* Dynamic control flow based on intermediate results.
The BFS example demonstrates:
* How to decompose algorithms into multiple cooperating kernels.
* Techniques for managing frontiers and iterative processing.
* Strategies for handling irregular and dynamic parallelism.
* Proper synchronization between kernel launches.
Key takeaways:
1. **Kernel boundaries provide synchronization**: Use them strategically to ensure correctness.
2. **Separate exploration from update**: Prevents race conditions in level-based algorithms.
3. **Host controls iteration**: CPU manages the overall loop while GPU does heavy lifting.
4. **Flags enable dynamic control**: Device-side flags allow work-dependent termination.
Understanding multi-kernel patterns enables developers to implement
sophisticated algorithms like graph processing, dynamic programming, and
iterative refinement methods efficiently on GPUs.