2
0
Ficheiros
rocm-systems/projects/hip/docs/tutorial/programming-patterns/multikernel_bfs.rst
T
Istvan Kiss 2f6fb89c51 Add GPU programming patterns tutorials (#1918)
Update projects/hip/docs/tutorial/programming-patterns/atomic_operations_histogram.rst


WIP

Co-authored-by: Julia Jiang <56359287+jujiang-del@users.noreply.github.com>
2025-11-20 10:03:22 -08:00

485 linhas
15 KiB
ReStructuredText

Este ficheiro contém caracteres Unicode ambíguos
Este ficheiro contém caracteres Unicode que podem ser confundidos com outros caracteres. Se acha que é intencional, pode ignorar este aviso com segurança. Use o botão Revelar para os mostrar.
.. meta::
:description: Multi-kernel programming with breadth-first search (BFS)
:keywords: AMD, ROCm, HIP, GPU programming, multi-kernel, BFS, breadth-first search
*******************************************************************************
Multi-kernel programming: breadth-first search tutorial
*******************************************************************************
Many real-world GPU workloads involve multiple kernels cooperating to solve a
single problem. This tutorial explores **multi-kernel GPU programming** using
the breadth-first search (BFS) algorithm, a foundational graph traversal
method widely used in networking, path-finding, and social network analysis.
The implementation is adapted from the **Rodinia benchmark suite**, a
well-known collection of heterogeneous computing workloads that demonstrate
different parallel programming strategies.
.. include:: ../prerequisites.rst
Multi-kernel GPU programming
============================
In GPU computing, some algorithms cannot be efficiently expressed using a
single kernel due to synchronization or dependency constraints. Instead, they
are decomposed into multiple kernels that execute sequentially, with each
kernel responsible for a specific computation phase.
This approach, called **multi-kernel programming**, is essential when:
* Results from one kernel determine the input for the next.
* Global synchronization between thread blocks is required.
* Control flow depends on runtime conditions.
* The algorithm involves iterative or level-wise processing.
Breadth-first search (BFS)
==========================
Breadth-first search (BFS) is a **layered graph traversal algorithm** that
explores nodes level by level, starting from a root node. It guarantees finding
the shortest path (in edge count) to all reachable nodes in an unweighted
graph.
Applications of BFS include:
* **Path-finding**: Finding shortest paths between nodes.
* **Peer-to-peer networking**: Network topology discovery.
* **GPS navigation**: Route planning and optimization.
* **Social networks**: Friend recommendations and connection analysis.
* **Web crawling**: Systematic website exploration.
Algorithm characteristics
-------------------------
BFS is structured as a level-synchronous algorithm:
* Nodes in the same graph level are processed concurrently.
* A queue (or "frontier") tracks which nodes to explore next.
* Each node is visited once to prevent redundant processing.
Sequential BFS is straightforward but inherently serial due to the
level-by-level dependency between nodes. GPU parallelization requires
restructuring the traversal to exploit data parallelism across nodes
within the same frontier.
Sequential BFS algorithm
=========================
Let's first understand how BFS works sequentially before parallelizing it.
Example Graph
~~~~~~~~~~~~~
Consider a simple graph with four nodes:
.. code-block:: text
R (root)
/ \
A B
\ /
C
Step-by-step execution
~~~~~~~~~~~~~~~~~~~~~~
**Step 1**: Start at the root node ``R``
* Mark ``R`` as visited
* Enqueue ``R``
* Queue: [R]
**Step 2**: Process ``R``
* Dequeue ``R``
* Discover neighbors: ``A`` and ``B``
* Enqueue both, mark as visited
* Queue: [A, B]
**Step 3**: Process ``A``
* Dequeue ``A``
* Neighbors: ``R`` (visited) and ``C`` (new)
* Enqueue ``C``
* Queue: [B, C]
**Step 4**: Process ``B``
* Dequeue ``B``
* Neighbors: ``R`` (visited) and ``C`` (visited)
* Queue: [C]
**Step 5**: Process ``C``
* Dequeue ``C``
* All neighbors visited
* Queue becomes empty — traversal complete
Parallel BFS on GPU
===================
Unlike dense linear algebra, BFS is an **irregular** algorithm. The amount of
work per node varies, and the connectivity pattern of the graph drives
execution. The main challenges are:
1. **Data dependencies**: nodes in the next level depend on the previous level.
2. **Irregular parallelism**: each frontier may contain a very different number of nodes.
3. **Dynamic workload**: the size of the next frontier is unknown at runtime.
4. **Synchronization**: all nodes in one frontier must complete before the next begins.
The **frontier** is the set of nodes being processed at a given BFS level.
Parallel BFS executes all frontier nodes simultaneously, using one thread per
node to discover new neighbors and mark them for the next iteration.
Implementation strategy
-----------------------
The GPU implementation performs BFS using **two cooperating kernels**:
1. **Kernel 1**: processes all nodes in the current frontier.
2. **Kernel 2**: updates the next frontier and checks if work remains.
This design provides **implicit synchronization** between levels while avoiding
race conditions. The host (CPU) manages the iterative control loop, launching
kernels repeatedly until no more frontier nodes exist.
Data structures
===============
The graph is represented using adjacency lists stored in arrays:
.. code-block:: c++
struct Node {
int starting; // starting index in the edge list
int no_of_edges; // number of outgoing edges
};
**Main arrays:**
* ``g_graph_nodes``: node array storing offsets into the edge list.
* ``g_graph_edges``: flattened list of edge destinations.
* ``g_graph_mask``: boolean array indicating active frontier nodes.
* ``g_updating_graph_mask``: marks nodes to be added to the next frontier.
* ``g_graph_visited``: tracks which nodes were visited.
* ``g_graph_cost``: stores the distance (edge count) from the source node.
**Control flow flags:**
* ``g_over``: device-side flag indicating whether another iteration is needed.
* The host resets this flag each iteration and checks it after kernel execution.
The two-kernel approach
=======================
The two-kernel structure ensures correctness and efficient synchronization:
* **Exploration kernel (Kernel 1)** discovers new nodes.
* **Update kernel (Kernel 2)** finalizes state for the next iteration.
This separation:
* Avoids race conditions between threads of different levels.
* Provides synchronization between BFS levels.
* Keeps control logic simple on the host side.
Kernel 1: process current frontier
----------------------------------
Each thread processes one node from the current frontier, examining all of its
outgoing edges:
.. code-block:: c++
__global__ void Kernel1(
Node* g_graph_nodes,
int* g_graph_edges,
bool* g_graph_mask,
bool* g_updating_graph_mask,
bool* g_graph_visited,
int* g_graph_cost,
int no_of_nodes)
{
int tid = hipBlockIdx_x * MAX_THREADS_PER_BLOCK + hipThreadIdx_x;
if (tid < no_of_nodes && g_graph_mask[tid]) {
g_graph_mask[tid] = false;
for (int i = g_graph_nodes[tid].starting;
i < g_graph_nodes[tid].starting + g_graph_nodes[tid].no_of_edges;
i++) {
int id = g_graph_edges[i];
if (!g_graph_visited[id]) {
g_graph_cost[id] = g_graph_cost[tid] + 1;
g_updating_graph_mask[id] = true;
}
}
}
}
**Kernel 1 responsibilities:**
* Clear the nodes mask (mark processed).
* Explore all edges.
* For each unvisited neighbor:
* Compute cost (distance).
* Add to the next frontier.
Kernel 2: update frontier
-------------------------
This kernel finalizes the next frontier:
.. code-block:: c++
__global__ void Kernel2(
bool* g_graph_mask,
bool* g_updating_graph_mask,
bool* g_graph_visited,
bool* g_over,
int no_of_nodes)
{
int tid = hipBlockIdx_x * MAX_THREADS_PER_BLOCK + hipThreadIdx_x;
if (tid < no_of_nodes && g_updating_graph_mask[tid]) {
g_graph_mask[tid] = true;
g_graph_visited[tid] = true;
*g_over = true;
g_updating_graph_mask[tid] = false;
}
}
**Kernel 2 responsibilities:**
* Move newly discovered nodes into the active frontier.
* Mark them as visited.
* Signal continuation via ``*g_over``.
Host-side control loop
======================
.. code-block:: c++
do {
h_over = false;
hipMemcpy(d_over, &h_over, sizeof(bool), hipMemcpyHostToDevice);
Kernel1<<<num_blocks, MAX_THREADS_PER_BLOCK>>>(
d_graph_nodes, d_graph_edges, d_graph_mask,
d_graph_updating_graph_mask, d_graph_visited,
d_graph_cost, no_of_nodes);
hipDeviceSynchronize();
Kernel2<<<num_blocks, MAX_THREADS_PER_BLOCK>>>(
d_graph_mask, d_graph_updating_graph_mask,
d_graph_visited, d_over, no_of_nodes);
hipDeviceSynchronize();
hipMemcpy(&h_over, d_over, sizeof(bool), hipMemcpyDeviceToHost);
} while (h_over);
The loop exits when no new nodes are discovered. ``g_over`` or ``h_over`` on
host side remains ``false`` after one full iteration.
Performance Characteristics
===========================
Parallelism Patterns
--------------------
**Within each iteration:**
- High parallelism: All frontier nodes processed simultaneously
- Work distribution: One thread per node
**Across iterations:**
- Sequential: Must complete one level before starting the next
- Variable parallelism: Different levels may have different numbers of nodes
Workload Characteristics
------------------------
.. list-table::
:header-rows: 1
:widths: 30 70
* - Characteristic
- Description
* - **Irregular**
- Frontier size varies dramatically across levels
* - **Data-dependent**
- Graph structure determines parallel work available
* - **Dynamic**
- Cannot predict workload statically
* - **Memory-bound**
- Many memory accesses per computation
Best practices
==============
This section outlines recommended practices for implementing an efficient
GPU-accelerated Breadth-First Search (BFS). It highlights design principles,
memory-management strategies, and debugging techniques that help ensure
correctness, maintainability, and high performance when mapping BFS onto modern
GPU architectures.
Design principles
-----------------
1. **Define clear kernel roles**
Decompose BFS into well-defined GPU kernels, each responsible for a specific
phase of computation. For example:
* **Kernel 1**: frontier expansion (discovering new nodes)
* **Kernel 2**: frontier update (marking next-level nodes)
This separation simplifies synchronization and ensures that each kernel
operates on independent data regions.
2. **Minimize hostdevice communication**
Keep graph data structures (nodes, edges, masks) resident on the GPU across
iterations. Only transfer lightweight control flags such as ``g_over`` to the
host each loop iteration to check termination conditions.
3. **Kernel boundaries as synchronization points**
Kernel launch boundaries on the same stream naturally enforce global
synchronization across all threads on the GPU. Each kernel invocation
completes before the next begins, ensuring that:
* All nodes in the current frontier are fully processed before updating the
next frontier.
* Memory updates to arrays like ``g_graph_cost`` or ``g_graph_mask`` are
visible to all threads in subsequent kernels.
This avoids the need for costly device-wide barriers or explicit
synchronization primitives within a single kernel. Leverage kernel sequencing
to structure iterative algorithms cleanly—each kernel represents one
computation phase per BFS level.
4. **Flag-based control**
Use device-side flags for dynamic termination and conditional control flow.
In BFS, the Boolean flag ``g_over`` serves as a device-to-host signal
indicating whether new nodes were discovered during the current iteration.
* Initialize ``g_over`` to ``false`` on the host at the start of each
iteration.
* Allow GPU threads in **Kernel 2** to set ``*g_over = true`` when adding new
nodes to the next frontier.
* After kernel completion, copy the flag back to the host using
:cpp:func:`hipMemcpy`. If ``g_over`` remains false, the traversal is
complete.
This mechanism avoids repeated host intervention and enables a tight
CPUGPU control loop that dynamically adapts to workload size without
transferring large data structures.
Memory strategy
---------------
1. **Persistent device allocations**
Allocate all required device buffers once prior to traversal. Reuse these
allocations across multiple BFS runs or multiple source nodes to minimize
the overhead of repeated :cpp:func:`hipMalloc` and :cpp:func:`hipFree`
calls.
2. **Minimize hostdevice communication**
Keep graph data structures (nodes, edges, masks) resident on the GPU across
iterations. Only transfer lightweight control flags such as ``g_over`` to the
host each loop iteration to check termination conditions.
3. **Use pinned host memory for control flags**
When copying ``g_over`` or other control signals between host and device,
allocate host memory using pinned (page-locked) buffers to accelerate DMA
transfers.
Debugging and validation
------------------------
1. **Frontier validation**
After each iteration, verify the number of nodes marked in
``g_graph_mask``. Unexpected empty or overfull frontiers often indicate
incorrect synchronization or uninitialized masks.
2. **Termination condition check**
Confirm that the host-side loop terminates when ``g_over`` remains false
for one iteration. If the loop never ends, ensure ``g_over`` is reset on the
host before each kernel launch.
3. **Result verification**
Compare computed distances in ``g_graph_cost`` against a CPU reference
implementation for small graphs to validate correctness.
4. **Profiling and bottleneck detection**
Use tools such as :doc:`rocprofv3<rocprofiler-sdk:how-to/using-rocprofv3>`
or :doc:`ROCm compute profiler<rocprofiler-compute:how-to/profile/mode>`
to measure per-kernel execution times, memory throughput, and
synchronization overhead.
5. **Logging and debug builds**
Enable optional logging for iteration counts, frontier sizes, and
synchronization states during development. Disable logging in production
builds to avoid performance impact.
Conclusion
==========
Multi-kernel GPU programming is essential for complex algorithms that require:
* Multiple phases of computation.
* Data dependencies between phases.
* Dynamic control flow based on intermediate results.
The BFS example demonstrates:
* How to decompose algorithms into multiple cooperating kernels.
* Techniques for managing frontiers and iterative processing.
* Strategies for handling irregular and dynamic parallelism.
* Proper synchronization between kernel launches.
Key takeaways:
1. **Kernel boundaries provide synchronization**: Use them strategically to ensure correctness.
2. **Separate exploration from update**: Prevents race conditions in level-based algorithms.
3. **Host controls iteration**: CPU manages the overall loop while GPU does heavy lifting.
4. **Flags enable dynamic control**: Device-side flags allow work-dependent termination.
Understanding multi-kernel patterns enables developers to implement
sophisticated algorithms like graph processing, dynamic programming, and
iterative refinement methods efficiently on GPUs.