2f6fb89c51
Update projects/hip/docs/tutorial/programming-patterns/atomic_operations_histogram.rst WIP Co-authored-by: Julia Jiang <56359287+jujiang-del@users.noreply.github.com>
485 linhas
15 KiB
ReStructuredText
485 linhas
15 KiB
ReStructuredText
.. meta::
|
||
:description: Multi-kernel programming with breadth-first search (BFS)
|
||
:keywords: AMD, ROCm, HIP, GPU programming, multi-kernel, BFS, breadth-first search
|
||
|
||
*******************************************************************************
|
||
Multi-kernel programming: breadth-first search tutorial
|
||
*******************************************************************************
|
||
|
||
Many real-world GPU workloads involve multiple kernels cooperating to solve a
|
||
single problem. This tutorial explores **multi-kernel GPU programming** using
|
||
the breadth-first search (BFS) algorithm, a foundational graph traversal
|
||
method widely used in networking, path-finding, and social network analysis.
|
||
|
||
The implementation is adapted from the **Rodinia benchmark suite**, a
|
||
well-known collection of heterogeneous computing workloads that demonstrate
|
||
different parallel programming strategies.
|
||
|
||
.. include:: ../prerequisites.rst
|
||
|
||
Multi-kernel GPU programming
|
||
============================
|
||
|
||
In GPU computing, some algorithms cannot be efficiently expressed using a
|
||
single kernel due to synchronization or dependency constraints. Instead, they
|
||
are decomposed into multiple kernels that execute sequentially, with each
|
||
kernel responsible for a specific computation phase.
|
||
|
||
This approach, called **multi-kernel programming**, is essential when:
|
||
|
||
* Results from one kernel determine the input for the next.
|
||
|
||
* Global synchronization between thread blocks is required.
|
||
|
||
* Control flow depends on runtime conditions.
|
||
|
||
* The algorithm involves iterative or level-wise processing.
|
||
|
||
Breadth-first search (BFS)
|
||
==========================
|
||
|
||
Breadth-first search (BFS) is a **layered graph traversal algorithm** that
|
||
explores nodes level by level, starting from a root node. It guarantees finding
|
||
the shortest path (in edge count) to all reachable nodes in an unweighted
|
||
graph.
|
||
|
||
Applications of BFS include:
|
||
|
||
* **Path-finding**: Finding shortest paths between nodes.
|
||
* **Peer-to-peer networking**: Network topology discovery.
|
||
* **GPS navigation**: Route planning and optimization.
|
||
* **Social networks**: Friend recommendations and connection analysis.
|
||
* **Web crawling**: Systematic website exploration.
|
||
|
||
Algorithm characteristics
|
||
-------------------------
|
||
|
||
BFS is structured as a level-synchronous algorithm:
|
||
|
||
* Nodes in the same graph level are processed concurrently.
|
||
* A queue (or "frontier") tracks which nodes to explore next.
|
||
* Each node is visited once to prevent redundant processing.
|
||
|
||
Sequential BFS is straightforward but inherently serial due to the
|
||
level-by-level dependency between nodes. GPU parallelization requires
|
||
restructuring the traversal to exploit data parallelism across nodes
|
||
within the same frontier.
|
||
|
||
Sequential BFS algorithm
|
||
=========================
|
||
|
||
Let's first understand how BFS works sequentially before parallelizing it.
|
||
|
||
Example Graph
|
||
~~~~~~~~~~~~~
|
||
|
||
Consider a simple graph with four nodes:
|
||
|
||
.. code-block:: text
|
||
|
||
R (root)
|
||
/ \
|
||
A B
|
||
\ /
|
||
C
|
||
|
||
Step-by-step execution
|
||
~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
**Step 1**: Start at the root node ``R``
|
||
|
||
* Mark ``R`` as visited
|
||
* Enqueue ``R``
|
||
* Queue: [R]
|
||
|
||
**Step 2**: Process ``R``
|
||
|
||
* Dequeue ``R``
|
||
* Discover neighbors: ``A`` and ``B``
|
||
* Enqueue both, mark as visited
|
||
* Queue: [A, B]
|
||
|
||
**Step 3**: Process ``A``
|
||
|
||
* Dequeue ``A``
|
||
* Neighbors: ``R`` (visited) and ``C`` (new)
|
||
* Enqueue ``C``
|
||
* Queue: [B, C]
|
||
|
||
**Step 4**: Process ``B``
|
||
|
||
* Dequeue ``B``
|
||
* Neighbors: ``R`` (visited) and ``C`` (visited)
|
||
* Queue: [C]
|
||
|
||
**Step 5**: Process ``C``
|
||
|
||
* Dequeue ``C``
|
||
* All neighbors visited
|
||
* Queue becomes empty — traversal complete
|
||
|
||
Parallel BFS on GPU
|
||
===================
|
||
|
||
Unlike dense linear algebra, BFS is an **irregular** algorithm. The amount of
|
||
work per node varies, and the connectivity pattern of the graph drives
|
||
execution. The main challenges are:
|
||
|
||
1. **Data dependencies**: nodes in the next level depend on the previous level.
|
||
|
||
2. **Irregular parallelism**: each frontier may contain a very different number of nodes.
|
||
|
||
3. **Dynamic workload**: the size of the next frontier is unknown at runtime.
|
||
|
||
4. **Synchronization**: all nodes in one frontier must complete before the next begins.
|
||
|
||
The **frontier** is the set of nodes being processed at a given BFS level.
|
||
Parallel BFS executes all frontier nodes simultaneously, using one thread per
|
||
node to discover new neighbors and mark them for the next iteration.
|
||
|
||
Implementation strategy
|
||
-----------------------
|
||
|
||
The GPU implementation performs BFS using **two cooperating kernels**:
|
||
|
||
1. **Kernel 1**: processes all nodes in the current frontier.
|
||
2. **Kernel 2**: updates the next frontier and checks if work remains.
|
||
|
||
This design provides **implicit synchronization** between levels while avoiding
|
||
race conditions. The host (CPU) manages the iterative control loop, launching
|
||
kernels repeatedly until no more frontier nodes exist.
|
||
|
||
Data structures
|
||
===============
|
||
|
||
The graph is represented using adjacency lists stored in arrays:
|
||
|
||
.. code-block:: c++
|
||
|
||
struct Node {
|
||
int starting; // starting index in the edge list
|
||
int no_of_edges; // number of outgoing edges
|
||
};
|
||
|
||
**Main arrays:**
|
||
|
||
* ``g_graph_nodes``: node array storing offsets into the edge list.
|
||
* ``g_graph_edges``: flattened list of edge destinations.
|
||
* ``g_graph_mask``: boolean array indicating active frontier nodes.
|
||
* ``g_updating_graph_mask``: marks nodes to be added to the next frontier.
|
||
* ``g_graph_visited``: tracks which nodes were visited.
|
||
* ``g_graph_cost``: stores the distance (edge count) from the source node.
|
||
|
||
**Control flow flags:**
|
||
|
||
* ``g_over``: device-side flag indicating whether another iteration is needed.
|
||
* The host resets this flag each iteration and checks it after kernel execution.
|
||
|
||
The two-kernel approach
|
||
=======================
|
||
|
||
The two-kernel structure ensures correctness and efficient synchronization:
|
||
|
||
* **Exploration kernel (Kernel 1)** discovers new nodes.
|
||
* **Update kernel (Kernel 2)** finalizes state for the next iteration.
|
||
|
||
This separation:
|
||
|
||
* Avoids race conditions between threads of different levels.
|
||
|
||
* Provides synchronization between BFS levels.
|
||
|
||
* Keeps control logic simple on the host side.
|
||
|
||
Kernel 1: process current frontier
|
||
----------------------------------
|
||
|
||
Each thread processes one node from the current frontier, examining all of its
|
||
outgoing edges:
|
||
|
||
.. code-block:: c++
|
||
|
||
__global__ void Kernel1(
|
||
Node* g_graph_nodes,
|
||
int* g_graph_edges,
|
||
bool* g_graph_mask,
|
||
bool* g_updating_graph_mask,
|
||
bool* g_graph_visited,
|
||
int* g_graph_cost,
|
||
int no_of_nodes)
|
||
{
|
||
int tid = hipBlockIdx_x * MAX_THREADS_PER_BLOCK + hipThreadIdx_x;
|
||
|
||
if (tid < no_of_nodes && g_graph_mask[tid]) {
|
||
g_graph_mask[tid] = false;
|
||
|
||
for (int i = g_graph_nodes[tid].starting;
|
||
i < g_graph_nodes[tid].starting + g_graph_nodes[tid].no_of_edges;
|
||
i++) {
|
||
int id = g_graph_edges[i];
|
||
if (!g_graph_visited[id]) {
|
||
g_graph_cost[id] = g_graph_cost[tid] + 1;
|
||
g_updating_graph_mask[id] = true;
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
**Kernel 1 responsibilities:**
|
||
|
||
* Clear the node’s mask (mark processed).
|
||
|
||
* Explore all edges.
|
||
|
||
* For each unvisited neighbor:
|
||
|
||
* Compute cost (distance).
|
||
|
||
* Add to the next frontier.
|
||
|
||
Kernel 2: update frontier
|
||
-------------------------
|
||
|
||
This kernel finalizes the next frontier:
|
||
|
||
.. code-block:: c++
|
||
|
||
__global__ void Kernel2(
|
||
bool* g_graph_mask,
|
||
bool* g_updating_graph_mask,
|
||
bool* g_graph_visited,
|
||
bool* g_over,
|
||
int no_of_nodes)
|
||
{
|
||
int tid = hipBlockIdx_x * MAX_THREADS_PER_BLOCK + hipThreadIdx_x;
|
||
|
||
if (tid < no_of_nodes && g_updating_graph_mask[tid]) {
|
||
g_graph_mask[tid] = true;
|
||
g_graph_visited[tid] = true;
|
||
*g_over = true;
|
||
g_updating_graph_mask[tid] = false;
|
||
}
|
||
}
|
||
|
||
**Kernel 2 responsibilities:**
|
||
|
||
* Move newly discovered nodes into the active frontier.
|
||
|
||
* Mark them as visited.
|
||
|
||
* Signal continuation via ``*g_over``.
|
||
|
||
Host-side control loop
|
||
======================
|
||
|
||
.. code-block:: c++
|
||
|
||
do {
|
||
h_over = false;
|
||
hipMemcpy(d_over, &h_over, sizeof(bool), hipMemcpyHostToDevice);
|
||
|
||
Kernel1<<<num_blocks, MAX_THREADS_PER_BLOCK>>>(
|
||
d_graph_nodes, d_graph_edges, d_graph_mask,
|
||
d_graph_updating_graph_mask, d_graph_visited,
|
||
d_graph_cost, no_of_nodes);
|
||
hipDeviceSynchronize();
|
||
|
||
Kernel2<<<num_blocks, MAX_THREADS_PER_BLOCK>>>(
|
||
d_graph_mask, d_graph_updating_graph_mask,
|
||
d_graph_visited, d_over, no_of_nodes);
|
||
hipDeviceSynchronize();
|
||
|
||
hipMemcpy(&h_over, d_over, sizeof(bool), hipMemcpyDeviceToHost);
|
||
} while (h_over);
|
||
|
||
|
||
The loop exits when no new nodes are discovered. ``g_over`` or ``h_over`` on
|
||
host side remains ``false`` after one full iteration.
|
||
|
||
Performance Characteristics
|
||
===========================
|
||
|
||
Parallelism Patterns
|
||
--------------------
|
||
|
||
**Within each iteration:**
|
||
|
||
- High parallelism: All frontier nodes processed simultaneously
|
||
- Work distribution: One thread per node
|
||
|
||
**Across iterations:**
|
||
|
||
- Sequential: Must complete one level before starting the next
|
||
- Variable parallelism: Different levels may have different numbers of nodes
|
||
|
||
Workload Characteristics
|
||
------------------------
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
:widths: 30 70
|
||
|
||
* - Characteristic
|
||
- Description
|
||
* - **Irregular**
|
||
- Frontier size varies dramatically across levels
|
||
* - **Data-dependent**
|
||
- Graph structure determines parallel work available
|
||
* - **Dynamic**
|
||
- Cannot predict workload statically
|
||
* - **Memory-bound**
|
||
- Many memory accesses per computation
|
||
|
||
Best practices
|
||
==============
|
||
|
||
This section outlines recommended practices for implementing an efficient
|
||
GPU-accelerated Breadth-First Search (BFS). It highlights design principles,
|
||
memory-management strategies, and debugging techniques that help ensure
|
||
correctness, maintainability, and high performance when mapping BFS onto modern
|
||
GPU architectures.
|
||
|
||
Design principles
|
||
-----------------
|
||
|
||
1. **Define clear kernel roles**
|
||
|
||
Decompose BFS into well-defined GPU kernels, each responsible for a specific
|
||
phase of computation. For example:
|
||
|
||
* **Kernel 1**: frontier expansion (discovering new nodes)
|
||
* **Kernel 2**: frontier update (marking next-level nodes)
|
||
|
||
This separation simplifies synchronization and ensures that each kernel
|
||
operates on independent data regions.
|
||
|
||
2. **Minimize host–device communication**
|
||
|
||
Keep graph data structures (nodes, edges, masks) resident on the GPU across
|
||
iterations. Only transfer lightweight control flags such as ``g_over`` to the
|
||
host each loop iteration to check termination conditions.
|
||
|
||
3. **Kernel boundaries as synchronization points**
|
||
|
||
Kernel launch boundaries on the same stream naturally enforce global
|
||
synchronization across all threads on the GPU. Each kernel invocation
|
||
completes before the next begins, ensuring that:
|
||
|
||
* All nodes in the current frontier are fully processed before updating the
|
||
next frontier.
|
||
|
||
* Memory updates to arrays like ``g_graph_cost`` or ``g_graph_mask`` are
|
||
visible to all threads in subsequent kernels.
|
||
|
||
This avoids the need for costly device-wide barriers or explicit
|
||
synchronization primitives within a single kernel. Leverage kernel sequencing
|
||
to structure iterative algorithms cleanly—each kernel represents one
|
||
computation phase per BFS level.
|
||
|
||
4. **Flag-based control**
|
||
|
||
Use device-side flags for dynamic termination and conditional control flow.
|
||
In BFS, the Boolean flag ``g_over`` serves as a device-to-host signal
|
||
indicating whether new nodes were discovered during the current iteration.
|
||
|
||
* Initialize ``g_over`` to ``false`` on the host at the start of each
|
||
iteration.
|
||
|
||
* Allow GPU threads in **Kernel 2** to set ``*g_over = true`` when adding new
|
||
nodes to the next frontier.
|
||
|
||
* After kernel completion, copy the flag back to the host using
|
||
:cpp:func:`hipMemcpy`. If ``g_over`` remains false, the traversal is
|
||
complete.
|
||
|
||
This mechanism avoids repeated host intervention and enables a tight
|
||
CPU–GPU control loop that dynamically adapts to workload size without
|
||
transferring large data structures.
|
||
|
||
Memory strategy
|
||
---------------
|
||
|
||
1. **Persistent device allocations**
|
||
|
||
Allocate all required device buffers once prior to traversal. Reuse these
|
||
allocations across multiple BFS runs or multiple source nodes to minimize
|
||
the overhead of repeated :cpp:func:`hipMalloc` and :cpp:func:`hipFree`
|
||
calls.
|
||
|
||
2. **Minimize host–device communication**
|
||
|
||
Keep graph data structures (nodes, edges, masks) resident on the GPU across
|
||
iterations. Only transfer lightweight control flags such as ``g_over`` to the
|
||
host each loop iteration to check termination conditions.
|
||
|
||
3. **Use pinned host memory for control flags**
|
||
|
||
When copying ``g_over`` or other control signals between host and device,
|
||
allocate host memory using pinned (page-locked) buffers to accelerate DMA
|
||
transfers.
|
||
|
||
Debugging and validation
|
||
------------------------
|
||
|
||
1. **Frontier validation**
|
||
|
||
After each iteration, verify the number of nodes marked in
|
||
``g_graph_mask``. Unexpected empty or overfull frontiers often indicate
|
||
incorrect synchronization or uninitialized masks.
|
||
|
||
2. **Termination condition check**
|
||
|
||
Confirm that the host-side loop terminates when ``g_over`` remains false
|
||
for one iteration. If the loop never ends, ensure ``g_over`` is reset on the
|
||
host before each kernel launch.
|
||
|
||
3. **Result verification**
|
||
|
||
Compare computed distances in ``g_graph_cost`` against a CPU reference
|
||
implementation for small graphs to validate correctness.
|
||
|
||
4. **Profiling and bottleneck detection**
|
||
|
||
Use tools such as :doc:`rocprofv3<rocprofiler-sdk:how-to/using-rocprofv3>`
|
||
or :doc:`ROCm compute profiler<rocprofiler-compute:how-to/profile/mode>`
|
||
to measure per-kernel execution times, memory throughput, and
|
||
synchronization overhead.
|
||
|
||
5. **Logging and debug builds**
|
||
|
||
Enable optional logging for iteration counts, frontier sizes, and
|
||
synchronization states during development. Disable logging in production
|
||
builds to avoid performance impact.
|
||
|
||
Conclusion
|
||
==========
|
||
|
||
Multi-kernel GPU programming is essential for complex algorithms that require:
|
||
|
||
* Multiple phases of computation.
|
||
|
||
* Data dependencies between phases.
|
||
|
||
* Dynamic control flow based on intermediate results.
|
||
|
||
The BFS example demonstrates:
|
||
|
||
* How to decompose algorithms into multiple cooperating kernels.
|
||
|
||
* Techniques for managing frontiers and iterative processing.
|
||
|
||
* Strategies for handling irregular and dynamic parallelism.
|
||
|
||
* Proper synchronization between kernel launches.
|
||
|
||
Key takeaways:
|
||
|
||
1. **Kernel boundaries provide synchronization**: Use them strategically to ensure correctness.
|
||
2. **Separate exploration from update**: Prevents race conditions in level-based algorithms.
|
||
3. **Host controls iteration**: CPU manages the overall loop while GPU does heavy lifting.
|
||
4. **Flags enable dynamic control**: Device-side flags allow work-dependent termination.
|
||
|
||
Understanding multi-kernel patterns enables developers to implement
|
||
sophisticated algorithms like graph processing, dynamic programming, and
|
||
iterative refinement methods efficiently on GPUs.
|