.. meta::
  :description: Multi-kernel programming with breadth-first search (BFS)
  :keywords: AMD, ROCm, HIP, GPU programming, multi-kernel, BFS, breadth-first search

*******************************************************************************
Multi-kernel programming: breadth-first search tutorial
*******************************************************************************

Many real-world GPU workloads involve multiple kernels cooperating to solve a
single problem. This tutorial explores **multi-kernel GPU programming** using
the breadth-first search (BFS) algorithm, a foundational graph traversal
method widely used in networking, path-finding, and social network analysis.

The implementation is adapted from the **Rodinia benchmark suite**, a
well-known collection of heterogeneous computing workloads that demonstrate
different parallel programming strategies.

.. include:: ../prerequisites.rst

Multi-kernel GPU programming
============================

In GPU computing, some algorithms cannot be efficiently expressed using a
single kernel due to synchronization or dependency constraints. Instead, they
are decomposed into multiple kernels that execute sequentially, with each
kernel responsible for a specific computation phase.

This approach, called **multi-kernel programming**, is essential when:

* Results from one kernel determine the input for the next.

* Global synchronization between thread blocks is required.

* Control flow depends on runtime conditions.

* The algorithm involves iterative or level-wise processing.

Breadth-first search (BFS)
==========================

Breadth-first search (BFS) is a **layered graph traversal algorithm** that
explores nodes level by level, starting from a root node. It guarantees finding
the shortest path (in edge count) to all reachable nodes in an unweighted
graph.

Applications of BFS include:

* **Path-finding**: Finding shortest paths between nodes.
* **Peer-to-peer networking**: Network topology discovery.
* **GPS navigation**: Route planning and optimization.
* **Social networks**: Friend recommendations and connection analysis.
* **Web crawling**: Systematic website exploration.

Algorithm characteristics
-------------------------

BFS is structured as a level-synchronous algorithm:

* Nodes in the same graph level are processed concurrently.
* A queue (or "frontier") tracks which nodes to explore next.
* Each node is visited once to prevent redundant processing.

Sequential BFS is straightforward but inherently serial due to the
level-by-level dependency between nodes. GPU parallelization requires
restructuring the traversal to exploit data parallelism across nodes
within the same frontier.

Sequential BFS algorithm
=========================

Let's first understand how BFS works sequentially before parallelizing it.

Example Graph
~~~~~~~~~~~~~

Consider a simple graph with four nodes:

.. code-block:: text

           R (root)
          / \
         A   B
          \ /
           C

Step-by-step execution
~~~~~~~~~~~~~~~~~~~~~~

**Step 1**: Start at the root node ``R``

* Mark ``R`` as visited  
* Enqueue ``R``  
* Queue: [R]

**Step 2**: Process ``R``

* Dequeue ``R``  
* Discover neighbors: ``A`` and ``B``  
* Enqueue both, mark as visited  
* Queue: [A, B]

**Step 3**: Process ``A``

* Dequeue ``A``  
* Neighbors: ``R`` (visited) and ``C`` (new)  
* Enqueue ``C``  
* Queue: [B, C]

**Step 4**: Process ``B``

* Dequeue ``B``  
* Neighbors: ``R`` (visited) and ``C`` (visited)  
* Queue: [C]

**Step 5**: Process ``C``

* Dequeue ``C``  
* All neighbors visited  
* Queue becomes empty — traversal complete

Parallel BFS on GPU
===================

Unlike dense linear algebra, BFS is an **irregular** algorithm. The amount of
work per node varies, and the connectivity pattern of the graph drives
execution. The main challenges are:

1. **Data dependencies**: nodes in the next level depend on the previous level.

2. **Irregular parallelism**: each frontier may contain a very different number of nodes.

3. **Dynamic workload**: the size of the next frontier is unknown at runtime.

4. **Synchronization**: all nodes in one frontier must complete before the next begins.

The **frontier** is the set of nodes being processed at a given BFS level.
Parallel BFS executes all frontier nodes simultaneously, using one thread per
node to discover new neighbors and mark them for the next iteration.

Implementation strategy
-----------------------

The GPU implementation performs BFS using **two cooperating kernels**:

1. **Kernel 1**: processes all nodes in the current frontier.
2. **Kernel 2**: updates the next frontier and checks if work remains.

This design provides **implicit synchronization** between levels while avoiding
race conditions. The host (CPU) manages the iterative control loop, launching
kernels repeatedly until no more frontier nodes exist.

Data structures
===============

The graph is represented using adjacency lists stored in arrays:

.. code-block:: c++

    struct Node {
        int starting;       // starting index in the edge list
        int no_of_edges;    // number of outgoing edges
    };

**Main arrays:**

* ``g_graph_nodes``: node array storing offsets into the edge list.
* ``g_graph_edges``: flattened list of edge destinations.
* ``g_graph_mask``: boolean array indicating active frontier nodes.
* ``g_updating_graph_mask``: marks nodes to be added to the next frontier.
* ``g_graph_visited``: tracks which nodes were visited.
* ``g_graph_cost``: stores the distance (edge count) from the source node.

**Control flow flags:**

* ``g_over``: device-side flag indicating whether another iteration is needed.
* The host resets this flag each iteration and checks it after kernel execution.

The two-kernel approach
=======================

The two-kernel structure ensures correctness and efficient synchronization:

* **Exploration kernel (Kernel 1)** discovers new nodes.
* **Update kernel (Kernel 2)** finalizes state for the next iteration.

This separation:

* Avoids race conditions between threads of different levels.

* Provides synchronization between BFS levels.

* Keeps control logic simple on the host side.

Kernel 1: process current frontier
----------------------------------

Each thread processes one node from the current frontier, examining all of its
outgoing edges:

.. code-block:: c++

    __global__ void Kernel1(
        Node* g_graph_nodes,
        int* g_graph_edges,
        bool* g_graph_mask,
        bool* g_updating_graph_mask,
        bool* g_graph_visited,
        int* g_graph_cost,
        int no_of_nodes)
    {
        int tid = hipBlockIdx_x * MAX_THREADS_PER_BLOCK + hipThreadIdx_x;

        if (tid < no_of_nodes && g_graph_mask[tid]) {
            g_graph_mask[tid] = false;

            for (int i = g_graph_nodes[tid].starting;
                 i < g_graph_nodes[tid].starting + g_graph_nodes[tid].no_of_edges;
                 i++) {
                int id = g_graph_edges[i];
                if (!g_graph_visited[id]) {
                    g_graph_cost[id] = g_graph_cost[tid] + 1;
                    g_updating_graph_mask[id] = true;
                }
            }
        }
    }

**Kernel 1 responsibilities:**

* Clear the node’s mask (mark processed).

* Explore all edges.

* For each unvisited neighbor:

  * Compute cost (distance).

  * Add to the next frontier.

Kernel 2: update frontier
-------------------------

This kernel finalizes the next frontier:

.. code-block:: c++

    __global__ void Kernel2(
        bool* g_graph_mask,
        bool* g_updating_graph_mask,
        bool* g_graph_visited,
        bool* g_over,
        int no_of_nodes)
    {
        int tid = hipBlockIdx_x * MAX_THREADS_PER_BLOCK + hipThreadIdx_x;

        if (tid < no_of_nodes && g_updating_graph_mask[tid]) {
            g_graph_mask[tid] = true;
            g_graph_visited[tid] = true;
            *g_over = true;
            g_updating_graph_mask[tid] = false;
        }
    }

**Kernel 2 responsibilities:**

* Move newly discovered nodes into the active frontier.

* Mark them as visited.

* Signal continuation via ``*g_over``.

Host-side control loop
======================

.. code-block:: c++

    do {
        h_over = false;
        hipMemcpy(d_over, &h_over, sizeof(bool), hipMemcpyHostToDevice);

        Kernel1<<<num_blocks, MAX_THREADS_PER_BLOCK>>>(
            d_graph_nodes, d_graph_edges, d_graph_mask,
            d_graph_updating_graph_mask, d_graph_visited,
            d_graph_cost, no_of_nodes);
        hipDeviceSynchronize();

        Kernel2<<<num_blocks, MAX_THREADS_PER_BLOCK>>>(
            d_graph_mask, d_graph_updating_graph_mask,
            d_graph_visited, d_over, no_of_nodes);
        hipDeviceSynchronize();

        hipMemcpy(&h_over, d_over, sizeof(bool), hipMemcpyDeviceToHost);
    } while (h_over);


The loop exits when no new nodes are discovered. ``g_over`` or ``h_over`` on
host side remains ``false`` after one full iteration.

Performance Characteristics
===========================

Parallelism Patterns
--------------------

**Within each iteration:**

- High parallelism: All frontier nodes processed simultaneously
- Work distribution: One thread per node

**Across iterations:**

- Sequential: Must complete one level before starting the next
- Variable parallelism: Different levels may have different numbers of nodes

Workload Characteristics
------------------------

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Characteristic
     - Description
   * - **Irregular**
     - Frontier size varies dramatically across levels
   * - **Data-dependent**
     - Graph structure determines parallel work available
   * - **Dynamic**
     - Cannot predict workload statically
   * - **Memory-bound**
     - Many memory accesses per computation

Best practices
==============

This section outlines recommended practices for implementing an efficient
GPU-accelerated Breadth-First Search (BFS). It highlights design principles,
memory-management strategies, and debugging techniques that help ensure
correctness, maintainability, and high performance when mapping BFS onto modern
GPU architectures.

Design principles
-----------------

1. **Define clear kernel roles**

   Decompose BFS into well-defined GPU kernels, each responsible for a specific
   phase of computation. For example:

   * **Kernel 1**: frontier expansion (discovering new nodes)
   * **Kernel 2**: frontier update (marking next-level nodes)

   This separation simplifies synchronization and ensures that each kernel
   operates on independent data regions.

2. **Minimize host–device communication**

   Keep graph data structures (nodes, edges, masks) resident on the GPU across
   iterations. Only transfer lightweight control flags such as ``g_over`` to the
   host each loop iteration to check termination conditions.

3. **Kernel boundaries as synchronization points**

   Kernel launch boundaries on the same stream naturally enforce global
   synchronization across all threads on the GPU. Each kernel invocation
   completes before the next begins, ensuring that:

   * All nodes in the current frontier are fully processed before updating the
     next frontier.

   * Memory updates to arrays like ``g_graph_cost`` or ``g_graph_mask`` are
     visible to all threads in subsequent kernels.

   This avoids the need for costly device-wide barriers or explicit
   synchronization primitives within a single kernel. Leverage kernel sequencing
   to structure iterative algorithms cleanly—each kernel represents one
   computation phase per BFS level.

4. **Flag-based control**

   Use device-side flags for dynamic termination and conditional control flow.
   In BFS, the Boolean flag ``g_over`` serves as a device-to-host signal
   indicating whether new nodes were discovered during the current iteration.

   * Initialize ``g_over`` to ``false`` on the host at the start of each
     iteration.

   * Allow GPU threads in **Kernel 2** to set ``*g_over = true`` when adding new
     nodes to the next frontier.

   * After kernel completion, copy the flag back to the host using
     :cpp:func:`hipMemcpy`. If ``g_over`` remains false, the traversal is
     complete.

   This mechanism avoids repeated host intervention and enables a tight
   CPU–GPU control loop that dynamically adapts to workload size without
   transferring large data structures.

Memory strategy
---------------

1. **Persistent device allocations**

   Allocate all required device buffers once prior to traversal. Reuse these
   allocations across multiple BFS runs or multiple source nodes to minimize
   the overhead of repeated :cpp:func:`hipMalloc` and :cpp:func:`hipFree`
   calls.

2. **Minimize host–device communication**

   Keep graph data structures (nodes, edges, masks) resident on the GPU across
   iterations. Only transfer lightweight control flags such as ``g_over`` to the
   host each loop iteration to check termination conditions.

3. **Use pinned host memory for control flags**

   When copying ``g_over`` or other control signals between host and device,
   allocate host memory using pinned (page-locked) buffers to accelerate DMA
   transfers.

Debugging and validation
------------------------

1. **Frontier validation**

   After each iteration, verify the number of nodes marked in
   ``g_graph_mask``. Unexpected empty or overfull frontiers often indicate
   incorrect synchronization or uninitialized masks.

2. **Termination condition check**

   Confirm that the host-side loop terminates when ``g_over`` remains false
   for one iteration. If the loop never ends, ensure ``g_over`` is reset on the
   host before each kernel launch.

3. **Result verification**

   Compare computed distances in ``g_graph_cost`` against a CPU reference
   implementation for small graphs to validate correctness.

4. **Profiling and bottleneck detection**

   Use tools such as :doc:`rocprofv3<rocprofiler-sdk:how-to/using-rocprofv3>`
   or :doc:`ROCm compute profiler<rocprofiler-compute:how-to/profile/mode>`
   to measure per-kernel execution times, memory throughput, and
   synchronization overhead.

5. **Logging and debug builds**

   Enable optional logging for iteration counts, frontier sizes, and
   synchronization states during development. Disable logging in production
   builds to avoid performance impact.

Conclusion
==========

Multi-kernel GPU programming is essential for complex algorithms that require:

* Multiple phases of computation.

* Data dependencies between phases.

* Dynamic control flow based on intermediate results.

The BFS example demonstrates:

* How to decompose algorithms into multiple cooperating kernels.

* Techniques for managing frontiers and iterative processing.

* Strategies for handling irregular and dynamic parallelism.

* Proper synchronization between kernel launches.

Key takeaways:

1. **Kernel boundaries provide synchronization**: Use them strategically to ensure correctness.
2. **Separate exploration from update**: Prevents race conditions in level-based algorithms.
3. **Host controls iteration**: CPU manages the overall loop while GPU does heavy lifting.
4. **Flags enable dynamic control**: Device-side flags allow work-dependent termination.

Understanding multi-kernel patterns enables developers to implement 
sophisticated algorithms like graph processing, dynamic programming, and 
iterative refinement methods efficiently on GPUs.