.. meta:: :description: Multi-kernel programming with breadth-first search (BFS) :keywords: AMD, ROCm, HIP, GPU programming, multi-kernel, BFS, breadth-first search ******************************************************************************* Multi-kernel programming: breadth-first search tutorial ******************************************************************************* Many real-world GPU workloads involve multiple kernels cooperating to solve a single problem. This tutorial explores **multi-kernel GPU programming** using the breadth-first search (BFS) algorithm, a foundational graph traversal method widely used in networking, path-finding, and social network analysis. The implementation is adapted from the **Rodinia benchmark suite**, a well-known collection of heterogeneous computing workloads that demonstrate different parallel programming strategies. .. include:: ../prerequisites.rst Multi-kernel GPU programming ============================ In GPU computing, some algorithms cannot be efficiently expressed using a single kernel due to synchronization or dependency constraints. Instead, they are decomposed into multiple kernels that execute sequentially, with each kernel responsible for a specific computation phase. This approach, called **multi-kernel programming**, is essential when: * Results from one kernel determine the input for the next. * Global synchronization between thread blocks is required. * Control flow depends on runtime conditions. * The algorithm involves iterative or level-wise processing. Breadth-first search (BFS) ========================== Breadth-first search (BFS) is a **layered graph traversal algorithm** that explores nodes level by level, starting from a root node. It guarantees finding the shortest path (in edge count) to all reachable nodes in an unweighted graph. Applications of BFS include: * **Path-finding**: Finding shortest paths between nodes. * **Peer-to-peer networking**: Network topology discovery. * **GPS navigation**: Route planning and optimization. * **Social networks**: Friend recommendations and connection analysis. * **Web crawling**: Systematic website exploration. Algorithm characteristics ------------------------- BFS is structured as a level-synchronous algorithm: * Nodes in the same graph level are processed concurrently. * A queue (or "frontier") tracks which nodes to explore next. * Each node is visited once to prevent redundant processing. Sequential BFS is straightforward but inherently serial due to the level-by-level dependency between nodes. GPU parallelization requires restructuring the traversal to exploit data parallelism across nodes within the same frontier. Sequential BFS algorithm ========================= Let's first understand how BFS works sequentially before parallelizing it. Example Graph ~~~~~~~~~~~~~ Consider a simple graph with four nodes: .. code-block:: text R (root) / \ A B \ / C Step-by-step execution ~~~~~~~~~~~~~~~~~~~~~~ **Step 1**: Start at the root node ``R`` * Mark ``R`` as visited * Enqueue ``R`` * Queue: [R] **Step 2**: Process ``R`` * Dequeue ``R`` * Discover neighbors: ``A`` and ``B`` * Enqueue both, mark as visited * Queue: [A, B] **Step 3**: Process ``A`` * Dequeue ``A`` * Neighbors: ``R`` (visited) and ``C`` (new) * Enqueue ``C`` * Queue: [B, C] **Step 4**: Process ``B`` * Dequeue ``B`` * Neighbors: ``R`` (visited) and ``C`` (visited) * Queue: [C] **Step 5**: Process ``C`` * Dequeue ``C`` * All neighbors visited * Queue becomes empty — traversal complete Parallel BFS on GPU =================== Unlike dense linear algebra, BFS is an **irregular** algorithm. The amount of work per node varies, and the connectivity pattern of the graph drives execution. The main challenges are: 1. **Data dependencies**: nodes in the next level depend on the previous level. 2. **Irregular parallelism**: each frontier may contain a very different number of nodes. 3. **Dynamic workload**: the size of the next frontier is unknown at runtime. 4. **Synchronization**: all nodes in one frontier must complete before the next begins. The **frontier** is the set of nodes being processed at a given BFS level. Parallel BFS executes all frontier nodes simultaneously, using one thread per node to discover new neighbors and mark them for the next iteration. Implementation strategy ----------------------- The GPU implementation performs BFS using **two cooperating kernels**: 1. **Kernel 1**: processes all nodes in the current frontier. 2. **Kernel 2**: updates the next frontier and checks if work remains. This design provides **implicit synchronization** between levels while avoiding race conditions. The host (CPU) manages the iterative control loop, launching kernels repeatedly until no more frontier nodes exist. Data structures =============== The graph is represented using adjacency lists stored in arrays: .. code-block:: c++ struct Node { int starting; // starting index in the edge list int no_of_edges; // number of outgoing edges }; **Main arrays:** * ``g_graph_nodes``: node array storing offsets into the edge list. * ``g_graph_edges``: flattened list of edge destinations. * ``g_graph_mask``: boolean array indicating active frontier nodes. * ``g_updating_graph_mask``: marks nodes to be added to the next frontier. * ``g_graph_visited``: tracks which nodes were visited. * ``g_graph_cost``: stores the distance (edge count) from the source node. **Control flow flags:** * ``g_over``: device-side flag indicating whether another iteration is needed. * The host resets this flag each iteration and checks it after kernel execution. The two-kernel approach ======================= The two-kernel structure ensures correctness and efficient synchronization: * **Exploration kernel (Kernel 1)** discovers new nodes. * **Update kernel (Kernel 2)** finalizes state for the next iteration. This separation: * Avoids race conditions between threads of different levels. * Provides synchronization between BFS levels. * Keeps control logic simple on the host side. Kernel 1: process current frontier ---------------------------------- Each thread processes one node from the current frontier, examining all of its outgoing edges: .. code-block:: c++ __global__ void Kernel1( Node* g_graph_nodes, int* g_graph_edges, bool* g_graph_mask, bool* g_updating_graph_mask, bool* g_graph_visited, int* g_graph_cost, int no_of_nodes) { int tid = hipBlockIdx_x * MAX_THREADS_PER_BLOCK + hipThreadIdx_x; if (tid < no_of_nodes && g_graph_mask[tid]) { g_graph_mask[tid] = false; for (int i = g_graph_nodes[tid].starting; i < g_graph_nodes[tid].starting + g_graph_nodes[tid].no_of_edges; i++) { int id = g_graph_edges[i]; if (!g_graph_visited[id]) { g_graph_cost[id] = g_graph_cost[tid] + 1; g_updating_graph_mask[id] = true; } } } } **Kernel 1 responsibilities:** * Clear the node’s mask (mark processed). * Explore all edges. * For each unvisited neighbor: * Compute cost (distance). * Add to the next frontier. Kernel 2: update frontier ------------------------- This kernel finalizes the next frontier: .. code-block:: c++ __global__ void Kernel2( bool* g_graph_mask, bool* g_updating_graph_mask, bool* g_graph_visited, bool* g_over, int no_of_nodes) { int tid = hipBlockIdx_x * MAX_THREADS_PER_BLOCK + hipThreadIdx_x; if (tid < no_of_nodes && g_updating_graph_mask[tid]) { g_graph_mask[tid] = true; g_graph_visited[tid] = true; *g_over = true; g_updating_graph_mask[tid] = false; } } **Kernel 2 responsibilities:** * Move newly discovered nodes into the active frontier. * Mark them as visited. * Signal continuation via ``*g_over``. Host-side control loop ====================== .. code-block:: c++ do { h_over = false; hipMemcpy(d_over, &h_over, sizeof(bool), hipMemcpyHostToDevice); Kernel1<<>>( d_graph_nodes, d_graph_edges, d_graph_mask, d_graph_updating_graph_mask, d_graph_visited, d_graph_cost, no_of_nodes); hipDeviceSynchronize(); Kernel2<<>>( d_graph_mask, d_graph_updating_graph_mask, d_graph_visited, d_over, no_of_nodes); hipDeviceSynchronize(); hipMemcpy(&h_over, d_over, sizeof(bool), hipMemcpyDeviceToHost); } while (h_over); The loop exits when no new nodes are discovered. ``g_over`` or ``h_over`` on host side remains ``false`` after one full iteration. Performance Characteristics =========================== Parallelism Patterns -------------------- **Within each iteration:** - High parallelism: All frontier nodes processed simultaneously - Work distribution: One thread per node **Across iterations:** - Sequential: Must complete one level before starting the next - Variable parallelism: Different levels may have different numbers of nodes Workload Characteristics ------------------------ .. list-table:: :header-rows: 1 :widths: 30 70 * - Characteristic - Description * - **Irregular** - Frontier size varies dramatically across levels * - **Data-dependent** - Graph structure determines parallel work available * - **Dynamic** - Cannot predict workload statically * - **Memory-bound** - Many memory accesses per computation Best practices ============== This section outlines recommended practices for implementing an efficient GPU-accelerated Breadth-First Search (BFS). It highlights design principles, memory-management strategies, and debugging techniques that help ensure correctness, maintainability, and high performance when mapping BFS onto modern GPU architectures. Design principles ----------------- 1. **Define clear kernel roles** Decompose BFS into well-defined GPU kernels, each responsible for a specific phase of computation. For example: * **Kernel 1**: frontier expansion (discovering new nodes) * **Kernel 2**: frontier update (marking next-level nodes) This separation simplifies synchronization and ensures that each kernel operates on independent data regions. 2. **Minimize host–device communication** Keep graph data structures (nodes, edges, masks) resident on the GPU across iterations. Only transfer lightweight control flags such as ``g_over`` to the host each loop iteration to check termination conditions. 3. **Kernel boundaries as synchronization points** Kernel launch boundaries on the same stream naturally enforce global synchronization across all threads on the GPU. Each kernel invocation completes before the next begins, ensuring that: * All nodes in the current frontier are fully processed before updating the next frontier. * Memory updates to arrays like ``g_graph_cost`` or ``g_graph_mask`` are visible to all threads in subsequent kernels. This avoids the need for costly device-wide barriers or explicit synchronization primitives within a single kernel. Leverage kernel sequencing to structure iterative algorithms cleanly—each kernel represents one computation phase per BFS level. 4. **Flag-based control** Use device-side flags for dynamic termination and conditional control flow. In BFS, the Boolean flag ``g_over`` serves as a device-to-host signal indicating whether new nodes were discovered during the current iteration. * Initialize ``g_over`` to ``false`` on the host at the start of each iteration. * Allow GPU threads in **Kernel 2** to set ``*g_over = true`` when adding new nodes to the next frontier. * After kernel completion, copy the flag back to the host using :cpp:func:`hipMemcpy`. If ``g_over`` remains false, the traversal is complete. This mechanism avoids repeated host intervention and enables a tight CPU–GPU control loop that dynamically adapts to workload size without transferring large data structures. Memory strategy --------------- 1. **Persistent device allocations** Allocate all required device buffers once prior to traversal. Reuse these allocations across multiple BFS runs or multiple source nodes to minimize the overhead of repeated :cpp:func:`hipMalloc` and :cpp:func:`hipFree` calls. 2. **Minimize host–device communication** Keep graph data structures (nodes, edges, masks) resident on the GPU across iterations. Only transfer lightweight control flags such as ``g_over`` to the host each loop iteration to check termination conditions. 3. **Use pinned host memory for control flags** When copying ``g_over`` or other control signals between host and device, allocate host memory using pinned (page-locked) buffers to accelerate DMA transfers. Debugging and validation ------------------------ 1. **Frontier validation** After each iteration, verify the number of nodes marked in ``g_graph_mask``. Unexpected empty or overfull frontiers often indicate incorrect synchronization or uninitialized masks. 2. **Termination condition check** Confirm that the host-side loop terminates when ``g_over`` remains false for one iteration. If the loop never ends, ensure ``g_over`` is reset on the host before each kernel launch. 3. **Result verification** Compare computed distances in ``g_graph_cost`` against a CPU reference implementation for small graphs to validate correctness. 4. **Profiling and bottleneck detection** Use tools such as :doc:`rocprofv3` or :doc:`ROCm compute profiler` to measure per-kernel execution times, memory throughput, and synchronization overhead. 5. **Logging and debug builds** Enable optional logging for iteration counts, frontier sizes, and synchronization states during development. Disable logging in production builds to avoid performance impact. Conclusion ========== Multi-kernel GPU programming is essential for complex algorithms that require: * Multiple phases of computation. * Data dependencies between phases. * Dynamic control flow based on intermediate results. The BFS example demonstrates: * How to decompose algorithms into multiple cooperating kernels. * Techniques for managing frontiers and iterative processing. * Strategies for handling irregular and dynamic parallelism. * Proper synchronization between kernel launches. Key takeaways: 1. **Kernel boundaries provide synchronization**: Use them strategically to ensure correctness. 2. **Separate exploration from update**: Prevents race conditions in level-based algorithms. 3. **Host controls iteration**: CPU manages the overall loop while GPU does heavy lifting. 4. **Flags enable dynamic control**: Device-side flags allow work-dependent termination. Understanding multi-kernel patterns enables developers to implement sophisticated algorithms like graph processing, dynamic programming, and iterative refinement methods efficiently on GPUs.