rocm-systems/projects/hip/docs/tutorial/graph_api.rst

.. meta::
  :description: HIP graph API tutorial
  :keywords: AMD, ROCm, HIP, graph API, tutorial

.. _hip_graph_api_tutorial:

*******************************************************************************
HIP Graph API Tutorial
*******************************************************************************

**Time to complete**: 60 minutes | **Difficulty**: Intermediate | **Domain**: Medical Imaging

Introduction
============

Imagine you are directing a movie. In traditional GPU programming with streams, you are like a director who must call
"action!" for every single shot, waiting between each take. With HIP graphs, you pre-plan the entire scene sequence and
then call "action!" just once to film everything in one go. This tutorial will show you how to transform your GPU
applications from repeated direction to choreographed performance.

Modeling dependencies between GPU operations
--------------------------------------------

Most movies in the world follow a plot where certain scenes must happen before the following scenes; otherwise the
movie might not make much sense. If a scene *A* must happen before scenes *B* and *C*, *B* and *C* depend on *A*. If
*B* and *C* contain different stories that (at this point) are unrelated to each other, *B* and *C* are independent and
can be shown to the audience in any order. However, both scenes might be a prerequisite for the final scene *D*, so *D*
depends on both of them. When you represent scenes as *nodes* and dependencies as *edges*, you can create a graph, and
the graph representing your imaginary movie script will have a diamond-like shape:

.. figure:: ../data/tutorial/graph_api/diamond.svg
  :alt: Diagram showing a graph with diamond-like shape. Nodes represent movie scenes and edges represent dependencies
        between scenes.
  :align: center

You can think about GPU operations in a similar way. For example, most kernels require at least one data buffer to work
with, so they will depend on a preceding copy or ``memset`` operation. Others might process the results of preceding
kernels. Real-world applications typically involve multiple GPU operations with dependencies between them. HIP offers
two ways to think about and model these dependencies: streams and graphs.

Streams
^^^^^^^

Streams are HIP's default model for organizing and launching GPU operations on the device. They are sequential sets of
operations, similar to CPU threads. Adding operation *A* before operation *B* to a stream ensures *A* happens before
*B*, regardless of any interdependencies (or lack thereof) between them. A stream can be thought of as a first-in,
first-out (FIFO) queue of operations.

Multiple streams operate independently, and manual synchronization is required when dependencies cross stream
boundaries. Additionally, each operation in a stream is scheduled independently, which — depending on the complexity of
the enqueued operation — might lead to noticeable CPU launch overhead and kernel dispatch latency, especially for
workloads with many small kernels. However, applications that use streams are well suited for workloads that are
dynamic and unpredictable.

For more information about HIP streams, see :ref:`asynchronous_how-to`.

Graphs
^^^^^^

HIP graphs model dependencies between operations as nodes and edges on a diagram. Each node in the graph represents an
operation, and each edge represents a dependency between two nodes. If no edge exists between two nodes, they are
independent and can execute in any order.

Because dependency information is built into the graph, the HIP runtime automatically inserts the necessary
synchronization points. Launching all operations in a graph requires only a single API call, reducing launch overhead
and dispatch latency to near-zero. This is especially beneficial for workloads with many small kernels, where launch
overhead can dominate overall execution time.

Graphs must be defined once before use, making them ideal for fixed workflows that run repeatedly. While node
parameters can be updated between executions, the graph structure itself cannot change after instantiation. This
structural immutability is the primary trade-off compared to the flexibility of streams.

For more information about HIP graphs, see :ref:`how_to_HIP_graph`.

When to use graphs
^^^^^^^^^^^^^^^^^^

This table shows when to use graphs in your application.

.. list-table::
  :header-rows: 1
  :class: decision-matrix

  * - ✅ **Use Graphs When**
    - ❌ **Avoid Graphs When**
  * - Workflow is fixed and repetitive
    - Workflow changes dynamically
  * - Same kernels execute many times
    - One-shot operations
  * - Launch overhead is significant (many small kernels)
    - Kernels are long-running

Transitioning a CT reconstruction pipeline
------------------------------------------

In this tutorial, you will modify an existing GPU-accelerated stream-based image processing pipeline that reconstructs
computer tomography (CT) data (the classic Shepp-Logan phantom [ShLo74]_). The pipeline transforms raw X-ray
projections into clear cross-sectional images used in medical diagnosis.

.. figure:: ../data/tutorial/graph_api/ct_reconstruction_overview.png
  :alt: Diagram showing raw projection data being transformed into a reconstructed CT slice
  :align: center

.. note::
  The tutorial application generates a phantom volume and forward projections. This GPU-accelerated operation uses
  multiple streams and appears in the traces. You can ignore the dataset generation — it is not relevant to this
  tutorial.

The reconstruction pipeline consists of:

1. **Load** projection data into GPU memory
2. **Preprocess** the projection through six stages:

  a. Logarithmic transformation (convert X-ray intensities)
  b. Pixel weighting (correct for cone-beam geometry)
  c. Forward FFT (transform to frequency domain)
  d. Shepp-Logan filtering (enhance edges and improve contrast)
  e. Inverse FFT (return to spatial domain)
  f. Normalization (account for unnormalized FFT)

3. **Reconstruct** the 3D volume using the Feldkamp-Davis-Kress (FDK) algorithm [FeDK84]_

**Why HIP graphs?** CT scanners process hundreds of projections per scan. By capturing this fixed workflow as a graph,
you will reduce the amount of API calls required for launching the workflow on a GPU to 1 per projection, thus reducing
launch overhead and dispatch latency to near-zero.

What you will learn
-------------------

After completing this tutorial, you will be able to:

* Convert a stream-based HIP application to a graph-based application via stream capturing
* Create graphs manually for fine-grained control
* Integrate graph-safe libraries like hipFFT into your graphs
* Understand when graphs provide performance benefits
* Apply graph concepts to your own workflows

Before you begin
----------------

Required knowledge
^^^^^^^^^^^^^^^^^^

You should be comfortable writing and debugging HIP kernels, understand basic GPU memory management concepts like
device allocation and host-to-device transfers, be familiar with HIP streams and events, and have experience using
CMake to build C++ projects. This tutorial assumes you have written at least a few HIP programs before and understand
concepts like grid dimensions and thread blocks.

Hardware and software requirements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Your system needs ROCm 6.2 or later with the hipFFT library installed. The tutorial works on
all :doc:`supported AMD GPUs <rocm-install-on-linux:reference/system-requirements>`, though at least 4 GiB of GPU
memory are recommended for comfortable performance with the reconstruction workload. You will also need
`git <https://git-scm.com/>`__ to check out the code repository, `CMake <https://www.cmake.org>`__ 3.21 or later to
build the code, along with a CMake generator that supports the HIP language such as GNU Make or Ninja.

.. note::
  Visual Studio generators currently do not support HIP. The (optional) ``rocprofv3`` tool is currently supported on
  Linux only.

To save the output volume, you need a recent version of `libTIFF <https://libtiff.gitlab.io/libtiff/>`__. If CMake
cannot find libTIFF on your system, it automatically downloads and builds it.

To view both the input projections and the output volume produced by this tutorial, install a scientific image viewer
that can display 16-bit and 32-bit grayscale data, such as `Fiji <https://imagej.net/software/fiji/downloads>`__.
Standard image viewers may be unable to correctly display the output.

Optional knowledge
^^^^^^^^^^^^^^^^^^

While not required, familiarity with Fast Fourier Transform (FFT) operations will help you understand the filtering
steps. Similarly, knowledge of medical imaging or CT reconstruction is helpful for understanding the application
context. If you have worked with signal processing or image filtering before, you will recognize some of the applied
concepts.

.. note::
  You can skip the reconstruction algorithm and concentrate on the stream and graph implementations in the files
  prefixed with ``main_``.

Step 1: Build the tutorial code
===============================

The full code for this tutorial is part of the `ROCm examples repository <https://github.com/ROCm/rocm-examples>`__.
Check out the repository:

.. code-block:: bash

  git clone https://github.com/ROCm/rocm-examples.git

Then navigate to ``rocm-examples/HIP-Doc/Tutorials/graph_api/``. The code can be found in the ``src`` subdirectory.

Create a separate ``build`` directory inside ``rocm-examples/HIP-Doc/Tutorials/graph_api/``. Then
configure the project (adjust ``CMAKE_HIP_ARCHITECTURES`` to match your GPU):

.. code-block:: bash

  cd build
  cmake -DCMAKE_PREFIX_PATH=/opt/rocm -DCMAKE_BUILD_TYPE=Release -DCMAKE_HIP_ARCHITECTURES=gfx1100 -DCMAKE_HIP_PLATFORM=amd -DCMAKE_CXX_COMPILER=amdclang++ -DCMAKE_C_COMPILER=amdclang -DCMAKE_HIP_COMPILER=amdclang++ ..

Now you can build the three variants of the tutorial code:

.. code-block:: bash

  cmake --build . --target hip_graph_api_tutorial_streams hip_graph_api_tutorial_graph_capture hip_graph_api_tutorial_graph_creation

.. note::
  The ``graph_capture`` variant is currently not supported on Windows and the build target is therefore unavailable.

Step 2: Examining the stream-based baseline application
=======================================================

Open ``src/main_streams.hip`` in your editor. You will explore how this application processes data.

Understanding batched processing
--------------------------------

The application processes multiple projections simultaneously to maximize GPU utilization.

Determining parallel capacity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

At the beginning of ``main()``, the program queries the GPU for its number of asynchronous engines to determine how
many streams it can create, indicating how many data transfer or compute operations can run in parallel.

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
  :start-after: // [sphinx-async-engine-start]
  :end-before: // [sphinx-async-engine-end]
  :language: cuda
  :dedent:

.. tip::
  Each asynchronous engine executes operations independently. More engines mean more parallelism.


Processing projections in batches
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Find the ``MAIN LOOP`` comment. Here the application groups projections into parallel batches:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
  :start-after: // [sphinx-batch-start]
  :end-before: // [sphinx-batch-end]
  :language: cuda
  :dedent:

Notice how each batch size equals the stream count — this ensures every stream stays busy.

Synchronization
^^^^^^^^^^^^^^^

Each projection processes independently, so you only need to synchronize once at the end.
:cpp:func:`hipStreamWaitEvent()` function makes the first stream wait for all other streams to complete.

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
  :start-after: // [sphinx-sync-start]
  :end-before: // [sphinx-sync-end]
  :language: cuda
  :dedent:

Exploring the processing pipeline
---------------------------------

Next, examine what happens to each projection. Find the ``START HERE`` comment to see the reconstruction pipeline's
first steps:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
  :start-after: // [sphinx-preprocessing-start]
  :end-before: // [sphinx-preprocessing-end]
  :language: cuda
  :dedent:

This is a typical pattern found across many HIP applications: multiple kernels executing in sequence with data
dependencies. In the next step, the weighted projections need to be transformed into Fourier space and filtered. For
optimal performance, it is recommended to execute a 1D FFT on a buffer size which is a power of two. Copy the weighted
projection to another buffer where the row length is a power of two equal to or larger than the projection's row
length:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
  :start-after: // [sphinx-proj-to-expanded-start]
  :end-before: // [sphinx-proj-to-expanded-end]
  :language: cuda
  :dedent:

Next, transform the expanded projection into Fourier space for filtering:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
  :start-after: // [sphinx-forward-start]
  :end-before: // [sphinx-forward-end]
  :language: cuda
  :dedent:

.. tip::
  Some hipFFT operations are graph-safe: As long as these operations are operating on the capturing stream, they will
  be captured into the graph as well. Refer to :ref:`hipFFT's documentation <hipfft:hipfft-api-usage>` for more
  information on its graph-safe operations.

In Fourier space, apply the Shepp-Logan filter, then transform back:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
  :start-after: // [sphinx-filter-start]
  :end-before: // [sphinx-filter-end]
  :language: cuda
  :dedent:

Shrink to original size and normalize the FFT output:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
  :start-after: // [sphinx-expanded-to-proj-start]
  :end-before: // [sphinx-expanded-to-proj-end]
  :language: cuda
  :dedent:

Finally, back-project the filtered projection into the 3D volume using ``atomicAdd`` operations to accumulate voxel
values from multiple kernels:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
  :start-after: // [sphinx-bp-start]
  :end-before: // [sphinx-bp-end]
  :language: cuda
  :dedent:

.. note::
  The preprocessing kernels process 512 × 512 pixels (:math:`\mathcal{O}(n²)`), while the back-projection kernel
  processes 512 × 512 × 512 voxels (:math:`\mathcal{O}(n³)`). This cubic complexity makes back-projection the
  computational bottleneck.

Creating a trace file
^^^^^^^^^^^^^^^^^^^^^

Inside the ``build`` directory you will now generate a trace:

.. code-block:: bash

  rocprofv3 -o streams -d outDir -f pftrace --hip-trace --kernel-trace --memory-copy-trace --memory-allocation-trace -- ./HIP-Doc/Tutorials/graph_api/src/hip_graph_api_tutorial_streams

.. note::
  For more information on the ``rocprofv3`` tool, please refer to its
  :ref:`documentation <rocprofiler-sdk:using-rocprofv3>`.

Analyzing the trace
^^^^^^^^^^^^^^^^^^^

Open the trace file to see what is really happening:

1. Navigate to your ``build/outDir`` directory
2. Open ``streams_results.pftrace`` in `Perfetto <https://ui.perfetto.dev>`__
3. Click the arrow next to your executable name under ``System``
4. Focus on the kernel execution pattern on the right

.. figure:: ../data/tutorial/graph_api/streams_trace.png
  :alt: Stream execution showing gaps between kernel launches
  :align: center

While projections process in parallel, there are visible gaps between operations. These gaps represent overhead caused
by scheduling and launching the operations. In the next section, you will eliminate these gaps by capturing streams into
a graph.

Step 3: Converting to graphs via stream capture
===============================================

Stream capture is a feature that allows you to record a sequence of GPU operations (kernel launches, memory copies,
etc.) into a HIP Graph, which can later be executed as a single, optimized unit. Open the file
``src/main_graph_capture.hip``, which contains the code from the previous subsection, with a few changes that allow you
to capture the streams into a single graph.

Before the main loop, declare graph-specific variables:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
  :start-after: // [sphinx-graph-vars-start]
  :end-before: // [sphinx-graph-vars-end]
  :language: cuda
  :dedent:

``graphExec`` and ``graphExecFinal`` will be instances of the graph template that you will create in the following
steps. You will typically instantiate a graph template once and update its parameters for repeated launches. If the
graph topology changes, you will need a new instance. The ``graphStream`` will launch the final graph instances.

Inside the main loop, activate capture mode on the first stream:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
  :start-after: // [sphinx-begin-capture-start]
  :end-before: // [sphinx-begin-capture-end]
  :language: cuda
  :dedent:

.. admonition:: What happens during capture?

  When :cpp:func:`hipStreamBeginCapture` is called, the stream stops executing operations immediately. Instead, it
  records operations into a graph template (``graph`` in the code shown here).

To capture multiple streams, use events to implement the fork-join pattern:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
  :start-after: // [sphinx-fork-start]
  :end-before: // [sphinx-fork-end]
  :language: cuda
  :dedent:

This creates dependencies between streams, activating capture mode on the additional streams and ensuring they are all
part of the same graph.

**The processing pipeline itself remains unchanged.**

After recording all operations of the current batch, join the streams:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
  :start-after: // [sphinx-join-start]
  :end-before: // [sphinx-join-end]
  :language: cuda
  :dedent:

Then stop capturing:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
  :start-after: // [sphinx-stop-capture-start]
  :end-before: // [sphinx-stop-capture-end]
  :language: cuda
  :dedent:

The graph template is now complete. In order to execute the recorded operations, you need to instantiate the graph
and execute it on the ``graphStream``. The graph template can be safely destroyed after instantiating:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
  :start-after: // [sphinx-graph-instantiate-start]
  :end-before: // [sphinx-graph-instantiate-end]
  :language: cuda
  :dedent:

.. tip::
  Use :cpp:func:`hipGraphDebugDotPrint` to save a graph's topology into a ``*.dot`` file. The resulting file
  contains a `DOT <https://graphviz.org/doc/info/lang.html>`__ description which can be processed with
  `Graphviz <https://graphviz.org/>`__ or visualized with several tools. For example:

  .. code-block:: bash

    dot -Tpng graph_capture.dot -o graph_capture.png

Instantiating a graph is a relatively costly operation. However, you need to update the parameters whenever a new batch
is processed. Since the graph templates are the same for all batches (i.e., the topology of the resulting graph does
not change), it is sufficient to update the existing graph instance's parameters instead of creating a new instance:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
  :start-after: // [sphinx-graph-update-start]
  :end-before: // [sphinx-graph-update-end]
  :language: cuda
  :dedent:

Should the graph's topology change between iterations, it is necessary to create a new graph instance. In your
application's case, this can happen when the number of projections is not evenly divisible by the number of
asynchronous engines:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
  :start-after: // [sphinx-graph-final-start]
  :end-before: // [sphinx-graph-final-end]
  :language: cuda
  :dedent:

Creating a trace
----------------

Now you have successfully converted the processing pipeline into an executable graph. You can examine the effects of
this change and generate another trace:

.. code-block:: bash

  rocprofv3 -o graph_capture -d outDir -f pftrace --hip-trace --kernel-trace --memory-copy-trace --memory-allocation-trace -- ./HIP-Doc/Tutorials/graph_api/src/hip_graph_api_tutorial_graph_capture

Analyzing the trace
-------------------

Opening the resulting trace file ``outDir/graph_capture_results.pftrace`` with Perfetto shows a significant change:

.. figure:: ../data/tutorial/graph_api/capture_trace.png
  :alt: Diagram showing a trace of the capturing variant.
  :align: center

The gaps have disappeared! By capturing all operations of a batch into a single graph, you have successfully
eliminated the launching and scheduling overhead previously observed in the stream-based variant.

A limitation of stream capture is that it preserves stream ordering even when unnecessary. Operations that could run in
parallel still execute sequentially. Another approach to graphs is manual construction. This is quite verbose but also
offers much more control over dependencies and parallelism.

Step 4: Manual graph creation (advanced)
========================================

Open ``src/main_graph_creation.hip`` and find the main loop. The code here differs from the other variants: rather than
capturing streams into graphs, you will build the graph manually. Consider how the weighting kernel is invoked through
a kernel node:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
  :start-after: // [sphinx-weighting-node-start]
  :end-before: // [sphinx-weighting-node-end]
  :language: cuda
  :dedent:

You create an array of ``void*`` pointers containing the kernel parameters. Next, configure the kernel launch
parameters: grid and block dimensions, the kernel function pointer, and the dynamic shared memory size. Finally, add
the kernel node to the graph template. Note the ``&logTransformationKernelNode, 1`` part: this is how you specify a
dependency from the preceding log transformation kernel node to the weighting kernel node.

.. note::
  For specifying multiple dependencies, you would pass an array of :cpp:type:`hipGraphNode_t` objects and the number of
  nodes inside the array to :cpp:func:`hipGraphAddKernelNode`.

The HIP graph API supports multiple different node types. For example, this is how a ``memset`` node is set up:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
  :start-after: // [sphinx-memset-node-start]
  :end-before: // [sphinx-memset-node-end]
  :language: cuda
  :dedent:

.. note::
  Despite the different construction method, graph instantiation and updates
  work exactly as before. You can find the same patterns at the loop's end.

Adding hipFFT nodes
-------------------

While hipFFT provides graph-safe functionality, it does not support manual node creation. Integrating hipFFT into the
graph requires a workaround using stream capture with additional bookkeeping.

You capture the graph state before and after hipFFT operations, then identify the nodes hipFFT added:

Step 1: Save existing nodes
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Record all current graph nodes in a sorted ``std::set``:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
  :start-after: // [sphinx-before-forward-start]
  :end-before: // [sphinx-before-forward-end]
  :language: cuda
  :dedent:

Step 2: Capture hipFFT operations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
  :start-after: // [sphinx-hipfft-start]
  :end-before: // [sphinx-hipfft-end]
  :language: cuda
  :dedent:

Step 3: Get updated node list
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
  :start-after: // [sphinx-after-forward-start]
  :end-before: // [sphinx-after-forward-end]
  :language: cuda
  :dedent:

Step 4: Find new nodes
^^^^^^^^^^^^^^^^^^^^^^

Compute the difference between both node sets:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
  :start-after: // [sphinx-node-difference-start]
  :end-before: // [sphinx-node-difference-end]
  :language: cuda
  :dedent:

Step 5: Identify the leaf node
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Find hipFFT's final node for dependency tracking:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
  :start-after: // [sphinx-find-leaf-start]
  :end-before: // [sphinx-find-leaf-end]
  :language: cuda
  :dedent:

The leaf detection logic checks if a node has no outgoing edges:

.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
  :start-after: // [sphinx-is-leaf-start]
  :end-before: // [sphinx-is-leaf-end]
  :language: cuda
  :dedent:

With hipFFT integrated and its leaf node identified, subsequent nodes can establish proper dependencies.

.. note::
  You can also capture hipFFT operations into a separate graph template, then add it to the main graph as a child graph
  using :cpp:func:`hipGraphAddChildGraphNode`. The approach above adds hipFFT nodes directly to the main graph as
  first-class nodes. A child graph acts as a single node that expands recursively into its components. The scheduler
  may handle these approaches differently, potentially affecting performance.

Creating a trace
----------------

Now you have manually implemented the processing pipeline with the graph API. You can examine the result by generating
another trace:

.. code-block:: bash

  rocprofv3 -o graph_creation -d outDir -f pftrace --hip-trace --kernel-trace --memory-copy-trace --memory-allocation-trace -- ./HIP-Doc/Tutorials/graph_api/src/hip_graph_api_tutorial_graph_creation

Analyzing the trace
-------------------

Opening the resulting trace file ``outDir/graph_creation_results.pftrace`` with Perfetto shows a similar trace to what
you achieved with the capture variant:

.. figure:: ../data/tutorial/graph_api/creation_trace.png
  :alt: Diagram showing a trace of the creation variant.
  :align: center

Like before, the kernels are executed *en bloc*. By creating nodes for all operations in the processing pipeline, you
avoided the launching and scheduling overhead you previously observed in the stream-based variant.

Updating individual nodes
-------------------------

The code presented in this tutorial updates the entire graph instance for each new batch. Applications that require
updates to only a small subset of nodes might experience excessive overhead. For these cases, the HIP Graph API
provides the following methods for updating individual nodes:

* :cpp:func:`hipGraphExecChildGraphNodeSetParams`
* :cpp:func:`hipGraphExecEventRecordNodeSetEvent`
* :cpp:func:`hipGraphExecEventWaitNodeSetEvent`
* :cpp:func:`hipGraphExecExternalSemaphoresSignalNodeSetParams`
* :cpp:func:`hipGraphExecExternalSemaphoresWaitNodeSetParams`
* :cpp:func:`hipGraphExecHostNodeSetParams`
* :cpp:func:`hipGraphExecKernelNodeSetParams`
* :cpp:func:`hipGraphExecMemcpyNodeSetParams`
* :cpp:func:`hipGraphExecMemcpyNodeSetParams1D`
* :cpp:func:`hipGraphExecMemcpyNodeSetParamsFromSymbol`
* :cpp:func:`hipGraphExecMemcpyNodeSetParamsToSymbol`
* :cpp:func:`hipGraphExecMemsetNodeSetParams`
* :cpp:func:`hipGraphExecNodeSetParams`

Conclusion
==========

When an application has predictable, repetitive workflows, transitioning from streams to graphs can significantly
reduce launch overhead and improve performance. HIP provides two approaches for creating graphs: stream capture and
explicit graph construction.

**Stream capture** converts existing stream-based code into a graph by recording the operations between start and stop
capture calls. This approach minimizes code changes and works well when your application already has a graph-like
structure with clear dependencies.

**Explicit graph construction** involves manually creating nodes and defining edges between them using the graph API.
While this approach requires more code changes and is more verbose, it provides fine-grained control over dependencies
and allows for optimizations that might not be possible with stream capture. This method is ideal when you need precise
control over the graph topology or when working with complex dependency patterns.

.. tip::
  Choose stream capture for quick conversions of existing code with minimal changes. Choose explicit construction when
  you need maximum control and optimization opportunities.

Resources
=========

* :ref:`HIP Programming Guide's section on HIP graphs <how_to_HIP_graph>`
* :ref:`HIP graph API reference <graph_management_reference>`

.. rubric:: References

.. [FeDK84] L.A. Feldkamp, L.C. Davis and J.W. Kress: "Practical cone-beam algorithm". In *Journal of the Optical Society of America A*, vol. 1, no. 6, pp. 612-619, June 1984, DOI `10.1364/JOSAA.1.000612 <https://dx.doi.org/10.1364/JOSAA.1.000612>`__.
.. [ShLo74] L.A. Shepp and B.F. Logan: "The Fourier reconstruction of a head section". In *IEEE Transactions on Nuclear Science*, vol. 21, no. 3, pp. 21-43, June 1974, DOI `10.1109/TNS.1974.6499235 <https://dx.doi.org/10.1109/TNS.1974.6499235>`__.