rocm-systems/projects/hip/docs/how-to/hip_runtime_api/hipgraph.rst

.. meta::
    :description: This chapter describes how to use HIP graphs and highlights their use cases.
    :keywords: ROCm, HIP, graph, stream

.. _how_to_HIP_graph:

********************************************************************************
HIP graphs
********************************************************************************

HIP graphs are an alternative way of executing tasks on a GPU that can provide
performance benefits over launching kernels using the standard
method via streams. A HIP graph is made up of nodes and edges. The nodes of a
HIP graph represent the operations performed, while the edges mark dependencies
between those operations.

.. hint::
    The :ref:`HIP Graph API tutorial <hip_graph_api_tutorial>` demonstrates how
    to use HIP graphs in a real-world application.

The nodes can be one of the following:

- empty nodes
- nested graphs
- kernel launches
- host-side function calls
- HIP memory functions (copy, memset, ...)
- HIP events
- signalling or waiting on external semaphores

.. note::
    The available node types are specified by :cpp:enum:`hipGraphNodeType`.

The following figure visualizes the concept of graphs, compared to using streams.

.. figure:: ../../data/how-to/hip_runtime_api/hipgraph/hip_graph.svg
    :alt: Diagram depicting the difference between using streams to execute
          kernels with dependencies, resolved by explicitly synchronizing,
          or using graphs, where the edges denote the dependencies.

The standard method of launching kernels incurs a small overhead for each
iteration of the operation involved. That overhead is negligible, when the
kernel is launched directly with the HIP C/C++ API, but depending on the
framework used, there can be several levels of redirection, until the actual
kernel is launched by the HIP runtime, leading to significant overhead.
Especially for some AI frameworks, a GPU kernel might run faster than the time
it takes for the framework to set up and launch the kernel, and so the overhead
of repeatedly launching kernels can have a significant impact on performance.

HIP graphs are designed to address this issue, by predefining the HIP API calls
and their dependencies with a graph, and performing most of the initialization
beforehand. Launching a graph only requires a single call, after which the
HIP runtime takes care of executing the operations within the graph.
Graphs can provide additional performance benefits, by enabling optimizations
that are only possible when knowing the dependencies between the operations.

.. figure:: ../../data/how-to/hip_runtime_api/hipgraph/hip_graph_speedup.svg
    :alt: Diagram depicting the speed up achievable with HIP graphs compared to
          HIP streams when launching many short-running kernels.

    Qualitative presentation of the execution time of many short-running kernels
    when launched using HIP stream versus HIP graph. This does not include the
    time needed to set up the graph.

Using HIP graphs
================================================================================

There are two different ways of creating graphs: Capturing kernel launches from
a stream, or explicitly creating graphs. The difference between the two
approaches is explained later in this chapter.

The general flow for using HIP graphs includes the following steps.

#. Create a :cpp:type:`hipGraph_t` graph template using one of the two approaches described in this chapter
#. Create a :cpp:type:`hipGraphExec_t` executable instance of the graph template using :cpp:func:`hipGraphInstantiate`
#. Use :cpp:func:`hipGraphLaunch` to launch the executable graph to a stream
#. After execution completes free and destroy graph resources

The first two steps are the initial setup and only need to be executed once. First
step is the definition of the operations (nodes) and the dependencies (edges)
between them. The second step is the instantiation of the graph. This takes care
of validating and initializing the graph, to reduce the overhead when executing
the graph. The third step is the execution of the graph, which takes care of
launching all the kernels and executing the operations while respecting their
dependencies and necessary synchronizations as specified.

Because HIP graphs require some setup and initialization overhead before their
first execution, graphs only provide a benefit for workloads that require
many iterations to complete.

In both methods the :cpp:type:`hipGraph_t` template for a graph is used to define the graph.
In order to actually launch a graph, the template needs to be instantiated using
:cpp:func:`hipGraphInstantiate`, which results in an executable graph of type :cpp:type:`hipGraphExec_t`.
This executable graph can then be launched with :cpp:func:`hipGraphLaunch`, replaying the
operations within the graph. Note, that launching graphs is fundamentally no
different to executing other HIP functions on a stream, except for the fact,
that scheduling the operations within the graph encompasses less overhead and
can enable some optimizations, but they still need to be associated with a stream for execution.

Memory management
--------------------------------------------------------------------------------

Memory that is used by operations in graphs can either be pre-allocated or
managed within the graph. Graphs can contain nodes that take care of allocating
memory on the device or copying memory between the host and the device.
Whether you want to pre-allocate the memory or manage it within the graph
depends on the use-case. If the graph is executed in a tight loop the
performance is usually better when the memory is preallocated, so that it
does not need to be reallocated in every iteration.

The same rules as for normal memory allocations apply for memory allocated and
freed by nodes, meaning that the nodes that access memory allocated in a graph
must be ordered after allocation and before freeing.

Memory management within the graph enables the runtime to take care of memory reuse and optimizations.
The lifetime of memory managed in a graph begins when the execution reaches the
node allocating the memory, and ends when either reaching the corresponding
free node within the graph, or after graph execution when a corresponding
:cpp:func:`hipFreeAsync` or :cpp:func:`hipFree` call is reached.
The memory can also be freed with a free node in a different graph that is
associated with the same memory address.

Unlike device memory that is not associated with a graph, this does not necessarily
mean that the freed memory is returned back to the operating system immediately.
Graphs can retain a memory pool for quickly reusing memory within the graph.
This can be especially useful when memory is freed and reallocated later on
within a graph, as that memory doesn't have to be requested from the operating system.
It also potentially reduces the total memory footprint of the graph, by reusing the same memory.

The amount of memory allocated for graph memory pools on a specific device can
be queried using :cpp:func:`hipDeviceGetGraphMemAttribute`.
In order to return the freed memory :cpp:func:`hipDeviceGraphMemTrim` can be used.
This will return any memory that is not in active use by graphs.

These memory allocations can also be set up to allow access from multiple GPUs,
just like normal allocations. HIP then takes care of allocating and mapping the
memory to the GPUs. When capturing a graph from a stream, the node sets the
accessibility according to :cpp:func:`hipMemPoolSetAccess` at the time of capturing.


Capture graphs from a stream
================================================================================

The easy way to integrate HIP graphs into already existing code is to use
:cpp:func:`hipStreamBeginCapture` and :cpp:func:`hipStreamEndCapture` to obtain a :cpp:type:`hipGraph_t`
graph template that includes the captured operations.

When starting to capture operations for a graph using :cpp:func:`hipStreamBeginCapture`,
the operations assigned to the stream are captured into a graph instead of being
executed. The associated graph is returned when calling :cpp:func:`hipStreamEndCapture`, which
also stops capturing operations.
In order to capture to an already existing graph use :cpp:func:`hipStreamBeginCaptureToGraph`.

The functions assigned to the capturing stream are not executed, but instead are
captured and defined as nodes in the graph, to be run when the instantiated
graph is launched.

Functions must be associated with a stream in order to be captured.
This means that non-HIP API functions are not captured by default, but are
executed as standard functions when encountered and not added to the graph.
In order to assign host functions to a stream use
:cpp:func:`hipLaunchHostFunc`, as shown in the following code example.
They will then be captured and defined as a host node in the resulting graph,
and won't be executed when encountered.

Synchronous HIP API calls that are implicitly assigned to the default stream are
not permitted while capturing a stream  and will return an error. This is
because they implicitly synchronize and cause a dependency that can not be
captured within the stream. This includes functions like :cpp:func:`hipMalloc`,
:cpp:func:`hipMemcpy` and :cpp:func:`hipFree`. In order to capture these to the stream, replace
them with the corresponding asynchronous calls like :cpp:func:`hipMallocAsync`, :cpp:func:`hipMemcpyAsync` or :cpp:func:`hipFreeAsync`.

The general flow for using stream capture to create a graph template is:

#. Create a stream from which to capture the operations

#. Call :cpp:func:`hipStreamBeginCapture` before the first operation to be captured

#. Call :cpp:func:`hipStreamEndCapture` after the last operation to be captured

   #. Define a :cpp:type:`hipGraph_t` graph template to which :cpp:func:`hipStreamEndCapture`
      passes the captured graph

The following code is an example of how to use the HIP graph API to capture a
graph from a stream.

.. literalinclude:: ../../tools/example_codes/graph_capture.hip
    :start-after: // [sphinx-start]
    :end-before: // [sphinx-end]
    :language: cpp

Explicit graph creation
================================================================================

Graphs can also be created directly using the HIP graph API, giving more
fine-grained control over the graph. In this case, the graph nodes are created
explicitly, together with their parameters and dependencies, which specify the
edges of the graph, thereby forming the graph structure.

The nodes are represented by the generic :cpp:type:`hipGraphNode_t` type. The actual
node type is implicitly defined by the specific function used to add the node to
the graph, for example :cpp:func:`hipGraphAddKernelNode` See the
:ref:`HIP graph API documentation<graph_management_reference>` for the
available functions, they are of type ``hipGraphAdd{Type}Node``. Each type of
node also has a predefined set of parameters depending on the operation, for
example :cpp:class:`hipKernelNodeParams` for a kernel launch. See the
:doc:`documentation for the general hipGraphNodeParams type<../../doxygen/html/structhip_graph_node_params>`
for a list of available parameter types and their members.

The general flow for explicitly creating a graph is usually:

#. Create a graph :cpp:type:`hipGraph_t`

#. Create the nodes and their parameters and add them to the graph

   #. Define a :cpp:type:`hipGraphNode_t`

   #. Define the parameter struct for the desired operation, by explicitly setting the appropriate struct's members.

   #. Use the appropriate ``hipGraphAdd{Type}Node`` function to add the node to the graph.

      #. The dependencies can be defined when adding the node to the graph, or afterwards by using :cpp:func:`hipGraphAddDependencies`

The following code example demonstrates how to explicitly create nodes in order to create a graph.

.. literalinclude:: ../../tools/example_codes/graph_creation.hip
    :start-after: // [sphinx-start]
    :end-before: // [sphinx-end]
    :language: cpp