Sync HIP documentation 2025-10-20 (#1258)

* Add examples to tools folder * Correct P2P memory access section * Sync poriting guide * Add HIP Graph tutorial * Add hint about using amdgpu-dkms for IPC API * Add a few more env variables
2025-10-29 07:42:06 +01:00
@@ -103,66 +103,10 @@ The kernel arguments are listed after the configuration parameters.

 .. code-block:: cpp

-  #include <hip/hip_runtime.h>
-  #include <iostream>
-
-  #define HIP_CHECK(expression)                                \
-  {                                                            \
-      const hipError_t err = expression;                       \
-      if(err != hipSuccess){                                   \
-          std::cerr << "HIP error: " << hipGetErrorString(err) \
-              << " at " << __LINE__ << "\n";                   \
-      }                                                        \
-  }
-
-  // Performs a simple initialization of an array with the thread's index variables.
-  // This function is only available in device code.
-  __device__ void init_array(float * const a, const unsigned int arraySize){
-    // globalIdx uniquely identifies a thread in a 1D launch configuration.
-    const int globalIdx = threadIdx.x + blockIdx.x * blockDim.x;
-    // Each thread initializes a single element of the array.
-    if(globalIdx < arraySize){
-      a[globalIdx] = globalIdx;
-    }
-  }
-
-  // Rounds a value up to the next multiple.
-  // This function is available in host and device code.
-  __host__ __device__ constexpr int round_up_to_nearest_multiple(int number, int multiple){
-    return (number + multiple - 1)/multiple;
-  }
-
-  __global__ void example_kernel(float * const a, const unsigned int N)
-  {
-    // Initialize array.
-    init_array(a, N);
-    // Perform additional work:
-    // - work with the array
-    // - use the array in a different kernel
-    // - ...
-  }
-
-  int main()
-  {
-    constexpr int N = 100000000; // problem size
-    constexpr int blockSize = 256; //configurable block size
-
-    //needed number of blocks for the given problem size
-    constexpr int gridSize = round_up_to_nearest_multiple(N, blockSize);
-
-    float *a;
-    // allocate memory on the GPU
-    HIP_CHECK(hipMalloc(&a, sizeof(*a) * N));
-
-    std::cout << "Launching kernel." << std::endl;
-    example_kernel<<<dim3(gridSize), dim3(blockSize), 0/*example doesn't use shared memory*/, 0/*default stream*/>>>(a, N);
-    // make sure kernel execution is finished by synchronizing. The CPU can also
-    // execute other instructions during that time
-    HIP_CHECK(hipDeviceSynchronize());
-    std::cout << "Kernel execution finished." << std::endl;
-
-    HIP_CHECK(hipFree(a));
-  }
+  .. literalinclude:: ../tools/example_codes/calling_global_functions.hip
+      :start-after: // [sphinx-start]
+      :end-before: // [sphinx-end]
+      :language: cpp

 Inline qualifiers
 --------------------------------------------------------------------------------
@@ -321,28 +265,10 @@ launch has to specify the needed amount of ``extern`` shared memory in the launc
 configuration. The statically allocated shared memory is allocated without this
 parameter.

-.. code-block:: cpp
-
-  #include <hip/hip_runtime.h>
-
-  extern __shared__ int shared_array[];
-
-  __global__ void kernel(){
-    // initialize shared memory
-    shared_array[threadIdx.x] = threadIdx.x;
-    // use shared memory - synchronize to make sure, that all threads of the
-    // block see all changes to shared memory
-    __syncthreads();
-  }
-
-  int main(){
-    //shared memory in this case depends on the configurable block size
-    constexpr int blockSize = 256;
-    constexpr int sharedMemSize = blockSize * sizeof(int);
-    constexpr int gridSize = 2;
-
-    kernel<<<dim3(gridSize), dim3(blockSize), sharedMemSize, 0>>>();
-  }
+.. literalinclude:: ../tools/example_codes/extern_shared_memory.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 __managed__
 --------------------------------------------------------------------------------
@@ -735,22 +661,18 @@ with the actual frequency.

 The difference between the returned values represents the cycles used.

-.. code-block:: cpp
-
-  __global void kernel(){
-    long long int start = clock64();
-    // kernel code
-    long long int stop = clock64();
-    long long int cycles = stop - start;
-  }
+.. literalinclude:: ../tools/example_codes/timer.hip
+    :start-after: // [sphinx-kernel-start]
+    :end-before: // [sphinx-kernel-end]
+    :language: cpp

 ``long long int wall_clock64()`` returns the wall clock time on the device, with a constant, fixed frequency.
 The frequency is device dependent and can be queried using:

-.. code-block:: cpp
-
-  int wallClkRate = 0; //in kilohertz
-  hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId);
+.. literalinclude:: ../tools/example_codes/timer.hip
+    :start-after: // [sphinx-query-start]
+    :end-before: // [sphinx-query-end]
+    :language: cpp

 .. _atomic functions:

@@ -1,649 +0,0 @@
-.. meta::
-  :description: This chapter presents how to port the CUDA driver API and showcases equivalent operations in HIP.
-  :keywords: AMD, ROCm, HIP, CUDA, driver API, porting, port
-
-.. _porting_driver_api:
-
-*******************************************************************************
-Porting CUDA driver API
-*******************************************************************************
-
-CUDA provides separate driver and runtime APIs. The two APIs generally provide
-the similar functionality and mostly can be used interchangeably, however the
-driver API allows for more fine-grained control over the kernel level
-initialization, contexts and module management. This is all taken care of
-implicitly by the runtime API.
-
-* Driver API calls begin with the prefix ``cu``, while runtime API calls begin
-  with the prefix ``cuda``. For example, the driver API contains
-  ``cuEventCreate``, while the runtime API contains ``cudaEventCreate``, which
-  has similar functionality.
-
-* The driver API offers two additional low-level functionalities not exposed by
-  the runtime API: module management ``cuModule*`` and context management
-  ``cuCtx*`` APIs.
-
-HIP does not explicitly provide two different APIs, the corresponding functions
-for the CUDA driver API are available in the HIP runtime API, and are usually
-prefixed with ``hipDrv``. The module and context functionality is available with
-the ``hipModule`` and ``hipCtx`` prefix.
-
-cuModule API
-================================================================================
-
-The Module section of the driver API provides additional control over how and
-when accelerator code objects are loaded. For example, the driver API enables
-code objects to load from files or memory pointers. Symbols for kernels or
-global data are extracted from the loaded code objects. In contrast, the runtime
-API loads automatically and, if necessary, compiles all the kernels from an
-executable binary when it runs. In this mode, kernel code must be compiled using
-NVCC so that automatic loading can function correctly.
-
-The Module features are useful in an environment that generates the code objects
-directly, such as a new accelerator language front end. NVCC is not used here.
-Instead, the environment might have a different kernel language or compilation
-flow. Other environments have many kernels and don't want all of them to be
-loaded automatically. The Module functions load the generated code objects and
-launch kernels. Similar to the cuModule API, HIP defines a hipModule API that
-provides similar explicit control over code object management.
-
-.. _context_driver_api:
-
-cuCtx API
-================================================================================
-
-The driver API defines "Context" and "Devices" as separate entities.
-Contexts contain a single device, and a device can theoretically have multiple contexts.
-Each context contains a set of streams and events specific to the context.
-Historically, contexts also defined a unique address space for the GPU. This might no longer be the case in unified memory platforms, because the CPU and all the devices in the same process share a single unified address space.
-The Context APIs also provide a mechanism to switch between devices, which enables a single CPU thread to send commands to different GPUs.
-HIP and recent versions of the CUDA Runtime provide other mechanisms to accomplish this feat, for example, using streams or ``cudaSetDevice``.
-
-The CUDA runtime API unifies the Context API with the Device API. This simplifies the APIs and has little loss of functionality. This is because each context can contain a single device, and the benefits of multiple contexts have been replaced with other interfaces.
-HIP provides a Context API to facilitate easy porting from existing Driver code.
-In HIP, the ``Ctx`` functions largely provide an alternate syntax for changing the active device.
-
-Most new applications preferentially use ``hipSetDevice`` or the stream APIs. Therefore, HIP has marked the ``hipCtx`` APIs as **deprecated**. Support for these APIs might not be available in future releases. For more details on deprecated APIs, see :doc:`../reference/deprecated_api_list`.
-
-HIP module and Ctx APIs
-================================================================================
-
-Rather than present two separate APIs, HIP extends the HIP API with new APIs for
-modules and ``Ctx`` control.
-
-hipModule API
--------------------------------------------------------------------------------
-
-Like the CUDA driver API, the Module API provides additional control over how
-code is loaded, including options to load code from files or from in-memory
-pointers.
-NVCC and HIP-Clang target different architectures and use different code object
-formats. NVCC supports ``cubin`` or ``ptx`` files, while the HIP-Clang path uses
-the ``hsaco`` format.
-The external compilers which generate these code objects are responsible for
-generating and loading the correct code object for each platform.
-Notably, there is no fat binary format that can contain code for both NVCC and
-HIP-Clang platforms. The following table summarizes the formats used on each
-platform:
-
-.. list-table:: Module formats
-   :header-rows: 1
-
-   * - Format
-     - APIs
-     - NVCC
-     - HIP-CLANG
-   * - Code object
-     - ``hipModuleLoad``, ``hipModuleLoadData``
-     - ``.cubin`` or PTX text
-     - ``.hsaco``
-   * - Fat binary
-     - ``hipModuleLoadFatBin``
-     - ``.fatbin``
-     - ``.hip_fatbin``
-
-``hipcc`` uses HIP-Clang or NVCC to compile host code. Both of these compilers can embed code objects into the final executable. These code objects are automatically loaded when the application starts.
-The ``hipModule`` API can be used to load additional code objects. When used this way, it extends the capability of the automatically loaded code objects.
-HIP-Clang enables both of these capabilities to be used together. Of course, it is possible to create a program with no kernels and no automatic loading.
-
-For module API reference, visit :ref:`module_management_reference`.
-
-hipCtx API
--------------------------------------------------------------------------------
-
-HIP provides a ``Ctx`` API as a thin layer over the existing device functions. The ``Ctx`` API can be used to set the current context or to query properties of the device associated with the context.
-The current context is implicitly used by other APIs, such as ``hipStreamCreate``.
-
-For context reference, visit :ref:`context_management_reference`.
-
-HIPIFY translation of CUDA driver API
-================================================================================
-
-The HIPIFY tools convert CUDA driver APIs such as streams, events, modules,
-devices, memory management, context, and the profiler to the equivalent HIP
-calls. For example, ``cuEventCreate`` is translated to :cpp:func:`hipEventCreate`.
-HIPIFY tools also convert error codes from the driver namespace and coding
-conventions to the equivalent HIP error code. HIP unifies the APIs for these
-common functions.
-
-The memory copy API requires additional explanation. The CUDA driver includes
-the memory direction in the name of the API (``cuMemcpyHtoD``), while the CUDA
-runtime API provides a single memory copy API with a parameter that specifies
-the direction. It also supports a "default" direction where the runtime
-determines the direction automatically.
-HIP provides both versions, for example, :cpp:func:`hipMemcpyHtoD` as well as
-:cpp:func:`hipMemcpy`. The first version might be faster in some cases because
-it avoids any host overhead to detect the different memory directions.
-
-HIP defines a single error space and uses camel case for all errors (i.e. ``hipErrorInvalidValue``).
-
-For further information, visit the :doc:`hipify:index`.
-
-Address spaces
--------------------------------------------------------------------------------
-
-HIP-Clang defines a process-wide address space where the CPU and all devices
-allocate addresses from a single unified pool.
-This means addresses can be shared between contexts. Unlike the original CUDA
-implementation, a new context does not create a new address space for the device.
-
-Using hipModuleLaunchKernel
--------------------------------------------------------------------------------
-
-Both CUDA driver and runtime APIs define a function for launching kernels,
-called ``cuLaunchKernel`` or ``cudaLaunchKernel``. The equivalent API in HIP is
-``hipModuleLaunchKernel``.
-The kernel arguments and the execution configuration (grid dimensions, group
-dimensions, dynamic shared memory, and stream) are passed as arguments to the
-launch function.
-The runtime API additionally provides the ``<<< >>>`` syntax for launching
-kernels, which resembles a special function call and is easier to use than the
-explicit launch API, especially when handling kernel arguments.
-However, this syntax is not standard C++ and is available only when NVCC is used
-to compile the host code.
-
-Additional information
--------------------------------------------------------------------------------
-
-HIP-Clang creates a primary context when the HIP API is called. So, in pure
-driver API code, HIP-Clang creates a primary context while HIP/NVCC has an empty
-context stack. HIP-Clang pushes the primary context to the context stack when it
-is empty. This can lead to subtle differences in applications which mix the
-runtime and driver APIs.
-
-HIP-Clang implementation notes
-================================================================================
-
-.hip_fatbin
--------------------------------------------------------------------------------
-
-HIP-Clang links device code from different translation units together. For each
-device target, it generates a code object. ``clang-offload-bundler`` bundles
-code objects for different device targets into one fat binary, which is embedded
-as the global symbol ``__hip_fatbin`` in the ``.hip_fatbin`` section of the ELF
-file of the executable or shared object.
-
-Initialization and termination functions
--------------------------------------------------------------------------------
-
-HIP-Clang generates initialization and termination functions for each
-translation unit for host code compilation. The initialization functions call
-``__hipRegisterFatBinary`` to register the fat binary embedded in the ELF file.
-They also call ``__hipRegisterFunction`` and ``__hipRegisterVar`` to register
-kernel functions and device-side global variables. The termination functions
-call ``__hipUnregisterFatBinary``.
-HIP-Clang emits a global variable ``__hip_gpubin_handle`` of type ``void**``
-with ``linkonce`` linkage and an initial value of 0 for each host translation
-unit. Each initialization function checks ``__hip_gpubin_handle`` and registers
-the fat binary only if ``__hip_gpubin_handle`` is 0. It saves the return value
-of ``__hip_gpubin_handle`` to ``__hip_gpubin_handle``. This ensures that the fat
-binary is registered once. A similar check is performed in the termination
-functions.
-
-Kernel launching
--------------------------------------------------------------------------------
-
-HIP-Clang supports kernel launching using either the CUDA ``<<<>>>`` syntax,
-``hipLaunchKernel``, or ``hipLaunchKernelGGL``. The last option is a macro which
-expands to the CUDA ``<<<>>>`` syntax by default. It can also be turned into a
-template by defining ``HIP_TEMPLATE_KERNEL_LAUNCH``.
-
-When the executable or shared library is loaded by the dynamic linker, the
-initialization functions are called. In the initialization functions, the code
-objects containing all kernels are loaded when ``__hipRegisterFatBinary`` is
-called. When ``__hipRegisterFunction`` is called, the stub functions are
-associated with the corresponding kernels in the code objects.
-
-HIP-Clang implements two sets of APIs for launching kernels.
-By default, when HIP-Clang encounters the ``<<<>>>`` statement in the host code,
-it first calls ``hipConfigureCall`` to set up the threads and grids. It then
-calls the stub function with the given arguments. The stub function calls
-``hipSetupArgument`` for each kernel argument, then calls ``hipLaunchByPtr``
-with a function pointer to the stub function. In ``hipLaunchByPtr``, the actual
-kernel associated with the stub function is launched.
-
-NVCC implementation notes
-================================================================================
-
-Interoperation between HIP and CUDA driver
--------------------------------------------------------------------------------
-
-CUDA applications might want to mix CUDA driver code with HIP code (see the
-example below). This table shows the equivalence between CUDA and HIP types
-required to implement this interaction.
-
-.. list-table:: Equivalence table between HIP and CUDA types
-   :header-rows: 1
-
-   * - HIP type
-     - CU Driver type
-     - CUDA Runtime type
-   * - ``hipModule_t``
-     - ``CUmodule``
-     -
-   * - ``hipFunction_t``
-     - ``CUfunction``
-     -
-   * - ``hipCtx_t``
-     - ``CUcontext``
-     -
-   * - ``hipDevice_t``
-     - ``CUdevice``
-     -
-   * - ``hipStream_t``
-     - ``CUstream``
-     - ``cudaStream_t``
-   * - ``hipEvent_t``
-     - ``CUevent``
-     - ``cudaEvent_t``
-   * - ``hipArray``
-     - ``CUarray``
-     - ``cudaArray``
-
-Compilation options
--------------------------------------------------------------------------------
-
-The ``hipModule_t`` interface does not support the ``cuModuleLoadDataEx`` function, which is used to control PTX compilation options.
-HIP-Clang does not use PTX, so it does not support these compilation options.
-In fact, HIP-Clang code objects contain fully compiled code for a device-specific instruction set and don't require additional compilation as a part of the load step.
-The corresponding HIP function ``hipModuleLoadDataEx`` behaves like ``hipModuleLoadData`` on the HIP-Clang path (where compilation options are not used) and like ``cuModuleLoadDataEx`` on the NVCC path.
-
-For example:
-
-.. tab-set::
-
-    .. tab-item:: HIP
-
-        .. code-block:: cpp
-
-            hipModule_t module;
-            void *imagePtr = ...; // Somehow populate data pointer with code object
-
-            const int numOptions = 1;
-            hipJitOption options[numOptions];
-            void *optionValues[numOptions];
-
-            options[0] = hipJitOptionMaxRegisters;
-            unsigned maxRegs = 15;
-            optionValues[0] = (void *)(&maxRegs);
-
-            // hipModuleLoadData(module, imagePtr) will be called on HIP-Clang path, JIT
-            // options will not be used, and cupModuleLoadDataEx(module, imagePtr,
-            // numOptions, options, optionValues) will be called on NVCC path
-            hipModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues);
-
-            hipFunction_t k;
-            hipModuleGetFunction(&k, module, "myKernel");
-
-    .. tab-item:: CUDA
-
-        .. code-block:: cpp
-
-            CUmodule module;
-            void *imagePtr = ...; // Somehow populate data pointer with code object
-
-            const int numOptions = 1;
-            CUJit_option options[numOptions];
-            void *optionValues[numOptions];
-
-            options[0] = CU_JIT_MAX_REGISTERS;
-            unsigned maxRegs = 15;
-            optionValues[0] = (void *)(&maxRegs);
-
-            cuModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues);
-
-            CUfunction k;
-            cuModuleGetFunction(&k, module, "myKernel");
-
-The sample below shows how to use ``hipModuleGetFunction``.
-
-.. code-block:: cpp
-
-    #include <hip/hip_runtime.h>
-    #include <hip/hip_runtime_api.h>
-
-    #include <vector>
-
-    int main() {
-
-        size_t elements = 64*1024;
-        size_t size_bytes = elements * sizeof(float);
-
-        std::vector<float> A(elements), B(elements);
-
-        // On NVIDIA platforms the driver runtime needs to be initiated
-        #ifdef __HIP_PLATFORM_NVIDIA__
-        hipInit(0);
-        hipDevice_t device;
-        hipCtx_t context;
-        HIPCHECK(hipDeviceGet(&device, 0));
-        HIPCHECK(hipCtxCreate(&context, 0, device));
-        #endif
-
-        // Allocate device memory
-        hipDeviceptr_t d_A, d_B;
-        HIPCHECK(hipMalloc(&d_A, size_bytes));
-        HIPCHECK(hipMalloc(&d_B, size_bytes));
-
-        // Copy data to device
-        HIPCHECK(hipMemcpyHtoD(d_A, A.data(), size_bytes));
-        HIPCHECK(hipMemcpyHtoD(d_B, B.data(), size_bytes));
-
-        // Load module
-        hipModule_t Module;
-        // For AMD the module file has to contain architecture specific object codee
-        // For NVIDIA the module file has to contain PTX, found in e.g. "vcpy_isa.ptx"
-        HIPCHECK(hipModuleLoad(&Module, "vcpy_isa.co"));
-        // Get kernel function from the module via its name
-        hipFunction_t Function;
-        HIPCHECK(hipModuleGetFunction(&Function, Module, "hello_world"));
-
-        // Create buffer for kernel arguments
-        std::vector<void*> argBuffer{&d_A, &d_B};
-        size_t arg_size_bytes = argBuffer.size() * sizeof(void*);
-
-        // Create configuration passed to the kernel as arguments
-        void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, argBuffer.data(),
-                          HIP_LAUNCH_PARAM_BUFFER_SIZE, &arg_size_bytes, HIP_LAUNCH_PARAM_END};
-
-        int threads_per_block = 128;
-        int blocks = (elements + threads_per_block - 1) / threads_per_block;
-
-        // Actually launch kernel
-        HIPCHECK(hipModuleLaunchKernel(Function, blocks, 1, 1, threads_per_block, 1, 1, 0, 0, NULL, config));
-
-        HIPCHECK(hipMemcpyDtoH(A.data(), d_A, elements));
-        HIPCHECK(hipMemcpyDtoH(B.data(), d_B, elements));
-
-        #ifdef __HIP_PLATFORM_NVIDIA__
-        HIPCHECK(hipCtxDetach(context));
-        #endif
-
-        HIPCHECK(hipFree(d_A));
-        HIPCHECK(hipFree(d_B));
-
-        return 0;
-    }
-
-HIP module and texture Driver API
-================================================================================
-
-HIP supports texture driver APIs. However, texture references must be declared
-within the host scope. The following code demonstrates the use of texture
-references for the ``__HIP_PLATFORM_AMD__`` platform.
-
-.. code-block:: cpp
-
-    // Code to generate code object
-
-    #include "hip/hip_runtime.h"
-    extern texture<float, 2, hipReadModeElementType> tex;
-
-    __global__ void tex2dKernel(hipLaunchParm lp, float *outputData, int width,
-                                int height) {
-        int x = blockIdx.x * blockDim.x + threadIdx.x;
-        int y = blockIdx.y * blockDim.y + threadIdx.y;
-        outputData[y * width + x] = tex2D(tex, x, y);
-    }
-
-.. code-block:: cpp
-
-  // Host code:
-
-  texture<float, 2, hipReadModeElementType> tex;
-
-    void myFunc ()
-    {
-        // ...
-
-        textureReference* texref;
-        hipModuleGetTexRef(&texref, Module1, "tex");
-        hipTexRefSetAddressMode(texref, 0, hipAddressModeWrap);
-        hipTexRefSetAddressMode(texref, 1, hipAddressModeWrap);
-        hipTexRefSetFilterMode(texref, hipFilterModePoint);
-        hipTexRefSetFlags(texref, 0);
-        hipTexRefSetFormat(texref, HIP_AD_FORMAT_FLOAT, 1);
-        hipTexRefSetArray(texref, array, HIP_TRSA_OVERRIDE_FORMAT);
-
-      // ...
-    }
-
-Driver entry point access
-================================================================================
-
-Starting from HIP version 6.2.0, support for Driver Entry Point Access is
-available when using CUDA 12.0 or newer. This feature allows developers to
-directly interact with the CUDA driver API, providing more control over GPU
-operations.
-
-Driver Entry Point Access provides several features:
-
-* Retrieving the address of a runtime function
-* Requesting the default stream version on a per-thread basis
-* Accessing new HIP features on older toolkits with a newer driver
-
-For driver entry point access reference, visit :cpp:func:`hipGetProcAddress`.
-
-Address retrieval
--------------------------------------------------------------------------------
-
-The :cpp:func:`hipGetProcAddress` function can be used to obtain the address of
-a runtime function. This is demonstrated in the following example:
-
-.. code-block:: cpp
-
-  #include <hip/hip_runtime.h>
-  #include <hip/hip_runtime_api.h>
-
-  #include <iostream>
-
-  typedef hipError_t (*hipInit_t)(unsigned int);
-
-  int main() {
-      // Initialize the HIP runtime
-      hipError_t res = hipInit(0);
-      if (res != hipSuccess) {
-          std::cerr << "Failed to initialize HIP runtime." << std::endl;
-          return 1;
-      }
-
-      // Get the address of the hipInit function
-      hipInit_t hipInitFunc;
-      int hipVersion = HIP_VERSION; // Use the HIP version defined in hip_runtime_api.h
-      uint64_t flags = 0; // No special flags
-      hipDriverProcAddressQueryResult symbolStatus;
-
-      res = hipGetProcAddress("hipInit", (void**)&hipInitFunc, hipVersion, flags, &symbolStatus);
-      if (res != hipSuccess) {
-          std::cerr << "Failed to get address of hipInit()." << std::endl;
-          return 1;
-      }
-
-      // Call the hipInit function using the obtained address
-      res = hipInitFunc(0);
-      if (res == hipSuccess) {
-          std::cout << "HIP runtime initialized successfully using hipGetProcAddress()." << std::endl;
-      } else {
-          std::cerr << "Failed to initialize HIP runtime using hipGetProcAddress()." << std::endl;
-      }
-
-      return 0;
-  }
-
-Per-thread default stream version request
-================================================================================
-
-HIP offers functionality similar to CUDA for managing streams on a per-thread
-basis. By using ``hipStreamPerThread``, each thread can independently manage its
-default stream, simplifying operations. The following example demonstrates how
-this feature enhances performance by reducing contention and improving
-efficiency.
-
-.. code-block:: cpp
-
-  #include <hip/hip_runtime.h>
-
-  #include <iostream>
-
-  int main() {
-      // Initialize the HIP runtime
-      hipError_t res = hipInit(0);
-      if (res != hipSuccess) {
-          std::cerr << "Failed to initialize HIP runtime." << std::endl;
-          return 1;
-      }
-
-      // Get the per-thread default stream
-      hipStream_t stream = hipStreamPerThread;
-
-      // Use the stream for some operation
-      // For example, allocate memory on the device
-      void* d_ptr;
-      size_t size = 1024;
-      res = hipMalloc(&d_ptr, size);
-      if (res != hipSuccess) {
-          std::cerr << "Failed to allocate memory." << std::endl;
-          return 1;
-      }
-
-      // Perform some operation using the stream
-      // For example, set memory on the device
-      res = hipMemsetAsync(d_ptr, 0, size, stream);
-      if (res != hipSuccess) {
-          std::cerr << "Failed to set memory." << std::endl;
-          return 1;
-      }
-
-      // Synchronize the stream
-      res = hipStreamSynchronize(stream);
-      if (res != hipSuccess) {
-          std::cerr << "Failed to synchronize stream." << std::endl;
-          return 1;
-      }
-
-      std::cout << "Operation completed successfully using per-thread default stream." << std::endl;
-
-      // Free the allocated memory
-      hipFree(d_ptr);
-
-      return 0;
-  }
-
-Accessing new HIP features with a newer driver
-================================================================================
-
-HIP is designed to be forward compatible, allowing newer features to be utilized
-with older toolkits, provided a compatible driver is present. Feature support
-can be verified through runtime API functions and version checks. This approach
-ensures that applications can benefit from new features and improvements in the
-HIP runtime without needing to be recompiled with a newer toolkit. The function
-:cpp:func:`hipGetProcAddress` enables dynamic querying and the use of newer
-functions offered by the HIP runtime, even if the application was built with an
-older toolkit.
-
-An example is provided for a hypothetical ``foo()`` function.
-
-.. code-block:: cpp
-
-  // Get the address of the foo function
-  foo_t fooFunc;
-  int hipVersion = 60300000; // Use an own HIP version number (e.g. 6.3.0)
-  uint64_t flags = 0; // No special flags
-  hipDriverProcAddressQueryResult symbolStatus;
-
-  res = hipGetProcAddress("foo", (void**)&fooFunc, hipVersion, flags, &symbolStatus);
-
-The HIP version number is defined as an integer:
-
-.. code-block:: cpp
-
-  HIP_VERSION=HIP_VERSION_MAJOR * 10000000 + HIP_VERSION_MINOR * 100000 + HIP_VERSION_PATCH
-
-CU_POINTER_ATTRIBUTE_MEMORY_TYPE
-================================================================================
-
-To get the pointer's memory type in HIP, developers should use
-:cpp:func:`hipPointerGetAttributes`. First parameter of the function is
-`hipPointerAttribute_t`. Its ``type`` member variable indicates whether the
-memory pointed to is allocated on the device or the host.
-
-For example:
-
-.. code-block:: cpp
-
-  double * ptr;
-  hipMalloc(&ptr, sizeof(double));
-  hipPointerAttribute_t attr;
-  hipPointerGetAttributes(&attr, ptr); /*attr.type is hipMemoryTypeDevice*/
-  if(attr.type == hipMemoryTypeDevice)
-    std::cout << "ptr is of type hipMemoryTypeDevice" << std::endl;
-
-  double* ptrHost;
-  hipHostMalloc(&ptrHost, sizeof(double));
-  hipPointerAttribute_t attr;
-  hipPointerGetAttributes(&attr, ptrHost); /*attr.type is hipMemoryTypeHost*/
-  if(attr.type == hipMemorTypeHost)
-    std::cout << "ptrHost is of type hipMemoryTypeHost" << std::endl;
-
-Note that ``hipMemoryType`` enum values are different from the
-``cudaMemoryType`` enum values.
-
-For example, on AMD platform, `hipMemoryType` is defined in `hip_runtime_api.h`,
-
-.. code-block:: cpp
-
-  typedef enum hipMemoryType {
-      hipMemoryTypeHost = 0,    ///< Memory is physically located on host
-      hipMemoryTypeDevice = 1,  ///< Memory is physically located on device. (see deviceId for specific device)
-      hipMemoryTypeArray = 2,   ///< Array memory, physically located on device. (see deviceId for specific device)
-      hipMemoryTypeUnified = 3, ///< Not used currently
-      hipMemoryTypeManaged = 4  ///< Managed memory, automaticallly managed by the unified memory system
-  } hipMemoryType;
-
-Looking into CUDA toolkit, it defines `cudaMemoryType` as following,
-
-.. code-block:: cpp
-
-  enum cudaMemoryType
-  {
-    cudaMemoryTypeUnregistered = 0, // Unregistered memory.
-    cudaMemoryTypeHost = 1, // Host memory.
-    cudaMemoryTypeDevice = 2, // Device memory.
-    cudaMemoryTypeManaged = 3, // Managed memory
-  }
-
-In this case, memory type translation for ``hipPointerGetAttributes`` needs to
-be handled properly on NVIDIA platform to get the correct memory type in CUDA,
-which is done in the file ``nvidia_hip_runtime_api.h``.
-
-So in any HIP applications which use HIP APIs involving memory types, developers
-should use ``#ifdef`` in order to assign the correct enum values depending on
-NVIDIA or AMD platform.
-
-As an example, please see the code from the `link <https://github.com/ROCm/hip-tests/tree/develop/catch/unit/memory/hipMemcpyParam2D.cc>`_.
-
-With the ``#ifdef`` condition, HIP APIs work as expected on both AMD and NVIDIA
-platforms.
-
-Note, ``cudaMemoryTypeUnregistered`` is currently not supported as
-``hipMemoryType`` enum, due to HIP functionality backward compatibility.
@@ -207,319 +207,24 @@ The example codes

    .. tab-item:: Sequential

-      .. code-block:: cpp
-
-        #include <hip/hip_runtime.h>
-        #include <vector>
-        #include <iostream>
-
-        #define HIP_CHECK(expression)                \
-        {                                            \
-            const hipError_t status = expression;    \
-            if(status != hipSuccess){                \
-                    std::cerr << "HIP error "        \
-                        << status << ": "            \
-                        << hipGetErrorString(status) \
-                        << " at " << __FILE__ << ":" \
-                        << __LINE__ << std::endl;    \
-            }                                        \
-        }
-
-        // GPU Kernels
-        __global__ void kernelA(double* arrayA, size_t size){
-            const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-            if(x < size){arrayA[x] += 1.0;}
-        };
-        __global__ void kernelB(double* arrayA, double* arrayB, size_t size){
-            const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-            if(x < size){arrayB[x] += arrayA[x] + 3.0;}
-        };
-
-        int main()
-        {
-            constexpr int numOfBlocks = 1 << 20;
-            constexpr int threadsPerBlock = 1024;
-            constexpr int numberOfIterations = 50;
-            // The array size smaller to avoid the relatively short kernel launch compared to memory copies
-            constexpr size_t arraySize = 1U << 25;
-            double *d_dataA;
-            double *d_dataB;
-
-            double initValueA = 0.0;
-            double initValueB = 2.0;
-
-            std::vector<double> vectorA(arraySize, initValueA);
-            std::vector<double> vectorB(arraySize, initValueB);
-            // Allocate device memory
-            HIP_CHECK(hipMalloc(&d_dataA, arraySize * sizeof(*d_dataA)));
-            HIP_CHECK(hipMalloc(&d_dataB, arraySize * sizeof(*d_dataB)));
-            for(int iteration = 0; iteration < numberOfIterations; iteration++)
-            {
-                // Host to Device copies
-                HIP_CHECK(hipMemcpy(d_dataA, vectorA.data(), arraySize * sizeof(*d_dataA), hipMemcpyHostToDevice));
-                HIP_CHECK(hipMemcpy(d_dataB, vectorB.data(), arraySize * sizeof(*d_dataB), hipMemcpyHostToDevice));
-                // Launch the GPU kernels
-                hipLaunchKernelGGL(kernelA, dim3(numOfBlocks), dim3(threadsPerBlock), 0, 0, d_dataA, arraySize);
-                hipLaunchKernelGGL(kernelB, dim3(numOfBlocks), dim3(threadsPerBlock), 0, 0, d_dataA, d_dataB, arraySize);
-                // Device to Host copies
-                HIP_CHECK(hipMemcpy(vectorA.data(), d_dataA, arraySize * sizeof(*vectorA.data()), hipMemcpyDeviceToHost));
-                HIP_CHECK(hipMemcpy(vectorB.data(), d_dataB, arraySize * sizeof(*vectorB.data()), hipMemcpyDeviceToHost));
-            }
-            // Wait for all operations to complete
-            HIP_CHECK(hipDeviceSynchronize());
-
-            // Verify results
-            const double expectedA = (double)numberOfIterations;
-            const double expectedB =
-                initValueB + (3.0 * numberOfIterations) +
-                (expectedA * (expectedA + 1.0)) / 2.0;
-            bool passed = true;
-            for(size_t i = 0; i < arraySize; ++i){
-                if(vectorA[i] != expectedA){
-                    passed = false;
-                    std::cerr << "Validation failed! Expected " << expectedA << " got " << vectorA[i] << " at index: " << i << std::endl;
-                    break;
-                }
-                if(vectorB[i] != expectedB){
-                    passed = false;
-                    std::cerr << "Validation failed! Expected " << expectedB << " got " <<  vectorB[i] << " at index: " << i << std::endl;
-                    break;
-                }
-            }
-
-            if(passed){
-                std::cout << "Sequential execution completed successfully." << std::endl;
-            }else{
-                std::cerr << "Sequential execution failed." << std::endl;
-            }
-
-            // Cleanup
-            HIP_CHECK(hipFree(d_dataA));
-            HIP_CHECK(hipFree(d_dataB));
-
-            return 0;
-        }
+      .. literalinclude:: ../../tools/example_codes/sequential_kernel_execution.hip
+          :start-after: // [sphinx-start]
+          :end-before: // [sphinx-end]
+          :language: cpp

    .. tab-item:: Asynchronous

-      .. code-block:: cpp
-
-        #include <hip/hip_runtime.h>
-        #include <vector>
-        #include <iostream>
-
-        #define HIP_CHECK(expression)                \
-        {                                            \
-            const hipError_t status = expression;    \
-            if(status != hipSuccess){                \
-                    std::cerr << "HIP error "        \
-                        << status << ": "            \
-                        << hipGetErrorString(status) \
-                        << " at " << __FILE__ << ":" \
-                        << __LINE__ << std::endl;    \
-            }                                        \
-        }
-
-        // GPU Kernels
-        __global__ void kernelA(double* arrayA, size_t size){
-            const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-            if(x < size){arrayA[x] += 1.0;}
-        };
-        __global__ void kernelB(double* arrayA, double* arrayB, size_t size){
-            const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-            if(x < size){arrayB[x] += arrayA[x] + 3.0;}
-        };
-
-        int main()
-        {
-            constexpr int numOfBlocks = 1 << 20;
-            constexpr int threadsPerBlock = 1024;
-            constexpr int numberOfIterations = 50;
-            // The array size smaller to avoid the relatively short kernel launch compared to memory copies
-            constexpr size_t arraySize = 1U << 25;
-            double *d_dataA;
-            double *d_dataB;
-
-            double initValueA = 0.0;
-            double initValueB = 2.0;
-
-            std::vector<double> vectorA(arraySize, initValueA);
-            std::vector<double> vectorB(arraySize, initValueB);
-            // Allocate device memory
-            HIP_CHECK(hipMalloc(&d_dataA, arraySize * sizeof(*d_dataA)));
-            HIP_CHECK(hipMalloc(&d_dataB, arraySize * sizeof(*d_dataB)));
-            // Create streams
-            hipStream_t streamA, streamB;
-            HIP_CHECK(hipStreamCreate(&streamA));
-            HIP_CHECK(hipStreamCreate(&streamB));
-            for(unsigned int iteration = 0; iteration < numberOfIterations; iteration++)
-            {
-                // Stream 1: Host to Device 1
-                HIP_CHECK(hipMemcpyAsync(d_dataA, vectorA.data(), arraySize * sizeof(*d_dataA), hipMemcpyHostToDevice, streamA));
-                // Stream 2: Host to Device 2
-                HIP_CHECK(hipMemcpyAsync(d_dataB, vectorB.data(), arraySize * sizeof(*d_dataB), hipMemcpyHostToDevice, streamB));
-                // Stream 1: Kernel 1
-                hipLaunchKernelGGL(kernelA, dim3(numOfBlocks), dim3(threadsPerBlock), 0, streamA, d_dataA, arraySize);
-                // Wait for streamA finish
-                HIP_CHECK(hipStreamSynchronize(streamA));
-                // Stream 2: Kernel 2
-                hipLaunchKernelGGL(kernelB, dim3(numOfBlocks), dim3(threadsPerBlock), 0, streamB, d_dataA, d_dataB, arraySize);
-                // Stream 1: Device to Host 2 (after Kernel 1)
-                HIP_CHECK(hipMemcpyAsync(vectorA.data(), d_dataA, arraySize * sizeof(*vectorA.data()), hipMemcpyDeviceToHost, streamA));
-                // Stream 2: Device to Host 2 (after Kernel 2)
-                HIP_CHECK(hipMemcpyAsync(vectorB.data(), d_dataB, arraySize * sizeof(*vectorB.data()), hipMemcpyDeviceToHost, streamB));
-            }
-            // Wait for all operations in both streams to complete
-            HIP_CHECK(hipStreamSynchronize(streamA));
-            HIP_CHECK(hipStreamSynchronize(streamB));
-            // Verify results
-            double expectedA = (double)numberOfIterations;
-            double expectedB =
-                initValueB + (3.0 * numberOfIterations) +
-                (expectedA * (expectedA + 1.0)) / 2.0;
-            bool passed = true;
-            for(size_t i = 0; i < arraySize; ++i){
-                if(vectorA[i] != expectedA){
-                    passed = false;
-                    std::cerr << "Validation failed! Expected " << expectedA << " got " << vectorA[i] << " at index: " << i << std::endl;
-                    break;
-                }
-                if(vectorB[i] != expectedB){
-                    passed = false;
-                    std::cerr << "Validation failed! Expected " << expectedB << " got " <<  vectorB[i] << " at index: " << i << std::endl;
-                    break;
-                }
-            }
-            if(passed){
-                std::cout << "Asynchronous execution completed successfully." << std::endl;
-            }else{
-                std::cerr << "Asynchronous execution failed." << std::endl;
-            }
-
-            // Cleanup
-            HIP_CHECK(hipStreamDestroy(streamA));
-            HIP_CHECK(hipStreamDestroy(streamB));
-            HIP_CHECK(hipFree(d_dataA));
-            HIP_CHECK(hipFree(d_dataB));
-
-            return 0;
-        }
+      .. literalinclude:: ../../tools/example_codes/async_kernel_execution.hip
+          :start-after: // [sphinx-start]
+          :end-before: // [sphinx-end]
+          :language: cpp

    .. tab-item:: hipStreamWaitEvent

-      .. code-block:: cpp
-
-        #include <hip/hip_runtime.h>
-        #include <vector>
-        #include <iostream>
-
-        #define HIP_CHECK(expression)                \
-        {                                            \
-            const hipError_t status = expression;    \
-            if(status != hipSuccess){                \
-                    std::cerr << "HIP error "        \
-                        << status << ": "            \
-                        << hipGetErrorString(status) \
-                        << " at " << __FILE__ << ":" \
-                        << __LINE__ << std::endl;    \
-            }                                        \
-        }
-
-        // GPU Kernels
-        __global__ void kernelA(double* arrayA, size_t size){
-            const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-            if(x < size){arrayA[x] += 1.0;}
-        };
-        __global__ void kernelB(double* arrayA, double* arrayB, size_t size){
-            const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-            if(x < size){arrayB[x] += arrayA[x] + 3.0;}
-        };
-
-        int main()
-        {
-            constexpr int numOfBlocks = 1 << 20;
-            constexpr int threadsPerBlock = 1024;
-            constexpr int numberOfIterations = 50;
-            // The array size smaller to avoid the relatively short kernel launch compared to memory copies
-            constexpr size_t arraySize = 1U << 25;
-            double *d_dataA;
-            double *d_dataB;
-            double initValueA = 0.0;
-            double initValueB = 2.0;
-
-            std::vector<double> vectorA(arraySize, initValueA);
-            std::vector<double> vectorB(arraySize, initValueB);
-            // Allocate device memory
-            HIP_CHECK(hipMalloc(&d_dataA, arraySize * sizeof(*d_dataA)));
-            HIP_CHECK(hipMalloc(&d_dataB, arraySize * sizeof(*d_dataB)));
-            // Create streams
-            hipStream_t streamA, streamB;
-            HIP_CHECK(hipStreamCreate(&streamA));
-            HIP_CHECK(hipStreamCreate(&streamB));
-            // Create events
-            hipEvent_t event, eventA, eventB;
-            HIP_CHECK(hipEventCreate(&event));
-            HIP_CHECK(hipEventCreate(&eventA));
-            HIP_CHECK(hipEventCreate(&eventB));
-            for(unsigned int iteration = 0; iteration < numberOfIterations; iteration++)
-            {
-                // Stream 1: Host to Device 1
-                HIP_CHECK(hipMemcpyAsync(d_dataA, vectorA.data(), arraySize * sizeof(*d_dataA), hipMemcpyHostToDevice, streamA));
-                // Stream 2: Host to Device 2
-                HIP_CHECK(hipMemcpyAsync(d_dataB, vectorB.data(), arraySize * sizeof(*d_dataB), hipMemcpyHostToDevice, streamB));
-                // Stream 1: Kernel 1
-                hipLaunchKernelGGL(kernelA, dim3(numOfBlocks), dim3(threadsPerBlock), 0, streamA, d_dataA, arraySize);
-                // Record event after the GPU kernel in Stream 1
-                HIP_CHECK(hipEventRecord(event, streamA));
-                // Stream 2: Wait for event before starting Kernel 2
-                HIP_CHECK(hipStreamWaitEvent(streamB, event, 0));
-                // Stream 2: Kernel 2
-                hipLaunchKernelGGL(kernelB, dim3(numOfBlocks), dim3(threadsPerBlock), 0, streamB, d_dataA, d_dataB, arraySize);
-                // Stream 1: Device to Host 2 (after Kernel 1)
-                HIP_CHECK(hipMemcpyAsync(vectorA.data(), d_dataA, arraySize * sizeof(*vectorA.data()), hipMemcpyDeviceToHost, streamA));
-                // Stream 2: Device to Host 2 (after Kernel 2)
-                HIP_CHECK(hipMemcpyAsync(vectorB.data(), d_dataB, arraySize * sizeof(*vectorB.data()), hipMemcpyDeviceToHost, streamB));
-                // Wait for all operations in both streams to complete
-                HIP_CHECK(hipEventRecord(eventA, streamA));
-                HIP_CHECK(hipEventRecord(eventB, streamB));
-                HIP_CHECK(hipStreamWaitEvent(streamA, eventA, 0));
-                HIP_CHECK(hipStreamWaitEvent(streamB, eventB, 0));
-            }
-            // Verify results
-            double expectedA = (double)numberOfIterations;
-            double expectedB =
-                initValueB + (3.0 * numberOfIterations) +
-                (expectedA * (expectedA + 1.0)) / 2.0;
-            bool passed = true;
-            for(size_t i = 0; i < arraySize; ++i){
-                if(vectorA[i] != expectedA){
-                    passed = false;
-                    std::cerr << "Validation failed! Expected " << expectedA << " got " << vectorA[i] << std::endl;
-                    break;
-                }
-                if(vectorB[i] != expectedB){
-                    passed = false;
-                    std::cerr << "Validation failed! Expected " << expectedB << " got " <<  vectorB[i] << std::endl;
-                    break;
-                }
-            }
-            if(passed){
-                std::cout << "Asynchronous execution with events completed successfully." << std::endl;
-            }else{
-                std::cerr << "Asynchronous execution with events failed." << std::endl;
-            }
-
-            // Cleanup
-            HIP_CHECK(hipEventDestroy(event));
-            HIP_CHECK(hipEventDestroy(eventA));
-            HIP_CHECK(hipEventDestroy(eventB));
-            HIP_CHECK(hipStreamDestroy(streamA));
-            HIP_CHECK(hipStreamDestroy(streamB));
-            HIP_CHECK(hipFree(d_dataA));
-            HIP_CHECK(hipFree(d_dataB));
-
-            return 0;
-        }
+      .. literalinclude:: ../../tools/example_codes/event_based_synchronization.hip
+          :start-after: // [sphinx-start]
+          :end-before: // [sphinx-end]
+          :language: cpp

 HIP Graphs
 ===============================================================================
@@ -33,38 +33,10 @@ You can adjust the call stack size as shown in the following example, allowing
 fine-tuning based on specific kernel requirements. This helps prevent stack
 overflow errors by ensuring sufficient stack memory is allocated.

-.. code-block:: cpp
-
-    #include <hip/hip_runtime.h>
-    #include <iostream>
-
-    #define HIP_CHECK(expression)                \
-    {                                            \
-        const hipError_t status = expression;    \
-        if(status != hipSuccess){                \
-                std::cerr << "HIP error "        \
-                    << status << ": "            \
-                    << hipGetErrorString(status) \
-                    << " at " << __FILE__ << ":" \
-                    << __LINE__ << std::endl;    \
-        }                                        \
-    }
-
-    int main()
-    {
-        size_t stackSize;
-        HIP_CHECK(hipDeviceGetLimit(&stackSize, hipLimitStackSize));
-        std::cout << "Default stack size: " << stackSize << " bytes" << std::endl;
-
-        // Set a new stack size
-        size_t newStackSize = 1024 * 8; // 8 KiB
-        HIP_CHECK(hipDeviceSetLimit(hipLimitStackSize, newStackSize));
-
-        HIP_CHECK(hipDeviceGetLimit(&stackSize, hipLimitStackSize));
-        std::cout << "Updated stack size: " << stackSize << " bytes" << std::endl;
-
-        return 0;
-    }
+.. literalinclude:: ../../tools/example_codes/call_stack_management.cpp
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 Depending on the GPU model, at full occupancy, it can consume a significant
 amount of memory. For instance, an MI300X with 304 compute units (CU) and up to
@@ -81,49 +53,7 @@ needed for the call stack due to the GPUs inherent parallelism. This can be
 achieved by increasing stack size or optimizing code to reduce stack usage. To
 detect stack overflow add proper error handling or use debugging tools.

-.. code-block:: cpp
-
-    #include <hip/hip_runtime.h>
-    #include <iostream>
-
-    #define HIP_CHECK(expression)                \
-    {                                            \
-        const hipError_t status = expression;    \
-        if(status != hipSuccess){                \
-                std::cerr << "HIP error "        \
-                    << status << ": "            \
-                    << hipGetErrorString(status) \
-                    << " at " << __FILE__ << ":" \
-                    << __LINE__ << std::endl;    \
-        }                                        \
-    }
-
-    __device__ unsigned long long fibonacci(unsigned long long n)
-    {
-        if (n == 0 || n == 1)
-        {
-            return n;
-        }
-        return fibonacci(n - 1) + fibonacci(n - 2);
-    }
-
-    __global__ void kernel(unsigned long long n)
-    {
-        unsigned long long result = fibonacci(n);
-        const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-
-        if (x == 0)
-            printf("%llu! = %llu \n", n, result);
-    }
-
-    int main()
-    {
-        kernel<<<1, 1>>>(10);
-        HIP_CHECK(hipDeviceSynchronize());
-
-        // With -O0 optimization option hit the stack limit
-        // kernel<<<1, 256>>>(2048);
-        // HIP_CHECK(hipDeviceSynchronize());
-
-        return 0;
-    }
+.. literalinclude:: ../../tools/example_codes/device_recursion.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp
@@ -68,70 +68,7 @@ Complete example
 A complete example to demonstrate the error handling with a simple addition of
 two values kernel:

-.. code-block:: cpp
-
-  #include <hip/hip_runtime.h>
-  #include <vector>
-  #include <iostream>
-
-  #define HIP_CHECK(expression)                  \
-  {                                              \
-      const hipError_t status = expression;      \
-      if(status != hipSuccess){                  \
-          std::cerr << "HIP error "              \
-                    << status << ": "            \
-                    << hipGetErrorString(status) \
-                    << " at " << __FILE__ << ":" \
-                    << __LINE__ << std::endl;    \
-      }                                          \
-  }
-
-  // Addition of two values.
-  __global__ void add(int *a, int *b, int *c, size_t size) {
-      const size_t index = threadIdx.x + blockDim.x * blockIdx.x;
-      if(index < size) {
-          c[index] += a[index] + b[index];
-      }
-  }
-
-  int main() {
-      constexpr int numOfBlocks = 256;
-      constexpr int threadsPerBlock = 256;
-      constexpr size_t arraySize = 1U << 16;
-
-      std::vector<int> a(arraySize), b(arraySize), c(arraySize);
-      int *d_a, *d_b, *d_c;
-
-      // Setup input values.
-      std::fill(a.begin(), a.end(), 1);
-      std::fill(b.begin(), b.end(), 2);
-
-      // Allocate device copies of a, b and c.
-      HIP_CHECK(hipMalloc(&d_a, arraySize * sizeof(*d_a)));
-      HIP_CHECK(hipMalloc(&d_b, arraySize * sizeof(*d_b)));
-      HIP_CHECK(hipMalloc(&d_c, arraySize * sizeof(*d_c)));
-
-      // Copy input values to device.
-      HIP_CHECK(hipMemcpy(d_a, &a, arraySize * sizeof(*d_a), hipMemcpyHostToDevice));
-      HIP_CHECK(hipMemcpy(d_b, &b, arraySize * sizeof(*d_b), hipMemcpyHostToDevice));
-
-      // Launch add() kernel on GPU.
-      hipLaunchKernelGGL(add, dim3(numOfBlocks), dim3(threadsPerBlock), 0, 0, d_a, d_b, d_c, arraySize);
-      // Check the kernel launch
-      HIP_CHECK(hipGetLastError());
-      // Check for kernel execution error
-      HIP_CHECK(hipDeviceSynchronize());
-
-      // Copy the result back to the host.
-      HIP_CHECK(hipMemcpy(&c, d_c, arraySize * sizeof(*d_c), hipMemcpyDeviceToHost));
-
-      // Cleanup allocated memory.
-      HIP_CHECK(hipFree(d_a));
-      HIP_CHECK(hipFree(d_b));
-      HIP_CHECK(hipFree(d_c));
-
-      // Print the result.
-      std::cout << a[0] << " + " << b[0] << " = " << c[0] << std::endl;
-
-      return 0;
-  }
+.. literalinclude:: ../../tools/example_codes/error_handling.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp
@@ -14,6 +14,10 @@ method via streams. A HIP graph is made up of nodes and edges. The nodes of a
 HIP graph represent the operations performed, while the edges mark dependencies
 between those operations.

+.. hint::
+    The :ref:`HIP Graph API tutorial <hip_graph_api_tutorial>` demonstrates how
+    to use HIP graphs in a real-world application.
+
 The nodes can be one of the following:

 - empty nodes
@@ -180,124 +184,10 @@ The general flow for using stream capture to create a graph template is:
 The following code is an example of how to use the HIP graph API to capture a
 graph from a stream.

-.. code-block:: cpp
-
-    #include <hip/hip_runtime.h>
-    #include <vector>
-    #include <iostream>
-
-    #define HIP_CHECK(expression)                \
-    {                                            \
-        const hipError_t status = expression;    \
-        if(status != hipSuccess){                \
-                std::cerr << "HIP error "        \
-                    << status << ": "            \
-                    << hipGetErrorString(status) \
-                    << " at " << __FILE__ << ":" \
-                    << __LINE__ << std::endl;    \
-        }                                        \
-    }
-
-
-    __global__ void kernelA(double* arrayA, size_t size){
-        const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-        if(x < size){arrayA[x] *= 2.0;}
-    };
-    __global__ void kernelB(int* arrayB, size_t size){
-        const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-        if(x < size){arrayB[x] = 3;}
-    };
-    __global__ void kernelC(double* arrayA, const int* arrayB, size_t size){
-        const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-        if(x < size){arrayA[x] += arrayB[x];}
-    };
-
-    struct set_vector_args{
-        std::vector<double>& h_array;
-        double value;
-    };
-
-    void set_vector(void* args){
-        set_vector_args h_args{*(reinterpret_cast<set_vector_args*>(args))};
-
-        std::vector<double>& vec{h_args.h_array};
-        vec.assign(vec.size(), h_args.value);
-    }
-
-    int main(){
-        constexpr int numOfBlocks = 1024;
-        constexpr int threadsPerBlock = 1024;
-        constexpr size_t arraySize = 1U << 20;
-
-        // This example assumes that kernelA operates on data that needs to be initialized on
-        // and copied from the host, while kernelB initializes the array that is passed to it.
-        // Both arrays are then used as input to kernelC, where arrayA is also used as
-       //  output, that is copied back to the host, while arrayB is only read from and not modified.
-
-        double* d_arrayA;
-        int* d_arrayB;
-        std::vector<double> h_array(arraySize);
-        constexpr double initValue = 2.0;
-
-        hipStream_t captureStream;
-        HIP_CHECK(hipStreamCreate(&captureStream));
-
-        // Start capturing the operations assigned to the stream
-        HIP_CHECK(hipStreamBeginCapture(captureStream, hipStreamCaptureModeGlobal));
-
-        // hipMallocAsync and hipMemcpyAsync are needed, to be able to assign it to a stream
-        HIP_CHECK(hipMallocAsync(&d_arrayA, arraySize*sizeof(double), captureStream));
-        HIP_CHECK(hipMallocAsync(&d_arrayB, arraySize*sizeof(int), captureStream));
-
-        // Assign host function to the stream
-        // Needs a custom struct to pass the arguments
-        set_vector_args args{h_array, initValue};
-        HIP_CHECK(hipLaunchHostFunc(captureStream, set_vector, &args));
-
-        HIP_CHECK(hipMemcpyAsync(d_arrayA, h_array.data(), arraySize*sizeof(double), hipMemcpyHostToDevice, captureStream));
-
-        kernelA<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, arraySize);
-        kernelB<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayB, arraySize);
-        kernelC<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, d_arrayB, arraySize);
-
-        HIP_CHECK(hipMemcpyAsync(h_array.data(), d_arrayA, arraySize*sizeof(*d_arrayA), hipMemcpyDeviceToHost, captureStream));
-
-        HIP_CHECK(hipFreeAsync(d_arrayA, captureStream));
-        HIP_CHECK(hipFreeAsync(d_arrayB, captureStream));
-
-        // Stop capturing
-        hipGraph_t graph;
-        HIP_CHECK(hipStreamEndCapture(captureStream, &graph));
-
-        // Create an executable graph from the captured graph
-        hipGraphExec_t graphExec;
-        HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
-
-        // The graph template can be deleted after the instantiation if it's not needed for later use
-        HIP_CHECK(hipGraphDestroy(graph));
-
-        // Actually launch the graph. The stream does not have
-        // to be the same as the one used for capturing.
-        HIP_CHECK(hipGraphLaunch(graphExec, captureStream));
-
-        // Verify results
-        constexpr double expected = initValue * 2.0 + 3;
-        bool passed = true;
-        for(size_t i = 0; i < arraySize; ++i){
-                if(h_array[i] != expected){
-                        passed = false;
-                        std::cerr << "Validation failed! Expected " << expected << " got " << h_array[0] << std::endl;
-                        break;
-                }
-        }
-        if(passed){
-                std::cerr << "Validation passed." << std::endl;
-        }
-
-        // Free graph and stream resources after usage
-        HIP_CHECK(hipGraphExecDestroy(graphExec));
-        HIP_CHECK(hipStreamDestroy(captureStream));
-    }
+.. literalinclude:: ../../tools/example_codes/graph_capture.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 Explicit graph creation
 ================================================================================
@@ -333,178 +223,7 @@ The general flow for explicitly creating a graph is usually:

 The following code example demonstrates how to explicitly create nodes in order to create a graph.

-.. code-block:: cpp
-
-    #include <hip/hip_runtime.h>
-    #include <vector>
-    #include <iostream>
-
-    #define HIP_CHECK(expression)                \
-    {                                            \
-        const hipError_t status = expression;    \
-        if(status != hipSuccess){                \
-                std::cerr << "HIP error "        \
-                    << status << ": "            \
-                    << hipGetErrorString(status) \
-                    << " at " << __FILE__ << ":" \
-                    << __LINE__ << std::endl;    \
-        }                                        \
-    }
-
-    __global__ void kernelA(double* arrayA, size_t size){
-        const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-        if(x < size){arrayA[x] *= 2.0;}
-    };
-    __global__ void kernelB(int* arrayB, size_t size){
-        const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-        if(x < size){arrayB[x] = 3;}
-    };
-    __global__ void kernelC(double* arrayA, const int* arrayB, size_t size){
-        const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
-        if(x < size){arrayA[x] += arrayB[x];}
-    };
-
-    struct set_vector_args{
-        std::vector<double>& h_array;
-        double value;
-    };
-
-    void set_vector(void* args){
-        set_vector_args h_args{*(reinterpret_cast<set_vector_args*>(args))};
-
-        std::vector<double>& vec{h_args.h_array};
-        vec.assign(vec.size(), h_args.value);
-    }
-
-    int main(){
-        constexpr int numOfBlocks = 1024;
-        constexpr int threadsPerBlock = 1024;
-        size_t arraySize = 1U << 20;
-
-        // The pointers to the device memory don't need to be declared here,
-        // they are contained within the hipMemAllocNodeParams as the dptr member
-        std::vector<double> h_array(arraySize);
-        constexpr double initValue = 2.0;
-
-        // Create graph an empty graph
-        hipGraph_t graph;
-        HIP_CHECK(hipGraphCreate(&graph, 0));
-
-        // Parameters to allocate arrays
-        hipMemAllocNodeParams allocArrayAParams{};
-        allocArrayAParams.poolProps.allocType = hipMemAllocationTypePinned;
-        allocArrayAParams.poolProps.location.type = hipMemLocationTypeDevice;
-        allocArrayAParams.poolProps.location.id = 0; // GPU on which memory resides
-        allocArrayAParams.bytesize = arraySize * sizeof(double);
-
-        hipMemAllocNodeParams allocArrayBParams{};
-        allocArrayBParams.poolProps.allocType = hipMemAllocationTypePinned;
-        allocArrayBParams.poolProps.location.type = hipMemLocationTypeDevice;
-        allocArrayBParams.poolProps.location.id = 0; // GPU on which memory resides
-        allocArrayBParams.bytesize = arraySize * sizeof(int);
-
-        // Add the allocation nodes to the graph. They don't have any dependencies
-        hipGraphNode_t allocNodeA, allocNodeB;
-        HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeA, graph, nullptr, 0, &allocArrayAParams));
-        HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeB, graph, nullptr, 0, &allocArrayBParams));
-
-        // Parameters for the host function
-        // Needs custom struct to pass the arguments
-        set_vector_args args{h_array, initValue};
-        hipHostNodeParams hostParams{};
-        hostParams.fn = set_vector;
-        hostParams.userData = static_cast<void*>(&args);
-
-        // Add the host node that initializes the host array. It also doesn't have any dependencies
-        hipGraphNode_t hostNode;
-        HIP_CHECK(hipGraphAddHostNode(&hostNode, graph, nullptr, 0, &hostParams));
-
-        // Add memory copy node, that copies the initialized host array to the device.
-        // It has to wait for the host array to be initialized and the device memory to be allocated
-        hipGraphNode_t cpyNodeDependencies[] = {allocNodeA, hostNode};
-        hipGraphNode_t cpyToDevNode;
-        HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToDevNode, graph, cpyNodeDependencies, 1, allocArrayAParams.dptr, h_array.data(), arraySize * sizeof(double), hipMemcpyHostToDevice));
-
-        // Parameters for kernelA
-        hipKernelNodeParams kernelAParams;
-        void* kernelAArgs[] = {&allocArrayAParams.dptr, static_cast<void*>(&arraySize)};
-        kernelAParams.func = reinterpret_cast<void*>(kernelA);
-        kernelAParams.gridDim = numOfBlocks;
-        kernelAParams.blockDim = threadsPerBlock;
-        kernelAParams.sharedMemBytes = 0;
-        kernelAParams.kernelParams = kernelAArgs;
-        kernelAParams.extra = nullptr;
-
-        // Add the node for kernelA. It has to wait for the memory copy to finish, as it depends on the values from the host array.
-        hipGraphNode_t kernelANode;
-        HIP_CHECK(hipGraphAddKernelNode(&kernelANode, graph, &cpyToDevNode, 1, &kernelAParams));
-
-        // Parameters for kernelB
-        hipKernelNodeParams kernelBParams;
-        void* kernelBArgs[] = {&allocArrayBParams.dptr, static_cast<void*>(&arraySize)};
-        kernelBParams.func = reinterpret_cast<void*>(kernelB);
-        kernelBParams.gridDim = numOfBlocks;
-        kernelBParams.blockDim = threadsPerBlock;
-        kernelBParams.sharedMemBytes = 0;
-        kernelBParams.kernelParams = kernelBArgs;
-        kernelBParams.extra = nullptr;
-
-        //  Add the node for kernelB. It only has to wait for the memory to be allocated, as it initializes the array.
-        hipGraphNode_t kernelBNode;
-        HIP_CHECK(hipGraphAddKernelNode(&kernelBNode, graph, &allocNodeB, 1, &kernelBParams));
-
-        // Parameters for kernelC
-        hipKernelNodeParams kernelCParams;
-        void* kernelCArgs[] = {&allocArrayAParams.dptr, &allocArrayBParams.dptr, static_cast<void*>(&arraySize)};
-        kernelCParams.func = reinterpret_cast<void*>(kernelC);
-        kernelCParams.gridDim = numOfBlocks;
-        kernelCParams.blockDim = threadsPerBlock;
-        kernelCParams.sharedMemBytes = 0;
-        kernelCParams.kernelParams = kernelCArgs;
-        kernelCParams.extra = nullptr;
-
-        // Add the node for kernelC. It has to wait on both kernelA and kernelB to finish, as it depends on their results.
-        hipGraphNode_t kernelCNode;
-        hipGraphNode_t kernelCDependencies[] = {kernelANode, kernelBNode};
-        HIP_CHECK(hipGraphAddKernelNode(&kernelCNode, graph, kernelCDependencies, 1, &kernelCParams));
-
-        // Copy the results back to the host. Has to wait for kernelC to finish.
-        hipGraphNode_t cpyToHostNode;
-        HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToHostNode, graph, &kernelCNode, 1, h_array.data(), allocArrayAParams.dptr, arraySize * sizeof(double), hipMemcpyDeviceToHost));
-
-        // Free array of allocNodeA. It needs to wait for the copy to finish, as kernelC stores its results in it.
-        hipGraphNode_t freeNodeA;
-        HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeA, graph, &cpyToHostNode, 1, allocArrayAParams.dptr));
-        // Free array of allocNodeB. It only needs to wait for kernelC to finish, as it is not written back to the host.
-        hipGraphNode_t freeNodeB;
-        HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeB, graph, &kernelCNode, 1, allocArrayBParams.dptr));
-
-        // Instantiate the graph in order to execute it
-        hipGraphExec_t graphExec;
-        HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
-
-        // The graph can be freed after the instantiation if it's not needed for other purposes
-        HIP_CHECK(hipGraphDestroy(graph));
-
-        // Actually launch the graph
-        hipStream_t graphStream;
-        HIP_CHECK(hipStreamCreate(&graphStream));
-        HIP_CHECK(hipGraphLaunch(graphExec, graphStream));
-
-        // Verify results
-        constexpr double expected = initValue * 2.0 + 3;
-        bool passed = true;
-        for(size_t i = 0; i < arraySize; ++i){
-                if(h_array[i] != expected){
-                        passed = false;
-                        std::cerr << "Validation failed! Expected " << expected << " got " << h_array[0] << std::endl;
-                        break;
-                }
-        }
-        if(passed){
-                std::cerr << "Validation passed." << std::endl;
-        }
-
-        HIP_CHECK(hipGraphExecDestroy(graphExec));
-        HIP_CHECK(hipStreamDestroy(graphStream));
-    }
+.. literalinclude:: ../../tools/example_codes/graph_creation.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp
@@ -66,24 +66,10 @@ which can be used to loop over the available GPUs.

 Example code of querying GPUs:

-.. code-block:: cpp
-
-  #include <hip/hip_runtime.h>
-  #include <iostream>
-
-  int main() {
-
-      int deviceCount;
-      if (hipGetDeviceCount(&deviceCount) == hipSuccess){
-          for (int i = 0; i < deviceCount; ++i){
-              hipDeviceProp_t prop;
-              if ( hipGetDeviceProperties(&prop, i) == hipSuccess)
-                  std::cout << "Device" << i << prop.name << std::endl;
-          }
-      }
-
-      return 0;
-  }
+.. literalinclude:: ../../tools/example_codes/simple_device_query.cpp
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 Setting the GPU
 --------------------------------------------------------------------------------
@@ -47,61 +47,10 @@ C++ application.

 **Example:** Using pageable host memory in HIP

-.. code-block:: cpp
-
-  #include <hip/hip_runtime.h>
-  #include <iostream>
-
-  #define HIP_CHECK(expression)                  \
-  {                                              \
-      const hipError_t status = expression;      \
-      if(status != hipSuccess){                  \
-          std::cerr << "HIP error "              \
-                    << status << ": "            \
-                    << hipGetErrorString(status) \
-                    << " at " << __FILE__ << ":" \
-                    << __LINE__ << std::endl;    \
-      }                                          \
-  }
-
-  int main()
-  {
-      const int element_number = 100;
-
-      int *host_input, *host_output;
-      // Host allocation
-      host_input  = new int[element_number];
-      host_output = new int[element_number];
-
-      // Host data preparation
-      for (int i = 0; i < element_number; i++) {
-          host_input[i] = i;
-      }
-      memset(host_output, 0, element_number * sizeof(int));
-
-      int *device_input, *device_output;
-
-      // Device allocation
-      HIP_CHECK(hipMalloc((int **)&device_input,  element_number * sizeof(int)));
-      HIP_CHECK(hipMalloc((int **)&device_output, element_number * sizeof(int)));
-
-      // Device data preparation
-      HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
-      HIP_CHECK(hipMemset(device_output, 0, element_number * sizeof(int)));
-
-      // Run the kernel
-      // ...
-
-      HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
-
-      // Free host memory
-      delete[] host_input;
-      delete[] host_output;
-
-      // Free device memory
-      HIP_CHECK(hipFree(device_input));
-      HIP_CHECK(hipFree(device_output));
-  }
+.. literalinclude:: ../../../tools/example_codes/pageable_host_memory.cpp
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 .. note::

@@ -133,61 +82,10 @@ processes, which can negatively impact the overall performance of the host.

 **Example:** Using pinned memory in HIP

-.. code-block:: cpp
-
-  #include <hip/hip_runtime.h>
-  #include <iostream>
-
-  #define HIP_CHECK(expression)                  \
-  {                                              \
-      const hipError_t status = expression;      \
-      if(status != hipSuccess){                  \
-          std::cerr << "HIP error "              \
-                    << status << ": "            \
-                    << hipGetErrorString(status) \
-                    << " at " << __FILE__ << ":" \
-                    << __LINE__ << std::endl;    \
-      }                                          \
-  }
-
-  int main()
-  {
-      const int element_number = 100;
-
-      int *host_input, *host_output;
-      // Host allocation
-      HIP_CHECK(hipHostMalloc((int **)&host_input, element_number * sizeof(int)));
-      HIP_CHECK(hipHostMalloc((int **)&host_output, element_number * sizeof(int)));
-
-      // Host data preparation
-      for (int i = 0; i < element_number; i++) {
-          host_input[i] = i;
-      }
-      memset(host_output, 0, element_number * sizeof(int));
-
-      int *device_input, *device_output;
-
-      // Device allocation
-      HIP_CHECK(hipMalloc((int **)&device_input,  element_number * sizeof(int)));
-      HIP_CHECK(hipMalloc((int **)&device_output, element_number * sizeof(int)));
-
-      // Device data preparation
-      HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
-      HIP_CHECK(hipMemset(device_output, 0, element_number * sizeof(int)));
-
-      // Run the kernel
-      // ...
-
-      HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
-
-      // Free host memory
-      delete[] host_input;
-      delete[] host_output;
-
-      // Free device memory
-      HIP_CHECK(hipFree(device_input));
-      HIP_CHECK(hipFree(device_output));
-  }
+.. literalinclude:: ../../../tools/example_codes/pinned_host_memory.cpp
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 .. _memory_allocation_flags:

@@ -37,102 +37,17 @@ Here is how to use stream ordered memory allocation:
 .. tab-set::
  .. tab-item:: Stream Ordered Memory Allocation

-    .. code-block:: cpp
-
-      #include <iostream>
-      #include <hip/hip_runtime.h>
-
-      // Kernel to perform some computation on allocated memory.
-      __global__ void myKernel(int* data, size_t numElements) {
-          int tid = threadIdx.x + blockIdx.x * blockDim.x;
-          if (tid < numElements) {
-              data[tid] = tid * 2;
-          }
-      }
-
-      int main() {
-          // Initialize HIP.
-          hipInit(0);
-
-          // Stream 0.
-          constexpr hipStream_t streamId = 0;
-
-          // Allocate memory with stream ordered semantics.
-          constexpr size_t numElements = 1024;
-          int* devData;
-          hipMallocAsync(&devData, numElements * sizeof(*devData), streamId);
-
-          // Launch the kernel to perform computation.
-          dim3 blockSize(256);
-          dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
-          myKernel<<<gridSize, blockSize>>>(devData, numElements);
-
-          // Copy data back to host.
-          int* hostData = new int[numElements];
-          hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost);
-
-          // Print the array.
-          for (size_t i = 0; i < numElements; ++i) {
-              std::cout << "Element " << i << ": " << hostData[i] << std::endl;
-          }
-
-          // Free memory with stream ordered semantics.
-          hipFreeAsync(devData, streamId);
-          delete[] hostData;
-
-          // Synchronize to ensure completion.
-          hipDeviceSynchronize();
-
-          return 0;
-      }
+    .. literalinclude:: ../../../tools/example_codes/stream_ordered_memory_allocation.hip
+        :start-after: // [sphinx-start]
+        :end-before: // [sphinx-end]
+        :language: cpp

  .. tab-item:: Ordinary Allocation

-    .. code-block:: cpp
-
-      #include <iostream>
-      #include <hip/hip_runtime.h>
-
-      // Kernel to perform some computation on allocated memory.
-      __global__ void myKernel(int* data, size_t numElements) {
-          int tid = threadIdx.x + blockIdx.x * blockDim.x;
-          if (tid < numElements) {
-              data[tid] = tid * 2;
-          }
-      }
-
-      int main() {
-          // Initialize HIP.
-          hipInit(0);
-
-          // Allocate memory.
-          constexpr size_t numElements = 1024;
-          int* devData;
-          hipMalloc(&devData, numElements * sizeof(*devData));
-
-          // Launch the kernel to perform computation.
-          dim3 blockSize(256);
-          dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
-          myKernel<<<gridSize, blockSize>>>(devData, numElements);
-
-          // Copy data back to host.
-          int* hostData = new int[numElements];
-          hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost);
-
-          // Print the array.
-          for (size_t i = 0; i < numElements; ++i) {
-              std::cout << "Element " << i << ": " << hostData[i] << std::endl;
-          }
-
-          // Free memory.
-          hipFree(devData);
-          delete[] hostData;
-
-          // Synchronize to ensure completion.
-          hipDeviceSynchronize();
-
-          return 0;
-      }
+    .. literalinclude:: ../../../tools/example_codes/ordinary_memory_allocation.hip
+        :start-after: // [sphinx-start]
+        :end-before: // [sphinx-end]
+        :language: cpp

 For more details, see :ref:`stream_ordered_memory_allocator_reference`.

@@ -148,121 +63,29 @@ The ``hipMallocAsync()`` function uses the current memory pool and also provides

 Unlike NVIDIA CUDA, where stream-ordered memory allocation can be implicit, ROCm HIP is explicit. This requires managing memory allocation for each stream in HIP while ensuring precise control over memory usage and synchronization.

-.. code-block:: cpp
-
-    #include <iostream>
-    #include <hip/hip_runtime.h>
-
-    // Kernel to perform some computation on allocated memory.
-    __global__ void myKernel(int* data, size_t numElements) {
-        int tid = threadIdx.x + blockIdx.x * blockDim.x;
-        if (tid < numElements) {
-            data[tid] = tid * 2;
-        }
-    }
-
-    int main() {
-        // Create a stream.
-        hipStream_t stream;
-        hipStreamCreate(&stream);
-
-        // Create a memory pool with default properties.
-        hipMemPoolProps poolProps = {};
-        poolProps.allocType = hipMemAllocationTypePinned;
-        poolProps.handleTypes = hipMemHandleTypePosixFileDescriptor;
-        poolProps.location.type = hipMemLocationTypeDevice;
-        poolProps.location.id = 0; // Assuming device 0.
-
-        hipMemPool_t memPool;
-        hipMemPoolCreate(&memPool, &poolProps);
-
-        // Allocate memory from the pool asynchronously.
-        constexpr size_t numElements = 1024;
-        int* devData = nullptr;
-        hipMallocFromPoolAsync(&devData, numElements * sizeof(*devData), memPool, stream);
-
-        // Define grid and block sizes.
-        dim3 blockSize(256);
-        dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
-
-        // Launch the kernel to perform computation.
-        myKernel<<<gridSize, blockSize, 0, stream>>>(devData, numElements);
-
-        // Synchronize the stream.
-        hipStreamSynchronize(stream);
-
-        // Copy data back to host.
-        int* hostData = new int[numElements];
-        hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost);
-
-        // Print the array.
-        for (size_t i = 0; i < numElements; ++i) {
-            std::cout << "Element " << i << ": " << hostData[i] << std::endl;
-        }
-
-        // Free the allocated memory.
-        hipFreeAsync(devData, stream);
-
-        // Synchronize the stream again to ensure all operations are complete.
-        hipStreamSynchronize(stream);
-
-        // Destroy the memory pool and stream.
-        hipMemPoolDestroy(memPool);
-        hipStreamDestroy(stream);
-
-        // Free host memory.
-        delete[] hostData;
-
-        return 0;
-    }
+.. literalinclude:: ../../../tools/example_codes/memory_pool.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 Trim pools
 ----------

 The memory allocator allows you to allocate and free memory in stream order. To control memory usage, set the release threshold attribute using ``hipMemPoolAttrReleaseThreshold``.  This threshold specifies the amount of reserved memory in bytes to hold onto.

-.. code-block:: cpp
-
-    uint64_t threshold = UINT64_MAX;
-    hipMemPoolSetAttribute(memPool, hipMemPoolAttrReleaseThreshold, &threshold);
+.. literalinclude:: ../../../tools/example_codes/memory_pool_threshold.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 When the amount of memory held in the memory pool exceeds the threshold, the allocator tries to release memory back to the operating system during the next call to stream, event, or context synchronization.

 To improve performance, it is a good practice to adjust the memory pool size using ``hipMemPoolTrimTo()``. It helps to reclaim memory from an excessive memory pool, which optimizes memory usage for your application.

-.. code-block:: cpp
-
-    #include <hip/hip_runtime.h>
-    #include <iostream>
-
-    int main() {
-        hipMemPool_t memPool;
-        hipDevice_t device = 0; // Specify the device index.
-
-        // Initialize the device.
-        hipSetDevice(device);
-
-        // Get the default memory pool for the device.
-        hipDeviceGetDefaultMemPool(&memPool, device);
-
-        // Allocate memory from the pool (e.g., 1 MB).
-        size_t allocSize = 1 * 1024 * 1024;
-        void* ptr;
-        hipMalloc(&ptr, allocSize);
-
-        // Free the allocated memory.
-        hipFree(ptr);
-
-        // Trim the memory pool to a specific size (e.g., 512 KB).
-        size_t newSize = 512 * 1024;
-        hipMemPoolTrimTo(memPool, newSize);
-
-        // Clean up.
-        hipMemPoolDestroy(memPool);
-
-        std::cout << "Memory pool trimmed to " << newSize << " bytes." << std::endl;
-        return 0;
-    }
+.. literalinclude:: ../../../tools/example_codes/memory_pool_trim.cpp
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 Resource usage statistics
 -------------------------
@@ -276,81 +99,10 @@ Resource usage statistics help in optimization. Here is the list of pool attribu

 To reset these attributes to the current value, use ``hipMemPoolSetAttribute()``.

-.. code-block:: cpp
-
-    #include <iostream>
-    #include <hip/hip_runtime.h>
-
-    // Sample helper functions for getting the usage statistics in bulk.
-    struct usageStatistics {
-        uint64_t reservedMemCurrent;
-        uint64_t reservedMemHigh;
-        uint64_t usedMemCurrent;
-        uint64_t usedMemHigh;
-    };
-
-    void getUsageStatistics(hipMemPool_t memPool, struct usageStatistics *statistics) {
-        hipMemPoolGetAttribute(memPool, hipMemPoolAttrReservedMemCurrent, &statistics->reservedMemCurrent);
-        hipMemPoolGetAttribute(memPool, hipMemPoolAttrReservedMemHigh, &statistics->reservedMemHigh);
-        hipMemPoolGetAttribute(memPool, hipMemPoolAttrUsedMemCurrent, &statistics->usedMemCurrent);
-        hipMemPoolGetAttribute(memPool, hipMemPoolAttrUsedMemHigh, &statistics->usedMemHigh);
-    }
-
-    // Resetting the watermarks resets them to the current value.
-    void resetStatistics(hipMemPool_t memPool) {
-        uint64_t value = 0;
-        hipMemPoolSetAttribute(memPool, hipMemPoolAttrReservedMemHigh, &value);
-        hipMemPoolSetAttribute(memPool, hipMemPoolAttrUsedMemHigh, &value);
-    }
-
-    int main() {
-        hipMemPool_t memPool;
-        hipDevice_t device = 0; // Specify the device index.
-
-        // Initialize the device.
-        hipSetDevice(device);
-
-        // Get the default memory pool for the device.
-        hipDeviceGetDefaultMemPool(&memPool, device);
-
-        // Allocate memory from the pool (e.g., 1 MB).
-        size_t allocSize = 1 * 1024 * 1024;
-        void* ptr;
-        hipMalloc(&ptr, allocSize);
-
-        // Free the allocated memory.
-        hipFree(ptr);
-
-        // Trim the memory pool to a specific size (e.g., 512 KB).
-        size_t newSize = 512 * 1024;
-        hipMemPoolTrimTo(memPool, newSize);
-
-        // Get and print usage statistics before resetting.
-        usageStatistics statsBefore;
-        getUsageStatistics(memPool, &statsBefore);
-        std::cout << "Before resetting statistics:" << std::endl;
-        std::cout << "Reserved Memory Current: " << statsBefore.reservedMemCurrent << " bytes" << std::endl;
-        std::cout << "Reserved Memory High: " << statsBefore.reservedMemHigh << " bytes" << std::endl;
-        std::cout << "Used Memory Current: " << statsBefore.usedMemCurrent << " bytes" << std::endl;
-        std::cout << "Used Memory High: " << statsBefore.usedMemHigh << " bytes" << std::endl;
-
-        // Reset the statistics.
-        resetStatistics(memPool);
-
-        // Get and print usage statistics after resetting.
-        usageStatistics statsAfter;
-        getUsageStatistics(memPool, &statsAfter);
-        std::cout << "After resetting statistics:" << std::endl;
-        std::cout << "Reserved Memory Current: " << statsAfter.reservedMemCurrent << " bytes" << std::endl;
-        std::cout << "Reserved Memory High: " << statsAfter.reservedMemHigh << " bytes" << std::endl;
-        std::cout << "Used Memory Current: " << statsAfter.usedMemCurrent << " bytes" << std::endl;
-        std::cout << "Used Memory High: " << statsAfter.usedMemHigh << " bytes" << std::endl;
-
-        // Clean up.
-        hipMemPoolDestroy(memPool);
-
-        return 0;
-    }
+.. literalinclude:: ../../../tools/example_codes/memory_pool_resource_usage_statistics.cpp
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 Memory reuse policies
 ---------------------
@@ -369,6 +121,11 @@ Allocations are initially accessible from the device where they reside.
 Interprocess memory handling
 =============================

+.. attention::
+    IPC API calls are only supported on systems with an active ``amdgpu-dkms`` driver. Please refer to the
+    `AMDGPU documentation <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/index.html>`__ for information
+    on how to install ``amdgpu-dkms``.
+
 Interprocess capable (IPC) memory pools facilitate efficient and secure sharing of GPU memory between processes.

 To achieve interprocess memory sharing, you can use either :ref:`device pointer <device-pointer>` or :ref:`shareable handle <shareable-handle>`. Both provide allocator (export) and consumer (import) interfaces.
@@ -303,207 +303,35 @@ explicit memory management example is presented in the last tab.

    .. tab-item:: hipMallocManaged()

-        .. code-block:: cpp
+        .. literalinclude:: ../../../tools/example_codes/dynamic_unified_memory.hip
+            :start-after: // [sphinx-start]
+            :end-before: // [sphinx-end]
            :emphasize-lines: 22-25
-
-            #include <hip/hip_runtime.h>
-            #include <iostream>
-
-            #define HIP_CHECK(expression)              \
-            {                                          \
-                const hipError_t err = expression;     \
-                if(err != hipSuccess){                 \
-                    std::cerr << "HIP error: "         \
-                        << hipGetErrorString(err)      \
-                        << " at " << __LINE__ << "\n"; \
-                }                                      \
-            }
-
-            // Addition of two values.
-            __global__ void add(int *a, int *b, int *c) {
-                *c = *a + *b;
-            }
-
-            int main() {
-                int *a, *b, *c;
-
-                // Allocate memory for a, b and c that is accessible to both device and host codes.
-                HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
-                HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
-                HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
-
-                // Setup input values.
-                *a = 1;
-                *b = 2;
-
-                // Launch add() kernel on GPU.
-                hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
-
-                // Wait for GPU to finish before accessing on host.
-                HIP_CHECK(hipDeviceSynchronize());
-
-                // Print the result.
-                std::cout << *a << " + " << *b << " = " << *c << std::endl;
-
-                // Cleanup allocated memory.
-                HIP_CHECK(hipFree(a));
-                HIP_CHECK(hipFree(b));
-                HIP_CHECK(hipFree(c));
-
-                return 0;
-            }
+            :language: cpp

    .. tab-item:: __managed__

-        .. code-block:: cpp
+        .. literalinclude:: ../../../tools/example_codes/static_unified_memory.hip
+            :start-after: // [sphinx-start]
+            :end-before: // [sphinx-end]
            :emphasize-lines: 19-20
-
-            #include <hip/hip_runtime.h>
-            #include <iostream>
-
-            #define HIP_CHECK(expression)              \
-            {                                          \
-                const hipError_t err = expression;     \
-                if(err != hipSuccess){                 \
-                    std::cerr << "HIP error: "         \
-                        << hipGetErrorString(err)      \
-                        << " at " << __LINE__ << "\n"; \
-                }                                      \
-            }
-
-            // Addition of two values.
-            __global__ void add(int *a, int *b, int *c) {
-                *c = *a + *b;
-            }
-
-            // Declare a, b and c as static variables.
-            __managed__ int a, b, c;
-
-            int main() {
-                // Setup input values.
-                a = 1;
-                b = 2;
-
-                // Launch add() kernel on GPU.
-                hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, &a, &b, &c);
-
-                // Wait for GPU to finish before accessing on host.
-                HIP_CHECK(hipDeviceSynchronize());
-
-                // Prints the result.
-                std::cout << a << " + " << b << " = " << c << std::endl;
-
-                return 0;
-            }
+            :language: cpp

    .. tab-item:: new

-        .. code-block:: cpp
+        .. literalinclude:: ../../../tools/example_codes/standard_unified_memory.hip
+            :start-after: // [sphinx-start]
+            :end-before: // [sphinx-end]
            :emphasize-lines: 21-24
-
-            #include <hip/hip_runtime.h>
-            #include <iostream>
-            #include <new>
-
-            #define HIP_CHECK(expression)              \
-            {                                          \
-                const hipError_t err = expression;     \
-                if(err != hipSuccess){                 \
-                    std::cerr << "HIP error: "         \
-                        << hipGetErrorString(err)      \
-                        << " at " << __LINE__ << "\n"; \
-                }                                      \
-            }
-
-            // Addition of two values.
-            __global__ void add(int* a, int* b, int* c) {
-                *c = *a + *b;
-            }
-
-            // This example requires HMM support and the environment variable HSA_XNACK needs to be set to 1
-            int main() {
-                // Allocate memory with proper alignment for performance
-                int *a = new(std::align_val_t(128)) int[1];
-                int *b = new(std::align_val_t(128)) int[1];
-                int *c = new(std::align_val_t(128)) int[1];
-
-                // Setup input values.
-                *a = 1;
-                *b = 2;
-
-                // Launch add() kernel on GPU.
-                hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
-
-                // Wait for GPU to finish before accessing on host.
-                HIP_CHECK(hipDeviceSynchronize());
-
-                // Prints the result.
-                std::cout << *a << " + " << *b << " = " << *c << std::endl;
-
-                // Cleanup allocated memory with matching aligned delete.
-                ::operator delete[](a, std::align_val_t(128));
-                ::operator delete[](b, std::align_val_t(128));
-                ::operator delete[](c, std::align_val_t(128));
-
-                return 0;
-            }
+            :language: cpp

    .. tab-item:: Explicit Memory Management

-        .. code-block:: cpp
+        .. literalinclude:: ../../../tools/example_codes/explicit_memory.hip
+            :start-after: // [sphinx-start]
+            :end-before: // [sphinx-end]
            :emphasize-lines: 27-34, 39-40
-
-            #include <hip/hip_runtime.h>
-            #include <iostream>
-
-            #define HIP_CHECK(expression)              \
-            {                                          \
-                const hipError_t err = expression;     \
-                if(err != hipSuccess){                 \
-                    std::cerr << "HIP error: "         \
-                        << hipGetErrorString(err)      \
-                        << " at " << __LINE__ << "\n"; \
-                }                                      \
-            }
-
-            // Addition of two values.
-            __global__ void add(int *a, int *b, int *c) {
-                *c = *a + *b;
-            }
-
-            int main() {
-                int a, b, c;
-                int *d_a, *d_b, *d_c;
-
-                // Setup input values.
-                a = 1;
-                b = 2;
-
-                // Allocate device copies of a, b and c.
-                HIP_CHECK(hipMalloc(&d_a, sizeof(*d_a)));
-                HIP_CHECK(hipMalloc(&d_b, sizeof(*d_b)));
-                HIP_CHECK(hipMalloc(&d_c, sizeof(*d_c)));
-
-                // Copy input values to device.
-                HIP_CHECK(hipMemcpy(d_a, &a, sizeof(*d_a), hipMemcpyHostToDevice));
-                HIP_CHECK(hipMemcpy(d_b, &b, sizeof(*d_b), hipMemcpyHostToDevice));
-
-                // Launch add() kernel on GPU.
-                hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, d_a, d_b, d_c);
-
-                // Copy the result back to the host.
-                HIP_CHECK(hipMemcpy(&c, d_c, sizeof(*d_c), hipMemcpyDeviceToHost));
-
-                // Cleanup allocated memory.
-                HIP_CHECK(hipFree(d_a));
-                HIP_CHECK(hipFree(d_b));
-                HIP_CHECK(hipFree(d_c));
-
-                // Prints the result.
-                std::cout << a << " + " << b << " = " << c << std::endl;
-
-                return 0;
-            }
+            :language: cpp

 .. _using unified memory:

@@ -559,65 +387,11 @@ Data prefetching is a technique used to improve the performance of your
 application by moving data to the desired device before it's actually
 needed. ``hipCpuDeviceId`` is a special constant to specify the CPU as target.

-.. code-block:: cpp
+.. literalinclude:: ../../../tools/example_codes/data_prefetching.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
    :emphasize-lines: 33-36,41-42
-
-    #include <hip/hip_runtime.h>
-    #include <iostream>
-
-    #define HIP_CHECK(expression)              \
-    {                                          \
-        const hipError_t err = expression;     \
-        if(err != hipSuccess){                 \
-            std::cerr << "HIP error: "         \
-                << hipGetErrorString(err)      \
-                << " at " << __LINE__ << "\n"; \
-        }                                      \
-    }
-
-    // Addition of two values.
-    __global__ void add(int *a, int *b, int *c) {
-        *c = *a + *b;
-    }
-
-    int main() {
-        int *a, *b, *c;
-        int deviceId;
-        HIP_CHECK(hipGetDevice(&deviceId)); // Get the current device ID
-
-        // Allocate memory for a, b and c that is accessible to both device and host codes.
-        HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
-        HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
-        HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
-
-        // Setup input values.
-        *a = 1;
-        *b = 2;
-
-        // Prefetch the data to the GPU device.
-        HIP_CHECK(hipMemPrefetchAsync(a, sizeof(*a), deviceId, 0));
-        HIP_CHECK(hipMemPrefetchAsync(b, sizeof(*b), deviceId, 0));
-        HIP_CHECK(hipMemPrefetchAsync(c, sizeof(*c), deviceId, 0));
-
-        // Launch add() kernel on GPU.
-        hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
-
-        // Prefetch the result back to the CPU.
-        HIP_CHECK(hipMemPrefetchAsync(c, sizeof(*c), hipCpuDeviceId, 0));
-
-        // Wait for the prefetch operations to complete.
-        HIP_CHECK(hipDeviceSynchronize());
-
-        // Prints the result.
-        std::cout << *a << " + " << *b << " = " << *c << std::endl;
-
-        // Cleanup allocated memory.
-        HIP_CHECK(hipFree(a));
-        HIP_CHECK(hipFree(b));
-        HIP_CHECK(hipFree(c));
-
-        return 0;
-    }
+    :language: cpp

 Memory advice
 --------------------------------------------------------------------------------
@@ -642,71 +416,11 @@ impact on performance can vary based on the specific use case and the system.
 The following is the updated version of the example above with memory advice
 instead of prefetching.

-.. code-block:: cpp
+.. literalinclude:: ../../../tools/example_codes/unified_memory_advice.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
    :emphasize-lines: 29-41
-
-    #include <hip/hip_runtime.h>
-    #include <iostream>
-
-    #define HIP_CHECK(expression)              \
-    {                                          \
-        const hipError_t err = expression;     \
-        if(err != hipSuccess){                 \
-            std::cerr << "HIP error: "         \
-                << hipGetErrorString(err)      \
-                << " at " << __LINE__ << "\n"; \
-        }                                      \
-    }
-
-    // Addition of two values.
-    __global__ void add(int *a, int *b, int *c) {
-        *c = *a + *b;
-    }
-
-    int main() {
-        int deviceId;
-        HIP_CHECK(hipGetDevice(&deviceId));
-        int *a, *b, *c;
-
-        // Allocate memory for a, b, and c accessible to both device and host codes.
-        HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
-        HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
-        HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
-
-        // Set memory advice for a and b to be read, located on and accessed by the GPU.
-        HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetPreferredLocation, deviceId));
-        HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetAccessedBy, deviceId));
-        HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetReadMostly, deviceId));
-
-        HIP_CHECK(hipMemAdvise(b, sizeof(*b), hipMemAdviseSetPreferredLocation, deviceId));
-        HIP_CHECK(hipMemAdvise(b, sizeof(*b), hipMemAdviseSetAccessedBy, deviceId));
-        HIP_CHECK(hipMemAdvise(b, sizeof(*b), hipMemAdviseSetReadMostly, deviceId));
-
-        // Set memory advice for c to be read, located on and accessed by the CPU.
-        HIP_CHECK(hipMemAdvise(c, sizeof(*c), hipMemAdviseSetPreferredLocation, hipCpuDeviceId));
-        HIP_CHECK(hipMemAdvise(c, sizeof(*c), hipMemAdviseSetAccessedBy, hipCpuDeviceId));
-        HIP_CHECK(hipMemAdvise(c, sizeof(*c), hipMemAdviseSetReadMostly, hipCpuDeviceId));
-
-        // Setup input values.
-        *a = 1;
-        *b = 2;
-
-        // Launch add() kernel on GPU.
-        hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
-
-        // Wait for GPU to finish before accessing on host.
-        HIP_CHECK(hipDeviceSynchronize());
-
-        // Prints the result.
-        std::cout << *a << " + " << *b << " = " << *c << std::endl;
-
-        // Cleanup allocated memory.
-        HIP_CHECK(hipFree(a));
-        HIP_CHECK(hipFree(b));
-        HIP_CHECK(hipFree(c));
-
-        return 0;
-    }
+    :language: cpp

 Memory range attributes
 --------------------------------------------------------------------------------
@@ -714,70 +428,11 @@ Memory range attributes
 :cpp:func:`hipMemRangeGetAttribute()` allows you to query attributes of a given
 memory range. The attributes are given in :cpp:enum:`hipMemRangeAttribute`.

-.. code-block:: cpp
+.. literalinclude:: ../../../tools/example_codes/memory_range_attributes.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
    :emphasize-lines: 44-49
-
-    #include <hip/hip_runtime.h>
-    #include <iostream>
-
-    #define HIP_CHECK(expression)              \
-    {                                          \
-        const hipError_t err = expression;     \
-        if(err != hipSuccess){                 \
-            std::cerr << "HIP error: "         \
-                << hipGetErrorString(err)      \
-                << " at " << __LINE__ << "\n"; \
-        }                                      \
-    }
-
-    // Addition of two values.
-    __global__ void add(int *a, int *b, int *c) {
-        *c = *a + *b;
-    }
-
-    int main() {
-        int *a, *b, *c;
-        unsigned int attributeValue;
-        constexpr size_t attributeSize = sizeof(attributeValue);
-
-        int deviceId;
-        HIP_CHECK(hipGetDevice(&deviceId));
-
-        // Allocate memory for a, b and c that is accessible to both device and host codes.
-        HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
-        HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
-        HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
-
-        // Setup input values.
-        *a = 1;
-        *b = 2;
-
-        HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetReadMostly, deviceId));
-
-        // Launch add() kernel on GPU.
-        hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
-
-        // Wait for GPU to finish before accessing on host.
-        HIP_CHECK(hipDeviceSynchronize());
-
-        // Query an attribute of the memory range.
-        HIP_CHECK(hipMemRangeGetAttribute(&attributeValue,
-                                attributeSize,
-                                hipMemRangeAttributeReadMostly,
-                                a,
-                                sizeof(*a)));
-
-        // Prints the result.
-        std::cout << *a << " + " << *b << " = " << *c << std::endl;
-        std::cout << "The array a is" << (attributeValue == 1 ? "" : " NOT") << " set to hipMemRangeAttributeReadMostly" << std::endl;
-
-        // Cleanup allocated memory.
-        HIP_CHECK(hipFree(a));
-        HIP_CHECK(hipFree(b));
-        HIP_CHECK(hipFree(c));
-
-        return 0;
-    }
+    :language: cpp

 Asynchronously attach memory to a stream
 --------------------------------------------------------------------------------
@@ -22,43 +22,10 @@ dynamic selections during runtime to ensure optimal performance.

 If the application does not define a specific GPU, device 0 is selected.

-.. code-block:: cpp
-
-    #include <hip/hip_runtime.h>
-    #include <iostream>
-
-    int main()
-    {
-        int deviceCount;
-        hipGetDeviceCount(&deviceCount);
-        std::cout << "Number of devices: " << deviceCount << std::endl;
-
-        for (int deviceId = 0; deviceId < deviceCount; ++deviceId)
-        {
-            hipDeviceProp_t deviceProp;
-            hipGetDeviceProperties(&deviceProp, deviceId);
-            std::cout << "Device " << deviceId << std::endl << " Properties:" << std::endl;
-            std::cout << "  Name: " << deviceProp.name << std::endl;
-            std::cout << "  Total Global Memory: " << deviceProp.totalGlobalMem / (1024 * 1024) << " MiB" << std::endl;
-            std::cout << "  Shared Memory per Block: " << deviceProp.sharedMemPerBlock / 1024 << " KiB" << std::endl;
-            std::cout << "  Registers per Block: " << deviceProp.regsPerBlock << std::endl;
-            std::cout << "  Warp Size: " << deviceProp.warpSize << std::endl;
-            std::cout << "  Max Threads per Block: " << deviceProp.maxThreadsPerBlock << std::endl;
-            std::cout << "  Max Threads per Multiprocessor: " << deviceProp.maxThreadsPerMultiProcessor << std::endl;
-            std::cout << "  Number of Multiprocessors: " << deviceProp.multiProcessorCount << std::endl;
-            std::cout << "  Max Threads Dimensions: ["
-                    << deviceProp.maxThreadsDim[0] << ", "
-                    << deviceProp.maxThreadsDim[1] << ", "
-                    << deviceProp.maxThreadsDim[2] << "]" << std::endl;
-            std::cout << "  Max Grid Size: ["
-                    << deviceProp.maxGridSize[0] << ", "
-                    << deviceProp.maxGridSize[1] << ", "
-                    << deviceProp.maxGridSize[2] << "]" << std::endl;
-            std::cout << std::endl;
-        }
-
-        return 0;
-    }
+.. literalinclude:: ../../tools/example_codes/device_enumeration.cpp
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 .. _multi_device_selection:

@@ -72,71 +39,10 @@ different GPUs might have different capabilities or workloads. By selecting the
 appropriate device, you ensure that the computational tasks are directed to the
 correct GPU, optimizing performance and resource utilization.

-.. code-block:: cpp
-
-    #include <hip/hip_runtime.h>
-    #include <iostream>
-
-    #define HIP_CHECK(expression)                \
-    {                                            \
-        const hipError_t status = expression;    \
-        if (status != hipSuccess) {              \
-            std::cerr << "HIP error " << status  \
-                    << ": " << hipGetErrorString(status) \
-                    << " at " << __FILE__ << ":" \
-                    << __LINE__ << std::endl;  \
-            exit(status);                        \
-        }                                        \
-    }
-
-    __global__ void simpleKernel(double *data)
-    {
-        int idx = blockIdx.x * blockDim.x + threadIdx.x;
-        data[idx] = idx * 2.0;
-    }
-
-    int main()
-    {
-        double* deviceData0;
-        double* deviceData1;
-        size_t  size = 1024 * sizeof(*deviceData0);
-
-        int deviceId0 = 0;
-        int deviceId1 = 1;
-
-        // Set device 0 and perform operations
-        HIP_CHECK(hipSetDevice(deviceId0)); // Set device 0 as current
-        HIP_CHECK(hipMalloc(&deviceData0, size)); // Allocate memory on device 0
-        simpleKernel<<<1000, 128>>>(deviceData0); // Launch kernel on device 0
-        HIP_CHECK(hipDeviceSynchronize());
-
-        // Set device 1 and perform operations
-        HIP_CHECK(hipSetDevice(deviceId1)); // Set device 1 as current
-        HIP_CHECK(hipMalloc(&deviceData1, size)); // Allocate memory on device 1
-        simpleKernel<<<1000, 128>>>(deviceData1); // Launch kernel on device 1
-        HIP_CHECK(hipDeviceSynchronize());
-
-        // Copy result from device 0
-        double hostData0[1024];
-        HIP_CHECK(hipSetDevice(deviceId0));
-        HIP_CHECK(hipMemcpy(hostData0, deviceData0, size, hipMemcpyDeviceToHost));
-
-        // Copy result from device 1
-        double hostData1[1024];
-        HIP_CHECK(hipSetDevice(deviceId1));
-        HIP_CHECK(hipMemcpy(hostData1, deviceData1, size, hipMemcpyDeviceToHost));
-
-        // Display results from both devices
-        std::cout << "Device 0 data: " << hostData0[0] << std::endl;
-        std::cout << "Device 1 data: " << hostData1[0] << std::endl;
-
-        // Free device memory
-        HIP_CHECK(hipFree(deviceData0));
-        HIP_CHECK(hipFree(deviceData1));
-
-        return 0;
-    }
-
+.. literalinclude:: ../../tools/example_codes/device_selection.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 Stream and event behavior
 ===============================================================================
@@ -151,100 +57,10 @@ conditions and optimizes data flow in multi-GPU systems. Together, streams and
 events maximize performance by enabling parallel execution, load balancing, and
 effective resource utilization across heterogeneous hardware.

-.. code-block:: cpp
-
-    #include <hip/hip_runtime.h>
-    #include <iostream>
-
-    __global__ void simpleKernel(double *data)
-    {
-        int idx = blockIdx.x * blockDim.x + threadIdx.x;
-        data[idx] = idx * 2.0;
-    }
-
-    int main()
-    {
-        int numDevices;
-        hipGetDeviceCount(&numDevices);
-
-        if (numDevices < 2) {
-            std::cerr << "This example requires at least two GPUs." << std::endl;
-            return -1;
-        }
-
-        double *deviceData0, *deviceData1;
-        size_t size = 1024 * sizeof(*deviceData0);
-
-        // Create streams and events for each device
-        hipStream_t stream0, stream1;
-        hipEvent_t startEvent0, stopEvent0, startEvent1, stopEvent1;
-
-        // Initialize device 0
-        hipSetDevice(0);
-        hipStreamCreate(&stream0);
-        hipEventCreate(&startEvent0);
-        hipEventCreate(&stopEvent0);
-        hipMalloc(&deviceData0, size);
-
-        // Initialize device 1
-        hipSetDevice(1);
-        hipStreamCreate(&stream1);
-        hipEventCreate(&startEvent1);
-        hipEventCreate(&stopEvent1);
-        hipMalloc(&deviceData1, size);
-
-        // Record the start event on device 0
-        hipSetDevice(0);
-        hipEventRecord(startEvent0, stream0);
-
-        // Launch the kernel asynchronously on device 0
-        simpleKernel<<<1000, 128, 0, stream0>>>(deviceData0);
-
-        // Record the stop event on device 0
-        hipEventRecord(stopEvent0, stream0);
-
-        // Wait for the stop event on device 0 to complete
-        hipEventSynchronize(stopEvent0);
-
-        // Record the start event on device 1
-        hipSetDevice(1);
-        hipEventRecord(startEvent1, stream1);
-
-        // Launch the kernel asynchronously on device 1
-        simpleKernel<<<1000, 128, 0, stream1>>>(deviceData1);
-
-        // Record the stop event on device 1
-        hipEventRecord(stopEvent1, stream1);
-
-        // Wait for the stop event on device 1 to complete
-        hipEventSynchronize(stopEvent1);
-
-        // Calculate elapsed time between the events for both devices
-        float milliseconds0 = 0, milliseconds1 = 0;
-        hipEventElapsedTime(&milliseconds0, startEvent0, stopEvent0);
-        hipEventElapsedTime(&milliseconds1, startEvent1, stopEvent1);
-
-        std::cout << "Elapsed time on GPU 0: " << milliseconds0 << " ms" << std::endl;
-        std::cout << "Elapsed time on GPU 1: " << milliseconds1 << " ms" << std::endl;
-
-        // Cleanup for device 0
-        hipSetDevice(0);
-        hipEventDestroy(startEvent0);
-        hipEventDestroy(stopEvent0);
-        hipStreamSynchronize(stream0);
-        hipStreamDestroy(stream0);
-        hipFree(deviceData0);
-
-        // Cleanup for device 1
-        hipSetDevice(1);
-        hipEventDestroy(startEvent1);
-        hipEventDestroy(stopEvent1);
-        hipStreamSynchronize(stream1);
-        hipStreamDestroy(stream1);
-        hipFree(deviceData1);
-
-        return 0;
-    }
+.. literalinclude:: ../../tools/example_codes/multi_device_synchronization.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 Peer-to-peer memory access
 ===============================================================================
@@ -257,164 +73,25 @@ applications that require frequent data exchange between GPUs, as it eliminates
 the need to transfer data through the host memory.

 By adding peer-to-peer access to the example referenced in
-:ref:`multi_device_selection`, data can be copied between devices:
+:ref:`multi_device_selection`, data can be efficiently copied between devices.
+If peer-to-peer access is not activated, the call to :cpp:func:`hipMemcpy`
+still works but internally uses a staging buffer in host memory, which incurs a
+performance penalty.

 .. tab-set::

    .. tab-item:: with peer-to-peer

-        .. code-block:: cpp
-            :emphasize-lines: 31-37, 51-55
-
-            #include <hip/hip_runtime.h>
-            #include <iostream>
-
-            #define HIP_CHECK(expression)                        \
-            {                                                    \
-                const hipError_t status = expression;            \
-                if (status != hipSuccess) {                      \
-                    std::cerr << "HIP error " << status          \
-                            << ": " << hipGetErrorString(status) \
-                            << " at " << __FILE__ << ":"         \
-                            << __LINE__ << std::endl;            \
-                    exit(status);                                \
-                }                                                \
-            }
-
-            __global__ void simpleKernel(double *data)
-            {
-                int idx   = blockIdx.x * blockDim.x + threadIdx.x;
-                data[idx] = idx * 2.0;
-            }
-
-            int main()
-            {
-                double* deviceData0;
-                double* deviceData1;
-                size_t  size = 1024 * sizeof(*deviceData0);
-
-                int deviceId0 = 0;
-                int deviceId1 = 1;
-
-                // Enable peer access to the memory (allocated and future) on the peer device.
-                // Ensure the device is active before enabling peer access.
-                hipSetDevice(deviceId0);
-                hipDeviceEnablePeerAccess(deviceId1, 0);
-
-                hipSetDevice(deviceId1);
-                hipDeviceEnablePeerAccess(deviceId0, 0);
-
-                // Set device 0 and perform operations
-                HIP_CHECK(hipSetDevice(deviceId0)); // Set device 0 as current
-                HIP_CHECK(hipMalloc(&deviceData0, size)); // Allocate memory on device 0
-                simpleKernel<<<1000, 128>>>(deviceData0); // Launch kernel on device 0
-                HIP_CHECK(hipDeviceSynchronize());
-
-                // Set device 1 and perform operations
-                HIP_CHECK(hipSetDevice(deviceId1)); // Set device 1 as current
-                HIP_CHECK(hipMalloc(&deviceData1, size)); // Allocate memory on device 1
-                simpleKernel<<<1000, 128>>>(deviceData1); // Launch kernel on device 1
-                HIP_CHECK(hipDeviceSynchronize());
-
-                // Use peer-to-peer access
-                hipSetDevice(deviceId0);
-
-                // Now device 0 can access memory allocated on device 1
-                hipMemcpy(deviceData0, deviceData1, size, hipMemcpyDeviceToDevice);
-
-                // Copy result from device 0
-                double hostData0[1024];
-                HIP_CHECK(hipSetDevice(deviceId0));
-                HIP_CHECK(hipMemcpy(hostData0, deviceData0, size, hipMemcpyDeviceToHost));
-
-                // Copy result from device 1
-                double hostData1[1024];
-                HIP_CHECK(hipSetDevice(deviceId1));
-                HIP_CHECK(hipMemcpy(hostData1, deviceData1, size, hipMemcpyDeviceToHost));
-
-                // Display results from both devices
-                std::cout << "Device 0 data: " << hostData0[0] << std::endl;
-                std::cout << "Device 1 data: " << hostData1[0] << std::endl;
-
-                // Free device memory
-                HIP_CHECK(hipFree(deviceData0));
-                HIP_CHECK(hipFree(deviceData1));
-
-                return 0;
-            }
+        .. literalinclude:: ../../tools/example_codes/p2p_memory_access.hip
+            :start-after: // [sphinx-start]
+            :end-before: // [sphinx-end]
+            :emphasize-lines: 43-49, 63-67
+            :language: cpp

    .. tab-item:: without peer-to-peer

-        .. code-block:: cpp
-            :emphasize-lines: 43-49, 53, 58
-
-            #include <hip/hip_runtime.h>
-            #include <iostream>
-
-            #define HIP_CHECK(expression)                        \
-            {                                                    \
-                const hipError_t status = expression;            \
-                if (status != hipSuccess) {                      \
-                    std::cerr << "HIP error " << status          \
-                            << ": " << hipGetErrorString(status) \
-                            << " at " << __FILE__ << ":"         \
-                            << __LINE__ << std::endl;            \
-                    exit(status);                                \
-                }                                                \
-            }
-
-            __global__ void simpleKernel(double *data)
-            {
-                int idx   = blockIdx.x * blockDim.x + threadIdx.x;
-                data[idx] = idx * 2.0;
-            }
-
-            int main()
-            {
-                double* deviceData0;
-                double* deviceData1;
-                size_t  size = 1024 * sizeof(*deviceData0);
-
-                int deviceId0 = 0;
-                int deviceId1 = 1;
-
-                // Set device 0 and perform operations
-                HIP_CHECK(hipSetDevice(deviceId0)); // Set device 0 as current
-                HIP_CHECK(hipMalloc(&deviceData0, size)); // Allocate memory on device 0
-                simpleKernel<<<1000, 128>>>(deviceData0); // Launch kernel on device 0
-                HIP_CHECK(hipDeviceSynchronize());
-
-                // Set device 1 and perform operations
-                HIP_CHECK(hipSetDevice(deviceId1)); // Set device 1 as current
-                HIP_CHECK(hipMalloc(&deviceData1, size)); // Allocate memory on device 1
-                simpleKernel<<<1000, 128>>>(deviceData1); // Launch kernel on device 1
-                HIP_CHECK(hipDeviceSynchronize());
-
-                // Attempt to use deviceData0 on device 1 (This will not work as deviceData0 is allocated on device 0)
-                HIP_CHECK(hipSetDevice(deviceId1));
-                hipError_t err = hipMemcpy(deviceData1, deviceData0, size, hipMemcpyDeviceToDevice); // This should fail
-                if (err != hipSuccess)
-                {
-                    std::cout << "Error: Cannot access deviceData0 from device 1, deviceData0 is on device 0" << std::endl;
-                }
-
-                // Copy result from device 0
-                double hostData0[1024];
-                HIP_CHECK(hipSetDevice(deviceId0));
-                HIP_CHECK(hipMemcpy(hostData0, deviceData0, size, hipMemcpyDeviceToHost));
-
-                // Copy result from device 1
-                double hostData1[1024];
-                HIP_CHECK(hipSetDevice(deviceId1));
-                HIP_CHECK(hipMemcpy(hostData1, deviceData1, size, hipMemcpyDeviceToHost));
-
-                // Display results from both devices
-                std::cout << "Device 0 data: " << hostData0[0] << std::endl;
-                std::cout << "Device 1 data: " << hostData1[0] << std::endl;
-
-                // Free device memory
-                HIP_CHECK(hipFree(deviceData0));
-                HIP_CHECK(hipFree(deviceData1));
-
-                return 0;
-            }
+        .. literalinclude:: ../../tools/example_codes/p2p_memory_access_host_staging.hip
+            :start-after: // [sphinx-start]
+            :end-before: // [sphinx-end]
+            :emphasize-lines: 55-57
+            :language: cpp
@@ -38,8 +38,7 @@ The HIP documentation is organized into the following categories:
 * {doc}`./how-to/hip_runtime_api`
 * {doc}`./how-to/hip_cpp_language_extensions`
 * {doc}`./how-to/kernel_language_cpp_support`
-* [HIP porting guide](./how-to/hip_porting_guide)
-* [HIP porting: driver API guide](./how-to/hip_porting_driver_api)
+* {doc}`./how-to/hip_porting_guide`
 * {doc}`./how-to/hip_rtc`
 * {doc}`./understand/amd_clr`

@@ -66,6 +65,7 @@ The HIP documentation is organized into the following categories:
 * [SAXPY tutorial](./tutorial/saxpy)
 * [Reduction tutorial](./tutorial/reduction)
 * [Cooperative groups tutorial](./tutorial/cooperative_groups_tutorial)
+* [HIP Graph API tutorial](./tutorial/graph_api)

 :::

@@ -11,92 +11,10 @@ example and comparison table. For a complete list of mappings, visit :ref:`HIPIF

 The following CUDA code example illustrates several CUDA API syntaxes.

-.. code-block:: cpp
-
-  #include <iostream>
-  #include <vector>
-  #include <cuda_runtime.h>
-
-  __global__ void block_reduction(const float* input, float* output, int num_elements)
-  {
-      extern __shared__ float s_data[];
-
-      int tid = threadIdx.x;
-      int global_id = blockDim.x * blockIdx.x + tid;
-
-      if (global_id < num_elements)
-      {
-          s_data[tid] = input[global_id];
-      }
-      else
-      {
-          s_data[tid] = 0.0f;
-      }
-      __syncthreads();
-
-      for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
-      {
-          if (tid < stride)
-          {
-              s_data[tid] += s_data[tid + stride];
-          }
-          __syncthreads();
-      }
-
-      if (tid == 0)
-      {
-          output[blockIdx.x] = s_data[0];
-      }
-  }
-
-  int main()
-  {
-      int threads = 256;
-      const int num_elements = 50000;
-
-      std::vector<float> h_a(num_elements);
-      std::vector<float> h_b((num_elements + threads - 1) / threads);
-
-      for (int i = 0; i < num_elements; ++i)
-      {
-          h_a[i] = rand() / static_cast<float>(RAND_MAX);
-      }
-
-      float *d_a, *d_b;
-      cudaMalloc(&d_a, h_a.size() * sizeof(float));
-      cudaMalloc(&d_b, h_b.size() * sizeof(float));
-
-      cudaStream_t stream;
-      cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
-
-      cudaEvent_t start_event, stop_event;
-      cudaEventCreate(&start_event);
-      cudaEventCreate(&stop_event);
-
-      cudaMemcpyAsync(d_a, h_a.data(), h_a.size() * sizeof(float), cudaMemcpyHostToDevice, stream);
-
-      cudaEventRecord(start_event, stream);
-
-      int blocks = (num_elements + threads - 1) / threads;
-      block_reduction<<<blocks, threads, threads * sizeof(float), stream>>>(d_a, d_b, num_elements);
-
-      cudaMemcpyAsync(h_b.data(), d_b, h_b.size() * sizeof(float), cudaMemcpyDeviceToHost, stream);
-
-      cudaEventRecord(stop_event, stream);
-      cudaEventSynchronize(stop_event);
-
-      cudaEventElapsedTime(&milliseconds, start_event, stop_event);
-      std::cout << "Kernel execution time: " << milliseconds << " ms\n";
-
-      cudaFree(d_a);
-      cudaFree(d_b);
-
-      cudaEventDestroy(start_event);
-      cudaEventDestroy(stop_event);
-      cudaStreamDestroy(stream);
-
-      return 0;
-  }
+.. literalinclude:: ../tools/example_codes/block_reduction.cu
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 The following table maps CUDA API functions to corresponding HIP API functions, as demonstrated in the
 preceding code examples.
@@ -337,117 +337,7 @@ The kernel function ``computeDFT`` shows various HIP complex math operations in
 The example also demonstrates proper use of complex number handling on both host and device, including
 memory allocation, transfer, and validation of results between CPU and GPU implementations.

-.. code-block:: cpp
-
-    #include <hip/hip_runtime.h>
-    #include <hip/hip_complex.h>
-    #include <iostream>
-    #include <vector>
-    #include <cmath>
-
-    #define HIP_CHECK(expression)              \
-        {                                      \
-            const hipError_t err = expression; \
-            if (err != hipSuccess) {           \
-                std::cerr << "HIP error: "     \
-                        << hipGetErrorString(err) \
-                        << " at " << __LINE__ << "\n"; \
-                exit(EXIT_FAILURE);            \
-            }                                  \
-        }
-
-    // Kernel to compute DFT
-    __global__ void computeDFT(const float* input,
-                            hipFloatComplex* output,
-                            const int N)
-    {
-        int k = blockIdx.x * blockDim.x + threadIdx.x;
-        if (k >= N) return;
-
-        hipFloatComplex sum = make_hipFloatComplex(0.0f, 0.0f);
-
-        for (int n = 0; n < N; n++) {
-            float angle = -2.0f * M_PI * k * n / N;
-            hipFloatComplex w = make_hipFloatComplex(cosf(angle), sinf(angle));
-            hipFloatComplex x = make_hipFloatComplex(input[n], 0.0f);
-            sum = hipCaddf(sum, hipCmulf(x, w));
-        }
-
-        output[k] = sum;
-    }
-
-    // CPU implementation of DFT for verification
-    std::vector<hipFloatComplex> cpuDFT(const std::vector<float>& input) {
-        const int N = input.size();
-        std::vector<hipFloatComplex> result(N);
-
-        for (int k = 0; k < N; k++) {
-            hipFloatComplex sum = make_hipFloatComplex(0.0f, 0.0f);
-            for (int n = 0; n < N; n++) {
-                float angle = -2.0f * M_PI * k * n / N;
-                hipFloatComplex w = make_hipFloatComplex(cosf(angle), sinf(angle));
-                hipFloatComplex x = make_hipFloatComplex(input[n], 0.0f);
-                sum = hipCaddf(sum, hipCmulf(x, w));
-            }
-            result[k] = sum;
-        }
-        return result;
-    }
-
-    int main() {
-        const int N = 256;  // Signal length
-        const int blockSize = 256;
-
-        // Generate input signal: sum of two sine waves
-        std::vector<float> signal(N);
-        for (int i = 0; i < N; i++) {
-            float t = static_cast<float>(i) / N;
-            signal[i] = sinf(2.0f * M_PI * 10.0f * t) +  // 10 Hz component
-                    0.5f * sinf(2.0f * M_PI * 20.0f * t);  // 20 Hz component
-        }
-
-        // Compute reference solution on CPU
-        std::vector<hipFloatComplex> cpu_output = cpuDFT(signal);
-
-        // Allocate device memory
-        float* d_signal;
-        hipFloatComplex* d_output;
-        HIP_CHECK(hipMalloc(&d_signal, N * sizeof(float)));
-        HIP_CHECK(hipMalloc(&d_output, N * sizeof(hipFloatComplex)));
-
-        // Copy input to device
-        HIP_CHECK(hipMemcpy(d_signal, signal.data(), N * sizeof(float),
-                        hipMemcpyHostToDevice));
-
-        // Launch kernel
-        dim3 grid((N + blockSize - 1) / blockSize);
-        dim3 block(blockSize);
-        computeDFT<<<grid, block>>>(d_signal, d_output, N);
-        HIP_CHECK(hipGetLastError());
-
-        // Get GPU results
-        std::vector<hipFloatComplex> gpu_output(N);
-        HIP_CHECK(hipMemcpy(gpu_output.data(), d_output, N * sizeof(hipFloatComplex),
-                        hipMemcpyDeviceToHost));
-
-        // Verify results
-        bool passed = true;
-        const float tolerance = 1e-5f;  // Adjust based on precision requirements
-
-        for (int i = 0; i < N; i++) {
-            float diff_real = std::abs(hipCrealf(gpu_output[i]) - hipCrealf(cpu_output[i]));
-            float diff_imag = std::abs(hipCimagf(gpu_output[i]) - hipCimagf(cpu_output[i]));
-
-            if (diff_real > tolerance || diff_imag > tolerance) {
-                passed = false;
-                break;
-            }
-        }
-
-        std::cout << "DFT Verification: " << (passed ? "PASSED" : "FAILED") << "\n";
-
-        // Cleanup
-        HIP_CHECK(hipFree(d_signal));
-        HIP_CHECK(hipFree(d_output));
-        return passed ? 0 : 1;
-    }
+.. literalinclude:: ../tools/example_codes/complex_math.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp
@@ -27,7 +27,7 @@ and :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`.
      -

    * - | ``AMD_LOG_MASK``
-        | Specifies HIP log filters. Here is the ` complete list of log masks <https://github.com/ROCm/clr/blob/develop/rocclr/utils/debug.hpp#L40>`_.
+        | Specifies HIP log filters. Here is the `complete list of log masks <https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/utils/debug.hpp#L48>`_.
      - ``0x7FFFFFFF``
      - | 0x1: Log API calls.
        | 0x2: Kernel and copy commands and barriers.
@@ -49,8 +49,16 @@ and :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`.
        | 0x20000: Memory allocation.
        | 0x40000: Memory pool allocation, including memory in graphs.
        | 0x80000: Timestamp details.
+        | 0x100000: Comgr path information print.
        | 0xFFFFFFFF: Log always even mask flag is zero.

+    * - | ``HIP_FORCE_DEV_KERNARG``
+        | Forces kernel arguments to be stored in device memory to reduce latency.
+        | Can improve performance by 2-3 µs for some kernels.
+      - ``1``
+      - | 0: Disable
+        | 1: Enable
+
    * - | ``HIP_LAUNCH_BLOCKING``
        | Used for serialization on kernel execution.
      - ``0``
@@ -14,7 +14,7 @@ environment variables in HIP are collected in the following table.
    * - | ``ROCR_VISIBLE_DEVICES``
        | A list of device indices or UUIDs that will be exposed to applications.
      - :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`, :doc:`Setting the number of compute units <rocm:how-to/setting-cus>`
-      - Example: ``0,GPU-DEADBEEFDEADBEEF``
+      - Example: ``0,GPU-4b2c1a9f-8d3e-6f7a-b5c9-2e4d8a1f6c3b``

    * - | ``GPU_DEVICE_ORDINAL``
        | Devices indices exposed to OpenCL and HIP applications.
@@ -25,3 +25,11 @@ environment variables in HIP are collected in the following table.
        | Device indices exposed to HIP applications.
      - :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`, :doc:`HIP debugging <hip:how-to/debugging>`
      - Example: ``0,2``
+
+.. admonition:: Recommendation
+
+  * On Linux, use ``ROCR_VISIBLE_DEVICES``.
+
+  * On Windows, use ``HIP_VISIBLE_DEVICES``.
+
+  * For portability across different vendors, use ``CUDA_VISIBLE_DEVICES``.
@@ -55,6 +55,12 @@ pages:
      - | 0: Disable
        | 1: Enable

+    * - | ``GPU_SINGLE_ALLOC_PERCENT``
+        | Limits the maximum size of a single memory allocation as a percentage of GPU memory.
+      - ``100``
+      - | Unit: Percentage
+        | Prevents single allocations from consuming all available GPU memory.
+
    * - | ``GPU_MAX_HEAP_SIZE``
        | Set maximum size of the GPU heap to % of board memory.
      - ``100``
@@ -16,19 +16,19 @@ different features in HIP.
      - ``--gpu-architecture=gfx906:sramecc+:xnack``, ``-fgpu-rdc``

    * - | ``AMD_COMGR_SAVE_TEMPS``
-        | Controls the deletion of temporary files generated during the compilation of COMGR. These files do not appear in the current working directory, but are instead left in a platform-specific temporary directory.
+        | Controls the deletion of temporary files generated during the compilation of Comgr. These files do not appear in the current working directory, but are instead left in a platform-specific temporary directory.
      - Unset by default.
      - | 0: Temporary files are deleted automatically.
        | Non zero integer: Turn off the temporary files deletion.

    * - | ``AMD_COMGR_EMIT_VERBOSE_LOGS``
-        | Sets logging of COMGR to include additional Comgr-specific informational messages.
+        | Sets logging of Comgr to include additional Comgr-specific informational messages.
      - Unset by default.
      - | 0: Verbose log disabled.
        | Non zero integer: Verbose log enabled.

    * - | ``AMD_COMGR_REDIRECT_LOGS``
-        | Controls redirect logs of COMGR.
+        | Controls redirect logs of Comgr.
      - Unset by default.
      - | `stdout` / `-`: Redirected to the standard output.
        | `stderr`: Redirected to the error stream.
@@ -24,88 +24,10 @@ The following C++ example shows a simplified method for computing ULP difference
 HIP and standard C++ math functions by first finding where the maximum absolute error
 occurs.

-.. code-block:: cpp
-
-    #include <hip/hip_runtime.h>
-    #include <iostream>
-    #include <vector>
-    #include <cmath>
-    #include <limits>
-
-    #define HIP_CHECK(expression)              \
-        {                                      \
-            const hipError_t err = expression; \
-            if (err != hipSuccess) {           \
-                std::cerr << "HIP error: "     \
-                          << hipGetErrorString(err) \
-                          << " at " << __LINE__ << "\n"; \
-                exit(EXIT_FAILURE);            \
-            }                                  \
-        }
-
-    // Simple ULP difference calculator
-    int64_t ulp_diff(float a, float b) {
-        if (a == b) return 0;
-        union { float f; int32_t i; } ua{a}, ub{b};
-
-        // For negative values, convert to a positive-based representation
-        if (ua.i < 0) ua.i = std::numeric_limits<int32_t>::max() - ua.i;
-        if (ub.i < 0) ub.i = std::numeric_limits<int32_t>::max() - ub.i;
-
-        return std::abs((int64_t)ua.i - (int64_t)ub.i);
-    }
-
-    // Test kernel
-    __global__ void test_sin(float* out, int n) {
-        int i = blockIdx.x * blockDim.x + threadIdx.x;
-        if (i < n) {
-            float x = -M_PI + (2.0f * M_PI * i) / (n - 1);
-            out[i] = sin(x);
-        }
-    }
-
-    int main() {
-        const int n = 1000000;
-        const int blocksize = 256;
-        std::vector<float> outputs(n);
-        float* d_out;
-
-        HIP_CHECK(hipMalloc(&d_out, n * sizeof(float)));
-        dim3 threads(blocksize);
-        dim3 blocks((n + blocksize - 1) / blocksize);  // Fixed grid calculation
-        test_sin<<<blocks, threads>>>(d_out, n);
-        HIP_CHECK(hipPeekAtLastError());
-        HIP_CHECK(hipMemcpy(outputs.data(), d_out, n * sizeof(float), hipMemcpyDeviceToHost));
-
-        // Step 1: Find the maximum absolute error
-        double max_abs_error = 0.0;
-        float max_error_output = 0.0;
-        float max_error_expected = 0.0;
-
-        for (int i = 0; i < n; i++) {
-            float x = -M_PI + (2.0f * M_PI * i) / (n - 1);
-            float expected = std::sin(x);
-            double abs_error = std::abs(outputs[i] - expected);
-
-            if (abs_error > max_abs_error) {
-                max_abs_error = abs_error;
-                max_error_output = outputs[i];
-                max_error_expected = expected;
-            }
-        }
-
-        // Step 2: Compute ULP difference based on the max absolute error pair
-        int64_t max_ulp = ulp_diff(max_error_output, max_error_expected);
-
-        // Output results
-        std::cout << "Max Absolute Error: " << max_abs_error << std::endl;
-        std::cout << "Max ULP Difference: " << max_ulp << std::endl;
-        std::cout << "Max Error Values -> Got: " << max_error_output
-                  << ", Expected: " << max_error_expected << std::endl;
-
-        HIP_CHECK(hipFree(d_out));
-        return 0;
-    }
+.. literalinclude:: ../tools/example_codes/math.hip
+    :start-after: // [sphinx-start]
+    :end-before: // [sphinx-end]
+    :language: cpp

 Standard mathematical functions
 ===============================
@@ -58,7 +58,6 @@ subtrees:
  - file: how-to/hip_cpp_language_extensions
  - file: how-to/kernel_language_cpp_support
  - file: how-to/hip_porting_guide
-  - file: how-to/hip_porting_driver_api
  - file: how-to/hip_rtc
  - file: understand/amd_clr

@@ -127,6 +126,7 @@ subtrees:
  - file: tutorial/saxpy
  - file: tutorial/reduction
  - file: tutorial/cooperative_groups_tutorial
+  - file: tutorial/graph_api

 - caption: About
  entries:
@@ -0,0 +1,95 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include "example_utils.hpp"
+
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+#include <iostream>
+#include <numeric>
+#include <vector>
+
+///\brief Calculates \p a[i] = \p a[i] + \p b[i] where \p i stands for the thread's index in the grid.
+// [sphinx-kernel-start]
+__global__ void AddKernel(float* a, const float* b)
+{
+    int global_idx = threadIdx.x + blockIdx.x * blockDim.x;
+
+    a[global_idx] += b[global_idx];
+}
+// [sphinx-kernel-end]
+
+int main()
+{
+    // The number of float elements in each vector.
+    constexpr unsigned int size = 1 << 20; // == 1'048'576 elements
+
+    // Bytes to allocate for each device vector.
+    constexpr size_t size_bytes = size * sizeof(float);
+
+    // Number of threads per kernel block.
+    constexpr unsigned int threads_per_block = 256;
+
+    // Number of blocks per kernel grid. The expression below calculates ceil(size/block_size).
+    constexpr unsigned int number_of_blocks = ceiling_div(size, threads_per_block);
+
+    // Allocate a vector and fill it with an increasing sequence (i.e. 1, 2, 3, 4...)
+    std::vector<float> h_a(size);
+    std::iota(h_a.begin(), h_a.end(), 1.f);
+
+    // Allocate b vector and fill it with a decreasing sequence (i.e. 1'048'576, 1'048'575, ..., 3, 2, 1)
+    std::vector<float> h_b(size);
+    std::iota(h_b.rbegin(), h_b.rend(), 1.f);
+
+    // Allocate and copy vectors to device memory.
+    float* d_a{};
+    float* d_b{};
+    HIP_CHECK(hipMalloc(&d_a, size_bytes));
+    HIP_CHECK(hipMalloc(&d_b, size_bytes));
+    HIP_CHECK(hipMemcpy(d_a, h_a.data(), size_bytes, hipMemcpyHostToDevice));
+    HIP_CHECK(hipMemcpy(d_b, h_b.data(), size_bytes, hipMemcpyHostToDevice));
+
+    std::cout << "Calculating a[i] = a[i] + b[i] over " << size << " elements." << std::endl;
+
+    // Launch the kernel on the default stream.
+    // [sphinx-kernel-launch-start]
+    AddKernel<<<number_of_blocks, threads_per_block>>>(d_a, d_b);
+    // [sphinx-kernel-launch-end]
+
+    // Check if the kernel launch was successful.
+    HIP_CHECK(hipGetLastError());
+
+    // Copy the results back to the host. This call blocks the host's execution until the copy is finished.
+    HIP_CHECK(hipMemcpy(h_a.data(), d_a, size_bytes, hipMemcpyDeviceToHost));
+
+    // Free device memory.
+    HIP_CHECK(hipFree(d_b));
+    HIP_CHECK(hipFree(d_a));
+
+    // Print the first few elements of the results:
+    constexpr size_t elements_to_print = 10;
+    std::cout << "First " << elements_to_print << " elements of the results: "
+              << format_range(h_a.begin(), h_a.begin() + elements_to_print) << std::endl;
+
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,142 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+#include <vector>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess)                 \
+    {                                        \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+// GPU Kernels
+__global__ void kernelA(double* arrayA, std::size_t size)
+{
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+    if(x < size)
+    {
+        arrayA[x] += 1.0;
+    }
+}
+
+__global__ void kernelB(double* arrayA, double* arrayB, std::size_t size)
+{
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+    if(x < size)
+    {
+        arrayB[x] += arrayA[x] + 3.0;
+    }
+}
+
+int main()
+{
+    constexpr int numOfBlocks = 1 << 20;
+    constexpr int threadsPerBlock = 1024;
+    constexpr int numberOfIterations = 50;
+    // The array size smaller to avoid the relatively short kernel launch compared to memory copies
+    constexpr std::size_t arraySize = 1U << 25;
+    double *d_dataA;
+    double *d_dataB;
+
+    double initValueA = 0.0;
+    double initValueB = 2.0;
+
+    std::vector<double> vectorA(arraySize, initValueA);
+    std::vector<double> vectorB(arraySize, initValueB);
+    // Allocate device memory
+    HIP_CHECK(hipMalloc(&d_dataA, arraySize * sizeof(*d_dataA)));
+    HIP_CHECK(hipMalloc(&d_dataB, arraySize * sizeof(*d_dataB)));
+    // Create streams
+    hipStream_t streamA, streamB;
+    HIP_CHECK(hipStreamCreate(&streamA));
+    HIP_CHECK(hipStreamCreate(&streamB));
+    for(unsigned int iteration = 0; iteration < numberOfIterations; iteration++)
+    {
+        // Stream 1: Host to Device 1
+        HIP_CHECK(hipMemcpyAsync(d_dataA, vectorA.data(), arraySize * sizeof(*d_dataA), hipMemcpyHostToDevice, streamA));
+        // Stream 2: Host to Device 2
+        HIP_CHECK(hipMemcpyAsync(d_dataB, vectorB.data(), arraySize * sizeof(*d_dataB), hipMemcpyHostToDevice, streamB));
+        // Stream 1: Kernel 1
+        kernelA<<<numOfBlocks, threadsPerBlock, 0, streamA>>>(d_dataA, arraySize);
+        // Wait for streamA finish
+        HIP_CHECK(hipStreamSynchronize(streamA));
+        // Stream 2: Kernel 2
+        kernelB<<<numOfBlocks, threadsPerBlock, 0, streamB>>>(d_dataA, d_dataB, arraySize);
+        // Stream 1: Device to Host 2 (after Kernel 1)
+        HIP_CHECK(hipMemcpyAsync(vectorA.data(), d_dataA, arraySize * sizeof(*vectorA.data()), hipMemcpyDeviceToHost, streamA));
+        // Stream 2: Device to Host 2 (after Kernel 2)
+        HIP_CHECK(hipMemcpyAsync(vectorB.data(), d_dataB, arraySize * sizeof(*vectorB.data()), hipMemcpyDeviceToHost, streamB));
+    }
+    // Wait for all operations in both streams to complete
+    HIP_CHECK(hipStreamSynchronize(streamA));
+    HIP_CHECK(hipStreamSynchronize(streamB));
+    // Verify results
+    double expectedA = (double)numberOfIterations;
+    double expectedB = initValueB + (3.0 * numberOfIterations) + (expectedA * (expectedA + 1.0)) / 2.0;
+    bool passed = true;
+    for(std::size_t i = 0; i < arraySize; ++i)
+    {
+        if(vectorA[i] != expectedA)
+        {
+            passed = false;
+            std::cerr << "Validation failed! Expected " << expectedA << " got " << vectorA[i] << " at index: " << i << std::endl;
+            break;
+        }
+        if(vectorB[i] != expectedB)
+        {
+            passed = false;
+            std::cerr << "Validation failed! Expected " << expectedB << " got " <<  vectorB[i] << " at index: " << i << std::endl;
+            break;
+        }
+    }
+
+    if(passed)
+    {
+        std::cout << "Asynchronous execution completed successfully." << std::endl;
+    }
+    else
+    {
+        std::cerr << "Asynchronous execution failed." << std::endl;
+    }
+
+    // Cleanup
+    HIP_CHECK(hipStreamDestroy(streamA));
+    HIP_CHECK(hipStreamDestroy(streamB));
+    HIP_CHECK(hipFree(d_dataA));
+    HIP_CHECK(hipFree(d_dataB));
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,110 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <cuda_runtime.h>
+
+#include <iostream>
+#include <vector>
+
+__global__ void block_reduction(const float* input, float* output, int num_elements)
+{
+    extern __shared__ float s_data[];
+
+    int tid = threadIdx.x;
+    int global_id = blockDim.x * blockIdx.x + tid;
+
+    if (global_id < num_elements)
+    {
+        s_data[tid] = input[global_id];
+    }
+    else
+    {
+        s_data[tid] = 0.0f;
+    }
+    __syncthreads();
+
+    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
+    {
+        if (tid < stride)
+        {
+            s_data[tid] += s_data[tid + stride];
+        }
+        __syncthreads();
+    }
+
+    if (tid == 0)
+    {
+        output[blockIdx.x] = s_data[0];
+    }
+}
+
+int main()
+{
+    int threads = 256;
+    const int num_elements = 50000;
+
+    std::vector<float> h_a(num_elements);
+    std::vector<float> h_b((num_elements + threads - 1) / threads);
+
+    for (int i = 0; i < num_elements; ++i)
+    {
+        h_a[i] = rand() / static_cast<float>(RAND_MAX);
+    }
+
+    float *d_a, *d_b;
+    cudaMalloc(&d_a, h_a.size() * sizeof(float));
+    cudaMalloc(&d_b, h_b.size() * sizeof(float));
+
+    cudaStream_t stream;
+    cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
+
+    cudaEvent_t start_event, stop_event;
+    cudaEventCreate(&start_event);
+    cudaEventCreate(&stop_event);
+
+    cudaMemcpyAsync(d_a, h_a.data(), h_a.size() * sizeof(float), cudaMemcpyHostToDevice, stream);
+
+    cudaEventRecord(start_event, stream);
+
+    int blocks = (num_elements + threads - 1) / threads;
+    block_reduction<<<blocks, threads, threads * sizeof(float), stream>>>(d_a, d_b, num_elements);
+
+    cudaMemcpyAsync(h_b.data(), d_b, h_b.size() * sizeof(float), cudaMemcpyDeviceToHost, stream);
+
+    cudaEventRecord(stop_event, stream);
+    cudaEventSynchronize(stop_event);
+
+    float milliseconds = 0.f;
+    cudaEventElapsedTime(&milliseconds, start_event, stop_event);
+    std::cout << "Kernel execution time: " << milliseconds << " ms\n";
+
+    cudaFree(d_a);
+    cudaFree(d_b);
+
+    cudaEventDestroy(start_event);
+    cudaEventDestroy(stop_event);
+    cudaStreamDestroy(stream);
+
+    return 0;
+}
+// [sphinx-end]
@@ -0,0 +1,58 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess)                 \
+    {                                        \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+int main()
+{
+    std::size_t stackSize;
+    HIP_CHECK(hipDeviceGetLimit(&stackSize, hipLimitStackSize));
+    std::cout << "Default stack size: " << stackSize << " bytes" << std::endl;
+
+    // Set a new stack size
+    std::size_t newStackSize = 1024 * 8; // 8 KiB
+    HIP_CHECK(hipDeviceSetLimit(hipLimitStackSize, newStackSize));
+
+    HIP_CHECK(hipDeviceGetLimit(&stackSize, hipLimitStackSize));
+    std::cout << "Updated stack size: " << stackSize << " bytes" << std::endl;
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,89 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <iostream>
+
+#define HIP_CHECK(expression)                                \
+{                                                            \
+    const hipError_t err = expression;                       \
+    if(err != hipSuccess)                                    \
+    {                                                        \
+        std::cerr << "HIP error: " << hipGetErrorString(err) \
+            << " at " << __LINE__ << "\n";                   \
+    }                                                        \
+}
+
+// Performs a simple initialization of an array with the thread's index variables.
+// This function is only available in device code.
+__device__ void init_array(float * const a, const unsigned int arraySize)
+{
+    // globalIdx uniquely identifies a thread in a 1D launch configuration.
+    const int globalIdx = threadIdx.x + blockIdx.x * blockDim.x;
+    // Each thread initializes a single element of the array.
+    if(globalIdx < arraySize)
+    {
+        a[globalIdx] = globalIdx;
+    }
+}
+
+// Rounds a value up to the next multiple.
+// This function is available in host and device code.
+__host__ __device__ constexpr int round_up_to_nearest_multiple(int number, int multiple)
+{
+    return (number + multiple - 1)/multiple;
+}
+
+__global__ void example_kernel(float * const a, const unsigned int N)
+{
+    // Initialize array.
+    init_array(a, N);
+    // Perform additional work:
+    // - work with the array
+    // - use the array in a different kernel
+    // - ...
+}
+
+int main()
+{
+    constexpr int N = 100000000; // problem size
+    constexpr int blockSize = 256; //configurable block size
+
+    //needed number of blocks for the given problem size
+    constexpr int gridSize = round_up_to_nearest_multiple(N, blockSize);
+
+    float *a;
+    // allocate memory on the GPU
+    HIP_CHECK(hipMalloc(&a, sizeof(*a) * N));
+
+    std::cout << "Launching kernel." << std::endl;
+    example_kernel<<<dim3(gridSize), dim3(blockSize), 0/*example doesn't use shared memory*/, 0/*default stream*/>>>(a, N);
+    // make sure kernel execution is finished by synchronizing. The CPU can also
+    // execute other instructions during that time
+    HIP_CHECK(hipDeviceSynchronize());
+    std::cout << "Kernel execution finished." << std::endl;
+
+    HIP_CHECK(hipFree(a));
+}
+// [sphinx-end]
@@ -0,0 +1,165 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+#include <hip/hiprtc.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+#include <string>
+#include <vector>
+
+#define CHECK_RET_CODE(call, ret_code)                                                             \
+{                                                                                                  \
+    if ((call) != ret_code)                                                                        \
+    {                                                                                              \
+        std::cout << "Failed in call: " << #call << std::endl;                                     \
+        std::abort();                                                                              \
+    }                                                                                              \
+}
+#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
+#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
+
+// source code for hiprtc
+static constexpr auto kernel_source{
+    R"(
+    extern "C"
+    __global__ void vector_add(float* output, float* input1, float* input2, size_t size)
+    {
+        int i = threadIdx.x;
+        if (i < size)
+        {
+            output[i] = input1[i] + input2[i];
+        }
+    }
+)"};
+
+int main()
+{
+    hiprtcProgram prog;
+    auto rtc_ret_code = hiprtcCreateProgram(&prog,            // HIPRTC program handle
+                                            kernel_source,    // kernel source string
+                                            "vector_add.cpp", // Name of the file
+                                            0,                // Number of headers
+                                            nullptr,          // Header sources
+                                            nullptr);         // Name of header file
+
+    if (rtc_ret_code != HIPRTC_SUCCESS)
+    {
+        std::cerr << "Failed to create program" << std::endl;
+        std::abort();
+    }
+
+    hipDeviceProp_t props;
+    int device = 0;
+    HIP_CHECK(hipGetDeviceProperties(&props, device));
+    auto sarg = std::string{"--gpu-architecture="} + props.gcnArchName;  // device for which binary is to be generated
+
+    const char* options[] = {sarg.c_str()};
+
+    rtc_ret_code = hiprtcCompileProgram(prog,      // hiprtcProgram
+                                        1,         // Number of options
+                                        options);  // Clang Options
+    if (rtc_ret_code != HIPRTC_SUCCESS)
+    {
+        std::cerr << "Failed to create program" << std::endl;
+        std::abort();
+    }
+
+    std::size_t logSize;
+    HIPRTC_CHECK(hiprtcGetProgramLogSize(prog, &logSize));
+
+    if (logSize)
+    {
+        std::string log(logSize, '\0');
+        HIPRTC_CHECK(hiprtcGetProgramLog(prog, &log[0]));
+        std::cerr << "Compilation failed or produced warnings: " << log << std::endl;
+        std::abort();
+    }
+
+    std::size_t codeSize;
+    HIPRTC_CHECK(hiprtcGetCodeSize(prog, &codeSize));
+
+    std::vector<char> kernel_binary(codeSize);
+    HIPRTC_CHECK(hiprtcGetCode(prog, kernel_binary.data()));
+
+    HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
+
+    hipModule_t module;
+    hipFunction_t kernel;
+
+    HIP_CHECK(hipModuleLoadData(&module, kernel_binary.data()));
+    HIP_CHECK(hipModuleGetFunction(&kernel, module, "vector_add"));
+
+    constexpr std::size_t ele_size = 256;  // total number of items to add
+    std::vector<float> hinput, output;
+    hinput.reserve(ele_size);
+    output.reserve(ele_size);
+    for (std::size_t i = 0; i < ele_size; i++)
+    {
+        hinput.push_back(static_cast<float>(i + 1));
+        output.push_back(0.0f);
+    }
+
+    float *dinput1, *dinput2, *doutput;
+    HIP_CHECK(hipMalloc(&dinput1, sizeof(float) * ele_size));
+    HIP_CHECK(hipMalloc(&dinput2, sizeof(float) * ele_size));
+    HIP_CHECK(hipMalloc(&doutput, sizeof(float) * ele_size));
+
+    HIP_CHECK(hipMemcpy(dinput1, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
+    HIP_CHECK(hipMemcpy(dinput2, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
+
+    struct
+    {
+        float* output;
+        float* input1;
+        float* input2;
+        std::size_t size;
+    } args{doutput, dinput1, dinput2, ele_size};
+
+    auto size = sizeof(args);
+    void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, &args, HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
+                      HIP_LAUNCH_PARAM_END};
+
+    HIP_CHECK(hipModuleLaunchKernel(kernel, 1, 1, 1, ele_size, 1, 1, 0, nullptr, nullptr, config));
+
+    HIP_CHECK(hipMemcpy(output.data(), doutput, sizeof(float) * ele_size, hipMemcpyDeviceToHost));
+
+    for (std::size_t i = 0; i < ele_size; i++)
+    {
+        if ((hinput[i] + hinput[i]) != output[i])
+        {
+            std::cout << "Failed in validation: " << (hinput[i] + hinput[i]) << " - " << output[i] << std::endl;
+            std::abort();
+        }
+    }
+    std::cout << "Passed" << std::endl;
+
+    HIP_CHECK(hipFree(dinput1));
+    HIP_CHECK(hipFree(dinput2));
+    HIP_CHECK(hipFree(doutput));
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-stop]
@@ -0,0 +1,142 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+#include <hip/hip_complex.h>
+
+#include <cmath>
+#include <cstdlib>
+#include <iostream>
+#include <vector>
+
+#define HIP_CHECK(expression)                      \
+    {                                              \
+        const hipError_t err = expression;         \
+        if (err != hipSuccess) {                   \
+            std::cerr << "HIP error: "             \
+                    << hipGetErrorString(err)      \
+                    << " at " << __LINE__ << "\n"; \
+            exit(EXIT_FAILURE);                    \
+        }                                          \
+    }
+
+// Kernel to compute DFT
+__global__ void computeDFT(const float* input, hipFloatComplex* output, const int N)
+{
+    int k = blockIdx.x * blockDim.x + threadIdx.x;
+    if (k >= N) return;
+
+    hipFloatComplex sum = make_hipFloatComplex(0.0f, 0.0f);
+
+    for (int n = 0; n < N; n++)
+    {
+        float angle = -2.0f * M_PI * k * n / N;
+        hipFloatComplex w = make_hipFloatComplex(cosf(angle), sinf(angle));
+        hipFloatComplex x = make_hipFloatComplex(input[n], 0.0f);
+        sum = hipCaddf(sum, hipCmulf(x, w));
+    }
+
+    output[k] = sum;
+}
+
+// CPU implementation of DFT for verification
+std::vector<hipFloatComplex> cpuDFT(const std::vector<float>& input)
+{
+    const int N = input.size();
+    std::vector<hipFloatComplex> result(N);
+
+    for (int k = 0; k < N; k++)
+    {
+        hipFloatComplex sum = make_hipFloatComplex(0.0f, 0.0f);
+        for (int n = 0; n < N; n++)
+        {
+            float angle = -2.0f * M_PI * k * n / N;
+            hipFloatComplex w = make_hipFloatComplex(cosf(angle), sinf(angle));
+            hipFloatComplex x = make_hipFloatComplex(input[n], 0.0f);
+            sum = hipCaddf(sum, hipCmulf(x, w));
+        }
+        result[k] = sum;
+    }
+    return result;
+}
+
+int main()
+{
+    const int N = 256;  // Signal length
+    const int blockSize = 256;
+
+    // Generate input signal: sum of two sine waves
+    std::vector<float> signal(N);
+    for (int i = 0; i < N; i++)
+    {
+        float t = static_cast<float>(i) / N;
+        signal[i] = sinf(2.0f * M_PI * 10.0f * t) +  // 10 Hz component
+                0.5f * sinf(2.0f * M_PI * 20.0f * t);  // 20 Hz component
+    }
+
+    // Compute reference solution on CPU
+    std::vector<hipFloatComplex> cpu_output = cpuDFT(signal);
+
+    // Allocate device memory
+    float* d_signal;
+    hipFloatComplex* d_output;
+    HIP_CHECK(hipMalloc(&d_signal, N * sizeof(float)));
+    HIP_CHECK(hipMalloc(&d_output, N * sizeof(hipFloatComplex)));
+
+    // Copy input to device
+    HIP_CHECK(hipMemcpy(d_signal, signal.data(), N * sizeof(float), hipMemcpyHostToDevice));
+
+    // Launch kernel
+    dim3 grid((N + blockSize - 1) / blockSize);
+    dim3 block(blockSize);
+    computeDFT<<<grid, block>>>(d_signal, d_output, N);
+    HIP_CHECK(hipGetLastError());
+
+    // Get GPU results
+    std::vector<hipFloatComplex> gpu_output(N);
+    HIP_CHECK(hipMemcpy(gpu_output.data(), d_output, N * sizeof(hipFloatComplex), hipMemcpyDeviceToHost));
+
+    // Verify results
+    bool passed = true;
+    const float tolerance = 1e-5f;  // Adjust based on precision requirements
+
+    for (int i = 0; i < N; i++)
+    {
+        float diff_real = std::abs(hipCrealf(gpu_output[i]) - hipCrealf(cpu_output[i]));
+        float diff_imag = std::abs(hipCimagf(gpu_output[i]) - hipCimagf(cpu_output[i]));
+
+        if (diff_real > tolerance || diff_imag > tolerance)
+        {
+            passed = false;
+            break;
+        }
+    }
+
+    std::cout << "DFT Verification: " << (passed ? "PASSED" : "FAILED") << "\n";
+
+    // Cleanup
+    HIP_CHECK(hipFree(d_signal));
+    HIP_CHECK(hipFree(d_output));
+
+    return passed ? EXIT_SUCCESS : EXIT_FAILURE;
+}
+// [sphinx-end]
@@ -0,0 +1,75 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include "example_utils.hpp"
+
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+// [sphinx-start]
+constexpr std::size_t const_array_size = 32;
+__constant__ double const_array[const_array_size];
+
+void set_constant_memory(double* values)
+{
+    HIP_CHECK(hipMemcpyToSymbol(const_array, values, const_array_size * sizeof(double)));
+}
+
+__global__ void kernel_using_const_memory(double* array)
+{
+    int warpIdx = threadIdx.x / warpSize;
+    // uniform access of warps to const_array for best performance
+    array[blockIdx.x] *= const_array[warpIdx];
+}
+// [sphinx-end]
+
+int main()
+{
+    std::size_t elements = 32;
+    std::size_t size_bytes = elements * sizeof(double);
+
+    // allocate host array
+    double *host_array = new double[elements];
+
+    // allocate device array
+    double *device_array = nullptr;
+    HIP_CHECK(hipMalloc((double**) &device_array, size_bytes));
+
+    // copy from host to the device
+    set_constant_memory(host_array);
+
+    kernel_using_const_memory<<<32, 32>>>(device_array);
+
+    // copy from device to host, to e.g. get results from the kernel
+    HIP_CHECK(hipMemcpy(host_array, device_array, size_bytes, hipMemcpyDeviceToHost));
+
+    // free memory when not needed any more
+    HIP_CHECK(hipFree(device_array));
+    delete[] host_array;
+
+    std::cout << "Success!" << std::endl;
+
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,84 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <iostream>
+
+#define HIP_CHECK(expression)              \
+{                                          \
+    const hipError_t err = expression;     \
+    if(err != hipSuccess)                  \
+    {                                      \
+        std::cerr << "HIP error: "         \
+            << hipGetErrorString(err)      \
+            << " at " << __LINE__ << "\n"; \
+    }                                      \
+}
+
+// Addition of two values.
+__global__ void add(int *a, int *b, int *c)
+{
+    *c = *a + *b;
+}
+
+int main()
+{
+    int *a, *b, *c;
+    int deviceId;
+    HIP_CHECK(hipGetDevice(&deviceId)); // Get the current device ID
+
+    // Allocate memory for a, b and c that is accessible to both device and host codes.
+    HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
+    HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
+    HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
+
+    // Setup input values.
+    *a = 1;
+    *b = 2;
+
+    // Prefetch the data to the GPU device.
+    HIP_CHECK(hipMemPrefetchAsync(a, sizeof(*a), deviceId, 0));
+    HIP_CHECK(hipMemPrefetchAsync(b, sizeof(*b), deviceId, 0));
+    HIP_CHECK(hipMemPrefetchAsync(c, sizeof(*c), deviceId, 0));
+
+    // Launch add() kernel on GPU.
+    add<<<1, 1>>>(a, b, c);
+
+    // Prefetch the result back to the CPU.
+    HIP_CHECK(hipMemPrefetchAsync(c, sizeof(*c), hipCpuDeviceId, 0));
+
+    // Wait for the prefetch operations to complete.
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Prints the result.
+    std::cout << *a << " + " << *b << " = " << *c << std::endl;
+
+    // Cleanup allocated memory.
+    HIP_CHECK(hipFree(a));
+    HIP_CHECK(hipFree(b));
+    HIP_CHECK(hipFree(c));
+
+    return 0;
+}
+// [sphinx-end]
@@ -0,0 +1,61 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                                \
+{                                                            \
+    const hipError_t err = expression;                       \
+    if (err != hipSuccess)                                   \
+    {                                                        \
+        std::cout << "HIP Error: " << hipGetErrorString(err) \
+              << " at line " << __LINE__ << std::endl;       \
+        std::exit(EXIT_FAILURE);                             \
+    }                                                        \
+}
+
+__global__ void test_kernel()
+{
+    // [sphinx-start]
+//#if __CUDA_ARCH__ >= 130 // does not properly specify, what feature is required, not portable
+#if __HIP_ARCH_HAS_DOUBLES__ == 1 // explicitly specifies, what feature is required, portable between AMD and NVIDIA GPUs
+    // device code
+#endif
+    // [sphinx-end]
+
+#if __HIP_ARCH_HAS_DOUBLES__ == 1
+    printf("Device has double-precision support.\n");
+#else
+    printf("Device does not have double-precision support.\n");
+#endif
+}
+
+int main()
+{
+    test_kernel<<<1, 1, 0, 0>>>();
+    HIP_CHECK(hipDeviceSynchronize());
+
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,74 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                        \
+{                                                    \
+    const hipError_t status = expression;            \
+    if (status != hipSuccess)                        \
+    {                                                \
+        std::cerr << "HIP error " << status          \
+                << ": " << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":"         \
+                << __LINE__ << std::endl;            \
+        std::exit(EXIT_FAILURE);                     \
+    }                                                \
+}
+
+int main()
+{
+    int deviceCount;
+    HIP_CHECK(hipGetDeviceCount(&deviceCount));
+    std::cout << "Number of devices: " << deviceCount << std::endl;
+
+    for (int deviceId = 0; deviceId < deviceCount; ++deviceId)
+    {
+        hipDeviceProp_t deviceProp;
+        HIP_CHECK(hipGetDeviceProperties(&deviceProp, deviceId));
+        std::cout << "Device " << deviceId << std::endl << " Properties:" << std::endl;
+        std::cout << "  Name: " << deviceProp.name << std::endl;
+        std::cout << "  Total Global Memory: " << deviceProp.totalGlobalMem / (1024 * 1024) << " MiB" << std::endl;
+        std::cout << "  Shared Memory per Block: " << deviceProp.sharedMemPerBlock / 1024 << " KiB" << std::endl;
+        std::cout << "  Registers per Block: " << deviceProp.regsPerBlock << std::endl;
+        std::cout << "  Warp Size: " << deviceProp.warpSize << std::endl;
+        std::cout << "  Max Threads per Block: " << deviceProp.maxThreadsPerBlock << std::endl;
+        std::cout << "  Max Threads per Multiprocessor: " << deviceProp.maxThreadsPerMultiProcessor << std::endl;
+        std::cout << "  Number of Multiprocessors: " << deviceProp.multiProcessorCount << std::endl;
+        std::cout << "  Max Threads Dimensions: ["
+                << deviceProp.maxThreadsDim[0] << ", "
+                << deviceProp.maxThreadsDim[1] << ", "
+                << deviceProp.maxThreadsDim[2] << "]" << std::endl;
+        std::cout << "  Max Grid Size: ["
+                << deviceProp.maxGridSize[0] << ", "
+                << deviceProp.maxGridSize[1] << ", "
+                << deviceProp.maxGridSize[2] << "]" << std::endl;
+        std::cout << std::endl;
+    }
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,72 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess)                 \
+    {                                        \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+__device__ unsigned long long fibonacci(unsigned long long n)
+{
+    if (n == 0 || n == 1)
+    {
+        return n;
+    }
+    return fibonacci(n - 1) + fibonacci(n - 2);
+}
+
+__global__ void kernel(unsigned long long n)
+{
+    unsigned long long result = fibonacci(n);
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+
+    if (x == 0)
+        printf("%llu! = %llu \n", n, result);
+}
+
+int main()
+{
+    kernel<<<1, 1>>>(10);
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // With -O0 optimization option hit the stack limit
+    // kernel<<<1, 256>>>(2048);
+    // HIP_CHECK(hipDeviceSynchronize());
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,100 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                        \
+{                                                    \
+    const hipError_t status = expression;            \
+    if (status != hipSuccess)                        \
+    {                                                \
+        std::cerr << "HIP error " << status          \
+                << ": " << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":"         \
+                << __LINE__ << std::endl;            \
+        std::exit(EXIT_FAILURE);                     \
+    }                                                \
+}
+
+__global__ void simpleKernel(double *data, std::size_t elems)
+{
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if(idx < elems)
+        data[idx] = idx * 2.0;
+}
+
+int main()
+{
+    int deviceCount;
+    HIP_CHECK(hipGetDeviceCount(&deviceCount));
+    if(deviceCount < 2)
+    {
+        std::cout << "This example requires at least two HIP devices." << std::endl;
+        return EXIT_SUCCESS;
+    }
+
+    double* deviceData0;
+    double* deviceData1;
+    constexpr std::size_t elems = 1024;
+    constexpr std::size_t size = elems * sizeof(double);
+
+    int deviceId0 = 0;
+    int deviceId1 = 1;
+
+    // Set device 0 and perform operations
+    HIP_CHECK(hipSetDevice(deviceId0)); // Set device 0 as current
+    HIP_CHECK(hipMalloc(&deviceData0, size)); // Allocate memory on device 0
+    simpleKernel<<<8, 128>>>(deviceData0, elems); // Launch kernel on device 0
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Set device 1 and perform operations
+    HIP_CHECK(hipSetDevice(deviceId1)); // Set device 1 as current
+    HIP_CHECK(hipMalloc(&deviceData1, size)); // Allocate memory on device 1
+    simpleKernel<<<8, 128>>>(deviceData1, elems); // Launch kernel on device 1
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Copy result from device 0
+    double hostData0[elems];
+    HIP_CHECK(hipSetDevice(deviceId0));
+    HIP_CHECK(hipMemcpy(hostData0, deviceData0, size, hipMemcpyDeviceToHost));
+
+    // Copy result from device 1
+    double hostData1[elems];
+    HIP_CHECK(hipSetDevice(deviceId1));
+    HIP_CHECK(hipMemcpy(hostData1, deviceData1, size, hipMemcpyDeviceToHost));
+
+    // Display results from both devices
+    std::cout << "Device 0 data: " << hostData0[0] << std::endl;
+    std::cout << "Device 1 data: " << hostData1[0] << std::endl;
+
+    // Free device memory
+    HIP_CHECK(hipFree(deviceData0));
+    HIP_CHECK(hipFree(deviceData1));
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,64 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include "example_utils.hpp"
+
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+// [sphinx-start]
+extern __shared__ int dynamic_shared[];
+
+__global__ void kernel(int array1SizeX, int array1SizeY, int array2Size)
+{
+    // at least (array1SizeX * array1SizeY + array2Size) * sizeof(int) bytes
+    // dynamic shared memory need to be allocated when the kernel is launched
+    int* array1 = dynamic_shared;
+    // array1 is interpreted as 2D of size:
+    int array1Size = array1SizeX * array1SizeY;
+
+    int* array2 = &(array1[array1Size]);
+
+    if(threadIdx.x < array1SizeX && threadIdx.y < array1SizeY)
+    {
+        // access array1 with threadIdx.x + threadIdx.y * array1SizeX
+    }
+    if(threadIdx.x < array2Size)
+    {
+        // access array2 threadIdx.x
+    }
+}
+// [sphinx-end]
+
+int main()
+{
+    std::size_t shared_memory_bytes = 512 * sizeof(int);
+    kernel<<<64, 512, shared_memory_bytes>>>(512, 1, 512);
+    HIP_CHECK(hipPeekAtLastError());
+    HIP_CHECK(hipDeviceSynchronize());
+
+    std::cout << "Success!" << std::endl;
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,74 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <iostream>
+
+#define HIP_CHECK(expression)              \
+{                                          \
+    const hipError_t err = expression;     \
+    if(err != hipSuccess)                  \
+    {                                      \
+        std::cerr << "HIP error: "         \
+            << hipGetErrorString(err)      \
+            << " at " << __LINE__ << "\n"; \
+    }                                      \
+}
+
+// Addition of two values.
+__global__ void add(int *a, int *b, int *c)
+{
+    *c = *a + *b;
+}
+
+int main()
+{
+    int *a, *b, *c;
+
+    // Allocate memory for a, b and c that is accessible to both device and host codes.
+    HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
+    HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
+    HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
+
+    // Setup input values.
+    *a = 1;
+    *b = 2;
+
+    // Launch add() kernel on GPU.
+    add<<<1, 1>>>(a, b, c);
+
+    // Wait for GPU to finish before accessing on host.
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Print the result.
+    std::cout << *a << " + " << *b << " = " << *c << std::endl;
+
+    // Cleanup allocated memory.
+    HIP_CHECK(hipFree(a));
+    HIP_CHECK(hipFree(b));
+    HIP_CHECK(hipFree(c));
+
+    return 0;
+}
+// [sphinx-end]
@@ -0,0 +1,97 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <algorithm>
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+#include <vector>
+
+#define HIP_CHECK(expression)                  \
+{                                              \
+    const hipError_t status = expression;      \
+    if(status != hipSuccess)                   \
+    {                                          \
+        std::cerr << "HIP error "              \
+                  << status << ": "            \
+                  << hipGetErrorString(status) \
+                  << " at " << __FILE__ << ":" \
+                  << __LINE__ << std::endl;    \
+    }                                          \
+}
+
+// Addition of two values.
+__global__ void add(int *a, int *b, int *c, std::size_t size)
+{
+    const std::size_t index = threadIdx.x + blockDim.x * blockIdx.x;
+    if(index < size)
+    {
+        c[index] += a[index] + b[index];
+    }
+}
+
+int main()
+{
+    constexpr int numOfBlocks = 256;
+    constexpr int threadsPerBlock = 256;
+    constexpr std::size_t arraySize = 1U << 16;
+
+    std::vector<int> a(arraySize), b(arraySize), c(arraySize);
+    int *d_a, *d_b, *d_c;
+
+    // Setup input values.
+    std::fill(a.begin(), a.end(), 1);
+    std::fill(b.begin(), b.end(), 2);
+
+    // Allocate device copies of a, b and c.
+    HIP_CHECK(hipMalloc(&d_a, arraySize * sizeof(int)));
+    HIP_CHECK(hipMalloc(&d_b, arraySize * sizeof(int)));
+    HIP_CHECK(hipMalloc(&d_c, arraySize * sizeof(int)));
+
+    // Copy input values to device.
+    HIP_CHECK(hipMemcpy(d_a, a.data(), arraySize * sizeof(int), hipMemcpyHostToDevice));
+    HIP_CHECK(hipMemcpy(d_b, b.data(), arraySize * sizeof(int), hipMemcpyHostToDevice));
+
+    // Launch add() kernel on GPU.
+    add<<<numOfBlocks, threadsPerBlock>>>(d_a, d_b, d_c, arraySize);
+    // Check the kernel launch
+    HIP_CHECK(hipGetLastError());
+    // Check for kernel execution error
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Copy the result back to the host.
+    HIP_CHECK(hipMemcpy(c.data(), d_c, arraySize * sizeof(int), hipMemcpyDeviceToHost));
+
+    // Cleanup allocated memory.
+    HIP_CHECK(hipFree(d_a));
+    HIP_CHECK(hipFree(d_b));
+    HIP_CHECK(hipFree(d_c));
+
+    // Print the result.
+    std::cout << a[0] << " + " << b[0] << " = " << c[0] << std::endl;
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,153 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+#include <vector>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess)                 \
+    {                                        \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+// GPU Kernels
+__global__ void kernelA(double* arrayA, std::size_t size)
+{
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+    if(x < size)
+    {
+        arrayA[x] += 1.0;
+    }
+}
+
+__global__ void kernelB(double* arrayA, double* arrayB, std::size_t size)
+{
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+    if(x < size)
+    {
+        arrayB[x] += arrayA[x] + 3.0;
+    }
+}
+
+int main()
+{
+    constexpr int numOfBlocks = 1 << 20;
+    constexpr int threadsPerBlock = 1024;
+    constexpr int numberOfIterations = 50;
+    // The array size smaller to avoid the relatively short kernel launch compared to memory copies
+    constexpr std::size_t arraySize = 1U << 25;
+    double *d_dataA;
+    double *d_dataB;
+    double initValueA = 0.0;
+    double initValueB = 2.0;
+
+    std::vector<double> vectorA(arraySize, initValueA);
+    std::vector<double> vectorB(arraySize, initValueB);
+    // Allocate device memory
+    HIP_CHECK(hipMalloc(&d_dataA, arraySize * sizeof(*d_dataA)));
+    HIP_CHECK(hipMalloc(&d_dataB, arraySize * sizeof(*d_dataB)));
+    // Create streams
+    hipStream_t streamA, streamB;
+    HIP_CHECK(hipStreamCreate(&streamA));
+    HIP_CHECK(hipStreamCreate(&streamB));
+    // Create events
+    hipEvent_t event, eventA, eventB;
+    HIP_CHECK(hipEventCreate(&event));
+    HIP_CHECK(hipEventCreate(&eventA));
+    HIP_CHECK(hipEventCreate(&eventB));
+    for(unsigned int iteration = 0; iteration < numberOfIterations; iteration++)
+    {
+        // Stream 1: Host to Device 1
+        HIP_CHECK(hipMemcpyAsync(d_dataA, vectorA.data(), arraySize * sizeof(*d_dataA), hipMemcpyHostToDevice, streamA));
+        // Stream 2: Host to Device 2
+        HIP_CHECK(hipMemcpyAsync(d_dataB, vectorB.data(), arraySize * sizeof(*d_dataB), hipMemcpyHostToDevice, streamB));
+        // Stream 1: Kernel 1
+        kernelA<<<numOfBlocks, threadsPerBlock, 0, streamA>>>(d_dataA, arraySize);
+        // Record event after the GPU kernel in Stream 1
+        HIP_CHECK(hipEventRecord(event, streamA));
+        // Stream 2: Wait for event before starting Kernel 2
+        HIP_CHECK(hipStreamWaitEvent(streamB, event, 0));
+        // Stream 2: Kernel 2
+        kernelB<<<numOfBlocks, threadsPerBlock, 0, streamB>>>(d_dataA, d_dataB, arraySize);
+        // Stream 1: Device to Host 2 (after Kernel 1)
+        HIP_CHECK(hipMemcpyAsync(vectorA.data(), d_dataA, arraySize * sizeof(*vectorA.data()), hipMemcpyDeviceToHost, streamA));
+        // Stream 2: Device to Host 2 (after Kernel 2)
+        HIP_CHECK(hipMemcpyAsync(vectorB.data(), d_dataB, arraySize * sizeof(*vectorB.data()), hipMemcpyDeviceToHost, streamB));
+        // Wait for all operations in both streams to complete
+        HIP_CHECK(hipEventRecord(eventA, streamA));
+        HIP_CHECK(hipEventRecord(eventB, streamB));
+        HIP_CHECK(hipStreamWaitEvent(streamA, eventA, 0));
+        HIP_CHECK(hipStreamWaitEvent(streamB, eventB, 0));
+    }
+    // Verify results
+    double expectedA = (double)numberOfIterations;
+    double expectedB = initValueB + (3.0 * numberOfIterations) + (expectedA * (expectedA + 1.0)) / 2.0;
+    bool passed = true;
+    for(std::size_t i = 0; i < arraySize; ++i)
+    {
+        if(vectorA[i] != expectedA)
+        {
+            passed = false;
+            std::cerr << "Validation failed! Expected " << expectedA << " got " << vectorA[i] << std::endl;
+            break;
+        }
+        if(vectorB[i] != expectedB)
+        {
+            passed = false;
+            std::cerr << "Validation failed! Expected " << expectedB << " got " <<  vectorB[i] << std::endl;
+            break;
+        }
+    }
+    
+    if(passed)
+    {
+        std::cout << "Asynchronous execution with events completed successfully." << std::endl;
+    }
+    else
+    {
+        std::cerr << "Asynchronous execution with events failed." << std::endl;
+    }
+
+    // Cleanup
+    HIP_CHECK(hipEventDestroy(event));
+    HIP_CHECK(hipEventDestroy(eventA));
+    HIP_CHECK(hipEventDestroy(eventB));
+    HIP_CHECK(hipStreamDestroy(streamA));
+    HIP_CHECK(hipStreamDestroy(streamB));
+    HIP_CHECK(hipFree(d_dataA));
+    HIP_CHECK(hipFree(d_dataB));
+
+    return 0;
+}
+// [sphinx-end]
@@ -0,0 +1,58 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include "example_utils.hpp"
+
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+int main()
+{
+    // [sphinx-start]
+    std::size_t elements = 1 << 20;
+    std::size_t size_bytes = elements * sizeof(int);
+
+    // allocate host and device memory
+    int *host_pointer = new int[elements];
+    int *device_input, *device_result;
+    HIP_CHECK(hipMalloc(&device_input, size_bytes));
+    HIP_CHECK(hipMalloc(&device_result, size_bytes));
+
+    // copy from host to the device
+    HIP_CHECK(hipMemcpy(device_input, host_pointer, size_bytes, hipMemcpyHostToDevice));
+
+    // Use memory on the device, i.e. execute kernels
+
+    // copy from device to host, to e.g. get results from the kernel
+    HIP_CHECK(hipMemcpy(host_pointer, device_result, size_bytes, hipMemcpyDeviceToHost));
+
+    // free memory when not needed any more
+    HIP_CHECK(hipFree(device_result));
+    HIP_CHECK(hipFree(device_input));
+    delete[] host_pointer;
+    // [sphinx-end]
+
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,79 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <iostream>
+
+#define HIP_CHECK(expression)              \
+{                                          \
+    const hipError_t err = expression;     \
+    if(err != hipSuccess)                  \
+    {                                      \
+        std::cerr << "HIP error: "         \
+            << hipGetErrorString(err)      \
+            << " at " << __LINE__ << "\n"; \
+    }                                      \
+}
+
+// Addition of two values.
+__global__ void add(int *a, int *b, int *c)
+{
+    *c = *a + *b;
+}
+
+int main()
+{
+    int a, b, c;
+    int *d_a, *d_b, *d_c;
+
+    // Setup input values.
+    a = 1;
+    b = 2;
+
+    // Allocate device copies of a, b and c.
+    HIP_CHECK(hipMalloc(&d_a, sizeof(*d_a)));
+    HIP_CHECK(hipMalloc(&d_b, sizeof(*d_b)));
+    HIP_CHECK(hipMalloc(&d_c, sizeof(*d_c)));
+
+    // Copy input values to device.
+    HIP_CHECK(hipMemcpy(d_a, &a, sizeof(*d_a), hipMemcpyHostToDevice));
+    HIP_CHECK(hipMemcpy(d_b, &b, sizeof(*d_b), hipMemcpyHostToDevice));
+
+    // Launch add() kernel on GPU.
+    add<<<1, 1>>>(d_a, d_b, d_c);
+
+    // Copy the result back to the host.
+    HIP_CHECK(hipMemcpy(&c, d_c, sizeof(*d_c), hipMemcpyDeviceToHost));
+
+    // Cleanup allocated memory.
+    HIP_CHECK(hipFree(d_a));
+    HIP_CHECK(hipFree(d_b));
+    HIP_CHECK(hipFree(d_c));
+
+    // Prints the result.
+    std::cout << a << " + " << b << " = " << c << std::endl;
+
+    return 0;
+}
+// [sphinx-end]
@@ -0,0 +1,53 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+#include <iostream>
+
+extern __shared__ int shared_array[];
+
+__global__ void kernel()
+{
+    // initialize shared memory
+    shared_array[threadIdx.x] = threadIdx.x;
+    // use shared memory - synchronize to make sure, that all threads of the
+    // block see all changes to shared memory
+    __syncthreads();
+}
+
+int main()
+{
+    //shared memory in this case depends on the configurable block size
+    constexpr int blockSize = 256;
+    constexpr int sharedMemSize = blockSize * sizeof(int);
+    constexpr int gridSize = 2;
+
+    kernel<<<dim3(gridSize), dim3(blockSize), sharedMemSize, 0>>>();
+    if(auto err = hipDeviceSynchronize(); err != hipSuccess)
+        std::cerr << "HIP error " << err << ": " << hipGetErrorString(err) << std::endl;
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,606 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include "backprojection.hpp"
+#include "filtering.hpp"
+#include "log_transform.hpp"
+#include "normalization.hpp"
+#include "phantom.hpp"
+#include "projection.hpp"
+#include "utility.hpp"
+#include "weighting.hpp"
+#include "volume.hpp"
+
+#include <hip/hip_runtime.h>
+
+#include <hipfft/hipfft.h>
+
+#include <algorithm>
+#include <chrono>
+#include <cmath>
+#include <cstddef>
+#include <cstdint>
+#include <cstdlib>
+#include <cstring>
+#include <iostream>
+#include <numbers>
+#include <ostream>
+#include <set>
+#include <stdexcept>
+#include <vector>
+
+auto main() -> int
+{
+    try
+    {
+        auto hasTextures = int{0};
+        hip_check(hipDeviceGetAttribute(&hasTextures, hipDeviceAttributeImageSupport, 0));
+
+        // Fetch device properties
+        auto devProps = hipDeviceProp_t{};
+        hip_check(hipGetDeviceProperties(&devProps, 0));
+
+        auto const numStreams = devProps.asyncEngineCount;
+
+        std::cout << "Device has " << numStreams << " asynchronous engines; preprocessing will use "
+                  << numStreams << " parallel streams." << std::endl;
+
+        auto streams = std::vector<hipStream_t>{};
+        streams.resize(numStreams);
+        for(auto&& stream : streams)
+            hip_check(hipStreamCreate(&stream));
+
+        auto r = static_cast<float*>(nullptr);
+        auto R = static_cast<hipfftComplex*>(nullptr);
+
+        auto forwardPlans = std::vector<hipfftHandle>{};
+        auto forwardSizes = std::vector<std::size_t>{};
+        auto backwardPlans = std::vector<hipfftHandle>{};
+        auto backwardSizes = std::vector<std::size_t>{};
+        forwardPlans.resize(numStreams);
+        forwardSizes.resize(numStreams);
+        backwardPlans.resize(numStreams);
+        backwardSizes.resize(numStreams);
+
+        auto projections = std::vector<float*>{};
+        auto projectionPitches = std::vector<std::size_t>{};
+        auto expandedProjections = std::vector<float*>{};
+        auto expandedPitches = std::vector<std::size_t>{};
+        auto transformedProjections = std::vector<hipfftComplex*>{};
+        auto transformedPitches = std::vector<std::size_t>{};
+        auto textureProjections = std::vector<hipTextureObject_t>{};
+
+        auto projGeom = phantom::make_projectionGeometry();
+        auto volGeom = phantom::make_volumeGeometry();
+        auto phantomProjections = phantom::make_projections(projGeom, volGeom, streams);
+        
+        std::cout << "Initializing... " << std::flush;
+
+        auto stream = streams.at(0);
+
+        // Create filter kernel
+        hip_check(hipMalloc(reinterpret_cast<void**>(&r), projGeom.dimFFT.x * sizeof(float)));
+        auto const creationBlocks = std::max((projGeom.dimFFT.x / 1024u), 1u);
+        filter_creation_kernel<<<creationBlocks, 1024, 0, stream>>>(r, projGeom.s_dimFFT.x, projGeom.pixelDim.x);
+
+        hip_check(hipMalloc(reinterpret_cast<void**>(&R), projGeom.dimTrans.x * sizeof(hipfftComplex)));
+        auto filterPlan = hipfftHandle{};
+        hipfft_check(hipfftPlan1d(&filterPlan, projGeom.dimFFT.x, HIPFFT_R2C, 1));
+        hipfft_check(hipfftSetStream(filterPlan, stream));
+        hipfft_check(hipfftExecR2C(filterPlan, r, R));
+
+        auto absoluteBlocks = (projGeom.dimTrans.x / 1024u) + 1u;
+        filter_absolute_kernel<<<absoluteBlocks, 1024, 0, stream>>>(R, projGeom.dimTrans.x, projGeom.pixelDim.x);
+
+        hip_check(hipStreamSynchronize(stream));
+        
+        hipfft_check(hipfftDestroy(filterPlan));
+        hip_check(hipFree(r));
+
+        auto const inputProjSingle = projGeom.dim.x * projGeom.dim.y * sizeof(std::uint16_t);
+        auto const inputProjTotal = inputProjSingle * numStreams;
+        auto const projSingle = projGeom.dim.x * projGeom.dim.y * sizeof(float);
+        auto const projTotal = projSingle * numStreams;
+        auto const expandedSingle = projGeom.dimFFT.x * projGeom.dimFFT.y * sizeof(float);
+        auto const expandedTotal = expandedSingle * numStreams;
+        auto const transformedSingle = projGeom.dimTrans.x * projGeom.dimTrans.y * sizeof(hipfftComplex);
+        auto const transformedTotal = transformedSingle * numStreams;
+        auto const volumeTotal = volGeom.dim.x * volGeom.dim.y * volGeom.dim.z * sizeof(float);
+        auto const memTotal = inputProjTotal + projTotal + expandedTotal + transformedTotal + volumeTotal;
+        auto devMemFree = std::size_t{};
+        auto devMemTotal = std::size_t{};
+        hip_check(hipMemGetInfo(&devMemFree, &devMemTotal));
+
+        auto memRequired = static_cast<std::size_t>(memTotal);
+        if(memRequired > devMemFree)
+        {
+            std::cerr << "Not enough device memory. Required: " << memRequired 
+                      << ", available: " << devMemFree << std::endl;
+            return EXIT_FAILURE;
+        }
+
+        std::cout << "Done!" << std::endl;
+        std::cout << "Volume dimensions: " << volGeom.dim.x << " x "
+                                           << volGeom.dim.y << " x "
+                                           << volGeom.dim.z << std::endl;
+
+        // Initialize per-stream data
+        for(auto streamIdx = 0u; streamIdx < streams.size(); ++streamIdx)
+        {
+            std::cout << "Initializing stream " << streamIdx << "... " << std::flush;
+            auto stream = streams.at(streamIdx);
+
+            auto proj = static_cast<float*>(nullptr);
+            auto projPitch = std::size_t{};
+            hip_check(hipMallocPitch(
+                reinterpret_cast<void**>(&proj), &projPitch, projGeom.dim.x * sizeof(float), projGeom.dim.y
+            ));
+            projections.push_back(proj);
+            projectionPitches.push_back(projPitch);
+
+            auto expanded = static_cast<float*>(nullptr);
+            auto expandedPitch = std::size_t{};
+            hip_check(hipMallocPitch(
+                reinterpret_cast<void**>(&expanded),
+                &expandedPitch,
+                projGeom.dimFFT.x * sizeof(float),
+                projGeom.dimFFT.y
+            ));
+            expandedProjections.push_back(expanded);
+            expandedPitches.push_back(expandedPitch);
+            
+            auto transformed = static_cast<hipfftComplex*>(nullptr);
+            auto transformedPitch = std::size_t{};
+            hip_check(hipMallocPitch(
+                reinterpret_cast<void**>(&transformed),
+                &transformedPitch,
+                projGeom.dimTrans.x * sizeof(hipfftComplex),
+                projGeom.dimTrans.y
+            ));
+            transformedProjections.push_back(transformed);
+            transformedPitches.push_back(transformedPitch);
+
+            auto& forward = forwardPlans.at(streamIdx);
+            auto& forwardSize = forwardSizes.at(streamIdx);
+            auto fw_inembed = static_cast<int>(expandedPitch / sizeof(float));
+            auto fw_istride = 1;
+            auto fw_idist = fw_inembed;
+            auto fw_onembed = static_cast<int>(transformedPitch / sizeof(hipfftComplex));
+            auto fw_ostride = 1;
+            auto fw_odist = fw_onembed;
+            hipfft_check(hipfftCreate(&forward));
+            hipfft_check(hipfftMakePlanMany(forward, 1, &projGeom.s_dimFFT.x,
+                                            &fw_inembed, 1, fw_idist,
+                                            &fw_onembed, 1, fw_odist,
+                                            HIPFFT_R2C, projGeom.s_dimFFT.y, &forwardSize));
+            hipfft_check(hipfftSetStream(forward, stream));
+
+            auto& backward = backwardPlans.at(streamIdx);
+            auto& backwardSize = backwardSizes.at(streamIdx);
+            auto bw_inembed = fw_onembed;
+            auto bw_istride = fw_ostride;
+            auto bw_idist = fw_odist;
+            auto bw_onembed = fw_inembed;
+            auto bw_ostride = fw_istride;
+            auto bw_odist = fw_idist;
+            hipfft_check(hipfftCreate(&backward));
+            hipfft_check(hipfftMakePlanMany(backward, 1, &projGeom.s_dimFFT.x,
+                                            &bw_inembed, bw_istride, bw_idist,
+                                            &bw_onembed, bw_ostride, bw_odist,
+                                            HIPFFT_C2R, projGeom.s_dimFFT.y, &backwardSize));
+            hipfft_check(hipfftSetStream(backward, stream));
+
+            if(hasTextures)
+            {
+                // create a HIP texture from the projection
+                auto resDesc = hipResourceDesc{};
+                resDesc.resType = hipResourceTypePitch2D;
+                resDesc.res.pitch2D.desc = hipCreateChannelDesc<float>();
+                resDesc.res.pitch2D.devPtr = static_cast<void*>(proj);
+                resDesc.res.pitch2D.width = projGeom.dim.x;
+                resDesc.res.pitch2D.height = projGeom.dim.y;
+                resDesc.res.pitch2D.pitchInBytes = projPitch;
+
+                auto texDesc = hipTextureDesc{};
+                texDesc.addressMode[0] = hipAddressModeBorder;
+                texDesc.addressMode[1] = hipAddressModeBorder;
+                texDesc.readMode = hipReadModeElementType;
+                texDesc.borderColor[0] = 0.f;
+                texDesc.borderColor[0] = 0.f;
+                texDesc.filterMode = hipFilterModeLinear;
+                texDesc.normalizedCoords = 0;
+
+                auto& projTex = textureProjections.emplace_back();
+                hip_check(hipCreateTextureObject(&projTex, &resDesc, &texDesc, nullptr));
+            }
+
+            std::cout << "Done!" << std::endl;
+        }
+
+        create_volume("volume.tif");
+        auto hostVolPtr = static_cast<float*>(nullptr);
+        hip_check(hipHostMalloc(
+            reinterpret_cast<void**>(&hostVolPtr),
+            volGeom.dim.x * volGeom.dim.y * volGeom.dim.z * sizeof(float),
+            hipHostMallocDefault
+        ));
+        auto hostVol = make_hipPitchedPtr(
+            hostVolPtr, volGeom.dim.x * sizeof(float), volGeom.dim.x, volGeom.dim.y
+        );    
+        auto vol = hipPitchedPtr{};
+        auto volExt = make_hipExtent(volGeom.dim.x * sizeof(float), volGeom.dim.y, volGeom.dim.z);
+        hip_check(hipMalloc3D(&vol, volExt));
+        hip_check(hipMemset3D(vol, 0, volExt));
+
+        ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+        // MAIN LOOP
+        ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+        // [sphinx-graph-vars-start]
+        auto graphCreated = false;
+        auto graphExec = hipGraphExec_t{};
+        auto graphFinalCreated = false;
+        auto graphExecFinal = hipGraphExec_t{};
+        auto graphStream = hipStream_t{};
+        hip_check(hipStreamCreate(&graphStream));
+        // [sphinx-graph-vars-end]
+
+        auto start = std::chrono::steady_clock::now();
+            
+        auto projIdx = 0u;
+        while(projIdx < projGeom.numProj)
+        {
+            // [sphinx-begin-capture-start]
+            // Capture the current batch into a graph template
+            auto graph = hipGraph_t{};
+            hip_check(hipStreamBeginCapture(streams.at(0), hipStreamCaptureModeGlobal));
+            // [sphinx-begin-capture-end]
+
+            auto batchSize = std::min(numStreams, static_cast<int>(projGeom.numProj - projIdx));
+
+            // [sphinx-fork-start]
+            // Fork: Record events on stream 0, then have other streams wait
+            for(auto streamIdx = 1; streamIdx < batchSize; ++streamIdx)
+            {
+                auto forkEvent = hipEvent_t{};
+                hip_check(hipEventCreate(&forkEvent));
+                hip_check(hipEventRecord(forkEvent, streams.at(0)));
+                hip_check(hipStreamWaitEvent(streams.at(streamIdx), forkEvent, 0));
+                hip_check(hipEventDestroy(forkEvent)); // Can destroy after wait is recorded
+            }
+            // [sphinx-fork-end]
+
+            // Launch batch in parallel streams
+            for(auto streamIdx = 0; streamIdx < batchSize; ++streamIdx, ++projIdx)
+            {
+                auto stream = streams.at(streamIdx);
+
+                auto threadsPerBlock = dim3{32, 32, 1};
+                auto blocksPerGrid = dim3{
+                    (projGeom.dim.x / threadsPerBlock.x) + 1, (projGeom.dim.y / threadsPerBlock.y) + 1, 1
+                };
+
+                auto inputPitchedPtr = phantomProjections.at(projIdx);
+                auto input = static_cast<std::uint16_t*>(inputPitchedPtr.ptr);
+                auto inputPitch = inputPitchedPtr.pitch;
+
+                auto proj = projections.at(streamIdx);
+                auto projPitch = projectionPitches.at(streamIdx);
+                normalization_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(
+                    input, inputPitch, proj, projPitch, projGeom.dim, projGeom.bps
+                );
+
+                log_transformation_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(proj, projPitch, projGeom.dim);
+
+                weighting_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(
+                    proj,
+                    projPitch,
+                    projGeom.dim,
+                    projGeom.d_sd,
+                    projGeom.d_so,
+                    projGeom.minCoord,
+                    projGeom.pixelDim
+                );
+
+                // Expand projection to filter length
+                auto expanded = expandedProjections.at(streamIdx);
+                auto expandedPitch = expandedPitches.at(streamIdx);
+                hip_check(hipMemset2DAsync(
+                    expanded, expandedPitch, 0, projGeom.dimFFT.x * sizeof(float), projGeom.dimFFT.y, stream
+                ));
+                hip_check(hipMemcpy2DAsync(
+                    expanded,
+                    expandedPitch,
+                    proj,
+                    projPitch,
+                    projGeom.dim.x * sizeof(float),
+                    projGeom.dim.y,
+                    hipMemcpyDeviceToDevice,
+                    stream
+                ));
+
+                // R2C Fourier-transform projection
+                auto transformed = transformedProjections.at(streamIdx);
+                auto transformedPitch = transformedPitches.at(streamIdx);
+                hip_check(hipMemset2DAsync(
+                    transformed,
+                    transformedPitch,
+                    0,
+                    projGeom.dimTrans.x * sizeof(hipfftComplex),
+                    projGeom.dimTrans.y,
+                    stream
+                ));
+                auto& forward = forwardPlans.at(streamIdx);
+                hipfft_check(hipfftExecR2C(forward, expanded, transformed));
+                
+                // Apply filter
+                auto filterBlocksPerGrid = dim3{
+                    (projGeom.dimTrans.x / threadsPerBlock.x) + 1,
+                    (projGeom.dimTrans.y / threadsPerBlock.y) + 1,
+                    1
+                };
+                filter_application_kernel<<<filterBlocksPerGrid, threadsPerBlock, 0, stream>>>(
+                    transformed, transformedPitch, R, projGeom.dimTrans
+                );
+                
+                auto& backward = backwardPlans.at(streamIdx);
+                hipfft_check(hipfftExecC2R(backward, transformed, expanded));
+
+                // Shrink projection to original size and normalize
+                hip_check(hipMemcpy2DAsync(
+                    proj,
+                    projPitch,
+                    expanded,
+                    expandedPitch,
+                    projGeom.dim.x * sizeof(float),
+                    projGeom.dim.y,
+                    hipMemcpyDeviceToDevice,
+                    stream
+                ));
+                
+                filter_normalization_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(
+                    proj, projPitch, projGeom.dimFFT.x, projGeom.dim
+                );
+
+                // Backprojection
+                auto thetaDeg = projGeom.thetaSign * projGeom.thetaStep * projIdx; // Current angle
+                auto thetaRad = thetaDeg * std::numbers::pi_v<float> / 180.f; // Convert to radians
+                auto sinTheta = std::sin(thetaRad);
+                auto cosTheta = std::cos(thetaRad);
+
+                auto bpBlockSize = dim3{32u, 8u, 4u};
+                auto bpBlocks = dim3{
+                    static_cast<std::uint32_t>(volGeom.dim.x / bpBlockSize.x + 1),
+                    static_cast<std::uint32_t>(volGeom.dim.y / bpBlockSize.y + 1),
+                    static_cast<std::uint32_t>(volGeom.dim.z / bpBlockSize.z + 1)
+                };
+
+                if(hasTextures)
+                {
+                    auto& projTex = textureProjections.at(streamIdx);
+                    backprojection_kernel<<<bpBlocks, bpBlockSize, 0, stream>>>(
+                        static_cast<float*>(vol.ptr),
+                        vol.pitch,
+                        volGeom.dim,
+                        volGeom.voxelDim,
+                        projTex,
+                        projGeom.minCoord,
+                        sinTheta,
+                        cosTheta,
+                        projGeom.pixelDim,
+                        projGeom.d_sd,
+                        projGeom.d_so
+                    );
+                }
+                else
+                {
+                    // Fallback for devices without support for texture instructions
+                    backprojection_kernel_no_tex<<<bpBlocks, bpBlockSize, 0, stream>>>(
+                        static_cast<float*>(vol.ptr),
+                        vol.pitch,
+                        volGeom.dim,
+                        volGeom.voxelDim,
+                        proj,
+                        projPitch,
+                        projGeom.dim,
+                        projGeom.minCoord,
+                        sinTheta,
+                        cosTheta,
+                        projGeom.pixelDim,
+                        projGeom.d_sd,
+                        projGeom.d_so
+                    );
+                }
+            }
+
+            // [sphinx-join-start]
+            // Join: Record events on all streams except stream 0, then have stream 0 wait
+            for(auto streamIdx = 1; streamIdx < batchSize; ++streamIdx)
+            {
+                auto joinEvent = hipEvent_t{};
+                hip_check(hipEventCreate(&joinEvent));
+                hip_check(hipEventRecord(joinEvent, streams.at(streamIdx)));
+                hip_check(hipStreamWaitEvent(streams.at(0), joinEvent, 0));
+                hip_check(hipEventDestroy(joinEvent)); // Can destroy after wait is recorded
+            }
+            // [sphinx-join-end]
+
+            // [sphinx-stop-capture-start]
+            // Stop capturing -- this will stop capturing on all streams
+            hip_check(hipStreamEndCapture(streams.at(0), &graph));
+            // [sphinx-stop-capture-end]
+
+            // Instantiate and launch the graph
+            if(batchSize == numStreams)
+            {
+                // [sphinx-graph-instantiate-start]
+                if(!graphCreated)
+                {
+                    hip_check(hipGraphDebugDotPrint(graph, "graph_capture.dot", hipGraphDebugDotFlagsVerbose));
+                    
+                    hip_check(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
+                    hip_check(hipGraphDestroy(graph));
+                    hip_check(hipGraphLaunch(graphExec, graphStream));
+                    graphCreated = true;
+                }
+                // [sphinx-graph-instantiate-end]
+                // [sphinx-graph-update-start]
+                else
+                {
+                    // Update existing executable graph after each iteration with new input data
+                    auto result = hipGraphExecUpdateResult{};
+                    auto errorNode = hipGraphNode_t{};
+                    hip_check(hipGraphExecUpdate(graphExec, graph, &errorNode, &result));
+                    if(result != hipGraphExecUpdateSuccess)
+                    {
+                        auto msg = std::string{"Failed to update graph: "};
+                        switch(result)
+                        {
+                        case hipGraphExecUpdateError:
+                            msg += "Invalid value.";
+                            break;
+                        case hipGraphExecUpdateErrorFunctionChanged:
+                            msg += "Function of kernel node changed.";
+                            break;
+                        case hipGraphExecUpdateErrorNodeTypeChanged:
+                            msg += "Type of node changed.";
+                            break;
+                        case hipGraphExecUpdateErrorNotSupported:
+                            msg += "Something about the node is not supported.";
+                            break;
+                        case hipGraphExecUpdateErrorParametersChanged:
+                            msg += "Unsupported parameter change.";
+                            break;
+                        case hipGraphExecUpdateErrorTopologyChanged:
+                            msg += "Graph topology changed.";
+                            break;
+                        case hipGraphExecUpdateErrorUnsupportedFunctionChange:
+                            msg += "Unsupported change of kernel node function.";
+                            break;
+                        default:
+                            msg += "Unknown error.";
+                            break;
+                        }
+                        throw std::runtime_error{msg};
+                    }
+                    hip_check(hipGraphDestroy(graph));
+                    hip_check(hipGraphLaunch(graphExec, graphStream));
+                }
+                // [sphinx-graph-update-end]
+            }
+            else
+            {
+                // [sphinx-graph-final-start]
+                hip_check(hipGraphDebugDotPrint(graph, "graph_capture_final.dot", hipGraphDebugDotFlagsVerbose));
+
+                // Incomplete batch: topology changed, must instantiate new executable graph
+                hip_check(hipGraphInstantiate(&graphExecFinal, graph, nullptr, nullptr, 0));
+                hip_check(hipGraphDestroy(graph));
+                hip_check(hipGraphLaunch(graphExecFinal, graphStream));
+                // [sphinx-graph-final-end]
+                
+                graphFinalCreated = true;
+            }
+        }
+        
+        // Obtain reconstruction time before copying back the result
+        auto stop = std::chrono::steady_clock::time_point{};
+        hip_check(hipLaunchHostFunc(graphStream, [](void* data)
+        {
+            auto& stop = *(static_cast<std::chrono::steady_clock::time_point*>(data));
+            stop = std::chrono::steady_clock::now();
+        }, static_cast<void*>(&stop)));
+
+        // Copy volume back to host and save
+        auto memcpyParams = hipMemcpy3DParms{};
+        std::memset(&memcpyParams, 0, sizeof(hipMemcpy3DParms));
+        memcpyParams.dstPos = make_hipPos(0, 0, 0);
+        memcpyParams.dstPtr = hostVol;
+        memcpyParams.srcPos = make_hipPos(0, 0, 0);
+        memcpyParams.srcPtr = vol;
+        memcpyParams.extent = volExt;
+        memcpyParams.kind = hipMemcpyDeviceToHost;
+        hip_check(hipMemcpy3DAsync(&memcpyParams, graphStream));
+            
+        auto saveVolArgs = new save_volume_args
+        {
+            "volume.tif",
+            hostVolPtr,
+            volGeom.dim.x, volGeom.dim.y, volGeom.dim.z,
+            volGeom.voxelDim.x, volGeom.voxelDim.y
+        };
+        hip_check(hipLaunchHostFunc(graphStream, save_volume, saveVolArgs));
+
+        std::cout << "All work items enqueued, waiting for completion... " << std::flush;
+        hip_check(hipStreamSynchronize(graphStream));
+        std::cout << "Done!" << std::endl;
+        
+        auto const elapsed = std::chrono::duration<double>{stop - start};
+        std::cout << "Reconstruction time: " << elapsed.count() << 's' << std::endl;
+
+        // Cleanup
+        if(graphFinalCreated)
+            hip_check(hipGraphExecDestroy(graphExecFinal));
+        hip_check(hipGraphExecDestroy(graphExec));
+        hip_check(hipStreamDestroy(graphStream));
+
+        hip_check(hipFree(vol.ptr));
+        hip_check(hipFreeHost(hostVolPtr));
+
+        if(hasTextures)
+        {
+            for(auto&& tex : textureProjections)
+                hip_check(hipDestroyTextureObject(tex));
+        }
+
+        for(auto&& plan : backwardPlans)
+            hipfft_check(hipfftDestroy(plan));
+
+        for(auto&& plan : forwardPlans)
+            hipfft_check(hipfftDestroy(plan));
+
+        for(auto&& p : transformedProjections)
+            hip_check(hipFree(p));
+
+        for(auto&& p : expandedProjections)
+            hip_check(hipFree(p));
+
+        for(auto&& p : projections)
+            hip_check(hipFree(p));
+
+        for(auto&& p : phantomProjections)
+            hip_check(hipFree(p.ptr));
+
+        hip_check(hipFree(R));
+
+        for(auto&& stream : streams)
+            hip_check(hipStreamDestroy(stream));
+
+        hip_check(hipDeviceSynchronize());
+
+        return EXIT_SUCCESS;
+    }
+    catch(std::runtime_error const& e)
+    {
+        std::cerr << "Caught runtime error: " << e.what() << std::endl;
+        return EXIT_FAILURE;
+    }
+}
@@ -0,0 +1,804 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include "backprojection.hpp"
+#include "filtering.hpp"
+#include "log_transform.hpp"
+#include "normalization.hpp"
+#include "phantom.hpp"
+#include "projection.hpp"
+#include "utility.hpp"
+#include "weighting.hpp"
+#include "volume.hpp"
+
+#include <hip/hip_runtime.h>
+
+#include <hipfft/hipfft.h>
+
+#include <algorithm>
+#include <chrono>
+#include <cmath>
+#include <cstddef>
+#include <cstdint>
+#include <cstdlib>
+#include <cstring>
+#include <iostream>
+#include <iterator>
+#include <numbers>
+#include <ostream>
+#include <set>
+#include <stdexcept>
+#include <vector>
+
+auto main() -> int
+{
+    try
+    {
+        auto hasTextures = int{0};
+        hip_check(hipDeviceGetAttribute(&hasTextures, hipDeviceAttributeImageSupport, 0));
+
+        // Fetch device properties
+        auto devProps = hipDeviceProp_t{};
+        hip_check(hipGetDeviceProperties(&devProps, 0));
+
+        auto const numBranches = devProps.asyncEngineCount;
+
+        std::cout << "Device supports " << numBranches << " asynchronous engines; preprocessing will use "
+                  << numBranches << " parallel branches." << std::endl;
+        
+        // For interoperability with hipFFT we require streams and events
+        auto streams = std::vector<hipStream_t>{};
+        streams.resize(numBranches);
+        for(auto&& stream : streams)
+            hip_check(hipStreamCreate(&stream));
+
+        auto r = static_cast<float*>(nullptr);
+        auto R = static_cast<hipfftComplex*>(nullptr);
+
+        auto forwardPlans = std::vector<hipfftHandle>{};
+        auto forwardSizes = std::vector<std::size_t>{};
+        auto backwardPlans = std::vector<hipfftHandle>{};
+        auto backwardSizes = std::vector<std::size_t>{};
+        forwardPlans.resize(numBranches);
+        forwardSizes.resize(numBranches);
+        backwardPlans.resize(numBranches);
+        backwardSizes.resize(numBranches);
+
+        auto projections = std::vector<float*>{};
+        auto projectionPitches = std::vector<std::size_t>{};
+        auto expandedProjections = std::vector<float*>{};
+        auto expandedPitches = std::vector<std::size_t>{};
+        auto transformedProjections = std::vector<hipfftComplex*>{};
+        auto transformedPitches = std::vector<std::size_t>{};
+        auto textureProjections = std::vector<hipTextureObject_t>{};
+
+        auto projGeom = phantom::make_projectionGeometry();
+        auto volGeom = phantom::make_volumeGeometry();
+        auto phantomProjections = phantom::make_projections(projGeom, volGeom, streams);
+        
+        std::cout << "Initializing... " << std::flush;
+
+        auto stream = streams.at(0);
+
+        // Create filter kernel
+        hip_check(hipMalloc(reinterpret_cast<void**>(&r), projGeom.dimFFT.x * sizeof(float)));
+        auto const creationBlocks = std::max((projGeom.dimFFT.x / 1024u), 1u);
+        filter_creation_kernel<<<creationBlocks, 1024, 0, stream>>>(r, projGeom.s_dimFFT.x, projGeom.pixelDim.x);
+
+        hip_check(hipMalloc(reinterpret_cast<void**>(&R), projGeom.dimTrans.x * sizeof(hipfftComplex)));
+        auto filterPlan = hipfftHandle{};
+        hipfft_check(hipfftPlan1d(&filterPlan, projGeom.dimFFT.x, HIPFFT_R2C, 1));
+        hipfft_check(hipfftSetStream(filterPlan, stream));
+        hipfft_check(hipfftExecR2C(filterPlan, r, R));
+
+        auto absoluteBlocks = (projGeom.dimTrans.x / 1024u) + 1u;
+        filter_absolute_kernel<<<absoluteBlocks, 1024, 0, stream>>>(R, projGeom.dimTrans.x, projGeom.pixelDim.x);
+
+        hip_check(hipStreamSynchronize(stream));
+            
+        hipfft_check(hipfftDestroy(filterPlan));
+        hip_check(hipFree(r));
+
+        auto const inputProjSingle = projGeom.dim.x * projGeom.dim.y * sizeof(std::uint16_t);
+        auto const inputProjTotal = inputProjSingle * numBranches;
+        auto const projSingle = projGeom.dim.x * projGeom.dim.y * sizeof(float);
+        auto const projTotal = projSingle * numBranches;
+        auto const expandedSingle = projGeom.dimFFT.x * projGeom.dimFFT.y * sizeof(float);
+        auto const expandedTotal = expandedSingle * numBranches;
+        auto const transformedSingle = projGeom.dimTrans.x * projGeom.dimTrans.y * sizeof(hipfftComplex);
+        auto const transformedTotal = transformedSingle * numBranches;
+        auto const volumeTotal = volGeom.dim.x * volGeom.dim.y * volGeom.dim.z * sizeof(float);
+        auto const memTotal = inputProjTotal + projTotal + expandedTotal + transformedTotal + volumeTotal;
+        auto devMemFree = std::size_t{};
+        auto devMemTotal = std::size_t{};
+        hip_check(hipMemGetInfo(&devMemFree, &devMemTotal));
+
+        auto memRequired = static_cast<std::size_t>(memTotal);
+        if(memRequired > devMemFree)
+        {
+            std::cerr << "Not enough device memory. Required: " << memRequired 
+                      << ", available: " << devMemFree << std::endl;
+            return EXIT_FAILURE;
+        }
+
+        std::cout << "Done!" << std::endl;
+        std::cout << "Volume dimensions: " << volGeom.dim.x << " x "
+                                           << volGeom.dim.y << " x "
+                                           << volGeom.dim.z << std::endl;
+
+        // Initialize per-branch data
+        for(auto branchIdx = 0; branchIdx < numBranches; ++branchIdx)
+        {
+            std::cout << "Initializing branch #" << branchIdx << "... " << std::flush;
+            auto stream = streams.at(branchIdx);
+
+            auto proj = static_cast<float*>(nullptr);
+            auto projPitch = std::size_t{};
+            hip_check(hipMallocPitch(
+                reinterpret_cast<void**>(&proj), &projPitch, projGeom.dim.x * sizeof(float), projGeom.dim.y
+            ));
+            projections.push_back(proj);
+            projectionPitches.push_back(projPitch);
+
+            auto expanded = static_cast<float*>(nullptr);
+            auto expandedPitch = std::size_t{};
+            hip_check(hipMallocPitch(
+                reinterpret_cast<void**>(&expanded),
+                &expandedPitch,
+                projGeom.dimFFT.x * sizeof(float),
+                projGeom.dimFFT.y
+            ));
+            expandedProjections.push_back(expanded);
+            expandedPitches.push_back(expandedPitch);
+
+            auto transformed = static_cast<hipfftComplex*>(nullptr);
+            auto transformedPitch = std::size_t{};
+            hip_check(hipMallocPitch(
+                reinterpret_cast<void**>(&transformed),
+                &transformedPitch,
+                projGeom.dimTrans.x * sizeof(hipfftComplex),
+                projGeom.dimTrans.y
+            ));
+            transformedProjections.push_back(transformed);
+            transformedPitches.push_back(transformedPitch);
+
+            auto& forward = forwardPlans.at(branchIdx);
+            auto& forwardSize = forwardSizes.at(branchIdx);
+            auto fw_inembed = static_cast<int>(expandedPitch / sizeof(float));
+            auto fw_istride = 1;
+            auto fw_idist = fw_inembed;
+            auto fw_onembed = static_cast<int>(transformedPitch / sizeof(hipfftComplex));
+            auto fw_ostride = 1;
+            auto fw_odist = fw_onembed;
+            hipfft_check(hipfftCreate(&forward));
+            hipfft_check(hipfftMakePlanMany(forward, 1, &projGeom.s_dimFFT.x,
+                                            &fw_inembed, 1, fw_idist,
+                                            &fw_onembed, 1, fw_odist,
+                                            HIPFFT_R2C, projGeom.s_dimFFT.y, &forwardSize));
+            hipfft_check(hipfftSetStream(forward, stream));
+
+            auto& backward = backwardPlans.at(branchIdx);
+            auto& backwardSize = backwardSizes.at(branchIdx);
+            auto bw_inembed = fw_onembed;
+            auto bw_istride = fw_ostride;
+            auto bw_idist = fw_odist;
+            auto bw_onembed = fw_inembed;
+            auto bw_ostride = fw_istride;
+            auto bw_odist = fw_idist;
+            hipfft_check(hipfftCreate(&backward));
+            hipfft_check(hipfftMakePlanMany(backward, 1, &projGeom.s_dimFFT.x,
+                                            &bw_inembed, bw_istride, bw_idist,
+                                            &bw_onembed, bw_ostride, bw_odist,
+                                            HIPFFT_C2R, projGeom.s_dimFFT.y, &backwardSize));
+            hipfft_check(hipfftSetStream(backward, stream));
+
+            if(hasTextures)
+            {
+                // create a HIP texture from the projection
+                auto resDesc = hipResourceDesc{};
+                resDesc.resType = hipResourceTypePitch2D;
+                resDesc.res.pitch2D.desc = hipCreateChannelDesc<float>();
+                resDesc.res.pitch2D.devPtr = static_cast<void*>(proj);
+                resDesc.res.pitch2D.width = projGeom.dim.x;
+                resDesc.res.pitch2D.height = projGeom.dim.y;
+                resDesc.res.pitch2D.pitchInBytes = projPitch;
+
+                auto texDesc = hipTextureDesc{};
+                texDesc.addressMode[0] = hipAddressModeBorder;
+                texDesc.addressMode[1] = hipAddressModeBorder;
+                texDesc.readMode = hipReadModeElementType;
+                texDesc.borderColor[0] = 0.f;
+                texDesc.borderColor[0] = 0.f;
+                texDesc.filterMode = hipFilterModeLinear;
+                texDesc.normalizedCoords = 0;
+
+                auto& projTex = textureProjections.emplace_back();
+                hip_check(hipCreateTextureObject(&projTex, &resDesc, &texDesc, nullptr));
+            }
+
+            std::cout << "Done!" << std::endl;
+        }
+
+        create_volume("volume.tif");
+        auto hostVolPtr = static_cast<float*>(nullptr);
+        hip_check(hipHostMalloc(
+            reinterpret_cast<void**>(&hostVolPtr),
+            volGeom.dim.x * volGeom.dim.y * volGeom.dim.z * sizeof(float),
+            hipHostMallocDefault
+        ));
+        auto hostVol = make_hipPitchedPtr(
+            hostVolPtr, volGeom.dim.x * sizeof(float), volGeom.dim.x, volGeom.dim.y
+        );    
+        auto vol = hipPitchedPtr{};
+        auto volExt = make_hipExtent(volGeom.dim.x * sizeof(float), volGeom.dim.y, volGeom.dim.z);
+        hip_check(hipMalloc3D(&vol, volExt));
+        hip_check(hipMemset3D(vol, 0, volExt));
+
+        ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+        // MAIN LOOP
+        ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+        auto graphCreated = false;
+        auto graphExec = hipGraphExec_t{};
+        auto graphFinalCreated = false;
+        auto graphExecFinal = hipGraphExec_t{};
+        auto graphStream = hipStream_t{};
+        hip_check(hipStreamCreate(&graphStream));
+
+        auto start = std::chrono::steady_clock::now();
+
+        auto projIdx = 0u;
+        while(projIdx < projGeom.numProj)
+        {
+            auto batchSize = std::min(numBranches, static_cast<int>(projGeom.numProj - projIdx));
+
+            // Create graph for current batch
+            auto graph = hipGraph_t{};
+            hip_check(hipGraphCreate(&graph, 0));
+
+            // Add nodes for each projection in batch
+            for(auto branchIdx = 0; branchIdx < batchSize; ++branchIdx, ++projIdx)
+            {
+                auto stream = streams.at(branchIdx);
+
+                auto threadsPerBlock = dim3{32, 32, 1};
+                auto blocksPerGrid = dim3{
+                    (projGeom.dim.x / threadsPerBlock.x) + 1, (projGeom.dim.y / threadsPerBlock.y) + 1, 1
+                };
+
+                auto inputPitchedPtr = phantomProjections.at(projIdx);
+                auto input = static_cast<std::uint16_t*>(inputPitchedPtr.ptr);
+                auto inputPitch = inputPitchedPtr.pitch;
+
+                auto proj = projections.at(branchIdx);
+                auto projPitch = projectionPitches.at(branchIdx);
+                void* normalizationKernelParams[] =
+                {
+                    static_cast<void*>(&input),
+                    static_cast<void*>(&inputPitch),
+                    static_cast<void*>(&proj),
+                    static_cast<void*>(&projPitch),
+                    static_cast<void*>(&projGeom.dim),
+                    static_cast<void*>(&projGeom.bps)
+                };
+
+                auto normalizationKernelNodeParams = hipKernelNodeParams{};
+                normalizationKernelNodeParams.blockDim = threadsPerBlock;
+                normalizationKernelNodeParams.extra = nullptr;
+                normalizationKernelNodeParams.func = reinterpret_cast<void*>(normalization_kernel);
+                normalizationKernelNodeParams.gridDim = blocksPerGrid;
+                normalizationKernelNodeParams.kernelParams = normalizationKernelParams;
+                normalizationKernelNodeParams.sharedMemBytes = 0;
+                auto normalizationKernelNode = hipGraphNode_t{};
+                hip_check(hipGraphAddKernelNode(
+                    &normalizationKernelNode, graph, nullptr, 0, &normalizationKernelNodeParams
+                ));
+
+                void* logTransformationKernelParams[] =
+                {
+                    static_cast<void*>(&proj),
+                    static_cast<void*>(&projPitch),
+                    static_cast<void*>(&projGeom.dim)
+                };
+                auto logTransformationKernelNodeParams = hipKernelNodeParams{};
+                logTransformationKernelNodeParams.blockDim = threadsPerBlock;
+                logTransformationKernelNodeParams.extra = nullptr;
+                logTransformationKernelNodeParams.func = reinterpret_cast<void*>(log_transformation_kernel);
+                logTransformationKernelNodeParams.gridDim = blocksPerGrid;
+                logTransformationKernelNodeParams.kernelParams = logTransformationKernelParams;
+                logTransformationKernelNodeParams.sharedMemBytes = 0;
+                auto logTransformationKernelNode = hipGraphNode_t{};
+                hip_check(hipGraphAddKernelNode(
+                    &logTransformationKernelNode,
+                    graph,
+                    &normalizationKernelNode,
+                    1,
+                    &logTransformationKernelNodeParams
+                ));
+
+                // [sphinx-weighting-node-start]
+                void* weightingKernelParams[] =
+                {
+                    static_cast<void*>(&proj),
+                    static_cast<void*>(&projPitch),
+                    static_cast<void*>(&projGeom.dim),
+                    static_cast<void*>(&projGeom.d_sd),
+                    static_cast<void*>(&projGeom.d_so),
+                    static_cast<void*>(&projGeom.minCoord),
+                    static_cast<void*>(&projGeom.pixelDim)
+                };
+                auto weightingKernelNodeParams = hipKernelNodeParams{};
+                weightingKernelNodeParams.blockDim = threadsPerBlock;
+                weightingKernelNodeParams.extra = nullptr;
+                weightingKernelNodeParams.func = reinterpret_cast<void*>(weighting_kernel);
+                weightingKernelNodeParams.gridDim = blocksPerGrid;
+                weightingKernelNodeParams.kernelParams = weightingKernelParams;
+                weightingKernelNodeParams.sharedMemBytes = 0;
+                auto weightingKernelNode = hipGraphNode_t{};
+                hip_check(hipGraphAddKernelNode(
+                    &weightingKernelNode, graph, &logTransformationKernelNode, 1, &weightingKernelNodeParams
+                ));
+                // [sphinx-weighting-node-end]
+
+                // Expand projection to filter length
+                auto expanded = expandedProjections.at(branchIdx);
+                auto expandedPitch = expandedPitches.at(branchIdx);
+                // [sphinx-memset-node-start]
+                auto expandedMemsetNodeParams = hipMemsetParams{};
+                expandedMemsetNodeParams.dst = static_cast<void*>(expanded);
+                expandedMemsetNodeParams.elementSize = sizeof(float);
+                expandedMemsetNodeParams.height = projGeom.dimFFT.y;
+                expandedMemsetNodeParams.pitch = expandedPitch;
+                expandedMemsetNodeParams.value = 0;
+                expandedMemsetNodeParams.width = projGeom.dimFFT.x;
+                auto expandedMemsetNode = hipGraphNode_t{};
+                hip_check(hipGraphAddMemsetNode(
+                    &expandedMemsetNode, graph, &weightingKernelNode, 1, &expandedMemsetNodeParams
+                ));
+                // [sphinx-memset-node-end]
+
+                auto copyProjToExpandedNodeParams = hipMemcpy3DParms{};
+                std::memset(&copyProjToExpandedNodeParams, 0, sizeof(hipMemcpy3DParms));
+                copyProjToExpandedNodeParams.srcPos = make_hipPos(0, 0, 0);
+                copyProjToExpandedNodeParams.srcPtr = make_hipPitchedPtr(
+                    static_cast<void*>(proj), projPitch, projGeom.dim.x, projGeom.dim.y);
+                copyProjToExpandedNodeParams.dstPos = make_hipPos(0, 0, 0);
+                copyProjToExpandedNodeParams.dstPtr = make_hipPitchedPtr(
+                    static_cast<void*>(expanded), expandedPitch, projGeom.dimFFT.x, projGeom.dimFFT.y);
+                copyProjToExpandedNodeParams.extent = make_hipExtent(
+                    projGeom.dim.x * sizeof(float), projGeom.dim.y, 1);
+                copyProjToExpandedNodeParams.kind = hipMemcpyDeviceToDevice;
+                auto copyProjToExpandedNode = hipGraphNode_t{};
+                hip_check(hipGraphAddMemcpyNode(
+                    &copyProjToExpandedNode,
+                    graph,
+                    &expandedMemsetNode,
+                    1,
+                    &copyProjToExpandedNodeParams
+                ));
+
+                // R2C Fourier-transform projection
+                auto transformed = transformedProjections.at(branchIdx);
+                auto transformedPitch = transformedPitches.at(branchIdx);
+                auto transformedMemsetNodeParams = hipMemsetParams{};
+                transformedMemsetNodeParams.dst = static_cast<void*>(transformed);
+                transformedMemsetNodeParams.elementSize = sizeof(float); // elementSize maximum is 4 bytes
+                transformedMemsetNodeParams.height = projGeom.dimTrans.y;
+                transformedMemsetNodeParams.pitch = transformedPitch;
+                transformedMemsetNodeParams.value = 0;
+                transformedMemsetNodeParams.width = projGeom.dimTrans.x * 2; // hipfftComplex = 2 floats
+                auto transformedMemsetNode = hipGraphNode_t{};
+                hip_check(hipGraphAddMemsetNode(
+                    &transformedMemsetNode, graph, &copyProjToExpandedNode, 1, &transformedMemsetNodeParams
+                ));
+
+                // [sphinx-before-forward-start]
+                // Before capturing the FFT operations, obtain the set of nodes already present in the graph
+                auto nodesBeforeForward = std::vector<hipGraphNode_t>{};
+                auto numNodesBeforeForward = std::size_t{};
+                hip_check(hipGraphGetNodes(graph, nullptr, &numNodesBeforeForward));
+                nodesBeforeForward.resize(numNodesBeforeForward);
+                hip_check(hipGraphGetNodes(graph, nodesBeforeForward.data(), &numNodesBeforeForward));
+                auto nodesBeforeForwardSorted = std::set<hipGraphNode_t>{
+                    std::begin(nodesBeforeForward), std::end(nodesBeforeForward)
+                };
+                // [sphinx-before-forward-end]
+
+                // [sphinx-hipfft-start]
+                hip_check(hipStreamBeginCaptureToGraph(
+                    stream, graph, &transformedMemsetNode, nullptr, 1, hipStreamCaptureModeGlobal));
+
+                auto& forward = forwardPlans.at(branchIdx);
+                hipfft_check(hipfftExecR2C(forward, expanded, transformed));
+
+                hip_check(hipStreamEndCapture(stream, &graph));
+                // [sphinx-hipfft-end]
+
+                // [sphinx-is-leaf-start]
+                auto is_leaf = [](hipGraphNode_t node)
+                {
+                    auto numDependentNodes = std::size_t{};
+                    hip_check(hipGraphNodeGetDependentNodes(node, nullptr, &numDependentNodes));
+                    return numDependentNodes == 0;
+                };
+                // [sphinx-is-leaf-end]
+
+                // [sphinx-after-forward-start]
+                // Obtain nodes in graph again, the new nodes will be our dependencies for the next step
+                auto nodesAfterForward = std::vector<hipGraphNode_t>{};
+                auto numNodesAfterForward = std::size_t{};
+                hip_check(hipGraphGetNodes(graph, nullptr, &numNodesAfterForward));
+                nodesAfterForward.resize(numNodesAfterForward);
+                hip_check(hipGraphGetNodes(graph, nodesAfterForward.data(), &numNodesAfterForward));
+                auto nodesAfterForwardSorted = std::set<hipGraphNode_t>{
+                    std::begin(nodesAfterForward), std::end(nodesAfterForward)
+                };
+                // [sphinx-after-forward-end]
+
+                // [sphinx-node-difference-start]
+                // Compute difference between both sets
+                auto forwardFFTNodes = std::vector<hipGraphNode_t>{};
+                std::set_difference(std::begin(nodesAfterForwardSorted), std::end(nodesAfterForwardSorted),
+                                    std::begin(nodesBeforeForwardSorted), std::end(nodesBeforeForwardSorted),
+                                    std::back_inserter(forwardFFTNodes));
+                // [sphinx-node-difference-end]
+
+                // [sphinx-find-leaf-start]
+                // Find leaf node in difference set
+                auto forwardLeafNode = *(std::find_if(std::begin(forwardFFTNodes), std::end(forwardFFTNodes), is_leaf));
+                // [sphinx-find-leaf-end]
+
+                // Apply filter
+                auto filterBlocksPerGrid = dim3{
+                    (projGeom.dimTrans.x / threadsPerBlock.x) + 1,
+                    (projGeom.dimTrans.y / threadsPerBlock.y) + 1,
+                    1
+                };
+                void* filterApplicationKernelParams[] =
+                {
+                    static_cast<void*>(&transformed),
+                    static_cast<void*>(&transformedPitch),
+                    static_cast<void*>(&R),
+                    static_cast<void*>(&projGeom.dimTrans)
+                };
+                auto filterApplicationKernelNodeParams = hipKernelNodeParams{};
+                filterApplicationKernelNodeParams.blockDim = threadsPerBlock;
+                filterApplicationKernelNodeParams.extra = nullptr;
+                filterApplicationKernelNodeParams.func = reinterpret_cast<void*>(filter_application_kernel);
+                filterApplicationKernelNodeParams.gridDim = filterBlocksPerGrid;
+                filterApplicationKernelNodeParams.kernelParams = filterApplicationKernelParams;
+                filterApplicationKernelNodeParams.sharedMemBytes = 0;
+                auto filterApplicationKernelNode = hipGraphNode_t{};
+                hip_check(hipGraphAddKernelNode(
+                    &filterApplicationKernelNode, graph, &forwardLeafNode, 1, &filterApplicationKernelNodeParams
+                ));
+
+                // C2R Fourier-transform projection - same node counting procedure as above
+                auto nodesBeforeBackward = std::vector<hipGraphNode_t>{};
+                auto numNodesBeforeBackward = std::size_t{};
+                hip_check(hipGraphGetNodes(graph, nullptr, &numNodesBeforeBackward));
+                nodesBeforeBackward.resize(numNodesBeforeBackward);
+                hip_check(hipGraphGetNodes(graph, nodesBeforeBackward.data(), &numNodesBeforeBackward));
+                auto nodesBeforeBackwardSorted = std::set<hipGraphNode_t>{
+                    std::begin(nodesBeforeBackward), std::end(nodesBeforeBackward)
+                };
+
+                hip_check(hipStreamBeginCaptureToGraph(
+                    stream, graph, &filterApplicationKernelNode, nullptr, 1, hipStreamCaptureModeGlobal
+                ));
+
+                auto& backward = backwardPlans.at(branchIdx);
+                hipfft_check(hipfftExecC2R(backward, transformed, expanded));
+
+                hip_check(hipStreamEndCapture(stream, &graph));
+
+                auto nodesAfterBackward = std::vector<hipGraphNode_t>{};
+                auto numNodesAfterBackward = std::size_t{};
+                hip_check(hipGraphGetNodes(graph, nullptr, &numNodesAfterBackward));
+                nodesAfterBackward.resize(numNodesAfterBackward);
+                hip_check(hipGraphGetNodes(graph, nodesAfterBackward.data(), &numNodesAfterBackward));
+                auto nodesAfterBackwardSorted = std::set<hipGraphNode_t>{
+                    std::begin(nodesAfterBackward), std::end(nodesAfterBackward)
+                };
+
+                auto backwardFFTNodes = std::vector<hipGraphNode_t>{};
+                std::set_difference(std::begin(nodesAfterBackwardSorted), std::end(nodesAfterBackwardSorted),
+                                    std::begin(nodesBeforeBackwardSorted), std::end(nodesBeforeBackwardSorted),
+                                    std::back_inserter(backwardFFTNodes));
+
+                auto backwardLeafNode = *(
+                    std::find_if(std::begin(backwardFFTNodes), std::end(backwardFFTNodes), is_leaf
+                ));
+
+                // Shrink projection to original size and normalize
+                auto copyExpandedToProjNodeParams = hipMemcpy3DParms{};
+                std::memset(&copyExpandedToProjNodeParams, 0, sizeof(hipMemcpy3DParms));
+                copyExpandedToProjNodeParams.srcPos = make_hipPos(0, 0, 0);
+                copyExpandedToProjNodeParams.srcPtr = make_hipPitchedPtr(
+                    static_cast<void*>(expanded), expandedPitch, projGeom.dimFFT.x, projGeom.dimFFT.y
+                );
+                copyExpandedToProjNodeParams.dstPos = make_hipPos(0, 0, 0);
+                copyExpandedToProjNodeParams.dstPtr = make_hipPitchedPtr(
+                    static_cast<void*>(proj), projPitch, projGeom.dim.x, projGeom.dim.y
+                );
+                copyExpandedToProjNodeParams.extent = make_hipExtent(
+                    projGeom.dim.x * sizeof(float), projGeom.dim.y, 1);
+                copyExpandedToProjNodeParams.kind = hipMemcpyDeviceToDevice;
+                auto copyExpandedToProjNode = hipGraphNode_t{};
+                hip_check(hipGraphAddMemcpyNode(
+                    &copyExpandedToProjNode, graph, &backwardLeafNode, 1, &copyExpandedToProjNodeParams
+                ));
+
+                void* filterNormalizationKernelParams[] =
+                {
+                    static_cast<void*>(&proj),
+                    static_cast<void*>(&projPitch),
+                    static_cast<void*>(&projGeom.dimFFT.x),
+                    static_cast<void*>(&projGeom.dim)
+                };
+                auto filterNormalizationKernelNodeParams = hipKernelNodeParams{};
+                filterNormalizationKernelNodeParams.blockDim = threadsPerBlock;
+                filterNormalizationKernelNodeParams.extra = nullptr;
+                filterNormalizationKernelNodeParams.func = reinterpret_cast<void*>(filter_normalization_kernel);
+                filterNormalizationKernelNodeParams.gridDim = blocksPerGrid;
+                filterNormalizationKernelNodeParams.kernelParams = filterNormalizationKernelParams;
+                filterNormalizationKernelNodeParams.sharedMemBytes = 0;
+                auto filterNormalizationKernelNode = hipGraphNode_t{};
+                hip_check(hipGraphAddKernelNode(
+                    &filterNormalizationKernelNode,
+                    graph,
+                    &copyExpandedToProjNode,
+                    1,
+                    &filterNormalizationKernelNodeParams));
+
+                // Backprojection
+                auto thetaDeg = projGeom.thetaSign * projGeom.thetaStep * projIdx; // Current angle
+                auto thetaRad = thetaDeg * std::numbers::pi_v<float> / 180.f; // Convert to radians
+                auto sinTheta = std::sin(thetaRad);
+                auto cosTheta = std::cos(thetaRad);
+
+                auto bpBlockSize = dim3{32u, 8u, 4u};
+                auto bpBlocks = dim3{
+                    static_cast<std::uint32_t>(volGeom.dim.x / bpBlockSize.x + 1),
+                    static_cast<std::uint32_t>(volGeom.dim.y / bpBlockSize.y + 1),
+                    static_cast<std::uint32_t>(volGeom.dim.z / bpBlockSize.z + 1)
+                };
+
+                if(hasTextures)
+                {
+                    auto& projTex = textureProjections.at(branchIdx);
+                    void* backprojectionKernelParams[] =
+                    {
+                        &vol.ptr,
+                        static_cast<void*>(&vol.pitch),
+                        static_cast<void*>(&volGeom.dim),
+                        static_cast<void*>(&volGeom.voxelDim),
+                        static_cast<void*>(&projTex),
+                        static_cast<void*>(&projGeom.minCoord),
+                        static_cast<void*>(&sinTheta),
+                        static_cast<void*>(&cosTheta),
+                        static_cast<void*>(&projGeom.pixelDim),
+                        static_cast<void*>(&projGeom.d_sd),
+                        static_cast<void*>(&projGeom.d_so)
+                    };
+                    auto backprojectionKernelNodeParams = hipKernelNodeParams{};
+                    backprojectionKernelNodeParams.blockDim = bpBlockSize;
+                    backprojectionKernelNodeParams.extra = nullptr;
+                    backprojectionKernelNodeParams.func = reinterpret_cast<void*>(backprojection_kernel);
+                    backprojectionKernelNodeParams.gridDim = bpBlocks;
+                    backprojectionKernelNodeParams.kernelParams = backprojectionKernelParams;
+                    backprojectionKernelNodeParams.sharedMemBytes = 0;
+                    auto backprojectionKernelNode = hipGraphNode_t{};
+                    hip_check(hipGraphAddKernelNode(
+                        &backprojectionKernelNode,
+                        graph,
+                        &filterNormalizationKernelNode,
+                        1,
+                        &backprojectionKernelNodeParams
+                    ));
+                }
+                else
+                {
+                    // Fallback for devices without support for texture instructions
+                    void* backprojectionKernelParams[] =
+                    {
+                        &vol.ptr,
+                        static_cast<void*>(&vol.pitch),
+                        static_cast<void*>(&volGeom.dim),
+                        static_cast<void*>(&volGeom.voxelDim),
+                        static_cast<void*>(&proj),
+                        static_cast<void*>(&projPitch),
+                        static_cast<void*>(&projGeom.dim),
+                        static_cast<void*>(&projGeom.minCoord),
+                        static_cast<void*>(&sinTheta),
+                        static_cast<void*>(&cosTheta),
+                        static_cast<void*>(&projGeom.pixelDim),
+                        static_cast<void*>(&projGeom.d_sd),
+                        static_cast<void*>(&projGeom.d_so)
+                    };
+                    auto backprojectionKernelNodeParams = hipKernelNodeParams{};
+                    backprojectionKernelNodeParams.blockDim = bpBlockSize;
+                    backprojectionKernelNodeParams.extra = nullptr;
+                    backprojectionKernelNodeParams.func = reinterpret_cast<void*>(backprojection_kernel_no_tex);
+                    backprojectionKernelNodeParams.gridDim = bpBlocks;
+                    backprojectionKernelNodeParams.kernelParams = backprojectionKernelParams;
+                    backprojectionKernelNodeParams.sharedMemBytes = 0;
+                    auto backprojectionKernelNode = hipGraphNode_t{};
+                    hip_check(hipGraphAddKernelNode(
+                        &backprojectionKernelNode,
+                        graph,
+                        &filterNormalizationKernelNode,
+                        1,
+                        &backprojectionKernelNodeParams
+                    ));
+                }
+            }
+
+            // Instantiate and launch the graph
+            if(batchSize == numBranches)
+            {
+                if(!graphCreated)
+                {
+                    hip_check(hipGraphDebugDotPrint(graph, "graph_creation.dot", hipGraphDebugDotFlagsVerbose));
+
+                    hip_check(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
+                    hip_check(hipGraphDestroy(graph));
+                    hip_check(hipGraphLaunch(graphExec, graphStream));
+                    graphCreated = true;
+                }
+                else
+                {
+                    // Update existing executable graph after each iteration with new input data
+                    auto result = hipGraphExecUpdateResult{};
+                    auto errorNode = hipGraphNode_t{};
+                    hip_check(hipGraphExecUpdate(graphExec, graph, &errorNode, &result));
+                    if(result != hipGraphExecUpdateSuccess)
+                    {
+                        auto msg = std::string{"Failed to update graph: "};
+                        switch(result)
+                        {
+                        case hipGraphExecUpdateError:
+                            msg += "Invalid value.";
+                            break;
+                        case hipGraphExecUpdateErrorFunctionChanged:
+                            msg += "Function of kernel node changed.";
+                            break;
+                        case hipGraphExecUpdateErrorNodeTypeChanged:
+                            msg += "Type of node changed.";
+                            break;
+                        case hipGraphExecUpdateErrorNotSupported:
+                            msg += "Something about the node is not supported.";
+                            break;
+                        case hipGraphExecUpdateErrorParametersChanged:
+                            msg += "Unsupported parameter change.";
+                            break;
+                        case hipGraphExecUpdateErrorTopologyChanged:
+                            msg += "Graph topology changed.";
+                            break;
+                        case hipGraphExecUpdateErrorUnsupportedFunctionChange:
+                            msg += "Unsupported change of kernel node function.";
+                            break;
+                        default:
+                            msg += "Unknown error.";
+                            break;
+                        }
+                        throw std::runtime_error{msg};
+                    }
+                    hip_check(hipGraphDestroy(graph));
+                    hip_check(hipGraphLaunch(graphExec, graphStream));
+                }
+            }
+            else
+            {
+                hip_check(hipGraphDebugDotPrint(graph, "graph_creation_final.dot", hipGraphDebugDotFlagsVerbose));
+
+                // Incomplete batch: topology changed, must instantiate new executable graph
+                hip_check(hipGraphInstantiate(&graphExecFinal, graph, nullptr, nullptr, 0));
+                hip_check(hipGraphDestroy(graph));
+                hip_check(hipGraphLaunch(graphExecFinal, graphStream));
+
+                graphFinalCreated = true;
+            }
+        }
+
+        // Obtain reconstruction time before copying back the result
+        auto stop = std::chrono::steady_clock::time_point{};
+        hip_check(hipLaunchHostFunc(graphStream, [](void* data)
+        {
+            auto& stop = *(static_cast<std::chrono::steady_clock::time_point*>(data));
+            stop = std::chrono::steady_clock::now();
+        }, static_cast<void*>(&stop)));
+
+        // Copy volume back to host and save
+        auto memcpyParams = hipMemcpy3DParms{};
+        std::memset(&memcpyParams, 0, sizeof(hipMemcpy3DParms));
+        memcpyParams.dstPos = make_hipPos(0, 0, 0);
+        memcpyParams.dstPtr = hostVol;
+        memcpyParams.srcPos = make_hipPos(0, 0, 0);
+        memcpyParams.srcPtr = vol;
+        memcpyParams.extent = volExt;
+        memcpyParams.kind = hipMemcpyDeviceToHost;
+        hip_check(hipMemcpy3DAsync(&memcpyParams, graphStream));
+        
+        auto saveVolArgs = new save_volume_args
+        {
+            "volume.tif",
+            hostVolPtr,
+            volGeom.dim.x, volGeom.dim.y, volGeom.dim.z,
+            volGeom.voxelDim.x, volGeom.voxelDim.y
+        };
+        hip_check(hipLaunchHostFunc(graphStream, save_volume, saveVolArgs));
+
+        std::cout << "All work items enqueued, waiting for completion... " << std::flush;
+        hip_check(hipStreamSynchronize(graphStream));
+        std::cout << "Done!" << std::endl;
+
+        auto const elapsed = std::chrono::duration<double>{stop - start};
+        std::cout << "Reconstruction time: " << elapsed.count() << 's' << std::endl;
+
+        // Cleanup
+        if(graphFinalCreated)
+            hip_check(hipGraphExecDestroy(graphExecFinal));
+        hip_check(hipGraphExecDestroy(graphExec));
+        hip_check(hipStreamDestroy(graphStream));
+
+        hip_check(hipFree(vol.ptr));
+        hip_check(hipFreeHost(hostVolPtr));
+
+        if(hasTextures)
+        {
+            for(auto&& tex : textureProjections)
+                hip_check(hipDestroyTextureObject(tex));
+        }
+
+        for(auto&& plan : backwardPlans)
+            hipfft_check(hipfftDestroy(plan));
+
+        for(auto&& plan : forwardPlans)
+            hipfft_check(hipfftDestroy(plan));
+
+        for(auto&& p : transformedProjections)
+            hip_check(hipFree(p));
+
+        for(auto&& p : expandedProjections)
+            hip_check(hipFree(p));
+
+        for(auto&& p : projections)
+            hip_check(hipFree(p));
+
+        for(auto&& p : phantomProjections)
+            hip_check(hipFree(p.ptr));
+
+        hip_check(hipFree(R));
+
+        for(auto&& stream : streams)
+            hip_check(hipStreamDestroy(stream));
+
+        hip_check(hipDeviceSynchronize());
+
+        return EXIT_SUCCESS;
+    }
+    catch(std::runtime_error const& e)
+    {
+        std::cerr << "Caught runtime error: " << e.what() << std::endl;
+        return EXIT_FAILURE;
+    }
+}
@@ -0,0 +1,521 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include "backprojection.hpp"
+#include "filtering.hpp"
+#include "log_transform.hpp"
+#include "normalization.hpp"
+#include "phantom.hpp"
+#include "projection.hpp"
+#include "utility.hpp"
+#include "weighting.hpp"
+#include "volume.hpp"
+
+#include <hip/hip_runtime.h>
+
+#include <hipfft/hipfft.h>
+
+#include <algorithm>
+#include <chrono>
+#include <cmath>
+#include <cstddef>
+#include <cstdint>
+#include <cstdlib>
+#include <cstring>
+#include <iostream>
+#include <numbers>
+#include <ostream>
+#include <set>
+#include <stdexcept>
+#include <vector>
+
+auto main() -> int
+{
+    try
+    {
+        auto hasTextures = int{0};
+        hip_check(hipDeviceGetAttribute(&hasTextures, hipDeviceAttributeImageSupport, 0));
+
+        // [sphinx-async-engine-start]
+        // Fetch device properties
+        auto devProps = hipDeviceProp_t{};
+        hip_check(hipGetDeviceProperties(&devProps, 0));
+
+        auto const numStreams = devProps.asyncEngineCount;
+
+        std::cout << "Device has " << numStreams << " asynchronous engines; preprocessing will use "
+                  << numStreams << " parallel streams." << std::endl;
+
+        auto streams = std::vector<hipStream_t>{};
+        streams.resize(numStreams);
+        for(auto&& stream : streams)
+            hip_check(hipStreamCreate(&stream));
+        // [sphinx-async-engine-end]
+
+        auto r = static_cast<float*>(nullptr);
+        auto R = static_cast<hipfftComplex*>(nullptr);
+
+        auto forwardPlans = std::vector<hipfftHandle>{};
+        auto forwardSizes = std::vector<std::size_t>{};
+        auto backwardPlans = std::vector<hipfftHandle>{};
+        auto backwardSizes = std::vector<std::size_t>{};
+        forwardPlans.resize(numStreams);
+        forwardSizes.resize(numStreams);
+        backwardPlans.resize(numStreams);
+        backwardSizes.resize(numStreams);
+
+        auto projections = std::vector<float*>{};
+        auto projectionPitches = std::vector<std::size_t>{};
+        auto expandedProjections = std::vector<float*>{};
+        auto expandedPitches = std::vector<std::size_t>{};
+        auto transformedProjections = std::vector<hipfftComplex*>{};
+        auto transformedPitches = std::vector<std::size_t>{};
+        auto textureProjections = std::vector<hipTextureObject_t>{};
+
+        auto projGeom = phantom::make_projectionGeometry();
+        auto volGeom = phantom::make_volumeGeometry();
+        auto phantomProjections = phantom::make_projections(projGeom, volGeom, streams);
+
+        std::cout << "Initializing... " << std::flush;
+
+        auto stream = streams.at(0);
+
+        // Create filter kernel
+        hip_check(hipMalloc(reinterpret_cast<void**>(&r), projGeom.dimFFT.x * sizeof(float)));
+        auto const creationBlocks = std::max((projGeom.dimFFT.x / 1024u), 1u);
+        filter_creation_kernel<<<creationBlocks, 1024, 0, stream>>>(r, projGeom.s_dimFFT.x, projGeom.pixelDim.x);
+
+        hip_check(hipMalloc(reinterpret_cast<void**>(&R), projGeom.dimTrans.x * sizeof(hipfftComplex)));
+        auto filterPlan = hipfftHandle{};
+        hipfft_check(hipfftPlan1d(&filterPlan, projGeom.dimFFT.x, HIPFFT_R2C, 1));
+        hipfft_check(hipfftSetStream(filterPlan, stream));
+        hipfft_check(hipfftExecR2C(filterPlan, r, R));
+
+        auto absoluteBlocks = (projGeom.dimTrans.x / 1024u) + 1u;
+        filter_absolute_kernel<<<absoluteBlocks, 1024, 0, stream>>>(R, projGeom.dimTrans.x, projGeom.pixelDim.x);
+
+        hip_check(hipStreamSynchronize(stream));
+                
+        hipfft_check(hipfftDestroy(filterPlan));
+        hip_check(hipFree(r));
+
+        auto const inputProjSingle = projGeom.dim.x * projGeom.dim.y * sizeof(std::uint16_t);
+        auto const inputProjTotal = inputProjSingle * numStreams;
+        auto const projSingle = projGeom.dim.x * projGeom.dim.y * sizeof(float);
+        auto const projTotal = projSingle * numStreams;
+        auto const expandedSingle = projGeom.dimFFT.x * projGeom.dimFFT.y * sizeof(float);
+        auto const expandedTotal = expandedSingle * numStreams;
+        auto const transformedSingle = projGeom.dimTrans.x * projGeom.dimTrans.y * sizeof(hipfftComplex);
+        auto const transformedTotal = transformedSingle * numStreams;
+        auto const volumeTotal = volGeom.dim.x * volGeom.dim.y * volGeom.dim.z * sizeof(float);
+        auto const memTotal = inputProjTotal + projTotal + expandedTotal + transformedTotal + volumeTotal;
+        auto devMemFree = std::size_t{};
+        auto devMemTotal = std::size_t{};
+        hip_check(hipMemGetInfo(&devMemFree, &devMemTotal));
+
+        auto memRequired = static_cast<std::size_t>(memTotal);
+        if(memRequired > devMemFree)
+        {
+            std::cerr << "Not enough device memory. Required: " << memRequired 
+                      << ", available: " << devMemFree << std::endl;
+            return EXIT_FAILURE;
+        }
+
+        std::cout << "Done!" << std::endl;
+        std::cout << "Volume dimensions: " << volGeom.dim.x << " x "
+                                           << volGeom.dim.y << " x "
+                                           << volGeom.dim.z << std::endl;
+
+        // Initialize per-stream data
+        for(auto streamIdx = 0u; streamIdx < streams.size(); ++streamIdx)
+        {
+            std::cout << "Initializing stream " << streamIdx << "... " << std::flush;
+            auto stream = streams.at(streamIdx);
+
+            auto proj = static_cast<float*>(nullptr);
+            auto projPitch = std::size_t{};
+            hip_check(hipMallocPitch(
+                reinterpret_cast<void**>(&proj), &projPitch, projGeom.dim.x * sizeof(float), projGeom.dim.y
+            ));
+            projections.push_back(proj);
+            projectionPitches.push_back(projPitch);
+
+            auto expanded = static_cast<float*>(nullptr);
+            auto expandedPitch = std::size_t{};
+            hip_check(hipMallocPitch(
+                reinterpret_cast<void**>(&expanded),
+                &expandedPitch,
+                projGeom.dimFFT.x * sizeof(float),
+                projGeom.dimFFT.y
+            ));
+            expandedProjections.push_back(expanded);
+            expandedPitches.push_back(expandedPitch);
+            
+            auto transformed = static_cast<hipfftComplex*>(nullptr);
+            auto transformedPitch = std::size_t{};
+            hip_check(hipMallocPitch(
+                reinterpret_cast<void**>(&transformed),
+                &transformedPitch,
+                projGeom.dimTrans.x * sizeof(hipfftComplex),
+                projGeom.dimTrans.y
+            ));
+            transformedProjections.push_back(transformed);
+            transformedPitches.push_back(transformedPitch);
+
+            auto& forward = forwardPlans.at(streamIdx);
+            auto& forwardSize = forwardSizes.at(streamIdx);
+            auto fw_inembed = static_cast<int>(expandedPitch / sizeof(float));
+            auto fw_istride = 1;
+            auto fw_idist = fw_inembed;
+            auto fw_onembed = static_cast<int>(transformedPitch / sizeof(hipfftComplex));
+            auto fw_ostride = 1;
+            auto fw_odist = fw_onembed;
+            hipfft_check(hipfftCreate(&forward));
+            hipfft_check(hipfftMakePlanMany(forward, 1, &projGeom.s_dimFFT.x,
+                                            &fw_inembed, 1, fw_idist,
+                                            &fw_onembed, 1, fw_odist,
+                                            HIPFFT_R2C, projGeom.s_dimFFT.y, &forwardSize));
+            hipfft_check(hipfftSetStream(forward, stream));
+
+            auto& backward = backwardPlans.at(streamIdx);
+            auto& backwardSize = backwardSizes.at(streamIdx);
+            auto bw_inembed = fw_onembed;
+            auto bw_istride = fw_ostride;
+            auto bw_idist = fw_odist;
+            auto bw_onembed = fw_inembed;
+            auto bw_ostride = fw_istride;
+            auto bw_odist = fw_idist;
+            hipfft_check(hipfftCreate(&backward));
+            hipfft_check(hipfftMakePlanMany(backward, 1, &projGeom.s_dimFFT.x,
+                                            &bw_inembed, bw_istride, bw_idist,
+                                            &bw_onembed, bw_ostride, bw_odist,
+                                            HIPFFT_C2R, projGeom.s_dimFFT.y, &backwardSize));
+            hipfft_check(hipfftSetStream(backward, stream));
+
+            if(hasTextures)
+            {
+                // create a HIP texture from the projection
+                auto resDesc = hipResourceDesc{};
+                resDesc.resType = hipResourceTypePitch2D;
+                resDesc.res.pitch2D.desc = hipCreateChannelDesc<float>();
+                resDesc.res.pitch2D.devPtr = static_cast<void*>(proj);
+                resDesc.res.pitch2D.width = projGeom.dim.x;
+                resDesc.res.pitch2D.height = projGeom.dim.y;
+                resDesc.res.pitch2D.pitchInBytes = projPitch;
+
+                auto texDesc = hipTextureDesc{};
+                texDesc.addressMode[0] = hipAddressModeBorder;
+                texDesc.addressMode[1] = hipAddressModeBorder;
+                texDesc.readMode = hipReadModeElementType;
+                texDesc.borderColor[0] = 0.f;
+                texDesc.borderColor[0] = 0.f;
+                texDesc.filterMode = hipFilterModeLinear;
+                texDesc.normalizedCoords = 0;
+
+                auto& projTex = textureProjections.emplace_back();
+                hip_check(hipCreateTextureObject(&projTex, &resDesc, &texDesc, nullptr));
+            }
+
+            std::cout << "Done!" << std::endl;
+        }
+
+        create_volume("volume.tif");
+        auto hostVolPtr = static_cast<float*>(nullptr);
+        hip_check(hipHostMalloc(
+            reinterpret_cast<void**>(&hostVolPtr),
+            volGeom.dim.x * volGeom.dim.y * volGeom.dim.z * sizeof(float),
+            hipHostMallocDefault
+        ));
+        auto hostVol = make_hipPitchedPtr(
+            hostVolPtr, volGeom.dim.x * sizeof(float), volGeom.dim.x, volGeom.dim.y
+        );    
+        auto vol = hipPitchedPtr{};
+        auto volExt = make_hipExtent(volGeom.dim.x * sizeof(float), volGeom.dim.y, volGeom.dim.z);
+        hip_check(hipMalloc3D(&vol, volExt));
+        hip_check(hipMemset3D(vol, 0, volExt));
+
+        ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+        // MAIN LOOP
+        ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+        auto start = std::chrono::steady_clock::now();
+
+        // [sphinx-batch-start]
+        auto projIdx = 0u;
+        while(projIdx < projGeom.numProj)
+        {
+            auto batchSize = std::min(numStreams, static_cast<int>(projGeom.numProj - projIdx));
+
+            // Launch batch in parallel streams
+            for(auto streamIdx = 0; streamIdx < batchSize; ++streamIdx, ++projIdx)
+            {
+                auto stream = streams.at(streamIdx);
+                // [sphinx-batch-end]
+
+                auto threadsPerBlock = dim3{32, 32, 1};
+                auto blocksPerGrid = dim3{
+                    (projGeom.dim.x / threadsPerBlock.x) + 1, (projGeom.dim.y / threadsPerBlock.y) + 1, 1
+                };
+
+                auto inputPitchedPtr = phantomProjections.at(projIdx);
+                auto input = static_cast<std::uint16_t*>(inputPitchedPtr.ptr);
+                auto inputPitch = inputPitchedPtr.pitch;
+
+                // [sphinx-preprocessing-start]
+                ////////////////////////////////////////////////////////////////////////////////////////////////////
+                // START HERE
+                ////////////////////////////////////////////////////////////////////////////////////////////////////
+                auto proj = projections.at(streamIdx);
+                auto projPitch = projectionPitches.at(streamIdx);
+                normalization_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(
+                    input, inputPitch, proj, projPitch, projGeom.dim, projGeom.bps
+                );
+
+                log_transformation_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(proj, projPitch, projGeom.dim);
+
+                weighting_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(
+                    proj,
+                    projPitch,
+                    projGeom.dim,
+                    projGeom.d_sd,
+                    projGeom.d_so,
+                    projGeom.minCoord,
+                    projGeom.pixelDim
+                );
+                // [sphinx-preprocessing-end]
+
+                // [sphinx-proj-to-expanded-start]
+                // Expand projection to filter length
+                auto expanded = expandedProjections.at(streamIdx);
+                auto expandedPitch = expandedPitches.at(streamIdx);
+                hip_check(hipMemset2DAsync(
+                    expanded, expandedPitch, 0, projGeom.dimFFT.x * sizeof(float), projGeom.dimFFT.y, stream
+                ));
+                hip_check(hipMemcpy2DAsync(
+                    expanded,
+                    expandedPitch,
+                    proj,
+                    projPitch,
+                    projGeom.dim.x * sizeof(float),
+                    projGeom.dim.y,
+                    hipMemcpyDeviceToDevice,
+                    stream
+                ));
+                // [sphinx-proj-to-expanded-end]
+
+                // [sphinx-forward-start]
+                // R2C Fourier-transform projection
+                auto transformed = transformedProjections.at(streamIdx);
+                auto transformedPitch = transformedPitches.at(streamIdx);
+                hip_check(hipMemset2DAsync(
+                    transformed,
+                    transformedPitch,
+                    0,
+                    projGeom.dimTrans.x * sizeof(hipfftComplex),
+                    projGeom.dimTrans.y,
+                    stream
+                ));
+                auto& forward = forwardPlans.at(streamIdx);
+                hipfft_check(hipfftExecR2C(forward, expanded, transformed));
+                // [sphinx-forward-end]
+                    
+                // [sphinx-filter-start]
+                // Apply filter
+                auto filterBlocksPerGrid = dim3{
+                    (projGeom.dimTrans.x / threadsPerBlock.x) + 1,
+                    (projGeom.dimTrans.y / threadsPerBlock.y) + 1,
+                    1
+                };
+                filter_application_kernel<<<filterBlocksPerGrid, threadsPerBlock, 0, stream>>>(
+                    transformed, transformedPitch, R, projGeom.dimTrans
+                );
+                    
+                auto& backward = backwardPlans.at(streamIdx);
+                hipfft_check(hipfftExecC2R(backward, transformed, expanded));
+                // [sphinx-filter-end]
+
+                // [sphinx-expanded-to-proj-start]
+                // Shrink projection to original size and normalize
+                hip_check(hipMemcpy2DAsync(
+                    proj,
+                    projPitch,
+                    expanded,
+                    expandedPitch,
+                    projGeom.dim.x * sizeof(float),
+                    projGeom.dim.y,
+                    hipMemcpyDeviceToDevice,
+                    stream
+                ));
+                    
+                filter_normalization_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(
+                    proj, projPitch, projGeom.dimFFT.x, projGeom.dim
+                );
+                // [sphinx-expanded-to-proj-end]
+
+                // [sphinx-bp-start]
+                // Backprojection
+                auto thetaDeg = projGeom.thetaSign * projGeom.thetaStep * projIdx; // Current angle
+                auto thetaRad = thetaDeg * std::numbers::pi_v<float> / 180.f; // Convert to radians
+                auto sinTheta = std::sin(thetaRad);
+                auto cosTheta = std::cos(thetaRad);
+
+                auto bpBlockSize = dim3{32u, 8u, 4u};
+                auto bpBlocks = dim3{
+                    static_cast<std::uint32_t>(volGeom.dim.x / bpBlockSize.x + 1),
+                    static_cast<std::uint32_t>(volGeom.dim.y / bpBlockSize.y + 1),
+                    static_cast<std::uint32_t>(volGeom.dim.z / bpBlockSize.z + 1)
+                };
+
+                if(hasTextures)
+                {
+                    auto& projTex = textureProjections.at(streamIdx);
+                    backprojection_kernel<<<bpBlocks, bpBlockSize, 0, stream>>>(
+                        static_cast<float*>(vol.ptr),
+                        vol.pitch,
+                        volGeom.dim,
+                        volGeom.voxelDim,
+                        projTex,
+                        projGeom.minCoord,
+                        sinTheta,
+                        cosTheta,
+                        projGeom.pixelDim,
+                        projGeom.d_sd,
+                        projGeom.d_so
+                    );
+                }
+                else
+                {
+                    // Fallback for devices without support for texture instructions
+                    backprojection_kernel_no_tex<<<bpBlocks, bpBlockSize, 0, stream>>>(
+                        static_cast<float*>(vol.ptr),
+                        vol.pitch,
+                        volGeom.dim,
+                        volGeom.voxelDim,
+                        proj,
+                        projPitch,
+                        projGeom.dim,
+                        projGeom.minCoord,
+                        sinTheta,
+                        cosTheta,
+                        projGeom.pixelDim,
+                        projGeom.d_sd,
+                        projGeom.d_so
+                    );
+                }
+                // [sphinx-bp-end]
+            }
+        }
+        
+        // [sphinx-sync-start]
+        // First stream waits for other streams to complete
+        auto completionEvents = std::vector<hipEvent_t>{};
+        for(auto streamIdx = 1u; streamIdx < streams.size(); ++streamIdx)
+        {
+            auto event = hipEvent_t{};
+            hip_check(hipEventCreate(&event));
+            hip_check(hipEventRecord(event, streams.at(streamIdx)));
+            completionEvents.push_back(event);
+        }
+
+        for(auto&& event : completionEvents)
+            hip_check(hipStreamWaitEvent(streams.at(0), event, 0));
+        // [sphinx-sync-end]
+
+        // Obtain reconstruction time before copying back the result
+        auto stop = std::chrono::steady_clock::time_point{};
+        hip_check(hipLaunchHostFunc(streams.at(0), [](void* data)
+        {
+            auto& stop = *(static_cast<std::chrono::steady_clock::time_point*>(data));
+            stop = std::chrono::steady_clock::now();
+        }, static_cast<void*>(&stop)));
+
+        // Copy volume back to host and save
+        auto memcpyParams = hipMemcpy3DParms{};
+        std::memset(&memcpyParams, 0, sizeof(hipMemcpy3DParms));
+        memcpyParams.dstPos = make_hipPos(0, 0, 0);
+        memcpyParams.dstPtr = hostVol;
+        memcpyParams.srcPos = make_hipPos(0, 0, 0);
+        memcpyParams.srcPtr = vol;
+        memcpyParams.extent = volExt;
+        memcpyParams.kind = hipMemcpyDeviceToHost;
+        hip_check(hipMemcpy3DAsync(&memcpyParams, streams.at(0)));
+            
+        auto saveVolArgs = new save_volume_args
+        {
+            "volume.tif",
+            hostVolPtr,
+            volGeom.dim.x, volGeom.dim.y, volGeom.dim.z,
+            volGeom.voxelDim.x, volGeom.voxelDim.y
+        };
+        hip_check(hipLaunchHostFunc(streams.at(0), save_volume, saveVolArgs));
+
+        std::cout << "All work items enqueued, waiting for completion... " << std::flush;
+        hip_check(hipStreamSynchronize(streams.at(0)));
+        std::cout << "Done!" << std::endl;
+
+        auto const elapsed = std::chrono::duration<double>{stop - start};
+        std::cout << "Reconstruction time: " << elapsed.count() << 's' << std::endl;
+
+        for(auto&& event : completionEvents)
+            hip_check(hipEventDestroy(event));
+
+        hip_check(hipFree(vol.ptr));
+        hip_check(hipFreeHost(hostVolPtr));
+
+        if(hasTextures)
+        {
+            for(auto&& tex : textureProjections)
+                hip_check(hipDestroyTextureObject(tex));
+        }
+
+        for(auto&& plan : backwardPlans)
+            hipfft_check(hipfftDestroy(plan));
+
+        for(auto&& plan : forwardPlans)
+            hipfft_check(hipfftDestroy(plan));
+
+        for(auto&& p : transformedProjections)
+            hip_check(hipFree(p));
+
+        for(auto&& p : expandedProjections)
+            hip_check(hipFree(p));
+
+        for(auto&& p : projections)
+            hip_check(hipFree(p));
+
+        for(auto&& p : phantomProjections)
+            hip_check(hipFree(p.ptr));
+
+        hip_check(hipFree(R));
+
+        for(auto&& stream : streams)
+            hip_check(hipStreamDestroy(stream));
+
+        hip_check(hipDeviceSynchronize());
+
+        return EXIT_SUCCESS;
+    }
+    catch(std::runtime_error const& e)
+    {
+        std::cerr << "Caught runtime error: " << e.what() << std::endl;
+        return EXIT_FAILURE;
+    }
+}
@@ -0,0 +1,168 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+#include <vector>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess)                 \
+    {                                        \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+__global__ void kernelA(double* arrayA, std::size_t size)
+{
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+    if(x < size)
+    {
+        arrayA[x] *= 2.0;
+    }
+}
+
+__global__ void kernelB(int* arrayB, std::size_t size)
+{
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+    if(x < size)
+    {
+        arrayB[x] = 3;
+    }
+}
+
+__global__ void kernelC(double* arrayA, const int* arrayB, std::size_t size)
+{
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+    if(x < size)
+    {
+        arrayA[x] += arrayB[x];
+    }
+}
+
+struct set_vector_args
+{
+    std::vector<double>& h_array;
+    double value;
+};
+
+void set_vector(void* args)
+{
+    set_vector_args h_args{*(reinterpret_cast<set_vector_args*>(args))};
+
+    std::vector<double>& vec{h_args.h_array};
+    vec.assign(vec.size(), h_args.value);
+}
+
+int main()
+{
+    constexpr int numOfBlocks = 1024;
+    constexpr int threadsPerBlock = 1024;
+    constexpr std::size_t arraySize = 1U << 20;
+
+    // This example assumes that kernelA operates on data that needs to be initialized on
+    // and copied from the host, while kernelB initializes the array that is passed to it.
+    // Both arrays are then used as input to kernelC, where arrayA is also used as
+   //  output, that is copied back to the host, while arrayB is only read from and not modified.
+
+    double* d_arrayA;
+    int* d_arrayB;
+    std::vector<double> h_array(arraySize);
+    constexpr double initValue = 2.0;
+
+    hipStream_t captureStream;
+    HIP_CHECK(hipStreamCreate(&captureStream));
+
+    // Start capturing the operations assigned to the stream
+    HIP_CHECK(hipStreamBeginCapture(captureStream, hipStreamCaptureModeGlobal));
+
+    // hipMallocAsync and hipMemcpyAsync are needed, to be able to assign it to a stream
+    HIP_CHECK(hipMallocAsync(reinterpret_cast<void**>(&d_arrayA), arraySize*sizeof(double), captureStream));
+    HIP_CHECK(hipMallocAsync(reinterpret_cast<void**>(&d_arrayB), arraySize*sizeof(int), captureStream));
+
+    // Assign host function to the stream
+    // Needs a custom struct to pass the arguments
+    set_vector_args args{h_array, initValue};
+    HIP_CHECK(hipLaunchHostFunc(captureStream, set_vector, &args));
+
+    HIP_CHECK(hipMemcpyAsync(d_arrayA, h_array.data(), arraySize*sizeof(double), hipMemcpyHostToDevice, captureStream));
+
+    kernelA<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, arraySize);
+    kernelB<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayB, arraySize);
+    kernelC<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, d_arrayB, arraySize);
+
+    HIP_CHECK(hipMemcpyAsync(h_array.data(), d_arrayA, arraySize*sizeof(*d_arrayA), hipMemcpyDeviceToHost, captureStream));
+
+    HIP_CHECK(hipFreeAsync(d_arrayA, captureStream));
+    HIP_CHECK(hipFreeAsync(d_arrayB, captureStream));
+
+    // Stop capturing
+    hipGraph_t graph;
+    HIP_CHECK(hipStreamEndCapture(captureStream, &graph));
+
+    // Create an executable graph from the captured graph
+    hipGraphExec_t graphExec;
+    HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
+
+    // The graph template can be deleted after the instantiation if it's not needed for later use
+    HIP_CHECK(hipGraphDestroy(graph));
+
+    // Actually launch the graph. The stream does not have
+    // to be the same as the one used for capturing.
+    HIP_CHECK(hipGraphLaunch(graphExec, captureStream));
+
+    HIP_CHECK(hipStreamSynchronize(captureStream));
+
+    // Verify results
+    constexpr double expected = initValue * 2.0 + 3;
+    bool passed = true;
+    for(std::size_t i = 0; i < arraySize; ++i)
+    {
+        if(h_array[i] != expected)
+        {
+            passed = false;
+            std::cerr << "Validation failed! Expected " << expected << " got " << h_array[0] << std::endl;
+            break;
+        }
+    }
+
+    if(passed)
+    {
+        std::cerr << "Validation passed." << std::endl;
+    }
+
+    // Free graph and stream resources after usage
+    HIP_CHECK(hipGraphExecDestroy(graphExec));
+    HIP_CHECK(hipStreamDestroy(captureStream));
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,226 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+#include <vector>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess)                 \
+    {                                        \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+__global__ void kernelA(double* arrayA, std::size_t size)
+{
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+    if(x < size)
+    {
+        arrayA[x] *= 2.0;
+    }
+}
+
+__global__ void kernelB(int* arrayB, std::size_t size)
+{
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+    if(x < size)
+    {
+        arrayB[x] = 3;
+    }
+}
+
+__global__ void kernelC(double* arrayA, const int* arrayB, std::size_t size)
+{
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+    if(x < size)
+    {
+        arrayA[x] += arrayB[x];
+    }
+}
+
+struct set_vector_args
+{
+    std::vector<double>& h_array;
+    double value;
+};
+
+void set_vector(void* args)
+{
+    set_vector_args h_args{*(reinterpret_cast<set_vector_args*>(args))};
+
+    std::vector<double>& vec{h_args.h_array};
+    vec.assign(vec.size(), h_args.value);
+}
+
+int main()
+{
+    constexpr int numOfBlocks = 1024;
+    constexpr int threadsPerBlock = 1024;
+    std::size_t arraySize = 1U << 20;
+
+    // The pointers to the device memory don't need to be declared here,
+    // they are contained within the hipMemAllocNodeParams as the dptr member
+    std::vector<double> h_array(arraySize);
+    constexpr double initValue = 2.0;
+
+    // Create graph an empty graph
+    hipGraph_t graph;
+    HIP_CHECK(hipGraphCreate(&graph, 0));
+
+    // Parameters to allocate arrays
+    hipMemAllocNodeParams allocArrayAParams{};
+    allocArrayAParams.poolProps.allocType = hipMemAllocationTypePinned;
+    allocArrayAParams.poolProps.location.type = hipMemLocationTypeDevice;
+    allocArrayAParams.poolProps.location.id = 0; // GPU on which memory resides
+    allocArrayAParams.bytesize = arraySize * sizeof(double);
+
+    hipMemAllocNodeParams allocArrayBParams{};
+    allocArrayBParams.poolProps.allocType = hipMemAllocationTypePinned;
+    allocArrayBParams.poolProps.location.type = hipMemLocationTypeDevice;
+    allocArrayBParams.poolProps.location.id = 0; // GPU on which memory resides
+    allocArrayBParams.bytesize = arraySize * sizeof(int);
+
+    // Add the allocation nodes to the graph. They don't have any dependencies
+    hipGraphNode_t allocNodeA, allocNodeB;
+    HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeA, graph, nullptr, 0, &allocArrayAParams));
+    HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeB, graph, nullptr, 0, &allocArrayBParams));
+
+    // Parameters for the host function
+    // Needs custom struct to pass the arguments
+    set_vector_args args{h_array, initValue};
+    hipHostNodeParams hostParams{};
+    hostParams.fn = set_vector;
+    hostParams.userData = static_cast<void*>(&args);
+
+    // Add the host node that initializes the host array. It also doesn't have any dependencies
+    hipGraphNode_t hostNode;
+    HIP_CHECK(hipGraphAddHostNode(&hostNode, graph, nullptr, 0, &hostParams));
+
+    // Add memory copy node, that copies the initialized host array to the device.
+    // It has to wait for the host array to be initialized and the device memory to be allocated
+    hipGraphNode_t cpyNodeDependencies[] = {allocNodeA, hostNode};
+    hipGraphNode_t cpyToDevNode;
+    HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToDevNode, graph, cpyNodeDependencies, 2, allocArrayAParams.dptr, h_array.data(), arraySize * sizeof(double), hipMemcpyHostToDevice));
+
+    // Parameters for kernelA
+    hipKernelNodeParams kernelAParams;
+    void* kernelAArgs[] = {&allocArrayAParams.dptr, static_cast<void*>(&arraySize)};
+    kernelAParams.func = reinterpret_cast<void*>(kernelA);
+    kernelAParams.gridDim = numOfBlocks;
+    kernelAParams.blockDim = threadsPerBlock;
+    kernelAParams.sharedMemBytes = 0;
+    kernelAParams.kernelParams = kernelAArgs;
+    kernelAParams.extra = nullptr;
+
+    // Add the node for kernelA. It has to wait for the memory copy to finish, as it depends on the values from the host array.
+    hipGraphNode_t kernelANode;
+    HIP_CHECK(hipGraphAddKernelNode(&kernelANode, graph, &cpyToDevNode, 1, &kernelAParams));
+
+    // Parameters for kernelB
+    hipKernelNodeParams kernelBParams;
+    void* kernelBArgs[] = {&allocArrayBParams.dptr, static_cast<void*>(&arraySize)};
+    kernelBParams.func = reinterpret_cast<void*>(kernelB);
+    kernelBParams.gridDim = numOfBlocks;
+    kernelBParams.blockDim = threadsPerBlock;
+    kernelBParams.sharedMemBytes = 0;
+    kernelBParams.kernelParams = kernelBArgs;
+    kernelBParams.extra = nullptr;
+
+    //  Add the node for kernelB. It only has to wait for the memory to be allocated, as it initializes the array.
+    hipGraphNode_t kernelBNode;
+    HIP_CHECK(hipGraphAddKernelNode(&kernelBNode, graph, &allocNodeB, 1, &kernelBParams));
+
+    // Parameters for kernelC
+    hipKernelNodeParams kernelCParams;
+    void* kernelCArgs[] = {&allocArrayAParams.dptr, &allocArrayBParams.dptr, static_cast<void*>(&arraySize)};
+    kernelCParams.func = reinterpret_cast<void*>(kernelC);
+    kernelCParams.gridDim = numOfBlocks;
+    kernelCParams.blockDim = threadsPerBlock;
+    kernelCParams.sharedMemBytes = 0;
+    kernelCParams.kernelParams = kernelCArgs;
+    kernelCParams.extra = nullptr;
+
+    // Add the node for kernelC. It has to wait on both kernelA and kernelB to finish, as it depends on their results.
+    hipGraphNode_t kernelCNode;
+    hipGraphNode_t kernelCDependencies[] = {kernelANode, kernelBNode};
+    HIP_CHECK(hipGraphAddKernelNode(&kernelCNode, graph, kernelCDependencies, 2, &kernelCParams));
+
+    // Copy the results back to the host. Has to wait for kernelC to finish.
+    hipGraphNode_t cpyToHostNode;
+    HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToHostNode, graph, &kernelCNode, 1, h_array.data(), allocArrayAParams.dptr, arraySize * sizeof(double), hipMemcpyDeviceToHost));
+
+    // Free array of allocNodeA. It needs to wait for the copy to finish, as kernelC stores its results in it.
+    hipGraphNode_t freeNodeA;
+    HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeA, graph, &cpyToHostNode, 1, allocArrayAParams.dptr));
+    // Free array of allocNodeB. It only needs to wait for kernelC to finish, as it is not written back to the host.
+    hipGraphNode_t freeNodeB;
+    HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeB, graph, &kernelCNode, 1, allocArrayBParams.dptr));
+
+    // Instantiate the graph in order to execute it
+    hipGraphExec_t graphExec;
+    HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
+
+    // The graph can be freed after the instantiation if it's not needed for other purposes
+    HIP_CHECK(hipGraphDestroy(graph));
+
+    // Actually launch the graph
+    hipStream_t graphStream;
+    HIP_CHECK(hipStreamCreate(&graphStream));
+    HIP_CHECK(hipGraphLaunch(graphExec, graphStream));
+
+    HIP_CHECK(hipStreamSynchronize(graphStream));
+
+    // Verify results
+    constexpr double expected = initValue * 2.0 + 3;
+    bool passed = true;
+    for(std::size_t i = 0; i < arraySize; ++i)
+    {
+        if(h_array[i] != expected)
+        {
+            passed = false;
+            std::cerr << "Validation failed! Expected " << expected << " got " << h_array[0] << std::endl;
+            break;
+        }
+    }
+
+    if(passed)
+    {
+        std::cerr << "Validation passed." << std::endl;
+    }
+
+    HIP_CHECK(hipGraphExecDestroy(graphExec));
+    HIP_CHECK(hipStreamDestroy(graphStream));
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,59 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                                \
+{                                                            \
+    const hipError_t err = expression;                       \
+    if (err != hipSuccess)                                   \
+    {                                                        \
+        std::cout << "HIP Error: " << hipGetErrorString(err) \
+              << " at line " << __LINE__ << std::endl;       \
+        std::exit(EXIT_FAILURE);                             \
+    }                                                        \
+}
+
+int main()
+{
+    int deviceCount;
+    HIP_CHECK(hipGetDeviceCount(&deviceCount));
+
+    int device = 0; // Query first available GPU. Can be replaced with any
+                    // integer up to, not including, deviceCount
+    hipDeviceProp_t deviceProp;
+    HIP_CHECK(hipGetDeviceProperties(&deviceProp, device));
+
+    std::cout << "The queried device ";
+    if (deviceProp.arch.hasSharedInt32Atomics) // portable HIP feature query
+        std::cout << "supports";
+    else
+        std::cout << "does not support";
+    std::cout << " shared int32 atomic operations" << std::endl;
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,59 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                                \
+{                                                            \
+    const hipError_t err = expression;                       \
+    if (err != hipSuccess)                                   \
+    {                                                        \
+        std::cout << "HIP Error: " << hipGetErrorString(err) \
+              << " at line " << __LINE__ << std::endl;       \
+        std::exit(EXIT_FAILURE);                             \
+    }                                                        \
+}
+
+int main()
+{
+    int deviceCount;
+    HIP_CHECK(hipGetDeviceCount(&deviceCount));
+
+    int device = 0; // Query first available GPU. Can be replaced with any
+                    // integer up to, not including, deviceCount
+    hipDeviceProp_t deviceProp;
+    HIP_CHECK(hipGetDeviceProperties(&deviceProp, device));
+
+    std::cout << "The queried device ";
+    if (deviceProp.arch.hasSharedInt32Atomics) // portable HIP feature query
+        std::cout << "supports";
+    else
+        std::cout << "does not support";
+    std::cout << " shared int32 atomic operations" << std::endl;
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,48 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+
+int main()
+{
+    // [sphinx-amd-start]
+#ifdef __HIP_PLATFORM_AMD__
+    // This code path is compiled when amdclang++ is used for compilation
+#endif
+    // [sphinx-amd-end]
+
+    // [sphinx-nvidia-start]
+#ifdef __HIP_PLATFORM_NVIDIA__
+    // This code path is compiled when nvcc is used for compilation
+    // Could be compiling with CUDA language extensions enabled (for example, a ".cu file)
+    // Could be in pass-through mode to an underlying host compiler (for example, a .cpp file)
+#endif
+    // [sphinx-nvidia-end]
+
+#if !defined(__HIP_PLATFORM_AMD__) && !defined(__HIP_PLATFORM_NVIDIA__)
+#   error "No compatible HIP platform defined!"
+#endif
+
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,52 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+#include <iostream>
+
+__host__ __device__ void call_func()
+{
+    #ifdef __HIP_DEVICE_COMPILE__
+        printf("device\n");
+    #else
+        std::cout << "host" << std::endl;
+    #endif
+}
+
+__global__ void test_kernel()
+{
+  call_func();
+}
+
+int main()
+{
+    test_kernel<<<1, 1, 0, 0>>>();
+    if(auto err = hipDeviceSynchronize(); err != hipSuccess)
+        std::cerr << "HIP error " << err << ": " << hipGetErrorString(err) << std::endl;
+
+    call_func();
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,75 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include "example_utils.hpp"
+
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+// [sphinx-kernel-start]
+__global__ void kernel_memory_allocation()
+{
+  // The pointer is stored in shared memory, so that all
+  // threads of the block can access the pointer
+  __shared__ int *memory;
+
+  std::size_t blockSize = blockDim.x;
+  constexpr std::size_t elementsPerThread = 1024;
+  if(threadIdx.x == 0)
+  {
+    // allocate memory in one contiguous block
+    memory = new int[blockDim.x * elementsPerThread];
+  }
+  __syncthreads();
+
+  // load pointer into thread-local variable to avoid
+  // unnecessary accesses to shared memory
+  int *localPtr = memory;
+
+  // work with allocated memory, e.g. initialization
+  for(int i = 0; i < elementsPerThread; ++i)
+  {
+    // access in a contiguous way
+    localPtr[i * blockSize + threadIdx.x] = i;
+  }
+
+  // synchronize to make sure no thread is accessing the memory before freeing
+  __syncthreads();
+  if(threadIdx.x == 0)
+  {
+    delete[] memory;
+  }
+}
+// [sphinx-kernel-end]
+
+int main()
+{
+    kernel_memory_allocation<<<64, 1024>>>();
+    HIP_CHECK(hipGetLastError());
+    
+    std::cout << "Success!" << std::endl;
+    
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,91 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <iostream>
+
+#define HIP_CHECK(expression)                                \
+{                                                            \
+    const hipError_t err = expression;                       \
+    if(err != hipSuccess)                                    \
+    {                                                        \
+        std::cerr << "HIP error: " << hipGetErrorString(err) \
+            << " at " << __LINE__ << "\n";                   \
+    }                                                        \
+}
+
+// Performs a simple initialization of an array with the thread's index variables.
+// This function is only available in device code.
+__device__ void init_array(float * const a, const unsigned int arraySize)
+{
+    // globalIdx uniquely identifies a thread in a 1D launch configuration.
+    const int globalIdx = threadIdx.x + blockIdx.x * blockDim.x;
+    // Each thread initializes a single element of the array.
+    if(globalIdx < arraySize)
+    {
+        a[globalIdx] = globalIdx;
+    }
+}
+
+// Rounds a value up to the next multiple.
+// This function is available in host and device code.
+__host__ __device__ constexpr int round_up_to_nearest_multiple(int number, int multiple)
+{
+    return (number + multiple - 1)/multiple;
+}
+
+__global__
+__launch_bounds__(512, 4) // This kernel requires at most 512 threads per block and at least 4 warps per execution unit.
+void example_kernel(float * const a, const unsigned int N)
+{
+    // Initialize array.
+    init_array(a, N);
+    // Perform additional work:
+    // - work with the array
+    // - use the array in a different kernel
+    // - ...
+}
+
+int main()
+{
+    constexpr int N = 100000000; // problem size
+    constexpr int blockSize = 256; //configurable block size
+
+    //needed number of blocks for the given problem size
+    constexpr int gridSize = round_up_to_nearest_multiple(N, blockSize);
+
+    float *a;
+    // allocate memory on the GPU
+    HIP_CHECK(hipMalloc(&a, sizeof(*a) * N));
+
+    std::cout << "Launching kernel." << std::endl;
+    example_kernel<<<dim3(gridSize), dim3(blockSize), 0/*example doesn't use shared memory*/, 0/*default stream*/>>>(a, N);
+    // make sure kernel execution is finished by synchronizing. The CPU can also
+    // execute other instructions during that time
+    HIP_CHECK(hipDeviceSynchronize());
+    std::cout << "Kernel execution finished." << std::endl;
+
+    HIP_CHECK(hipFree(a));
+}
+// [sphinx-end]
@@ -0,0 +1,200 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+#include <hip/hiprtc.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+#include <string>
+#include <vector>
+
+#define CHECK_RET_CODE(call, ret_code)                                                             \
+{                                                                                                  \
+    if ((call) != ret_code)                                                                        \
+    {                                                                                              \
+        std::cout << "Failed in call: " << #call << std::endl;                                     \
+        std::abort();                                                                              \
+    }                                                                                              \
+}
+#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
+#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
+
+// source code for hiprtc
+static constexpr auto kernel_source{
+    R"(
+    extern "C"
+    __global__ void vector_add(float* output, float* input1, float* input2, size_t size)
+    {
+        int i = threadIdx.x;
+        if (i < size)
+        {
+            output[i] = input1[i] + input2[i];
+        }
+    }
+)"};
+
+int main()
+{
+    hiprtcProgram prog;
+    auto rtc_ret_code = hiprtcCreateProgram(&prog,            // HIPRTC program handle
+                                            kernel_source,    // kernel source string
+                                            "vector_add.cpp", // Name of the file
+                                            0,                // Number of headers
+                                            nullptr,          // Header sources
+                                            nullptr);         // Name of header file
+
+    if (rtc_ret_code != HIPRTC_SUCCESS)
+    {
+        std::cerr << "Failed to create program" << std::endl;
+        std::abort();
+    }
+
+    // [sphinx-options-start]
+    auto sarg = std::string{"-fgpu-rdc"};
+    const char* compile_options[] = {sarg.c_str()};
+
+    rtc_ret_code = hiprtcCompileProgram(prog,      // hiprtcProgram
+                                        1,         // Number of options
+                                        compile_options);
+    // [sphinx-options-end]
+    if (rtc_ret_code != HIPRTC_SUCCESS)
+    {
+        std::cerr << "Failed to create program" << std::endl;
+        std::abort();
+    }
+
+    std::size_t logSize;
+    HIPRTC_CHECK(hiprtcGetProgramLogSize(prog, &logSize));
+
+    if (logSize)
+    {
+        std::string log(logSize, '\0');
+        HIPRTC_CHECK(hiprtcGetProgramLog(prog, &log[0]));
+        std::cerr << "Compilation failed or produced warnings: " << log << std::endl;
+        std::abort();
+    }
+
+    // [sphinx-bitcode-start]
+    std::size_t bitCodeSize;
+    HIPRTC_CHECK(hiprtcGetBitcodeSize(prog, &bitCodeSize));
+
+    std::vector<char> kernel_bitcode(bitCodeSize);
+    HIPRTC_CHECK(hiprtcGetBitcode(prog, kernel_bitcode.data()));
+    // [sphinx-bitcode-end]
+
+    HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
+
+    auto num_options = 0u;
+    hiprtcJIT_option* options = nullptr;
+    void* option_vals[] = {nullptr};
+    auto rtc_link_state = hiprtcLinkState{};
+    // [sphinx-link-create-start]
+    HIPRTC_CHECK(hiprtcLinkCreate(num_options,           // number of options
+                                  options,               // Array of options
+                                  option_vals,           // Array of option values cast to void*
+                                  &rtc_link_state));     // HIPRTC link state created upon success
+    // [sphinx-link-create-end]
+
+    auto input_type = HIPRTC_JIT_INPUT_LLVM_BITCODE;
+    auto bit_code_ptr = kernel_bitcode.data();
+    auto bit_code_size = bitCodeSize;
+    // [sphinx-link-add-start]
+    HIPRTC_CHECK(hiprtcLinkAddData(rtc_link_state,        // HIPRTC link state
+                                   input_type,            // type of the input data or bitcode
+                                   bit_code_ptr,          // input data which is null terminated
+                                   bit_code_size,         // size of the input data
+                                   "a",                   // optional name for this input
+                                   0,                     // size of the options
+                                   nullptr,               // Array of options applied to this input
+                                   nullptr));             // Array of option values cast to void*
+    // [sphinx-link-add-end]
+
+    void* binary = nullptr;
+    auto binarySize = std::size_t{};
+    // [sphinx-link-complete-start]
+    HIPRTC_CHECK(hiprtcLinkComplete(rtc_link_state,       // HIPRTC link state
+                                    &binary,              // upon success, points to the output binary
+                                    &binarySize));        // size of the binary is stored (optional)
+    // [sphinx-link-complete-end]
+
+    hipModule_t module;
+    hipFunction_t kernel;
+
+    HIP_CHECK(hipModuleLoadData(&module, binary));
+    HIP_CHECK(hipModuleGetFunction(&kernel, module, "vector_add"));
+
+    HIPRTC_CHECK(hiprtcLinkDestroy(rtc_link_state));
+
+    constexpr std::size_t ele_size = 256;  // total number of items to add
+    std::vector<float> hinput, output;
+    hinput.reserve(ele_size);
+    output.reserve(ele_size);
+    for (std::size_t i = 0; i < ele_size; i++)
+    {
+        hinput.push_back(static_cast<float>(i + 1));
+        output.push_back(0.0f);
+    }
+
+    float *dinput1, *dinput2, *doutput;
+    HIP_CHECK(hipMalloc(&dinput1, sizeof(float) * ele_size));
+    HIP_CHECK(hipMalloc(&dinput2, sizeof(float) * ele_size));
+    HIP_CHECK(hipMalloc(&doutput, sizeof(float) * ele_size));
+
+    HIP_CHECK(hipMemcpy(dinput1, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
+    HIP_CHECK(hipMemcpy(dinput2, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
+
+    struct
+    {
+        float* output;
+        float* input1;
+        float* input2;
+        std::size_t size;
+    } args{doutput, dinput1, dinput2, ele_size};
+
+    auto size = sizeof(args);
+    void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, &args, HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
+                      HIP_LAUNCH_PARAM_END};
+
+    HIP_CHECK(hipModuleLaunchKernel(kernel, 1, 1, 1, ele_size, 1, 1, 0, nullptr, nullptr, config));
+
+    HIP_CHECK(hipMemcpy(output.data(), doutput, sizeof(float) * ele_size, hipMemcpyDeviceToHost));
+
+    for (std::size_t i = 0; i < ele_size; i++)
+    {
+        if ((hinput[i] + hinput[i]) != output[i])
+        {
+            std::cout << "Failed in validation: " << (hinput[i] + hinput[i]) << " - " << output[i] << std::endl;
+            std::abort();
+        }
+    }
+    std::cout << "Passed" << std::endl;
+
+    HIP_CHECK(hipFree(dinput1));
+    HIP_CHECK(hipFree(dinput2));
+    HIP_CHECK(hipFree(doutput));
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-stop]
@@ -0,0 +1,219 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+#include <hip/hiprtc.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <fstream>
+#include <ios>
+#include <iostream>
+#include <string>
+#include <vector>
+
+#if __has_include(<filesystem>)
+    #include <filesystem>
+    namespace fs = std::filesystem;
+#elif __has_include(<experimental/filesystem>)
+    #include <experimental/filesystem>
+    namespace fs = std::experimental::filesystem;
+#else
+    static_assert(false, "filesystem not available");
+#endif
+
+
+#define CHECK_RET_CODE(call, ret_code)                                                             \
+{                                                                                                  \
+    if ((call) != ret_code)                                                                        \
+    {                                                                                              \
+        std::cout << "Failed in call: " << #call << std::endl;                                     \
+        std::abort();                                                                              \
+    }                                                                                              \
+}
+#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
+#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
+
+// source code for hiprtc
+static constexpr auto kernel_source{
+    R"(
+    extern "C"
+    __global__ void vector_add(float* output, float* input1, float* input2, size_t size)
+    {
+        int i = threadIdx.x;
+        if (i < size)
+        {
+            output[i] = input1[i] + input2[i];
+        }
+    }
+)"};
+
+int main()
+{
+    hiprtcProgram prog;
+    auto rtc_ret_code = hiprtcCreateProgram(&prog,            // HIPRTC program handle
+                                            kernel_source,    // kernel source string
+                                            "vector_add.cpp", // Name of the file
+                                            0,                // Number of headers
+                                            nullptr,          // Header sources
+                                            nullptr);         // Name of header file
+
+    if (rtc_ret_code != HIPRTC_SUCCESS)
+    {
+        std::cerr << "Failed to create program" << std::endl;
+        std::abort();
+    }
+
+    // [sphinx-options-start]
+    auto sarg = std::string{"-fgpu-rdc"};
+    const char* compile_options[] = {sarg.c_str()};
+
+    rtc_ret_code = hiprtcCompileProgram(prog,      // hiprtcProgram
+                                        1,         // Number of options
+                                        compile_options);
+    // [sphinx-options-end]
+    if (rtc_ret_code != HIPRTC_SUCCESS)
+    {
+        std::cerr << "Failed to create program" << std::endl;
+        std::abort();
+    }
+
+    std::size_t logSize;
+    HIPRTC_CHECK(hiprtcGetProgramLogSize(prog, &logSize));
+
+    if (logSize)
+    {
+        std::string log(logSize, '\0');
+        HIPRTC_CHECK(hiprtcGetProgramLog(prog, &log[0]));
+        std::cerr << "Compilation failed or produced warnings: " << log << std::endl;
+        std::abort();
+    }
+
+    // [sphinx-bitcode-start]
+    std::size_t bitCodeSize;
+    HIPRTC_CHECK(hiprtcGetBitcodeSize(prog, &bitCodeSize));
+
+    std::vector<char> kernel_bitcode(bitCodeSize);
+    HIPRTC_CHECK(hiprtcGetBitcode(prog, kernel_bitcode.data()));
+    // [sphinx-bitcode-end]
+
+    HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
+
+    auto num_options = 0u;
+    hiprtcJIT_option* options = nullptr;
+    void* option_vals[] = {nullptr};
+    auto rtc_link_state = hiprtcLinkState{};
+    // [sphinx-link-create-start]
+    HIPRTC_CHECK(hiprtcLinkCreate(num_options,           // number of options
+                                  options,               // Array of options
+                                  option_vals,           // Array of option values cast to void*
+                                  &rtc_link_state));     // HIPRTC link state created upon success
+    // [sphinx-link-create-end]
+
+    auto input_type = HIPRTC_JIT_INPUT_LLVM_BITCODE;
+    auto bc_file_path = std::string{"bitcode.bc"};
+    auto bc_file = std::fstream{bc_file_path.c_str(), std::ios::binary | std::ios::out};
+    if(!bc_file.is_open())
+    {
+        std::cerr << "Could not open bitcode file for writing!" << std::endl;
+        std::abort();
+    }
+    bc_file.write(kernel_bitcode.data(), bitCodeSize);
+    bc_file.close();
+    // [sphinx-link-add-start]
+    HIPRTC_CHECK(hiprtcLinkAddFile(rtc_link_state,        // HIPRTC link state
+                                   input_type,            // type of the input data or bitcode
+                                   bc_file_path.c_str(),  // input data which is null terminated
+                                   0,                     // size of the options
+                                   nullptr,               // Array of options applied to this input
+                                   nullptr));             // Array of option values cast to void*
+    // [sphinx-link-add-end]
+    fs::remove(bc_file_path);
+
+    void* binary = nullptr;
+    auto binarySize = std::size_t{};
+    // [sphinx-link-complete-start]
+    HIPRTC_CHECK(hiprtcLinkComplete(rtc_link_state,       // HIPRTC link state
+                                    &binary,              // upon success, points to the output binary
+                                    &binarySize));        // size of the binary is stored (optional)
+    // [sphinx-link-complete-end]
+
+    hipModule_t module;
+    hipFunction_t kernel;
+
+    HIP_CHECK(hipModuleLoadData(&module, binary));
+    HIP_CHECK(hipModuleGetFunction(&kernel, module, "vector_add"));
+
+    HIPRTC_CHECK(hiprtcLinkDestroy(rtc_link_state));
+
+    constexpr std::size_t ele_size = 256;  // total number of items to add
+    std::vector<float> hinput, output;
+    hinput.reserve(ele_size);
+    output.reserve(ele_size);
+    for (std::size_t i = 0; i < ele_size; i++)
+    {
+        hinput.push_back(static_cast<float>(i + 1));
+        output.push_back(0.0f);
+    }
+
+    float *dinput1, *dinput2, *doutput;
+    HIP_CHECK(hipMalloc(&dinput1, sizeof(float) * ele_size));
+    HIP_CHECK(hipMalloc(&dinput2, sizeof(float) * ele_size));
+    HIP_CHECK(hipMalloc(&doutput, sizeof(float) * ele_size));
+
+    HIP_CHECK(hipMemcpy(dinput1, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
+    HIP_CHECK(hipMemcpy(dinput2, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
+
+    struct
+    {
+        float* output;
+        float* input1;
+        float* input2;
+        std::size_t size;
+    } args{doutput, dinput1, dinput2, ele_size};
+
+    auto size = sizeof(args);
+    void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, &args, HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
+                      HIP_LAUNCH_PARAM_END};
+
+    HIP_CHECK(hipModuleLaunchKernel(kernel, 1, 1, 1, ele_size, 1, 1, 0, nullptr, nullptr, config));
+
+    HIP_CHECK(hipMemcpy(output.data(), doutput, sizeof(float) * ele_size, hipMemcpyDeviceToHost));
+
+    for (std::size_t i = 0; i < ele_size; i++)
+    {
+        if ((hinput[i] + hinput[i]) != output[i])
+        {
+            std::cout << "Failed in validation: " << (hinput[i] + hinput[i]) << " - " << output[i] << std::endl;
+            std::abort();
+        }
+    }
+    std::cout << "Passed" << std::endl;
+
+    HIP_CHECK(hipFree(dinput1));
+    HIP_CHECK(hipFree(dinput2));
+    HIP_CHECK(hipFree(doutput));
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-stop]
@@ -0,0 +1,200 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+#include <hip/hiprtc.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+#include <string>
+#include <vector>
+
+#define CHECK_RET_CODE(call, ret_code)                                                             \
+{                                                                                                  \
+    if ((call) != ret_code)                                                                        \
+    {                                                                                              \
+        std::cout << "Failed in call: " << #call << std::endl;                                     \
+        std::abort();                                                                              \
+    }                                                                                              \
+}
+#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
+#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
+
+// source code for hiprtc
+static constexpr auto kernel_source{
+    R"(
+    extern "C"
+    __global__ void vector_add(float* output, float* input1, float* input2, size_t size)
+    {
+        int i = threadIdx.x;
+        if (i < size)
+        {
+            output[i] = input1[i] + input2[i];
+        }
+    }
+)"};
+
+int main()
+{
+    hiprtcProgram prog;
+    auto rtc_ret_code = hiprtcCreateProgram(&prog,            // HIPRTC program handle
+                                            kernel_source,    // kernel source string
+                                            "vector_add.cpp", // Name of the file
+                                            0,                // Number of headers
+                                            nullptr,          // Header sources
+                                            nullptr);         // Name of header file
+
+    if (rtc_ret_code != HIPRTC_SUCCESS)
+    {
+        std::cerr << "Failed to create program" << std::endl;
+        std::abort();
+    }
+
+    // [sphinx-options-start]
+    auto sarg = std::string{"-fgpu-rdc"};
+    const char* compile_options[] = {sarg.c_str()};
+
+    rtc_ret_code = hiprtcCompileProgram(prog,      // hiprtcProgram
+                                        1,         // Number of options
+                                        compile_options);
+    // [sphinx-options-end]
+    if (rtc_ret_code != HIPRTC_SUCCESS)
+    {
+        std::cerr << "Failed to create program" << std::endl;
+        std::abort();
+    }
+
+    std::size_t logSize;
+    HIPRTC_CHECK(hiprtcGetProgramLogSize(prog, &logSize));
+
+    if (logSize)
+    {
+        std::string log(logSize, '\0');
+        HIPRTC_CHECK(hiprtcGetProgramLog(prog, &log[0]));
+        std::cerr << "Compilation failed or produced warnings: " << log << std::endl;
+        std::abort();
+    }
+
+    // [sphinx-bitcode-start]
+    std::size_t bitCodeSize;
+    HIPRTC_CHECK(hiprtcGetBitcodeSize(prog, &bitCodeSize));
+
+    std::vector<char> kernel_bitcode(bitCodeSize);
+    HIPRTC_CHECK(hiprtcGetBitcode(prog, kernel_bitcode.data()));
+    // [sphinx-bitcode-end]
+
+    HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
+
+    // [sphinx-link-create-start]
+    const char* isaopts[] = {"-mllvm", "-inline-threshold=1", "-mllvm", "-inlinehint-threshold=1"};
+    std::vector<hiprtcJIT_option> jit_options = {HIPRTC_JIT_IR_TO_ISA_OPT_EXT,
+                                                 HIPRTC_JIT_IR_TO_ISA_OPT_COUNT_EXT};
+    std::size_t isaoptssize = 4;
+    void* lopts[] = {reinterpret_cast<void*>(isaopts),
+                     reinterpret_cast<void*>(isaoptssize)};
+    hiprtcLinkState linkstate;
+    HIPRTC_CHECK(hiprtcLinkCreate(2u, jit_options.data(), reinterpret_cast<void**>(lopts), &linkstate));
+    // [sphinx-link-create-end]
+
+    auto input_type = HIPRTC_JIT_INPUT_LLVM_BITCODE;
+    auto bit_code_ptr = kernel_bitcode.data();
+    auto bit_code_size = bitCodeSize;
+    // [sphinx-link-add-start]
+    HIPRTC_CHECK(hiprtcLinkAddData(linkstate,        // HIPRTC link state
+                                   input_type,            // type of the input data or bitcode
+                                   bit_code_ptr,          // input data which is null terminated
+                                   bit_code_size,         // size of the input data
+                                   "a",                   // optional name for this input
+                                   0,                     // size of the options
+                                   nullptr,               // Array of options applied to this input
+                                   nullptr));             // Array of option values cast to void*
+    // [sphinx-link-add-end]
+
+    void* binary = nullptr;
+    auto binarySize = std::size_t{};
+    // [sphinx-link-complete-start]
+    HIPRTC_CHECK(hiprtcLinkComplete(linkstate,       // HIPRTC link state
+                                  &binary,         // upon success, points to the output binary
+                                  &binarySize));   // size of the binary is stored (optional)
+    // [sphinx-link-complete-end]
+
+    hipModule_t module;
+    hipFunction_t kernel;
+
+    HIP_CHECK(hipModuleLoadData(&module, binary));
+    HIP_CHECK(hipModuleGetFunction(&kernel, module, "vector_add"));
+
+    HIPRTC_CHECK(hiprtcLinkDestroy(linkstate));
+
+    constexpr std::size_t ele_size = 256;  // total number of items to add
+    std::vector<float> hinput, output;
+    hinput.reserve(ele_size);
+    output.reserve(ele_size);
+    for (std::size_t i = 0; i < ele_size; i++)
+    {
+        hinput.push_back(static_cast<float>(i + 1));
+        output.push_back(0.0f);
+    }
+
+    float *dinput1, *dinput2, *doutput;
+    HIP_CHECK(hipMalloc(&dinput1, sizeof(float) * ele_size));
+    HIP_CHECK(hipMalloc(&dinput2, sizeof(float) * ele_size));
+    HIP_CHECK(hipMalloc(&doutput, sizeof(float) * ele_size));
+
+    HIP_CHECK(hipMemcpy(dinput1, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
+    HIP_CHECK(hipMemcpy(dinput2, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
+
+    struct
+    {
+        float* output;
+        float* input1;
+        float* input2;
+        std::size_t size;
+    } args{doutput, dinput1, dinput2, ele_size};
+
+    auto size = sizeof(args);
+    void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, &args, HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
+                      HIP_LAUNCH_PARAM_END};
+
+    HIP_CHECK(hipModuleLaunchKernel(kernel, 1, 1, 1, ele_size, 1, 1, 0, nullptr, nullptr, config));
+
+    HIP_CHECK(hipMemcpy(output.data(), doutput, sizeof(float) * ele_size, hipMemcpyDeviceToHost));
+
+    for (std::size_t i = 0; i < ele_size; i++)
+    {
+        if ((hinput[i] + hinput[i]) != output[i])
+        {
+            std::cout << "Failed in validation: " << (hinput[i] + hinput[i]) << " - " << output[i] << std::endl;
+            std::abort();
+        }
+    }
+    std::cout << "Passed" << std::endl;
+
+    HIP_CHECK(hipFree(dinput1));
+    HIP_CHECK(hipFree(dinput2));
+    HIP_CHECK(hipFree(doutput));
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-stop]
@@ -0,0 +1,107 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+#include <vector>
+
+#define HIP_CHECK(expression)                                \
+{                                                            \
+    const hipError_t err = expression;                       \
+    if (err != hipSuccess)                                   \
+    {                                                        \
+        std::cout << "HIP Error: " << hipGetErrorString(err) \
+              << " at line " << __LINE__ << std::endl;       \
+        std::exit(EXIT_FAILURE);                             \
+    }                                                        \
+}
+
+int main()
+{
+    std::size_t elements = 64*1024;
+    std::size_t size_bytes = elements * sizeof(float);
+
+    std::vector<float> A(elements), B(elements);
+
+    // On NVIDIA platforms the driver runtime needs to be initiated
+    #ifdef __HIP_PLATFORM_NVIDIA__
+    hipInit(0);
+    hipDevice_t device;
+    hipCtx_t context;
+    HIP_CHECK(hipDeviceGet(&device, 0));
+    HIP_CHECK(hipCtxCreate(&context, 0, device));
+    #endif
+
+    // Allocate device memory
+    hipDeviceptr_t d_A, d_B;
+    HIP_CHECK(hipMalloc(reinterpret_cast<void**>(&d_A), size_bytes));
+    HIP_CHECK(hipMalloc(reinterpret_cast<void**>(&d_B), size_bytes));
+
+    // Copy data to device
+    HIP_CHECK(hipMemcpyHtoD(d_A, A.data(), size_bytes));
+    HIP_CHECK(hipMemcpyHtoD(d_B, B.data(), size_bytes));
+
+    // Load module
+    hipModule_t Module;
+    // For AMD the module file has to contain architecture specific object code
+    // For NVIDIA the module file has to contain PTX, found in e.g. "vcpy_isa.ptx"
+    #ifdef __HIP_PLATFORM_AMD__
+    HIP_CHECK(hipModuleLoad(&Module, "vcpy_isa.hsaco"));
+    #elif defined(__HIP_PLATFORM_NVIDIA__)
+    HIP_CHECK(hipModuleLoad(&Module, "vcpy_isa.ptx"));
+    #endif
+    // Get kernel function from the module via its name
+    hipFunction_t Function;
+    HIP_CHECK(hipModuleGetFunction(&Function, Module, "hello_world"));
+
+    // Create buffer for kernel arguments
+    std::vector<void*> argBuffer{reinterpret_cast<void*>(d_A), reinterpret_cast<void*>(d_B)};
+    std::size_t arg_size_bytes = argBuffer.size() * sizeof(void*);
+
+    // Create configuration passed to the kernel as arguments
+    void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, argBuffer.data(),
+                      HIP_LAUNCH_PARAM_BUFFER_SIZE, &arg_size_bytes,
+                      HIP_LAUNCH_PARAM_END};
+
+    int threads_per_block = 128;
+    int blocks = (elements + threads_per_block - 1) / threads_per_block;
+
+    // Actually launch kernel
+    HIP_CHECK(hipModuleLaunchKernel(Function, blocks, 1, 1, threads_per_block, 1, 1, 0, 0, NULL, config));
+
+    HIP_CHECK(hipMemcpyDtoH(A.data(), d_A, elements));
+    HIP_CHECK(hipMemcpyDtoH(B.data(), d_B, elements));
+
+    HIP_CHECK(hipFree(reinterpret_cast<void*>(d_A)));
+    HIP_CHECK(hipFree(reinterpret_cast<void*>(d_B)));
+
+    #ifdef __HIP_PLATFORM_NVIDIA__
+    HIP_CHECK(hipCtxDestroy(context));
+    #endif
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,145 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <fstream>
+#include <iostream>
+#include <memory>
+#include <string>
+#include <vector>
+
+#define HIP_CHECK(expression)                                \
+{                                                            \
+    const hipError_t err = expression;                       \
+    if (err != hipSuccess)                                   \
+    {                                                        \
+        std::cout << "HIP Error: " << hipGetErrorString(err) \
+              << " at line " << __LINE__ << std::endl;       \
+        std::exit(EXIT_FAILURE);                             \
+    }                                                        \
+}
+
+void* populate_data_pointer()
+{
+#ifdef __HIP_PLATFORM_AMD__
+    auto filename = std::string{"myKernel.hsaco"};
+#elif defined(__HIP_PLATFORM_NVIDIA__)
+    auto filename = std::string{"myKernel.ptx"};
+#endif
+    std::fstream file{filename, std::ios::in | std::ios::binary | std::ios::ate};
+    if(!file.is_open())
+    {
+        std::cerr << "Error opening file " << filename << std::endl;
+        std::exit(EXIT_FAILURE);
+    }
+
+    auto filesize = file.tellg();
+    auto storage = new char[filesize];
+
+    file.seekg(0, std::ios::beg);
+    file.read(storage, filesize);
+
+    return storage;
+}
+
+int main()
+{
+    std::size_t elements = 64*1024;
+    std::size_t size_bytes = elements * sizeof(float);
+
+    std::vector<float> A(elements), B(elements);
+
+    // On NVIDIA platforms the driver runtime needs to be initiated
+    #ifdef __HIP_PLATFORM_NVIDIA__
+    HIP_CHECK(hipInit(0));
+    hipDevice_t device;
+    hipCtx_t context;
+    HIP_CHECK(hipDeviceGet(&device, 0));
+    HIP_CHECK(hipCtxCreate(&context, 0, device));
+    #endif
+
+    // Allocate device memory
+    hipDeviceptr_t d_A, d_B;
+    HIP_CHECK(hipMalloc(reinterpret_cast<void**>(&d_A), size_bytes));
+    HIP_CHECK(hipMalloc(reinterpret_cast<void**>(&d_B), size_bytes));
+
+    // Copy data to device
+    HIP_CHECK(hipMemcpyHtoD(d_A, A.data(), size_bytes));
+    HIP_CHECK(hipMemcpyHtoD(d_B, B.data(), size_bytes));
+
+    // Load module
+    
+    // For AMD the module file has to contain architecture specific object code
+    // For NVIDIA the module file has to contain PTX, found in e.g. "myKernel.ptx"
+    // [sphinx-start]
+    hipModule_t module;
+    void* imagePtr = populate_data_pointer();
+
+    const int numOptions = 1;
+    hipJitOption options[numOptions];
+    void *optionValues[numOptions];
+
+    options[0] = hipJitOptionMaxRegisters;
+    unsigned maxRegs = 15;
+    optionValues[0] = static_cast<void*>(&maxRegs);
+
+    // hipModuleLoadData(module, imagePtr) will be called on HIP-Clang path, JIT options will not be used, and
+    // cuModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues) will be called on NVCC path
+    HIP_CHECK(hipModuleLoadDataEx(&module, imagePtr, numOptions, options, optionValues));
+
+    // Get kernel function from the module via its name
+    hipFunction_t k;
+    HIP_CHECK(hipModuleGetFunction(&k, module, "myKernel"));
+    // [sphinx-end]
+
+    // Create buffer for kernel arguments
+    std::vector<void*> argBuffer{reinterpret_cast<void*>(d_A), reinterpret_cast<void*>(d_B)};
+    std::size_t arg_size_bytes = argBuffer.size() * sizeof(void*);
+
+    // Create configuration passed to the kernel as arguments
+    void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, argBuffer.data(),
+                      HIP_LAUNCH_PARAM_BUFFER_SIZE, &arg_size_bytes,
+                      HIP_LAUNCH_PARAM_END};
+
+    int threads_per_block = 128;
+    int blocks = (elements + threads_per_block - 1) / threads_per_block;
+
+    // Actually launch kernel
+    HIP_CHECK(hipModuleLaunchKernel(k, blocks, 1, 1, threads_per_block, 1, 1, 0, 0, NULL, config));
+
+    HIP_CHECK(hipMemcpyDtoH(A.data(), d_A, elements));
+    HIP_CHECK(hipMemcpyDtoH(B.data(), d_B, elements));
+
+    HIP_CHECK(hipFree(reinterpret_cast<void*>(d_A)));
+    HIP_CHECK(hipFree(reinterpret_cast<void*>(d_B)));
+
+    #ifdef __HIP_PLATFORM_NVIDIA__
+    HIP_CHECK(hipCtxDestroy(context));
+    #endif
+
+    delete[] static_cast<char*>(imagePtr);
+
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,134 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <cuda.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <fstream>
+#include <iostream>
+#include <memory>
+#include <string>
+#include <vector>
+
+#define CUDA_CHECK(expression)                                          \
+{                                                                       \
+    const CUresult err = expression;                                    \
+    if (err != CUDA_SUCCESS)                                            \
+    {                                                                   \
+        const char* err_str{nullptr};                                   \
+        cuGetErrorString(err, &err_str);                                \
+        std::cerr << "CUDA Error: " << err_str                          \
+                  << " at line " << __LINE__ << std::endl;              \
+        std::exit(EXIT_FAILURE);                                        \
+    }                                                                   \
+}
+
+void* populate_data_pointer()
+{
+    auto filename = std::string{"myKernel.ptx"};
+    std::fstream file{filename, std::ios::in | std::ios::binary | std::ios::ate};
+    if(!file.is_open())
+    {
+        std::cerr << "Error opening file " << filename << std::endl;
+        std::exit(EXIT_FAILURE);
+    }
+
+    auto filesize = file.tellg();
+    auto storage = new char[filesize];
+
+    file.seekg(0, std::ios::beg);
+    file.read(storage, filesize);
+
+    return storage;
+}
+
+int main()
+{
+    std::size_t elements = 64*1024;
+    std::size_t size_bytes = elements * sizeof(float);
+
+    std::vector<float> A(elements), B(elements);
+
+    // On NVIDIA platforms the driver runtime needs to be initiated
+    cuInit(0);
+    CUdevice device;
+    CUcontext context;
+    CUDA_CHECK(cuDeviceGet(&device, 0));
+    CUDA_CHECK(cuCtxCreate(&context, 0, device));
+
+    // Allocate device memory
+    CUdeviceptr d_A, d_B;
+    CUDA_CHECK(cuMemAlloc(&d_A, size_bytes));
+    CUDA_CHECK(cuMemAlloc(&d_B, size_bytes));
+
+    // Copy data to device
+    CUDA_CHECK(cuMemcpyHtoD(d_A, A.data(), size_bytes));
+    CUDA_CHECK(cuMemcpyHtoD(d_B, B.data(), size_bytes));
+
+    // Load module
+    
+    // For NVIDIA the module file has to contain PTX, found in e.g. "myKernel.ptx"
+    // [sphinx-start]
+    CUmodule module;
+    void* imagePtr = populate_data_pointer();
+
+    const int numOptions = 1;
+    CUjit_option options[numOptions];
+    void *optionValues[numOptions];
+
+    options[0] = CU_JIT_MAX_REGISTERS;
+    unsigned maxRegs = 15;
+    optionValues[0] = (void *)(&maxRegs);
+
+    cuModuleLoadDataEx(&module, imagePtr, numOptions, options, optionValues);
+
+    CUfunction k;
+    cuModuleGetFunction(&k, module, "myKernel");
+    // [sphinx-end]
+
+    // Create buffer for kernel arguments
+    std::vector<void*> argBuffer{&d_A, &d_B};
+    std::size_t arg_size_bytes = argBuffer.size() * sizeof(void*);
+
+    // Create configuration passed to the kernel as arguments
+    void* config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, argBuffer.data(),
+                      CU_LAUNCH_PARAM_BUFFER_SIZE, &arg_size_bytes, CU_LAUNCH_PARAM_END};
+
+    int threads_per_block = 128;
+    int blocks = (elements + threads_per_block - 1) / threads_per_block;
+
+    // Actually launch kernel
+    CUDA_CHECK(cuLaunchKernel(k, blocks, 1, 1, threads_per_block, 1, 1, 0, 0, NULL, config));
+
+    CUDA_CHECK(cuMemcpyDtoH(A.data(), d_A, elements));
+    CUDA_CHECK(cuMemcpyDtoH(B.data(), d_B, elements));
+
+    CUDA_CHECK(cuMemFree(d_A));
+    CUDA_CHECK(cuMemFree(d_B));
+
+    CUDA_CHECK(cuCtxDestroy(context));
+
+    delete[] static_cast<char*>(imagePtr);
+
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,111 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime.h>
+#include <iostream>
+#include <vector>
+
+#define hip_check(hip_call)                                                    \
+{                                                                              \
+    auto hip_res = hip_call;                                                   \
+    if (hip_res != hipSuccess) {                                               \
+        std::cerr << "Failed in HIP call: " << #hip_call                       \
+                << " at " << __FILE__ << ":" << __LINE__                       \
+                << " with error: " << hipGetErrorString(hip_res) << std::endl; \
+        std::abort();                                                          \
+    }                                                                          \
+}
+
+__global__ void add_half_precision(__half* in1, __half* in2, float* out, size_t size)
+{
+    int idx = threadIdx.x;
+    if (idx < size)
+    {
+        // Load as half, perform addition in float, store as float
+        float sum = __half2float(in1[idx] + in2[idx]);
+        out[idx] = sum;
+    }
+}
+
+int main()
+{
+    constexpr size_t size = 32;
+    constexpr float tolerance = 1e-1f;  // Allowable numerical difference
+
+    // Initialize input vectors as floats
+    std::vector<float> in1(size), in2(size);
+    for (size_t i = 0; i < size; i++) {
+        in1[i] = i + 1.1f;
+        in2[i] = i + 2.2f;
+    }
+
+    // Compute expected results in full precision on CPU
+    std::vector<float> cpu_out(size);
+    for (size_t i = 0; i < size; i++) {
+        cpu_out[i] = in1[i] + in2[i];  // Direct float addition
+    }
+
+    // Allocate device memory (store input as half, output as float)
+    __half *d_in1, *d_in2;
+    float *d_out;
+    hip_check(hipMalloc(&d_in1, sizeof(__half) * size));
+    hip_check(hipMalloc(&d_in2, sizeof(__half) * size));
+    hip_check(hipMalloc(&d_out, sizeof(float) * size));
+
+    // Convert input to half and copy to device
+    std::vector<__half> in1_half(size), in2_half(size);
+    for (size_t i = 0; i < size; i++)
+    {
+        in1_half[i] = __float2half(in1[i]);
+        in2_half[i] = __float2half(in2[i]);
+    }
+
+    hip_check(hipMemcpy(d_in1, in1_half.data(), sizeof(__half) * size, hipMemcpyHostToDevice));
+    hip_check(hipMemcpy(d_in2, in2_half.data(), sizeof(__half) * size, hipMemcpyHostToDevice));
+
+    // Launch kernel
+    add_half_precision<<<1, size>>>(d_in1, d_in2, d_out, size);
+
+    // Copy result back to host
+    std::vector<float> gpu_out(size, 0.0f);
+    hip_check(hipMemcpy(gpu_out.data(), d_out, sizeof(float) * size, hipMemcpyDeviceToHost));
+
+    // Free device memory
+    hip_check(hipFree(d_in1));
+    hip_check(hipFree(d_in2));
+    hip_check(hipFree(d_out));
+
+    // Validation with tolerance
+    for (size_t i = 0; i < size; i++)
+    {
+        if (std::fabs(cpu_out[i] - gpu_out[i]) > tolerance)
+        {
+            std::cerr << "Mismatch at index " << i
+                      << ": CPU result = " << cpu_out[i]
+                      << ", GPU result = " << gpu_out[i] << std::endl;
+            std::abort();
+        }
+    }
+
+    std::cout << "Success: CPU and GPU half-precision addition match within tolerance!" << std::endl;
+}
@@ -0,0 +1,130 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_fp8.h>
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+#include <iostream>
+#include <vector>
+
+#define hip_check(hip_call)                                                      \
+{                                                                                \
+    auto hip_res = hip_call;                                                     \
+    if (hip_res != hipSuccess)                                                   \
+    {                                                                            \
+        std::cerr << "Failed in HIP call: " << #hip_call                         \
+                  << " at " << __FILE__ << ":" << __LINE__                       \
+                  << " with error: " << hipGetErrorString(hip_res) << std::endl; \
+        std::exit(EXIT_FAILURE);                                                 \
+    }                                                                            \
+}
+
+__device__ __hip_fp8_storage_t d_convert_float_to_fp8(float in, __hip_fp8_interpretation_t interpret, __hip_saturation_t sat)
+{
+    return __hip_cvt_float_to_fp8(in, sat, interpret);
+}
+
+__global__ void float_to_fp8_to_float(float *in, __hip_fp8_interpretation_t interpret, __hip_saturation_t sat, float *out, size_t size)
+{
+    int i = threadIdx.x;
+    if (i < size)
+    {
+        auto fp8 = d_convert_float_to_fp8(in[i], interpret, sat);
+        // Implicit conversion from fp8 to float is defined by HIP runtime
+        out[i] = fp8;
+    }
+}
+
+__hip_fp8_storage_t convert_float_to_fp8(float in,                             /* Input val */
+                                         __hip_fp8_interpretation_t interpret, /* interpretation of number E4M3/E5M2 */
+                                         __hip_saturation_t sat                /* Saturation behavior */
+                                        )
+{
+    return __hip_cvt_float_to_fp8(in, sat, interpret);
+}
+
+int main()
+{
+    constexpr size_t size = 32;
+    hipDeviceProp_t prop;
+    hip_check(hipGetDeviceProperties(&prop, 0));
+    bool is_supported = (std::string(prop.gcnArchName).find("gfx94") != std::string::npos); // gfx94x
+    if(!is_supported)
+    {
+        std::cerr << "Need a gfx94x, but found: " << prop.gcnArchName << std::endl;
+        std::cerr << "No device conversions are supported, only host conversions are supported." << std::endl;
+        return EXIT_SUCCESS;
+    }
+
+    const __hip_fp8_interpretation_t interpret = (std::string(prop.gcnArchName).find("gfx94") != std::string::npos)
+                                                    ? __HIP_E4M3_FNUZ // gfx94x
+                                                    : __HIP_E4M3;
+    constexpr __hip_saturation_t sat = __HIP_SATFINITE;
+
+    std::vector<float> in;
+    in.reserve(size);
+    for (size_t i = 0; i < size; i++)
+        in.push_back(i + 1.1f);
+
+    std::cout << "Converting float to fp8 and back..." << std::endl;
+    // CPU convert
+    std::vector<float> cpu_out;
+    cpu_out.reserve(size);
+    for (const auto &fval : in)
+    {
+        auto fp8 = convert_float_to_fp8(fval, interpret, sat);
+        // Implicit conversion from fp8 to float is defined by HIP runtime
+        cpu_out.push_back(fp8);
+    }
+
+    // GPU convert
+    float *d_in, *d_out;
+    hip_check(hipMalloc(&d_in, sizeof(float) * size));
+    hip_check(hipMalloc(&d_out, sizeof(float) * size));
+
+    hip_check(hipMemcpy(d_in, in.data(), sizeof(float) * in.size(), hipMemcpyHostToDevice));
+
+    float_to_fp8_to_float<<<1, size>>>(d_in, interpret, sat, d_out, size);
+
+    std::vector<float> gpu_out(size, 0.0f);
+    hip_check(hipMemcpy(gpu_out.data(), d_out, sizeof(float) * gpu_out.size(), hipMemcpyDeviceToHost));
+
+    hip_check(hipFree(d_in));
+    hip_check(hipFree(d_out));
+
+    // Validation
+    for (size_t i = 0; i < size; i++)
+    {
+        if (cpu_out[i] != gpu_out[i])
+        {
+            std::cerr << "cpu round trip result: " << cpu_out[i]
+                      << " - gpu round trip result: " << gpu_out[i] << std::endl;
+            return EXIT_FAILURE;
+        }
+    }
+    std::cout << "...CPU and GPU round trip convert matches." << std::endl;
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,202 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+#include <hip/hiprtc.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+#include <string>
+#include <vector>
+
+#define CHECK_RET_CODE(call, ret_code)                                                             \
+{                                                                                                  \
+    if ((call) != ret_code)                                                                        \
+    {                                                                                              \
+        std::cout << "Failed in call: " << #call << std::endl;                                     \
+        std::abort();                                                                              \
+    }                                                                                              \
+}
+#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
+#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
+
+// [sphinx-source-start]
+static constexpr const char gpu_program[] {
+R"(
+    __device__ int V1; // set from host code
+    static __global__ void f1(int *result)
+    {
+        *result = V1 + 10;
+    }
+
+    namespace N1
+    {
+        namespace N2
+        {
+            __constant__ int V2; // set from host code
+            __global__ void f2(int *result)
+            {
+                *result = V2 + 20;
+            }
+        }
+    }
+
+    template<typename T>
+    __global__ void f3(int *result)
+    {
+        *result = sizeof(T);
+    }
+)"};
+// [sphinx-source-end]
+
+int main()
+{
+    using namespace std::string_literals;
+
+    hiprtcProgram prog;
+    HIPRTC_CHECK(hiprtcCreateProgram(&prog, gpu_program, "gpu_source.cpp", 0, nullptr, nullptr));
+
+    std::vector<std::string> kernel_names;
+    std::vector<std::string> variable_names;
+    std::vector<int> initial_values;
+    std::vector<int> expected_results;
+    initial_values.emplace_back(100);
+    initial_values.emplace_back(200);
+    expected_results.emplace_back(110);
+    expected_results.emplace_back(220);
+    expected_results.emplace_back(static_cast<int>(sizeof(int)));
+
+    // [sphinx-add-expression-start]
+    kernel_names.emplace_back("&f1"s);
+    kernel_names.emplace_back("N1::N2::f2"s);
+    kernel_names.emplace_back("f3<int>"s);
+    for(auto&& name : kernel_names)
+        HIPRTC_CHECK(hiprtcAddNameExpression(prog, name.c_str()));
+
+    variable_names.emplace_back("&V1"s);
+    variable_names.emplace_back("&N1::N2::V2");
+    for(auto&& name : variable_names)
+        HIPRTC_CHECK(hiprtcAddNameExpression(prog, name.c_str()));
+    // [sphinx-add-expression-end]
+
+    hipDeviceProp_t props;
+    int device = 0;
+    HIP_CHECK(hipGetDeviceProperties(&props, device));
+    auto sarg = std::string{"--gpu-architecture="} + props.gcnArchName;  // device for which binary is to be generated
+
+    const char* options[] = {sarg.c_str()};
+
+    HIPRTC_CHECK(hiprtcCompileProgram(prog, 1, options));
+
+    std::size_t logSize;
+    HIPRTC_CHECK(hiprtcGetProgramLogSize(prog, &logSize));
+    if (logSize)
+    {
+        std::string log(logSize, '\0');
+        HIPRTC_CHECK(hiprtcGetProgramLog(prog, &log[0]));
+        std::cerr << "Compilation failed or produced warnings: " << log << std::endl;
+        std::abort();
+    }
+
+    std::size_t codeSize;
+    HIPRTC_CHECK(hiprtcGetCodeSize(prog, &codeSize));
+
+    std::vector<char> kernel_binary(codeSize);
+    HIPRTC_CHECK(hiprtcGetCode(prog, kernel_binary.data()));
+
+    std::vector<std::string> lowered_kernel_names;
+    std::vector<std::string> lowered_variable_names;
+    // [sphinx-get-kernel-name-start]
+    for(auto&& name : kernel_names)
+    {
+        const char* lowered_name = nullptr;
+        HIPRTC_CHECK(hiprtcGetLoweredName(prog, name.c_str(), &lowered_name));
+        lowered_kernel_names.emplace_back(lowered_name);
+    }
+    // [sphinx-get-kernel-name-end]
+    // [sphinx-get-variable-name-start]
+    for(auto&& name : variable_names)
+    {
+        const char* lowered_name = nullptr;
+        HIPRTC_CHECK(hiprtcGetLoweredName(prog, name.c_str(), &lowered_name));
+        lowered_variable_names.emplace_back(lowered_name);
+    }
+    // [sphinx-get-variable-name-end]
+
+    HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
+
+    hipModule_t module;
+
+    HIP_CHECK(hipModuleLoadData(&module, kernel_binary.data()));
+
+    for(auto i = std::size_t{0}; i < initial_values.size(); ++i)
+    {
+        auto name = lowered_variable_names.at(i);
+        auto initial_value = initial_values.at(i);
+
+        // [sphinx-update-variable-start]
+        hipDeviceptr_t variable_addr;
+        std::size_t bytes{};
+        HIP_CHECK(hipModuleGetGlobal(&variable_addr, &bytes, module, name.c_str()));
+        HIP_CHECK(hipMemcpyHtoD(variable_addr, &initial_value, sizeof(initial_value)));
+        // [sphinx-update-variable-end]
+    }
+
+    hipDeviceptr_t d_result;
+    auto h_result = 0;
+    HIP_CHECK(hipMalloc(reinterpret_cast<void**>(&d_result), sizeof(h_result)));
+    HIP_CHECK(hipMemcpyHtoD(d_result, &h_result, sizeof(h_result)));
+
+    struct
+    {
+        hipDeviceptr_t ptr;
+    } args{d_result};
+    auto args_size = sizeof(args);
+    
+    void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, &args,
+                      HIP_LAUNCH_PARAM_BUFFER_SIZE, &args_size,
+                      HIP_LAUNCH_PARAM_END};
+
+    for(auto i = std::size_t{0}; i < lowered_kernel_names.size(); ++i)
+    {
+        auto name = lowered_kernel_names.at(i);
+        auto expected = expected_results.at(i);
+        // [sphinx-launch-kernel-start]
+        hipFunction_t kernel;
+        HIP_CHECK(hipModuleGetFunction(&kernel, module, name.c_str()));
+        HIP_CHECK(hipModuleLaunchKernel(kernel, 1, 1, 1, 1, 1, 1, 0, nullptr, nullptr, config));
+        // [sphinx-launch-kernel-end]
+        HIP_CHECK(hipMemcpyDtoH(&h_result, d_result, sizeof(h_result)));
+        if(expected != h_result)
+        {
+            std::cerr << "Validation failed. expected = " << expected << ", h_result = " << h_result << std::endl;
+            return EXIT_FAILURE;
+        }
+    }
+
+    std::cout << "Validation passed." << std::endl;
+
+    HIP_CHECK(hipFree(reinterpret_cast<void*>(d_result)));
+
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,118 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cmath>
+#include <iostream>
+#include <limits>
+#include <vector>
+
+#define HIP_CHECK(expression)                        \
+    {                                                \
+        const hipError_t err = expression;           \
+        if (err != hipSuccess)                       \
+        {                                            \
+            std::cerr << "HIP error: "               \
+                      << hipGetErrorString(err)      \
+                      << " at " << __LINE__ << "\n"; \
+            exit(EXIT_FAILURE);                      \
+        }                                            \
+    }
+
+// Simple ULP difference calculator
+int64_t ulp_diff(float a, float b)
+{
+    if (a == b)
+        return 0;
+    
+    union
+    {
+        float f;
+        int32_t i;
+    } ua{a}, ub{b};
+
+    // For negative values, convert to a positive-based representation
+    if (ua.i < 0) ua.i = std::numeric_limits<int32_t>::max() - ua.i;
+    if (ub.i < 0) ub.i = std::numeric_limits<int32_t>::max() - ub.i;
+
+    return std::abs((int64_t)ua.i - (int64_t)ub.i);
+}
+
+// Test kernel
+__global__ void test_sin(float* out, int n)
+{
+    int i = blockIdx.x * blockDim.x + threadIdx.x;
+    if (i < n)
+    {
+        float x = -M_PI + (2.0f * M_PI * i) / (n - 1);
+        out[i] = sinf(x);
+    }
+}
+
+int main()
+{
+    const int n = 1000000;
+    const int blocksize = 256;
+    std::vector<float> outputs(n);
+    float* d_out;
+
+    HIP_CHECK(hipMalloc(&d_out, n * sizeof(float)));
+    dim3 threads(blocksize);
+    dim3 blocks((n + blocksize - 1) / blocksize);  // Fixed grid calculation
+    test_sin<<<blocks, threads>>>(d_out, n);
+    HIP_CHECK(hipPeekAtLastError());
+    HIP_CHECK(hipMemcpy(outputs.data(), d_out, n * sizeof(float), hipMemcpyDeviceToHost));
+
+    // Step 1: Find the maximum absolute error
+    double max_abs_error = 0.0;
+    float max_error_output = 0.0;
+    float max_error_expected = 0.0;
+
+    for (int i = 0; i < n; i++)
+    {
+        float x = -M_PI + (2.0f * M_PI * i) / (n - 1);
+        float expected = std::sin(x);
+        double abs_error = std::abs(outputs[i] - expected);
+
+        if (abs_error > max_abs_error)
+        {
+            max_abs_error = abs_error;
+            max_error_output = outputs[i];
+            max_error_expected = expected;
+        }
+    }
+
+    // Step 2: Compute ULP difference based on the max absolute error pair
+    int64_t max_ulp = ulp_diff(max_error_output, max_error_expected);
+
+    // Output results
+    std::cout << "Max Absolute Error: " << max_abs_error << std::endl;
+    std::cout << "Max ULP Difference: " << max_ulp << std::endl;
+    std::cout << "Max Error Values -> Got: " << max_error_output
+              << ", Expected: " << max_error_expected << std::endl;
+
+    HIP_CHECK(hipFree(d_out));
+    return 0;
+}
+// [sphinx-end]
@@ -0,0 +1,109 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <iostream>
+
+#define HIP_CHECK(expression)                        \
+{                                                    \
+    const hipError_t status = expression;            \
+    if (status != hipSuccess)                        \
+    {                                                \
+        std::cerr << "HIP error " << status          \
+                << ": " << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":"         \
+                << __LINE__ << std::endl;            \
+        std::exit(EXIT_FAILURE);                     \
+    }                                                \
+}
+
+// Kernel to perform some computation on allocated memory.
+__global__ void myKernel(int* data, std::size_t numElements)
+{
+    int tid = threadIdx.x + blockIdx.x * blockDim.x;
+    if (tid < numElements)
+    {
+        data[tid] = tid * 2;
+    }
+}
+
+int main()
+{
+    // Create a stream.
+    hipStream_t stream;
+    HIP_CHECK(hipStreamCreate(&stream));
+
+    // Create a memory pool with default properties.
+    hipMemPoolProps poolProps = {};
+    poolProps.allocType = hipMemAllocationTypePinned;
+    poolProps.handleTypes = hipMemHandleTypePosixFileDescriptor;
+    poolProps.location.type = hipMemLocationTypeDevice;
+    poolProps.location.id = 0; // Assuming device 0.
+
+    hipMemPool_t memPool;
+    HIP_CHECK(hipMemPoolCreate(&memPool, &poolProps));
+
+    // Allocate memory from the pool asynchronously.
+    constexpr std::size_t numElements = 1024;
+    int* devData = nullptr;
+    HIP_CHECK(hipMallocFromPoolAsync(reinterpret_cast<void**>(&devData),
+                                     numElements * sizeof(*devData),
+                                     memPool,
+                                     stream));
+
+    // Define grid and block sizes.
+    dim3 blockSize(256);
+    dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
+
+    // Launch the kernel to perform computation.
+    myKernel<<<gridSize, blockSize, 0, stream>>>(devData, numElements);
+
+    // Synchronize the stream.
+    HIP_CHECK(hipStreamSynchronize(stream));
+
+    // Copy data back to host.
+    int* hostData = new int[numElements];
+    HIP_CHECK(hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost));
+
+    // Print the array.
+    for (std::size_t i = 0; i < numElements; ++i)
+        std::cout << "Element " << i << ": " << hostData[i] << std::endl;
+
+    // Free the allocated memory.
+    HIP_CHECK(hipFreeAsync(devData, stream));
+
+    // Synchronize the stream again to ensure all operations are complete.
+    HIP_CHECK(hipStreamSynchronize(stream));
+
+    // Destroy the memory pool and stream.
+    HIP_CHECK(hipMemPoolDestroy(memPool));
+    HIP_CHECK(hipStreamDestroy(stream));
+
+    // Free host memory.
+    delete[] hostData;
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,115 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdint>
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                        \
+{                                                    \
+    const hipError_t status = expression;            \
+    if (status != hipSuccess)                        \
+    {                                                \
+        std::cerr << "HIP error " << status          \
+                << ": " << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":"         \
+                << __LINE__ << std::endl;            \
+        std::exit(EXIT_FAILURE);                     \
+    }                                                \
+}
+
+// Sample helper functions for getting the usage statistics in bulk.
+struct usageStatistics
+{
+    std::uint64_t reservedMemCurrent;
+    std::uint64_t reservedMemHigh;
+    std::uint64_t usedMemCurrent;
+    std::uint64_t usedMemHigh;
+};
+
+void getUsageStatistics(hipMemPool_t memPool, struct usageStatistics *statistics)
+{
+    HIP_CHECK(hipMemPoolGetAttribute(memPool, hipMemPoolAttrReservedMemCurrent, &statistics->reservedMemCurrent));
+    HIP_CHECK(hipMemPoolGetAttribute(memPool, hipMemPoolAttrReservedMemHigh, &statistics->reservedMemHigh));
+    HIP_CHECK(hipMemPoolGetAttribute(memPool, hipMemPoolAttrUsedMemCurrent, &statistics->usedMemCurrent));
+    HIP_CHECK(hipMemPoolGetAttribute(memPool, hipMemPoolAttrUsedMemHigh, &statistics->usedMemHigh));
+}
+
+// Resetting the watermarks resets them to the current value.
+void resetStatistics(hipMemPool_t memPool)
+{
+    std::uint64_t value = 0;
+    HIP_CHECK(hipMemPoolSetAttribute(memPool, hipMemPoolAttrReservedMemHigh, &value));
+    HIP_CHECK(hipMemPoolSetAttribute(memPool, hipMemPoolAttrUsedMemHigh, &value));
+}
+
+int main()
+{
+    hipMemPool_t memPool;
+    hipDevice_t device = 0; // Specify the device index.
+
+    // Initialize the device.
+    HIP_CHECK(hipSetDevice(device));
+
+    // Get the default memory pool for the device.
+    HIP_CHECK(hipDeviceGetDefaultMemPool(&memPool, device));
+
+    // Allocate memory from the pool (e.g., 1 MB).
+    std::size_t allocSize = 1 * 1024 * 1024;
+    void* ptr;
+    HIP_CHECK(hipMalloc(&ptr, allocSize));
+
+    // Free the allocated memory.
+    HIP_CHECK(hipFree(ptr));
+
+    // Trim the memory pool to a specific size (e.g., 512 KB).
+    std::size_t newSize = 512 * 1024;
+    HIP_CHECK(hipMemPoolTrimTo(memPool, newSize));
+
+    // Get and print usage statistics before resetting.
+    usageStatistics statsBefore;
+    getUsageStatistics(memPool, &statsBefore);
+    std::cout << "Before resetting statistics:" << std::endl;
+    std::cout << "Reserved Memory Current: " << statsBefore.reservedMemCurrent << " bytes" << std::endl;
+    std::cout << "Reserved Memory High: " << statsBefore.reservedMemHigh << " bytes" << std::endl;
+    std::cout << "Used Memory Current: " << statsBefore.usedMemCurrent << " bytes" << std::endl;
+    std::cout << "Used Memory High: " << statsBefore.usedMemHigh << " bytes" << std::endl;
+
+    // Reset the statistics.
+    resetStatistics(memPool);
+
+    // Get and print usage statistics after resetting.
+    usageStatistics statsAfter;
+    getUsageStatistics(memPool, &statsAfter);
+    std::cout << "After resetting statistics:" << std::endl;
+    std::cout << "Reserved Memory Current: " << statsAfter.reservedMemCurrent << " bytes" << std::endl;
+    std::cout << "Reserved Memory High: " << statsAfter.reservedMemHigh << " bytes" << std::endl;
+    std::cout << "Used Memory Current: " << statsAfter.usedMemCurrent << " bytes" << std::endl;
+    std::cout << "Used Memory High: " << statsAfter.usedMemHigh << " bytes" << std::endl;
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,115 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdint>
+#include <cstdlib>
+#include <iostream>
+#include <limits>
+
+#define HIP_CHECK(expression)                        \
+{                                                    \
+    const hipError_t status = expression;            \
+    if (status != hipSuccess)                        \
+    {                                                \
+        std::cerr << "HIP error " << status          \
+                << ": " << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":"         \
+                << __LINE__ << std::endl;            \
+        std::exit(EXIT_FAILURE);                     \
+    }                                                \
+}
+
+// Kernel to perform some computation on allocated memory.
+__global__ void myKernel(int* data, std::size_t numElements)
+{
+    int tid = threadIdx.x + blockIdx.x * blockDim.x;
+    if (tid < numElements)
+    {
+        data[tid] = tid * 2;
+    }
+}
+
+int main()
+{
+    // Create a stream.
+    hipStream_t stream;
+    HIP_CHECK(hipStreamCreate(&stream));
+
+    // Create a memory pool with default properties.
+    hipMemPoolProps poolProps = {};
+    poolProps.allocType = hipMemAllocationTypePinned;
+    poolProps.handleTypes = hipMemHandleTypePosixFileDescriptor;
+    poolProps.location.type = hipMemLocationTypeDevice;
+    poolProps.location.id = 0; // Assuming device 0.
+
+    hipMemPool_t memPool;
+    HIP_CHECK(hipMemPoolCreate(&memPool, &poolProps));
+
+    // [sphinx-start]
+    std::uint64_t threshold = std::numeric_limits<std::uint64_t>::max();
+    HIP_CHECK(hipMemPoolSetAttribute(memPool, hipMemPoolAttrReleaseThreshold, &threshold));
+    // [sphinx-end]
+
+    // Allocate memory from the pool asynchronously.
+    constexpr std::size_t numElements = 1024;
+    int* devData = nullptr;
+    HIP_CHECK(hipMallocFromPoolAsync(reinterpret_cast<void**>(&devData),
+                                     numElements * sizeof(*devData),
+                                     memPool,
+                                     stream));
+
+    // Define grid and block sizes.
+    dim3 blockSize(256);
+    dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
+
+    // Launch the kernel to perform computation.
+    myKernel<<<gridSize, blockSize, 0, stream>>>(devData, numElements);
+
+    // Synchronize the stream.
+    HIP_CHECK(hipStreamSynchronize(stream));
+
+    // Copy data back to host.
+    int* hostData = new int[numElements];
+    HIP_CHECK(hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost));
+
+    // Print the array.
+    for (std::size_t i = 0; i < numElements; ++i)
+        std::cout << "Element " << i << ": " << hostData[i] << std::endl;
+
+    // Free the allocated memory.
+    HIP_CHECK(hipFreeAsync(devData, stream));
+
+    // Synchronize the stream again to ensure all operations are complete.
+    HIP_CHECK(hipStreamSynchronize(stream));
+
+    // Destroy the memory pool and stream.
+    HIP_CHECK(hipMemPoolDestroy(memPool));
+    HIP_CHECK(hipStreamDestroy(stream));
+
+    // Free host memory.
+    delete[] hostData;
+
+    return 0;
+}
@@ -0,0 +1,69 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                        \
+{                                                    \
+    const hipError_t status = expression;            \
+    if (status != hipSuccess)                        \
+    {                                                \
+        std::cerr << "HIP error " << status          \
+                << ": " << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":"         \
+                << __LINE__ << std::endl;            \
+        std::exit(EXIT_FAILURE);                     \
+    }                                                \
+}
+
+int main()
+{
+    hipMemPool_t memPool;
+    hipDevice_t device = 0; // Specify the device index.
+
+    // Initialize the device.
+    HIP_CHECK(hipSetDevice(device));
+
+    // Get the default memory pool for the device.
+    HIP_CHECK(hipDeviceGetDefaultMemPool(&memPool, device));
+
+    // Allocate memory from the pool (e.g., 1 MB).
+    std::size_t allocSize = 1 * 1024 * 1024;
+    void* ptr;
+    HIP_CHECK(hipMalloc(&ptr, allocSize));
+
+    // Free the allocated memory.
+    HIP_CHECK(hipFree(ptr));
+
+    // Trim the memory pool to a specific size (e.g., 512 KB).
+    std::size_t newSize = 512 * 1024;
+    HIP_CHECK(hipMemPoolTrimTo(memPool, newSize));
+
+    std::cout << "Memory pool trimmed to " << newSize << " bytes." << std::endl;
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,90 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <iostream>
+
+#define HIP_CHECK(expression)              \
+{                                          \
+    const hipError_t err = expression;     \
+    if(err != hipSuccess)                  \
+    {                                      \
+        std::cerr << "HIP error: "         \
+            << hipGetErrorString(err)      \
+            << " at " << __LINE__ << "\n"; \
+    }                                      \
+}
+
+// Addition of two values.
+__global__ void add(int *a, int *b, int *c)
+{
+    *c = *a + *b;
+}
+
+int main()
+{
+    int *a, *b, *c;
+    unsigned int attributeValue;
+    constexpr std::size_t attributeSize = sizeof(attributeValue);
+
+    int deviceId;
+    HIP_CHECK(hipGetDevice(&deviceId));
+
+    // Allocate memory for a, b and c that is accessible to both device and host codes.
+    HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
+    HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
+    HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
+
+    // Setup input values.
+    *a = 1;
+    *b = 2;
+
+    HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetReadMostly, deviceId));
+
+    // Launch add() kernel on GPU.
+    add<<<1, 1>>>(a, b, c);
+
+    // Wait for GPU to finish before accessing on host.
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Query an attribute of the memory range.
+    HIP_CHECK(hipMemRangeGetAttribute(&attributeValue,
+                            attributeSize,
+                            hipMemRangeAttributeReadMostly,
+                            a,
+                            sizeof(*a)));
+
+    // Prints the result.
+    std::cout << *a << " + " << *b << " = " << *c << std::endl;
+    std::cout << "The array a is" << (attributeValue == 1 ? "" : " NOT") << " set to hipMemRangeAttributeReadMostly" << std::endl;
+
+    // Cleanup allocated memory.
+    HIP_CHECK(hipFree(a));
+    HIP_CHECK(hipFree(b));
+    HIP_CHECK(hipFree(c));
+
+    return 0;
+}
+// [sphinx-end]
@@ -0,0 +1,136 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                        \
+{                                                    \
+    const hipError_t status = expression;            \
+    if (status != hipSuccess)                        \
+    {                                                \
+        std::cerr << "HIP error " << status          \
+                << ": " << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":"         \
+                << __LINE__ << std::endl;            \
+        std::exit(EXIT_FAILURE);                     \
+    }                                                \
+}
+
+__global__ void simpleKernel(double *data, std::size_t elems)
+{
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if(idx < elems)
+        data[idx] = idx * 2.0;
+}
+
+int main()
+{
+    int numDevices;
+    HIP_CHECK(hipGetDeviceCount(&numDevices));
+
+    if (numDevices < 2)
+    {
+        std::cout << "This example requires at least two HIP devices." << std::endl;
+        return EXIT_SUCCESS;
+    }
+
+    double *deviceData0;
+    double *deviceData1;
+    constexpr std::size_t elems = 1024;
+    constexpr std::size_t size = elems * sizeof(double);
+
+    // Create streams and events for each device
+    hipStream_t stream0, stream1;
+    hipEvent_t startEvent0, stopEvent0, startEvent1, stopEvent1;
+
+    // Initialize device 0
+    HIP_CHECK(hipSetDevice(0));
+    HIP_CHECK(hipStreamCreate(&stream0));
+    HIP_CHECK(hipEventCreate(&startEvent0));
+    HIP_CHECK(hipEventCreate(&stopEvent0));
+    HIP_CHECK(hipMalloc(&deviceData0, size));
+
+    // Initialize device 1
+    HIP_CHECK(hipSetDevice(1));
+    HIP_CHECK(hipStreamCreate(&stream1));
+    HIP_CHECK(hipEventCreate(&startEvent1));
+    HIP_CHECK(hipEventCreate(&stopEvent1));
+    HIP_CHECK(hipMalloc(&deviceData1, size));
+
+    // Record the start event on device 0
+    HIP_CHECK(hipSetDevice(0));
+    HIP_CHECK(hipEventRecord(startEvent0, stream0));
+
+    // Launch the kernel asynchronously on device 0
+    simpleKernel<<<8, 128, 0, stream0>>>(deviceData0, elems);
+
+    // Record the stop event on device 0
+    HIP_CHECK(hipEventRecord(stopEvent0, stream0));
+
+    // Wait for the stop event on device 0 to complete
+    HIP_CHECK(hipEventSynchronize(stopEvent0));
+
+    // Record the start event on device 1
+    HIP_CHECK(hipSetDevice(1));
+    HIP_CHECK(hipEventRecord(startEvent1, stream1));
+
+    // Launch the kernel asynchronously on device 1
+    simpleKernel<<<8, 128, 0, stream1>>>(deviceData1, elems);
+
+    // Record the stop event on device 1
+    HIP_CHECK(hipEventRecord(stopEvent1, stream1));
+
+    // Wait for the stop event on device 1 to complete
+    HIP_CHECK(hipEventSynchronize(stopEvent1));
+
+    // Calculate elapsed time between the events for both devices
+    float milliseconds0 = 0, milliseconds1 = 0;
+    HIP_CHECK(hipEventElapsedTime(&milliseconds0, startEvent0, stopEvent0));
+    HIP_CHECK(hipEventElapsedTime(&milliseconds1, startEvent1, stopEvent1));
+
+    std::cout << "Elapsed time on GPU 0: " << milliseconds0 << " ms" << std::endl;
+    std::cout << "Elapsed time on GPU 1: " << milliseconds1 << " ms" << std::endl;
+
+    // Cleanup for device 0
+    HIP_CHECK(hipSetDevice(0));
+    HIP_CHECK(hipEventDestroy(startEvent0));
+    HIP_CHECK(hipEventDestroy(stopEvent0));
+    HIP_CHECK(hipStreamSynchronize(stream0));
+    HIP_CHECK(hipStreamDestroy(stream0));
+    HIP_CHECK(hipFree(deviceData0));
+
+    // Cleanup for device 1
+    HIP_CHECK(hipSetDevice(1));
+    HIP_CHECK(hipEventDestroy(startEvent1));
+    HIP_CHECK(hipEventDestroy(stopEvent1));
+    HIP_CHECK(hipStreamSynchronize(stream1));
+    HIP_CHECK(hipStreamDestroy(stream1));
+    HIP_CHECK(hipFree(deviceData1));
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,81 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <iostream>
+
+#define HIP_CHECK(expression)                        \
+{                                                    \
+    const hipError_t status = expression;            \
+    if (status != hipSuccess)                        \
+    {                                                \
+        std::cerr << "HIP error " << status          \
+                << ": " << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":"         \
+                << __LINE__ << std::endl;            \
+        std::exit(EXIT_FAILURE);                     \
+    }                                                \
+}
+
+// Kernel to perform some computation on allocated memory.
+__global__ void myKernel(int* data, std::size_t numElements)
+{
+    int tid = threadIdx.x + blockIdx.x * blockDim.x;
+    if (tid < numElements)
+    {
+        data[tid] = tid * 2;
+    }
+}
+
+int main()
+{
+    // Allocate memory.
+    constexpr std::size_t numElements = 1024;
+    int* devData;
+    HIP_CHECK(hipMalloc(&devData, numElements * sizeof(*devData)));
+
+    // Launch the kernel to perform computation.
+    dim3 blockSize(256);
+    dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
+    myKernel<<<gridSize, blockSize>>>(devData, numElements);
+
+    // Copy data back to host.
+    int* hostData = new int[numElements];
+    HIP_CHECK(hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost));
+
+    // Print the array.
+    for (std::size_t i = 0; i < numElements; ++i)
+        std::cout << "Element " << i << ": " << hostData[i] << std::endl;
+
+    // Free memory.
+    HIP_CHECK(hipFree(devData));
+    delete[] hostData;
+
+    // Synchronize to ensure completion.
+    HIP_CHECK(hipDeviceSynchronize());
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,114 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                        \
+{                                                    \
+    const hipError_t status = expression;            \
+    if (status != hipSuccess)                        \
+    {                                                \
+        std::cerr << "HIP error " << status          \
+                << ": " << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":"         \
+                << __LINE__ << std::endl;            \
+        std::exit(EXIT_FAILURE);                     \
+    }                                                \
+}
+
+__global__ void simpleKernel(double *data, std::size_t elems)
+{
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if(idx < elems)
+        data[idx] = idx * 2.0;
+}
+
+int main()
+{
+    int deviceCount;
+    HIP_CHECK(hipGetDeviceCount(&deviceCount));
+    if(deviceCount < 2)
+    {
+        std::cout << "This example requires at least two HIP devices." << std::endl;
+        return EXIT_SUCCESS;
+    }
+
+    double* deviceData0;
+    double* deviceData1;
+    constexpr std::size_t elems = 1024;
+    constexpr std::size_t size = elems * sizeof(double);
+
+    int deviceId0 = 0;
+    int deviceId1 = 1;
+
+    // Enable peer access to the memory (allocated and future) on the peer device.
+    // Ensure the device is active before enabling peer access.
+    HIP_CHECK(hipSetDevice(deviceId0));
+    HIP_CHECK(hipDeviceEnablePeerAccess(deviceId1, 0));
+
+    HIP_CHECK(hipSetDevice(deviceId1));
+    HIP_CHECK(hipDeviceEnablePeerAccess(deviceId0, 0));
+
+    // Set device 0 and perform operations
+    HIP_CHECK(hipSetDevice(deviceId0)); // Set device 0 as current
+    HIP_CHECK(hipMalloc(&deviceData0, size)); // Allocate memory on device 0
+    simpleKernel<<<8, 128>>>(deviceData0, elems); // Launch kernel on device 0
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Set device 1 and perform operations
+    HIP_CHECK(hipSetDevice(deviceId1)); // Set device 1 as current
+    HIP_CHECK(hipMalloc(&deviceData1, size)); // Allocate memory on device 1
+    simpleKernel<<<8, 128>>>(deviceData1, elems); // Launch kernel on device 1
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Use peer-to-peer access
+    HIP_CHECK(hipSetDevice(deviceId0));
+
+    // Now device 0 can access memory allocated on device 1
+    HIP_CHECK(hipMemcpy(deviceData0, deviceData1, size, hipMemcpyDeviceToDevice));
+
+    // Copy result from device 0
+    double hostData0[elems];
+    HIP_CHECK(hipSetDevice(deviceId0));
+    HIP_CHECK(hipMemcpy(hostData0, deviceData0, size, hipMemcpyDeviceToHost));
+
+    // Copy result from device 1
+    double hostData1[elems];
+    HIP_CHECK(hipSetDevice(deviceId1));
+    HIP_CHECK(hipMemcpy(hostData1, deviceData1, size, hipMemcpyDeviceToHost));
+
+    // Display results from both devices
+    std::cout << "Device 0 data: " << hostData0[0] << std::endl;
+    std::cout << "Device 1 data: " << hostData1[0] << std::endl;
+
+    // Free device memory
+    HIP_CHECK(hipFree(deviceData0));
+    HIP_CHECK(hipFree(deviceData1));
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,104 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                        \
+{                                                    \
+    const hipError_t status = expression;            \
+    if (status != hipSuccess)                        \
+    {                                                \
+        std::cerr << "HIP error " << status          \
+                << ": " << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":"         \
+                << __LINE__ << std::endl;            \
+        std::exit(EXIT_FAILURE);                     \
+    }                                                \
+}
+
+__global__ void simpleKernel(double *data, std::size_t elems)
+{
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if(idx < elems)
+        data[idx] = idx * 2.0;
+}
+
+int main()
+{
+    int deviceCount;
+    HIP_CHECK(hipGetDeviceCount(&deviceCount));
+    if(deviceCount < 2)
+    {
+        std::cout << "This example requires at least two HIP devices." << std::endl;
+        return EXIT_SUCCESS;
+    }
+    
+    double* deviceData0;
+    double* deviceData1;
+    constexpr std::size_t elems = 1024;
+    constexpr std::size_t size = elems * sizeof(double);
+
+    int deviceId0 = 0;
+    int deviceId1 = 1;
+
+    // Set device 0 and perform operations
+    HIP_CHECK(hipSetDevice(deviceId0)); // Set device 0 as current
+    HIP_CHECK(hipMalloc(&deviceData0, size)); // Allocate memory on device 0
+    simpleKernel<<<8, 128>>>(deviceData0, elems); // Launch kernel on device 0
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Set device 1 and perform operations
+    HIP_CHECK(hipSetDevice(deviceId1)); // Set device 1 as current
+    HIP_CHECK(hipMalloc(&deviceData1, size)); // Allocate memory on device 1
+    simpleKernel<<<8, 128>>>(deviceData1, elems); // Launch kernel on device 1
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Use deviceData0 on device 1. This works but incurs a performance penalty.
+    HIP_CHECK(hipSetDevice(deviceId1));
+    HIP_CHECK(hipMemcpy(deviceData1, deviceData0, size, hipMemcpyDeviceToDevice));
+
+    // Copy result from device 0
+    double hostData0[elems];
+    HIP_CHECK(hipSetDevice(deviceId0));
+    HIP_CHECK(hipMemcpy(hostData0, deviceData0, size, hipMemcpyDeviceToHost));
+
+    // Copy result from device 1
+    double hostData1[elems];
+    HIP_CHECK(hipSetDevice(deviceId1));
+    HIP_CHECK(hipMemcpy(hostData1, deviceData1, size, hipMemcpyDeviceToHost));
+
+    // Display results from both devices
+    std::cout << "Device 0 data: " << hostData0[0] << std::endl;
+    std::cout << "Device 1 data: " << hostData1[0] << std::endl;
+
+    // Free device memory
+    HIP_CHECK(hipFree(deviceData0));
+    HIP_CHECK(hipFree(deviceData1));
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,80 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstring>
+#include <iostream>
+
+#define HIP_CHECK(expression)                  \
+{                                              \
+    const hipError_t status = expression;      \
+    if(status != hipSuccess)                   \
+    {                                          \
+        std::cerr << "HIP error "              \
+                  << status << ": "            \
+                  << hipGetErrorString(status) \
+                  << " at " << __FILE__ << ":" \
+                  << __LINE__ << std::endl;    \
+    }                                          \
+}
+
+int main()
+{
+    const int element_number = 100;
+
+    int *host_input, *host_output;
+    // Host allocation
+    host_input  = new int[element_number];
+    host_output = new int[element_number];
+
+    // Host data preparation
+    for (int i = 0; i < element_number; i++) {
+        host_input[i] = i;
+    }
+    std::memset(host_output, 0, element_number * sizeof(int));
+
+    int *device_input, *device_output;
+
+    // Device allocation
+    HIP_CHECK(hipMalloc((int **)&device_input,  element_number * sizeof(int)));
+    HIP_CHECK(hipMalloc((int **)&device_output, element_number * sizeof(int)));
+
+    // Device data preparation
+    HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
+    HIP_CHECK(hipMemset(device_output, 0, element_number * sizeof(int)));
+
+    // Run the kernel
+    // ...
+
+    HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
+
+    // Free host memory
+    delete[] host_input;
+    delete[] host_output;
+
+    // Free device memory
+    HIP_CHECK(hipFree(device_input));
+    HIP_CHECK(hipFree(device_output));
+}
+// [sphinx-end]
@@ -0,0 +1,78 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+int main()
+{
+    // Initialize the HIP runtime
+    if (auto err = hipInit(0); err != hipSuccess)
+    {
+        std::cerr << "Failed to initialize HIP runtime." << std::endl;
+        return EXIT_FAILURE;
+    }
+
+    // Get the per-thread default stream
+    hipStream_t stream = hipStreamPerThread;
+
+    // Use the stream for some operation
+    // For example, allocate memory on the device
+    void* d_ptr;
+    std::size_t size = 1024;
+    if (auto err = hipMalloc(&d_ptr, size); err != hipSuccess)
+    {
+        std::cerr << "Failed to allocate memory." << std::endl;
+        return EXIT_FAILURE;
+    }
+
+    // Perform some operation using the stream
+    // For example, set memory on the device
+    if (auto err = hipMemsetAsync(d_ptr, 0, size, stream); err != hipSuccess)
+    {
+        std::cerr << "Failed to set memory." << std::endl;
+        return EXIT_FAILURE;
+    }
+
+    // Synchronize the stream
+    if (auto err = hipStreamSynchronize(stream); err != hipSuccess)
+    {
+        std::cerr << "Failed to synchronize stream." << std::endl;
+        return EXIT_FAILURE;
+    }
+
+    // Free the allocated memory
+    if(auto err = hipFree(d_ptr); err != hipSuccess)
+    {
+        std::cerr << "Failed to free memory." << std::endl;
+        return EXIT_FAILURE;
+    }
+
+    std::cout << "Operation completed successfully using per-thread default stream." << std::endl;
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,81 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstring>
+#include <iostream>
+
+#define HIP_CHECK(expression)                  \
+{                                              \
+    const hipError_t status = expression;      \
+    if(status != hipSuccess)                   \
+    {                                          \
+        std::cerr << "HIP error "              \
+                  << status << ": "            \
+                  << hipGetErrorString(status) \
+                  << " at " << __FILE__ << ":" \
+                  << __LINE__ << std::endl;    \
+    }                                          \
+}
+
+int main()
+{
+    const int element_number = 100;
+
+    int *host_input, *host_output;
+    // Host allocation
+    HIP_CHECK(hipHostMalloc(&host_input, element_number * sizeof(int)));
+    HIP_CHECK(hipHostMalloc(&host_output, element_number * sizeof(int)));
+
+    // Host data preparation
+    for (int i = 0; i < element_number; i++)
+    {
+        host_input[i] = i;
+    }
+    std::memset(host_output, 0, element_number * sizeof(int));
+
+    int *device_input, *device_output;
+
+    // Device allocation
+    HIP_CHECK(hipMalloc(&device_input,  element_number * sizeof(int)));
+    HIP_CHECK(hipMalloc(&device_output, element_number * sizeof(int)));
+
+    // Device data preparation
+    HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
+    HIP_CHECK(hipMemset(device_output, 0, element_number * sizeof(int)));
+
+    // Run the kernel
+    // ...
+
+    HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
+
+    // Free host memory
+    HIP_CHECK(hipFreeHost(host_input));
+    HIP_CHECK(hipFreeHost(host_output));
+
+    // Free device memory
+    HIP_CHECK(hipFree(device_input));
+    HIP_CHECK(hipFree(device_output));
+}
+// [sphinx-end]
@@ -0,0 +1,61 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                                \
+{                                                            \
+    const hipError_t err = expression;                       \
+    if (err != hipSuccess)                                   \
+    {                                                        \
+        std::cout << "HIP Error: " << hipGetErrorString(err) \
+              << " at line " << __LINE__ << std::endl;       \
+        std::exit(EXIT_FAILURE);                             \
+    }                                                        \
+}
+
+int main()
+{
+    // [sphinx-start]
+    double * ptr;
+    HIP_CHECK(hipMalloc(&ptr, sizeof(double)));
+    hipPointerAttribute_t attr;
+    HIP_CHECK(hipPointerGetAttributes(&attr, ptr)); /*attr.type is hipMemoryTypeDevice*/
+    if(attr.type == hipMemoryTypeDevice)
+        std::cout << "ptr is of type hipMemoryTypeDevice" << std::endl;
+
+    double* ptrHost;
+    HIP_CHECK(hipHostMalloc(&ptrHost, sizeof(double)));
+    hipPointerAttribute_t attrHost;
+    HIP_CHECK(hipPointerGetAttributes(&attrHost, ptrHost)); /*attr.type is hipMemoryTypeHost*/
+    if(attrHost.type == hipMemoryTypeHost)
+        std::cout << "ptrHost is of type hipMemoryTypeHost" << std::endl;
+    // [sphinx-end]
+    
+    HIP_CHECK(hipFreeHost(ptrHost));
+    HIP_CHECK(hipFree(ptr));
+
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,79 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+#include <hip/hiprtc.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+#include <string>
+#include <vector>
+
+#define CHECK_RET_CODE(call, ret_code)                                                             \
+{                                                                                                  \
+    if ((call) != ret_code)                                                                        \
+    {                                                                                              \
+        std::cout << "Failed in call: " << #call << std::endl;                                     \
+        std::abort();                                                                              \
+    }                                                                                              \
+}
+#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
+#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
+
+int main()
+{
+    const char* kernel_source = "adafsfgadascvsfgsadfbdt";
+    hiprtcProgram prog;
+    auto rtc_ret_code = hiprtcCreateProgram(&prog,            // HIPRTC program handle
+                                            kernel_source,    // kernel source string
+                                            "vector_add.cpp", // Name of the file
+                                            0,                // Number of headers
+                                            nullptr,          // Header sources
+                                            nullptr);         // Name of header file
+
+    if (rtc_ret_code != HIPRTC_SUCCESS)
+    {
+        std::cerr << "Failed to create program" << std::endl;
+        std::abort();
+    }
+
+    hipDeviceProp_t props;
+    int device = 0;
+    HIP_CHECK(hipGetDeviceProperties(&props, device));
+    auto sarg = std::string{"--gpu-architecture="} + props.gcnArchName;  // device for which binary is to be generated
+
+    const char* opts[] = {sarg.c_str()};
+
+    // [sphinx-start]
+    hiprtcResult result;
+    result = hiprtcCompileProgram(prog, 1, opts);
+    if (result != HIPRTC_SUCCESS)
+    {
+        std::cout << "hiprtcCompileProgram fails with error " << hiprtcGetErrorString(result);
+    }
+    // [sphinx-end]
+
+    HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
+
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,131 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+#include <vector>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess)                 \
+    {                                        \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+// GPU Kernels
+__global__ void kernelA(double* arrayA, std::size_t size)
+{
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+    if(x < size)
+    {
+        arrayA[x] += 1.0;
+    }
+}
+
+__global__ void kernelB(double* arrayA, double* arrayB, std::size_t size)
+{
+    const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
+    if(x < size)
+    {
+        arrayB[x] += arrayA[x] + 3.0;
+    }
+}
+
+int main()
+{
+    constexpr int numOfBlocks = 1 << 20;
+    constexpr int threadsPerBlock = 1024;
+    constexpr int numberOfIterations = 50;
+    // The array size smaller to avoid the relatively short kernel launch compared to memory copies
+    constexpr std::size_t arraySize = 1U << 25;
+    double *d_dataA;
+    double *d_dataB;
+
+    double initValueA = 0.0;
+    double initValueB = 2.0;
+
+    std::vector<double> vectorA(arraySize, initValueA);
+    std::vector<double> vectorB(arraySize, initValueB);
+    // Allocate device memory
+    HIP_CHECK(hipMalloc(&d_dataA, arraySize * sizeof(*d_dataA)));
+    HIP_CHECK(hipMalloc(&d_dataB, arraySize * sizeof(*d_dataB)));
+    for(int iteration = 0; iteration < numberOfIterations; iteration++)
+    {
+        // Host to Device copies
+        HIP_CHECK(hipMemcpy(d_dataA, vectorA.data(), arraySize * sizeof(*d_dataA), hipMemcpyHostToDevice));
+        HIP_CHECK(hipMemcpy(d_dataB, vectorB.data(), arraySize * sizeof(*d_dataB), hipMemcpyHostToDevice));
+        // Launch the GPU kernels
+        kernelA<<<numOfBlocks, threadsPerBlock>>>(d_dataA, arraySize);
+        kernelB<<<numOfBlocks, threadsPerBlock>>>(d_dataA, d_dataB, arraySize);
+        // Device to Host copies
+        HIP_CHECK(hipMemcpy(vectorA.data(), d_dataA, arraySize * sizeof(*vectorA.data()), hipMemcpyDeviceToHost));
+        HIP_CHECK(hipMemcpy(vectorB.data(), d_dataB, arraySize * sizeof(*vectorB.data()), hipMemcpyDeviceToHost));
+    }
+    // Wait for all operations to complete
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Verify results
+    const double expectedA = (double)numberOfIterations;
+    const double expectedB = initValueB + (3.0 * numberOfIterations) + (expectedA * (expectedA + 1.0)) / 2.0;
+    bool passed = true;
+    for(std::size_t i = 0; i < arraySize; ++i)
+    {
+        if(vectorA[i] != expectedA)
+        {
+            passed = false;
+            std::cerr << "Validation failed! Expected " << expectedA << " got " << vectorA[i] << " at index: " << i << std::endl;
+            break;
+        }
+        if(vectorB[i] != expectedB)
+        {
+            passed = false;
+            std::cerr << "Validation failed! Expected " << expectedB << " got " <<  vectorB[i] << " at index: " << i << std::endl;
+            break;
+        }
+    }
+
+    if(passed)
+    {
+        std::cout << "Sequential execution completed successfully." << std::endl;
+    }
+    else
+    {
+        std::cerr << "Sequential execution failed." << std::endl;
+    }
+
+    // Cleanup
+    HIP_CHECK(hipFree(d_dataA));
+    HIP_CHECK(hipFree(d_dataB));
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -0,0 +1,47 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+#include <iostream>
+
+// [sphinx-start]
+__constant__ int const_array[8];
+
+void set_constant_memory()
+{
+    int host_data[8] {1,2,3,4,5,6,7,8};
+
+    if(auto err = hipMemcpyToSymbol(const_array, host_data, sizeof(int) * 8); err != hipSuccess)
+        std::cerr << "HIP error " << err << ": " << hipGetErrorString(err) << std::endl;
+
+    // call kernel that accesses const_array
+}
+// [sphinx-end]
+
+int main()
+{
+    set_constant_memory();
+    std::cout << "Success!" << std::endl;
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,42 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+#include <iostream>
+
+int main()
+{
+    int deviceCount;
+    if (hipGetDeviceCount(&deviceCount) == hipSuccess)
+    {
+        for (int i = 0; i < deviceCount; ++i)
+        {
+            hipDeviceProp_t prop;
+            if (hipGetDeviceProperties(&prop, i) == hipSuccess)
+                std::cout << "Device" << i << prop.name << std::endl;
+        }
+    }
+
+    return 0;
+}
+// [sphinx-end]
@@ -0,0 +1,73 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <iostream>
+
+#define HIP_CHECK(expression)              \
+{                                          \
+    const hipError_t err = expression;     \
+    if(err != hipSuccess)                  \
+    {                                      \
+        std::cerr << "HIP error: "         \
+            << hipGetErrorString(err)      \
+            << " at " << __LINE__ << "\n"; \
+    }                                      \
+}
+
+// Addition of two values.
+__global__ void add(int *a, int *b, int *c)
+{
+    *c = *a + *b;
+}
+
+// This example requires HMM support and the environment variable HSA_XNACK needs to be set to 1
+int main()
+{
+    // Allocate memory for a, b, and c.
+    int *a = new int[1];
+    int *b = new int[1];
+    int *c = new int[1];
+
+    // Setup input values.
+    *a = 1;
+    *b = 2;
+
+    // Launch add() kernel on GPU.
+    add<<<1, 1>>>(a, b, c);
+
+    // Wait for GPU to finish before accessing on host.
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Print the result.
+    std::cout << *a << " + " << *b << " = " << *c << std::endl;
+
+    // Cleanup allocated memory.
+    delete[] c;
+    delete[] b;
+    delete[] a;
+
+    return 0;
+}
+// [sphinx-end]
@@ -0,0 +1,46 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include "example_utils.hpp"
+
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+#include <iostream>
+
+// [sphinx-start]
+__global__ void kernel()
+{
+  __shared__ int array[128];
+  __shared__ double result;
+}
+// [sphinx-end]
+
+int main()
+{
+    kernel<<<64, 512>>>();
+    HIP_CHECK(hipPeekAtLastError());
+    HIP_CHECK(hipDeviceSynchronize());
+
+    std::cout << "Success!" << std::endl;
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,65 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <iostream>
+
+#define HIP_CHECK(expression)              \
+{                                          \
+    const hipError_t err = expression;     \
+    if(err != hipSuccess)                  \
+    {                                      \
+        std::cerr << "HIP error: "         \
+            << hipGetErrorString(err)      \
+            << " at " << __LINE__ << "\n"; \
+    }                                      \
+}
+
+// Addition of two values.
+__global__ void add(int *a, int *b, int *c)
+{
+    *c = *a + *b;
+}
+
+// Declare a, b and c as static variables.
+__managed__ int a, b, c;
+
+int main()
+{
+    // Setup input values.
+    a = 1;
+    b = 2;
+
+    // Launch add() kernel on GPU.
+    add<<<1, 1>>>(&a, &b, &c);
+
+    // Wait for GPU to finish before accessing on host.
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Print the result.
+    std::cout << a << " + " << b << " = " << c << std::endl;
+
+    return 0;
+}
+// [sphinx-end]
@@ -0,0 +1,85 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                        \
+{                                                    \
+    const hipError_t status = expression;            \
+    if (status != hipSuccess)                        \
+    {                                                \
+        std::cerr << "HIP error " << status          \
+                << ": " << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":"         \
+                << __LINE__ << std::endl;            \
+        std::exit(EXIT_FAILURE);                     \
+    }                                                \
+}
+
+// Kernel to perform some computation on allocated memory.
+__global__ void myKernel(int* data, std::size_t numElements)
+{
+    int tid = threadIdx.x + blockIdx.x * blockDim.x;
+    if (tid < numElements)
+    {
+        data[tid] = tid * 2;
+    }
+}
+
+int main()
+{
+    // Stream 0.
+    constexpr hipStream_t streamId = 0;
+
+    // Allocate memory with stream ordered semantics.
+    constexpr std::size_t numElements = 1024;
+    int* devData;
+    HIP_CHECK(hipMallocAsync(reinterpret_cast<void**>(&devData), numElements * sizeof(*devData), streamId));
+
+    // Launch the kernel to perform computation.
+    dim3 blockSize(256);
+    dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
+    myKernel<<<gridSize, blockSize>>>(devData, numElements);
+
+    // Copy data back to host.
+    int* hostData = new int[numElements];
+    HIP_CHECK(hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost));
+
+    // Print the array.
+    for (std::size_t i = 0; i < numElements; ++i)
+        std::cout << "Element " << i << ": " << hostData[i] << std::endl;
+
+    // Free memory with stream ordered semantics.
+    HIP_CHECK(hipFreeAsync(devData, streamId));
+    delete[] hostData;
+
+    // Synchronize to ensure completion.
+    HIP_CHECK(hipDeviceSynchronize());
+
+    return EXIT_SUCCESS;
+}
+// [sphinx-end]
@@ -20,16 +20,23 @@
 // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 // SOFTWARE.

+#include "popcount.hpp"
+
 #include <hip/hip_runtime.h>
-#include <type_traits>
+
+#include <cstddef>
+#include <cstdint>
+#include <cstdlib>
 #include <iostream>
-#include <vector>
 #include <random>
+#include <vector>
+#include <type_traits>

 #define HIP_CHECK(expression)                \
 {                                            \
    const hipError_t status = expression;    \
-    if(status != hipSuccess){                \
+    if(status != hipSuccess)                 \
+    {                                        \
            std::cerr << "HIP error "        \
                << status << ": "            \
                << hipGetErrorString(status) \
@@ -39,169 +46,185 @@
 }

 // [Sphinx template warp size block reduction kernel start]
-template<uint32_t WarpSize>
-using lane_mask_t = typename std::conditional<WarpSize == 32, uint32_t, uint64_t>::type;
+template<std::uint32_t WarpSize>
+using lane_mask_t = typename std::conditional<WarpSize == 32, std::uint32_t, std::uint64_t>::type;

-template<uint32_t WarpSize>
-__global__ void block_reduce(int* input, lane_mask_t<WarpSize>* mask, int* output, size_t size) {
-  extern __shared__ int shared[];
+template<std::uint32_t WarpSize>
+__global__ void block_reduce(int* input, lane_mask_t<WarpSize>* mask, int* output, size_t size)
+{
+    extern __shared__ int shared[];

-  // Read of input with bounds check
-  auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id)
-  {
-    lane_mask_t<WarpSize> warp_mask = lane_mask_t<WarpSize>(1) << lane_id;
-    return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
-  };
+    // Read of input with bounds check
+    auto read_global_safe = [&](const std::uint32_t i, const std::uint32_t lane_id, const std::uint32_t mask_id)
+    {
+        lane_mask_t<WarpSize> warp_mask = lane_mask_t<WarpSize>(1) << lane_id;
+        return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
+    };

-  const uint32_t tid = threadIdx.x,
-                 lid = threadIdx.x % WarpSize,
-                 wid = threadIdx.x / WarpSize,
-                 bid = blockIdx.x,
-                 gid = bid * blockDim.x + tid;
+    const std::uint32_t tid = threadIdx.x,
+                        lid = threadIdx.x % WarpSize,
+                        wid = threadIdx.x / WarpSize,
+                        bid = blockIdx.x,
+                        gid = bid * blockDim.x + tid;

-  // Read input buffer to shared
-  shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / WarpSize) + wid);
-  __syncthreads();
-
-  // Shared reduction
-  for (uint32_t i = blockDim.x / 2; i >= WarpSize; i /= 2)
-  {
-    if (tid < i)
-      shared[tid] = shared[tid] + shared[tid + i];
+    // Read input buffer to shared
+    shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / WarpSize) + wid);
    __syncthreads();
-  }

-  // Use local variable in warp reduction  
-  int result =  shared[tid];
-  __syncthreads();
+    // Shared reduction
+    for (std::uint32_t i = blockDim.x / 2; i >= WarpSize; i /= 2)
+    {
+        if (tid < i)
+        shared[tid] = shared[tid] + shared[tid + i];
+        __syncthreads();
+    }

-  // This loop would be unrolled the same with the runtime warpSize.
-  #pragma unroll
-  for (uint32_t i = WarpSize/2; i >= 1; i /= 2) {
-    result = result + __shfl_down(result, i);
-  }
+    // Use local variable in warp reduction  
+    int result =  shared[tid];
+    __syncthreads();

-  // Write result to output buffer
-  if (tid == 0)
-    output[bid] = result;
-};
+    // This loop would be unrolled the same with the runtime warpSize.
+    #pragma unroll
+    for (std::uint32_t i = WarpSize/2; i >= 1; i /= 2)
+    {
+        result = result + __shfl_down(result, i);
+    }
+
+    // Write result to output buffer
+    if (tid == 0)
+        output[bid] = result;
+}
 // [Sphinx template warp size block reduction kernel end]

 // [Sphinx template warp size mask generation start]
-template<uint32_t WarpSize>
+template<std::uint32_t WarpSize>
 void generate_and_copy_mask(
-  void *d_mask, 
-  std::vector<int>& vectorExpected, 
-  int numOfBlocks,
-  int numberOfWarp,
-  int mask_size,
-  int mask_element_size) {
-  
-  std::random_device rd;
-  std::mt19937_64 eng(rd());
+    void *d_mask, 
+    std::vector<int>& vectorExpected, 
+    int numOfBlocks,
+    int numberOfWarp,
+    int mask_size,
+    int mask_element_size)
+{
+    std::random_device rd;
+    std::mt19937_64 eng(rd());

-  // Host side mask vector
-  std::vector<lane_mask_t<WarpSize>> mask(mask_size);
-  // Define uniform unsigned int distribution
-  std::uniform_int_distribution<lane_mask_t<WarpSize>> distr;
-  // Fill up the mask 
-  for(int i=0; i < numOfBlocks; i++) {
-    int count = 0;
-    for(int j=0; j < numberOfWarp; j++) {
-      int mask_index = i * numberOfWarp + j;
-      mask[mask_index] = distr(eng);
-      if constexpr(WarpSize == 32)
-        count += __builtin_popcount(mask[mask_index]);
-      else
-        count += __builtin_popcountll(mask[mask_index]);
+    // Host side mask vector
+    std::vector<lane_mask_t<WarpSize>> mask(mask_size);
+    // Define uniform unsigned int distribution
+    std::uniform_int_distribution<lane_mask_t<WarpSize>> distr;
+    // Fill up the mask 
+    for(int i=0; i < numOfBlocks; i++)
+    {
+        int count = 0;
+        for(int j=0; j < numberOfWarp; j++)
+        {
+            int mask_index = i * numberOfWarp + j;
+            mask[mask_index] = distr(eng);
+            if constexpr(WarpSize == 32)
+                count += popcount(static_cast<std::uint32_t>(mask[mask_index]));
+            else
+                count += popcount(mask[mask_index]);
+        }
+        vectorExpected[i]= count;
    }
-    vectorExpected[i]= count;
-  }

-  // Copy the mask array
-  HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
+    // Copy the mask array
+    HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
 }
 // [Sphinx template warp size mask generation end]

-int main() {
+int main()
+{
+    int deviceId = 0;
+    int warpSizeHost;
+    HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
+    std::cout << "Warp size: " << warpSizeHost << std::endl;

-  int deviceId = 0;
-  int warpSizeHost;
-  HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
-  std::cout << "Warp size: " << warpSizeHost << std::endl;
+    constexpr int numOfBlocks = 16;
+    constexpr int threadsPerBlock = 1024;
+    const int numberOfWarp = threadsPerBlock / warpSizeHost;
+    const int mask_element_size = warpSizeHost == 32 ? sizeof(std::uint32_t) : sizeof(std::uint64_t);
+    const int mask_size = numOfBlocks * numberOfWarp;
+    constexpr std::size_t arraySize = numOfBlocks * threadsPerBlock;

-  constexpr int numOfBlocks = 16;
-  constexpr int threadsPerBlock = 1024;
-  const int numberOfWarp = threadsPerBlock / warpSizeHost;
-  const int mask_element_size = warpSizeHost == 32 ? sizeof(uint32_t) : sizeof(uint64_t);
-  const int mask_size = numOfBlocks * numberOfWarp;
-  constexpr size_t arraySize = numOfBlocks * threadsPerBlock;
-
-  int *d_data, *d_results;
-  void *d_mask;
-  int initValue = 1;
-  std::vector<int> vectorInput(arraySize, initValue);
-  std::vector<int> vectorOutput(numOfBlocks);
-  std::vector<int> vectorExpected(numOfBlocks);
-  // Allocate device memory
-  HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
-  HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
-  HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
-  // Host to Device copy of the input array
-  HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
+    int *d_data, *d_results;
+    void *d_mask;
+    int initValue = 1;
+    std::vector<int> vectorInput(arraySize, initValue);
+    std::vector<int> vectorOutput(numOfBlocks);
+    std::vector<int> vectorExpected(numOfBlocks);
+    // Allocate device memory
+    HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
+    HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
+    HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
+    // Host to Device copy of the input array
+    HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
  
-  // [Sphinx template warp size select kernel start]
-  // Fill up the mask variable, copy to device and select the right kernel.
-  if(warpSizeHost == 32) {
-    // Generate and copy mask arrays
-    generate_and_copy_mask<32>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
+    // [Sphinx template warp size select kernel start]
+    // Fill up the mask variable, copy to device and select the right kernel.
+    if(warpSizeHost == 32)
+    {
+        // Generate and copy mask arrays
+        generate_and_copy_mask<32>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);

-    // Start the kernel
-    block_reduce<32><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
-      d_data,
-      static_cast<uint32_t*>(d_mask),
-      d_results,
-      arraySize);
-  } else if(warpSizeHost == 64) {
-    // Generate and copy mask arrays
-    generate_and_copy_mask<64>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
-
-    // Start the kernel
-    block_reduce<64><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
-      d_data,
-      static_cast<uint64_t*>(d_mask),
-      d_results,
-      arraySize);
-  } else {
-    std::cerr << "Unsupported warp size." << std::endl;
-    return 0;
-  }
-  // [Sphinx template warp size select kernel end]
-
-  // Check the kernel launch
-  HIP_CHECK(hipGetLastError());
-  // Check for kernel execution error
-  HIP_CHECK(hipDeviceSynchronize());
-  // Device to Host copy of the result
-  HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
-
-  // Verify results
-  bool passed = true;
-  for(size_t i = 0; i < numOfBlocks; ++i) {
-    if(vectorOutput[i] != vectorExpected[i]) {
-      passed = false;
-      std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl;
+        // Start the kernel
+        block_reduce<32><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
+            d_data,
+            static_cast<std::uint32_t*>(d_mask),
+            d_results,
+            arraySize);
    }
-  }
-  if(passed){
-    std::cout << "Execution completed successfully." << std::endl;
-  }else{
-    std::cerr << "Execution failed." << std::endl;
-  }
+    else if(warpSizeHost == 64)
+    {
+        // Generate and copy mask arrays
+        generate_and_copy_mask<64>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);

-  // Cleanup
-  HIP_CHECK(hipFree(d_data));
-  HIP_CHECK(hipFree(d_mask));
-  HIP_CHECK(hipFree(d_results));
-  return 0;
-}
+        // Start the kernel
+        block_reduce<64><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
+            d_data,
+            static_cast<std::uint64_t*>(d_mask),
+            d_results,
+            arraySize);
+    }
+    else
+    {
+        std::cerr << "Unsupported warp size." << std::endl;
+        return EXIT_FAILURE;
+    }
+    // [Sphinx template warp size select kernel end]
+
+    // Check the kernel launch
+    HIP_CHECK(hipGetLastError());
+    // Check for kernel execution error
+    HIP_CHECK(hipDeviceSynchronize());
+    // Device to Host copy of the result
+    HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
+
+    // Verify results
+    bool passed = true;
+    for(std::size_t i = 0; i < numOfBlocks; ++i)
+    {
+        if(vectorOutput[i] != vectorExpected[i])
+        {
+            passed = false;
+            std::cerr << "Validation failed! Expected " << vectorExpected[i]
+                                             << " got " << vectorOutput[i] << " at index: " << i << std::endl;
+        }
+    }
+
+    if(passed)
+    {
+        std::cout << "Execution completed successfully." << std::endl;
+    }
+    else
+    {
+        std::cerr << "Execution failed." << std::endl;
+    }
+
+    // Cleanup
+    HIP_CHECK(hipFree(d_data));
+    HIP_CHECK(hipFree(d_mask));
+    HIP_CHECK(hipFree(d_results));
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,66 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+
+#include <cstdlib>
+#include <iostream>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess)                 \
+    {                                        \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+// [sphinx-kernel-start]
+__global__ void kernel()
+{
+  long long int start = clock64();
+  // kernel code
+  long long int stop = clock64();
+  long long int cycles = stop - start;
+}
+// [sphinx-kernel-end]
+
+int main()
+{
+    int deviceId = 0;
+
+    // [sphinx-query-start]
+    int wallClkRate = 0; //in kilohertz
+    HIP_CHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId));
+    // [sphinx-query-end]
+
+    kernel<<<dim3{1, 1, 1}, dim3{32,1,1}>>>();
+    HIP_CHECK(hipDeviceSynchronize());
+
+    std::cout << "Device's wall clock rate is " << wallClkRate << " kHz." << std::endl;
+
+    return EXIT_SUCCESS;
+}
@@ -0,0 +1,89 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+// [sphinx-start]
+#include <hip/hip_runtime.h>
+#include <iostream>
+
+#define HIP_CHECK(expression)              \
+{                                          \
+    const hipError_t err = expression;     \
+    if(err != hipSuccess)                  \
+    {                                      \
+        std::cerr << "HIP error: "         \
+            << hipGetErrorString(err)      \
+            << " at " << __LINE__ << "\n"; \
+    }                                      \
+}
+
+// Addition of two values.
+__global__ void add(int *a, int *b, int *c)
+{
+    *c = *a + *b;
+}
+
+int main()
+{
+    int deviceId;
+    HIP_CHECK(hipGetDevice(&deviceId));
+    int *a, *b, *c;
+
+    // Allocate memory for a, b, and c accessible to both device and host codes.
+    HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
+    HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
+    HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
+
+    // Set memory advice for a and b to be read, located on and accessed by the GPU.
+    HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetPreferredLocation, deviceId));
+    HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetAccessedBy, deviceId));
+    HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetReadMostly, deviceId));
+
+    HIP_CHECK(hipMemAdvise(b, sizeof(*b), hipMemAdviseSetPreferredLocation, deviceId));
+    HIP_CHECK(hipMemAdvise(b, sizeof(*b), hipMemAdviseSetAccessedBy, deviceId));
+    HIP_CHECK(hipMemAdvise(b, sizeof(*b), hipMemAdviseSetReadMostly, deviceId));
+
+    // Set memory advice for c to be read, located on and accessed by the CPU.
+    HIP_CHECK(hipMemAdvise(c, sizeof(*c), hipMemAdviseSetPreferredLocation, hipCpuDeviceId));
+    HIP_CHECK(hipMemAdvise(c, sizeof(*c), hipMemAdviseSetAccessedBy, hipCpuDeviceId));
+    HIP_CHECK(hipMemAdvise(c, sizeof(*c), hipMemAdviseSetReadMostly, hipCpuDeviceId));
+
+    // Setup input values.
+    *a = 1;
+    *b = 2;
+
+    // Launch add() kernel on GPU.
+    add<<<1, 1>>>(a, b, c);
+
+    // Wait for GPU to finish before accessing on host.
+    HIP_CHECK(hipDeviceSynchronize());
+
+    // Prints the result.
+    std::cout << *a << " + " << *b << " = " << *c << std::endl;
+
+    // Cleanup allocated memory.
+    HIP_CHECK(hipFree(a));
+    HIP_CHECK(hipFree(b));
+    HIP_CHECK(hipFree(c));
+
+    return 0;
+}
+// [sphinx-end]
@@ -20,16 +20,23 @@
 // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 // SOFTWARE.

+#include "popcount.hpp"
+
 #include <hip/hip_runtime.h>
-#include <type_traits>
+
+#include <cstddef>
+#include <cstdint>
+#include <cstdlib>
 #include <iostream>
-#include <vector>
 #include <random>
+#include <vector>
+#include <type_traits>

 #define HIP_CHECK(expression)                \
 {                                            \
    const hipError_t status = expression;    \
-    if(status != hipSuccess){                \
+    if(status != hipSuccess)                 \
+    {                                        \
            std::cerr << "HIP error "        \
                << status << ": "            \
                << hipGetErrorString(status) \
@@ -39,146 +46,164 @@
 }

 // [Sphinx HIP warp size block reduction kernel start]
-__global__ void block_reduce(int* input, uint64_t* mask, int* output, size_t size){
-  extern __shared__ int shared[];
-  // Read of input with bounds check
-  auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id)
-  {
-    uint64_t warp_mask = 1ull << lane_id;
-    return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
-  };
-  const uint32_t tid = threadIdx.x,
-                 lid = threadIdx.x % warpSize,
-                 wid = threadIdx.x / warpSize,
-                 bid = blockIdx.x,
-                 gid = bid * blockDim.x + tid;
-  // Read input buffer to shared
-  shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / warpSize) + wid);
-  __syncthreads();
-  // Shared reduction
-  for (uint32_t i = blockDim.x / 2; i >= warpSize; i /= 2)
-  {
-    if (tid < i)
-      shared[tid] = shared[tid] + shared[tid + i];
+__global__ void block_reduce(int* input, std::uint64_t* mask, int* output, std::size_t size)
+{
+    extern __shared__ int shared[];
+    // Read of input with bounds check
+    auto read_global_safe = [&](const std::uint32_t i, const std::uint32_t lane_id, const std::uint32_t mask_id)
+    {
+        std::uint64_t warp_mask = 1ull << lane_id;
+        return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
+    };
+
+    const std::uint32_t tid = threadIdx.x,
+                    lid = threadIdx.x % warpSize,
+                    wid = threadIdx.x / warpSize,
+                    bid = blockIdx.x,
+                    gid = bid * blockDim.x + tid;
+
+    // Read input buffer to shared
+    shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / warpSize) + wid);
    __syncthreads();
-  }

-  // Use local variable in warp reduction  
-  int result =  shared[tid];
-  __syncthreads();
+    // Shared reduction
+    for (std::uint32_t i = blockDim.x / 2; i >= warpSize; i /= 2)
+    {
+        if (tid < i)
+        shared[tid] = shared[tid] + shared[tid + i];
+        __syncthreads();
+    }

-  // This loop would be unrolled the same with the compile-time WarpSize.
-  #pragma unroll
-  for (uint32_t i = warpSize/2; i >= 1; i /= 2) {
-    result = result + __shfl_down(result, i);
-  }
+    // Use local variable in warp reduction  
+    int result =  shared[tid];
+    __syncthreads();

-  // Write result to output buffer
-  if (tid == 0)
-    output[bid] = result;
-};
+    // This loop would be unrolled the same with the compile-time WarpSize.
+    #pragma unroll
+    for (std::uint32_t i = warpSize/2; i >= 1; i /= 2) {
+        result = result + __shfl_down(result, i);
+    }
+
+    // Write result to output buffer
+    if (tid == 0)
+        output[bid] = result;
+}
 // [Sphinx HIP warp size block reduction kernel end]

 // [Sphinx HIP warp size mask generation start]
 void generate_and_copy_mask(
-  uint64_t *d_mask, 
-  std::vector<int>& vectorExpected,
-  int warpSizeHost,
-  int numOfBlocks,
-  int numberOfWarp,
-  int mask_size,
-  int mask_element_size) {
-  
-  std::random_device rd;
-  std::mt19937_64 eng(rd());
+    std::uint64_t *d_mask, 
+    std::vector<int>& vectorExpected,
+    int warpSizeHost,
+    int numOfBlocks,
+    int numberOfWarp,
+    int mask_size,
+    int mask_element_size)
+{ 
+    std::random_device rd;
+    std::mt19937_64 eng(rd());

-  // Host side mask vector
-  std::vector<uint64_t> mask(mask_size);
-  // Define uniform unsigned int distribution
-  std::uniform_int_distribution<uint64_t> distr;
-  // Fill up the mask 
-  for(int i=0; i < numOfBlocks; i++) {
-    int count = 0;
-    for(int j=0; j < numberOfWarp; j++) {
-      int mask_index = i * numberOfWarp + j;
-      mask[mask_index] = distr(eng);
-      if(warpSizeHost == 32)
-        count += __builtin_popcount(mask[mask_index]);
-      else
-        count += __builtin_popcountll(mask[mask_index]);
+    // Host side mask vector
+    std::vector<std::uint64_t> mask(mask_size);
+    // Define uniform unsigned int distribution
+    std::uniform_int_distribution<std::uint64_t> distr;
+    // Fill up the mask 
+    for(int i=0; i < numOfBlocks; i++)
+    {
+        int count = 0;
+        for(int j=0; j < numberOfWarp; j++)
+        {
+            int mask_index = i * numberOfWarp + j;
+            mask[mask_index] = distr(eng);
+            if(warpSizeHost == 32)
+                count += popcount(static_cast<std::uint32_t>(mask[mask_index]));
+            else
+                count += popcount(mask[mask_index]);
+        }
+        vectorExpected[i]= count;
    }
-    vectorExpected[i]= count;
-  }
-  // Copy the mask array
-  HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
+    // Copy the mask array
+    HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
 }
 // [Sphinx HIP warp size mask generation end]

-int main() {
-  int deviceId = 0;
-  int warpSizeHost;
-  HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
-  std::cout << "Warp size: " << warpSizeHost << std::endl;
-  constexpr int numOfBlocks = 16;
-  constexpr int threadsPerBlock = 1024;
-  const int numberOfWarp = threadsPerBlock / warpSizeHost;
-  const int mask_element_size = sizeof(uint64_t);
-  const int mask_size = numOfBlocks * numberOfWarp;
-  constexpr size_t arraySize = numOfBlocks * threadsPerBlock;
-  int *d_data, *d_results;
-  uint64_t *d_mask;
-  int initValue = 1;
-  std::vector<int> vectorInput(arraySize, initValue);
-  std::vector<int> vectorOutput(numOfBlocks);
-  std::vector<int> vectorExpected(numOfBlocks);
-  // Allocate device memory
-  HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
-  HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
-  HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
-  // Host to Device copy of the input array
-  HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
+int main()
+{
+    int deviceId = 0;
+    int warpSizeHost;
+    HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
+    std::cout << "Warp size: " << warpSizeHost << std::endl;
+
+    constexpr int numOfBlocks = 16;
+    constexpr int threadsPerBlock = 1024;
+    const int numberOfWarp = threadsPerBlock / warpSizeHost;
+    const int mask_element_size = sizeof(std::uint64_t);
+    const int mask_size = numOfBlocks * numberOfWarp;
+    constexpr std::size_t arraySize = numOfBlocks * threadsPerBlock;
+
+    int *d_data, *d_results;
+    std::uint64_t *d_mask;
+    int initValue = 1;
+    std::vector<int> vectorInput(arraySize, initValue);
+    std::vector<int> vectorOutput(numOfBlocks);
+    std::vector<int> vectorExpected(numOfBlocks);
+    // Allocate device memory
+    HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
+    HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
+    HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
+    // Host to Device copy of the input array
+    HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
  
-  // [Sphinx HIP warp size select kernel start]
-  // Generate and copy mask arrays
-  generate_and_copy_mask(
-    d_mask,
-    vectorExpected,
-    warpSizeHost,
-    numOfBlocks,
-    numberOfWarp,
-    mask_size,
-    mask_element_size);
+    // [Sphinx HIP warp size select kernel start]
+    // Generate and copy mask arrays
+    generate_and_copy_mask(
+        d_mask,
+        vectorExpected,
+        warpSizeHost,
+        numOfBlocks,
+        numberOfWarp,
+        mask_size,
+        mask_element_size);

-  // Start the kernel
-  block_reduce<<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
-    d_data,
-    d_mask,
-    d_results,
-    arraySize);
-  // [Sphinx HIP warp size select kernel end]
+    // Start the kernel
+    block_reduce<<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
+        d_data,
+        d_mask,
+        d_results,
+        arraySize);
+    // [Sphinx HIP warp size select kernel end]

-  // Check the kernel launch
-  HIP_CHECK(hipGetLastError());
-  // Check for kernel execution error
-  HIP_CHECK(hipDeviceSynchronize());
-  // Device to Host copy of the result
-  HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
-  // Verify results
-  bool passed = true;
-  for(size_t i = 0; i < numOfBlocks; ++i) {
-    if(vectorOutput[i] != vectorExpected[i]) {
-      passed = false;
-      std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl;
+    // Check the kernel launch
+    HIP_CHECK(hipGetLastError());
+    // Check for kernel execution error
+    HIP_CHECK(hipDeviceSynchronize());
+    // Device to Host copy of the result
+    HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
+
+    // Verify results
+    bool passed = true;
+    for(std::size_t i = 0; i < numOfBlocks; ++i)
+    {
+        if(vectorOutput[i] != vectorExpected[i])
+        {
+            passed = false;
+            std::cerr << "Validation failed! Expected " << vectorExpected[i]
+                                             << " got " << vectorOutput[i] << " at index: " << i << std::endl;
+        }
    }
-  }
-  if(passed){
-    std::cout << "Execution completed successfully." << std::endl;
-  }else{
-    std::cerr << "Execution failed." << std::endl;
-  }
-  // Cleanup
-  HIP_CHECK(hipFree(d_data));
-  HIP_CHECK(hipFree(d_mask));
-  HIP_CHECK(hipFree(d_results));
-  return 0;
-}
+
+    if(passed)
+    {
+        std::cout << "Execution completed successfully." << std::endl;
+    }
+    else
+    {
+        std::cerr << "Execution failed." << std::endl;
+    }
+
+    // Cleanup
+    HIP_CHECK(hipFree(d_data));
+    HIP_CHECK(hipFree(d_mask));
+    HIP_CHECK(hipFree(d_results));
+    return EXIT_SUCCESS;
+}
@@ -21,5 +21,316 @@

 import urllib.request

-urllib.request.urlretrieve("https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/develop/HIP-Basic/opengl_interop/main.hip", "docs/tools/example_codes/opengl_interop.hip")
-urllib.request.urlretrieve("https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/develop/HIP-Basic/vulkan_interop/main.hip", "docs/tools/example_codes/external_interop.hip")
+urllib.request.urlretrieve(
+     "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Basic/opengl_interop/main.hip",
+        "docs/tools/example_codes/opengl_interop.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Basic/vulkan_interop/main.hip",
+    "docs/tools/example_codes/external_interop.hip"
+)
+
+# HIP-C%2B%2B-Language-Extensions
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/calling_global_functions/main.hip",
+    "docs/tools/example_codes/calling_global_functions.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/extern_shared_memory/main.hip",
+    "docs/tools/example_codes/extern_shared_memory.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/launch_bounds/main.hip",
+    "docs/tools/example_codes/launch_bounds.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/set_constant_memory/main.hip",
+    "docs/tools/example_codes/set_constant_memory.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/template_warp_size_reduction/main.hip",
+    "docs/tools/example_codes/template_warp_size_reduction.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/timer/main.hip",
+    "docs/tools/example_codes/timer.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/warp_size_reduction/main.hip",
+    "docs/tools/example_codes/warp_size_reduction.hip"
+)
+
+# HIP-Porting-Guide
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-Porting-Guide/device_code_feature_identification/main.hip",
+    "docs/tools/example_codes/device_code_feature_identification.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-Porting-Guide/host_code_feature_identification/main.cpp",
+    "docs/tools/example_codes/host_code_feature_identification.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-Porting-Guide/identifying_compilation_target_platform/main.cpp",
+    "docs/tools/example_codes/identifying_compilation_target_platform.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-Porting-Guide/identifying_host_device_compilation_pass/main.hip",
+    "docs/tools/example_codes/identifying_host_device_compilation_pass.hip"
+)
+
+# Introduction-to-the-HIP-Programming-Model
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Introduction-to-the-HIP-Programming-Model/add_kernel/main.hip",
+    "docs/tools/example_codes/add_kernel.hip"
+)
+
+# Porting-CUDA-Driver-API
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Porting-CUDA-Driver-API/load_module/main.cpp",
+    "docs/tools/example_codes/load_module.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Porting-CUDA-Driver-API/load_module_ex/main.cpp",
+    "docs/tools/example_codes/load_module_ex.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Porting-CUDA-Driver-API/load_module_ex_cuda/main.cpp",
+    "docs/tools/example_codes/load_module_ex_cuda.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Porting-CUDA-Driver-API/per_thread_default_stream/main.cpp",
+    "docs/tools/example_codes/per_thread_default_stream.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Porting-CUDA-Driver-API/pointer_memory_type/main.cpp",
+    "docs/tools/example_codes/pointer_memory_type.cpp"
+)
+
+# Programming-for-HIP-Runtime-Compiler
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Programming-for-HIP-Runtime-Compiler/compilation_apis/main.cpp",
+    "docs/tools/example_codes/compilation_apis.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Programming-for-HIP-Runtime-Compiler/linker_apis/main.cpp",
+    "docs/tools/example_codes/linker_apis.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Programming-for-HIP-Runtime-Compiler/linker_apis_file/main.cpp",
+    "docs/tools/example_codes/linker_apis_file.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Programming-for-HIP-Runtime-Compiler/linker_apis_options/main.cpp",
+    "docs/tools/example_codes/linker_apis_options.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Programming-for-HIP-Runtime-Compiler/lowered_names/main.cpp",
+    "docs/tools/example_codes/lowered_names.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Programming-for-HIP-Runtime-Compiler/rtc_error_handling/main.cpp",
+    "docs/tools/example_codes/rtc_error_handling.cpp"
+)
+
+# Using-HIP-Runtime-API
+# Using-HIP-Runtime-API/Asynchronous-Concurrent-Execution
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Asynchronous-Concurrent-Execution/async_kernel_execution/main.hip",
+    "docs/tools/example_codes/async_kernel_execution.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Asynchronous-Concurrent-Execution/event_based_synchronization/main.hip",
+    "docs/tools/example_codes/event_based_synchronization.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Asynchronous-Concurrent-Execution/sequential_kernel_execution/main.hip",
+    "docs/tools/example_codes/sequential_kernel_execution.hip"
+)
+
+# Using-HIP-Runtime-API / Call-Stack
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Call-Stack/call_stack_management/main.cpp",
+    "docs/tools/example_codes/call_stack_management.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Call-Stack/device_recursion/main.hip",
+    "docs/tools/example_codes/device_recursion.hip"
+)
+
+# Using-HIP-Runtime-API / Error-Handling
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Error-Handling/error_handling/main.hip",
+    "docs/tools/example_codes/error_handling.hip"
+)
+
+# Using-HIP-Runtime-API / HIP-Graphs
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/HIP-Graphs/graph_capture/main.hip",
+    "docs/tools/example_codes/graph_capture.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/HIP-Graphs/graph_creation/main.hip",
+    "docs/tools/example_codes/graph_creation.hip"
+)
+
+# Using-HIP-Runtime-API / Initialization
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Initialization/simple_device_query/main.cpp",
+    "docs/tools/example_codes/simple_device_query.cpp"
+)
+
+# Using-HIP-Runtime-API / Memory-Management / Device-Memory
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Device-Memory/constant_memory/main.hip",
+    "docs/tools/example_codes/constant_memory_device.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Device-Memory/dynamic_shared_memory/main.hip",
+    "docs/tools/example_codes/dynamic_shared_memory_device.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Device-Memory/explicit_copy/main.cpp",
+    "docs/tools/example_codes/explicit_copy.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Device-Memory/kernel_memory_allocation/main.hip",
+    "docs/tools/example_codes/kernel_memory_allocation.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Device-Memory/static_shared_memory/main.hip",
+    "docs/tools/example_codes/static_shared_memory_device.hip"
+)
+
+# Using-HIP-Runtime-API / Memory-Management / Host-Memory
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Host-Memory/pageable_host_memory/main.cpp",
+    "docs/tools/example_codes/pageable_host_memory.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Host-Memory/pinned_host_memory/main.cpp",
+    "docs/tools/example_codes/pinned_host_memory.cpp"
+)
+
+# Using-HIP-Runtime-API / Memory-Management / SOMA
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/SOMA/stream_ordered_memory_allocation/main.hip",
+    "docs/tools/example_codes/stream_ordered_memory_allocation.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/SOMA/ordinary_memory_allocation/main.hip",
+    "docs/tools/example_codes/ordinary_memory_allocation.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/SOMA/memory_pool/main.hip",
+    "docs/tools/example_codes/memory_pool.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/SOMA/memory_pool_resource_usage_statistics/main.cpp",
+    "docs/tools/example_codes/memory_pool_resource_usage_statistics.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/SOMA/memory_pool_threshold/main.hip",
+    "docs/tools/example_codes/memory_pool_threshold.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/SOMA/memory_pool_trim/main.cpp",
+    "docs/tools/example_codes/memory_pool_trim.cpp"
+)
+
+# Using-HIP-Runtime-API / Memory-Management / Unified-Memory-Management
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/data_prefetching/main.hip",
+    "docs/tools/example_codes/data_prefetching.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/dynamic_unified_memory/main.hip",
+    "docs/tools/example_codes/dynamic_unified_memory.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/explicit_memory/main.hip",
+    "docs/tools/example_codes/explicit_memory.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/memory_range_attributes/main.hip",
+    "docs/tools/example_codes/memory_range_attributes.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/standard_unified_memory/main.hip",
+    "docs/tools/example_codes/standard_unified_memory.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/static_unified_memory/main.hip",
+    "docs/tools/example_codes/static_unified_memory.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/unified_memory_advice/main.hip",
+    "docs/tools/example_codes/unified_memory_advice.hip"
+)
+
+# Using-HIP-Runtime-API / Multi-Device-Management
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Multi-Device-Management/device_enumeration/main.cpp",
+    "docs/tools/example_codes/device_enumeration.cpp"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Multi-Device-Management/device_selection/main.hip",
+    "docs/tools/example_codes/device_selection.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Multi-Device-Management/multi_device_synchronization/main.hip",
+    "docs/tools/example_codes/multi_device_synchronization.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Multi-Device-Management/p2p_memory_access/main.hip",
+    "docs/tools/example_codes/p2p_memory_access.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Multi-Device-Management/p2p_memory_access_host_staging/main.hip",
+    "docs/tools/example_codes/p2p_memory_access_host_staging.hip"
+)
+
+# Reference examples from HIP-Doc / Reference
+
+# CUDA-to-HIP-API-Function-Comparison
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Reference/CUDA-to-HIP-API-Function-Comparison/block_reduction/main.cu",
+    "docs/tools/example_codes/block_reduction.cu"
+)
+
+# HIP-Complex-Math-API
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Reference/HIP-Complex-Math-API/complex_math/main.hip",
+    "docs/tools/example_codes/complex_math.hip"
+)
+
+# HIP-Math-API
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Reference/HIP-Math-API/math/main.hip",
+    "docs/tools/example_codes/math.hip"
+)
+
+# Low-Precision-Floating-Point-Types
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Reference/Low-Precision-Floating-Point-Types/low_precision_float_fp8/main.hip",
+    "docs/tools/example_codes/low_precision_float_fp8.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Reference/Low-Precision-Floating-Point-Types/low_precision_float_fp16/main.hip",
+    "docs/tools/example_codes/low_precision_float_fp16.hip"
+)
+
+# Tutorial codes from HIP-Doc / Tutorials
+
+# graph_api
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Tutorials/graph_api/src/main_streams.hip",
+    "docs/tools/example_codes/graph_api_tutorial_main_streams.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Tutorials/graph_api/src/main_graph_capture.hip",
+    "docs/tools/example_codes/graph_api_tutorial_main_graph_capture.hip"
+)
+urllib.request.urlretrieve(
+    "https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Tutorials/graph_api/src/main_graph_creation.hip",
+    "docs/tools/example_codes/graph_api_tutorial_main_graph_creation.hip"
+)
@@ -0,0 +1,667 @@
+.. meta::
+  :description: HIP graph API tutorial
+  :keywords: AMD, ROCm, HIP, graph API, tutorial
+
+.. _hip_graph_api_tutorial:
+
+*******************************************************************************
+HIP Graph API Tutorial
+*******************************************************************************
+
+**Time to complete**: 60 minutes | **Difficulty**: Intermediate | **Domain**: Medical Imaging
+
+Introduction
+============
+
+Imagine you are directing a movie. In traditional GPU programming with streams, you are like a director who must call
+"action!" for every single shot, waiting between each take. With HIP graphs, you pre-plan the entire scene sequence and
+then call "action!" just once to film everything in one go. This tutorial will show you how to transform your GPU
+applications from repeated direction to choreographed performance.
+
+Modeling dependencies between GPU operations
+--------------------------------------------
+
+Most movies in the world follow a plot where certain scenes must happen before the following scenes; otherwise the
+movie might not make much sense. If a scene *A* must happen before scenes *B* and *C*, *B* and *C* depend on *A*. If
+*B* and *C* contain different stories that (at this point) are unrelated to each other, *B* and *C* are independent and
+can be shown to the audience in any order. However, both scenes might be a prerequisite for the final scene *D*, so *D*
+depends on both of them. When you represent scenes as *nodes* and dependencies as *edges*, you can create a graph, and
+the graph representing your imaginary movie script will have a diamond-like shape:
+
+.. figure:: ../data/tutorial/graph_api/diamond.svg
+  :alt: Diagram showing a graph with diamond-like shape. Nodes represent movie scenes and edges represent dependencies
+        between scenes.
+  :align: center
+
+You can think about GPU operations in a similar way. For example, most kernels require at least one data buffer to work
+with, so they will depend on a preceding copy or ``memset`` operation. Others might process the results of preceding
+kernels. Real-world applications typically involve multiple GPU operations with dependencies between them. HIP offers
+two ways to think about and model these dependencies: streams and graphs.
+
+Streams
+^^^^^^^
+
+Streams are HIP's default model for organizing and launching GPU operations on the device. They are sequential sets of
+operations, similar to CPU threads. Adding operation *A* before operation *B* to a stream ensures *A* happens before
+*B*, regardless of any interdependencies (or lack thereof) between them. A stream can be thought of as a first-in,
+first-out (FIFO) queue of operations.
+
+Multiple streams operate independently, and manual synchronization is required when dependencies cross stream
+boundaries. Additionally, each operation in a stream is scheduled independently, which — depending on the complexity of
+the enqueued operation — might lead to noticeable CPU launch overhead and kernel dispatch latency, especially for
+workloads with many small kernels. However, applications that use streams are well suited for workloads that are
+dynamic and unpredictable.
+
+For more information about HIP streams, see :ref:`asynchronous_how-to`.
+
+Graphs
+^^^^^^
+
+HIP graphs model dependencies between operations as nodes and edges on a diagram. Each node in the graph represents an
+operation, and each edge represents a dependency between two nodes. If no edge exists between two nodes, they are
+independent and can execute in any order.
+
+Because dependency information is built into the graph, the HIP runtime automatically inserts the necessary
+synchronization points. Launching all operations in a graph requires only a single API call, reducing launch overhead
+and dispatch latency to near-zero. This is especially beneficial for workloads with many small kernels, where launch
+overhead can dominate overall execution time.
+
+Graphs must be defined once before use, making them ideal for fixed workflows that run repeatedly. While node
+parameters can be updated between executions, the graph structure itself cannot change after instantiation. This
+structural immutability is the primary trade-off compared to the flexibility of streams.
+
+For more information about HIP graphs, see :ref:`how_to_HIP_graph`.
+
+When to use graphs
+^^^^^^^^^^^^^^^^^^
+
+This table shows when to use graphs in your application.
+
+.. list-table::
+  :header-rows: 1
+  :class: decision-matrix
+
+  * - ✅ **Use Graphs When**
+    - ❌ **Avoid Graphs When**
+  * - Workflow is fixed and repetitive
+    - Workflow changes dynamically
+  * - Same kernels execute many times
+    - One-shot operations
+  * - Launch overhead is significant (many small kernels)
+    - Kernels are long-running
+
+Transitioning a CT reconstruction pipeline
+------------------------------------------
+
+In this tutorial, you will modify an existing GPU-accelerated stream-based image processing pipeline that reconstructs
+computer tomography (CT) data (the classic Shepp-Logan phantom [ShLo74]_). The pipeline transforms raw X-ray
+projections into clear cross-sectional images used in medical diagnosis.
+
+.. figure:: ../data/tutorial/graph_api/ct_reconstruction_overview.png
+  :alt: Diagram showing raw projection data being transformed into a reconstructed CT slice
+  :align: center
+
+.. note::
+  The tutorial application generates a phantom volume and forward projections. This GPU-accelerated operation uses
+  multiple streams and appears in the traces. You can ignore the dataset generation — it is not relevant to this
+  tutorial.
+
+The reconstruction pipeline consists of:
+
+1. **Load** projection data into GPU memory
+2. **Preprocess** the projection through six stages:
+
+  a. Logarithmic transformation (convert X-ray intensities)
+  b. Pixel weighting (correct for cone-beam geometry)
+  c. Forward FFT (transform to frequency domain)
+  d. Shepp-Logan filtering (enhance edges and improve contrast)
+  e. Inverse FFT (return to spatial domain)
+  f. Normalization (account for unnormalized FFT)
+
+3. **Reconstruct** the 3D volume using the Feldkamp-Davis-Kress (FDK) algorithm [FeDK84]_
+
+**Why HIP graphs?** CT scanners process hundreds of projections per scan. By capturing this fixed workflow as a graph,
+you will reduce the amount of API calls required for launching the workflow on a GPU to 1 per projection, thus reducing
+launch overhead and dispatch latency to near-zero.
+
+What you will learn
+-------------------
+
+After completing this tutorial, you will be able to:
+
+* Convert a stream-based HIP application to a graph-based application via stream capturing
+* Create graphs manually for fine-grained control
+* Integrate graph-safe libraries like hipFFT into your graphs
+* Understand when graphs provide performance benefits
+* Apply graph concepts to your own workflows
+
+Before you begin
+----------------
+
+Required knowledge
+^^^^^^^^^^^^^^^^^^
+
+You should be comfortable writing and debugging HIP kernels, understand basic GPU memory management concepts like
+device allocation and host-to-device transfers, be familiar with HIP streams and events, and have experience using
+CMake to build C++ projects. This tutorial assumes you have written at least a few HIP programs before and understand
+concepts like grid dimensions and thread blocks.
+
+Hardware and software requirements
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Your system needs ROCm 6.2 or later with the hipFFT library installed. The tutorial works on
+all :doc:`supported AMD GPUs <rocm-install-on-linux:reference/system-requirements>`, though at least 4 GiB of GPU
+memory are recommended for comfortable performance with the reconstruction workload. You will also need
+`git <https://git-scm.com/>`__ to check out the code repository, `CMake <https://www.cmake.org>`__ 3.21 or later to
+build the code, along with a CMake generator that supports the HIP language such as GNU Make or Ninja.
+
+.. note::
+  Visual Studio generators currently do not support HIP. The (optional) ``rocprofv3`` tool is currently supported on
+  Linux only.
+
+To save the output volume, you need a recent version of `libTIFF <https://libtiff.gitlab.io/libtiff/>`__. If CMake
+cannot find libTIFF on your system, it automatically downloads and builds it.
+
+To view both the input projections and the output volume produced by this tutorial, install a scientific image viewer
+that can display 16-bit and 32-bit grayscale data, such as `Fiji <https://imagej.net/software/fiji/downloads>`__.
+Standard image viewers may be unable to correctly display the output.
+
+Optional knowledge
+^^^^^^^^^^^^^^^^^^
+
+While not required, familiarity with Fast Fourier Transform (FFT) operations will help you understand the filtering
+steps. Similarly, knowledge of medical imaging or CT reconstruction is helpful for understanding the application
+context. If you have worked with signal processing or image filtering before, you will recognize some of the applied
+concepts.
+
+.. note::
+  You can skip the reconstruction algorithm and concentrate on the stream and graph implementations in the files
+  prefixed with ``main_``.
+
+Step 1: Build the tutorial code
+===============================
+
+The full code for this tutorial is part of the `ROCm examples repository <https://github.com/ROCm/rocm-examples>`__.
+Check out the repository:
+
+.. code-block:: bash
+
+  git clone https://github.com/ROCm/rocm-examples.git
+
+Then navigate to ``rocm-examples/HIP-Doc/Tutorials/graph_api/``. The code can be found in the ``src`` subdirectory.
+
+Create a separate ``build`` directory inside ``rocm-examples/HIP-Doc/Tutorials/graph_api/``. Then 
+configure the project (adjust ``CMAKE_HIP_ARCHITECTURES`` to match your GPU):
+
+.. code-block:: bash
+
+  cd build
+  cmake -DCMAKE_PREFIX_PATH=/opt/rocm -DCMAKE_BUILD_TYPE=Release -DCMAKE_HIP_ARCHITECTURES=gfx1100 -DCMAKE_HIP_PLATFORM=amd -DCMAKE_CXX_COMPILER=amdclang++ -DCMAKE_C_COMPILER=amdclang -DCMAKE_HIP_COMPILER=amdclang++ ..
+
+Now you can build the three variants of the tutorial code:
+
+.. code-block:: bash
+
+  cmake --build . --target hip_graph_api_tutorial_streams hip_graph_api_tutorial_graph_capture hip_graph_api_tutorial_graph_creation
+
+.. note::
+  The ``graph_capture`` variant is currently not supported on Windows and the build target is therefore unavailable.
+
+Step 2: Examining the stream-based baseline application
+=======================================================
+
+Open ``src/main_streams.hip`` in your editor. You will explore how this application processes data.
+
+Understanding batched processing
+--------------------------------
+
+The application processes multiple projections simultaneously to maximize GPU utilization.
+
+Determining parallel capacity
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+At the beginning of ``main()``, the program queries the GPU for its number of asynchronous engines to determine how
+many streams it can create, indicating how many data transfer or compute operations can run in parallel.
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
+  :start-after: // [sphinx-async-engine-start]
+  :end-before: // [sphinx-async-engine-end]
+  :language: cuda
+  :dedent:
+
+.. tip::
+  Each asynchronous engine executes operations independently. More engines mean more parallelism.
+
+
+Processing projections in batches
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Find the ``MAIN LOOP`` comment. Here the application groups projections into parallel batches:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
+  :start-after: // [sphinx-batch-start]
+  :end-before: // [sphinx-batch-end]
+  :language: cuda
+  :dedent:
+
+Notice how each batch size equals the stream count — this ensures every stream stays busy.
+
+Synchronization
+^^^^^^^^^^^^^^^
+
+Each projection processes independently, so you only need to synchronize once at the end.
+:cpp:func:`hipStreamWaitEvent()` function makes the first stream wait for all other streams to complete.
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
+  :start-after: // [sphinx-sync-start]
+  :end-before: // [sphinx-sync-end]
+  :language: cuda
+  :dedent:
+
+Exploring the processing pipeline
+---------------------------------
+
+Next, examine what happens to each projection. Find the ``START HERE`` comment to see the reconstruction pipeline's
+first steps:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
+  :start-after: // [sphinx-preprocessing-start]
+  :end-before: // [sphinx-preprocessing-end]
+  :language: cuda
+  :dedent:
+
+This is a typical pattern found across many HIP applications: multiple kernels executing in sequence with data
+dependencies. In the next step, the weighted projections need to be transformed into Fourier space and filtered. For
+optimal performance, it is recommended to execute a 1D FFT on a buffer size which is a power of two. Copy the weighted
+projection to another buffer where the row length is a power of two equal to or larger than the projection's row
+length:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
+  :start-after: // [sphinx-proj-to-expanded-start]
+  :end-before: // [sphinx-proj-to-expanded-end]
+  :language: cuda
+  :dedent:
+
+Next, transform the expanded projection into Fourier space for filtering:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
+  :start-after: // [sphinx-forward-start]
+  :end-before: // [sphinx-forward-end]
+  :language: cuda
+  :dedent:
+
+.. tip::
+  Some hipFFT operations are graph-safe: As long as these operations are operating on the capturing stream, they will
+  be captured into the graph as well. Refer to :ref:`hipFFT's documentation <hipfft:hipfft-api-usage>` for more
+  information on its graph-safe operations.
+
+In Fourier space, apply the Shepp-Logan filter, then transform back:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
+  :start-after: // [sphinx-filter-start]
+  :end-before: // [sphinx-filter-end]
+  :language: cuda
+  :dedent:
+
+Shrink to original size and normalize the FFT output:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
+  :start-after: // [sphinx-expanded-to-proj-start]
+  :end-before: // [sphinx-expanded-to-proj-end]
+  :language: cuda
+  :dedent:
+
+Finally, back-project the filtered projection into the 3D volume using ``atomicAdd`` operations to accumulate voxel
+values from multiple kernels:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
+  :start-after: // [sphinx-bp-start]
+  :end-before: // [sphinx-bp-end]
+  :language: cuda
+  :dedent:
+
+.. note::
+  The preprocessing kernels process 512 × 512 pixels (:math:`\mathcal{O}(n²)`), while the back-projection kernel
+  processes 512 × 512 × 512 voxels (:math:`\mathcal{O}(n³)`). This cubic complexity makes back-projection the
+  computational bottleneck.
+
+Creating a trace file
+^^^^^^^^^^^^^^^^^^^^^
+
+Inside the ``build`` directory you will now generate a trace:
+
+.. code-block:: bash
+
+  rocprofv3 -o streams -d outDir -f pftrace --hip-trace --kernel-trace --memory-copy-trace --memory-allocation-trace -- ./HIP-Doc/Tutorials/graph_api/src/hip_graph_api_tutorial_graph_creation
+
+.. note::
+  For more information on the ``rocprofv3`` tool, please refer to its
+  :ref:`documentation <rocprofiler-sdk:using-rocprofv3>`.
+
+Analyzing the trace
+^^^^^^^^^^^^^^^^^^^
+
+Open the trace file to see what is really happening:
+
+1. Navigate to your ``build/outDir`` directory
+2. Open ``streams_results.pftrace`` in `Perfetto <https://ui.perfetto.dev>`__
+3. Click the arrow next to your executable name under ``System``
+4. Focus on the kernel execution pattern on the right
+
+.. figure:: ../data/tutorial/graph_api/streams_trace.png
+  :alt: Stream execution showing gaps between kernel launches
+  :align: center
+
+While projections process in parallel, there are visible gaps between operations. These gaps represent overhead caused
+by scheduling and launching the operations. In the next section, you will eliminate these gaps by capturing streams into
+a graph.
+
+Step 3: Converting to graphs via stream capture
+===============================================
+
+Stream capture is a feature that allows you to record a sequence of GPU operations (kernel launches, memory copies,
+etc.) into a HIP Graph, which can later be executed as a single, optimized unit. Open the file
+``src/main_graph_capture.hip``, which contains the code from the previous subsection, with a few changes that allow you
+to capture the streams into a single graph.
+
+Before the main loop, declare graph-specific variables:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
+  :start-after: // [sphinx-graph-vars-start]
+  :end-before: // [sphinx-graph-vars-end]
+  :language: cuda
+  :dedent:
+
+``graphExec`` and ``graphExecFinal`` will be instances of the graph template that you will create in the following
+steps. You will typically instantiate a graph template once and update its parameters for repeated launches. If the
+graph topology changes, you will need a new instance. The ``graphStream`` will launch the final graph instances.
+
+Inside the main loop, activate capture mode on the first stream:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
+  :start-after: // [sphinx-begin-capture-start]
+  :end-before: // [sphinx-begin-capture-end]
+  :language: cuda
+  :dedent:
+
+.. admonition:: What happens during capture?
+
+  When :cpp:func:`hipStreamBeginCapture` is called, the stream stops executing operations immediately. Instead, it
+  records operations into a graph template (``graph`` in the code shown here).
+
+To capture multiple streams, use events to implement the fork-join pattern:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
+  :start-after: // [sphinx-fork-start]
+  :end-before: // [sphinx-fork-end]
+  :language: cuda
+  :dedent:
+
+This creates dependencies between streams, activating capture mode on the additional streams and ensuring they are all
+part of the same graph.
+
+**The processing pipeline itself remains unchanged.**
+
+After recording all operations of the current batch, join the streams:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
+  :start-after: // [sphinx-join-start]
+  :end-before: // [sphinx-join-end]
+  :language: cuda
+  :dedent:
+
+Then stop capturing:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
+  :start-after: // [sphinx-stop-capture-start]
+  :end-before: // [sphinx-stop-capture-end]
+  :language: cuda
+  :dedent:
+
+The graph template is now complete. In order to execute the recorded operations, you need to instantiate the graph
+and execute it on the ``graphStream``. The graph template can be safely destroyed after instantiating:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
+  :start-after: // [sphinx-graph-instantiate-start]
+  :end-before: // [sphinx-graph-instantiate-end]
+  :language: cuda
+  :dedent:
+
+.. tip::
+  Use :cpp:func:`hipGraphDebugDotPrint` to save a graph's topology into a ``*.dot`` file. The resulting file
+  contains a `DOT <https://graphviz.org/doc/info/lang.html>`__ description which can be processed with
+  `Graphviz <https://graphviz.org/>`__ or visualized with several tools. For example:
+
+  .. code-block:: bash
+
+    dot -Tpng graph_capture.dot -o graph_capture.png
+
+Instantiating a graph is a relatively costly operation. However, you need to update the parameters whenever a new batch
+is processed. Since the graph templates are the same for all batches (i.e., the topology of the resulting graph does
+not change), it is sufficient to update the existing graph instance's parameters instead of creating a new instance:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
+  :start-after: // [sphinx-graph-update-start]
+  :end-before: // [sphinx-graph-update-end]
+  :language: cuda
+  :dedent:
+
+Should the graph's topology change between iterations, it is necessary to create a new graph instance. In your
+application's case, this can happen when the number of projections is not evenly divisible by the number of
+asynchronous engines:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
+  :start-after: // [sphinx-graph-final-start]
+  :end-before: // [sphinx-graph-final-end]
+  :language: cuda
+  :dedent:
+
+Creating a trace
+----------------
+
+Now you have successfully converted the processing pipeline into an executable graph. You can examine the effects of
+this change and generate another trace:
+
+.. code-block:: bash
+
+  rocprofv3 -o graph_capture -d outDir -f pftrace --hip-trace --kernel-trace --memory-copy-trace --memory-allocation-trace -- ./HIP-Doc/Tutorials/graph_api/src/hip_graph_api_tutorial_graph_capture
+
+Analyzing the trace
+-------------------
+
+Opening the resulting trace file ``outDir/graph_capture_results.pftrace`` with Perfetto shows a significant change:
+
+.. figure:: ../data/tutorial/graph_api/capture_trace.png
+  :alt: Diagram showing a trace of the capturing variant.
+  :align: center
+
+The gaps have disappeared! By capturing all operations of a batch into a single graph, you have successfully
+eliminated the launching and scheduling overhead previously observed in the stream-based variant.
+
+A limitation of stream capture is that it preserves stream ordering even when unnecessary. Operations that could run in
+parallel still execute sequentially. Another approach to graphs is manual construction. This is quite verbose but also
+offers much more control over dependencies and parallelism.
+
+Step 4: Manual graph creation (advanced)
+========================================
+
+Open ``src/main_graph_creation.hip`` and find the main loop. The code here differs from the other variants: rather than
+capturing streams into graphs, you will build the graph manually. Consider how the weighting kernel is invoked through
+a kernel node:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
+  :start-after: // [sphinx-weighting-node-start]
+  :end-before: // [sphinx-weighting-node-end]
+  :language: cuda
+  :dedent:
+
+You create an array of ``void*`` pointers containing the kernel parameters. Next, configure the kernel launch
+parameters: grid and block dimensions, the kernel function pointer, and the dynamic shared memory size. Finally, add
+the kernel node to the graph template. Note the ``&logTransformationKernelNode, 1`` part: this is how you specify a
+dependency from the preceding log transformation kernel node to the weighting kernel node.
+
+.. note::
+  For specifying multiple dependencies, you would pass an array of :cpp:type:`hipGraphNode_t` objects and the number of
+  nodes inside the array to :cpp:func:`hipGraphAddKernelNode`.
+
+The HIP graph API supports multiple different node types. For example, this is how a ``memset`` node is set up:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
+  :start-after: // [sphinx-memset-node-start]
+  :end-before: // [sphinx-memset-node-end]
+  :language: cuda
+  :dedent:
+
+.. note::
+  Despite the different construction method, graph instantiation and updates
+  work exactly as before. You can find the same patterns at the loop's end.
+
+Adding hipFFT nodes
+-------------------
+
+While hipFFT provides graph-safe functionality, it does not support manual node creation. Integrating hipFFT into the
+graph requires a workaround using stream capture with additional bookkeeping.
+
+You capture the graph state before and after hipFFT operations, then identify the nodes hipFFT added:
+
+Step 1: Save existing nodes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Record all current graph nodes in a sorted ``std::set``:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
+  :start-after: // [sphinx-before-forward-start]
+  :end-before: // [sphinx-before-forward-end]
+  :language: cuda
+  :dedent:
+
+Step 2: Capture hipFFT operations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
+  :start-after: // [sphinx-hipfft-start]
+  :end-before: // [sphinx-hipfft-end]
+  :language: cuda
+  :dedent:
+
+Step 3: Get updated node list
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
+  :start-after: // [sphinx-after-forward-start]
+  :end-before: // [sphinx-after-forward-end]
+  :language: cuda
+  :dedent:
+
+Step 4: Find new nodes
+^^^^^^^^^^^^^^^^^^^^^^
+
+Compute the difference between both node sets:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
+  :start-after: // [sphinx-node-difference-start]
+  :end-before: // [sphinx-node-difference-end]
+  :language: cuda
+  :dedent:
+
+Step 5: Identify the leaf node
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Find hipFFT's final node for dependency tracking:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
+  :start-after: // [sphinx-find-leaf-start]
+  :end-before: // [sphinx-find-leaf-end]
+  :language: cuda
+  :dedent:
+
+The leaf detection logic checks if a node has no outgoing edges:
+
+.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
+  :start-after: // [sphinx-is-leaf-start]
+  :end-before: // [sphinx-is-leaf-end]
+  :language: cuda
+  :dedent:
+
+With hipFFT integrated and its leaf node identified, subsequent nodes can establish proper dependencies.
+
+.. note::
+  You can also capture hipFFT operations into a separate graph template, then add it to the main graph as a child graph
+  using :cpp:func:`hipGraphAddChildGraphNode`. The approach above adds hipFFT nodes directly to the main graph as
+  first-class nodes. A child graph acts as a single node that expands recursively into its components. The scheduler
+  may handle these approaches differently, potentially affecting performance.
+
+Creating a trace
+----------------
+
+Now you have manually implemented the processing pipeline with the graph API. You can examine the result by generating
+another trace:
+
+.. code-block:: bash
+  
+  rocprofv3 -o graph_creation -d outDir -f pftrace --hip-trace --kernel-trace --memory-copy-trace --memory-allocation-trace -- ./HIP-Doc/Tutorials/graph_api/src/hip_graph_api_tutorial_graph_creation
+
+Analyzing the trace
+-------------------
+
+Opening the resulting trace file ``outDir/graph_creation_results.pftrace`` with Perfetto shows a similar trace to what
+you achieved with the capture variant:
+
+.. figure:: ../data/tutorial/graph_api/creation_trace.png
+  :alt: Diagram showing a trace of the creation variant.
+  :align: center
+
+Like before, the kernels are executed *en bloc*. By creating nodes for all operations in the processing pipeline, you
+avoided the launching and scheduling overhead you previously observed in the stream-based variant.
+
+Updating individual nodes
+-------------------------
+
+The code presented in this tutorial updates the entire graph instance for each new batch. Applications that require
+updates to only a small subset of nodes might experience excessive overhead. For these cases, the HIP Graph API
+provides the following methods for updating individual nodes:
+
+* :cpp:func:`hipGraphExecChildGraphNodeSetParams`
+* :cpp:func:`hipGraphExecEventRecordNodeSetEvent`
+* :cpp:func:`hipGraphExecEventWaitNodeSetEvent`
+* :cpp:func:`hipGraphExecExternalSemaphoresSignalNodeSetParams`
+* :cpp:func:`hipGraphExecExternalSemaphoresWaitNodeSetParams`
+* :cpp:func:`hipGraphExecHostNodeSetParams`
+* :cpp:func:`hipGraphExecKernelNodeSetParams`
+* :cpp:func:`hipGraphExecMemcpyNodeSetParams`
+* :cpp:func:`hipGraphExecMemcpyNodeSetParams1D`
+* :cpp:func:`hipGraphExecMemcpyNodeSetParamsFromSymbol`
+* :cpp:func:`hipGraphExecMemcpyNodeSetParamsToSymbol`
+* :cpp:func:`hipGraphExecMemsetNodeSetParams`
+* :cpp:func:`hipGraphExecNodeSetParams`
+
+Conclusion
+==========
+
+When an application has predictable, repetitive workflows, transitioning from streams to graphs can significantly
+reduce launch overhead and improve performance. HIP provides two approaches for creating graphs: stream capture and
+explicit graph construction.
+
+**Stream capture** converts existing stream-based code into a graph by recording the operations between start and stop
+capture calls. This approach minimizes code changes and works well when your application already has a graph-like
+structure with clear dependencies.
+
+**Explicit graph construction** involves manually creating nodes and defining edges between them using the graph API.
+While this approach requires more code changes and is more verbose, it provides fine-grained control over dependencies
+and allows for optimizations that might not be possible with stream capture. This method is ideal when you need precise
+control over the graph topology or when working with complex dependency patterns.
+
+.. tip::
+  Choose stream capture for quick conversions of existing code with minimal changes. Choose explicit construction when
+  you need maximum control and optimization opportunities.
+
+Resources
+=========
+
+* :ref:`HIP Programming Guide's section on HIP graphs <how_to_HIP_graph>`
+* :ref:`HIP graph API reference <graph_management_reference>`
+
+.. rubric:: References
+
+.. [FeDK84] L.A. Feldkamp, L.C. Davis and J.W. Kress: "Practical cone-beam algorithm". In *Journal of the Optical Society of America A*, vol. 1, no. 6, pp. 612-619, June 1984, DOI `10.1364/JOSAA.1.000612 <https://dx.doi.org/10.1364/JOSAA.1.000612>`__.
+.. [ShLo74] L.A. Shepp and B.F. Logan: "The Fourier reconstruction of a head section". In *IEEE Transactions on Nuclear Science*, vol. 21, no. 3, pp. 21-43, June 1974, DOI `10.1109/TNS.1974.6499235 <https://dx.doi.org/10.1109/TNS.1974.6499235>`__.