SWDEV-541514 - Docs update 2025-09-15 (#993)

Co-authored-by: Julia Jiang <56359287+jujiang-del@users.noreply.github.com>
2025-09-24 18:57:00 +02:00
@@ -54,12 +54,12 @@ suppress_warnings = ["etoc.toctree"]
 numfig = False

 exclude_patterns = [
-    "doxygen/mainpage.md",
-    "understand/glossary.md",
-    'how-to/debugging_env.rst',
-    "data/env_variables_hip.rst"
+    "./doxygen/mainpage.md",
+    "./understand/glossary.md",
+    './how-to/debugging_env.rst',
+    "./reference/env_variables"
 ]

 git_url = subprocess.check_output(['git', 'config', '--get', 'remote.origin.url']).strip().decode('ascii')
 if git_url.find("git:") != -1:
-    html_theme_options = {"repository_url": "https://github.com/ROCm/hip"}
+    html_theme_options = {"repository_url": "https://github.com/ROCm/hip"}
@@ -40,10 +40,10 @@ for the full list.
 What NVIDIA CUDA features does HIP support?
 -------------------------------------------

-The :doc:`NVIDIA CUDA runtime API supported by HIP<hipify:tables/CUDA_Runtime_API_functions_supported_by_HIP>`
-and :doc:`NVIDIA CUDA driver API supported by HIP<hipify:tables/CUDA_Driver_API_functions_supported_by_HIP>`
+The :doc:`NVIDIA CUDA runtime API supported by HIP<hipify:reference/tables/CUDA_Runtime_API_functions_supported_by_HIP>`
+and :doc:`NVIDIA CUDA driver API supported by HIP<hipify:reference/tables/CUDA_Driver_API_functions_supported_by_HIP>`
 pages describe which NVIDIA CUDA APIs are supported and what the equivalents are.
-The :doc:`HIP API documentation <doxygen/html/index>` describes each API and
+The :ref:`HIP runtime API reference<runtime_api_reference>` describes each API and
 its limitations, if any, compared with the equivalent CUDA API.

 The kernel language features are documented in the
@@ -0,0 +1,344 @@
+.. meta::
+  :description: This topic discusses the changes introduced in HIP 7.0
+  :keywords: AMD, ROCm, HIP, HIP changes, CUDA, C++ language extensions
+
+.. _compatibility-changes:
+
+*******************************************************************************
+HIP API 7.0 changes
+*******************************************************************************
+
+To improve code portability between AMD and NVIDIA GPU programming models, changes were made to the HIP API in ROCm 7.0 to simplify cross-platform programming. These changes align HIP C++ even more closely with NVIDIA CUDA. These changes are incompatible with prior releases, and might require recompiling existing HIP applications for use with ROCm 7.0, or editing and recompiling code in some cases. In the best case, the change requires no modification of existing applications. These changes were made available in a preview release based on ROCm 6.4.1 to help you prepare.
+
+Behavior changes in HIP Runtime API
+===================================
+
+Update ``hipGetLastError``
+--------------------------
+
+Prior to the 7.0 release of the HIP API, :cpp:func:`hipGetLastError` was not fully compliant with CUDA's behavior. The purpose of this change is to have ``hipGetLastError`` return the last actual error caught in the current thread during the application execution. Neither ``hipSuccess`` nor ``hipErrorNotReady`` is considered an error. Take the following code as an example:
+
+.. code:: cpp
+
+  1: hipError_t err = hipMalloc(...); // returns hipOutOfMemory
+  2: err = hipSetDevice(0); // returns hipSuccess
+  3: err = hipGetLastError();
+
+The prior behavior was for ``hipGetLastError`` at line 3 to return ``hipSuccess`` from line 2. In the 7.0 release, the value of ``err`` at line 3 is ``hipOutOfMemory`` which is the error returned in Line 1, rather than simply the result returned in line 2. This matches CUDA behavior.
+
+You can still use the prior functionality by using the ``hipExtGetLastError`` function. Notice that the function begins with ``hipExt`` which denotes a function call that is unique to HIP, without correlation to CUDA. This function was introduced with the 6.0 release.
+
+Cooperative groups changes
+--------------------------
+
+For :cpp:func:`hipLaunchCooperativeKernelMultiDevice` function, HIP now includes additional input parameter validation checks.
+
+* If the input launch stream is a NULLPTR or it is ``hipStreamLegacy``, the function now returns ``hipErrorInvalidResourceHandle``.
+* If the stream capturing is active, the function returns the error code ``hipErrorStreamCaptureUnsupported``.
+* If the stream capture status is invalidated, the function returns the error ``hipErrorStreamCaptureInvalidated``.
+
+The :cpp:func:`hipLaunchCooperativeKernel` function now checks the input stream handle. If it's invalid, the returned error is changed to ``hipErrorInvalidHandle`` from ``hipErrorContextIsDestroyed``.
+
+Update ``hipPointerGetAttributes``
+----------------------------------
+
+:cpp:func:`hipPointerGetAttributes` now matches the functionality of ``cudaPointerGetAttributes`` which changed in CUDA 11. If a NULL host or attribute pointer is passed as input parameter, ``hipPointerGetAttributes`` now returns ``hipSuccess`` instead of the error code ``hipErrorInvalidValue``.
+
+Any application which is expecting the API to return an error instead of success could be impacted and a code change may need to handle the error properly.
+
+Update ``hipFree``
+------------------
+
+:cpp:func:`hipFree` previously had an implicit wait for synchronization purpose which is applicable for all memory allocations. This wait has been disabled in the HIP 7.0 runtime for allocations made with ``hipMallocAsync`` and ``hipMallocFromPoolAsync`` to match the behavior of CUDA API ``cudaFree``
+
+Update ``hipFreeAsync``
+-----------------------
+
+The API returns ``hipSuccess`` when the input pointer is NULL, instead of ``hipErrorInvalidValue``, to be consistent with :cpp:func:`hipFree`.
+
+Exceptions effect during kernel execution changes
+-------------------------------------------------
+
+Exceptions that occur during kernel execution will no longer abort the process, but will instead return an error, unless core dumping is enabled.
+
+HIP runtime compiler (hipRTC) changes
+=====================================
+
+Runtime compilation for HIP is available through the ``hipRTC`` library as described in :ref:`hip_runtime_compiler_how-to`. The library grew organically within the main HIP runtime code. However, segregation of the ``hipRTC`` code is now needed to ensure better compatibility and easier code portability.
+
+Removal of ``hipRTC`` symbols from HIP Runtime Library
+------------------------------------------------------
+
+``hipRTC`` has been an independent library since the 6.0 release, but the ``hipRTC`` symbols were still available in the HIP runtime library. Starting with the 7.0 release ``hipRTC`` is no longer included in the HIP runtime, and any application using ``hipRTC`` APIs should link explicitly with the ``hipRTC`` library.
+
+This change makes the usage of ``hipRTC`` library on Linux the same as on Windows and matches the behavior of CUDA ``nvRTC``.
+
+``hipRTC`` compilation
+----------------------
+
+The device code compilation via ``hipRTC`` now uses namespace ``__hip_internal``, instead of the standard headers ``std``, to avoid namespace collision. These changes are made in the HIP header files.
+
+No code change is required in any application, but rebuilding is necessary.
+
+Removal of datatypes from ``hipRTC``
+------------------------------------
+
+In ``hipRTC``, datatype definitions such as ``int64_t``, ``uint64_t``, ``int32_t``, and ``uint32_t`` could result in conflicts in some applications, as they use their own definitions for these types. ``nvRTC`` doesn't define these datatypes either.
+These datatypes are removed and replaced by HIP internal datatypes prefixed with ``__hip``, for example, ``__hip_int64_t``.
+
+Any application relying on HIP internal datatypes during ``hipRTC`` compilation might be affected.
+These changes have no impact on any application if it compiles as expected using ``nvRTC``.
+
+HIP header clean up
+===================
+
+HIP header files previously included unnecessary Standard Template Libraries (STL) headers.
+With the 7.0 release, unnecessary STL headers are no longer included, and only the required STL headers
+are included. 
+
+Applications relying on HIP runtime header files might need to be updated to include STL header
+files that have been removed in 7.0.
+
+API signature and struct changes
+================================
+
+API signature changes
+---------------------
+
+Signatures in some APIs have been modified to match corresponding CUDA APIs, as described below.
+
+The RTC method definition is changed in the following ``hipRTC`` APIs:
+
+* :cpp:func:`hiprtcCreateProgram`
+* :cpp:func:`hiprtcCompileProgram`
+
+In these APIs, the input parameter type changes from ``const char**`` to ``const char* const*``.
+
+In addition, the following APIs have signature changes:
+
+* :cpp:func:`hipMemcpyHtoD`, the type of the second argument pointer changes from ``const void*`` to ``void*``.
+* :cpp:func:`hipCtxGetApiVersion`, the type of second argument is changed from ``int*`` to ``unsigned int*``.
+
+These signature changes do not require code modifications but do require rebuilding the application.
+
+Deprecated struct ``HIP_MEMSET_NODE_PARAMS`` 
+--------------------------------------------
+
+The deprecated structure ``HIP_MEMSET_NODE_PARAMS`` is removed.
+You can use the definition :cpp:struct:`hipMemsetParams` instead, as input parameter, while using these two APIs:
+
+* :cpp:func:`hipDrvGraphAddMemsetNode`
+* :cpp:func:`hipDrvGraphExecMemsetNodeSetParams`
+
+``hipMemsetParams`` struct change
+---------------------------------
+
+The struct :cpp:struct:`hipMemsetParams` is updated to be compatible with CUDA.
+The change is from the old struct definition shown below:
+
+.. code:: cpp
+
+  typedef struct hipMemsetParams {
+    void* dst;
+    unsigned int elementSize;
+    size_t height;
+    size_t pitch;
+    unsigned int value;
+    size_t width;
+  } hipMemsetParams;
+
+To the new struct definition as follows:
+
+.. code:: cpp
+
+  typedef struct hipMemsetParams {
+    void* dst;
+    size_t pitch;
+    unsigned int value;
+    unsigned int elementSize;
+    size_t width;
+    size_t height;
+  } hipMemsetParams;
+
+No code change is required in any application using this structure, but rebuilding is necessary.
+
+HIP vector constructor change
+-----------------------------
+
+Changes have been made to HIP vector constructors for ``hipComplex`` initialization to generate values in alignment with CUDA. The affected constructors are small vector types such as ``float2`` and ``int4`` for example. If your code previously relied on a single value to initialize all components within a vector or complex type, you might need to update your code. Otherwise, rebuilding the application is necessary but no code change is required in any application using these constructors.
+
+Stream capture updates
+======================
+
+Restrict stream capture modes
+-----------------------------
+
+Stream capture mode has been restricted in the following APIs to relaxed (``hipStreamCaptureModeRelaxed``) mode:
+
+* :cpp:func:`hipMallocManaged`
+* :cpp:func:`hipMemAdvise`
+
+These APIs are allowed only in relaxed stream capture mode. If the functions are used with stream capture, the HIP runtime the will return ``hipErrorStreamCaptureUnsupported`` on unsupported stream capture modes.
+
+Check stream capture mode
+-------------------------
+
+The following APIs will check the stream capture mode and return error codes to match the behavior of CUDA. No impact if stream capture is working correctly on CUDA. Otherwise, the application would need to modify the graph being captured.
+
+* :cpp:func:`hipLaunchCooperativeKernelMultiDevice` - Returns error code while stream capture status is active. The usage is restricted during stream capture
+* :cpp:func:`hipEventQuery` - Returns an error ``hipErrorStreamCaptureUnsupported`` in global capture mode
+* :cpp:func:`hipStreamAddCallback` - The stream capture behavior is updated. The function now checks if any of the blocking streams are capturing. If so, it returns an error and invalidates all capturing streams. The usage of this API is restricted during stream capture to match CUDA.
+
+Stream capture error return 
+---------------------------
+
+During stream capture, the following HIP APIs return the ``hipErrorStreamCaptureUnsupported`` error on the HIP runtime, but not always ``hipSuccess``, to match behavior with CUDA.
+
+* :cpp:func:`hipDeviceSetMemPool`
+* :cpp:func:`hipMemPoolCreate`
+* :cpp:func:`hipMemPoolDestroy`
+* :cpp:func:`hipDeviceSetSharedMemConfig`
+* :cpp:func:`hipDeviceSetCacheConfig`
+* :cpp:func:`hipMemcpyWithStream`
+
+The usage of these APIs is restricted during stream capture. No impact if stream capture is working fine on CUDA.
+
+Error code changes
+==================
+
+The following HIP APIs have been updated to return new or additional error codes to match the corresponding
+CUDA APIs. Most existing applications just check if ``hipSuccess`` is returned and no change is needed.
+However, if an application checks for a specific error code, the application code may need to be updated
+to match/handle the new error code accordingly.
+
+Module management related APIs
+------------------------------
+
+Kernel launch APIs
+^^^^^^^^^^^^^^^^^^
+
+The following APIs have updated implementations:
+
+* :cpp:func:`hipModuleLaunchKernel`
+* :cpp:func:`hipExtModuleLaunchKernel`
+* :cpp:func:`hipExtLaunchKernel`
+* :cpp:func:`hipDrvLaunchKernelEx`
+* :cpp:func:`hipLaunchKernel`
+* :cpp:func:`hipLaunchKernelExC`
+
+More conditional checks are added in the API implementation, and the return errors are added or changed in the following scenarios:
+
+* If the input stream handle is invalid, the returned error is changed to ``hipErrorContextIsDestroyed`` from ``hipErrorInvalidValue``
+* Adds a grid dimension check, if any input global work size dimension is zero, returns ``hipErrorInvalidValue``
+* Adds extra shared memory size check, if exceeds the size limit, returns ``hipErrorInvalidValue``
+* If the total number of threads per block exceeds the maximum work group limit during a kernel launch, the return value is changed to ``hipErrorInvalidConfiguration`` from ``hipErrorInvalidValue``
+
+``hipModuleLaunchCooperativeKernel``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Conditions are added in the API implementation of :cpp:func:`hipModuleLaunchCooperativeKernel`, and the returned errors are added in the following scenarios:
+
+* If the input stream is invalid, returns ``hipErrorContextIsDestroyed``, instead of ``hipErrorInvalidValue``
+* If any grid dimension or block dimension is zero, returns ``hipErrorInvalidValue``
+* If any grid dimension exceeds the maximum dimension limit, or work group size exceeds the maximum size, returns ``hipErrorInvalidConfiguration`` , instead of ``hipErrorInvalidValue`` 
+* If shared memory size in bytes exceeds the device local memory size per CU, returns ``hipErrorCooperativeLaunchTooLarge``
+
+``hipModuleLoad``
+^^^^^^^^^^^^^^^^^^
+
+The API updates the negative return of :cpp:func:`hipModuleLoad` to match the CUDA behavior. In cases where the file name exists but the file size is 0, the function returns ``hipErrorInvalidImage`` instead of ``hipErrorInvalidValue``.
+
+Texture management related APIs
+-------------------------------
+
+The following APIs have updated the return codes to match the CUDA behavior:
+
+* :cpp:func:`hipTexObjectCreate`, supports zero width and height for 2D image. If either width or height are zero the function will not return ``false``.
+* :cpp:func:`hipBindTexture2D`, adds extra check, if pointer for texture reference or device is NULL, returns ``hipErrorNotFound``.
+* :cpp:func:`hipBindTextureToArray`, if any NULL pointer is input for texture object, resource descriptor, or texture descriptor, returns error ``hipErrorInvalidChannelDescriptor``, instead of ``hipErrorInvalidValue``.
+* :cpp:func:`hipGetTextureAlignmentOffset`, adds a return code ``hipErrorInvalidTexture`` when the texture reference pointer is NULL.
+
+Cooperative group related APIs
+-------------------------------
+
+``hipLaunchCooperativeKernelMultiDevice``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Validations are added to the API implementation of :cpp:func:`hipLaunchCooperativeKernelMultiDevice`, as follows:
+
+* If input launch stream is NULLPTR or it is ``hipStreamLegacy``, returns ``hipErrorInvalidResourceHandle``.
+* If the stream capturing is active, returns the error ``hipErrorStreamCaptureUnsupported``.
+* If the stream capture status is invalidated, returns the error ``hipErrorStreamCaptureInvalidated``
+* If the total number of threads per block exceeds the maximum work group limit during a kernel launch, the return value is changed to ``hipErrorInvalidConfiguration``  from ``hipErrorInvalidValue``.
+
+``hipLaunchCooperativeKernel``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Validation are added to the API implementation of :cpp:func:`hipLaunchCooperativeKernel`, as follows:
+
+* If the input stream handle is invalid, the returned error is changed to ``hipErrorInvalidHandle`` from ``hipErrorContextIsDestroyed``.
+* If the total number of threads per block exceeds the maximum work group limit during a kernel launch, the return value is changed to ``hipErrorInvalidConfiguration`` from ``hipErrorInvalidValue`` .
+
+Invalid stream input parameter handling matches CUDA
+====================================================
+
+In order to match the CUDA runtime behavior more closely, HIP APIs with streams passed as input parameters no longer check the stream validity. Prior to the 7.0 release, the HIP runtime returns an error code ``hipErrorContextIsDestroyed`` if the stream is invalid. In CUDA 12 and later, the equivalent behavior is to raise a segmentation fault. With HIP 7.0, the HIP runtime matches CUDA by causing a segmentation fault. The list of APIs impacted by this change are as follows:
+
+* Stream management related APIs
+
+  * :cpp:func:`hipStreamGetCaptureInfo`
+  * :cpp:func:`hipStreamGetPriority`
+  * :cpp:func:`hipStreamGetFlags`
+  * :cpp:func:`hipStreamDestroy`
+  * :cpp:func:`hipStreamAddCallback`
+  * :cpp:func:`hipStreamQuery`
+  * :cpp:func:`hipLaunchHostFunc`
+
+* Graph management related APIs
+
+  * :cpp:func:`hipGraphUpload`
+  * :cpp:func:`hipGraphLaunch`
+  * :cpp:func:`hipStreamBeginCaptureToGraph`
+  * :cpp:func:`hipStreamBeginCapture`
+  * :cpp:func:`hipStreamIsCapturing`
+  * :cpp:func:`hipStreamGetCaptureInfo`
+  * :cpp:func:`hipGraphInstantiateWithParams`
+
+* Memory management related APIs
+
+  * :cpp:func:`hipMemcpyPeerAsync`
+  * :cpp:func:`hipMallocFromPoolAsync`
+  * :cpp:func:`hipFreeAsync`
+  * :cpp:func:`hipMallocAsync`
+  * :cpp:func:`hipMemcpyAsync`
+  * :cpp:func:`hipMemcpyToSymbolAsync`
+  * :cpp:func:`hipStreamAttachMemAsync`
+  * :cpp:func:`hipMemPrefetchAsync`
+  * :cpp:func:`hipDrvMemcpy3D`
+  * :cpp:func:`hipDrvMemcpy3DAsync`
+  * :cpp:func:`hipDrvMemcpy2DUnaligned`
+  * :cpp:func:`hipMemcpyParam2D`
+  * :cpp:func:`hipMemcpyParam2DAsync`
+  * :cpp:func:`hipMemcpy2DArrayToArray`
+  * :cpp:func:`hipMemcpy2D`
+  * :cpp:func:`hipMemcpy2DAsync`
+  * :cpp:func:`hipDrvMemcpy2DUnaligned`
+  * :cpp:func:`hipMemcpy3D`
+
+* Event management related APIs
+
+  * :cpp:func:`hipEventRecord`
+  * :cpp:func:`hipEventRecordWithFlags`
+
+Developers porting CUDA code to HIP no longer need to modify their error handling code. However,
+if you have come to expect the HIP runtime to return the error code ``hipErrorContextIsDestroyed``,
+you might need to adjust your code.
+
+warpSize Change
+===============
+
+To match the CUDA specification, ``warpSize`` is no longer a ``constexpr``.
+In general, this should be a transparent change. However, if an application was using ``warpSize``
+as a compile-time constant, it will have to be updated to handle the new definition.
+For more information, see `warpSize <./how-to/hip_cpp_language_extensions.html#warpsize>`_
+in :doc:`./how-to/hip_cpp_language_extensions`.
@@ -259,42 +259,50 @@ HSA provides environment variables that help analyze issues in drivers or hardwa

 * To isolate issues with hardware copy engines, you can use ``HSA_ENABLE_SDMA``.

-    ``HSA_ENABLE_SDMA=0`` causes host-to-device and device-to-host copies to use compute shader
-    blit kernels, rather than the dedicated DMA copy engines. Compute shader copies have low latency
-    (typically < 5 us) and can achieve approximately 80% of the bandwidth of the DMA copy engine.
+  ``HSA_ENABLE_SDMA=0`` causes host-to-device and device-to-host copies to use compute shader
+  blit kernels, rather than the dedicated DMA copy engines. Compute shader copies have low latency
+  (typically < 5 us) and can achieve approximately 80% of the bandwidth of the DMA copy engine.

 * To diagnose interrupt storm issues in the driver, you can use ``HSA_ENABLE_INTERRUPT``.

-    ``HSA_ENABLE_INTERRUPT=0`` causes completion signals to be detected with memory-based
-    polling, rather than interrupts.
+  ``HSA_ENABLE_INTERRUPT=0`` causes completion signals to be detected with memory-based
+  polling, rather than interrupts.

 HIP environment variable summary
 --------------------------------

 Here are some of the more commonly used environment variables:

-.. include-table:: data/env_variables_hip.rst
+.. include-table:: ./reference/env_variables/debug_hip_env.rst
    :table: hip-env-debug

 General debugging tips
 ======================================================

 * ``gdb --args`` can be used to pass the executable and arguments to ``gdb``.
+
 * You can set environment variables (``set env``) from within GDB on Linux:

-    .. code-block:: bash
+  .. code-block:: bash

-        (gdb) set env AMD_SERIALIZE_KERNEL 3
+      (gdb) set env AMD_SERIALIZE_KERNEL 3

-    .. note::
-        This ``gdb`` command does not use an equal (=) sign.
+  .. note::
+
+      This ``gdb`` command does not use an equal (=) sign.

 * The GDB backtrace shows a path in the runtime. This is because a fault is caught by the runtime, but it is generated by an asynchronous command running on the GPU.
+
 * To determine the true location of a fault, you can force the kernels to run synchronously by setting the environment variables ``AMD_SERIALIZE_KERNEL=3`` and ``AMD_SERIALIZE_COPY=3``. This forces HIP runtime to wait for the kernel to finish running before returning. If the fault occurs when a kernel is running, you can see the code that launched the kernel inside the backtrace. The thread that's causing the issue is typically the one inside ``libhsa-runtime64.so``.
+
 * VM faults inside kernels can be caused by:

-    * Incorrect code (e.g., a for loop that extends past array boundaries)
-    * Memory issues, such as invalid kernel arguments (null pointers, unregistered host pointers, bad pointers)
-    * Synchronization issues
-    * Compiler issues (incorrect code generation from the compiler)
-    * Runtime issues
+  * Incorrect code (e.g., a for loop that extends past array boundaries)
+
+  * Memory issues, such as invalid kernel arguments (null pointers, unregistered host pointers, bad pointers)
+
+  * Synchronization issues
+
+  * Compiler issues (incorrect code generation from the compiler)
+
+  * Runtime issues
@@ -411,11 +411,9 @@ warpSize
 ================================================================================

 The ``warpSize`` constant contains the number of threads per warp for the given
-target device. It can differ between different architectures, and on RDNA
-architectures it can even differ between kernel launches, depending on whether
-they run in CU or WGP mode. See the
-:doc:`hardware features <../reference/hardware_features>` for more
-information.
+target device. On AMD hardware, this is referred to as ``wavefront size``, which 
+may vary depending on the architecture. For more details, see the
+:doc:`hardware features <../reference/hardware_features>`.

 Since ``warpSize`` can differ between devices, it can not be assumed to be a
 compile-time constant on the host. It has to be queried using
@@ -423,8 +421,8 @@ compile-time constant on the host. It has to be queried using

 .. code-block:: cpp

-    int val;
-    hipDeviceGetAttribute(&val, hipDeviceAttributeWarpSize, deviceId);
+    int warpSizeHost;
+    hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId);

 .. note::

@@ -435,6 +433,130 @@ compile-time constant on the host. It has to be queried using
  of 32 can run on devices with a ``warpSize`` of 64, it only utilizes half of
  the compute resources.

+Prior to ROCm 7.0, the warpSize parameter was a compile-time constant. Starting
+with ROCm 7.0, it is early folded by the compiler, allowing it to be used in
+loop bounds and enabling loop unrolling in a manner similar to a compile-time
+constant warp size.
+
+If compile time warp size is required, for example to select the correct mask
+type or code path at compile time, the recommended approach is to determine the
+warp size of the GPU on host side and setup the kernel accordingly, as shown in
+the following block reduce example.
+
+The ``block_reduce`` kernel has a template parameter for warp size and performs
+a reduction operation in two main phases:
+
+- Shared memory reduction: Reduction is performed iteratively, halving the
+  number of active threads each step until only a warp remains
+  (32 or 64 threads, depending on the device).
+
+- Warp-level reduction: Once the shared memory reduction completes, the
+  remaining threads use warp-level shuffling to sum the remaining values. This
+  is done efficiently with the ``__shfl_down`` intrinsic, which allows threads within
+  the warp to exchange values without explicit synchronization.
+
+.. tab-set::
+
+    .. tab-item:: WarpSize template parameter
+       :sync: template-warpsize
+
+       .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
+          :start-after: // [Sphinx template warp size block reduction kernel start]
+          :end-before: // [Sphinx template warp size block reduction kernel end]
+          :language: cpp
+
+
+    .. tab-item:: HIP warpSize
+       :sync: hip-warpsize
+
+       .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
+          :start-after: // [Sphinx HIP warp size block reduction kernel start]
+          :end-before: // [Sphinx HIP warp size block reduction kernel end]
+          :language: cpp
+
+The host code with the main function:
+
+- Retrieves the warp size of the GPU (``warpSizeHost``) to determine the optimal
+  kernel configuration.
+
+- Allocates device memory (``d_data`` for input, ``d_results`` for block-wise
+  output) and initializes the input vector to 1.
+
+- Generates the mask variables for every warp and copies them to the device.
+
+  .. tab-set::
+
+      .. tab-item:: WarpSize template parameter
+         :sync: template-warpsize
+
+         .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
+            :start-after: // [Sphinx template warp size mask generation start]
+            :end-before: // [Sphinx template warp size mask generation end]
+            :language: cpp
+
+
+      .. tab-item:: HIP warpSize
+         :sync: hip-warpsize
+
+         .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
+            :start-after:  // [Sphinx HIP warp size mask generation start]
+            :end-before:  // [Sphinx HIP warp size mask generation end]
+            :language: cpp
+
+- Selects the appropriate kernel specialization based on the warp
+  size (either 32 or 64) and launches the kernel.
+
+  .. tab-set::
+
+      .. tab-item:: WarpSize template parameter
+         :sync: template-warpsize
+
+         .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
+            :start-after: // [Sphinx template warp size select kernel start]
+            :end-before: // [Sphinx template warp size select kernel end]
+            :language: cpp
+
+
+      .. tab-item:: HIP warpSize
+         :sync: hip-warpsize
+
+         .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
+            :start-after: // [Sphinx HIP warp size select kernel start]
+            :end-before: // [Sphinx HIP warp size select kernel end]
+            :language: cpp
+
+- Synchronizes the device and copies the results back to the host.
+
+- Checks that each block's sum is equal with the expected mask bit count, 
+  verifying the reduction's correctness.
+
+- Frees the device memory to prevent memory leaks.
+
+.. note::
+
+  The ``warpSize`` runtime example code is also provided for comparison purposes
+  and the full example codes are located in the `tools folder <https://github.com/ROCm/hip/tree/docs/develop/docs/tools/example_codes>`_.
+
+  The variable ``warpSize`` can be used for loop bounds and supports 
+  loop unrolling similarly to the template parameter ``WarpSize``.
+
+For users who still require a compile-time constant warp size as a macro on the
+device side, it can be defined manually based on the target device architecture,
+as shown in the following example.
+
+.. code-block:: cpp
+
+  #if defined(__GFX8__) || defined(__GFX9__)
+    #define WarpSize 64
+  #else
+    #define WarpSize 32
+  #endif
+
+.. note:: 
+
+  ``mwavefrontsize64`` compiler option is not supported by HIP runtime, that's
+  why the architecture based compile time selector is an acceptable approach.
+
 ********************************************************************************
 Vector types
 ********************************************************************************
@@ -855,7 +977,7 @@ The different shuffle functions behave as following:
  of range, the thread returns its own ``var``.

 ``__shfl_down``
-  The thread reads ``var`` from lane ``laneIdx - delta``, thereby "shuffling"
+  The thread reads ``var`` from lane ``laneIdx + delta``, thereby "shuffling"
  the values of the lanes of the warp "down". If the resulting source lane is
  out of range, the thread returns its own ``var``.

@@ -29,6 +29,8 @@ error code spaces:
 General Tips
 --------------------------------------------------------------------------------

+* ``hipDeviceptr_t`` is a ``void*`` and treated like a raw pointer, while ``CUdevicptr``
+  is an ``unsigned int`` and treated as a device memory handle. 
 * Starting to port on an NVIDIA machine is often the easiest approach, as the
  code can be tested for functionality and performance even if not fully ported
  to HIP.
@@ -46,16 +48,16 @@ HIPIFY
 translate CUDA to HIP code. There are two flavours available, ``hipfiy-clang``
 and ``hipify-perl``.

-:doc:`hipify-clang <hipify:hipify-clang>` is, as the name implies, a Clang-based
+:doc:`hipify-clang <hipify:how-to/hipify-clang>` is, as the name implies, a Clang-based
 tool, and actually parses the code, translates it into an Abstract Syntax Tree,
 from which it then generates the HIP source. For this, ``hipify-clang`` needs to
 be able to actually compile the code, so the CUDA code needs to be correct, and
 a CUDA install with all necessary headers must be provided.

-:doc:`hipify-perl <hipify:hipify-perl>` uses pattern matching, to translate the
+:doc:`hipify-perl <hipify:how-to/hipify-perl>` uses pattern matching, to translate the
 CUDA code to HIP. It does not require a working CUDA installation, and can also
 convert CUDA code, that is not syntactically correct. It is therefore easier to
-set up and use, but is not as powerful as ``hipfiy-clang``.
+set up and use, but is not as powerful as ``hipify-clang``.

 Scanning existing CUDA code to scope the porting effort
 --------------------------------------------------------------------------------
@@ -611,6 +613,70 @@ code, while the host can query it during runtime via the device properties. See
 the :ref:`HIP language extension for warpSize <warp_size>` for information on
 how to write portable wave-aware code.

+Lane masks bit-shift
+================================================================================
+
+A thread in a warp is also called a lane, and a lane mask is a bitmask where
+each bit corresponds to a thread in a warp. A bit is 1 if the thread is active,
+0 if it's inactive. Bit-shift operations are typically used to create lane masks
+and on AMD GPUs the ``warpSize`` can differ between different architectures,
+that's why it's essential to use correct bitmask type, when porting code.
+
+Example:
+
+.. code-block:: cpp
+
+  // Get the thread's position in the warp
+  unsigned int laneId = threadIdx.x % warpSize;
+
+  // Use lane ID for bit-shift
+  val & ((1 << (threadIdx.x % warpSize) )-1 );
+
+  // Shift 32 bit integer with val variable
+  WarpReduce::sum( (val < warpSize) ? (1 << val) : 0);
+
+Lane masks are 32-bit integer types as this is the integer precision that C 
+assigns to such constants by default. GCN/CDNA architectures have a warp size of
+64, :code:`threadIdx.x % warpSize` and :code:`val` in the example may obtain 
+values greater than 31. Consequently, shifting by such values would clear the 
+32-bit register to which the shift operation is applied. For AMD
+architectures, a straightforward fix could look as follows:
+
+.. code-block:: cpp
+  
+  // Get the thread's position in the warp
+  unsigned int laneId = threadIdx.x % warpSize;
+
+  // Use lane ID for bit-shift
+  val & ((1ull << (threadIdx.x % warpSize) )-1 );
+
+  // Shift 64 bit integer with val variable
+  WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0);
+
+For portability reasons, it is better to introduce appropriately
+typed placeholders as shown below:
+
+.. code-block:: cpp
+
+  #if defined(__GFX8__) || defined(__GFX9__)
+  typedef uint64_t lane_mask_t;
+  #else
+  typedef uint32_t lane_mask_t;
+  #endif
+
+The use of :code:`lane_mask_t` with the previous example:
+
+.. code-block:: cpp
+
+  // Get the thread's position in the warp
+  unsigned int laneId = threadIdx.x % warpSize;
+
+  // Use lane ID for bit-shift
+  val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 );
+
+  // Shift 32 or 64 bit integer with val variable
+  WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0);
+
 Porting from CUDA __launch_bounds__
 ================================================================================

@@ -14,12 +14,19 @@ alongside options to guide the compilation.

 .. note::

+  * Device code compilation via HIPRTC uses the ``__hip_internal`` namespace instead
+    of the ``std`` namespace to avoid namespace collision. 
  * This library can be used for compilation on systems without AMD GPU drivers
    installed (offline compilation). However, running the compiled code still
    requires both the HIP runtime library and GPU drivers on the target system.
-  * This library depends on Code Object Manager (comgr). You can try to
-    statically link comgr into HIPRTC to avoid ambiguity.
  * Developers can bundle this library with their application.
+  * HIPRTC leverages AMD's Code Object Manager API (``Comgr``) internally, which
+    is designed to simplify linking, compiling, and inspecting code objects. For
+    more information, see the `llvm-project/amd/comgr/README <https://github.com/ROCm/llvm-project/blob/amd-staging/amd/comgr/README.md>`_.
+  * Comgr may cache HIPRTC compilations. To force full recompilation for each HIPRTC API invocation, set AMD_COMGR_CACHE=0.
+
+    - When viewing the *README* in the Comgr GitHub repository you should look at a
+      specific branch of interest, such as ``docs/6.3.0`` or ``docs/6.4.1``, rather than the default branch.

 Compilation APIs
 ===============================================================================
@@ -30,6 +37,11 @@ To use HIPRTC functionality the header needs to be included:

  #include <hip/hiprtc.h>

+.. note::
+
+  Prior to the 7.0 release, the HIP runtime included the hipRTC library. With the 7.0
+  release, the library is separate and must be specifically included as shown above. 
+  
 Kernels can be stored in a string:

 .. code-block:: cpp
@@ -250,44 +262,12 @@ The full example is below:
    HIP_CHECK(hipFree(doutput));
  }

-
-Kernel Compilation Cache
-===============================================================================
-
-HIPRTC incorporates a cache to avoid recompiling kernels between program
-executions. The contents of the cache include the kernel source code (including
-the contents of any ``#include`` headers), the compilation flags, and the
-compiler version. After a ROCm version update, the kernels are progressively
-recompiled, and the new results are cached. When the cache is disabled, each
-kernel is recompiled every time it is requested.
-
-Use the following environment variables to manage the cache status as enabled or
-disabled, the location for storing the cache contents, and the cache eviction
-policy:
-
-* ``AMD_COMGR_CACHE`` By default this variable is unset and the
-  compilation cache feature is enabled. To disable the feature set the
-  environment variable to a value of ``0``.
-
-* ``AMD_COMGR_CACHE_DIR``: By default the value of this environment variable is
-  defined as ``$XDG_CACHE_HOME/comgr``, which defaults to
-  ``$USER/.cache/comgr`` on Linux, and ``%LOCALAPPDATA%\cache\comgr``
-  on Windows. You can specify a different directory for the environment variable
-  to change the path for cache storage. If the runtime fails to access the
-  specified cache directory the cache is disabled. If the environment variable
-  is set to an empty string (``""``), the default directory is used.
-
-* ``AMD_COMGR_CACHE_POLICY``: If assigned a value, the string is interpreted and
-  applied to the cache pruning policy. The string format is consistent with
-  `Clang's ThinLTO cache pruning policy <https://rocm.docs.amd.com/projects/llvm-project/en/latest/LLVM/clang/html/ThinLTO.html#cache-pruning>`_.
-  The default policy is defined as:
-  ``prune_interval=1h:prune_expiration=0h:cache_size=75%:cache_size_bytes=30g:cache_size_files=0``.
-  If the runtime fails to parse the defined string, or the environment variable
-  is set to an empty string (""), the cache is disabled.
-
 .. note::

-  This cache is also shared with the OpenCL runtime shipped with ROCm.
+  Some applications define datatypes such as ``int64_t``, ``uint64_t``, ``int32_t``, and ``uint32_t``
+  that could lead to conflicts when integrating with ``hipRTC``. To resolve these conflicts, these
+  datatypes are replaced with HIP-specific internal datatypes prefixed with ``__hip``. For example,
+  ``int64_t`` is replaced by ``__hip_int64_t``.

 HIPRTC specific options
 ===============================================================================
@@ -484,7 +464,7 @@ application requires the ingestion of bitcode/IR not derived from the currently
 installed AMD compiler, it must run with HIPRTC and comgr dynamic libraries that
 are compatible with the version of the bitcode/IR.

-`Comgr <https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr>`_ is a
+`Comgr <https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr/README.md>`_ is a
 shared library that incorporates the LLVM/Clang compiler that HIPRTC relies on.
 To identify the bitcode/IR version that comgr is compatible with, one can
 execute "clang -v" using the clang binary from the same ROCm or HIP SDK package.
@@ -492,6 +472,10 @@ For instance, if compiling bitcode/IR version 14, the HIPRTC and comgr libraries
 released by AMD around mid 2022 would be the best choice, assuming the
 LLVM/Clang version included in the package is also version 14.

+.. note:: 
+  When viewing the *README* in the Comgr GitHub repository you should look at a
+  specific branch of interest, such as ``docs/6.3.0`` or ``docs/6.4.1``, rather than the default branch.
+
 To ensure smooth operation and compatibility, an application may choose to ship
 the specific versions of HIPRTC and comgr dynamic libraries, or it may opt to
 clearly specify the version requirements and dependencies. This approach
@@ -88,7 +88,7 @@ developer may have to reduce the block size of the kernels. The kernel runtimes
 can be misleading for concurrent kernel runs, that is why during optimization
 it is a good practice to check the trace files, to see if one kernel is blocking
 another kernel, while they are running in parallel. For more information about
-the application tracing, check::doc:`rocprofiler:/how-to/using-rocprof`.
+application tracing, see :doc:`rocprofiler:how-to/using-rocprof`.

 When running kernels in parallel, the execution time can increase due to
 contention for shared resources. This is because multiple kernels may attempt
@@ -21,7 +21,7 @@ and AMD GPUs use different approaches. NVIDIA GPUs have the independent thread
 scheduling feature where each thread has its own call stack and effective
 program counter. On AMD GPUs threads are grouped; each warp has its own call
 stack and program counter. Warps are described and explained in the
-:ref:`inherent_thread_hierarchy`
+:ref:`inherent_thread_model`

 If a thread or warp exceeds its stack size, a stack overflow occurs, causing
 kernel failure. This can be detected using debuggers.
@@ -21,11 +21,10 @@ without changing it. To get a human readable version of the errors,

 .. note::

-    :cpp:func:`hipGetLastError` returns the returned error code of the last HIP
-    runtime API call even if it's ``hipSuccess``, while ``cudaGetLastError``
-    returns the error returned by any of the preceding CUDA APIs in the same
-    host thread. :cpp:func:`hipGetLastError` behavior will be matched with
-    ``cudaGetLastError`` in ROCm release 7.0.
+    :cpp:func:`hipGetLastError` returns the last actual HIP API error caught in the current thread
+    during the application execution. Prior to ROCm 7.0, ``hipGetLastError`` might also return
+    ``hipSuccess`` or ``hipErrorNotReady`` from the last HIP runtime API call, which are not errors.
+

 Best practices of HIP error handling:

@@ -8,12 +8,6 @@
 HIP graphs
 ********************************************************************************

-.. note::
-    The HIP graph API is currently in Beta. Some features can change and might
-    have outstanding issues. Not all features supported by CUDA graphs are yet
-    supported. For a list of all currently supported functions see the
-    :ref:`HIP graph API documentation<graph_management_reference>`.
-
 HIP graphs are an alternative way of executing tasks on a GPU that can provide
 performance benefits over launching kernels using the standard
 method via streams. A HIP graph is made up of nodes and edges. The nodes of a
@@ -64,6 +64,8 @@ To check the availability of fine- and coarse-grained memory pools, use
  Segment:                 GLOBAL; FLAGS: COARSE GRAINED
  ...

+.. _hip-memory-coherence-table:
+
 The APIs, flags and respective memory coherence control are listed in the
 following table:

@@ -108,6 +108,8 @@ C++ application.
  :cpp:func:`hipMalloc` and :cpp:func:`hipFree` are blocking calls. However, HIP
  also provides non-blocking versions :cpp:func:`hipMallocAsync` and
  :cpp:func:`hipFreeAsync`, which require a stream as an additional argument.
+  For asynchronous memory allocations made with ``hipMallocAsync`` and ``hipMallocFromPoolAsync``
+  ``hipFree`` does not implicitly wait for synchronization, to match the behavior of ``cudaFree``.

 .. _pinned_host_memory:

@@ -9,6 +9,14 @@
 Unified memory management
 *******************************************************************************

+This document covers unified memory management in HIP, which encompasses several
+approaches that provide a single address space accessible from both CPU and GPU.
+**Unified memory** refers to the overall architectural concept of this shared
+address space, while **managed memory** is one specific implementation that
+provides automatic page migration between devices. Other unified memory allocators
+like :cpp:func:`hipMalloc()` and :cpp:func:`hipHostMalloc()` provide different
+access patterns within the same unified address space concept.
+
 In conventional architectures CPUs and attached devices have their own memory
 space and dedicated physical memory backing it up, e.g. normal RAM for CPUs and
 VRAM on GPUs. This way each device can have physical memory optimized for its
@@ -39,57 +47,79 @@ model is shown in the following figure.

 Unified memory enables the access to memory located on other devices via
 several methods, depending on whether hardware support is available or has to be
-managed by the driver.
+managed by the driver. CPUs can access memory allocated via :cpp:func:`hipMalloc()`,
+providing bidirectional memory accessibility within the unified address space.

-Hardware supported on-demand page migration
--------------------------------------------------------------------------------
+Managed memory
+================================================================================

-When a kernel on the device tries to access a memory address that is not in its
-memory, a page-fault is triggered. The GPU then in turn requests the page from
-the host or an other device, on which the memory is located. The page is then
-unmapped from the source, sent to the device and mapped to the device's memory.
-The requested memory is then available to the processes running on the device.
+Managed Memory is an extension of the unified memory architecture in which HIP
+monitors memory access and intelligently migrates data between device and
+system memories, thereby improving performance and resource efficiency.

-In case the device's memory is at capacity, a page is unmapped from the device's
-memory first and sent and mapped to host memory. This enables more memory to be
-allocated and used for a GPU, than the GPU itself has physically available.
+When a kernel on the device tries to access a managed memory address that is
+not in its local device memory, a page-fault is triggered.  The GPU then in
+turn requests the page from the host or other device on which the memory is
+located. The page is unmapped from the source, sent to the device and
+mapped to the device's memory.  The requested memory is then available locally
+to the processes running on the device, which improves performance as local
+memory access outperforms remote memory access.

-This level of unified memory support can be very beneficial for sparse accesses
-to an array, that is not often used on the device.
-
-Driver managed page migration
--------------------------------------------------------------------------------
-
-If the hardware does not support on-demand page migration, then all the pages
-accessed by a kernel have to be resident on the device, so they have to be
-migrated before the kernel is running. Since the driver can not know beforehand,
-what parts of an array are going to be accessed, all pages of all accessed
-arrays have to be migrated. This can lead to significant delays on the first run
-of a kernel, on top of possibly copying more memory than is actually accessed by
-the kernel.
+Managed memory also expands the memory capacity available to a GPU kernel. When
+migrating memory into the device on page-fault, if the device's memory is
+already at capacity, a page is unmapped from the device's memory first and sent
+and mapped to host memory.  This enables more memory to be allocated and used
+for a GPU than the GPU itself has physically available. This level of support
+can be very beneficial, for example, for sparse accesses to an array that is
+not often used on the device.

 .. _unified memory system requirements:

-System requirements
-================================================================================
+System requirements for managed memory
+--------------------------------------------------------------------------------

-Unified memory is supported on Linux by all modern AMD GPUs from the Vega
-series onward, as shown in the following table. Unified memory management can
-be achieved by explicitly allocating managed memory using
-:cpp:func:`hipMallocManaged` or marking variables with the ``__managed__``
-attribute. For the latest GPUs, with a Linux kernel that supports
-`Heterogeneous Memory Management (HMM)
+Some AMD GPUs do not support page-faults, and thus do not support on-demand
+page-fault driven migration. On these architectures, if the programmer prefers
+all GPU memory accesses to be local, all pages have to migrated before the
+kernel is dispatched, as the driver cannot know beforehand which parts of a
+dataset are going to be accessed. This can lead to significant delays on the
+first run of a kernel, and, in the example of a sparsely accessed array, can
+also lead to copying more memory than is actually accessed by the kernel.
+
+Note that on systems which do not support page-faults, managed memory APIs are
+still accessible to the programmer, but managed memory operates in a degraded
+fashion due to the lack of demand-driven migration. Furthermore, on these
+systems it is still possible to use unified memory allocators that do not
+provide managed memory features; see
+:ref:`memory allocation approaches in unified memory` for more details.
+
+Managed memory is supported on Linux by all modern AMD GPUs from the Vega
+series onward, as shown in the following table. Managed memory can be
+explicitly allocated using :cpp:func:`hipMallocManaged()` or marking variables
+with the ``__managed__`` attribute. For the latest GPUs, with a Linux kernel
+that supports `Heterogeneous Memory Management (HMM)
 <https://www.kernel.org/doc/html/latest/mm/hmm.html>`_, the normal system
-allocator can be used.
+allocators (e.g., ``new``, ``malloc()``) can be used.

-.. list-table:: Supported Unified Memory Allocators by GPU architecture
+.. note::
+    To ensure the proper functioning of managed memory on supported GPUs, it
+    is **essential** to set the environment variable ``HSA_XNACK=1`` and use a
+    GPU kernel mode driver that supports `HMM
+    <https://www.kernel.org/doc/html/latest/mm/hmm.html>`_. Without this
+    configuration, access-driven memory migration will be disabled, and the
+    behavior will be similar to that of systems without HMM support.
+
+.. list-table:: Managed Memory Support by GPU Architecture
    :widths: 40, 25, 25
    :header-rows: 1
    :align: center

    * - Architecture
      - :cpp:func:`hipMallocManaged()`, ``__managed__``
-      - ``new``, ``malloc()``
+      - ``new``, ``malloc()``, ``allocate()``
+    * - CDNA4
+      - ✅
+      - ✅ :sup:`1`
    * - CDNA3
      - ✅
      - ✅ :sup:`1`
@@ -98,7 +128,7 @@ allocator can be used.
      - ✅ :sup:`1`
    * - CDNA1
      - ✅
-      - ✅ :sup:`1`
+      - ❌
    * - RDNA1
      - ✅
      - ❌
@@ -113,11 +143,16 @@ allocator can be used.
 :sup:`1` Works only with ``HSA_XNACK=1`` and kernels with HMM support. First GPU
 access causes recoverable page-fault.

-.. _unified memory allocators:
+.. _memory allocation approaches in unified memory:

-Unified memory allocators
+Memory allocation approaches in unified memory
 ================================================================================

+While managed memory provides automatic migration, unified memory encompasses
+several allocation methods, each with different access patterns and migration
+behaviors. The following section covers all available unified memory allocation
+approaches, including but not limited to managed memory APIs.
+
 Support for the different unified memory allocators depends on the GPU
 architecture and on the system. For more information, see :ref:`unified memory
 system requirements` and :ref:`checking unified memory support`.
@@ -133,17 +168,17 @@ system requirements` and :ref:`checking unified memory support`.

 - **System allocated unified memory**

-  Starting with CDNA2, the ``new`` and ``malloc()`` system allocators allow
+  Starting with CDNA2, the ``new``, ``malloc()``, and ``allocate()`` (Fortran) system allocators allow
  you to reserve unified memory. The system allocator is more versatile and
  offers an easy transition for code written for CPUs to HIP code as the
-  same system allocation API is used.
+  same system allocation API is used. Memory allocated by these allocators can
+  be registered to be accessible on device using :cpp:func:`hipHostRegister()`.

-To ensure the proper functioning of system allocated unified memory on supported
-GPUs, it is essential to set the environment variable ``HSA_XNACK=1`` and use
-a GPU kernel mode driver that supports HMM
-<https://www.kernel.org/doc/html/latest/mm/hmm.html>`_. Without this
-configuration, the behavior will be similar to that of systems without HMM
-support.
+- **HIP allocated non-managed memory**
+
+  :cpp:func:`hipMalloc()` and :cpp:func:`hipHostMalloc()` are dynamic memory
+  allocators available on all GPUs with unified memory support. Memory
+  allocated by these allocators is not migrated between device and host memory.

 The table below illustrates the expected behavior of managed and unified memory
 functions on ROCm and CUDA, both with and without HMM support.
@@ -161,10 +196,10 @@ functions on ROCm and CUDA, both with and without HMM support.
        - Access outside the origin without HMM or ``HSA_XNACK=0``
        - Allocation origin with HMM and ``HSA_XNACK=1``
        - Access outside the origin with HMM and ``HSA_XNACK=1``
-      * - ``new``, ``malloc()``
+      * - ``new``, ``malloc()``, ``allocate()``
        - host
        - not accessible on device
-        - host
+        - first touch
        - page-fault migration
      * - :cpp:func:`hipMalloc()`
        - device
@@ -174,13 +209,13 @@ functions on ROCm and CUDA, both with and without HMM support.
      * - :cpp:func:`hipMallocManaged()`, ``__managed__``
        - pinned host
        - zero copy [zc]_
-        - host
+        - first touch
        - page-fault migration
      * - :cpp:func:`hipHostRegister()`
-        - undefined behavior
-        - undefined behavior
-        - host
-        - page-fault migration
+        - pinned host
+        - zero copy [zc]_
+        - pinned host
+        - zero copy [zc]_
      * - :cpp:func:`hipHostMalloc()`
        - pinned host
        - zero copy [zc]_
@@ -236,9 +271,9 @@ functions on ROCm and CUDA, both with and without HMM support.
 Checking unified memory support
 --------------------------------------------------------------------------------

-The following device attributes can offer information about which :ref:`unified
-memory allocators` are supported. The attribute value is 1 if the functionality
-is supported, and 0 if it is not supported.
+The following device attributes can offer information about which :ref:`memory
+allocation approaches in unified memory` are supported. The attribute value is
+1 if the functionality is supported, and 0 if it is not supported.

 .. list-table:: Device attributes for unified memory management
    :widths: 40, 60
@@ -364,10 +399,11 @@ explicit memory management example is presented in the last tab.
    .. tab-item:: new

        .. code-block:: cpp
-            :emphasize-lines: 20-23
+            :emphasize-lines: 21-24

            #include <hip/hip_runtime.h>
            #include <iostream>
+            #include <new>

            #define HIP_CHECK(expression)              \
            {                                          \
@@ -386,10 +422,10 @@ explicit memory management example is presented in the last tab.

            // This example requires HMM support and the environment variable HSA_XNACK needs to be set to 1
            int main() {
-                // Allocate memory for a, b, and c.
-                int *a = new int[1];
-                int *b = new int[1];
-                int *c = new int[1];
+                // Allocate memory with proper alignment for performance
+                int *a = new(std::align_val_t(128)) int[1];
+                int *b = new(std::align_val_t(128)) int[1];
+                int *c = new(std::align_val_t(128)) int[1];

                // Setup input values.
                *a = 1;
@@ -404,10 +440,10 @@ explicit memory management example is presented in the last tab.
                // Prints the result.
                std::cout << *a << " + " << *b << " = " << *c << std::endl;

-                // Cleanup allocated memory.
-                delete[] a;
-                delete[] b;
-                delete[] c;
+                // Cleanup allocated memory with matching aligned delete.
+                ::operator delete[](a, std::align_val_t(128));
+                ::operator delete[](b, std::align_val_t(128));
+                ::operator delete[](c, std::align_val_t(128));

                return 0;
            }
@@ -501,9 +537,24 @@ Performance optimizations for unified memory
 There are several ways, in which the developer can guide the runtime to reduce
 copies between devices, in order to improve performance.

+With ``numactl --membind`` bindings, developers can control where physical
+allocation occurs by restricting memory allocation to specific NUMA nodes.
+This approach can reduce or eliminate the need for explicit data prefetching
+since memory is allocated in the desired location from the start.
+
 Data prefetching
 --------------------------------------------------------------------------------

+.. warning::
+    Data prefetching is not always an optimization and can slow down execution,
+    as the API takes time to execute. If the memory is already in the right
+    place, prefetching will waste time. Users should profile their code to
+    verify whether prefetching is beneficial for their specific use case.
+
+When prefetching is beneficial, developers can consider setting different default
+locations for different devices and using prefetch between them, which can help
+eliminate IPC communication overhead when memory moves between devices.
+
 Data prefetching is a technique used to improve the performance of your
 application by moving data to the desired device before it's actually
 needed. ``hipCpuDeviceId`` is a special constant to specify the CPU as target.
@@ -262,7 +262,9 @@ For example, when the control condition depends on ``threadIdx`` or ``warpSize``
 warp doesn't diverge. The compiler might optimize loops, short ifs, or switch
 blocks using branch predication, which prevents warp divergence. With branch
 predication, instructions associated with a false predicate are scheduled but
-not executed, which avoids unnecessary operations.
+not executed, which avoids unnecessary operations. For control conditions where
+one outcome is significantly more likely than the other, use `__builtin_expect <https://clang.llvm.org/docs/LanguageExtensions.html#builtin-expect>`_
+or ``[[likely]]`` to indicate the likely condition result.

 Avoiding divergent warps
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -10,6 +10,13 @@ The Heterogeneous-computing Interface for Portability (HIP) is a C++ runtime API
 and kernel language that lets you create portable applications for AMD and
 NVIDIA GPUs from a single source code. For more information, see [What is HIP?](./what_is_hip)

+```{note}
+HIP API 7.0 introduces changes to make it align more closely with NVIDIA CUDA.
+These changes are incompatible with prior releases, and might require recompiling
+existing HIP applications for use with the ROCm 7.0 release. For more information,
+see [HIP API 7.0 changes](./hip-7-changes).
+```
+
 Installation instructions are available from:

 * [Installing HIP](./install/install)
@@ -9,27 +9,28 @@ Build HIP from source
 Prerequisites
 =================================================

-HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler, or a CUDA platform with ``nvcc`` installed.
-Before building and running HIP, make sure drivers and prebuilt packages are installed properly on the platform.
+HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler,
+or a CUDA platform with ``nvcc`` installed. Before building and running HIP,
+make sure drivers and prebuilt packages are installed properly on the platform.

 You also need to install Python 3, which includes the ``CppHeaderParser`` package.
 Install Python 3 using the following command:

 .. code-block:: shell

-   apt-get install python3
+  apt-get install python3

 Check and install ``CppHeaderParser`` package using the command:

 .. code-block:: shell

-   pip3 install CppHeaderParser
+  pip3 install CppHeaderParser

 Install ``ROCm LLVM`` package using the command:

 .. code-block:: shell

-   apt-get install rocm-llvm-dev
+  apt-get install rocm-llvm-dev


 .. _Building the HIP runtime:
@@ -52,169 +53,174 @@ Set the repository branch using the variable: ``ROCM_BRANCH``. For example, for

 .. tab-set::

-   .. tab-item:: AMD
-      :sync: amd
+  .. tab-item:: AMD
+     :sync: amd

-      #. Get HIP source code.
+     #. Get HIP source code.

-         .. code-block:: shell
+        .. code-block:: shell

-            git clone -b "$ROCM_BRANCH" git@github.com:ROCm/rocm-systems.git   
+           git clone -b "$ROCM_BRANCH" git@github.com:ROCm/rocm-systems.git   

-      #. Set the environment variables.
+     #. Set the environment variables.

-         .. code-block:: shell
+        .. code-block:: shell

-            export CLR_DIR="$(readlink -f rocm-systems/projects/clr)"
-            export HIP_DIR="$(readlink -f rocm-systems/projects/hip)"
+           export CLR_DIR="$(readlink -f rocm-systems/projects/clr)"
+           export HIP_DIR="$(readlink -f rocm-systems/projects/hip)"

-      #. Build HIP.
+     #. Build HIP.

-         .. code-block:: shell
+        .. code-block:: shell

-            cd "$CLR_DIR"
-            mkdir -p build; cd build
-            cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF ..
-            make -j$(nproc)
-            sudo make install
+           cd "$CLR_DIR"
+           mkdir -p build; cd build
+           cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF ..
+           make -j$(nproc)
+           sudo make install

-         .. note::
+        .. note::

-            If ``CMAKE_INSTALL_PREFIX`` is not explicitly specified, the HIP runtime will be installed at
-            ``<ROCM_PATH>``, which is by default at the path ``/opt/rocm``.
+           If ``CMAKE_INSTALL_PREFIX`` is not explicitly specified, the HIP runtime will be installed at
+           ``<ROCM_PATH>``, which is by default at the path ``/opt/rocm``.

-            By default, the release version of HIP is built. If you need a debug version, you can put the option ``CMAKE_BUILD_TYPE=Debug`` in the command line.
+           By default, the release version of HIP is built. If you need a debug version, you can put the option ``CMAKE_BUILD_TYPE=Debug`` in the command line.

-         Default paths and environment variables:
-            * HIP is installed into ``<ROCM_PATH>``. This can be overridden by setting the ``INSTALL_PREFIX`` as the command option.
-               environment variable.
-            * HSA is in ``<ROCM_PATH>``. This can be overridden by setting the ``HSA_PATH``
-               environment variable.
-            * Clang is in ``<ROCM_PATH>/llvm/bin``. This can be overridden by setting the
-               ``HIP_CLANG_PATH`` environment variable.
-            * The device library is in ``<ROCM_PATH>/lib``. This can be overridden by setting the
-               ``DEVICE_LIB_PATH`` environment variable.
-            * Optionally, you can add ``<ROCM_PATH>/bin`` to your ``PATH``, which can make it easier to
-               use the tools.
-            * Optionally, you can set ``HIPCC_VERBOSE=7`` to output the command line for compilation.
+        Default paths and environment variables:
+         
+        * HIP is installed into ``<ROCM_PATH>``. This can be overridden by setting the ``INSTALL_PREFIX`` as the command option.

-         After you run the ``make install`` command, HIP is installed to ``<ROCM_PATH>`` by default, or ``$PWD/install/hip`` while ``INSTALL_PREFIX`` is defined.
+        * HSA is in ``<ROCM_PATH>``. This can be overridden by setting the ``HSA_PATH``
+          environment variable.

-         #. Generate a profiling header after adding/changing a HIP API.
+        * Clang is in ``<ROCM_PATH>/llvm/bin``. This can be overridden by setting the
+          ``HIP_CLANG_PATH`` environment variable.

-            When you add or change a HIP API, you may need to generate a new ``hip_prof_str.h`` header.
-            This header is used by ROCm tools to track HIP APIs, such as ``rocprofiler`` and ``roctracer``.
+        * The device library is in ``<ROCM_PATH>/lib``. This can be overridden by setting the
+          ``DEVICE_LIB_PATH`` environment variable.

-            To generate the header after your change, use the ``hip_prof_gen.py`` tool located in
-            ``hipamd/src``.
+        * Optionally, you can add ``<ROCM_PATH>/bin`` to your ``PATH``, which can make it easier to
+          use the tools.

-            Usage:
+        * Optionally, you can set ``HIPCC_VERBOSE=7`` to output the command line for compilation.

-            .. code-block:: shell
+        After you run the ``make install`` command, HIP is installed to ``<ROCM_PATH>`` by default, or ``$PWD/install/hip`` while ``INSTALL_PREFIX`` is defined.

-               `hip_prof_gen.py [-v] <input HIP API .h file> <patched srcs path> <previous output> [<output>]`
+     #. Generate a profiling header after adding/changing a HIP API.

-            Flags:
+        When you add or change a HIP API, you may need to generate a new ``hip_prof_str.h`` header.
+        This header is used by ROCm tools to track HIP APIs, such as ``rocprofiler`` and ``roctracer``.

-               * ``-v``: Verbose messages
-               * ``-r``: Process source directory recursively
-               * ``-t``: API types matching check
-               * ``--priv``: Private API check
-               * ``-e``: On error exit mode
-               * ``-p``: ``HIP_INIT_API`` macro patching mode
+        To generate the header after your change, use the ``hip_prof_gen.py`` tool located in
+        ``hipamd/src``.

-            Example usage:
+        Usage:

-            .. code-block:: shell
+        .. code-block:: shell

-               hip_prof_gen.py -v -p -t --priv <hip>/include/hip/hip_runtime_api.h \
-               <hipamd>/src <hipamd>/include/hip/amd_detail/hip_prof_str.h \
-               <hipamd>/include/hip/amd_detail/hip_prof_str.h.new
+           `hip_prof_gen.py [-v] <input HIP API .h file> <patched srcs path> <previous output> [<output>]`

-   .. tab-item:: NVIDIA
-      :sync: nvidia
+           Flags:

-      #. Get the HIP source code.
+              * ``-v``: Verbose messages
+              * ``-r``: Process source directory recursively
+              * ``-t``: API types matching check
+              * ``--priv``: Private API check
+              * ``-e``: On error exit mode
+              * ``-p``: ``HIP_INIT_API`` macro patching mode

-         .. code-block:: shell
+        Example usage:

-            git clone -b "$ROCM_BRANCH" git@github.com:ROCm/rocm-systems.git
+        .. code-block:: shell

-      #. Set the environment variables.
+           hip_prof_gen.py -v -p -t --priv <hip>/include/hip/hip_runtime_api.h \
+           <hipamd>/src <hipamd>/include/hip/amd_detail/hip_prof_str.h \
+           <hipamd>/include/hip/amd_detail/hip_prof_str.h.new

-         .. code-block:: shell
+  .. tab-item:: NVIDIA
+     :sync: nvidia

-            export CLR_DIR="$(readlink -f rocm-systems/projects/clr)"
-            export HIP_DIR="$(readlink -f rocm-systems/projects/hip)"
-            export HIP_OTHER="$(readlink -f rocm-systems/projects/hipother)"
+     #. Get the HIP source code.

-      #. Build HIP.
+        .. code-block:: shell

-         .. code-block:: shell
+           git clone -b "$ROCM_BRANCH" git@github.com:ROCm/rocm-systems.git

-            cd "$CLR_DIR"
-            mkdir -p build; cd build
-            cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/install -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF -DHIPNV_DIR=$HIP_OTHER/hipnv ..
-            make -j$(nproc)
-            sudo make install
+     #. Set the environment variables.
+
+        .. code-block:: shell
+
+           export CLR_DIR="$(readlink -f rocm-systems/projects/clr)"
+           export HIP_DIR="$(readlink -f rocm-systems/projects/hip)"
+           export HIP_OTHER="$(readlink -f rocm-systems/projects/hipother)"
+
+     #. Build HIP.
+
+        .. code-block:: shell
+
+           cd "$CLR_DIR"
+           mkdir -p build; cd build
+           cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/install -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF -DHIPNV_DIR=$HIP_OTHER/hipnv ..
+           make -j$(nproc)
+           sudo make install

 Build HIP tests
 =================================================

 .. tab-set::

-   .. tab-item:: AMD
-      :sync: amd
+  .. tab-item:: AMD
+     :sync: amd

-      * Build HIP catch tests.
+     **Build HIP catch tests.**

-         HIP catch tests utilize the Catch2 testing framework.
+     HIP catch tests utilize the Catch2 testing framework.

-         * Get HIP tests source code.
+     #. Get HIP tests source code.

-            .. code-block:: shell
+        .. code-block:: shell

-               git clone -b "$ROCM_BRANCH" git@github.com:ROCm/rocm-systems.git
-               export HIPTESTS_DIR="$(readlink -f rocm-systems/projects/hip-tests)"
+           git clone -b "$ROCM_BRANCH" git@github.com:ROCm/rocm-systems.git
+           export HIPTESTS_DIR="$(readlink -f rocm-systems/projects/hip-tests)"

-         * Build HIP tests from source.
+     #. Build HIP tests from source.

-            .. code-block:: shell
+        .. code-block:: shell

-               cd "$HIPTESTS_DIR"
-               mkdir -p build; cd build
-               cmake ../catch -DHIP_PLATFORM=amd -DHIP_PATH=$CLR_DIR/build/install  # or any path where HIP is installed; for example: ``/opt/rocm``
-               export ROCM_PATH=/opt/rocm
-               make build_tests
-               ctest # run tests
+           cd "$HIPTESTS_DIR"
+           mkdir -p build; cd build
+           cmake ../catch -DHIP_PLATFORM=amd -DHIP_PATH=$CLR_DIR/build/install  # or any path where HIP is installed; for example: ``/opt/rocm``
+           export ROCM_PATH=/opt/rocm
+           make build_tests
+           ctest # run tests

-            HIP catch tests are built in ``$HIPTESTS_DIR/build``.
+        HIP catch tests are built in ``$HIPTESTS_DIR/build``.

-            To run any single catch test, use this example:
+        To run any single catch test, use this example:

-            .. code-block:: shell
+        .. code-block:: shell

-               cd $HIPTESTS_DIR/build/catch_tests/unit/texture
-               ./TextureTest
+           cd $HIPTESTS_DIR/build/catch_tests/unit/texture
+           ./TextureTest

-         * Build a HIP Catch2 standalone test.
+     #. Build a HIP Catch2 standalone test.

-            .. code-block:: shell
+        .. code-block:: shell

-               cd "$HIPTESTS_DIR"
-               hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc \
-               -I ./catch/include ./catch/hipTestMain/standalone_main.cc \
-               -I ./catch/external/Catch2 -o hipPointerGetAttributes
-               ./hipPointerGetAttributes
-               ...
+           cd "$HIPTESTS_DIR"
+           hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc \
+           -I ./catch/include ./catch/hipTestMain/standalone_main.cc \
+           -I ./catch/external/Catch2 -o hipPointerGetAttributes
+           ./hipPointerGetAttributes
+           ...

-               All tests passed
+           All tests passed

-   .. tab-item:: NVIDIA
-      :sync: nvidia
+  .. tab-item:: NVIDIA
+     :sync: nvidia

-      The commands to build HIP tests on an NVIDIA platform are the same as on an AMD platform.
-      However, you must first set ``-DHIP_PLATFORM=nvidia``.
+     The commands to build HIP tests on an NVIDIA platform are the same as on an AMD platform.
+     However, you must first set ``-DHIP_PLATFORM=nvidia``.


 Run HIP
@@ -10,10 +10,10 @@ HIP can be installed on AMD (ROCm with HIP-Clang) and NVIDIA (CUDA with NVCC) pl

 .. note::

-   The version definition for the HIP runtime is different from CUDA. On AMD
-   platforms, the :cpp:func:`hipRuntimeGetVersion` function returns the HIP
-   runtime version. On NVIDIA platforms, this function returns the CUDA runtime
-   version.
+  The version definition for the HIP runtime is different from CUDA. On AMD
+  platforms, the :cpp:func:`hipRuntimeGetVersion` function returns the HIP
+  runtime version. On NVIDIA platforms, this function returns the CUDA runtime
+  version.

 .. _install_prerequisites:

@@ -22,84 +22,88 @@ Prerequisites

 .. tab-set::

-   .. tab-item:: AMD
-      :sync: amd
+  .. tab-item:: AMD
+     :sync: amd

-      Refer to the Prerequisites section in the ROCm install guides:
+     Refer to the Prerequisites section in the ROCm install guides:

-         * :doc:`rocm-install-on-linux:reference/system-requirements`
-         * :doc:`rocm-install-on-windows:reference/system-requirements`
+     * `System requirements (Linux) <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html>`_
+     * `System requirements (Windows) <https://rocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html>`_

-   .. tab-item:: NVIDIA
-      :sync: nvidia
+  .. tab-item:: NVIDIA
+     :sync: nvidia

-      With NVIDIA GPUs, HIP requires unified memory. All CUDA-enabled NVIDIA
-      GPUs with compute capability 5.0 or later should be supported. For more
-      information, see `NVIDIA's list of CUDA enabled GPUs <https://developer.nvidia.com/cuda-gpus>`_.
+     With NVIDIA GPUs, HIP requires unified memory. All CUDA-enabled NVIDIA
+     GPUs with compute capability 5.0 or later should be supported. For more
+     information, see `NVIDIA's list of CUDA enabled GPUs <https://developer.nvidia.com/cuda-gpus>`_.

 Installation
 =======================================

 .. tab-set::

-   .. tab-item:: AMD
-      :sync: amd
+  .. tab-item:: AMD
+     :sync: amd

-      HIP is automatically installed during the ROCm installation. If you haven't yet installed ROCm, you
-      can find installation instructions here:
+     HIP is automatically installed during the ROCm installation. If you haven't
+     yet installed ROCm, you can find installation instructions here:

-         * :doc:`rocm-install-on-linux:index`
-         * :doc:`rocm-install-on-windows:index`
+     * `ROCm installation for Linux <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/index.html>`_
+     * `HIP SDK installation for Windows <https://rocm.docs.amd.com/projects/install-on-windows/en/latest/index.html>`_

-      By default, HIP is installed into ``/opt/rocm``.
+     By default, HIP is installed into ``/opt/rocm``.

-      .. note::
-         There is no autodetection for the HIP installation. If you choose to install it somewhere other than the default location, you must set the ``HIP_PATH`` environment variable as explained in `Build HIP from source <./build.html>`_.
+     .. note::
+     
+        There is no autodetection for the HIP installation. If you choose to 
+        install it somewhere other than the default location, you must set the
+        ``HIP_PATH`` environment variable as explained in
+        `Build HIP from source <./build.html>`_.

-   .. tab-item:: NVIDIA
-      :sync: nvidia
+  .. tab-item:: NVIDIA
+     :sync: nvidia

-      #. Install the NVIDIA toolkit.
+     #. Install the NVIDIA toolkit.

-         The latest release can be found here:
-         `CUDA Toolkit <https://developer.nvidia.com/cuda-downloads>`_.
+        The latest release can be found here:
+        `CUDA Toolkit <https://developer.nvidia.com/cuda-downloads>`_.

-      #. Setup the radeon repo.
+     #. Setup the radeon repo.

-         .. code-block::shell
+        .. code-block::shell

-            # Replace url with appropriate link in the table below
-            wget https://repo.radeon.com/amdgpu-install/6.2/distro/version_name/amdgpu-install_6.2.60200-1_all.deb
-            sudo apt install ./amdgpu-install_6.2.60200-1_all.deb
-            sudo apt update
+          # Replace url with appropriate link in the table below
+          wget https://repo.radeon.com/amdgpu-install/6.2/distro/version_name/amdgpu-install_6.2.60200-1_all.deb
+          sudo apt install ./amdgpu-install_6.2.60200-1_all.deb
+          sudo apt update

-         .. list-table:: amdgpu-install links
-            :widths: 25 100
-            :header-rows: 1
+        .. list-table:: amdgpu-install links
+           :widths: 25 100
+           :header-rows: 1

-            * - Ubuntu version
-              - URL
-            * - 24.04
-              - https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb
-            * - 22.04
-              - https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/jammy/amdgpu-install_6.2.60204-1_all.deb
+           * - Ubuntu version
+             - URL
+           * - 24.04
+             - https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb
+           * - 22.04
+             - https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/jammy/amdgpu-install_6.2.60204-1_all.deb

-      #. Install the ``hip-runtime-nvidia`` and ``hip-dev`` packages. This installs the CUDA SDK and HIP
-         porting layer.
+     #. Install the ``hip-runtime-nvidia`` and ``hip-dev`` packages. This installs the CUDA SDK and HIP
+        porting layer.

-         .. code-block:: shell
+        .. code-block:: shell

-            apt-get install hip-runtime-nvidia hip-dev
+          apt-get install hip-runtime-nvidia hip-dev

-         The default paths are:
-            * CUDA SDK: ``/usr/local/cuda``
-            * HIP: ``/opt/rocm``
+        The default paths are:
+          * CUDA SDK: ``/usr/local/cuda``
+          * HIP: ``/opt/rocm``

-      #. Set the HIP_PLATFORM to nvidia.
+     #. Set the HIP_PLATFORM to nvidia.

-         .. code-block:: shell
+        .. code-block:: shell

-            export HIP_PLATFORM="nvidia"
+          export HIP_PLATFORM="nvidia"

 Verify your installation
 ==========================================================
@@ -108,4 +112,4 @@ Run ``hipconfig`` in your installation path.

 .. code-block:: shell

-   /opt/rocm/bin/hipconfig --full
+  /opt/rocm/bin/hipconfig --full
@@ -43,6 +43,13 @@ in both single and double precision formats.
 Complex Number Functions
 ========================

+.. note::
+
+  Changes have been made to small vector constructors for ``hipComplex`` and ``hipFloatComplex``
+  initialization, such as ``float2`` and ``int4``. If your code previously relied
+  on a single value to initialize all components within a vector or complex type, you might need
+  to update your code.
+
 A comprehensive collection of functions for creating and manipulating complex numbers, organized by
 functional categories for easy reference.

@@ -12,44 +12,24 @@ on AMD platform, which are grouped by functionality.
 GPU isolation variables
 ================================================================================

-The GPU isolation environment variables in HIP are collected in the following table.
-For more information, check :doc:`GPU isolation page <rocm:conceptual/gpu-isolation>`.
-
-.. include-table:: data/env_variables_hip.rst
-    :table: hip-env-isolation
+.. include:: ./env_variables/gpu_isolation_hip_env.rst

 Profiling variables
 ================================================================================

-The profiling environment variables in HIP are collected in the following table. For
-more information, check :doc:`setting the number of CUs page <rocm:how-to/setting-cus>`.
-
-.. include-table:: data/env_variables_hip.rst
-    :table: hip-env-prof
+.. include:: ./env_variables/profiling_hip_env.rst

 Debug variables
 ================================================================================

-The debugging environment variables in HIP are collected in the following table. For
-more information, check :ref:`debugging_with_hip`.
-
-.. include-table:: data/env_variables_hip.rst
-    :table: hip-env-debug
+.. include:: ./env_variables/debug_hip_env.rst

 Memory management related variables
 ================================================================================

-The memory management related environment variables in HIP are collected in the
-following table.
-
-.. include-table:: data/env_variables_hip.rst
-    :table: hip-env-memory
+.. include:: ./env_variables/memory_management_hip_env.rst

 Other useful variables
 ================================================================================

-The following table lists environment variables that are useful but relate to
-different features.
-
-.. include-table:: data/env_variables_hip.rst
-    :table: hip-env-other
+.. include:: ./env_variables/miscellaneous_hip_env.rst
@@ -0,0 +1,100 @@
+The debugging environment variables in HIP are collected in the following table. For
+more information, check :doc:`hip:how-to/logging`, :doc:`hip:how-to/debugging`
+and :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`.
+
+.. _hip-env-debug:
+.. list-table::
+    :header-rows: 1
+    :widths: 35,14,51
+
+    * - **Environment variable**
+      - **Default value**
+      - **Value**
+
+    * - | ``AMD_LOG_LEVEL``
+        | Enables HIP log on various level.
+      - ``0``
+      - | 0: Disable log.
+        | 1: Enables error logs.
+        | 2: Enables warning logs next to lower-level logs.
+        | 3: Enables information logs next to lower-level logs.
+        | 4: Enables debug logs next to lower-level logs.
+        | 5: Enables debug extra logs next to lower-level logs.
+
+    * - | ``AMD_LOG_LEVEL_FILE``
+        | Sets output file for ``AMD_LOG_LEVEL``.
+      - stderr output
+      -
+
+    * - | ``AMD_LOG_MASK``
+        | Specifies HIP log filters. Here is the ` complete list of log masks <https://github.com/ROCm/clr/blob/develop/rocclr/utils/debug.hpp#L40>`_.
+      - ``0x7FFFFFFF``
+      - | 0x1: Log API calls.
+        | 0x2: Kernel and copy commands and barriers.
+        | 0x4: Synchronization and waiting for commands to finish.
+        | 0x8: Decode and display AQL packets.
+        | 0x10: Queue commands and queue contents.
+        | 0x20: Signal creation, allocation, pool.
+        | 0x40: Locks and thread-safety code.
+        | 0x80: Kernel creations and arguments, etc.
+        | 0x100: Copy debug.
+        | 0x200: Detailed copy debug.
+        | 0x400: Resource allocation, performance-impacting events.
+        | 0x800: Initialization and shutdown.
+        | 0x1000: Misc debug, not yet classified.
+        | 0x2000: Show raw bytes of AQL packet.
+        | 0x4000: Show code creation debug.
+        | 0x8000: More detailed command info, including barrier commands.
+        | 0x10000: Log message location.
+        | 0x20000: Memory allocation.
+        | 0x40000: Memory pool allocation, including memory in graphs.
+        | 0x80000: Timestamp details.
+        | 0xFFFFFFFF: Log always even mask flag is zero.
+
+    * - | ``HIP_LAUNCH_BLOCKING``
+        | Used for serialization on kernel execution.
+      - ``0``
+      - | 0: Disable. Kernel executes normally.
+        | 1: Enable. Serializes kernel enqueue, behaves the same as ``AMD_SERIALIZE_KERNEL``.
+
+    * - | ``HIP_VISIBLE_DEVICES`` (or ``CUDA_VISIBLE_DEVICES``)
+        | Only devices whose index is present in the sequence are visible to HIP
+      - Unset by default.
+      - 0,1,2: Depending on the number of devices on the system.
+
+    * - | ``GPU_DUMP_CODE_OBJECT``
+        | Dump code object.
+      - ``0``
+      - | 0: Disable
+        | 1: Enable
+
+    * - | ``AMD_SERIALIZE_KERNEL``
+        | Serialize kernel enqueue.
+      - ``0``
+      - | 0: Disable
+        | 1: Wait for completion before enqueue.
+        | 2: Wait for completion after enqueue.
+        | 3: Both
+
+    * - | ``AMD_SERIALIZE_COPY``
+        | Serialize copies
+      - ``0``
+      - | 0: Disable
+        | 1: Wait for completion before enqueue.
+        | 2: Wait for completion after enqueue.
+        | 3: Both
+
+    * - | ``AMD_DIRECT_DISPATCH``
+        | Enable direct kernel dispatch (Currently for Linux; under development for Windows).
+      - ``1``
+      - | 0: Disable
+        | 1: Enable
+
+    * - | ``GPU_MAX_HW_QUEUES``
+        | The maximum number of hardware queues allocated per device.
+      - ``4``
+      - The variable controls how many independent hardware queues HIP runtime can create per process,
+        per device. If an application allocates more HIP streams than this number, then HIP runtime reuses
+        the same hardware queues for the new streams in a round-robin manner. Note that this maximum
+        number does not apply to hardware queues that are created for CU-masked HIP streams, or
+        cooperative queues for HIP Cooperative Groups (single queue per device).
@@ -0,0 +1,27 @@
+Restricting the access of applications to a subset of GPUs, also known as GPU
+isolation, allows users to hide GPU resources from programs. The GPU isolation
+environment variables in HIP are collected in the following table.
+
+.. _hip-env-isolation:
+.. list-table::
+    :header-rows: 1
+    :widths: 50,30,20
+
+    * - **Environment variable**
+      - **Links**
+      - **Value**
+
+    * - | ``ROCR_VISIBLE_DEVICES``
+        | A list of device indices or UUIDs that will be exposed to applications.
+      - :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`, :doc:`Setting the number of compute units <rocm:how-to/setting-cus>`
+      - Example: ``0,GPU-DEADBEEFDEADBEEF``
+
+    * - | ``GPU_DEVICE_ORDINAL``
+        | Devices indices exposed to OpenCL and HIP applications.
+      - :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`
+      - Example: ``0,2``
+
+    * - | ``HIP_VISIBLE_DEVICES`` or ``CUDA_VISIBLE_DEVICES``
+        | Device indices exposed to HIP applications.
+      - :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`, :doc:`HIP debugging <hip:how-to/debugging>`
+      - Example: ``0,2``
@@ -0,0 +1,100 @@
+The memory management related environment variables in HIP are collected in the
+following table. The ``HIP_HOST_COHERENT`` variable linked at the following
+pages:
+
+- :ref:`Coherence control <hip:coherence_control>`
+
+- :ref:`Memory allocation flags <hip:memory_allocation_flags>`
+
+.. _hip-env-memory:
+.. list-table::
+    :header-rows: 1
+    :widths: 35,14,51
+
+    * - **Environment variable**
+      - **Default value**
+      - **Value**
+
+    * - | ``HIP_HIDDEN_FREE_MEM``
+        | Amount of memory to hide from the free memory reported by hipMemGetInfo.
+      - ``0``
+      - | 0: Disable
+        | Unit: megabyte (MB)
+
+    * - | ``HIP_HOST_COHERENT``
+        | Specifies if the memory is coherent between the host and GPU in ``hipHostMalloc``.
+      - ``0``
+      - | 0: Memory is not coherent.
+        | 1: Memory is coherent.
+        | Environment variable has effect, if the following conditions are statisfied:
+        | - One of the ``hipHostMallocDefault``, ``hipHostMallocPortable``,  ``hipHostMallocWriteCombined`` or ``hipHostMallocNumaUser`` flag set to 1.
+        | - ``hipHostMallocCoherent``, ``hipHostMallocNonCoherent`` and ``hipHostMallocMapped`` flags set to 0.
+
+    * - | ``HIP_INITIAL_DM_SIZE``
+        | Set initial heap size for device malloc.
+      - ``8388608``
+      - | Unit: Byte
+        | The default value corresponds to 8 MB.
+
+    * - | ``HIP_MEM_POOL_SUPPORT``
+        | Enables memory pool support in HIP.
+      - ``0``
+      - | 0: Disable
+        | 1: Enable
+
+    * - | ``HIP_MEM_POOL_USE_VM``
+        | Enables memory pool support in HIP.
+      - | ``0``: other OS
+        | ``1``: Windows
+      - | 0: Disable
+        | 1: Enable
+
+    * - | ``HIP_VMEM_MANAGE_SUPPORT``
+        | Virtual Memory Management Support.
+      - ``1``
+      - | 0: Disable
+        | 1: Enable
+
+    * - | ``GPU_MAX_HEAP_SIZE``
+        | Set maximum size of the GPU heap to % of board memory.
+      - ``100``
+      - | Unit: Percentage
+
+    * - | ``GPU_MAX_REMOTE_MEM_SIZE``
+        | Maximum size that allows device memory substitution with system.
+      - ``2``
+      - | Unit: kilobyte (KB)
+
+    * - | ``GPU_NUM_MEM_DEPENDENCY``
+        | Number of memory objects for dependency tracking.
+      - ``256``
+      -
+
+    * - | ``GPU_STREAMOPS_CP_WAIT``
+        | Force the stream memory operation to wait on CP.
+      - ``0``
+      - | 0: Disable
+        | 1: Enable
+
+    * - | ``HSA_LOCAL_MEMORY_ENABLE``
+        | Enable HSA device local memory usage.
+      - ``1``
+      - | 0: Disable
+        | 1: Enable
+
+    * - | ``PAL_ALWAYS_RESIDENT``
+        | Force memory resources to become resident at allocation time.
+      - ``0``
+      - | 0: Disable
+        | 1: Enable
+
+    * - | ``PAL_PREPINNED_MEMORY_SIZE``
+        | Size of prepinned memory.
+      - ``64``
+      - | Unit: kilobyte (KB)
+
+    * - | ``REMOTE_ALLOC``
+        | Use remote memory for the global heap allocation.
+      - ``0``
+      - | 0: Disable
+        | 1: Enable
@@ -0,0 +1,34 @@
+The following table lists environment variables that are useful but relate to
+different features in HIP.
+
+.. _hip-env-other:
+.. list-table::
+    :header-rows: 1
+    :widths: 35,14,51
+
+    * - **Environment variable**
+      - **Default value**
+      - **Value**
+
+    * - | ``HIPRTC_COMPILE_OPTIONS_APPEND``
+        | Sets compile options needed for ``hiprtc`` compilation.
+      - Unset by default.
+      - ``--gpu-architecture=gfx906:sramecc+:xnack``, ``-fgpu-rdc``
+
+    * - | ``AMD_COMGR_SAVE_TEMPS``
+        | Controls the deletion of temporary files generated during the compilation of COMGR. These files do not appear in the current working directory, but are instead left in a platform-specific temporary directory.
+      - Unset by default.
+      - | 0: Temporary files are deleted automatically.
+        | Non zero integer: Turn off the temporary files deletion.
+
+    * - | ``AMD_COMGR_EMIT_VERBOSE_LOGS``
+        | Sets logging of COMGR to include additional Comgr-specific informational messages.
+      - Unset by default.
+      - | 0: Verbose log disabled.
+        | Non zero integer: Verbose log enabled.
+
+    * - | ``AMD_COMGR_REDIRECT_LOGS``
+        | Controls redirect logs of COMGR.
+      - Unset by default.
+      - | `stdout` / `-`: Redirected to the standard output.
+        | `stderr`: Redirected to the error stream.
@@ -0,0 +1,23 @@
+The profiling environment variables in HIP are collected in the following table. For
+more information, check :doc:`setting the number of CUs page <rocm:how-to/setting-cus>`.
+
+.. _hip-env-prof:
+.. list-table::
+    :header-rows: 1
+    :widths: 70,30
+
+    * - **Environment variable**
+      - **Value**
+
+    * - | ``HSA_CU_MASK``
+        | Sets the mask on a lower level of queue creation in the driver, this mask will also be set for queues being profiled.
+      - Example: ``1:0-8``
+
+    * - | ``ROC_GLOBAL_CU_MASK``
+        | Sets the mask on queues created by the HIP or the OpenCL runtimes, this mask will also be set for queues being profiled.
+      - Example: ``0xf``, enables only 4 CUs
+
+    * - | ``HIP_FORCE_QUEUE_PROFILING``
+        | Used to run the app as if it were run in rocprof. Forces command queue profiling on by default.
+      - | 0: Disable
+        | 1: Enable
@@ -13,6 +13,8 @@ returned by HIP API functions to indicate various runtime conditions and errors.

 For more details, see :ref:`Error handling functions <error_handling_reference>`.

+.. _basic_runtime_errors:
+
 Basic Runtime Errors
 ====================

@@ -100,6 +102,8 @@ Basic Runtime Errors
    If this error is encountered, it generally means the API or feature is not fully supported in the
    current version.

+.. _memory_management_errors:
+
 Memory Management Errors
 ========================

@@ -139,6 +143,14 @@ Memory Management Errors
      - ``1052``
      - Runtime memory call returned error

+    * - :term:`hipErrorInvalidChannelDescriptor`
+      - ``911``
+      - Input for texture object, resource descriptor, or texture descriptor is a NULL pointer or invalid
+
+    * - :term:`hipErrorInvalidTexture`
+      - ``912``
+      - Texture reference pointer is NULL or invalid
+
 .. glossary::

  hipErrorOutOfMemory
@@ -233,6 +245,21 @@ Memory Management Errors
    This error differs from ``hipErrorOutOfMemory`` in that it relates to memory operations internal to the HIP
    runtime rather than explicit application requests for memory allocation.

+  hipErrorInvalidChannelDescriptor
+
+    This error indicates that an invalid channel descriptor is used to define the format and layout of data
+    in memory, particularly when working with textures or arrays. This could happen if the descriptor is
+    incorrectly set up or if it does not match the expected format for the operation being performed.
+
+  hipErrorInvalidTexture
+
+    The error code is returned when an invalid texture object is used in a function call. This typically
+    occurs when a texture object is not properly initialized or configured before being used in operations
+    that require valid texture data. If you encounter this error, it suggests that the texture object
+    might be missing necessary configuration details or has been corrupted.
+
+.. _device_context_errors:
+
 Device and Context Errors
 =========================

@@ -385,6 +412,8 @@ Device and Context Errors
    * Custom build environments with mismatched components
    * Partial upgrades of the ROCm stack

+.. _kernel_launch_errors:
+
 Kernel and Launch Errors
 ========================

@@ -396,10 +425,18 @@ Kernel and Launch Errors
     - Value
     - Description

+   * - :term:`hipErrorInvalidValue``
+     - ``1``
+     - Invalid input value
+
   * - :term:`hipErrorInvalidDeviceFunction`
     - ``98``
     - Invalid device function

+   * - :term:`hipErrorContextIsDestroyed`
+     - ``709``
+     - Invalid stream handle
+
   * - :term:`hipErrorInvalidConfiguration`
     - ``9``
     - Invalid configuration argument
@@ -446,6 +483,11 @@ Kernel and Launch Errors

 .. glossary::

+  hipErrorInvalidValue
+
+    Error returned when a grid dimension check finds any input global work size
+    dimension is zero, or a shared memory size check finds the size exceeds the size limit. 
+
  hipErrorInvalidDeviceFunction

    Invalid device function. This error occurs when attempting to use a function that is not a valid device
@@ -453,6 +495,10 @@ Kernel and Launch Errors

    * Code compiled for a specific GPU architecture (using ``--offload-arch``) but executed on an different/incompatible GPU

+  hipErrorContextIsDestroyed
+
+    This error is returned when the input stream or input stream handle is invalid. 
+
  hipErrorInvalidConfiguration

    Invalid configuration argument. This error occurs when the configuration specified for a kernel launch
@@ -507,7 +553,7 @@ Kernel and Launch Errors
  hipErrorInvalidKernelFile

    Invalid kernel file. This error occurs when the kernel file or module being loaded is corrupted or in
-    an invalid format.
+    an invalid format, for example the file name exists but the file size is 0.

  hipErrorInvalidImage

@@ -556,6 +602,7 @@ Kernel and Launch Errors

    * Launching a cooperative kernel with grid dimensions that exceed hardware limits
    * Requesting more resources than available for synchronization across thread blocks
+    * The shared memory size in bytes exceeds the device local memory size per CU
    * Using cooperative groups on hardware with limited support
    * Not accounting for cooperative launch limitations in kernel configuration

@@ -577,6 +624,8 @@ Kernel and Launch Errors
    normal operation. Additional debugging of the previous failed launch may be required to identify
    the root cause.

+.. _stream_capture_errors:
+
 Stream Capture Errors
 =====================

@@ -624,6 +673,10 @@ Stream Capture Errors
     - ``907``
     - Operation not permitted on an event last recorded in a capturing stream

+   * - :term:`hipErrorInvalidResourceHandle`
+     - ``400``
+     - Input launch stream is ``NULL`` or is ``hipStreamLegacy``
+
 .. glossary::

  hipErrorStreamCaptureUnsupported
@@ -754,6 +807,14 @@ Stream Capture Errors
    and cannot be used for host-side synchronization until the capture is complete and the graph
    is executed.

+  hipErrorInvalidResourceHandle
+
+    This error is returned when the input launch stream is a NULL pointer, is invalid, or is ``hipStreamLegacy``.
+    If you encounter this error, you should check the validity of the resource handle being used in your HIP
+    API calls. Ensure that the handle was correctly obtained and has not been freed or invalidated before use.
+
+.. _profiler_errors:
+
 Profiler Errors
 ===============

@@ -845,6 +906,8 @@ Profiler Errors
    The HIP profiler must be in an active state before it can be stopped. This error is informational
    and indicates that the profiler is already in the desired inactive state.

+.. _resource_mapping_errors:
+
 Resource Mapping Errors
 =======================

@@ -992,6 +1055,8 @@ Resource Mapping Errors
    operation was attempted on a resource that was not mapped as a pointer. Resources must be mapped
    with the appropriate mapping type for the operations that will be performed on them.

+.. _peer_access_errors:
+
 Peer Access Errors
 ==================

@@ -1058,6 +1123,8 @@ Peer Access Errors
    access between peer devices. Not all device combinations support peer access. Compatibility can be
    determined with :cpp:func:`hipDeviceCanAccessPeer()`.

+.. _system_file_errors:
+
 System and File Errors
 ======================

@@ -1183,6 +1250,8 @@ System and File Errors
    This is a catch-all error that may require looking at system logs or using additional
    debugging tools to identify the root cause.

+.. _graphics_content_errors:
+
 Graphics Context Errors
 =======================

@@ -1216,6 +1285,8 @@ Graphics Context Errors
    instantiated graph update. This error occurs when attempting to update an already instantiated
    graph with changes that are not allowed.

+.. _hardware_errors:
+
 Hardware Errors
 ===============

@@ -240,10 +240,12 @@ page.
        - 106
        - 104

-.. [1] RDNA architectures have a configurable wavefront size. The native
-   wavefront size is 32, but they can run in "CU mode", which has an effective
-   wavefront size of 64. This affects the number of resident wavefronts and
-   blocks per compute Unit.
+.. [1] The RDNA architectures feature an experimental compiler option called 
+   ``mwavefrontsize64``, which determines the wavefront size for kernel code
+   generation. When this option is disabled, the native wavefront size of 32 is
+   used, when enabled wavefront size 64 is used. This option is not supported by
+   the HIP runtime.
+
 .. [2] RDNA architectures expand the concept of the traditional compute unit
   with the so-called work group processor, which effectively includes two
   compute units, within which all threads can cooperate.
@@ -8,7 +8,50 @@
 HIP runtime API
 ********************************************************************************

-The HIP Runtime API reference:
+The HIP Runtime API reference includes descriptions of HIP functions, as well as global datatypes, enums, and structs. 

-* :ref:`modules_reference`
-* :ref:`global_defines_enums_structs_files_reference`
+Modules
+=======
+
+The API is organized into modules based on functionality.
+
+* :ref:`initialization_version_reference`
+* :ref:`device_management_reference`
+* :ref:`execution_control_reference`
+* :ref:`error_handling_reference`
+* :ref:`stream_management_reference`
+* :ref:`stream_memory_operations_reference`
+* :ref:`event_management_reference`
+* :ref:`memory_management_reference`
+
+  * :ref:`memory_management_deprecated_reference`
+  * :ref:`external_resource_interoperability_reference`
+  * :ref:`stream_ordered_memory_allocator_reference`
+  * :ref:`unified_memory_reference`
+  * :ref:`virtual_memory_reference`
+  * :ref:`texture_management_reference`
+  * :ref:`texture_management_deprecated_reference`
+  * :ref:`surface_object_reference`
+
+* :ref:`peer_to_peer_device_memory_access_reference`
+* :ref:`context_management_reference`
+* :ref:`module_management_reference`
+* :ref:`occupancy_reference`
+* :ref:`profiler_control_reference`
+* :ref:`launch_api_reference`
+* :ref:`runtime_compilation_reference`
+* :ref:`callback_activity_apis_reference`
+* :ref:`graph_management_reference`
+* :ref:`opengl_interoperability_reference`
+* :ref:`graphics_interoperability_reference`
+* :ref:`cooperative_groups_reference`
+
+Global defines, enums, structs and files
+========================================
+
+The structs, define macros, enums and files in the HIP runtime API.
+
+* :ref:`global_enum_defines_reference`
+* :ref:`driver_types_reference`
+* :doc:`../../doxygen/html/annotated`
+* :doc:`../../doxygen/html/files`
@@ -7,6 +7,7 @@ root: index
 subtrees:
 - entries:
  - file: what_is_hip
+  - file: hip-7-changes
  - file: faq

 - caption: Install
@@ -0,0 +1,207 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+#include <type_traits>
+#include <iostream>
+#include <vector>
+#include <random>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess){                \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+// [Sphinx template warp size block reduction kernel start]
+template<uint32_t WarpSize>
+using lane_mask_t = typename std::conditional<WarpSize == 32, uint32_t, uint64_t>::type;
+
+template<uint32_t WarpSize>
+__global__ void block_reduce(int* input, lane_mask_t<WarpSize>* mask, int* output, size_t size) {
+  extern __shared__ int shared[];
+
+  // Read of input with bounds check
+  auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id)
+  {
+    lane_mask_t<WarpSize> warp_mask = lane_mask_t<WarpSize>(1) << lane_id;
+    return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
+  };
+
+  const uint32_t tid = threadIdx.x,
+                 lid = threadIdx.x % WarpSize,
+                 wid = threadIdx.x / WarpSize,
+                 bid = blockIdx.x,
+                 gid = bid * blockDim.x + tid;
+
+  // Read input buffer to shared
+  shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / WarpSize) + wid);
+  __syncthreads();
+
+  // Shared reduction
+  for (uint32_t i = blockDim.x / 2; i >= WarpSize; i /= 2)
+  {
+    if (tid < i)
+      shared[tid] = shared[tid] + shared[tid + i];
+    __syncthreads();
+  }
+
+  // Use local variable in warp reduction  
+  int result =  shared[tid];
+  __syncthreads();
+
+  // This loop would be unrolled the same with the runtime warpSize.
+  #pragma unroll
+  for (uint32_t i = WarpSize/2; i >= 1; i /= 2) {
+    result = result + __shfl_down(result, i);
+  }
+
+  // Write result to output buffer
+  if (tid == 0)
+    output[bid] = result;
+};
+// [Sphinx template warp size block reduction kernel end]
+
+// [Sphinx template warp size mask generation start]
+template<uint32_t WarpSize>
+void generate_and_copy_mask(
+  void *d_mask, 
+  std::vector<int>& vectorExpected, 
+  int numOfBlocks,
+  int numberOfWarp,
+  int mask_size,
+  int mask_element_size) {
+  
+  std::random_device rd;
+  std::mt19937_64 eng(rd());
+
+  // Host side mask vector
+  std::vector<lane_mask_t<WarpSize>> mask(mask_size);
+  // Define uniform unsigned int distribution
+  std::uniform_int_distribution<lane_mask_t<WarpSize>> distr;
+  // Fill up the mask 
+  for(int i=0; i < numOfBlocks; i++) {
+    int count = 0;
+    for(int j=0; j < numberOfWarp; j++) {
+      int mask_index = i * numberOfWarp + j;
+      mask[mask_index] = distr(eng);
+      if constexpr(WarpSize == 32)
+        count += __builtin_popcount(mask[mask_index]);
+      else
+        count += __builtin_popcountll(mask[mask_index]);
+    }
+    vectorExpected[i]= count;
+  }
+
+  // Copy the mask array
+  HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
+}
+// [Sphinx template warp size mask generation end]
+
+int main() {
+
+  int deviceId = 0;
+  int warpSizeHost;
+  HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
+  std::cout << "Warp size: " << warpSizeHost << std::endl;
+
+  constexpr int numOfBlocks = 16;
+  constexpr int threadsPerBlock = 1024;
+  const int numberOfWarp = threadsPerBlock / warpSizeHost;
+  const int mask_element_size = warpSizeHost == 32 ? sizeof(uint32_t) : sizeof(uint64_t);
+  const int mask_size = numOfBlocks * numberOfWarp;
+  constexpr size_t arraySize = numOfBlocks * threadsPerBlock;
+
+  int *d_data, *d_results;
+  void *d_mask;
+  int initValue = 1;
+  std::vector<int> vectorInput(arraySize, initValue);
+  std::vector<int> vectorOutput(numOfBlocks);
+  std::vector<int> vectorExpected(numOfBlocks);
+  // Allocate device memory
+  HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
+  HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
+  HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
+  // Host to Device copy of the input array
+  HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
+  
+  // [Sphinx template warp size select kernel start]
+  // Fill up the mask variable, copy to device and select the right kernel.
+  if(warpSizeHost == 32) {
+    // Generate and copy mask arrays
+    generate_and_copy_mask<32>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
+
+    // Start the kernel
+    block_reduce<32><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
+      d_data,
+      static_cast<uint32_t*>(d_mask),
+      d_results,
+      arraySize);
+  } else if(warpSizeHost == 64) {
+    // Generate and copy mask arrays
+    generate_and_copy_mask<64>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
+
+    // Start the kernel
+    block_reduce<64><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
+      d_data,
+      static_cast<uint64_t*>(d_mask),
+      d_results,
+      arraySize);
+  } else {
+    std::cerr << "Unsupported warp size." << std::endl;
+    return 0;
+  }
+  // [Sphinx template warp size select kernel end]
+
+  // Check the kernel launch
+  HIP_CHECK(hipGetLastError());
+  // Check for kernel execution error
+  HIP_CHECK(hipDeviceSynchronize());
+  // Device to Host copy of the result
+  HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
+
+  // Verify results
+  bool passed = true;
+  for(size_t i = 0; i < numOfBlocks; ++i) {
+    if(vectorOutput[i] != vectorExpected[i]) {
+      passed = false;
+      std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl;
+    }
+  }
+  if(passed){
+    std::cout << "Execution completed successfully." << std::endl;
+  }else{
+    std::cerr << "Execution failed." << std::endl;
+  }
+
+  // Cleanup
+  HIP_CHECK(hipFree(d_data));
+  HIP_CHECK(hipFree(d_mask));
+  HIP_CHECK(hipFree(d_results));
+  return 0;
+}
@@ -0,0 +1,184 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+#include <type_traits>
+#include <iostream>
+#include <vector>
+#include <random>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess){                \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+// [Sphinx HIP warp size block reduction kernel start]
+__global__ void block_reduce(int* input, uint64_t* mask, int* output, size_t size){
+  extern __shared__ int shared[];
+  // Read of input with bounds check
+  auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id)
+  {
+    uint64_t warp_mask = 1ull << lane_id;
+    return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
+  };
+  const uint32_t tid = threadIdx.x,
+                 lid = threadIdx.x % warpSize,
+                 wid = threadIdx.x / warpSize,
+                 bid = blockIdx.x,
+                 gid = bid * blockDim.x + tid;
+  // Read input buffer to shared
+  shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / warpSize) + wid);
+  __syncthreads();
+  // Shared reduction
+  for (uint32_t i = blockDim.x / 2; i >= warpSize; i /= 2)
+  {
+    if (tid < i)
+      shared[tid] = shared[tid] + shared[tid + i];
+    __syncthreads();
+  }
+
+  // Use local variable in warp reduction  
+  int result =  shared[tid];
+  __syncthreads();
+
+  // This loop would be unrolled the same with the compile-time WarpSize.
+  #pragma unroll
+  for (uint32_t i = warpSize/2; i >= 1; i /= 2) {
+    result = result + __shfl_down(result, i);
+  }
+
+  // Write result to output buffer
+  if (tid == 0)
+    output[bid] = result;
+};
+// [Sphinx HIP warp size block reduction kernel end]
+
+// [Sphinx HIP warp size mask generation start]
+void generate_and_copy_mask(
+  uint64_t *d_mask, 
+  std::vector<int>& vectorExpected,
+  int warpSizeHost,
+  int numOfBlocks,
+  int numberOfWarp,
+  int mask_size,
+  int mask_element_size) {
+  
+  std::random_device rd;
+  std::mt19937_64 eng(rd());
+
+  // Host side mask vector
+  std::vector<uint64_t> mask(mask_size);
+  // Define uniform unsigned int distribution
+  std::uniform_int_distribution<uint64_t> distr;
+  // Fill up the mask 
+  for(int i=0; i < numOfBlocks; i++) {
+    int count = 0;
+    for(int j=0; j < numberOfWarp; j++) {
+      int mask_index = i * numberOfWarp + j;
+      mask[mask_index] = distr(eng);
+      if(warpSizeHost == 32)
+        count += __builtin_popcount(mask[mask_index]);
+      else
+        count += __builtin_popcountll(mask[mask_index]);
+    }
+    vectorExpected[i]= count;
+  }
+  // Copy the mask array
+  HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
+}
+// [Sphinx HIP warp size mask generation end]
+
+int main() {
+  int deviceId = 0;
+  int warpSizeHost;
+  HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
+  std::cout << "Warp size: " << warpSizeHost << std::endl;
+  constexpr int numOfBlocks = 16;
+  constexpr int threadsPerBlock = 1024;
+  const int numberOfWarp = threadsPerBlock / warpSizeHost;
+  const int mask_element_size = sizeof(uint64_t);
+  const int mask_size = numOfBlocks * numberOfWarp;
+  constexpr size_t arraySize = numOfBlocks * threadsPerBlock;
+  int *d_data, *d_results;
+  uint64_t *d_mask;
+  int initValue = 1;
+  std::vector<int> vectorInput(arraySize, initValue);
+  std::vector<int> vectorOutput(numOfBlocks);
+  std::vector<int> vectorExpected(numOfBlocks);
+  // Allocate device memory
+  HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
+  HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
+  HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
+  // Host to Device copy of the input array
+  HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
+  
+  // [Sphinx HIP warp size select kernel start]
+  // Generate and copy mask arrays
+  generate_and_copy_mask(
+    d_mask,
+    vectorExpected,
+    warpSizeHost,
+    numOfBlocks,
+    numberOfWarp,
+    mask_size,
+    mask_element_size);
+
+  // Start the kernel
+  block_reduce<<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
+    d_data,
+    d_mask,
+    d_results,
+    arraySize);
+  // [Sphinx HIP warp size select kernel end]
+
+  // Check the kernel launch
+  HIP_CHECK(hipGetLastError());
+  // Check for kernel execution error
+  HIP_CHECK(hipDeviceSynchronize());
+  // Device to Host copy of the result
+  HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
+  // Verify results
+  bool passed = true;
+  for(size_t i = 0; i < numOfBlocks; ++i) {
+    if(vectorOutput[i] != vectorExpected[i]) {
+      passed = false;
+      std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl;
+    }
+  }
+  if(passed){
+    std::cout << "Execution completed successfully." << std::endl;
+  }else{
+    std::cerr << "Execution failed." << std::endl;
+  }
+  // Cleanup
+  HIP_CHECK(hipFree(d_data));
+  HIP_CHECK(hipFree(d_mask));
+  HIP_CHECK(hipFree(d_results));
+  return 0;
+}
@@ -348,89 +348,89 @@ find out what device binary flavors are embedded into the executable?
    artifacts on disk. Add the ROCmCC installation folder to your PATH if you
    want to use these utilities (the utilities expect them to be on the PATH).

-    You can list embedded program binaries using ``roc-obj-ls``.
+    You can list embedded program binaries using ``llvm-objdump`` with
+    ``--offloading`` option.

    .. code-block:: bash

-      roc-obj-ls ./saxpy
+      llvm-objdump --offloading ./saxpy

    It should return something like:

    .. code-block:: shell

-      1       host-x86_64-unknown-linux         file://./saxpy#offset=12288&size=0
-      1       hipv4-amdgcn-amd-amdhsa--gfx803   file://./saxpy#offset=12288&size=9760
+      ./saxpy:        file format elf64-x86-64
+      Extracting offload bundle: ./saxpy.0.host-x86_64-unknown-linux-gnu-
+      Extracting offload bundle: ./saxpy.0.hipv4-amdgcn-amd-amdhsa--gfx942

    The compiler embeds a version 4 code object (more on `code
    object versions <https://www.llvm.org/docs/AMDGPUUsage.html#code-object-metadata>`_)
-    and used the LLVM target triple `amdgcn-amd-amdhsa--gfx803` (more on `target triples
+    and used the LLVM target triple ``amdgcn-amd-amdhsa--gfx942`` (more on `target triples
    <https://www.llvm.org/docs/AMDGPUUsage.html#target-triples>`_). You can
    extract that program object in a disassembled fashion for human consumption
-    via ``roc-obj``.
+    via ``llvm-objdump``.

    .. code-block:: bash

-      roc-obj -t gfx803 -d ./saxpy
+      llvm-objdump --disassemble saxpy.0.hipv4-amdgcn-amd-amdhsa--gfx942 > saxpy.s

-    This creates two files on disk and ``.s`` extension is of most interest.
-    Opening this file or dumping it to the console using ``cat``
-    lets find the disassembled binary of the SAXPY compute kernel, something
-    similar to:
+    This creates a file on the disk called ``saxpy.s`` Opening this file or
+    dumping it to the console using ``cat`` lets find the disassembled binary of
+    the SAXPY compute kernel, something similar to:

    .. code-block::

+      saxpy.0.hipv4-amdgcn-amd-amdhsa--gfx942:        file format elf64-amdgpu
+
      Disassembly of section .text:

-      <_Z12saxpy_kernelfPKfPfj>:
-          s_load_dword s0, s[4:5], 0x2c        // 000000001000: C0020002 0000002C
-          s_load_dword s1, s[4:5], 0x18        // 000000001008: C0020042 00000018
-          s_waitcnt lgkmcnt(0)                 // 000000001010: BF8C007F
-          s_and_b32 s0, s0, 0xffff             // 000000001014: 8600FF00 0000FFFF
-          s_mul_i32 s6, s6, s0                 // 00000000101C: 92060006
-          v_add_u32_e32 v0, vcc, s6, v0        // 000000001020: 32000006
-          v_cmp_gt_u32_e32 vcc, s1, v0         // 000000001024: 7D980001
-          s_and_saveexec_b64 s[0:1], vcc       // 000000001028: BE80206A
-          s_cbranch_execz 22                   // 00000000102C: BF880016 <_Z12saxpy_kernelfPKfPfj+0x88>
-          s_load_dwordx4 s[0:3], s[4:5], 0x8   // 000000001030: C00A0002 00000008
-          v_mov_b32_e32 v1, 0                  // 000000001038: 7E020280
-          v_lshlrev_b64 v[0:1], 2, v[0:1]      // 00000000103C: D28F0000 00020082
-          s_waitcnt lgkmcnt(0)                 // 000000001044: BF8C007F
-          v_mov_b32_e32 v3, s1                 // 000000001048: 7E060201
-          v_add_u32_e32 v2, vcc, s0, v0        // 00000000104C: 32040000
-          v_addc_u32_e32 v3, vcc, v3, v1, vcc  // 000000001050: 38060303
-          flat_load_dword v2, v[2:3]           // 000000001054: DC500000 02000002
-          v_mov_b32_e32 v3, s3                 // 00000000105C: 7E060203
-          v_add_u32_e32 v0, vcc, s2, v0        // 000000001060: 32000002
-          v_addc_u32_e32 v1, vcc, v3, v1, vcc  // 000000001064: 38020303
-          flat_load_dword v3, v[0:1]           // 000000001068: DC500000 03000000
-          s_load_dword s0, s[4:5], 0x0         // 000000001070: C0020002 00000000
-          s_waitcnt vmcnt(0) lgkmcnt(0)        // 000000001078: BF8C0070
-          v_mac_f32_e32 v3, s0, v2             // 00000000107C: 2C060400
-          flat_store_dword v[0:1], v3          // 000000001080: DC700000 00000300
-          s_endpgm                             // 000000001088: BF810000
+      0000000000001900 <_Z12saxpy_kernelfPKfPfj>:
+        s_load_dword s3, s[0:1], 0x2c                              // 000000001900: C00200C0 0000002C
+        s_load_dword s4, s[0:1], 0x18                              // 000000001908: C0020100 00000018
+        s_waitcnt lgkmcnt(0)                                       // 000000001910: BF8CC07F
+        s_and_b32 s3, s3, 0xffff                                   // 000000001914: 8603FF03 0000FFFF
+        s_mul_i32 s2, s2, s3                                       // 00000000191C: 92020302
+        v_add_u32_e32 v0, s2, v0                                   // 000000001920: 68000002
+        v_cmp_gt_u32_e32 vcc, s4, v0                               // 000000001924: 7D980004
+        s_and_saveexec_b64 s[2:3], vcc                             // 000000001928: BE82206A
+        s_cbranch_execz 20                                         // 00000000192C: BF880014 <_Z12saxpy_kernelfPKfPfj+0x80>
+        s_load_dwordx4 s[4:7], s[0:1], 0x8                         // 000000001930: C00A0100 00000008
+        v_mov_b32_e32 v1, 0                                        // 000000001938: 7E020280
+        v_lshlrev_b64 v[0:1], 2, v[0:1]                            // 00000000193C: D28F0000 00020082
+        s_load_dword s0, s[0:1], 0x0                               // 000000001944: C0020000 00000000
+        s_waitcnt lgkmcnt(0)                                       // 00000000194C: BF8CC07F
+        v_lshl_add_u64 v[2:3], s[4:5], 0, v[0:1]                   // 000000001950: D2080002 04010004
+        v_lshl_add_u64 v[0:1], s[6:7], 0, v[0:1]                   // 000000001958: D2080000 04010006
+        global_load_dword v4, v[2:3], off                          // 000000001960: DC508000 047F0002
+        global_load_dword v5, v[0:1], off                          // 000000001968: DC508000 057F0000
+        s_waitcnt vmcnt(0)                                         // 000000001970: BF8C0F70
+        v_fmac_f32_e32 v5, s0, v4                                  // 000000001974: 760A0800
+        global_store_dword v[0:1], v5, off                         // 000000001978: DC708000 007F0500
+        s_endpgm                                                   // 000000001980: BF810000
+        s_nop 0                                                    // 000000001984: BF800000

    Alternatively, call the compiler with ``--save-temps`` to dump all device
    binary to disk in separate files.

    .. code-block:: bash

-      amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 --save-temps
+      amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 --save-temps --offload-arch=gfx942

    List all the temporaries created while compiling ``main.hip`` with:

    .. code-block:: bash

      ls main-hip-amdgcn-amd-amdhsa-*
-      main-hip-amdgcn-amd-amdhsa-gfx803.bc
-      main-hip-amdgcn-amd-amdhsa-gfx803.cui
-      main-hip-amdgcn-amd-amdhsa-gfx803.o
-      main-hip-amdgcn-amd-amdhsa-gfx803.out
-      main-hip-amdgcn-amd-amdhsa-gfx803.out.resolution.txt
-      main-hip-amdgcn-amd-amdhsa-gfx803.s
+      main-hip-amdgcn-amd-amdhsa-gfx942.bc
+      main-hip-amdgcn-amd-amdhsa-gfx942.o
+      main-hip-amdgcn-amd-amdhsa-gfx942.out.resolution.txt
+      main-hip-amdgcn-amd-amdhsa-gfx942.hipi 
+      main-hip-amdgcn-amd-amdhsa-gfx942.out
+      main-hip-amdgcn-amd-amdhsa-gfx942.s

    Files with the ``.s`` extension hold the disassembled contents of the binary.
    The filename notes the graphics IPs used by the compiler. The contents of
-    this file are similar to what ``roc-obj`` printed to the console.
+    this file are similar to the `*.s` file created with ``llvm-objdump`` earlier.

  .. tab-item:: Linux and NVIDIA
    :sync: linux-nvidia
@@ -491,7 +491,7 @@ find out what device binary flavors are embedded into the executable?

    We can see that the compiler embedded a version 4 code object (more on code
    `object versions <https://www.llvm.org/docs/AMDGPUUsage.html#code-object-metadata>`_) and
-    used the LLVM target triple `amdgcn-amd-amdhsa--gfx906` (more on `target triples
+    used the LLVM target triple ``amdgcn-amd-amdhsa--gfx906`` (more on `target triples
    <https://www.llvm.org/docs/AMDGPUUsage.html#target-triples>`_). Don't be
    alarmed about linux showing up as a binary format, AMDGPU binaries uploaded to
    the GPU for execution are proper linux ELF binaries in their format.
@@ -8,95 +8,12 @@
 HIP compilers
 ********************************************************************************

-ROCm provides the compiler driver ``hipcc``, that can be used on AMD ROCm and
-NVIDIA CUDA platforms.
+ROCm provides the compiler tools used to compile HIP applications for use on AMD GPUs. 
+The compilers set up the default libraries and include paths for the HIP and ROCm
+libraries and some needed environment variables. For more information, see the
+:doc:`ROCm compiler reference <llvm-project:reference/rocmcc>`.

-On ROCm, ``hipcc`` takes care of the following:
-
- Setting the default library and include paths for HIP
- Setting some environment variables
- Invoking the appropriate compiler - ``amdclang++``
-
-On NVIDIA CUDA platform, ``hipcc`` takes care of invoking compiler ``nvcc``.
-``amdclang++`` is based on the ``clang++`` compiler. For more
-details, see the :doc:`llvm project<llvm-project:index>`.
-
-HIPCC
-================================================================================
-
-Common Compiler Options
--------------------------------------------------------------------------------
-
-The following table shows the most common compiler options supported by
-``hipcc``.
-
-.. list-table::
-  :header-rows: 1
-
-  *
-   - Option
-   - Description
-  *
-   - ``--fgpu-rdc``
-   - Generate relocatable device code, which allows kernels or device functions
-     to call device functions in different translation units.
-  *
-   - ``-ggdb``
-   - Equivalent to `-g` plus tuning for GDB. This is recommended when using
-     ROCm's GDB to debug GPU code.
-  *
-   - ``--gpu-max-threads-per-block=<num>``
-   - Generate code to support up to the specified number of threads per block.
-  *
-   - ``-offload-arch=<target>``
-   - Generate code for the given GPU target.
-     For a full list of supported compilation targets see the `processor names in AMDGPU's llvm documentation <https://llvm.org/docs/AMDGPUUsage.html#processors>`_.
-     This option can appear multiple times to generate a fat binary for multiple
-     targets.
-     The actual support of the platform's runtime may differ.
-  *
-   - ``-save-temps``
-   - Save the compiler generated intermediate files.
-  *
-   - ``-v``
-   - Show the compilation steps.
-
-Linking
--------------------------------------------------------------------------------
-
-``hipcc`` adds the necessary libraries for HIP as well as for the accelerator
-compiler (``nvcc`` or ``amdclang++``). We recommend linking with ``hipcc`` since
-it automatically links the binary to the necessary HIP runtime libraries.
-
-Linking Code With Other Compilers
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-``nvcc`` by default uses ``g++`` to generate the host code.
-
-``amdclang++`` generates both device and host code. The code uses the same API
-as ``gcc``, which allows code generated by different ``gcc``-compatible
-compilers to be linked together. For example, code compiled using ``amdclang++``
-can link with code compiled using compilers such as ``gcc``, ``icc`` and
-``clang``. Take care to ensure all compilers use the same standard C++ header
-and library formats.
-
-libc++ and libstdc++
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-``hipcc`` links to ``libstdc++`` by default. This provides better compatibility
-between ``g++`` and HIP.
-
-In order to link to ``libc++``, pass ``--stdlib=libc++`` to ``hipcc``.
-Generally, libc++ provides a broader set of C++ features while ``libstdc++`` is
-the standard for more compilers, notably including ``g++``.
-
-When cross-linking C++ code, any C++ functions that use types from the C++
-standard library, such as ``std::string``, ``std::vector`` and other containers,
-must use the same standard-library implementation. This includes cross-linking
-between ``amdclang++`` and other compilers.
-
-
-HIP compilation workflow
+Compilation workflow
 ================================================================================

 HIP provides a flexible compilation workflow that supports both offline
@@ -115,25 +32,18 @@ performance overhead.
 Offline compilation
 --------------------------------------------------------------------------------

-The HIP code compilation is performed in two stages: host and  device code
-compilation stage.
+Offline compilation is performed in two steps: host and  device code
+compilation. 

- Device-code compilation stage: The compiled device code is embedded into the
+- Host-code compilation: On the host side, ``amdclang++`` or ``hipcc`` can
+  compile the host code in one step without other C++ compilers. 
+
+- Device-code compilation: The compiled device code is embedded into the
  host object file. Depending on the platform, the device code can be compiled
-  into assembly or binary. ``nvcc`` and ``amdclang++`` target different
-  architectures and use different code object formats. ``nvcc`` uses the binary
-  ``cubin`` or the assembly PTX files, while the ``amdclang++`` path is the
-  binary ``hsaco`` format. On CUDA platforms, the driver compiles the PTX files
-  to executable code during runtime.
-
- Host-code compilation stage: On the host side, ``hipcc`` or ``amdclang++`` can
-  compile the host code in one step without other C++ compilers. On the other
-  hand, ``nvcc`` only replaces the ``<<<...>>>`` kernel launch syntax with the
-  appropriate CUDA runtime function call and the modified host code is passed to
-  the default host compiler.
+  into assembly or binary. 

 For an example on how to compile HIP from the command line, see :ref:`SAXPY
-tutorial<compiling_on_the_command_line>` .
+tutorial <compiling_on_the_command_line>` .

 Runtime compilation
 --------------------------------------------------------------------------------
@@ -142,27 +52,26 @@ HIP allows you to compile kernels at runtime using the ``hiprtc*`` API. Kernels
 are stored as a text string, which is passed to HIPRTC alongside options to
 guide the compilation.

-For more details, see
-:doc:`HIP runtime compiler <../how-to/hip_rtc>`.
+For more information, see :doc:`HIP runtime compiler <../how-to/hip_rtc>`.

 Static libraries
 ================================================================================

-``hipcc`` supports generating two types of static libraries.
+Both ``amdclang++`` and ``hipcc`` support generating two types of static libraries.

 - The first type of static library only exports and launches host functions
  within the same library and not the device functions. This library type offers
-  the ability to link with a non-hipcc compiler such as ``gcc``. Additionally,
+  the ability to link with another compiler such as ``gcc``. Additionally,
  this library type contains host objects with device code embedded as fat
  binaries. This library type is generated using the flag ``--emit-static-lib``:

  .. code-block:: shell

-    hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
+    amdclang++ hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
    gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out

 - The second type of static library exports device functions to be linked by
-  other code objects by using ``hipcc`` as the linker. This library type
+  other code objects by using ``amdclang++`` or ``hipcc`` as the linker. This library type
  contains relocatable device objects and is generated using ``ar``:

  .. code-block:: shell
@@ -171,6 +80,6 @@ Static libraries
    ar rcsD libHipDevice.a hipDevice.o
    hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out

-A full example for this can be found in the ROCm-examples, see the examples for
+Examples of this can be found in `rocm-examples <https://github.com/ROCm/rocm-examples>`_ under
 `static host libraries <https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_host_library>`_
 or `static device libraries <https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_device_library>`_.
@@ -243,7 +243,7 @@ multiple threads via the thread ID constants ``threadIdx.x``, ``blockIdx.x``, et
 .. _inherent_thread_model:

 Hierarchical thread model
---------------------
+-------------------------

 As previously discussed, all threads of a kernel are uniquely identified by a set
 of integral values called thread IDs. The hierarchy consists of three levels: thread,
@@ -27,7 +27,7 @@ THE SOFTWARE.
 #include <tuple>
 #include <type_traits>
 #endif
-/** @addtogroup Execution Execution Management
+/** @addtogroup Execution Execution Control
 *  @{
 */

@@ -1579,7 +1579,7 @@ typedef enum hipLaunchAttributeID {
 typedef union hipLaunchAttributeValue {
  char pad[64];  ///< 64 byte padding
  hipAccessPolicyWindow
-      accessPolicyWindow;  ///< Value of launch attribute ::hipLaunchAttributePolicyWindow.
+      accessPolicyWindow;  ///< Value of launch attribute ::hipLaunchAttributeAccessPolicyWindow.
  int cooperative;         ///< Value of launch attribute ::hipLaunchAttributeCooperative. Indicates
                           ///< whether the kernel is cooperative.
  int priority;  ///< Value of launch attribute :: hipLaunchAttributePriority. Execution priority of
@@ -6561,7 +6561,7 @@ hipError_t hipLinkComplete(hipLinkState_t state, void** hipBinOut, size_t* sizeO
 /**
 * @brief Creates a linker instance with options.
 * @param [in] numOptions  Number of options
- * @param [in] option  Array of options
+ * @param [in] options  Array of options
 * @param [in] optionValues  Array of option values cast to void*
 * @param [out] stateOut  hip link state created upon success
 *