SWDEV-541514 - Docs update 2025-09-15 (#993)

Co-authored-by: Julia Jiang <56359287+jujiang-del@users.noreply.github.com>
Bu işleme şunda yer alıyor:
Istvan Kiss
2025-09-24 18:57:00 +02:00
işlemeyi yapan: GitHub
ebeveyn 0575606e49
işleme 83fb0c8c47
39 değiştirilmiş dosya ile 2973 ekleme ve 807 silme
+5 -5
Dosyayı Görüntüle
@@ -54,12 +54,12 @@ suppress_warnings = ["etoc.toctree"]
numfig = False
exclude_patterns = [
"doxygen/mainpage.md",
"understand/glossary.md",
'how-to/debugging_env.rst',
"data/env_variables_hip.rst"
"./doxygen/mainpage.md",
"./understand/glossary.md",
'./how-to/debugging_env.rst',
"./reference/env_variables"
]
git_url = subprocess.check_output(['git', 'config', '--get', 'remote.origin.url']).strip().decode('ascii')
if git_url.find("git:") != -1:
html_theme_options = {"repository_url": "https://github.com/ROCm/hip"}
html_theme_options = {"repository_url": "https://github.com/ROCm/hip"}
Dosya farkı çok büyük olduğundan ihmal edildi Fark Yükle
Dosya farkları bir veya daha fazla satır çok uzun olduğundan bastırıldı

Önce

Genişlik:  |  Yükseklik:  |  Boyut: 83 KiB

Sonra

Genişlik:  |  Yükseklik:  |  Boyut: 146 KiB

+3 -3
Dosyayı Görüntüle
@@ -40,10 +40,10 @@ for the full list.
What NVIDIA CUDA features does HIP support?
-------------------------------------------
The :doc:`NVIDIA CUDA runtime API supported by HIP<hipify:tables/CUDA_Runtime_API_functions_supported_by_HIP>`
and :doc:`NVIDIA CUDA driver API supported by HIP<hipify:tables/CUDA_Driver_API_functions_supported_by_HIP>`
The :doc:`NVIDIA CUDA runtime API supported by HIP<hipify:reference/tables/CUDA_Runtime_API_functions_supported_by_HIP>`
and :doc:`NVIDIA CUDA driver API supported by HIP<hipify:reference/tables/CUDA_Driver_API_functions_supported_by_HIP>`
pages describe which NVIDIA CUDA APIs are supported and what the equivalents are.
The :doc:`HIP API documentation <doxygen/html/index>` describes each API and
The :ref:`HIP runtime API reference<runtime_api_reference>` describes each API and
its limitations, if any, compared with the equivalent CUDA API.
The kernel language features are documented in the
+344
Dosyayı Görüntüle
@@ -0,0 +1,344 @@
.. meta::
:description: This topic discusses the changes introduced in HIP 7.0
:keywords: AMD, ROCm, HIP, HIP changes, CUDA, C++ language extensions
.. _compatibility-changes:
*******************************************************************************
HIP API 7.0 changes
*******************************************************************************
To improve code portability between AMD and NVIDIA GPU programming models, changes were made to the HIP API in ROCm 7.0 to simplify cross-platform programming. These changes align HIP C++ even more closely with NVIDIA CUDA. These changes are incompatible with prior releases, and might require recompiling existing HIP applications for use with ROCm 7.0, or editing and recompiling code in some cases. In the best case, the change requires no modification of existing applications. These changes were made available in a preview release based on ROCm 6.4.1 to help you prepare.
Behavior changes in HIP Runtime API
===================================
Update ``hipGetLastError``
--------------------------
Prior to the 7.0 release of the HIP API, :cpp:func:`hipGetLastError` was not fully compliant with CUDA's behavior. The purpose of this change is to have ``hipGetLastError`` return the last actual error caught in the current thread during the application execution. Neither ``hipSuccess`` nor ``hipErrorNotReady`` is considered an error. Take the following code as an example:
.. code:: cpp
1: hipError_t err = hipMalloc(...); // returns hipOutOfMemory
2: err = hipSetDevice(0); // returns hipSuccess
3: err = hipGetLastError();
The prior behavior was for ``hipGetLastError`` at line 3 to return ``hipSuccess`` from line 2. In the 7.0 release, the value of ``err`` at line 3 is ``hipOutOfMemory`` which is the error returned in Line 1, rather than simply the result returned in line 2. This matches CUDA behavior.
You can still use the prior functionality by using the ``hipExtGetLastError`` function. Notice that the function begins with ``hipExt`` which denotes a function call that is unique to HIP, without correlation to CUDA. This function was introduced with the 6.0 release.
Cooperative groups changes
--------------------------
For :cpp:func:`hipLaunchCooperativeKernelMultiDevice` function, HIP now includes additional input parameter validation checks.
* If the input launch stream is a NULLPTR or it is ``hipStreamLegacy``, the function now returns ``hipErrorInvalidResourceHandle``.
* If the stream capturing is active, the function returns the error code ``hipErrorStreamCaptureUnsupported``.
* If the stream capture status is invalidated, the function returns the error ``hipErrorStreamCaptureInvalidated``.
The :cpp:func:`hipLaunchCooperativeKernel` function now checks the input stream handle. If it's invalid, the returned error is changed to ``hipErrorInvalidHandle`` from ``hipErrorContextIsDestroyed``.
Update ``hipPointerGetAttributes``
----------------------------------
:cpp:func:`hipPointerGetAttributes` now matches the functionality of ``cudaPointerGetAttributes`` which changed in CUDA 11. If a NULL host or attribute pointer is passed as input parameter, ``hipPointerGetAttributes`` now returns ``hipSuccess`` instead of the error code ``hipErrorInvalidValue``.
Any application which is expecting the API to return an error instead of success could be impacted and a code change may need to handle the error properly.
Update ``hipFree``
------------------
:cpp:func:`hipFree` previously had an implicit wait for synchronization purpose which is applicable for all memory allocations. This wait has been disabled in the HIP 7.0 runtime for allocations made with ``hipMallocAsync`` and ``hipMallocFromPoolAsync`` to match the behavior of CUDA API ``cudaFree``
Update ``hipFreeAsync``
-----------------------
The API returns ``hipSuccess`` when the input pointer is NULL, instead of ``hipErrorInvalidValue``, to be consistent with :cpp:func:`hipFree`.
Exceptions effect during kernel execution changes
-------------------------------------------------
Exceptions that occur during kernel execution will no longer abort the process, but will instead return an error, unless core dumping is enabled.
HIP runtime compiler (hipRTC) changes
=====================================
Runtime compilation for HIP is available through the ``hipRTC`` library as described in :ref:`hip_runtime_compiler_how-to`. The library grew organically within the main HIP runtime code. However, segregation of the ``hipRTC`` code is now needed to ensure better compatibility and easier code portability.
Removal of ``hipRTC`` symbols from HIP Runtime Library
------------------------------------------------------
``hipRTC`` has been an independent library since the 6.0 release, but the ``hipRTC`` symbols were still available in the HIP runtime library. Starting with the 7.0 release ``hipRTC`` is no longer included in the HIP runtime, and any application using ``hipRTC`` APIs should link explicitly with the ``hipRTC`` library.
This change makes the usage of ``hipRTC`` library on Linux the same as on Windows and matches the behavior of CUDA ``nvRTC``.
``hipRTC`` compilation
----------------------
The device code compilation via ``hipRTC`` now uses namespace ``__hip_internal``, instead of the standard headers ``std``, to avoid namespace collision. These changes are made in the HIP header files.
No code change is required in any application, but rebuilding is necessary.
Removal of datatypes from ``hipRTC``
------------------------------------
In ``hipRTC``, datatype definitions such as ``int64_t``, ``uint64_t``, ``int32_t``, and ``uint32_t`` could result in conflicts in some applications, as they use their own definitions for these types. ``nvRTC`` doesn't define these datatypes either.
These datatypes are removed and replaced by HIP internal datatypes prefixed with ``__hip``, for example, ``__hip_int64_t``.
Any application relying on HIP internal datatypes during ``hipRTC`` compilation might be affected.
These changes have no impact on any application if it compiles as expected using ``nvRTC``.
HIP header clean up
===================
HIP header files previously included unnecessary Standard Template Libraries (STL) headers.
With the 7.0 release, unnecessary STL headers are no longer included, and only the required STL headers
are included.
Applications relying on HIP runtime header files might need to be updated to include STL header
files that have been removed in 7.0.
API signature and struct changes
================================
API signature changes
---------------------
Signatures in some APIs have been modified to match corresponding CUDA APIs, as described below.
The RTC method definition is changed in the following ``hipRTC`` APIs:
* :cpp:func:`hiprtcCreateProgram`
* :cpp:func:`hiprtcCompileProgram`
In these APIs, the input parameter type changes from ``const char**`` to ``const char* const*``.
In addition, the following APIs have signature changes:
* :cpp:func:`hipMemcpyHtoD`, the type of the second argument pointer changes from ``const void*`` to ``void*``.
* :cpp:func:`hipCtxGetApiVersion`, the type of second argument is changed from ``int*`` to ``unsigned int*``.
These signature changes do not require code modifications but do require rebuilding the application.
Deprecated struct ``HIP_MEMSET_NODE_PARAMS``
--------------------------------------------
The deprecated structure ``HIP_MEMSET_NODE_PARAMS`` is removed.
You can use the definition :cpp:struct:`hipMemsetParams` instead, as input parameter, while using these two APIs:
* :cpp:func:`hipDrvGraphAddMemsetNode`
* :cpp:func:`hipDrvGraphExecMemsetNodeSetParams`
``hipMemsetParams`` struct change
---------------------------------
The struct :cpp:struct:`hipMemsetParams` is updated to be compatible with CUDA.
The change is from the old struct definition shown below:
.. code:: cpp
typedef struct hipMemsetParams {
void* dst;
unsigned int elementSize;
size_t height;
size_t pitch;
unsigned int value;
size_t width;
} hipMemsetParams;
To the new struct definition as follows:
.. code:: cpp
typedef struct hipMemsetParams {
void* dst;
size_t pitch;
unsigned int value;
unsigned int elementSize;
size_t width;
size_t height;
} hipMemsetParams;
No code change is required in any application using this structure, but rebuilding is necessary.
HIP vector constructor change
-----------------------------
Changes have been made to HIP vector constructors for ``hipComplex`` initialization to generate values in alignment with CUDA. The affected constructors are small vector types such as ``float2`` and ``int4`` for example. If your code previously relied on a single value to initialize all components within a vector or complex type, you might need to update your code. Otherwise, rebuilding the application is necessary but no code change is required in any application using these constructors.
Stream capture updates
======================
Restrict stream capture modes
-----------------------------
Stream capture mode has been restricted in the following APIs to relaxed (``hipStreamCaptureModeRelaxed``) mode:
* :cpp:func:`hipMallocManaged`
* :cpp:func:`hipMemAdvise`
These APIs are allowed only in relaxed stream capture mode. If the functions are used with stream capture, the HIP runtime the will return ``hipErrorStreamCaptureUnsupported`` on unsupported stream capture modes.
Check stream capture mode
-------------------------
The following APIs will check the stream capture mode and return error codes to match the behavior of CUDA. No impact if stream capture is working correctly on CUDA. Otherwise, the application would need to modify the graph being captured.
* :cpp:func:`hipLaunchCooperativeKernelMultiDevice` - Returns error code while stream capture status is active. The usage is restricted during stream capture
* :cpp:func:`hipEventQuery` - Returns an error ``hipErrorStreamCaptureUnsupported`` in global capture mode
* :cpp:func:`hipStreamAddCallback` - The stream capture behavior is updated. The function now checks if any of the blocking streams are capturing. If so, it returns an error and invalidates all capturing streams. The usage of this API is restricted during stream capture to match CUDA.
Stream capture error return
---------------------------
During stream capture, the following HIP APIs return the ``hipErrorStreamCaptureUnsupported`` error on the HIP runtime, but not always ``hipSuccess``, to match behavior with CUDA.
* :cpp:func:`hipDeviceSetMemPool`
* :cpp:func:`hipMemPoolCreate`
* :cpp:func:`hipMemPoolDestroy`
* :cpp:func:`hipDeviceSetSharedMemConfig`
* :cpp:func:`hipDeviceSetCacheConfig`
* :cpp:func:`hipMemcpyWithStream`
The usage of these APIs is restricted during stream capture. No impact if stream capture is working fine on CUDA.
Error code changes
==================
The following HIP APIs have been updated to return new or additional error codes to match the corresponding
CUDA APIs. Most existing applications just check if ``hipSuccess`` is returned and no change is needed.
However, if an application checks for a specific error code, the application code may need to be updated
to match/handle the new error code accordingly.
Module management related APIs
------------------------------
Kernel launch APIs
^^^^^^^^^^^^^^^^^^
The following APIs have updated implementations:
* :cpp:func:`hipModuleLaunchKernel`
* :cpp:func:`hipExtModuleLaunchKernel`
* :cpp:func:`hipExtLaunchKernel`
* :cpp:func:`hipDrvLaunchKernelEx`
* :cpp:func:`hipLaunchKernel`
* :cpp:func:`hipLaunchKernelExC`
More conditional checks are added in the API implementation, and the return errors are added or changed in the following scenarios:
* If the input stream handle is invalid, the returned error is changed to ``hipErrorContextIsDestroyed`` from ``hipErrorInvalidValue``
* Adds a grid dimension check, if any input global work size dimension is zero, returns ``hipErrorInvalidValue``
* Adds extra shared memory size check, if exceeds the size limit, returns ``hipErrorInvalidValue``
* If the total number of threads per block exceeds the maximum work group limit during a kernel launch, the return value is changed to ``hipErrorInvalidConfiguration`` from ``hipErrorInvalidValue``
``hipModuleLaunchCooperativeKernel``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Conditions are added in the API implementation of :cpp:func:`hipModuleLaunchCooperativeKernel`, and the returned errors are added in the following scenarios:
* If the input stream is invalid, returns ``hipErrorContextIsDestroyed``, instead of ``hipErrorInvalidValue``
* If any grid dimension or block dimension is zero, returns ``hipErrorInvalidValue``
* If any grid dimension exceeds the maximum dimension limit, or work group size exceeds the maximum size, returns ``hipErrorInvalidConfiguration`` , instead of ``hipErrorInvalidValue``
* If shared memory size in bytes exceeds the device local memory size per CU, returns ``hipErrorCooperativeLaunchTooLarge``
``hipModuleLoad``
^^^^^^^^^^^^^^^^^^
The API updates the negative return of :cpp:func:`hipModuleLoad` to match the CUDA behavior. In cases where the file name exists but the file size is 0, the function returns ``hipErrorInvalidImage`` instead of ``hipErrorInvalidValue``.
Texture management related APIs
-------------------------------
The following APIs have updated the return codes to match the CUDA behavior:
* :cpp:func:`hipTexObjectCreate`, supports zero width and height for 2D image. If either width or height are zero the function will not return ``false``.
* :cpp:func:`hipBindTexture2D`, adds extra check, if pointer for texture reference or device is NULL, returns ``hipErrorNotFound``.
* :cpp:func:`hipBindTextureToArray`, if any NULL pointer is input for texture object, resource descriptor, or texture descriptor, returns error ``hipErrorInvalidChannelDescriptor``, instead of ``hipErrorInvalidValue``.
* :cpp:func:`hipGetTextureAlignmentOffset`, adds a return code ``hipErrorInvalidTexture`` when the texture reference pointer is NULL.
Cooperative group related APIs
-------------------------------
``hipLaunchCooperativeKernelMultiDevice``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Validations are added to the API implementation of :cpp:func:`hipLaunchCooperativeKernelMultiDevice`, as follows:
* If input launch stream is NULLPTR or it is ``hipStreamLegacy``, returns ``hipErrorInvalidResourceHandle``.
* If the stream capturing is active, returns the error ``hipErrorStreamCaptureUnsupported``.
* If the stream capture status is invalidated, returns the error ``hipErrorStreamCaptureInvalidated``
* If the total number of threads per block exceeds the maximum work group limit during a kernel launch, the return value is changed to ``hipErrorInvalidConfiguration`` from ``hipErrorInvalidValue``.
``hipLaunchCooperativeKernel``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Validation are added to the API implementation of :cpp:func:`hipLaunchCooperativeKernel`, as follows:
* If the input stream handle is invalid, the returned error is changed to ``hipErrorInvalidHandle`` from ``hipErrorContextIsDestroyed``.
* If the total number of threads per block exceeds the maximum work group limit during a kernel launch, the return value is changed to ``hipErrorInvalidConfiguration`` from ``hipErrorInvalidValue`` .
Invalid stream input parameter handling matches CUDA
====================================================
In order to match the CUDA runtime behavior more closely, HIP APIs with streams passed as input parameters no longer check the stream validity. Prior to the 7.0 release, the HIP runtime returns an error code ``hipErrorContextIsDestroyed`` if the stream is invalid. In CUDA 12 and later, the equivalent behavior is to raise a segmentation fault. With HIP 7.0, the HIP runtime matches CUDA by causing a segmentation fault. The list of APIs impacted by this change are as follows:
* Stream management related APIs
* :cpp:func:`hipStreamGetCaptureInfo`
* :cpp:func:`hipStreamGetPriority`
* :cpp:func:`hipStreamGetFlags`
* :cpp:func:`hipStreamDestroy`
* :cpp:func:`hipStreamAddCallback`
* :cpp:func:`hipStreamQuery`
* :cpp:func:`hipLaunchHostFunc`
* Graph management related APIs
* :cpp:func:`hipGraphUpload`
* :cpp:func:`hipGraphLaunch`
* :cpp:func:`hipStreamBeginCaptureToGraph`
* :cpp:func:`hipStreamBeginCapture`
* :cpp:func:`hipStreamIsCapturing`
* :cpp:func:`hipStreamGetCaptureInfo`
* :cpp:func:`hipGraphInstantiateWithParams`
* Memory management related APIs
* :cpp:func:`hipMemcpyPeerAsync`
* :cpp:func:`hipMallocFromPoolAsync`
* :cpp:func:`hipFreeAsync`
* :cpp:func:`hipMallocAsync`
* :cpp:func:`hipMemcpyAsync`
* :cpp:func:`hipMemcpyToSymbolAsync`
* :cpp:func:`hipStreamAttachMemAsync`
* :cpp:func:`hipMemPrefetchAsync`
* :cpp:func:`hipDrvMemcpy3D`
* :cpp:func:`hipDrvMemcpy3DAsync`
* :cpp:func:`hipDrvMemcpy2DUnaligned`
* :cpp:func:`hipMemcpyParam2D`
* :cpp:func:`hipMemcpyParam2DAsync`
* :cpp:func:`hipMemcpy2DArrayToArray`
* :cpp:func:`hipMemcpy2D`
* :cpp:func:`hipMemcpy2DAsync`
* :cpp:func:`hipDrvMemcpy2DUnaligned`
* :cpp:func:`hipMemcpy3D`
* Event management related APIs
* :cpp:func:`hipEventRecord`
* :cpp:func:`hipEventRecordWithFlags`
Developers porting CUDA code to HIP no longer need to modify their error handling code. However,
if you have come to expect the HIP runtime to return the error code ``hipErrorContextIsDestroyed``,
you might need to adjust your code.
warpSize Change
===============
To match the CUDA specification, ``warpSize`` is no longer a ``constexpr``.
In general, this should be a transparent change. However, if an application was using ``warpSize``
as a compile-time constant, it will have to be updated to handle the new definition.
For more information, see `warpSize <./how-to/hip_cpp_language_extensions.html#warpsize>`_
in :doc:`./how-to/hip_cpp_language_extensions`.
+23 -15
Dosyayı Görüntüle
@@ -259,42 +259,50 @@ HSA provides environment variables that help analyze issues in drivers or hardwa
* To isolate issues with hardware copy engines, you can use ``HSA_ENABLE_SDMA``.
``HSA_ENABLE_SDMA=0`` causes host-to-device and device-to-host copies to use compute shader
blit kernels, rather than the dedicated DMA copy engines. Compute shader copies have low latency
(typically < 5 us) and can achieve approximately 80% of the bandwidth of the DMA copy engine.
``HSA_ENABLE_SDMA=0`` causes host-to-device and device-to-host copies to use compute shader
blit kernels, rather than the dedicated DMA copy engines. Compute shader copies have low latency
(typically < 5 us) and can achieve approximately 80% of the bandwidth of the DMA copy engine.
* To diagnose interrupt storm issues in the driver, you can use ``HSA_ENABLE_INTERRUPT``.
``HSA_ENABLE_INTERRUPT=0`` causes completion signals to be detected with memory-based
polling, rather than interrupts.
``HSA_ENABLE_INTERRUPT=0`` causes completion signals to be detected with memory-based
polling, rather than interrupts.
HIP environment variable summary
--------------------------------
Here are some of the more commonly used environment variables:
.. include-table:: data/env_variables_hip.rst
.. include-table:: ./reference/env_variables/debug_hip_env.rst
:table: hip-env-debug
General debugging tips
======================================================
* ``gdb --args`` can be used to pass the executable and arguments to ``gdb``.
* You can set environment variables (``set env``) from within GDB on Linux:
.. code-block:: bash
.. code-block:: bash
(gdb) set env AMD_SERIALIZE_KERNEL 3
(gdb) set env AMD_SERIALIZE_KERNEL 3
.. note::
This ``gdb`` command does not use an equal (=) sign.
.. note::
This ``gdb`` command does not use an equal (=) sign.
* The GDB backtrace shows a path in the runtime. This is because a fault is caught by the runtime, but it is generated by an asynchronous command running on the GPU.
* To determine the true location of a fault, you can force the kernels to run synchronously by setting the environment variables ``AMD_SERIALIZE_KERNEL=3`` and ``AMD_SERIALIZE_COPY=3``. This forces HIP runtime to wait for the kernel to finish running before returning. If the fault occurs when a kernel is running, you can see the code that launched the kernel inside the backtrace. The thread that's causing the issue is typically the one inside ``libhsa-runtime64.so``.
* VM faults inside kernels can be caused by:
* Incorrect code (e.g., a for loop that extends past array boundaries)
* Memory issues, such as invalid kernel arguments (null pointers, unregistered host pointers, bad pointers)
* Synchronization issues
* Compiler issues (incorrect code generation from the compiler)
* Runtime issues
* Incorrect code (e.g., a for loop that extends past array boundaries)
* Memory issues, such as invalid kernel arguments (null pointers, unregistered host pointers, bad pointers)
* Synchronization issues
* Compiler issues (incorrect code generation from the compiler)
* Runtime issues
+130 -8
Dosyayı Görüntüle
@@ -411,11 +411,9 @@ warpSize
================================================================================
The ``warpSize`` constant contains the number of threads per warp for the given
target device. It can differ between different architectures, and on RDNA
architectures it can even differ between kernel launches, depending on whether
they run in CU or WGP mode. See the
:doc:`hardware features <../reference/hardware_features>` for more
information.
target device. On AMD hardware, this is referred to as ``wavefront size``, which
may vary depending on the architecture. For more details, see the
:doc:`hardware features <../reference/hardware_features>`.
Since ``warpSize`` can differ between devices, it can not be assumed to be a
compile-time constant on the host. It has to be queried using
@@ -423,8 +421,8 @@ compile-time constant on the host. It has to be queried using
.. code-block:: cpp
int val;
hipDeviceGetAttribute(&val, hipDeviceAttributeWarpSize, deviceId);
int warpSizeHost;
hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId);
.. note::
@@ -435,6 +433,130 @@ compile-time constant on the host. It has to be queried using
of 32 can run on devices with a ``warpSize`` of 64, it only utilizes half of
the compute resources.
Prior to ROCm 7.0, the warpSize parameter was a compile-time constant. Starting
with ROCm 7.0, it is early folded by the compiler, allowing it to be used in
loop bounds and enabling loop unrolling in a manner similar to a compile-time
constant warp size.
If compile time warp size is required, for example to select the correct mask
type or code path at compile time, the recommended approach is to determine the
warp size of the GPU on host side and setup the kernel accordingly, as shown in
the following block reduce example.
The ``block_reduce`` kernel has a template parameter for warp size and performs
a reduction operation in two main phases:
- Shared memory reduction: Reduction is performed iteratively, halving the
number of active threads each step until only a warp remains
(32 or 64 threads, depending on the device).
- Warp-level reduction: Once the shared memory reduction completes, the
remaining threads use warp-level shuffling to sum the remaining values. This
is done efficiently with the ``__shfl_down`` intrinsic, which allows threads within
the warp to exchange values without explicit synchronization.
.. tab-set::
.. tab-item:: WarpSize template parameter
:sync: template-warpsize
.. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
:start-after: // [Sphinx template warp size block reduction kernel start]
:end-before: // [Sphinx template warp size block reduction kernel end]
:language: cpp
.. tab-item:: HIP warpSize
:sync: hip-warpsize
.. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
:start-after: // [Sphinx HIP warp size block reduction kernel start]
:end-before: // [Sphinx HIP warp size block reduction kernel end]
:language: cpp
The host code with the main function:
- Retrieves the warp size of the GPU (``warpSizeHost``) to determine the optimal
kernel configuration.
- Allocates device memory (``d_data`` for input, ``d_results`` for block-wise
output) and initializes the input vector to 1.
- Generates the mask variables for every warp and copies them to the device.
.. tab-set::
.. tab-item:: WarpSize template parameter
:sync: template-warpsize
.. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
:start-after: // [Sphinx template warp size mask generation start]
:end-before: // [Sphinx template warp size mask generation end]
:language: cpp
.. tab-item:: HIP warpSize
:sync: hip-warpsize
.. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
:start-after: // [Sphinx HIP warp size mask generation start]
:end-before: // [Sphinx HIP warp size mask generation end]
:language: cpp
- Selects the appropriate kernel specialization based on the warp
size (either 32 or 64) and launches the kernel.
.. tab-set::
.. tab-item:: WarpSize template parameter
:sync: template-warpsize
.. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
:start-after: // [Sphinx template warp size select kernel start]
:end-before: // [Sphinx template warp size select kernel end]
:language: cpp
.. tab-item:: HIP warpSize
:sync: hip-warpsize
.. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
:start-after: // [Sphinx HIP warp size select kernel start]
:end-before: // [Sphinx HIP warp size select kernel end]
:language: cpp
- Synchronizes the device and copies the results back to the host.
- Checks that each block's sum is equal with the expected mask bit count,
verifying the reduction's correctness.
- Frees the device memory to prevent memory leaks.
.. note::
The ``warpSize`` runtime example code is also provided for comparison purposes
and the full example codes are located in the `tools folder <https://github.com/ROCm/hip/tree/docs/develop/docs/tools/example_codes>`_.
The variable ``warpSize`` can be used for loop bounds and supports
loop unrolling similarly to the template parameter ``WarpSize``.
For users who still require a compile-time constant warp size as a macro on the
device side, it can be defined manually based on the target device architecture,
as shown in the following example.
.. code-block:: cpp
#if defined(__GFX8__) || defined(__GFX9__)
#define WarpSize 64
#else
#define WarpSize 32
#endif
.. note::
``mwavefrontsize64`` compiler option is not supported by HIP runtime, that's
why the architecture based compile time selector is an acceptable approach.
********************************************************************************
Vector types
********************************************************************************
@@ -855,7 +977,7 @@ The different shuffle functions behave as following:
of range, the thread returns its own ``var``.
``__shfl_down``
The thread reads ``var`` from lane ``laneIdx - delta``, thereby "shuffling"
The thread reads ``var`` from lane ``laneIdx + delta``, thereby "shuffling"
the values of the lanes of the warp "down". If the resulting source lane is
out of range, the thread returns its own ``var``.
+69 -3
Dosyayı Görüntüle
@@ -29,6 +29,8 @@ error code spaces:
General Tips
--------------------------------------------------------------------------------
* ``hipDeviceptr_t`` is a ``void*`` and treated like a raw pointer, while ``CUdevicptr``
is an ``unsigned int`` and treated as a device memory handle.
* Starting to port on an NVIDIA machine is often the easiest approach, as the
code can be tested for functionality and performance even if not fully ported
to HIP.
@@ -46,16 +48,16 @@ HIPIFY
translate CUDA to HIP code. There are two flavours available, ``hipfiy-clang``
and ``hipify-perl``.
:doc:`hipify-clang <hipify:hipify-clang>` is, as the name implies, a Clang-based
:doc:`hipify-clang <hipify:how-to/hipify-clang>` is, as the name implies, a Clang-based
tool, and actually parses the code, translates it into an Abstract Syntax Tree,
from which it then generates the HIP source. For this, ``hipify-clang`` needs to
be able to actually compile the code, so the CUDA code needs to be correct, and
a CUDA install with all necessary headers must be provided.
:doc:`hipify-perl <hipify:hipify-perl>` uses pattern matching, to translate the
:doc:`hipify-perl <hipify:how-to/hipify-perl>` uses pattern matching, to translate the
CUDA code to HIP. It does not require a working CUDA installation, and can also
convert CUDA code, that is not syntactically correct. It is therefore easier to
set up and use, but is not as powerful as ``hipfiy-clang``.
set up and use, but is not as powerful as ``hipify-clang``.
Scanning existing CUDA code to scope the porting effort
--------------------------------------------------------------------------------
@@ -611,6 +613,70 @@ code, while the host can query it during runtime via the device properties. See
the :ref:`HIP language extension for warpSize <warp_size>` for information on
how to write portable wave-aware code.
Lane masks bit-shift
================================================================================
A thread in a warp is also called a lane, and a lane mask is a bitmask where
each bit corresponds to a thread in a warp. A bit is 1 if the thread is active,
0 if it's inactive. Bit-shift operations are typically used to create lane masks
and on AMD GPUs the ``warpSize`` can differ between different architectures,
that's why it's essential to use correct bitmask type, when porting code.
Example:
.. code-block:: cpp
// Get the thread's position in the warp
unsigned int laneId = threadIdx.x % warpSize;
// Use lane ID for bit-shift
val & ((1 << (threadIdx.x % warpSize) )-1 );
// Shift 32 bit integer with val variable
WarpReduce::sum( (val < warpSize) ? (1 << val) : 0);
Lane masks are 32-bit integer types as this is the integer precision that C
assigns to such constants by default. GCN/CDNA architectures have a warp size of
64, :code:`threadIdx.x % warpSize` and :code:`val` in the example may obtain
values greater than 31. Consequently, shifting by such values would clear the
32-bit register to which the shift operation is applied. For AMD
architectures, a straightforward fix could look as follows:
.. code-block:: cpp
// Get the thread's position in the warp
unsigned int laneId = threadIdx.x % warpSize;
// Use lane ID for bit-shift
val & ((1ull << (threadIdx.x % warpSize) )-1 );
// Shift 64 bit integer with val variable
WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0);
For portability reasons, it is better to introduce appropriately
typed placeholders as shown below:
.. code-block:: cpp
#if defined(__GFX8__) || defined(__GFX9__)
typedef uint64_t lane_mask_t;
#else
typedef uint32_t lane_mask_t;
#endif
The use of :code:`lane_mask_t` with the previous example:
.. code-block:: cpp
// Get the thread's position in the warp
unsigned int laneId = threadIdx.x % warpSize;
// Use lane ID for bit-shift
val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 );
// Shift 32 or 64 bit integer with val variable
WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0);
Porting from CUDA __launch_bounds__
================================================================================
+23 -39
Dosyayı Görüntüle
@@ -14,12 +14,19 @@ alongside options to guide the compilation.
.. note::
* Device code compilation via HIPRTC uses the ``__hip_internal`` namespace instead
of the ``std`` namespace to avoid namespace collision.
* This library can be used for compilation on systems without AMD GPU drivers
installed (offline compilation). However, running the compiled code still
requires both the HIP runtime library and GPU drivers on the target system.
* This library depends on Code Object Manager (comgr). You can try to
statically link comgr into HIPRTC to avoid ambiguity.
* Developers can bundle this library with their application.
* HIPRTC leverages AMD's Code Object Manager API (``Comgr``) internally, which
is designed to simplify linking, compiling, and inspecting code objects. For
more information, see the `llvm-project/amd/comgr/README <https://github.com/ROCm/llvm-project/blob/amd-staging/amd/comgr/README.md>`_.
* Comgr may cache HIPRTC compilations. To force full recompilation for each HIPRTC API invocation, set AMD_COMGR_CACHE=0.
- When viewing the *README* in the Comgr GitHub repository you should look at a
specific branch of interest, such as ``docs/6.3.0`` or ``docs/6.4.1``, rather than the default branch.
Compilation APIs
===============================================================================
@@ -30,6 +37,11 @@ To use HIPRTC functionality the header needs to be included:
#include <hip/hiprtc.h>
.. note::
Prior to the 7.0 release, the HIP runtime included the hipRTC library. With the 7.0
release, the library is separate and must be specifically included as shown above.
Kernels can be stored in a string:
.. code-block:: cpp
@@ -250,44 +262,12 @@ The full example is below:
HIP_CHECK(hipFree(doutput));
}
Kernel Compilation Cache
===============================================================================
HIPRTC incorporates a cache to avoid recompiling kernels between program
executions. The contents of the cache include the kernel source code (including
the contents of any ``#include`` headers), the compilation flags, and the
compiler version. After a ROCm version update, the kernels are progressively
recompiled, and the new results are cached. When the cache is disabled, each
kernel is recompiled every time it is requested.
Use the following environment variables to manage the cache status as enabled or
disabled, the location for storing the cache contents, and the cache eviction
policy:
* ``AMD_COMGR_CACHE`` By default this variable is unset and the
compilation cache feature is enabled. To disable the feature set the
environment variable to a value of ``0``.
* ``AMD_COMGR_CACHE_DIR``: By default the value of this environment variable is
defined as ``$XDG_CACHE_HOME/comgr``, which defaults to
``$USER/.cache/comgr`` on Linux, and ``%LOCALAPPDATA%\cache\comgr``
on Windows. You can specify a different directory for the environment variable
to change the path for cache storage. If the runtime fails to access the
specified cache directory the cache is disabled. If the environment variable
is set to an empty string (``""``), the default directory is used.
* ``AMD_COMGR_CACHE_POLICY``: If assigned a value, the string is interpreted and
applied to the cache pruning policy. The string format is consistent with
`Clang's ThinLTO cache pruning policy <https://rocm.docs.amd.com/projects/llvm-project/en/latest/LLVM/clang/html/ThinLTO.html#cache-pruning>`_.
The default policy is defined as:
``prune_interval=1h:prune_expiration=0h:cache_size=75%:cache_size_bytes=30g:cache_size_files=0``.
If the runtime fails to parse the defined string, or the environment variable
is set to an empty string (""), the cache is disabled.
.. note::
This cache is also shared with the OpenCL runtime shipped with ROCm.
Some applications define datatypes such as ``int64_t``, ``uint64_t``, ``int32_t``, and ``uint32_t``
that could lead to conflicts when integrating with ``hipRTC``. To resolve these conflicts, these
datatypes are replaced with HIP-specific internal datatypes prefixed with ``__hip``. For example,
``int64_t`` is replaced by ``__hip_int64_t``.
HIPRTC specific options
===============================================================================
@@ -484,7 +464,7 @@ application requires the ingestion of bitcode/IR not derived from the currently
installed AMD compiler, it must run with HIPRTC and comgr dynamic libraries that
are compatible with the version of the bitcode/IR.
`Comgr <https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr>`_ is a
`Comgr <https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr/README.md>`_ is a
shared library that incorporates the LLVM/Clang compiler that HIPRTC relies on.
To identify the bitcode/IR version that comgr is compatible with, one can
execute "clang -v" using the clang binary from the same ROCm or HIP SDK package.
@@ -492,6 +472,10 @@ For instance, if compiling bitcode/IR version 14, the HIPRTC and comgr libraries
released by AMD around mid 2022 would be the best choice, assuming the
LLVM/Clang version included in the package is also version 14.
.. note::
When viewing the *README* in the Comgr GitHub repository you should look at a
specific branch of interest, such as ``docs/6.3.0`` or ``docs/6.4.1``, rather than the default branch.
To ensure smooth operation and compatibility, an application may choose to ship
the specific versions of HIPRTC and comgr dynamic libraries, or it may opt to
clearly specify the version requirements and dependencies. This approach
+1 -1
Dosyayı Görüntüle
@@ -88,7 +88,7 @@ developer may have to reduce the block size of the kernels. The kernel runtimes
can be misleading for concurrent kernel runs, that is why during optimization
it is a good practice to check the trace files, to see if one kernel is blocking
another kernel, while they are running in parallel. For more information about
the application tracing, check::doc:`rocprofiler:/how-to/using-rocprof`.
application tracing, see :doc:`rocprofiler:how-to/using-rocprof`.
When running kernels in parallel, the execution time can increase due to
contention for shared resources. This is because multiple kernels may attempt
+1 -1
Dosyayı Görüntüle
@@ -21,7 +21,7 @@ and AMD GPUs use different approaches. NVIDIA GPUs have the independent thread
scheduling feature where each thread has its own call stack and effective
program counter. On AMD GPUs threads are grouped; each warp has its own call
stack and program counter. Warps are described and explained in the
:ref:`inherent_thread_hierarchy`
:ref:`inherent_thread_model`
If a thread or warp exceeds its stack size, a stack overflow occurs, causing
kernel failure. This can be detected using debuggers.
+4 -5
Dosyayı Görüntüle
@@ -21,11 +21,10 @@ without changing it. To get a human readable version of the errors,
.. note::
:cpp:func:`hipGetLastError` returns the returned error code of the last HIP
runtime API call even if it's ``hipSuccess``, while ``cudaGetLastError``
returns the error returned by any of the preceding CUDA APIs in the same
host thread. :cpp:func:`hipGetLastError` behavior will be matched with
``cudaGetLastError`` in ROCm release 7.0.
:cpp:func:`hipGetLastError` returns the last actual HIP API error caught in the current thread
during the application execution. Prior to ROCm 7.0, ``hipGetLastError`` might also return
``hipSuccess`` or ``hipErrorNotReady`` from the last HIP runtime API call, which are not errors.
Best practices of HIP error handling:
-6
Dosyayı Görüntüle
@@ -8,12 +8,6 @@
HIP graphs
********************************************************************************
.. note::
The HIP graph API is currently in Beta. Some features can change and might
have outstanding issues. Not all features supported by CUDA graphs are yet
supported. For a list of all currently supported functions see the
:ref:`HIP graph API documentation<graph_management_reference>`.
HIP graphs are an alternative way of executing tasks on a GPU that can provide
performance benefits over launching kernels using the standard
method via streams. A HIP graph is made up of nodes and edges. The nodes of a
@@ -64,6 +64,8 @@ To check the availability of fine- and coarse-grained memory pools, use
Segment: GLOBAL; FLAGS: COARSE GRAINED
...
.. _hip-memory-coherence-table:
The APIs, flags and respective memory coherence control are listed in the
following table:
+2
Dosyayı Görüntüle
@@ -108,6 +108,8 @@ C++ application.
:cpp:func:`hipMalloc` and :cpp:func:`hipFree` are blocking calls. However, HIP
also provides non-blocking versions :cpp:func:`hipMallocAsync` and
:cpp:func:`hipFreeAsync`, which require a stream as an additional argument.
For asynchronous memory allocations made with ``hipMallocAsync`` and ``hipMallocFromPoolAsync``
``hipFree`` does not implicitly wait for synchronization, to match the behavior of ``cudaFree``.
.. _pinned_host_memory:
+116 -65
Dosyayı Görüntüle
@@ -9,6 +9,14 @@
Unified memory management
*******************************************************************************
This document covers unified memory management in HIP, which encompasses several
approaches that provide a single address space accessible from both CPU and GPU.
**Unified memory** refers to the overall architectural concept of this shared
address space, while **managed memory** is one specific implementation that
provides automatic page migration between devices. Other unified memory allocators
like :cpp:func:`hipMalloc()` and :cpp:func:`hipHostMalloc()` provide different
access patterns within the same unified address space concept.
In conventional architectures CPUs and attached devices have their own memory
space and dedicated physical memory backing it up, e.g. normal RAM for CPUs and
VRAM on GPUs. This way each device can have physical memory optimized for its
@@ -39,57 +47,79 @@ model is shown in the following figure.
Unified memory enables the access to memory located on other devices via
several methods, depending on whether hardware support is available or has to be
managed by the driver.
managed by the driver. CPUs can access memory allocated via :cpp:func:`hipMalloc()`,
providing bidirectional memory accessibility within the unified address space.
Hardware supported on-demand page migration
--------------------------------------------------------------------------------
Managed memory
================================================================================
When a kernel on the device tries to access a memory address that is not in its
memory, a page-fault is triggered. The GPU then in turn requests the page from
the host or an other device, on which the memory is located. The page is then
unmapped from the source, sent to the device and mapped to the device's memory.
The requested memory is then available to the processes running on the device.
Managed Memory is an extension of the unified memory architecture in which HIP
monitors memory access and intelligently migrates data between device and
system memories, thereby improving performance and resource efficiency.
In case the device's memory is at capacity, a page is unmapped from the device's
memory first and sent and mapped to host memory. This enables more memory to be
allocated and used for a GPU, than the GPU itself has physically available.
When a kernel on the device tries to access a managed memory address that is
not in its local device memory, a page-fault is triggered. The GPU then in
turn requests the page from the host or other device on which the memory is
located. The page is unmapped from the source, sent to the device and
mapped to the device's memory. The requested memory is then available locally
to the processes running on the device, which improves performance as local
memory access outperforms remote memory access.
This level of unified memory support can be very beneficial for sparse accesses
to an array, that is not often used on the device.
Driver managed page migration
--------------------------------------------------------------------------------
If the hardware does not support on-demand page migration, then all the pages
accessed by a kernel have to be resident on the device, so they have to be
migrated before the kernel is running. Since the driver can not know beforehand,
what parts of an array are going to be accessed, all pages of all accessed
arrays have to be migrated. This can lead to significant delays on the first run
of a kernel, on top of possibly copying more memory than is actually accessed by
the kernel.
Managed memory also expands the memory capacity available to a GPU kernel. When
migrating memory into the device on page-fault, if the device's memory is
already at capacity, a page is unmapped from the device's memory first and sent
and mapped to host memory. This enables more memory to be allocated and used
for a GPU than the GPU itself has physically available. This level of support
can be very beneficial, for example, for sparse accesses to an array that is
not often used on the device.
.. _unified memory system requirements:
System requirements
================================================================================
System requirements for managed memory
--------------------------------------------------------------------------------
Unified memory is supported on Linux by all modern AMD GPUs from the Vega
series onward, as shown in the following table. Unified memory management can
be achieved by explicitly allocating managed memory using
:cpp:func:`hipMallocManaged` or marking variables with the ``__managed__``
attribute. For the latest GPUs, with a Linux kernel that supports
`Heterogeneous Memory Management (HMM)
Some AMD GPUs do not support page-faults, and thus do not support on-demand
page-fault driven migration. On these architectures, if the programmer prefers
all GPU memory accesses to be local, all pages have to migrated before the
kernel is dispatched, as the driver cannot know beforehand which parts of a
dataset are going to be accessed. This can lead to significant delays on the
first run of a kernel, and, in the example of a sparsely accessed array, can
also lead to copying more memory than is actually accessed by the kernel.
Note that on systems which do not support page-faults, managed memory APIs are
still accessible to the programmer, but managed memory operates in a degraded
fashion due to the lack of demand-driven migration. Furthermore, on these
systems it is still possible to use unified memory allocators that do not
provide managed memory features; see
:ref:`memory allocation approaches in unified memory` for more details.
Managed memory is supported on Linux by all modern AMD GPUs from the Vega
series onward, as shown in the following table. Managed memory can be
explicitly allocated using :cpp:func:`hipMallocManaged()` or marking variables
with the ``__managed__`` attribute. For the latest GPUs, with a Linux kernel
that supports `Heterogeneous Memory Management (HMM)
<https://www.kernel.org/doc/html/latest/mm/hmm.html>`_, the normal system
allocator can be used.
allocators (e.g., ``new``, ``malloc()``) can be used.
.. list-table:: Supported Unified Memory Allocators by GPU architecture
.. note::
To ensure the proper functioning of managed memory on supported GPUs, it
is **essential** to set the environment variable ``HSA_XNACK=1`` and use a
GPU kernel mode driver that supports `HMM
<https://www.kernel.org/doc/html/latest/mm/hmm.html>`_. Without this
configuration, access-driven memory migration will be disabled, and the
behavior will be similar to that of systems without HMM support.
.. list-table:: Managed Memory Support by GPU Architecture
:widths: 40, 25, 25
:header-rows: 1
:align: center
* - Architecture
- :cpp:func:`hipMallocManaged()`, ``__managed__``
- ``new``, ``malloc()``
- ``new``, ``malloc()``, ``allocate()``
* - CDNA4
- ✅
- ✅ :sup:`1`
* - CDNA3
- ✅
- ✅ :sup:`1`
@@ -98,7 +128,7 @@ allocator can be used.
- ✅ :sup:`1`
* - CDNA1
- ✅
- :sup:`1`
-
* - RDNA1
- ✅
- ❌
@@ -113,11 +143,16 @@ allocator can be used.
:sup:`1` Works only with ``HSA_XNACK=1`` and kernels with HMM support. First GPU
access causes recoverable page-fault.
.. _unified memory allocators:
.. _memory allocation approaches in unified memory:
Unified memory allocators
Memory allocation approaches in unified memory
================================================================================
While managed memory provides automatic migration, unified memory encompasses
several allocation methods, each with different access patterns and migration
behaviors. The following section covers all available unified memory allocation
approaches, including but not limited to managed memory APIs.
Support for the different unified memory allocators depends on the GPU
architecture and on the system. For more information, see :ref:`unified memory
system requirements` and :ref:`checking unified memory support`.
@@ -133,17 +168,17 @@ system requirements` and :ref:`checking unified memory support`.
- **System allocated unified memory**
Starting with CDNA2, the ``new`` and ``malloc()`` system allocators allow
Starting with CDNA2, the ``new``, ``malloc()``, and ``allocate()`` (Fortran) system allocators allow
you to reserve unified memory. The system allocator is more versatile and
offers an easy transition for code written for CPUs to HIP code as the
same system allocation API is used.
same system allocation API is used. Memory allocated by these allocators can
be registered to be accessible on device using :cpp:func:`hipHostRegister()`.
To ensure the proper functioning of system allocated unified memory on supported
GPUs, it is essential to set the environment variable ``HSA_XNACK=1`` and use
a GPU kernel mode driver that supports HMM
<https://www.kernel.org/doc/html/latest/mm/hmm.html>`_. Without this
configuration, the behavior will be similar to that of systems without HMM
support.
- **HIP allocated non-managed memory**
:cpp:func:`hipMalloc()` and :cpp:func:`hipHostMalloc()` are dynamic memory
allocators available on all GPUs with unified memory support. Memory
allocated by these allocators is not migrated between device and host memory.
The table below illustrates the expected behavior of managed and unified memory
functions on ROCm and CUDA, both with and without HMM support.
@@ -161,10 +196,10 @@ functions on ROCm and CUDA, both with and without HMM support.
- Access outside the origin without HMM or ``HSA_XNACK=0``
- Allocation origin with HMM and ``HSA_XNACK=1``
- Access outside the origin with HMM and ``HSA_XNACK=1``
* - ``new``, ``malloc()``
* - ``new``, ``malloc()``, ``allocate()``
- host
- not accessible on device
- host
- first touch
- page-fault migration
* - :cpp:func:`hipMalloc()`
- device
@@ -174,13 +209,13 @@ functions on ROCm and CUDA, both with and without HMM support.
* - :cpp:func:`hipMallocManaged()`, ``__managed__``
- pinned host
- zero copy [zc]_
- host
- first touch
- page-fault migration
* - :cpp:func:`hipHostRegister()`
- undefined behavior
- undefined behavior
- host
- page-fault migration
- pinned host
- zero copy [zc]_
- pinned host
- zero copy [zc]_
* - :cpp:func:`hipHostMalloc()`
- pinned host
- zero copy [zc]_
@@ -236,9 +271,9 @@ functions on ROCm and CUDA, both with and without HMM support.
Checking unified memory support
--------------------------------------------------------------------------------
The following device attributes can offer information about which :ref:`unified
memory allocators` are supported. The attribute value is 1 if the functionality
is supported, and 0 if it is not supported.
The following device attributes can offer information about which :ref:`memory
allocation approaches in unified memory` are supported. The attribute value is
1 if the functionality is supported, and 0 if it is not supported.
.. list-table:: Device attributes for unified memory management
:widths: 40, 60
@@ -364,10 +399,11 @@ explicit memory management example is presented in the last tab.
.. tab-item:: new
.. code-block:: cpp
:emphasize-lines: 20-23
:emphasize-lines: 21-24
#include <hip/hip_runtime.h>
#include <iostream>
#include <new>
#define HIP_CHECK(expression) \
{ \
@@ -386,10 +422,10 @@ explicit memory management example is presented in the last tab.
// This example requires HMM support and the environment variable HSA_XNACK needs to be set to 1
int main() {
// Allocate memory for a, b, and c.
int *a = new int[1];
int *b = new int[1];
int *c = new int[1];
// Allocate memory with proper alignment for performance
int *a = new(std::align_val_t(128)) int[1];
int *b = new(std::align_val_t(128)) int[1];
int *c = new(std::align_val_t(128)) int[1];
// Setup input values.
*a = 1;
@@ -404,10 +440,10 @@ explicit memory management example is presented in the last tab.
// Prints the result.
std::cout << *a << " + " << *b << " = " << *c << std::endl;
// Cleanup allocated memory.
delete[] a;
delete[] b;
delete[] c;
// Cleanup allocated memory with matching aligned delete.
::operator delete[](a, std::align_val_t(128));
::operator delete[](b, std::align_val_t(128));
::operator delete[](c, std::align_val_t(128));
return 0;
}
@@ -501,9 +537,24 @@ Performance optimizations for unified memory
There are several ways, in which the developer can guide the runtime to reduce
copies between devices, in order to improve performance.
With ``numactl --membind`` bindings, developers can control where physical
allocation occurs by restricting memory allocation to specific NUMA nodes.
This approach can reduce or eliminate the need for explicit data prefetching
since memory is allocated in the desired location from the start.
Data prefetching
--------------------------------------------------------------------------------
.. warning::
Data prefetching is not always an optimization and can slow down execution,
as the API takes time to execute. If the memory is already in the right
place, prefetching will waste time. Users should profile their code to
verify whether prefetching is beneficial for their specific use case.
When prefetching is beneficial, developers can consider setting different default
locations for different devices and using prefetch between them, which can help
eliminate IPC communication overhead when memory moves between devices.
Data prefetching is a technique used to improve the performance of your
application by moving data to the desired device before it's actually
needed. ``hipCpuDeviceId`` is a special constant to specify the CPU as target.
+3 -1
Dosyayı Görüntüle
@@ -262,7 +262,9 @@ For example, when the control condition depends on ``threadIdx`` or ``warpSize``
warp doesn't diverge. The compiler might optimize loops, short ifs, or switch
blocks using branch predication, which prevents warp divergence. With branch
predication, instructions associated with a false predicate are scheduled but
not executed, which avoids unnecessary operations.
not executed, which avoids unnecessary operations. For control conditions where
one outcome is significantly more likely than the other, use `__builtin_expect <https://clang.llvm.org/docs/LanguageExtensions.html#builtin-expect>`_
or ``[[likely]]`` to indicate the likely condition result.
Avoiding divergent warps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+7
Dosyayı Görüntüle
@@ -10,6 +10,13 @@ The Heterogeneous-computing Interface for Portability (HIP) is a C++ runtime API
and kernel language that lets you create portable applications for AMD and
NVIDIA GPUs from a single source code. For more information, see [What is HIP?](./what_is_hip)
```{note}
HIP API 7.0 introduces changes to make it align more closely with NVIDIA CUDA.
These changes are incompatible with prior releases, and might require recompiling
existing HIP applications for use with the ROCm 7.0 release. For more information,
see [HIP API 7.0 changes](./hip-7-changes).
```
Installation instructions are available from:
* [Installing HIP](./install/install)
+115 -109
Dosyayı Görüntüle
@@ -9,27 +9,28 @@ Build HIP from source
Prerequisites
=================================================
HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler, or a CUDA platform with ``nvcc`` installed.
Before building and running HIP, make sure drivers and prebuilt packages are installed properly on the platform.
HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler,
or a CUDA platform with ``nvcc`` installed. Before building and running HIP,
make sure drivers and prebuilt packages are installed properly on the platform.
You also need to install Python 3, which includes the ``CppHeaderParser`` package.
Install Python 3 using the following command:
.. code-block:: shell
apt-get install python3
apt-get install python3
Check and install ``CppHeaderParser`` package using the command:
.. code-block:: shell
pip3 install CppHeaderParser
pip3 install CppHeaderParser
Install ``ROCm LLVM`` package using the command:
.. code-block:: shell
apt-get install rocm-llvm-dev
apt-get install rocm-llvm-dev
.. _Building the HIP runtime:
@@ -52,169 +53,174 @@ Set the repository branch using the variable: ``ROCM_BRANCH``. For example, for
.. tab-set::
.. tab-item:: AMD
:sync: amd
.. tab-item:: AMD
:sync: amd
#. Get HIP source code.
#. Get HIP source code.
.. code-block:: shell
.. code-block:: shell
git clone -b "$ROCM_BRANCH" git@github.com:ROCm/rocm-systems.git
git clone -b "$ROCM_BRANCH" git@github.com:ROCm/rocm-systems.git
#. Set the environment variables.
#. Set the environment variables.
.. code-block:: shell
.. code-block:: shell
export CLR_DIR="$(readlink -f rocm-systems/projects/clr)"
export HIP_DIR="$(readlink -f rocm-systems/projects/hip)"
export CLR_DIR="$(readlink -f rocm-systems/projects/clr)"
export HIP_DIR="$(readlink -f rocm-systems/projects/hip)"
#. Build HIP.
#. Build HIP.
.. code-block:: shell
.. code-block:: shell
cd "$CLR_DIR"
mkdir -p build; cd build
cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF ..
make -j$(nproc)
sudo make install
cd "$CLR_DIR"
mkdir -p build; cd build
cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF ..
make -j$(nproc)
sudo make install
.. note::
.. note::
If ``CMAKE_INSTALL_PREFIX`` is not explicitly specified, the HIP runtime will be installed at
``<ROCM_PATH>``, which is by default at the path ``/opt/rocm``.
If ``CMAKE_INSTALL_PREFIX`` is not explicitly specified, the HIP runtime will be installed at
``<ROCM_PATH>``, which is by default at the path ``/opt/rocm``.
By default, the release version of HIP is built. If you need a debug version, you can put the option ``CMAKE_BUILD_TYPE=Debug`` in the command line.
By default, the release version of HIP is built. If you need a debug version, you can put the option ``CMAKE_BUILD_TYPE=Debug`` in the command line.
Default paths and environment variables:
* HIP is installed into ``<ROCM_PATH>``. This can be overridden by setting the ``INSTALL_PREFIX`` as the command option.
environment variable.
* HSA is in ``<ROCM_PATH>``. This can be overridden by setting the ``HSA_PATH``
environment variable.
* Clang is in ``<ROCM_PATH>/llvm/bin``. This can be overridden by setting the
``HIP_CLANG_PATH`` environment variable.
* The device library is in ``<ROCM_PATH>/lib``. This can be overridden by setting the
``DEVICE_LIB_PATH`` environment variable.
* Optionally, you can add ``<ROCM_PATH>/bin`` to your ``PATH``, which can make it easier to
use the tools.
* Optionally, you can set ``HIPCC_VERBOSE=7`` to output the command line for compilation.
Default paths and environment variables:
* HIP is installed into ``<ROCM_PATH>``. This can be overridden by setting the ``INSTALL_PREFIX`` as the command option.
After you run the ``make install`` command, HIP is installed to ``<ROCM_PATH>`` by default, or ``$PWD/install/hip`` while ``INSTALL_PREFIX`` is defined.
* HSA is in ``<ROCM_PATH>``. This can be overridden by setting the ``HSA_PATH``
environment variable.
#. Generate a profiling header after adding/changing a HIP API.
* Clang is in ``<ROCM_PATH>/llvm/bin``. This can be overridden by setting the
``HIP_CLANG_PATH`` environment variable.
When you add or change a HIP API, you may need to generate a new ``hip_prof_str.h`` header.
This header is used by ROCm tools to track HIP APIs, such as ``rocprofiler`` and ``roctracer``.
* The device library is in ``<ROCM_PATH>/lib``. This can be overridden by setting the
``DEVICE_LIB_PATH`` environment variable.
To generate the header after your change, use the ``hip_prof_gen.py`` tool located in
``hipamd/src``.
* Optionally, you can add ``<ROCM_PATH>/bin`` to your ``PATH``, which can make it easier to
use the tools.
Usage:
* Optionally, you can set ``HIPCC_VERBOSE=7`` to output the command line for compilation.
.. code-block:: shell
After you run the ``make install`` command, HIP is installed to ``<ROCM_PATH>`` by default, or ``$PWD/install/hip`` while ``INSTALL_PREFIX`` is defined.
`hip_prof_gen.py [-v] <input HIP API .h file> <patched srcs path> <previous output> [<output>]`
#. Generate a profiling header after adding/changing a HIP API.
Flags:
When you add or change a HIP API, you may need to generate a new ``hip_prof_str.h`` header.
This header is used by ROCm tools to track HIP APIs, such as ``rocprofiler`` and ``roctracer``.
* ``-v``: Verbose messages
* ``-r``: Process source directory recursively
* ``-t``: API types matching check
* ``--priv``: Private API check
* ``-e``: On error exit mode
* ``-p``: ``HIP_INIT_API`` macro patching mode
To generate the header after your change, use the ``hip_prof_gen.py`` tool located in
``hipamd/src``.
Example usage:
Usage:
.. code-block:: shell
.. code-block:: shell
hip_prof_gen.py -v -p -t --priv <hip>/include/hip/hip_runtime_api.h \
<hipamd>/src <hipamd>/include/hip/amd_detail/hip_prof_str.h \
<hipamd>/include/hip/amd_detail/hip_prof_str.h.new
`hip_prof_gen.py [-v] <input HIP API .h file> <patched srcs path> <previous output> [<output>]`
.. tab-item:: NVIDIA
:sync: nvidia
Flags:
#. Get the HIP source code.
* ``-v``: Verbose messages
* ``-r``: Process source directory recursively
* ``-t``: API types matching check
* ``--priv``: Private API check
* ``-e``: On error exit mode
* ``-p``: ``HIP_INIT_API`` macro patching mode
.. code-block:: shell
Example usage:
git clone -b "$ROCM_BRANCH" git@github.com:ROCm/rocm-systems.git
.. code-block:: shell
#. Set the environment variables.
hip_prof_gen.py -v -p -t --priv <hip>/include/hip/hip_runtime_api.h \
<hipamd>/src <hipamd>/include/hip/amd_detail/hip_prof_str.h \
<hipamd>/include/hip/amd_detail/hip_prof_str.h.new
.. code-block:: shell
.. tab-item:: NVIDIA
:sync: nvidia
export CLR_DIR="$(readlink -f rocm-systems/projects/clr)"
export HIP_DIR="$(readlink -f rocm-systems/projects/hip)"
export HIP_OTHER="$(readlink -f rocm-systems/projects/hipother)"
#. Get the HIP source code.
#. Build HIP.
.. code-block:: shell
.. code-block:: shell
git clone -b "$ROCM_BRANCH" git@github.com:ROCm/rocm-systems.git
cd "$CLR_DIR"
mkdir -p build; cd build
cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/install -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF -DHIPNV_DIR=$HIP_OTHER/hipnv ..
make -j$(nproc)
sudo make install
#. Set the environment variables.
.. code-block:: shell
export CLR_DIR="$(readlink -f rocm-systems/projects/clr)"
export HIP_DIR="$(readlink -f rocm-systems/projects/hip)"
export HIP_OTHER="$(readlink -f rocm-systems/projects/hipother)"
#. Build HIP.
.. code-block:: shell
cd "$CLR_DIR"
mkdir -p build; cd build
cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/install -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF -DHIPNV_DIR=$HIP_OTHER/hipnv ..
make -j$(nproc)
sudo make install
Build HIP tests
=================================================
.. tab-set::
.. tab-item:: AMD
:sync: amd
.. tab-item:: AMD
:sync: amd
* Build HIP catch tests.
**Build HIP catch tests.**
HIP catch tests utilize the Catch2 testing framework.
HIP catch tests utilize the Catch2 testing framework.
* Get HIP tests source code.
#. Get HIP tests source code.
.. code-block:: shell
.. code-block:: shell
git clone -b "$ROCM_BRANCH" git@github.com:ROCm/rocm-systems.git
export HIPTESTS_DIR="$(readlink -f rocm-systems/projects/hip-tests)"
git clone -b "$ROCM_BRANCH" git@github.com:ROCm/rocm-systems.git
export HIPTESTS_DIR="$(readlink -f rocm-systems/projects/hip-tests)"
* Build HIP tests from source.
#. Build HIP tests from source.
.. code-block:: shell
.. code-block:: shell
cd "$HIPTESTS_DIR"
mkdir -p build; cd build
cmake ../catch -DHIP_PLATFORM=amd -DHIP_PATH=$CLR_DIR/build/install # or any path where HIP is installed; for example: ``/opt/rocm``
export ROCM_PATH=/opt/rocm
make build_tests
ctest # run tests
cd "$HIPTESTS_DIR"
mkdir -p build; cd build
cmake ../catch -DHIP_PLATFORM=amd -DHIP_PATH=$CLR_DIR/build/install # or any path where HIP is installed; for example: ``/opt/rocm``
export ROCM_PATH=/opt/rocm
make build_tests
ctest # run tests
HIP catch tests are built in ``$HIPTESTS_DIR/build``.
HIP catch tests are built in ``$HIPTESTS_DIR/build``.
To run any single catch test, use this example:
To run any single catch test, use this example:
.. code-block:: shell
.. code-block:: shell
cd $HIPTESTS_DIR/build/catch_tests/unit/texture
./TextureTest
cd $HIPTESTS_DIR/build/catch_tests/unit/texture
./TextureTest
* Build a HIP Catch2 standalone test.
#. Build a HIP Catch2 standalone test.
.. code-block:: shell
.. code-block:: shell
cd "$HIPTESTS_DIR"
hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc \
-I ./catch/include ./catch/hipTestMain/standalone_main.cc \
-I ./catch/external/Catch2 -o hipPointerGetAttributes
./hipPointerGetAttributes
...
cd "$HIPTESTS_DIR"
hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc \
-I ./catch/include ./catch/hipTestMain/standalone_main.cc \
-I ./catch/external/Catch2 -o hipPointerGetAttributes
./hipPointerGetAttributes
...
All tests passed
All tests passed
.. tab-item:: NVIDIA
:sync: nvidia
.. tab-item:: NVIDIA
:sync: nvidia
The commands to build HIP tests on an NVIDIA platform are the same as on an AMD platform.
However, you must first set ``-DHIP_PLATFORM=nvidia``.
The commands to build HIP tests on an NVIDIA platform are the same as on an AMD platform.
However, you must first set ``-DHIP_PLATFORM=nvidia``.
Run HIP
+58 -54
Dosyayı Görüntüle
@@ -10,10 +10,10 @@ HIP can be installed on AMD (ROCm with HIP-Clang) and NVIDIA (CUDA with NVCC) pl
.. note::
The version definition for the HIP runtime is different from CUDA. On AMD
platforms, the :cpp:func:`hipRuntimeGetVersion` function returns the HIP
runtime version. On NVIDIA platforms, this function returns the CUDA runtime
version.
The version definition for the HIP runtime is different from CUDA. On AMD
platforms, the :cpp:func:`hipRuntimeGetVersion` function returns the HIP
runtime version. On NVIDIA platforms, this function returns the CUDA runtime
version.
.. _install_prerequisites:
@@ -22,84 +22,88 @@ Prerequisites
.. tab-set::
.. tab-item:: AMD
:sync: amd
.. tab-item:: AMD
:sync: amd
Refer to the Prerequisites section in the ROCm install guides:
Refer to the Prerequisites section in the ROCm install guides:
* :doc:`rocm-install-on-linux:reference/system-requirements`
* :doc:`rocm-install-on-windows:reference/system-requirements`
* `System requirements (Linux) <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html>`_
* `System requirements (Windows) <https://rocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html>`_
.. tab-item:: NVIDIA
:sync: nvidia
.. tab-item:: NVIDIA
:sync: nvidia
With NVIDIA GPUs, HIP requires unified memory. All CUDA-enabled NVIDIA
GPUs with compute capability 5.0 or later should be supported. For more
information, see `NVIDIA's list of CUDA enabled GPUs <https://developer.nvidia.com/cuda-gpus>`_.
With NVIDIA GPUs, HIP requires unified memory. All CUDA-enabled NVIDIA
GPUs with compute capability 5.0 or later should be supported. For more
information, see `NVIDIA's list of CUDA enabled GPUs <https://developer.nvidia.com/cuda-gpus>`_.
Installation
=======================================
.. tab-set::
.. tab-item:: AMD
:sync: amd
.. tab-item:: AMD
:sync: amd
HIP is automatically installed during the ROCm installation. If you haven't yet installed ROCm, you
can find installation instructions here:
HIP is automatically installed during the ROCm installation. If you haven't
yet installed ROCm, you can find installation instructions here:
* :doc:`rocm-install-on-linux:index`
* :doc:`rocm-install-on-windows:index`
* `ROCm installation for Linux <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/index.html>`_
* `HIP SDK installation for Windows <https://rocm.docs.amd.com/projects/install-on-windows/en/latest/index.html>`_
By default, HIP is installed into ``/opt/rocm``.
By default, HIP is installed into ``/opt/rocm``.
.. note::
There is no autodetection for the HIP installation. If you choose to install it somewhere other than the default location, you must set the ``HIP_PATH`` environment variable as explained in `Build HIP from source <./build.html>`_.
.. note::
There is no autodetection for the HIP installation. If you choose to
install it somewhere other than the default location, you must set the
``HIP_PATH`` environment variable as explained in
`Build HIP from source <./build.html>`_.
.. tab-item:: NVIDIA
:sync: nvidia
.. tab-item:: NVIDIA
:sync: nvidia
#. Install the NVIDIA toolkit.
#. Install the NVIDIA toolkit.
The latest release can be found here:
`CUDA Toolkit <https://developer.nvidia.com/cuda-downloads>`_.
The latest release can be found here:
`CUDA Toolkit <https://developer.nvidia.com/cuda-downloads>`_.
#. Setup the radeon repo.
#. Setup the radeon repo.
.. code-block::shell
.. code-block::shell
# Replace url with appropriate link in the table below
wget https://repo.radeon.com/amdgpu-install/6.2/distro/version_name/amdgpu-install_6.2.60200-1_all.deb
sudo apt install ./amdgpu-install_6.2.60200-1_all.deb
sudo apt update
# Replace url with appropriate link in the table below
wget https://repo.radeon.com/amdgpu-install/6.2/distro/version_name/amdgpu-install_6.2.60200-1_all.deb
sudo apt install ./amdgpu-install_6.2.60200-1_all.deb
sudo apt update
.. list-table:: amdgpu-install links
:widths: 25 100
:header-rows: 1
.. list-table:: amdgpu-install links
:widths: 25 100
:header-rows: 1
* - Ubuntu version
- URL
* - 24.04
- https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb
* - 22.04
- https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/jammy/amdgpu-install_6.2.60204-1_all.deb
* - Ubuntu version
- URL
* - 24.04
- https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb
* - 22.04
- https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/jammy/amdgpu-install_6.2.60204-1_all.deb
#. Install the ``hip-runtime-nvidia`` and ``hip-dev`` packages. This installs the CUDA SDK and HIP
porting layer.
#. Install the ``hip-runtime-nvidia`` and ``hip-dev`` packages. This installs the CUDA SDK and HIP
porting layer.
.. code-block:: shell
.. code-block:: shell
apt-get install hip-runtime-nvidia hip-dev
apt-get install hip-runtime-nvidia hip-dev
The default paths are:
* CUDA SDK: ``/usr/local/cuda``
* HIP: ``/opt/rocm``
The default paths are:
* CUDA SDK: ``/usr/local/cuda``
* HIP: ``/opt/rocm``
#. Set the HIP_PLATFORM to nvidia.
#. Set the HIP_PLATFORM to nvidia.
.. code-block:: shell
.. code-block:: shell
export HIP_PLATFORM="nvidia"
export HIP_PLATFORM="nvidia"
Verify your installation
==========================================================
@@ -108,4 +112,4 @@ Run ``hipconfig`` in your installation path.
.. code-block:: shell
/opt/rocm/bin/hipconfig --full
/opt/rocm/bin/hipconfig --full
+7
Dosyayı Görüntüle
@@ -43,6 +43,13 @@ in both single and double precision formats.
Complex Number Functions
========================
.. note::
Changes have been made to small vector constructors for ``hipComplex`` and ``hipFloatComplex``
initialization, such as ``float2`` and ``int4``. If your code previously relied
on a single value to initialize all components within a vector or complex type, you might need
to update your code.
A comprehensive collection of functions for creating and manipulating complex numbers, organized by
functional categories for easy reference.
+5 -25
Dosyayı Görüntüle
@@ -12,44 +12,24 @@ on AMD platform, which are grouped by functionality.
GPU isolation variables
================================================================================
The GPU isolation environment variables in HIP are collected in the following table.
For more information, check :doc:`GPU isolation page <rocm:conceptual/gpu-isolation>`.
.. include-table:: data/env_variables_hip.rst
:table: hip-env-isolation
.. include:: ./env_variables/gpu_isolation_hip_env.rst
Profiling variables
================================================================================
The profiling environment variables in HIP are collected in the following table. For
more information, check :doc:`setting the number of CUs page <rocm:how-to/setting-cus>`.
.. include-table:: data/env_variables_hip.rst
:table: hip-env-prof
.. include:: ./env_variables/profiling_hip_env.rst
Debug variables
================================================================================
The debugging environment variables in HIP are collected in the following table. For
more information, check :ref:`debugging_with_hip`.
.. include-table:: data/env_variables_hip.rst
:table: hip-env-debug
.. include:: ./env_variables/debug_hip_env.rst
Memory management related variables
================================================================================
The memory management related environment variables in HIP are collected in the
following table.
.. include-table:: data/env_variables_hip.rst
:table: hip-env-memory
.. include:: ./env_variables/memory_management_hip_env.rst
Other useful variables
================================================================================
The following table lists environment variables that are useful but relate to
different features.
.. include-table:: data/env_variables_hip.rst
:table: hip-env-other
.. include:: ./env_variables/miscellaneous_hip_env.rst
+100
Dosyayı Görüntüle
@@ -0,0 +1,100 @@
The debugging environment variables in HIP are collected in the following table. For
more information, check :doc:`hip:how-to/logging`, :doc:`hip:how-to/debugging`
and :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`.
.. _hip-env-debug:
.. list-table::
:header-rows: 1
:widths: 35,14,51
* - **Environment variable**
- **Default value**
- **Value**
* - | ``AMD_LOG_LEVEL``
| Enables HIP log on various level.
- ``0``
- | 0: Disable log.
| 1: Enables error logs.
| 2: Enables warning logs next to lower-level logs.
| 3: Enables information logs next to lower-level logs.
| 4: Enables debug logs next to lower-level logs.
| 5: Enables debug extra logs next to lower-level logs.
* - | ``AMD_LOG_LEVEL_FILE``
| Sets output file for ``AMD_LOG_LEVEL``.
- stderr output
-
* - | ``AMD_LOG_MASK``
| Specifies HIP log filters. Here is the ` complete list of log masks <https://github.com/ROCm/clr/blob/develop/rocclr/utils/debug.hpp#L40>`_.
- ``0x7FFFFFFF``
- | 0x1: Log API calls.
| 0x2: Kernel and copy commands and barriers.
| 0x4: Synchronization and waiting for commands to finish.
| 0x8: Decode and display AQL packets.
| 0x10: Queue commands and queue contents.
| 0x20: Signal creation, allocation, pool.
| 0x40: Locks and thread-safety code.
| 0x80: Kernel creations and arguments, etc.
| 0x100: Copy debug.
| 0x200: Detailed copy debug.
| 0x400: Resource allocation, performance-impacting events.
| 0x800: Initialization and shutdown.
| 0x1000: Misc debug, not yet classified.
| 0x2000: Show raw bytes of AQL packet.
| 0x4000: Show code creation debug.
| 0x8000: More detailed command info, including barrier commands.
| 0x10000: Log message location.
| 0x20000: Memory allocation.
| 0x40000: Memory pool allocation, including memory in graphs.
| 0x80000: Timestamp details.
| 0xFFFFFFFF: Log always even mask flag is zero.
* - | ``HIP_LAUNCH_BLOCKING``
| Used for serialization on kernel execution.
- ``0``
- | 0: Disable. Kernel executes normally.
| 1: Enable. Serializes kernel enqueue, behaves the same as ``AMD_SERIALIZE_KERNEL``.
* - | ``HIP_VISIBLE_DEVICES`` (or ``CUDA_VISIBLE_DEVICES``)
| Only devices whose index is present in the sequence are visible to HIP
- Unset by default.
- 0,1,2: Depending on the number of devices on the system.
* - | ``GPU_DUMP_CODE_OBJECT``
| Dump code object.
- ``0``
- | 0: Disable
| 1: Enable
* - | ``AMD_SERIALIZE_KERNEL``
| Serialize kernel enqueue.
- ``0``
- | 0: Disable
| 1: Wait for completion before enqueue.
| 2: Wait for completion after enqueue.
| 3: Both
* - | ``AMD_SERIALIZE_COPY``
| Serialize copies
- ``0``
- | 0: Disable
| 1: Wait for completion before enqueue.
| 2: Wait for completion after enqueue.
| 3: Both
* - | ``AMD_DIRECT_DISPATCH``
| Enable direct kernel dispatch (Currently for Linux; under development for Windows).
- ``1``
- | 0: Disable
| 1: Enable
* - | ``GPU_MAX_HW_QUEUES``
| The maximum number of hardware queues allocated per device.
- ``4``
- The variable controls how many independent hardware queues HIP runtime can create per process,
per device. If an application allocates more HIP streams than this number, then HIP runtime reuses
the same hardware queues for the new streams in a round-robin manner. Note that this maximum
number does not apply to hardware queues that are created for CU-masked HIP streams, or
cooperative queues for HIP Cooperative Groups (single queue per device).
+27
Dosyayı Görüntüle
@@ -0,0 +1,27 @@
Restricting the access of applications to a subset of GPUs, also known as GPU
isolation, allows users to hide GPU resources from programs. The GPU isolation
environment variables in HIP are collected in the following table.
.. _hip-env-isolation:
.. list-table::
:header-rows: 1
:widths: 50,30,20
* - **Environment variable**
- **Links**
- **Value**
* - | ``ROCR_VISIBLE_DEVICES``
| A list of device indices or UUIDs that will be exposed to applications.
- :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`, :doc:`Setting the number of compute units <rocm:how-to/setting-cus>`
- Example: ``0,GPU-DEADBEEFDEADBEEF``
* - | ``GPU_DEVICE_ORDINAL``
| Devices indices exposed to OpenCL and HIP applications.
- :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`
- Example: ``0,2``
* - | ``HIP_VISIBLE_DEVICES`` or ``CUDA_VISIBLE_DEVICES``
| Device indices exposed to HIP applications.
- :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`, :doc:`HIP debugging <hip:how-to/debugging>`
- Example: ``0,2``
+100
Dosyayı Görüntüle
@@ -0,0 +1,100 @@
The memory management related environment variables in HIP are collected in the
following table. The ``HIP_HOST_COHERENT`` variable linked at the following
pages:
- :ref:`Coherence control <hip:coherence_control>`
- :ref:`Memory allocation flags <hip:memory_allocation_flags>`
.. _hip-env-memory:
.. list-table::
:header-rows: 1
:widths: 35,14,51
* - **Environment variable**
- **Default value**
- **Value**
* - | ``HIP_HIDDEN_FREE_MEM``
| Amount of memory to hide from the free memory reported by hipMemGetInfo.
- ``0``
- | 0: Disable
| Unit: megabyte (MB)
* - | ``HIP_HOST_COHERENT``
| Specifies if the memory is coherent between the host and GPU in ``hipHostMalloc``.
- ``0``
- | 0: Memory is not coherent.
| 1: Memory is coherent.
| Environment variable has effect, if the following conditions are statisfied:
| - One of the ``hipHostMallocDefault``, ``hipHostMallocPortable``, ``hipHostMallocWriteCombined`` or ``hipHostMallocNumaUser`` flag set to 1.
| - ``hipHostMallocCoherent``, ``hipHostMallocNonCoherent`` and ``hipHostMallocMapped`` flags set to 0.
* - | ``HIP_INITIAL_DM_SIZE``
| Set initial heap size for device malloc.
- ``8388608``
- | Unit: Byte
| The default value corresponds to 8 MB.
* - | ``HIP_MEM_POOL_SUPPORT``
| Enables memory pool support in HIP.
- ``0``
- | 0: Disable
| 1: Enable
* - | ``HIP_MEM_POOL_USE_VM``
| Enables memory pool support in HIP.
- | ``0``: other OS
| ``1``: Windows
- | 0: Disable
| 1: Enable
* - | ``HIP_VMEM_MANAGE_SUPPORT``
| Virtual Memory Management Support.
- ``1``
- | 0: Disable
| 1: Enable
* - | ``GPU_MAX_HEAP_SIZE``
| Set maximum size of the GPU heap to % of board memory.
- ``100``
- | Unit: Percentage
* - | ``GPU_MAX_REMOTE_MEM_SIZE``
| Maximum size that allows device memory substitution with system.
- ``2``
- | Unit: kilobyte (KB)
* - | ``GPU_NUM_MEM_DEPENDENCY``
| Number of memory objects for dependency tracking.
- ``256``
-
* - | ``GPU_STREAMOPS_CP_WAIT``
| Force the stream memory operation to wait on CP.
- ``0``
- | 0: Disable
| 1: Enable
* - | ``HSA_LOCAL_MEMORY_ENABLE``
| Enable HSA device local memory usage.
- ``1``
- | 0: Disable
| 1: Enable
* - | ``PAL_ALWAYS_RESIDENT``
| Force memory resources to become resident at allocation time.
- ``0``
- | 0: Disable
| 1: Enable
* - | ``PAL_PREPINNED_MEMORY_SIZE``
| Size of prepinned memory.
- ``64``
- | Unit: kilobyte (KB)
* - | ``REMOTE_ALLOC``
| Use remote memory for the global heap allocation.
- ``0``
- | 0: Disable
| 1: Enable
+34
Dosyayı Görüntüle
@@ -0,0 +1,34 @@
The following table lists environment variables that are useful but relate to
different features in HIP.
.. _hip-env-other:
.. list-table::
:header-rows: 1
:widths: 35,14,51
* - **Environment variable**
- **Default value**
- **Value**
* - | ``HIPRTC_COMPILE_OPTIONS_APPEND``
| Sets compile options needed for ``hiprtc`` compilation.
- Unset by default.
- ``--gpu-architecture=gfx906:sramecc+:xnack``, ``-fgpu-rdc``
* - | ``AMD_COMGR_SAVE_TEMPS``
| Controls the deletion of temporary files generated during the compilation of COMGR. These files do not appear in the current working directory, but are instead left in a platform-specific temporary directory.
- Unset by default.
- | 0: Temporary files are deleted automatically.
| Non zero integer: Turn off the temporary files deletion.
* - | ``AMD_COMGR_EMIT_VERBOSE_LOGS``
| Sets logging of COMGR to include additional Comgr-specific informational messages.
- Unset by default.
- | 0: Verbose log disabled.
| Non zero integer: Verbose log enabled.
* - | ``AMD_COMGR_REDIRECT_LOGS``
| Controls redirect logs of COMGR.
- Unset by default.
- | `stdout` / `-`: Redirected to the standard output.
| `stderr`: Redirected to the error stream.
+23
Dosyayı Görüntüle
@@ -0,0 +1,23 @@
The profiling environment variables in HIP are collected in the following table. For
more information, check :doc:`setting the number of CUs page <rocm:how-to/setting-cus>`.
.. _hip-env-prof:
.. list-table::
:header-rows: 1
:widths: 70,30
* - **Environment variable**
- **Value**
* - | ``HSA_CU_MASK``
| Sets the mask on a lower level of queue creation in the driver, this mask will also be set for queues being profiled.
- Example: ``1:0-8``
* - | ``ROC_GLOBAL_CU_MASK``
| Sets the mask on queues created by the HIP or the OpenCL runtimes, this mask will also be set for queues being profiled.
- Example: ``0xf``, enables only 4 CUs
* - | ``HIP_FORCE_QUEUE_PROFILING``
| Used to run the app as if it were run in rocprof. Forces command queue profiling on by default.
- | 0: Disable
| 1: Enable
+72 -1
Dosyayı Görüntüle
@@ -13,6 +13,8 @@ returned by HIP API functions to indicate various runtime conditions and errors.
For more details, see :ref:`Error handling functions <error_handling_reference>`.
.. _basic_runtime_errors:
Basic Runtime Errors
====================
@@ -100,6 +102,8 @@ Basic Runtime Errors
If this error is encountered, it generally means the API or feature is not fully supported in the
current version.
.. _memory_management_errors:
Memory Management Errors
========================
@@ -139,6 +143,14 @@ Memory Management Errors
- ``1052``
- Runtime memory call returned error
* - :term:`hipErrorInvalidChannelDescriptor`
- ``911``
- Input for texture object, resource descriptor, or texture descriptor is a NULL pointer or invalid
* - :term:`hipErrorInvalidTexture`
- ``912``
- Texture reference pointer is NULL or invalid
.. glossary::
hipErrorOutOfMemory
@@ -233,6 +245,21 @@ Memory Management Errors
This error differs from ``hipErrorOutOfMemory`` in that it relates to memory operations internal to the HIP
runtime rather than explicit application requests for memory allocation.
hipErrorInvalidChannelDescriptor
This error indicates that an invalid channel descriptor is used to define the format and layout of data
in memory, particularly when working with textures or arrays. This could happen if the descriptor is
incorrectly set up or if it does not match the expected format for the operation being performed.
hipErrorInvalidTexture
The error code is returned when an invalid texture object is used in a function call. This typically
occurs when a texture object is not properly initialized or configured before being used in operations
that require valid texture data. If you encounter this error, it suggests that the texture object
might be missing necessary configuration details or has been corrupted.
.. _device_context_errors:
Device and Context Errors
=========================
@@ -385,6 +412,8 @@ Device and Context Errors
* Custom build environments with mismatched components
* Partial upgrades of the ROCm stack
.. _kernel_launch_errors:
Kernel and Launch Errors
========================
@@ -396,10 +425,18 @@ Kernel and Launch Errors
- Value
- Description
* - :term:`hipErrorInvalidValue``
- ``1``
- Invalid input value
* - :term:`hipErrorInvalidDeviceFunction`
- ``98``
- Invalid device function
* - :term:`hipErrorContextIsDestroyed`
- ``709``
- Invalid stream handle
* - :term:`hipErrorInvalidConfiguration`
- ``9``
- Invalid configuration argument
@@ -446,6 +483,11 @@ Kernel and Launch Errors
.. glossary::
hipErrorInvalidValue
Error returned when a grid dimension check finds any input global work size
dimension is zero, or a shared memory size check finds the size exceeds the size limit.
hipErrorInvalidDeviceFunction
Invalid device function. This error occurs when attempting to use a function that is not a valid device
@@ -453,6 +495,10 @@ Kernel and Launch Errors
* Code compiled for a specific GPU architecture (using ``--offload-arch``) but executed on an different/incompatible GPU
hipErrorContextIsDestroyed
This error is returned when the input stream or input stream handle is invalid.
hipErrorInvalidConfiguration
Invalid configuration argument. This error occurs when the configuration specified for a kernel launch
@@ -507,7 +553,7 @@ Kernel and Launch Errors
hipErrorInvalidKernelFile
Invalid kernel file. This error occurs when the kernel file or module being loaded is corrupted or in
an invalid format.
an invalid format, for example the file name exists but the file size is 0.
hipErrorInvalidImage
@@ -556,6 +602,7 @@ Kernel and Launch Errors
* Launching a cooperative kernel with grid dimensions that exceed hardware limits
* Requesting more resources than available for synchronization across thread blocks
* The shared memory size in bytes exceeds the device local memory size per CU
* Using cooperative groups on hardware with limited support
* Not accounting for cooperative launch limitations in kernel configuration
@@ -577,6 +624,8 @@ Kernel and Launch Errors
normal operation. Additional debugging of the previous failed launch may be required to identify
the root cause.
.. _stream_capture_errors:
Stream Capture Errors
=====================
@@ -624,6 +673,10 @@ Stream Capture Errors
- ``907``
- Operation not permitted on an event last recorded in a capturing stream
* - :term:`hipErrorInvalidResourceHandle`
- ``400``
- Input launch stream is ``NULL`` or is ``hipStreamLegacy``
.. glossary::
hipErrorStreamCaptureUnsupported
@@ -754,6 +807,14 @@ Stream Capture Errors
and cannot be used for host-side synchronization until the capture is complete and the graph
is executed.
hipErrorInvalidResourceHandle
This error is returned when the input launch stream is a NULL pointer, is invalid, or is ``hipStreamLegacy``.
If you encounter this error, you should check the validity of the resource handle being used in your HIP
API calls. Ensure that the handle was correctly obtained and has not been freed or invalidated before use.
.. _profiler_errors:
Profiler Errors
===============
@@ -845,6 +906,8 @@ Profiler Errors
The HIP profiler must be in an active state before it can be stopped. This error is informational
and indicates that the profiler is already in the desired inactive state.
.. _resource_mapping_errors:
Resource Mapping Errors
=======================
@@ -992,6 +1055,8 @@ Resource Mapping Errors
operation was attempted on a resource that was not mapped as a pointer. Resources must be mapped
with the appropriate mapping type for the operations that will be performed on them.
.. _peer_access_errors:
Peer Access Errors
==================
@@ -1058,6 +1123,8 @@ Peer Access Errors
access between peer devices. Not all device combinations support peer access. Compatibility can be
determined with :cpp:func:`hipDeviceCanAccessPeer()`.
.. _system_file_errors:
System and File Errors
======================
@@ -1183,6 +1250,8 @@ System and File Errors
This is a catch-all error that may require looking at system logs or using additional
debugging tools to identify the root cause.
.. _graphics_content_errors:
Graphics Context Errors
=======================
@@ -1216,6 +1285,8 @@ Graphics Context Errors
instantiated graph update. This error occurs when attempting to update an already instantiated
graph with changes that are not allowed.
.. _hardware_errors:
Hardware Errors
===============
+6 -4
Dosyayı Görüntüle
@@ -240,10 +240,12 @@ page.
- 106
- 104
.. [1] RDNA architectures have a configurable wavefront size. The native
wavefront size is 32, but they can run in "CU mode", which has an effective
wavefront size of 64. This affects the number of resident wavefronts and
blocks per compute Unit.
.. [1] The RDNA architectures feature an experimental compiler option called
``mwavefrontsize64``, which determines the wavefront size for kernel code
generation. When this option is disabled, the native wavefront size of 32 is
used, when enabled wavefront size 64 is used. This option is not supported by
the HIP runtime.
.. [2] RDNA architectures expand the concept of the traditional compute unit
with the so-called work group processor, which effectively includes two
compute units, within which all threads can cooperate.
+46 -3
Dosyayı Görüntüle
@@ -8,7 +8,50 @@
HIP runtime API
********************************************************************************
The HIP Runtime API reference:
The HIP Runtime API reference includes descriptions of HIP functions, as well as global datatypes, enums, and structs.
* :ref:`modules_reference`
* :ref:`global_defines_enums_structs_files_reference`
Modules
=======
The API is organized into modules based on functionality.
* :ref:`initialization_version_reference`
* :ref:`device_management_reference`
* :ref:`execution_control_reference`
* :ref:`error_handling_reference`
* :ref:`stream_management_reference`
* :ref:`stream_memory_operations_reference`
* :ref:`event_management_reference`
* :ref:`memory_management_reference`
* :ref:`memory_management_deprecated_reference`
* :ref:`external_resource_interoperability_reference`
* :ref:`stream_ordered_memory_allocator_reference`
* :ref:`unified_memory_reference`
* :ref:`virtual_memory_reference`
* :ref:`texture_management_reference`
* :ref:`texture_management_deprecated_reference`
* :ref:`surface_object_reference`
* :ref:`peer_to_peer_device_memory_access_reference`
* :ref:`context_management_reference`
* :ref:`module_management_reference`
* :ref:`occupancy_reference`
* :ref:`profiler_control_reference`
* :ref:`launch_api_reference`
* :ref:`runtime_compilation_reference`
* :ref:`callback_activity_apis_reference`
* :ref:`graph_management_reference`
* :ref:`opengl_interoperability_reference`
* :ref:`graphics_interoperability_reference`
* :ref:`cooperative_groups_reference`
Global defines, enums, structs and files
========================================
The structs, define macros, enums and files in the HIP runtime API.
* :ref:`global_enum_defines_reference`
* :ref:`driver_types_reference`
* :doc:`../../doxygen/html/annotated`
* :doc:`../../doxygen/html/files`
Dosya farkı çok büyük olduğundan ihmal edildi Fark Yükle
+1
Dosyayı Görüntüle
@@ -7,6 +7,7 @@ root: index
subtrees:
- entries:
- file: what_is_hip
- file: hip-7-changes
- file: faq
- caption: Install
+207
Dosyayı Görüntüle
@@ -0,0 +1,207 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <hip/hip_runtime.h>
#include <type_traits>
#include <iostream>
#include <vector>
#include <random>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
// [Sphinx template warp size block reduction kernel start]
template<uint32_t WarpSize>
using lane_mask_t = typename std::conditional<WarpSize == 32, uint32_t, uint64_t>::type;
template<uint32_t WarpSize>
__global__ void block_reduce(int* input, lane_mask_t<WarpSize>* mask, int* output, size_t size) {
extern __shared__ int shared[];
// Read of input with bounds check
auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id)
{
lane_mask_t<WarpSize> warp_mask = lane_mask_t<WarpSize>(1) << lane_id;
return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
};
const uint32_t tid = threadIdx.x,
lid = threadIdx.x % WarpSize,
wid = threadIdx.x / WarpSize,
bid = blockIdx.x,
gid = bid * blockDim.x + tid;
// Read input buffer to shared
shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / WarpSize) + wid);
__syncthreads();
// Shared reduction
for (uint32_t i = blockDim.x / 2; i >= WarpSize; i /= 2)
{
if (tid < i)
shared[tid] = shared[tid] + shared[tid + i];
__syncthreads();
}
// Use local variable in warp reduction
int result = shared[tid];
__syncthreads();
// This loop would be unrolled the same with the runtime warpSize.
#pragma unroll
for (uint32_t i = WarpSize/2; i >= 1; i /= 2) {
result = result + __shfl_down(result, i);
}
// Write result to output buffer
if (tid == 0)
output[bid] = result;
};
// [Sphinx template warp size block reduction kernel end]
// [Sphinx template warp size mask generation start]
template<uint32_t WarpSize>
void generate_and_copy_mask(
void *d_mask,
std::vector<int>& vectorExpected,
int numOfBlocks,
int numberOfWarp,
int mask_size,
int mask_element_size) {
std::random_device rd;
std::mt19937_64 eng(rd());
// Host side mask vector
std::vector<lane_mask_t<WarpSize>> mask(mask_size);
// Define uniform unsigned int distribution
std::uniform_int_distribution<lane_mask_t<WarpSize>> distr;
// Fill up the mask
for(int i=0; i < numOfBlocks; i++) {
int count = 0;
for(int j=0; j < numberOfWarp; j++) {
int mask_index = i * numberOfWarp + j;
mask[mask_index] = distr(eng);
if constexpr(WarpSize == 32)
count += __builtin_popcount(mask[mask_index]);
else
count += __builtin_popcountll(mask[mask_index]);
}
vectorExpected[i]= count;
}
// Copy the mask array
HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
}
// [Sphinx template warp size mask generation end]
int main() {
int deviceId = 0;
int warpSizeHost;
HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
std::cout << "Warp size: " << warpSizeHost << std::endl;
constexpr int numOfBlocks = 16;
constexpr int threadsPerBlock = 1024;
const int numberOfWarp = threadsPerBlock / warpSizeHost;
const int mask_element_size = warpSizeHost == 32 ? sizeof(uint32_t) : sizeof(uint64_t);
const int mask_size = numOfBlocks * numberOfWarp;
constexpr size_t arraySize = numOfBlocks * threadsPerBlock;
int *d_data, *d_results;
void *d_mask;
int initValue = 1;
std::vector<int> vectorInput(arraySize, initValue);
std::vector<int> vectorOutput(numOfBlocks);
std::vector<int> vectorExpected(numOfBlocks);
// Allocate device memory
HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
// Host to Device copy of the input array
HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
// [Sphinx template warp size select kernel start]
// Fill up the mask variable, copy to device and select the right kernel.
if(warpSizeHost == 32) {
// Generate and copy mask arrays
generate_and_copy_mask<32>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
// Start the kernel
block_reduce<32><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
d_data,
static_cast<uint32_t*>(d_mask),
d_results,
arraySize);
} else if(warpSizeHost == 64) {
// Generate and copy mask arrays
generate_and_copy_mask<64>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
// Start the kernel
block_reduce<64><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
d_data,
static_cast<uint64_t*>(d_mask),
d_results,
arraySize);
} else {
std::cerr << "Unsupported warp size." << std::endl;
return 0;
}
// [Sphinx template warp size select kernel end]
// Check the kernel launch
HIP_CHECK(hipGetLastError());
// Check for kernel execution error
HIP_CHECK(hipDeviceSynchronize());
// Device to Host copy of the result
HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
// Verify results
bool passed = true;
for(size_t i = 0; i < numOfBlocks; ++i) {
if(vectorOutput[i] != vectorExpected[i]) {
passed = false;
std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl;
}
}
if(passed){
std::cout << "Execution completed successfully." << std::endl;
}else{
std::cerr << "Execution failed." << std::endl;
}
// Cleanup
HIP_CHECK(hipFree(d_data));
HIP_CHECK(hipFree(d_mask));
HIP_CHECK(hipFree(d_results));
return 0;
}
+184
Dosyayı Görüntüle
@@ -0,0 +1,184 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <hip/hip_runtime.h>
#include <type_traits>
#include <iostream>
#include <vector>
#include <random>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
// [Sphinx HIP warp size block reduction kernel start]
__global__ void block_reduce(int* input, uint64_t* mask, int* output, size_t size){
extern __shared__ int shared[];
// Read of input with bounds check
auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id)
{
uint64_t warp_mask = 1ull << lane_id;
return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
};
const uint32_t tid = threadIdx.x,
lid = threadIdx.x % warpSize,
wid = threadIdx.x / warpSize,
bid = blockIdx.x,
gid = bid * blockDim.x + tid;
// Read input buffer to shared
shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / warpSize) + wid);
__syncthreads();
// Shared reduction
for (uint32_t i = blockDim.x / 2; i >= warpSize; i /= 2)
{
if (tid < i)
shared[tid] = shared[tid] + shared[tid + i];
__syncthreads();
}
// Use local variable in warp reduction
int result = shared[tid];
__syncthreads();
// This loop would be unrolled the same with the compile-time WarpSize.
#pragma unroll
for (uint32_t i = warpSize/2; i >= 1; i /= 2) {
result = result + __shfl_down(result, i);
}
// Write result to output buffer
if (tid == 0)
output[bid] = result;
};
// [Sphinx HIP warp size block reduction kernel end]
// [Sphinx HIP warp size mask generation start]
void generate_and_copy_mask(
uint64_t *d_mask,
std::vector<int>& vectorExpected,
int warpSizeHost,
int numOfBlocks,
int numberOfWarp,
int mask_size,
int mask_element_size) {
std::random_device rd;
std::mt19937_64 eng(rd());
// Host side mask vector
std::vector<uint64_t> mask(mask_size);
// Define uniform unsigned int distribution
std::uniform_int_distribution<uint64_t> distr;
// Fill up the mask
for(int i=0; i < numOfBlocks; i++) {
int count = 0;
for(int j=0; j < numberOfWarp; j++) {
int mask_index = i * numberOfWarp + j;
mask[mask_index] = distr(eng);
if(warpSizeHost == 32)
count += __builtin_popcount(mask[mask_index]);
else
count += __builtin_popcountll(mask[mask_index]);
}
vectorExpected[i]= count;
}
// Copy the mask array
HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
}
// [Sphinx HIP warp size mask generation end]
int main() {
int deviceId = 0;
int warpSizeHost;
HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
std::cout << "Warp size: " << warpSizeHost << std::endl;
constexpr int numOfBlocks = 16;
constexpr int threadsPerBlock = 1024;
const int numberOfWarp = threadsPerBlock / warpSizeHost;
const int mask_element_size = sizeof(uint64_t);
const int mask_size = numOfBlocks * numberOfWarp;
constexpr size_t arraySize = numOfBlocks * threadsPerBlock;
int *d_data, *d_results;
uint64_t *d_mask;
int initValue = 1;
std::vector<int> vectorInput(arraySize, initValue);
std::vector<int> vectorOutput(numOfBlocks);
std::vector<int> vectorExpected(numOfBlocks);
// Allocate device memory
HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
// Host to Device copy of the input array
HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
// [Sphinx HIP warp size select kernel start]
// Generate and copy mask arrays
generate_and_copy_mask(
d_mask,
vectorExpected,
warpSizeHost,
numOfBlocks,
numberOfWarp,
mask_size,
mask_element_size);
// Start the kernel
block_reduce<<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
d_data,
d_mask,
d_results,
arraySize);
// [Sphinx HIP warp size select kernel end]
// Check the kernel launch
HIP_CHECK(hipGetLastError());
// Check for kernel execution error
HIP_CHECK(hipDeviceSynchronize());
// Device to Host copy of the result
HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
// Verify results
bool passed = true;
for(size_t i = 0; i < numOfBlocks; ++i) {
if(vectorOutput[i] != vectorExpected[i]) {
passed = false;
std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl;
}
}
if(passed){
std::cout << "Execution completed successfully." << std::endl;
}else{
std::cerr << "Execution failed." << std::endl;
}
// Cleanup
HIP_CHECK(hipFree(d_data));
HIP_CHECK(hipFree(d_mask));
HIP_CHECK(hipFree(d_results));
return 0;
}
+47 -47
Dosyayı Görüntüle
@@ -348,89 +348,89 @@ find out what device binary flavors are embedded into the executable?
artifacts on disk. Add the ROCmCC installation folder to your PATH if you
want to use these utilities (the utilities expect them to be on the PATH).
You can list embedded program binaries using ``roc-obj-ls``.
You can list embedded program binaries using ``llvm-objdump`` with
``--offloading`` option.
.. code-block:: bash
roc-obj-ls ./saxpy
llvm-objdump --offloading ./saxpy
It should return something like:
.. code-block:: shell
1 host-x86_64-unknown-linux file://./saxpy#offset=12288&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx803 file://./saxpy#offset=12288&size=9760
./saxpy: file format elf64-x86-64
Extracting offload bundle: ./saxpy.0.host-x86_64-unknown-linux-gnu-
Extracting offload bundle: ./saxpy.0.hipv4-amdgcn-amd-amdhsa--gfx942
The compiler embeds a version 4 code object (more on `code
object versions <https://www.llvm.org/docs/AMDGPUUsage.html#code-object-metadata>`_)
and used the LLVM target triple `amdgcn-amd-amdhsa--gfx803` (more on `target triples
and used the LLVM target triple ``amdgcn-amd-amdhsa--gfx942`` (more on `target triples
<https://www.llvm.org/docs/AMDGPUUsage.html#target-triples>`_). You can
extract that program object in a disassembled fashion for human consumption
via ``roc-obj``.
via ``llvm-objdump``.
.. code-block:: bash
roc-obj -t gfx803 -d ./saxpy
llvm-objdump --disassemble saxpy.0.hipv4-amdgcn-amd-amdhsa--gfx942 > saxpy.s
This creates two files on disk and ``.s`` extension is of most interest.
Opening this file or dumping it to the console using ``cat``
lets find the disassembled binary of the SAXPY compute kernel, something
similar to:
This creates a file on the disk called ``saxpy.s`` Opening this file or
dumping it to the console using ``cat`` lets find the disassembled binary of
the SAXPY compute kernel, something similar to:
.. code-block::
saxpy.0.hipv4-amdgcn-amd-amdhsa--gfx942: file format elf64-amdgpu
Disassembly of section .text:
<_Z12saxpy_kernelfPKfPfj>:
s_load_dword s0, s[4:5], 0x2c // 000000001000: C0020002 0000002C
s_load_dword s1, s[4:5], 0x18 // 000000001008: C0020042 00000018
s_waitcnt lgkmcnt(0) // 000000001010: BF8C007F
s_and_b32 s0, s0, 0xffff // 000000001014: 8600FF00 0000FFFF
s_mul_i32 s6, s6, s0 // 00000000101C: 92060006
v_add_u32_e32 v0, vcc, s6, v0 // 000000001020: 32000006
v_cmp_gt_u32_e32 vcc, s1, v0 // 000000001024: 7D980001
s_and_saveexec_b64 s[0:1], vcc // 000000001028: BE80206A
s_cbranch_execz 22 // 00000000102C: BF880016 <_Z12saxpy_kernelfPKfPfj+0x88>
s_load_dwordx4 s[0:3], s[4:5], 0x8 // 000000001030: C00A0002 00000008
v_mov_b32_e32 v1, 0 // 000000001038: 7E020280
v_lshlrev_b64 v[0:1], 2, v[0:1] // 00000000103C: D28F0000 00020082
s_waitcnt lgkmcnt(0) // 000000001044: BF8C007F
v_mov_b32_e32 v3, s1 // 000000001048: 7E060201
v_add_u32_e32 v2, vcc, s0, v0 // 00000000104C: 32040000
v_addc_u32_e32 v3, vcc, v3, v1, vcc // 000000001050: 38060303
flat_load_dword v2, v[2:3] // 000000001054: DC500000 02000002
v_mov_b32_e32 v3, s3 // 00000000105C: 7E060203
v_add_u32_e32 v0, vcc, s2, v0 // 000000001060: 32000002
v_addc_u32_e32 v1, vcc, v3, v1, vcc // 000000001064: 38020303
flat_load_dword v3, v[0:1] // 000000001068: DC500000 03000000
s_load_dword s0, s[4:5], 0x0 // 000000001070: C0020002 00000000
s_waitcnt vmcnt(0) lgkmcnt(0) // 000000001078: BF8C0070
v_mac_f32_e32 v3, s0, v2 // 00000000107C: 2C060400
flat_store_dword v[0:1], v3 // 000000001080: DC700000 00000300
s_endpgm // 000000001088: BF810000
0000000000001900 <_Z12saxpy_kernelfPKfPfj>:
s_load_dword s3, s[0:1], 0x2c // 000000001900: C00200C0 0000002C
s_load_dword s4, s[0:1], 0x18 // 000000001908: C0020100 00000018
s_waitcnt lgkmcnt(0) // 000000001910: BF8CC07F
s_and_b32 s3, s3, 0xffff // 000000001914: 8603FF03 0000FFFF
s_mul_i32 s2, s2, s3 // 00000000191C: 92020302
v_add_u32_e32 v0, s2, v0 // 000000001920: 68000002
v_cmp_gt_u32_e32 vcc, s4, v0 // 000000001924: 7D980004
s_and_saveexec_b64 s[2:3], vcc // 000000001928: BE82206A
s_cbranch_execz 20 // 00000000192C: BF880014 <_Z12saxpy_kernelfPKfPfj+0x80>
s_load_dwordx4 s[4:7], s[0:1], 0x8 // 000000001930: C00A0100 00000008
v_mov_b32_e32 v1, 0 // 000000001938: 7E020280
v_lshlrev_b64 v[0:1], 2, v[0:1] // 00000000193C: D28F0000 00020082
s_load_dword s0, s[0:1], 0x0 // 000000001944: C0020000 00000000
s_waitcnt lgkmcnt(0) // 00000000194C: BF8CC07F
v_lshl_add_u64 v[2:3], s[4:5], 0, v[0:1] // 000000001950: D2080002 04010004
v_lshl_add_u64 v[0:1], s[6:7], 0, v[0:1] // 000000001958: D2080000 04010006
global_load_dword v4, v[2:3], off // 000000001960: DC508000 047F0002
global_load_dword v5, v[0:1], off // 000000001968: DC508000 057F0000
s_waitcnt vmcnt(0) // 000000001970: BF8C0F70
v_fmac_f32_e32 v5, s0, v4 // 000000001974: 760A0800
global_store_dword v[0:1], v5, off // 000000001978: DC708000 007F0500
s_endpgm // 000000001980: BF810000
s_nop 0 // 000000001984: BF800000
Alternatively, call the compiler with ``--save-temps`` to dump all device
binary to disk in separate files.
.. code-block:: bash
amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 --save-temps
amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 --save-temps --offload-arch=gfx942
List all the temporaries created while compiling ``main.hip`` with:
.. code-block:: bash
ls main-hip-amdgcn-amd-amdhsa-*
main-hip-amdgcn-amd-amdhsa-gfx803.bc
main-hip-amdgcn-amd-amdhsa-gfx803.cui
main-hip-amdgcn-amd-amdhsa-gfx803.o
main-hip-amdgcn-amd-amdhsa-gfx803.out
main-hip-amdgcn-amd-amdhsa-gfx803.out.resolution.txt
main-hip-amdgcn-amd-amdhsa-gfx803.s
main-hip-amdgcn-amd-amdhsa-gfx942.bc
main-hip-amdgcn-amd-amdhsa-gfx942.o
main-hip-amdgcn-amd-amdhsa-gfx942.out.resolution.txt
main-hip-amdgcn-amd-amdhsa-gfx942.hipi
main-hip-amdgcn-amd-amdhsa-gfx942.out
main-hip-amdgcn-amd-amdhsa-gfx942.s
Files with the ``.s`` extension hold the disassembled contents of the binary.
The filename notes the graphics IPs used by the compiler. The contents of
this file are similar to what ``roc-obj`` printed to the console.
this file are similar to the `*.s` file created with ``llvm-objdump`` earlier.
.. tab-item:: Linux and NVIDIA
:sync: linux-nvidia
@@ -491,7 +491,7 @@ find out what device binary flavors are embedded into the executable?
We can see that the compiler embedded a version 4 code object (more on code
`object versions <https://www.llvm.org/docs/AMDGPUUsage.html#code-object-metadata>`_) and
used the LLVM target triple `amdgcn-amd-amdhsa--gfx906` (more on `target triples
used the LLVM target triple ``amdgcn-amd-amdhsa--gfx906`` (more on `target triples
<https://www.llvm.org/docs/AMDGPUUsage.html#target-triples>`_). Don't be
alarmed about linux showing up as a binary format, AMDGPU binaries uploaded to
the GPU for execution are proper linux ELF binaries in their format.
+19 -110
Dosyayı Görüntüle
@@ -8,95 +8,12 @@
HIP compilers
********************************************************************************
ROCm provides the compiler driver ``hipcc``, that can be used on AMD ROCm and
NVIDIA CUDA platforms.
ROCm provides the compiler tools used to compile HIP applications for use on AMD GPUs.
The compilers set up the default libraries and include paths for the HIP and ROCm
libraries and some needed environment variables. For more information, see the
:doc:`ROCm compiler reference <llvm-project:reference/rocmcc>`.
On ROCm, ``hipcc`` takes care of the following:
- Setting the default library and include paths for HIP
- Setting some environment variables
- Invoking the appropriate compiler - ``amdclang++``
On NVIDIA CUDA platform, ``hipcc`` takes care of invoking compiler ``nvcc``.
``amdclang++`` is based on the ``clang++`` compiler. For more
details, see the :doc:`llvm project<llvm-project:index>`.
HIPCC
================================================================================
Common Compiler Options
--------------------------------------------------------------------------------
The following table shows the most common compiler options supported by
``hipcc``.
.. list-table::
:header-rows: 1
*
- Option
- Description
*
- ``--fgpu-rdc``
- Generate relocatable device code, which allows kernels or device functions
to call device functions in different translation units.
*
- ``-ggdb``
- Equivalent to `-g` plus tuning for GDB. This is recommended when using
ROCm's GDB to debug GPU code.
*
- ``--gpu-max-threads-per-block=<num>``
- Generate code to support up to the specified number of threads per block.
*
- ``-offload-arch=<target>``
- Generate code for the given GPU target.
For a full list of supported compilation targets see the `processor names in AMDGPU's llvm documentation <https://llvm.org/docs/AMDGPUUsage.html#processors>`_.
This option can appear multiple times to generate a fat binary for multiple
targets.
The actual support of the platform's runtime may differ.
*
- ``-save-temps``
- Save the compiler generated intermediate files.
*
- ``-v``
- Show the compilation steps.
Linking
--------------------------------------------------------------------------------
``hipcc`` adds the necessary libraries for HIP as well as for the accelerator
compiler (``nvcc`` or ``amdclang++``). We recommend linking with ``hipcc`` since
it automatically links the binary to the necessary HIP runtime libraries.
Linking Code With Other Compilers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``nvcc`` by default uses ``g++`` to generate the host code.
``amdclang++`` generates both device and host code. The code uses the same API
as ``gcc``, which allows code generated by different ``gcc``-compatible
compilers to be linked together. For example, code compiled using ``amdclang++``
can link with code compiled using compilers such as ``gcc``, ``icc`` and
``clang``. Take care to ensure all compilers use the same standard C++ header
and library formats.
libc++ and libstdc++
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``hipcc`` links to ``libstdc++`` by default. This provides better compatibility
between ``g++`` and HIP.
In order to link to ``libc++``, pass ``--stdlib=libc++`` to ``hipcc``.
Generally, libc++ provides a broader set of C++ features while ``libstdc++`` is
the standard for more compilers, notably including ``g++``.
When cross-linking C++ code, any C++ functions that use types from the C++
standard library, such as ``std::string``, ``std::vector`` and other containers,
must use the same standard-library implementation. This includes cross-linking
between ``amdclang++`` and other compilers.
HIP compilation workflow
Compilation workflow
================================================================================
HIP provides a flexible compilation workflow that supports both offline
@@ -115,25 +32,18 @@ performance overhead.
Offline compilation
--------------------------------------------------------------------------------
The HIP code compilation is performed in two stages: host and device code
compilation stage.
Offline compilation is performed in two steps: host and device code
compilation.
- Device-code compilation stage: The compiled device code is embedded into the
- Host-code compilation: On the host side, ``amdclang++`` or ``hipcc`` can
compile the host code in one step without other C++ compilers.
- Device-code compilation: The compiled device code is embedded into the
host object file. Depending on the platform, the device code can be compiled
into assembly or binary. ``nvcc`` and ``amdclang++`` target different
architectures and use different code object formats. ``nvcc`` uses the binary
``cubin`` or the assembly PTX files, while the ``amdclang++`` path is the
binary ``hsaco`` format. On CUDA platforms, the driver compiles the PTX files
to executable code during runtime.
- Host-code compilation stage: On the host side, ``hipcc`` or ``amdclang++`` can
compile the host code in one step without other C++ compilers. On the other
hand, ``nvcc`` only replaces the ``<<<...>>>`` kernel launch syntax with the
appropriate CUDA runtime function call and the modified host code is passed to
the default host compiler.
into assembly or binary.
For an example on how to compile HIP from the command line, see :ref:`SAXPY
tutorial<compiling_on_the_command_line>` .
tutorial <compiling_on_the_command_line>` .
Runtime compilation
--------------------------------------------------------------------------------
@@ -142,27 +52,26 @@ HIP allows you to compile kernels at runtime using the ``hiprtc*`` API. Kernels
are stored as a text string, which is passed to HIPRTC alongside options to
guide the compilation.
For more details, see
:doc:`HIP runtime compiler <../how-to/hip_rtc>`.
For more information, see :doc:`HIP runtime compiler <../how-to/hip_rtc>`.
Static libraries
================================================================================
``hipcc`` supports generating two types of static libraries.
Both ``amdclang++`` and ``hipcc`` support generating two types of static libraries.
- The first type of static library only exports and launches host functions
within the same library and not the device functions. This library type offers
the ability to link with a non-hipcc compiler such as ``gcc``. Additionally,
the ability to link with another compiler such as ``gcc``. Additionally,
this library type contains host objects with device code embedded as fat
binaries. This library type is generated using the flag ``--emit-static-lib``:
.. code-block:: shell
hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
amdclang++ hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out
- The second type of static library exports device functions to be linked by
other code objects by using ``hipcc`` as the linker. This library type
other code objects by using ``amdclang++`` or ``hipcc`` as the linker. This library type
contains relocatable device objects and is generated using ``ar``:
.. code-block:: shell
@@ -171,6 +80,6 @@ Static libraries
ar rcsD libHipDevice.a hipDevice.o
hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out
A full example for this can be found in the ROCm-examples, see the examples for
Examples of this can be found in `rocm-examples <https://github.com/ROCm/rocm-examples>`_ under
`static host libraries <https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_host_library>`_
or `static device libraries <https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_device_library>`_.
+1 -1
Dosyayı Görüntüle
@@ -243,7 +243,7 @@ multiple threads via the thread ID constants ``threadIdx.x``, ``blockIdx.x``, et
.. _inherent_thread_model:
Hierarchical thread model
---------------------
-------------------------
As previously discussed, all threads of a kernel are uniquely identified by a set
of integral values called thread IDs. The hierarchy consists of three levels: thread,
+1 -1
Dosyayı Görüntüle
@@ -27,7 +27,7 @@ THE SOFTWARE.
#include <tuple>
#include <type_traits>
#endif
/** @addtogroup Execution Execution Management
/** @addtogroup Execution Execution Control
* @{
*/
+2 -2
Dosyayı Görüntüle
@@ -1579,7 +1579,7 @@ typedef enum hipLaunchAttributeID {
typedef union hipLaunchAttributeValue {
char pad[64]; ///< 64 byte padding
hipAccessPolicyWindow
accessPolicyWindow; ///< Value of launch attribute ::hipLaunchAttributePolicyWindow.
accessPolicyWindow; ///< Value of launch attribute ::hipLaunchAttributeAccessPolicyWindow.
int cooperative; ///< Value of launch attribute ::hipLaunchAttributeCooperative. Indicates
///< whether the kernel is cooperative.
int priority; ///< Value of launch attribute :: hipLaunchAttributePriority. Execution priority of
@@ -6561,7 +6561,7 @@ hipError_t hipLinkComplete(hipLinkState_t state, void** hipBinOut, size_t* sizeO
/**
* @brief Creates a linker instance with options.
* @param [in] numOptions Number of options
* @param [in] option Array of options
* @param [in] options Array of options
* @param [in] optionValues Array of option values cast to void*
* @param [out] stateOut hip link state created upon success
*