21b6021848
Co-authored-by: Christophe Paquot <35546540+chrispaquot@users.noreply.github.com>
1237 rindas
46 KiB
ReStructuredText
1237 rindas
46 KiB
ReStructuredText
.. meta::
|
|
:description: This chapter presents how to port the CUDA source code to HIP
|
|
:keywords: AMD, ROCm, HIP, CUDA, driver API, porting, port
|
|
|
|
.. _porting_cuda_code:
|
|
|
|
*******************************************************************************
|
|
Porting NVIDIA CUDA code to HIP
|
|
*******************************************************************************
|
|
|
|
HIP eases the porting of existing NVIDIA CUDA code into the HIP
|
|
environment, enabling you to run your application on AMD GPUs. This topic describes
|
|
the available tools and provides practical suggestions for porting your CUDA
|
|
code and working through common issues.
|
|
|
|
CUDA provides separate driver and runtime APIs, while HIP mostly uses a single API.
|
|
The two CUDA APIs generally provide similar functionality and are mostly interchangeable.
|
|
However, the CUDA driver API provides fine-grained control over kernel-level
|
|
initialization, contexts, and module management, while the runtime API automatically
|
|
manages contexts and modules. The driver API is suitable for applications that require
|
|
tight integration with other systems or advanced control over GPU resources.
|
|
|
|
* Driver API calls begin with the prefix ``cu``, while runtime API calls begin
|
|
with the prefix ``cuda``. For example, the driver API contains
|
|
``cuEventCreate``, while the runtime API contains ``cudaEventCreate``, which
|
|
has similar functionality.
|
|
|
|
* The driver API offers two additional low-level functionalities not exposed by
|
|
the runtime API: module management ``cuModule*`` and context management
|
|
``cuCtx*`` APIs.
|
|
|
|
The HIP runtime API includes corresponding functions for both the CUDA driver and
|
|
the CUDA runtime API. The module and context functionality are available with the
|
|
``hipModule`` and ``hipCtx`` prefixes, and driver API functions are usually
|
|
prefixed with ``hipDrv``.
|
|
|
|
Porting a CUDA project
|
|
======================
|
|
|
|
HIP projects can target either AMD or NVIDIA platforms. HIP is a marshalling language
|
|
that provides a thin-layer mapping to functions in the AMD ROCm language, or to CUDA
|
|
functions. To compile the HIP code, you can use ``amdclang++``, also called HIP-Clang,
|
|
or you can use ``hipcc`` to enable compilation by ``nvcc`` to produce CUDA executables,
|
|
as described in :ref:`compilation_platform`.
|
|
|
|
Because HIP is a marshalling language that can be compiled by ``nvcc``, mixing HIP code
|
|
with CUDA code results in valid application code. This enables users to incrementally port
|
|
a CUDA project to HIP, and still compile and test the code during the transition.
|
|
|
|
The only notable exception is ``hipError_t``, which is not just an alias to
|
|
``cudaError_t``. In these cases, HIP provides functions to convert between the
|
|
error code spaces:
|
|
|
|
* :cpp:func:`hipErrorToCudaError`
|
|
* :cpp:func:`hipErrorToCUResult`
|
|
* :cpp:func:`hipCUDAErrorTohipError`
|
|
* :cpp:func:`hipCUResultTohipError`
|
|
|
|
General Tips
|
|
------------
|
|
|
|
* Starting to port on an NVIDIA machine is often the easiest approach, as the
|
|
code can be tested for functionality and performance even if not fully ported
|
|
to HIP.
|
|
* Once the CUDA code is ported to HIP and is running on the CUDA machine,
|
|
compile the HIP code for an AMD machine.
|
|
* You can handle platform-specific features through conditional compilation as described
|
|
in :ref:`compilation_platform`.
|
|
* Use the `HIPIFY <https://github.com/ROCm/HIPIFY>`_ tools to automatically
|
|
convert CUDA code to HIP, as described in the following section.
|
|
|
|
Using HIPIFY
|
|
============
|
|
|
|
:doc:`HIPIFY <hipify:index>` is a collection of tools that automatically
|
|
translate CUDA code to HIP code. For example, ``cuEventCreate`` is translated to
|
|
:cpp:func:`hipEventCreate`. HIPIFY tools also convert error codes from the
|
|
driver namespace and coding conventions to the equivalent HIP error code.
|
|
HIP unifies the APIs for these common functions.
|
|
|
|
There are two types of HIPIFY available:
|
|
|
|
* :doc:`hipify-clang <hipify:how-to/hipify-clang>` is a Clang-based tool that parses code,
|
|
translates it into an Abstract Syntax Tree, and generates the HIP source. For this,
|
|
``hipify-clang`` needs to be able to actually compile the code, so the CUDA code needs
|
|
to be correct, and a CUDA install with all necessary headers must be provided.
|
|
|
|
* :doc:`hipify-perl <hipify:how-to/hipify-perl>` uses pattern matching, to translate the
|
|
CUDA code to HIP. It does not require a working CUDA installation, and can also
|
|
convert CUDA code, that is not syntactically correct. It is therefore easier to
|
|
set up and use, but is not as powerful as ``hipfiy-clang``.
|
|
|
|
Memory copy functions
|
|
---------------------
|
|
|
|
When copying memory, the CUDA driver includes the memory direction in the name of
|
|
the API (``cuMemcpyHtoD``), while the CUDA runtime API provides a single memory
|
|
copy API with a parameter that specifies the direction. It also supports a
|
|
default direction where the runtime determines the direction automatically.
|
|
|
|
HIP provides both versions, for example, :cpp:func:`hipMemcpyHtoD` as well as
|
|
:cpp:func:`hipMemcpy`. The first version might be faster in some cases because
|
|
it avoids any host overhead to detect the direction of the memory copy.
|
|
|
|
Address spaces
|
|
--------------
|
|
|
|
HIP-Clang defines a process-wide address space where
|
|
the CPU and all devices allocate addresses from a single unified pool.
|
|
This means addresses can be shared between contexts. Unlike CUDA, a new context
|
|
does not create a new address space for the device.
|
|
|
|
Context stack behavior differences
|
|
----------------------------------
|
|
|
|
HIP-Clang creates a primary context when the HIP API is called. In CUDA
|
|
driver API code, HIP-Clang creates a primary context while HIP/NVCC has an empty
|
|
context stack. HIP-Clang pushes the primary context to the context stack when it
|
|
is empty. This can lead to subtle differences in applications which mix the
|
|
runtime and driver APIs.
|
|
|
|
Scanning CUDA source to scope the translation
|
|
---------------------------------------------
|
|
|
|
The ``--examine`` option, tells the hipify tools to do a test-run without changing
|
|
the source files, but instead scanning the files to determine which files contain CUDA code and
|
|
how much of that code can automatically be hipified.
|
|
|
|
There also are ``hipexamine-perl.sh`` or ``hipexamine.sh`` (for
|
|
``hipify-clang``) scripts to automatically scan directories.
|
|
|
|
For example, the following is a scan of one of the ``convolutionSeparable`` sample
|
|
from `cuda-samples <https://github.com/NVIDIA/cuda-samples>`_:
|
|
|
|
.. code-block:: shell
|
|
|
|
> cd Samples/2_Concepts_and_Techniques/convolutionSeparable/
|
|
> hipexamine-perl.sh
|
|
[HIPIFY] info: file './convolutionSeparable.cu' statistics:
|
|
CONVERTED refs count: 2
|
|
TOTAL lines of code: 214
|
|
WARNINGS: 0
|
|
[HIPIFY] info: CONVERTED refs by names:
|
|
cooperative_groups.h => hip/hip_cooperative_groups.h: 1
|
|
cudaMemcpyToSymbol => hipMemcpyToSymbol: 1
|
|
|
|
[HIPIFY] info: file './main.cpp' statistics:
|
|
CONVERTED refs count: 13
|
|
TOTAL lines of code: 174
|
|
WARNINGS: 0
|
|
[HIPIFY] info: CONVERTED refs by names:
|
|
cudaDeviceSynchronize => hipDeviceSynchronize: 2
|
|
cudaFree => hipFree: 3
|
|
cudaMalloc => hipMalloc: 3
|
|
cudaMemcpy => hipMemcpy: 2
|
|
cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
|
|
cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 1
|
|
cuda_runtime.h => hip/hip_runtime.h: 1
|
|
|
|
[HIPIFY] info: file 'GLOBAL' statistics:
|
|
CONVERTED refs count: 15
|
|
TOTAL lines of code: 512
|
|
WARNINGS: 0
|
|
[HIPIFY] info: CONVERTED refs by names:
|
|
cooperative_groups.h => hip/hip_cooperative_groups.h: 1
|
|
cudaDeviceSynchronize => hipDeviceSynchronize: 2
|
|
cudaFree => hipFree: 3
|
|
cudaMalloc => hipMalloc: 3
|
|
cudaMemcpy => hipMemcpy: 2
|
|
cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
|
|
cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 1
|
|
cudaMemcpyToSymbol => hipMemcpyToSymbol: 1
|
|
cuda_runtime.h => hip/hip_runtime.h: 1
|
|
|
|
``hipexamine-perl.sh`` reports how many CUDA calls are going to be converted to
|
|
HIP (e.g. ``CONVERTED refs count: 2``), and lists them by name together with
|
|
their corresponding HIP-version (see the lines following ``[HIPIFY] info:
|
|
CONVERTED refs by names:``). It also lists the total lines of code for the file
|
|
and potential warnings. In the end it prints a summary for all files.
|
|
|
|
Automatically converting a CUDA project
|
|
---------------------------------------
|
|
|
|
To directly replace the files, the ``--inplace`` option of ``hipify-perl`` or
|
|
``hipify-clang`` can be used. This creates a backup of the original files in a
|
|
``<filename>.prehip`` file and overwrites the existing files, keeping their file
|
|
endings. If the ``--inplace`` option is not given, the scripts print the
|
|
hipified code to ``stdout``.
|
|
|
|
``hipconvertinplace.sh`` or ``hipconvertinplace-perl.sh`` operate on whole
|
|
directories.
|
|
|
|
Library and driver equivalents
|
|
==============================
|
|
|
|
ROCm provides libraries to ease porting of code relying on CUDA libraries or the CUDA driver API.
|
|
Most CUDA libraries have a corresponding HIP library. For more information,
|
|
see either :doc:`ROCm libraries <rocm:reference/api-libraries>` or :doc:`HIPIFY CUDA compatible libraries <hipify:reference/supported_apis>`.
|
|
|
|
There are two flavours of libraries provided by ROCm, libraries prefixed with ``hip``
|
|
and libraries prefixed with ``roc``. While both are written using HIP, in general
|
|
only the ``hip``-libraries are portable. The libraries with the ``roc``-prefix
|
|
might also run on CUDA-capable GPUs, however they have been optimized for AMD
|
|
GPUs and might use assembly code or a different API, to achieve the best
|
|
performance.
|
|
|
|
In the case where a library provides both ``roc`` and ``hip`` versions, such as
|
|
``hipSparse`` and ``rocSparse``, the ``hip`` version is a marshalling library,
|
|
which is just a thin layer that redirects function calls to either the
|
|
``roc`` library or the corresponding CUDA library, depending on the target platform.
|
|
|
|
.. note::
|
|
|
|
If the application is only required to run on AMD GPUs, it is recommended to use
|
|
the ``roc``-libraries. In hipify tools, this can be accomplished using the ``--roc`` option.
|
|
|
|
cuModule and hipModule
|
|
----------------------
|
|
|
|
The ``cuModule`` feature of the driver API provides additional control over how and
|
|
when accelerator code objects are loaded. For example, the driver API enables
|
|
code objects to be loaded from files or memory pointers. Symbols for kernels or
|
|
global data are extracted from the loaded code objects. In contrast, the runtime
|
|
API loads automatically and, if necessary, compiles all the kernels from an
|
|
executable binary when it runs. In this mode, kernel code must be compiled using
|
|
NVCC so that automatic loading can function correctly.
|
|
|
|
The Module features are useful in an environment that generates the code objects
|
|
directly, such as a new accelerator language front end. NVCC is not used here.
|
|
Instead, the environment might have a different kernel language or compilation
|
|
flow. Other environments have many kernels and don't want all of them to be
|
|
loaded automatically. The Module functions load the generated code objects and
|
|
launch kernels.
|
|
|
|
Like the ``cuModule`` API, the ``hipModule`` API provides additional control
|
|
over code object management, including options to load code from files or from
|
|
in-memory pointers.
|
|
|
|
NVCC and HIP-Clang target different architectures and use different code object
|
|
formats. NVCC supports ``cubin`` or ``ptx`` files, while the HIP-Clang path uses
|
|
the ``hsaco`` format. The external compilers that generate these code objects are
|
|
responsible for generating and loading the correct code object for each platform.
|
|
Notably, there is no fat binary format that can contain code for both NVCC and
|
|
HIP-Clang platforms. The following table summarizes the formats used on each
|
|
platform:
|
|
|
|
.. list-table:: Module formats
|
|
:header-rows: 1
|
|
|
|
* - Format
|
|
- APIs
|
|
- NVCC
|
|
- HIP-CLANG
|
|
* - Code object
|
|
- ``hipModuleLoad``, ``hipModuleLoadData``
|
|
- ``.cubin`` or PTX text
|
|
- ``.hsaco``
|
|
* - Fat binary
|
|
- ``hipModuleLoadFatBin``
|
|
- ``.fatbin``
|
|
- ``.hip_fatbin``
|
|
|
|
|
|
``hipcc`` uses HIP-Clang or NVCC to compile host code. Both of these compilers can
|
|
embed code objects into the final executable. These code objects are automatically
|
|
loaded when the application starts. The ``hipModule`` API can be used to load
|
|
additional code objects. When used this way, it extends the capability of the
|
|
automatically loaded code objects. HIP-Clang enables both of these capabilities to
|
|
be used together. Of course, it is possible to create a program with no kernels and
|
|
no automatic loading.
|
|
|
|
For ``hipModule`` API reference content, see :ref:`module_management_reference`.
|
|
|
|
Using hipModuleLaunchKernel
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Both CUDA driver and runtime APIs define a function for launching kernels,
|
|
called ``cuLaunchKernel`` or ``cudaLaunchKernel``. The equivalent API in HIP is
|
|
:cpp:func:`hipModuleLaunchKernel`. The kernel arguments and the execution
|
|
configuration (grid dimensions, group dimensions, dynamic shared memory, and
|
|
stream) are passed as arguments to the launch function.
|
|
|
|
The HIP runtime API additionally supports the triple chevron (``<<< >>>``) syntax for launching
|
|
kernels, which resembles a special function call and is easier to use than the
|
|
explicit launch API, especially when handling kernel arguments.
|
|
|
|
.. _context_driver_api:
|
|
|
|
cuCtx and hipCtx
|
|
----------------
|
|
|
|
The CUDA driver API defines "Context" and "Devices" as separate entities.
|
|
Contexts contain a single device, and a device can theoretically have multiple contexts.
|
|
Each context contains a set of streams and events specific to the context.
|
|
The ``cuCtx`` API also provide a mechanism to switch between devices, which enables a
|
|
single CPU thread to send commands to different GPUs. HIP and recent versions of the
|
|
CUDA Runtime provide other mechanisms to accomplish this, such as using streams or ``cudaSetDevice``.
|
|
|
|
On the other hand, the CUDA runtime API unifies the Context API with the Device API. This simplifies the
|
|
APIs and has little loss of functionality because each context can contain a
|
|
single device, and the benefits of multiple contexts have been replaced with other interfaces.
|
|
|
|
HIP provides a Context API as a thin layer over the existing device functions to facilitate
|
|
easy porting from existing driver API code. The ``hipCtx`` functions largely provide an
|
|
alternate syntax for changing the active device. The ``hipCtx`` API can be used to set the
|
|
current context or to query properties of the device associated with the context. The current
|
|
context is implicitly used by other APIs, such as ``hipStreamCreate``.
|
|
|
|
.. note::
|
|
The ``hipCtx`` API is **deprecated** and its use is discouraged. Most new applications use
|
|
``hipSetDevice`` or the ``hipStream`` APIs. For more details on deprecated APIs, see :doc:`../reference/deprecated_api_list`.
|
|
|
|
.. _compilation_platform:
|
|
|
|
Compilation and platforms
|
|
=========================
|
|
|
|
HIP projects can target either AMD or NVIDIA platforms. The platform affects
|
|
which backend-headers are included and which libraries are used for linking. The
|
|
created binaries are not portable between AMD and NVIDIA platforms, and instead
|
|
must be separately compiled.
|
|
|
|
``hipcc`` is a portable compiler driver that calls ``amdclang++`` (on AMD systems)
|
|
or ``nvcc`` (on NVIDIA systems), passing the necessary options to the target
|
|
compiler. Tools that call ``hipcc`` must ensure the compiler options are appropriate
|
|
for the target compiler.
|
|
|
|
``hipconfig`` is a helpful tool for identifying the current system's platform,
|
|
compiler and runtime. It can also help set options appropriately. As an example,
|
|
``hipconfig`` can provide a path to HIP, in Makefiles:
|
|
|
|
.. code-block:: shell
|
|
|
|
HIP_PATH ?= $(shell hipconfig --path)
|
|
|
|
.. note::
|
|
You can use ``amdclang++`` to target NVIDIA systems, but you must manually specify
|
|
the required compiler options.
|
|
|
|
HIP Headers
|
|
-----------
|
|
|
|
The ``hip_runtime.h`` headers define all the necessary types, functions, macros,
|
|
etc., needed to compile a HIP program, this includes host as well as device
|
|
code. ``hip_runtime_api.h`` is a subset of ``hip_runtime.h``.
|
|
|
|
CUDA has slightly different contents for these two files. In some cases you might
|
|
need to convert hipified code to include the richer ``hip_runtime.h`` instead of
|
|
``hip_runtime_api.h``.
|
|
|
|
Using a Standard C++ Compiler
|
|
-----------------------------
|
|
|
|
A source file that is only calling HIP APIs but neither defines nor launches
|
|
any kernels can be compiled with a standard C or C++ compiler (GCC or MSVC for example )
|
|
even when ``hip_runtime_api.h`` or ``hip_runtime.h`` are included. The HIP include
|
|
paths and platform macros (``__HIP_PLATFORM_AMD__`` or ``__HIP_PLATFORM_NVIDIA__``)
|
|
must be passed to the compiler.
|
|
|
|
``hipconfig`` can help define the necessary options, for example on an AMD
|
|
platform:
|
|
|
|
.. code-block:: bash
|
|
|
|
hipconfig --cpp_config
|
|
-D__HIP_PLATFORM_AMD__= -I/opt/rocm/include
|
|
|
|
``nvcc`` includes some headers by default. ``hipcc`` does not include
|
|
default headers, and instead you must explicitly include all required files.
|
|
|
|
.. note::
|
|
The ``hipify`` tool automatically converts ``cuda_runtime.h`` to ``hip_runtime.h``,
|
|
and it converts ``cuda_runtime_api.h`` to ``hip_runtime_api.h``, but it may
|
|
miss nested headers or macros.
|
|
|
|
Compiler defines for HIP and CUDA
|
|
---------------------------------
|
|
|
|
C++-macros can be used to write code that is specific to a platform. This
|
|
section lists macros defined by compilers and the HIP/CUDA APIs,
|
|
and the compiler/platform combinations that define them.
|
|
|
|
The following table lists the macros that can be used when compiling HIP. Most
|
|
of these macros are not directly defined by the compilers, but in
|
|
``hip_common.h``, which is included by ``hip_runtime.h``.
|
|
|
|
.. list-table:: HIP-related defines
|
|
:header-rows: 1
|
|
|
|
*
|
|
- Macro
|
|
- ``amdclang++``
|
|
- ``nvcc`` when used as backend for ``hipcc``
|
|
- Other (GCC, MSVC, Clang, etc.)
|
|
*
|
|
- ``__HIP_PLATFORM_AMD__``
|
|
- Defined (see :ref:`identifying_compiler_target`)
|
|
- Undefined
|
|
- Undefined, needs to be set explicitly
|
|
*
|
|
- ``__HIP_PLATFORM_NVIDIA__``
|
|
- Undefined
|
|
- Defined (see :ref:`identifying_compiler_target`)
|
|
- Undefined, needs to be set explicitly
|
|
*
|
|
- ``__HIPCC__``
|
|
- Defined when compiling ``.hip`` files or specifying ``-x hip``
|
|
- Defined when compiling ``.hip`` files or specifying ``-x hip``
|
|
- Undefined
|
|
*
|
|
- ``__HIP_DEVICE_COMPILE__``
|
|
- 1 if compiling for device
|
|
undefined if compiling for host
|
|
- 1 if compiling for device
|
|
undefined if compiling for host
|
|
- Undefined
|
|
*
|
|
- ``__HIP_ARCH_<FEATURE>__``
|
|
- 0 or 1 depending on feature support of targeted hardware (see :ref:`identifying_device_architecture_features`)
|
|
- 0 or 1 depending on feature support of targeted hardware
|
|
- 0
|
|
*
|
|
- ``__HIP__``
|
|
- Defined when compiling ``.hip`` files or specifying ``-x hip``
|
|
- Undefined
|
|
- Undefined
|
|
|
|
The following table lists macros related to ``nvcc`` and CUDA as HIP backend.
|
|
|
|
.. list-table:: NVCC-related defines
|
|
:header-rows: 1
|
|
|
|
*
|
|
- Macro
|
|
- ``amdclang++``
|
|
- ``nvcc`` when used as backend for ``hipcc``
|
|
- Other (GCC, MSVC, Clang, etc.)
|
|
*
|
|
- ``__CUDACC__``
|
|
- Undefined
|
|
- Defined
|
|
- Undefined
|
|
(Clang defines this when explicitly compiling CUDA code)
|
|
*
|
|
- ``__NVCC__``
|
|
- Undefined
|
|
- Defined
|
|
- Undefined
|
|
*
|
|
- ``__CUDA_ARCH__`` [#cuda_arch]_
|
|
- Undefined
|
|
- Defined in device code
|
|
Integer representing compute capability
|
|
Must not be used in host code
|
|
- Undefined
|
|
|
|
.. [#cuda_arch] the use of ``__CUDA_ARCH__`` to check for hardware features is
|
|
discouraged, as this is not portable. Use the ``__HIP_ARCH_HAS_<FEATURE>``
|
|
macros instead.
|
|
|
|
.. _identifying_compiler_target:
|
|
|
|
Identifying the compilation target platform
|
|
-------------------------------------------
|
|
|
|
With HIP's portability, you might need to provide platform-specific code, or enable
|
|
platform-specific performance improvements.
|
|
|
|
For this, the ``__HIP_PLATFORM_AMD__`` and ``__HIP_PLATFORM_NVIDIA__`` macros
|
|
can be used, e.g.:
|
|
|
|
.. code-block:: cpp
|
|
|
|
#ifdef __HIP_PLATFORM_AMD__
|
|
// This code path is compiled when amdclang++ is used for compilation
|
|
#endif
|
|
|
|
.. code-block:: cpp
|
|
|
|
#ifdef __HIP_PLATFORM_NVIDIA__
|
|
// This code path is compiled when nvcc is used for compilation
|
|
// Could be compiling with CUDA language extensions enabled (for example, a ".cu file)
|
|
// Could be in pass-through mode to an underlying host compiler (for example, a .cpp file)
|
|
#endif
|
|
|
|
When using ``hipcc``, the environment variable ``HIP_PLATFORM`` specifies the
|
|
runtime to use. When an AMD graphics driver and an AMD GPU is detected,
|
|
``HIP_PLATFORM`` is set to ``amd``. If both runtimes are installed, and a
|
|
specific one should be used, or ``hipcc`` can't detect the runtime, the
|
|
environment variable has to be set manually.
|
|
|
|
To explicitly use the CUDA compilation path, use:
|
|
|
|
.. code-block:: bash
|
|
|
|
export HIP_PLATFORM=nvidia
|
|
hipcc main.cpp
|
|
|
|
Identifying host or device compilation pass
|
|
-------------------------------------------
|
|
|
|
``amdclang++`` makes multiple passes over the code: one pass for the host code, and
|
|
for the device code one pass for each GPU architecture to be compiled for.
|
|
``nvcc`` only makes two passes over the code: one for the host code and one for the
|
|
device code.
|
|
|
|
The ``__HIP_DEVICE_COMPILE__`` macro is defined when the compiler is compiling
|
|
for the device. This macro is a portable check that can replace the
|
|
``__CUDA_ARCH__`` macro.
|
|
|
|
.. code-block:: cpp
|
|
|
|
#include "hip/hip_runtime.h"
|
|
#include <iostream>
|
|
|
|
__host__ __device__ void call_func(){
|
|
#ifdef __HIP_DEVICE_COMPILE__
|
|
printf("device\n");
|
|
#else
|
|
std::cout << "host" << std::endl;
|
|
#endif
|
|
}
|
|
|
|
__global__ void test_kernel(){
|
|
call_func();
|
|
}
|
|
|
|
int main(int argc, char** argv) {
|
|
test_kernel<<<1, 1, 0, 0>>>();
|
|
|
|
call_func();
|
|
}
|
|
|
|
HIP-Clang implementation notes
|
|
==============================
|
|
|
|
HIP-Clang links device code from different translation units together. For each
|
|
device target, it generates a code object. ``clang-offload-bundler`` bundles
|
|
code objects for different device targets into one fat binary, which is embedded
|
|
as the global symbol ``__hip_fatbin`` in the ``.hip_fatbin`` section of the ELF
|
|
file of the executable or shared object.
|
|
|
|
Initialization and termination functions
|
|
----------------------------------------
|
|
|
|
HIP-Clang generates initialization and termination functions for each
|
|
translation unit for host code compilation. The initialization functions call
|
|
``__hipRegisterFatBinary`` to register the fat binary embedded in the ELF file.
|
|
They also call ``__hipRegisterFunction`` and ``__hipRegisterVar`` to register
|
|
kernel functions and device-side global variables. The termination functions
|
|
call ``__hipUnregisterFatBinary``.
|
|
|
|
HIP-Clang emits a global variable ``__hip_gpubin_handle`` of type ``void**``
|
|
with ``linkonce`` linkage and an initial value of 0 for each host translation
|
|
unit. Each initialization function checks ``__hip_gpubin_handle`` and registers
|
|
the fat binary only if ``__hip_gpubin_handle`` is 0. It saves the return value
|
|
of ``__hip_gpubin_handle`` to ``__hip_gpubin_handle``. This ensures that the fat
|
|
binary is registered once. A similar check is performed in the termination
|
|
functions.
|
|
|
|
Kernel launching
|
|
----------------
|
|
|
|
HIP-Clang supports kernel launching using either the triple chevron (``<<<>>>``) syntax,
|
|
:cpp:func:`hipLaunchKernel`, or :cpp:func:`hipLaunchKernelGGL`. The last option is a macro that
|
|
expands to the ``<<<>>>`` syntax by default. It can also be turned into a template by
|
|
defining ``HIP_TEMPLATE_KERNEL_LAUNCH``.
|
|
|
|
When the executable or shared library is loaded by the dynamic linker, the
|
|
initialization functions are called. In the initialization functions, the code
|
|
objects containing all kernels are loaded when ``__hipRegisterFatBinary`` is
|
|
called. When ``__hipRegisterFunction`` is called, the stub functions are
|
|
associated with the corresponding kernels in the code objects.
|
|
|
|
HIP-Clang implements two sets of APIs for launching kernels.
|
|
By default, when HIP-Clang encounters the ``<<<>>>`` statement in the host code,
|
|
it first calls :cpp:func:`hipConfigureCall` to set up the threads and grids. It then
|
|
calls the stub function with the given arguments. The stub function calls
|
|
:cpp:func:`hipSetupArgument` for each kernel argument, then calls :cpp:func:`hipLaunchByPtr`
|
|
with a function pointer to the stub function. In ``hipLaunchByPtr``, the actual
|
|
kernel associated with the stub function is launched.
|
|
|
|
NVCC implementation notes
|
|
=========================
|
|
|
|
CUDA applications can mix CUDA code with HIP code (see the
|
|
example below). The table shows the equivalent CUDA and HIP types
|
|
required to implement this interaction.
|
|
|
|
.. list-table:: Equivalence table between HIP and CUDA types
|
|
:header-rows: 1
|
|
|
|
* - HIP type
|
|
- CU Driver type
|
|
- CUDA Runtime type
|
|
* - :cpp:type:`hipModule_t`
|
|
- ``CUmodule``
|
|
-
|
|
* - :cpp:type:`hipFunction_t`
|
|
- ``CUfunction``
|
|
-
|
|
* - :cpp:type:`hipCtx_t`
|
|
- ``CUcontext``
|
|
-
|
|
* - :cpp:type:`hipDevice_t`
|
|
- ``CUdevice``
|
|
-
|
|
* - :cpp:type:`hipStream_t`
|
|
- ``CUstream``
|
|
- ``cudaStream_t``
|
|
* - :cpp:type:`hipEvent_t`
|
|
- ``CUevent``
|
|
- ``cudaEvent_t``
|
|
* - :cpp:type:`hipArray_t`
|
|
- ``CUarray``
|
|
- ``cudaArray``
|
|
|
|
Compilation options
|
|
-------------------
|
|
|
|
The :cpp:type:`hipModule_t` interface does not support the ``cuModuleLoadDataEx`` function,
|
|
which is used to control PTX compilation options. HIP-Clang does not use PTX, so
|
|
it does not support these compilation options. In fact, HIP-Clang code objects contain
|
|
fully compiled code for a device-specific instruction set and don't require additional
|
|
compilation as a part of the load step. The corresponding HIP function :cpp:func:`hipModuleLoadDataEx`
|
|
behaves like :cpp:func:`hipModuleLoadData` on the HIP-Clang path (where compilation options
|
|
are not used) and like ``cuModuleLoadDataEx`` on the NVCC path.
|
|
|
|
For example:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: HIP
|
|
|
|
.. code-block:: cpp
|
|
|
|
hipModule_t module;
|
|
void *imagePtr = ...; // Somehow populate data pointer with code object
|
|
|
|
const int numOptions = 1;
|
|
hipJitOption options[numOptions];
|
|
void *optionValues[numOptions];
|
|
|
|
options[0] = hipJitOptionMaxRegisters;
|
|
unsigned maxRegs = 15;
|
|
optionValues[0] = (void *)(&maxRegs);
|
|
|
|
// hipModuleLoadData(module, imagePtr) will be called on HIP-Clang path, JIT
|
|
// options will not be used, and cuModuleLoadDataEx(module, imagePtr,
|
|
// numOptions, options, optionValues) will be called on NVCC path
|
|
hipModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues);
|
|
|
|
hipFunction_t k;
|
|
hipModuleGetFunction(&k, module, "myKernel");
|
|
|
|
.. tab-item:: CUDA
|
|
|
|
.. code-block:: cpp
|
|
|
|
CUmodule module;
|
|
void *imagePtr = ...; // Somehow populate data pointer with code object
|
|
|
|
const int numOptions = 1;
|
|
CUJit_option options[numOptions];
|
|
void *optionValues[numOptions];
|
|
|
|
options[0] = CU_JIT_MAX_REGISTERS;
|
|
unsigned maxRegs = 15;
|
|
optionValues[0] = (void *)(&maxRegs);
|
|
|
|
cuModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues);
|
|
|
|
CUfunction k;
|
|
cuModuleGetFunction(&k, module, "myKernel");
|
|
|
|
The sample below shows how to use :cpp:func:``hipModuleGetFunction``.
|
|
|
|
.. code-block:: cpp
|
|
|
|
#include <hip/hip_runtime.h>
|
|
#include <hip/hip_runtime_api.h>
|
|
|
|
#include <vector>
|
|
|
|
int main() {
|
|
|
|
size_t elements = 64*1024;
|
|
size_t size_bytes = elements * sizeof(float);
|
|
|
|
std::vector<float> A(elements), B(elements);
|
|
|
|
// On NVIDIA platforms the driver runtime needs to be initiated
|
|
#ifdef __HIP_PLATFORM_NVIDIA__
|
|
hipInit(0);
|
|
hipDevice_t device;
|
|
hipCtx_t context;
|
|
HIPCHECK(hipDeviceGet(&device, 0));
|
|
HIPCHECK(hipCtxCreate(&context, 0, device));
|
|
#endif
|
|
|
|
// Allocate device memory
|
|
hipDeviceptr_t d_A, d_B;
|
|
HIPCHECK(hipMalloc(&d_A, size_bytes));
|
|
HIPCHECK(hipMalloc(&d_B, size_bytes));
|
|
|
|
// Copy data to device
|
|
HIPCHECK(hipMemcpyHtoD(d_A, A.data(), size_bytes));
|
|
HIPCHECK(hipMemcpyHtoD(d_B, B.data(), size_bytes));
|
|
|
|
// Load module
|
|
hipModule_t Module;
|
|
// For AMD the module file has to contain architecture specific object codee
|
|
// For NVIDIA the module file has to contain PTX, found in e.g. "vcpy_isa.ptx"
|
|
HIPCHECK(hipModuleLoad(&Module, "vcpy_isa.co"));
|
|
// Get kernel function from the module via its name
|
|
hipFunction_t Function;
|
|
HIPCHECK(hipModuleGetFunction(&Function, Module, "hello_world"));
|
|
|
|
// Create buffer for kernel arguments
|
|
std::vector<void*> argBuffer{&d_A, &d_B};
|
|
size_t arg_size_bytes = argBuffer.size() * sizeof(void*);
|
|
|
|
// Create configuration passed to the kernel as arguments
|
|
void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, argBuffer.data(),
|
|
HIP_LAUNCH_PARAM_BUFFER_SIZE, &arg_size_bytes, HIP_LAUNCH_PARAM_END};
|
|
|
|
int threads_per_block = 128;
|
|
int blocks = (elements + threads_per_block - 1) / threads_per_block;
|
|
|
|
// Actually launch kernel
|
|
HIPCHECK(hipModuleLaunchKernel(Function, blocks, 1, 1, threads_per_block, 1, 1, 0, 0, NULL, config));
|
|
|
|
HIPCHECK(hipMemcpyDtoH(A.data(), d_A, elements));
|
|
HIPCHECK(hipMemcpyDtoH(B.data(), d_B, elements));
|
|
|
|
#ifdef __HIP_PLATFORM_NVIDIA__
|
|
HIPCHECK(hipCtxDetach(context));
|
|
#endif
|
|
|
|
HIPCHECK(hipFree(d_A));
|
|
HIPCHECK(hipFree(d_B));
|
|
|
|
return 0;
|
|
}
|
|
|
|
.. _identifying_device_architecture_features:
|
|
|
|
Identifying device architecture and features
|
|
============================================
|
|
|
|
GPUs of different generations and architectures do not provide the same
|
|
level of :doc:`hardware feature support <../reference/hardware_features>`. To
|
|
guard device code that uses architecture-dependent features, the
|
|
``__HIP_ARCH_<FEATURE>__`` C++-macros can be used, as described below.
|
|
|
|
Device code feature identification
|
|
----------------------------------
|
|
|
|
Some CUDA code tests ``__CUDA_ARCH__`` for a specific value to determine whether
|
|
the GPU supports a certain architectural feature, depending on its compute
|
|
capability. This requires knowledge about what ``__CUDA_ARCH__`` supports what
|
|
feature set.
|
|
|
|
HIP simplifies this, by replacing these macros with feature-specific macros, not
|
|
architecture specific.
|
|
|
|
For instance,
|
|
|
|
.. code-block:: cpp
|
|
|
|
//#if __CUDA_ARCH__ >= 130 // does not properly specify what feature is required, not portable
|
|
#if __HIP_ARCH_HAS_DOUBLES__ == 1 // explicitly specifies what feature is required, portable between AMD and NVIDIA GPUs
|
|
// device code
|
|
#endif
|
|
|
|
For host code, the ``__HIP_ARCH_<FEATURE>__`` defines are set to 0, if
|
|
``hip_runtime.h`` is included, and undefined otherwise. It should not be relied
|
|
upon in host code.
|
|
|
|
Host code feature identification
|
|
--------------------------------
|
|
|
|
The host code must not rely on the ``__HIP_ARCH_<FEATURE>__`` macros, because the
|
|
GPUs available to a system are not known during compile time, and their
|
|
architectural features differ. Alternatively, the host code can query architecture
|
|
feature flags during runtime by using :cpp:func:`hipGetDeviceProperties`
|
|
or :cpp:func:`hipDeviceGetAttribute`.
|
|
|
|
.. code-block:: cpp
|
|
|
|
#include <hip/hip_runtime.h>
|
|
#include <cstdlib>
|
|
#include <iostream>
|
|
|
|
#define HIP_CHECK(expression) { \
|
|
const hipError_t err = expression; \
|
|
if (err != hipSuccess){ \
|
|
std::cout << "HIP Error: " << hipGetErrorString(err)) \
|
|
<< " at line " << __LINE__ << std::endl; \
|
|
std::exit(EXIT_FAILURE); \
|
|
} \
|
|
}
|
|
|
|
int main(){
|
|
int deviceCount;
|
|
HIP_CHECK(hipGetDeviceCount(&deviceCount));
|
|
|
|
int device = 0; // Query first available GPU. Can be replaced with any
|
|
// integer up to, not including, deviceCount
|
|
hipDeviceProp_t deviceProp;
|
|
HIP_CHECK(hipGetDeviceProperties(&deviceProp, device));
|
|
|
|
std::cout << "The queried device ";
|
|
if (deviceProp.arch.hasSharedInt32Atomics) // portable HIP feature query
|
|
std::cout << "supports";
|
|
else
|
|
std::cout << "does not support";
|
|
std::cout << " shared int32 atomic operations" << std::endl;
|
|
}
|
|
|
|
Feature macros and properties
|
|
-----------------------------
|
|
|
|
The following table lists the feature macros that HIP supports,
|
|
alongside corresponding device properties that can be queried from the host code.
|
|
|
|
.. list-table::
|
|
:header-rows: 1
|
|
|
|
*
|
|
- Macro (for device code)
|
|
- Device property (for host runtime query)
|
|
- Comment
|
|
*
|
|
- ``__HIP_ARCH_HAS_GLOBAL_INT32_ATOMICS__``
|
|
- ``hasGlobalInt32Atomics``
|
|
- 32-bit integer atomics for global memory
|
|
*
|
|
- ``__HIP_ARCH_HAS_GLOBAL_FLOAT_ATOMIC_EXCH__``
|
|
- ``hasGlobalFloatAtomicExch``
|
|
- 32-bit float atomic exchange for global memory
|
|
*
|
|
- ``__HIP_ARCH_HAS_SHARED_INT32_ATOMICS__``
|
|
- ``hasSharedInt32Atomics``
|
|
- 32-bit integer atomics for shared memory
|
|
*
|
|
- ``__HIP_ARCH_HAS_SHARED_FLOAT_ATOMIC_EXCH__``
|
|
- ``hasSharedFloatAtomicExch``
|
|
- 32-bit float atomic exchange for shared memory
|
|
*
|
|
- ``__HIP_ARCH_HAS_FLOAT_ATOMIC_ADD__``
|
|
- ``hasFloatAtomicAdd``
|
|
- 32-bit float atomic add in global and shared memory
|
|
*
|
|
- ``__HIP_ARCH_HAS_GLOBAL_INT64_ATOMICS__``
|
|
- ``hasGlobalInt64Atomics``
|
|
- 64-bit integer atomics for global memory
|
|
*
|
|
- ``__HIP_ARCH_HAS_SHARED_INT64_ATOMICS__``
|
|
- ``hasSharedInt64Atomics``
|
|
- 64-bit integer atomics for shared memory
|
|
*
|
|
- ``__HIP_ARCH_HAS_DOUBLES__``
|
|
- ``hasDoubles``
|
|
- Double-precision floating-point operations
|
|
*
|
|
- ``__HIP_ARCH_HAS_WARP_VOTE__``
|
|
- ``hasWarpVote``
|
|
- Warp vote instructions (``any``, ``all``)
|
|
*
|
|
- ``__HIP_ARCH_HAS_WARP_BALLOT__``
|
|
- ``hasWarpBallot``
|
|
- Warp ballot instructions
|
|
*
|
|
- ``__HIP_ARCH_HAS_WARP_SHUFFLE__``
|
|
- ``hasWarpShuffle``
|
|
- Warp shuffle operations (``shfl_*``)
|
|
*
|
|
- ``__HIP_ARCH_HAS_WARP_FUNNEL_SHIFT__``
|
|
- ``hasFunnelShift``
|
|
- Funnel shift two input words into one
|
|
*
|
|
- ``__HIP_ARCH_HAS_THREAD_FENCE_SYSTEM__``
|
|
- ``hasThreadFenceSystem``
|
|
- :cpp:func:`threadfence_system`
|
|
*
|
|
- ``__HIP_ARCH_HAS_SYNC_THREAD_EXT__``
|
|
- ``hasSyncThreadsExt``
|
|
- :cpp:func:`syncthreads_count`, :cpp:func:`syncthreads_and`, :cpp:func:`syncthreads_or`
|
|
*
|
|
- ``__HIP_ARCH_HAS_SURFACE_FUNCS__``
|
|
- ``hasSurfaceFuncs``
|
|
- Supports :ref:`surface functions <surface_object_reference>`.
|
|
*
|
|
- ``__HIP_ARCH_HAS_3DGRID__``
|
|
- ``has3dGrid``
|
|
- Grids and groups are 3D
|
|
*
|
|
- ``__HIP_ARCH_HAS_DYNAMIC_PARALLEL__``
|
|
- ``hasDynamicParallelism``
|
|
- Ability to launch a kernel from within a kernel
|
|
|
|
warpSize
|
|
========
|
|
|
|
Code should not assume a warp size of 32 or 64, as that is not portable between
|
|
platforms and architectures. The ``warpSize`` built-in should be used in device
|
|
code, while the host can query it during runtime via the device properties. See
|
|
the :ref:`HIP language extension for warpSize <warp_size>` for information on
|
|
how to write portable warpSize-aware code.
|
|
|
|
Lane masks bit-shift
|
|
====================
|
|
|
|
A thread in a warp is also called a lane, and a lane mask is a bitmask where
|
|
each bit corresponds to a thread in a warp. A bit is 1 if the thread is active,
|
|
0 if it's inactive. Bit-shift operations are typically used to create lane masks
|
|
and on AMD GPUs the ``warpSize`` can differ between different architectures,
|
|
that's why it's essential to use correct bitmask type, when porting code.
|
|
|
|
Example:
|
|
|
|
.. code-block:: cpp
|
|
|
|
// Get the thread's position in the warp
|
|
unsigned int laneId = threadIdx.x % warpSize;
|
|
|
|
// Use lane ID for bit-shift
|
|
val & ((1 << (threadIdx.x % warpSize) )-1 );
|
|
|
|
// Shift 32 bit integer with val variable
|
|
WarpReduce::sum( (val < warpSize) ? (1 << val) : 0);
|
|
|
|
Lane masks are 32-bit integer types as this is the integer precision that C
|
|
assigns to such constants by default. GCN/CDNA architectures have a warp size of
|
|
64, :code:`threadIdx.x % warpSize` and :code:`val` in the example may obtain
|
|
values greater than 31. Consequently, shifting by such values would clear the
|
|
32-bit register to which the shift operation is applied. For AMD
|
|
architectures, a straightforward fix could look as follows:
|
|
|
|
.. code-block:: cpp
|
|
|
|
// Get the thread's position in the warp
|
|
unsigned int laneId = threadIdx.x % warpSize;
|
|
|
|
// Use lane ID for bit-shift
|
|
val & ((1ull << (threadIdx.x % warpSize) )-1 );
|
|
|
|
// Shift 64 bit integer with val variable
|
|
WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0);
|
|
|
|
For portability reasons, it is better to introduce appropriately
|
|
typed placeholders as shown below:
|
|
|
|
.. code-block:: cpp
|
|
|
|
#if defined(__GFX8__) || defined(__GFX9__)
|
|
typedef uint64_t lane_mask_t;
|
|
#else
|
|
typedef uint32_t lane_mask_t;
|
|
#endif
|
|
|
|
The use of :code:`lane_mask_t` with the previous example:
|
|
|
|
.. code-block:: cpp
|
|
|
|
// Get the thread's position in the warp
|
|
unsigned int laneId = threadIdx.x % warpSize;
|
|
|
|
// Use lane ID for bit-shift
|
|
val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 );
|
|
|
|
// Shift 32 or 64 bit integer with val variable
|
|
WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0);
|
|
|
|
Porting from CUDA __launch_bounds__
|
|
===================================
|
|
|
|
CUDA defines a ``__launch_bounds__`` qualifier which works similarly to the HIP
|
|
implementation, however, it uses different parameters:
|
|
|
|
.. code-block:: cpp
|
|
|
|
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
|
|
|
|
``MAX_THREADS_PER_BLOCK`` is the same in CUDA and in HIP. However, ``MIN_BLOCKS_PER_MULTIPROCESSOR`` in CUDA
|
|
must be converted to ``MIN_WARPS_PER_EXECUTION_UNIT`` in HIP, which uses warps and execution units
|
|
rather than blocks and multiprocessors. This conversion can be done manually with the equation
|
|
considering the GPU's configuration mode.
|
|
|
|
* In Compute Unit (CU) mode, typical of CDNA:
|
|
|
|
.. code-block:: cpp
|
|
|
|
MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / (warpSize * 2)
|
|
|
|
* In Workgroup Processor (WGP) mode, a feature of RDNA:
|
|
|
|
.. code-block:: cpp
|
|
|
|
MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / (warpSize * 4)
|
|
|
|
Directly controlling the warps per execution unit makes it easier to reason about the occupancy,
|
|
unlike with blocks, where the occupancy depends on the block size.
|
|
|
|
The use of execution units rather than multiprocessors also provides support for
|
|
architectures with multiple execution units per multiprocessor. For example, the
|
|
AMD GCN architecture has 4 execution units per multiprocessor.
|
|
|
|
maxregcount
|
|
-----------
|
|
|
|
The ``nvcc`` compiler will predict the number of registers per thread based on the launch bounds calculation.
|
|
``--maxregcount X`` can be used to override the compiler's decision by enforcing a hard number of registers
|
|
(``X``) that the compiler must not exceed. If the compiler is unable to meet this requirement, it will place
|
|
additional "registers" into memory instead of using hardware registers.
|
|
|
|
Unlike ``nvcc``, ``amdclang++`` does not support the ``--maxregcount`` option. You are encouraged to use
|
|
the ``__launch_bounds__`` directive since the parameters are more intuitive and portable than micro-architecture
|
|
details like registers. The directive allows per-kernel control.
|
|
|
|
Driver entry point access
|
|
=========================
|
|
|
|
The HIP runtime provides support for CUDA driver entry point access when using
|
|
CUDA 12.0 or later. This feature lets developers interact directly with the
|
|
CUDA driver API, providing more control over GPU operations.
|
|
|
|
Driver entry point access provides several features:
|
|
|
|
* Retrieving the address of a runtime function
|
|
* Requesting the default stream version on a per-thread basis
|
|
* Accessing HIP features on older toolkits with a newer driver
|
|
|
|
For more information on driver entry point access, see :cpp:func:`hipGetProcAddress`.
|
|
|
|
Address retrieval
|
|
-----------------
|
|
|
|
The :cpp:func:`hipGetProcAddress` function can be used to obtain the address of
|
|
a runtime function. This is demonstrated in the following example:
|
|
|
|
.. code-block:: cpp
|
|
|
|
#include <hip/hip_runtime.h>
|
|
#include <hip/hip_runtime_api.h>
|
|
|
|
#include <iostream>
|
|
|
|
typedef hipError_t (*hipInit_t)(unsigned int);
|
|
|
|
int main() {
|
|
// Initialize the HIP runtime
|
|
hipError_t res = hipInit(0);
|
|
if (res != hipSuccess) {
|
|
std::cerr << "Failed to initialize HIP runtime." << std::endl;
|
|
return 1;
|
|
}
|
|
|
|
// Get the address of the hipInit function
|
|
hipInit_t hipInitFunc;
|
|
int hipVersion = HIP_VERSION; // Use the HIP version defined in hip_runtime_api.h
|
|
uint64_t flags = 0; // No special flags
|
|
hipDriverProcAddressQueryResult symbolStatus;
|
|
|
|
res = hipGetProcAddress("hipInit", (void**)&hipInitFunc, hipVersion, flags, &symbolStatus);
|
|
if (res != hipSuccess) {
|
|
std::cerr << "Failed to get address of hipInit()." << std::endl;
|
|
return 1;
|
|
}
|
|
|
|
// Call the hipInit function using the obtained address
|
|
res = hipInitFunc(0);
|
|
if (res == hipSuccess) {
|
|
std::cout << "HIP runtime initialized successfully using hipGetProcAddress()." << std::endl;
|
|
} else {
|
|
std::cerr << "Failed to initialize HIP runtime using hipGetProcAddress()." << std::endl;
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
|
|
Per-thread default stream version request
|
|
-----------------------------------------
|
|
|
|
HIP offers functionality similar to CUDA for managing streams on a per-thread
|
|
basis. By using ``hipStreamPerThread``, each thread can independently manage its
|
|
default stream, simplifying operations. The following example demonstrates how
|
|
this feature enhances performance by reducing contention and improving
|
|
efficiency.
|
|
|
|
.. code-block:: cpp
|
|
|
|
#include <hip/hip_runtime.h>
|
|
|
|
#include <iostream>
|
|
|
|
int main() {
|
|
// Initialize the HIP runtime
|
|
hipError_t res = hipInit(0);
|
|
if (res != hipSuccess) {
|
|
std::cerr << "Failed to initialize HIP runtime." << std::endl;
|
|
return 1;
|
|
}
|
|
|
|
// Get the per-thread default stream
|
|
hipStream_t stream = hipStreamPerThread;
|
|
|
|
// Use the stream for some operation
|
|
// For example, allocate memory on the device
|
|
void* d_ptr;
|
|
size_t size = 1024;
|
|
res = hipMalloc(&d_ptr, size);
|
|
if (res != hipSuccess) {
|
|
std::cerr << "Failed to allocate memory." << std::endl;
|
|
return 1;
|
|
}
|
|
|
|
// Perform some operation using the stream
|
|
// For example, set memory on the device
|
|
res = hipMemsetAsync(d_ptr, 0, size, stream);
|
|
if (res != hipSuccess) {
|
|
std::cerr << "Failed to set memory." << std::endl;
|
|
return 1;
|
|
}
|
|
|
|
// Synchronize the stream
|
|
res = hipStreamSynchronize(stream);
|
|
if (res != hipSuccess) {
|
|
std::cerr << "Failed to synchronize stream." << std::endl;
|
|
return 1;
|
|
}
|
|
|
|
std::cout << "Operation completed successfully using per-thread default stream." << std::endl;
|
|
|
|
// Free the allocated memory
|
|
hipFree(d_ptr);
|
|
|
|
return 0;
|
|
}
|
|
|
|
Accessing HIP features with a newer driver
|
|
------------------------------------------
|
|
|
|
HIP is forward compatible, allowing newer features to be utilized
|
|
with older toolkits, provided a compatible driver is present. Feature support
|
|
can be verified through runtime API functions and version checks. This approach
|
|
ensures that applications can benefit from new features and improvements in the
|
|
HIP runtime without requiring recompilation with a newer toolkit. The function
|
|
:cpp:func:`hipGetProcAddress` enables dynamic querying and the use of newer
|
|
functions offered by the HIP runtime, even if the application was built with an
|
|
older toolkit.
|
|
|
|
.. note::
|
|
:cpp:func:``hipGetProcAddress`` and its CUDA counterpart ``cuGetProcAddress`` are limited
|
|
to HIP/CUDA driver API function calls. For HIP/CUDA runtime API calls,the corresponding
|
|
function is :cpp:func:``hipGetDriverEntryPoint`` / ``cudaGetDriverEntryPoint``.
|
|
|
|
An example is provided for a hypothetical ``foo()`` function.
|
|
|
|
.. code-block:: cpp
|
|
|
|
// Get the address of the foo function
|
|
foo_t fooFunc;
|
|
int hipVersion = 60300000; // HIP version number (e.g. 6.3.0)
|
|
uint64_t flags = 0; // No special flags
|
|
hipDriverProcAddressQueryResult symbolStatus;
|
|
|
|
res = hipGetProcAddress("foo", (void**)&fooFunc, hipVersion, flags, &symbolStatus);
|
|
|
|
The HIP version number is defined as an integer:
|
|
|
|
.. code-block:: cpp
|
|
|
|
HIP_VERSION=HIP_VERSION_MAJOR * 10000000 + HIP_VERSION_MINOR * 100000 + HIP_VERSION_PATCH
|
|
|
|
CU_POINTER_ATTRIBUTE_MEMORY_TYPE
|
|
================================
|
|
|
|
To return the pointer's memory type in HIP, developers should use :cpp:func:`hipPointerGetAttributes`.
|
|
The first parameter of the function is `hipPointerAttribute_t`. Its ``type`` member variable indicates
|
|
whether the memory pointed to is allocated on the device or the host. For example:
|
|
|
|
.. code-block:: cpp
|
|
|
|
double * ptr;
|
|
hipMalloc(&ptr, sizeof(double));
|
|
hipPointerAttribute_t attr;
|
|
hipPointerGetAttributes(&attr, ptr); /*attr.type is hipMemoryTypeDevice*/
|
|
if(attr.type == hipMemoryTypeDevice)
|
|
std::cout << "ptr is of type hipMemoryTypeDevice" << std::endl;
|
|
|
|
double* ptrHost;
|
|
hipHostMalloc(&ptrHost, sizeof(double));
|
|
hipPointerAttribute_t attr;
|
|
hipPointerGetAttributes(&attr, ptrHost); /*attr.type is hipMemoryTypeHost*/
|
|
if(attr.type == hipMemorTypeHost)
|
|
std::cout << "ptrHost is of type hipMemoryTypeHost" << std::endl;
|
|
|
|
Note that ``hipMemoryType`` enum values are different from the
|
|
``cudaMemoryType`` enum values.
|
|
|
|
For example, on AMD platform, ``hipMemoryType`` is defined in ``hip_runtime_api.h``:
|
|
|
|
.. code-block:: cpp
|
|
|
|
typedef enum hipMemoryType {
|
|
hipMemoryTypeHost = 0, ///< Memory is physically located on host
|
|
hipMemoryTypeDevice = 1, ///< Memory is physically located on device. (see deviceId for specific device)
|
|
hipMemoryTypeArray = 2, ///< Array memory, physically located on device. (see deviceId for specific device)
|
|
hipMemoryTypeUnified = 3, ///< Not used currently
|
|
hipMemoryTypeManaged = 4 ///< Managed memory, automaticallly managed by the unified memory system
|
|
} hipMemoryType;
|
|
|
|
In the CUDA toolkit, the ``cudaMemoryType`` is defined as following:
|
|
|
|
.. code-block:: cpp
|
|
|
|
enum cudaMemoryType
|
|
{
|
|
cudaMemoryTypeUnregistered = 0, // Unregistered memory.
|
|
cudaMemoryTypeHost = 1, // Host memory.
|
|
cudaMemoryTypeDevice = 2, // Device memory.
|
|
cudaMemoryTypeManaged = 3, // Managed memory
|
|
}
|
|
|
|
.. note::
|
|
``cudaMemoryTypeUnregistered`` is currently not supported as ``hipMemoryType`` enum,
|
|
due to HIP functionality backward compatibility.
|
|
|
|
The memory type translation for ``hipPointerGetAttributes`` needs to
|
|
be handled properly on NVIDIA platform to return the correct memory type in CUDA,
|
|
which is done in the file ``nvidia_hip_runtime_api.h``.
|
|
|
|
In applications that use HIP memory type APIs, you should use ``#ifdef``
|
|
to assign the correct enum values depending on NVIDIA or AMD platform.
|