Sync HIP documentation 2025-10-20 (#1258)

* Add examples to tools folder
* Correct P2P memory access section
* Sync poriting guide
* Add HIP Graph tutorial
* Add hint about using amdgpu-dkms for IPC API
* Add a few more env variables
Этот коммит содержится в:
Istvan Kiss
2025-10-29 07:42:06 +01:00
коммит произвёл GitHub
родитель 8e98b80deb
Коммит 197f73dac9
89 изменённых файлов: 10327 добавлений и 3486 удалений
+16 -94
Просмотреть файл
@@ -103,66 +103,10 @@ The kernel arguments are listed after the configuration parameters.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess){ \
std::cerr << "HIP error: " << hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Performs a simple initialization of an array with the thread's index variables.
// This function is only available in device code.
__device__ void init_array(float * const a, const unsigned int arraySize){
// globalIdx uniquely identifies a thread in a 1D launch configuration.
const int globalIdx = threadIdx.x + blockIdx.x * blockDim.x;
// Each thread initializes a single element of the array.
if(globalIdx < arraySize){
a[globalIdx] = globalIdx;
}
}
// Rounds a value up to the next multiple.
// This function is available in host and device code.
__host__ __device__ constexpr int round_up_to_nearest_multiple(int number, int multiple){
return (number + multiple - 1)/multiple;
}
__global__ void example_kernel(float * const a, const unsigned int N)
{
// Initialize array.
init_array(a, N);
// Perform additional work:
// - work with the array
// - use the array in a different kernel
// - ...
}
int main()
{
constexpr int N = 100000000; // problem size
constexpr int blockSize = 256; //configurable block size
//needed number of blocks for the given problem size
constexpr int gridSize = round_up_to_nearest_multiple(N, blockSize);
float *a;
// allocate memory on the GPU
HIP_CHECK(hipMalloc(&a, sizeof(*a) * N));
std::cout << "Launching kernel." << std::endl;
example_kernel<<<dim3(gridSize), dim3(blockSize), 0/*example doesn't use shared memory*/, 0/*default stream*/>>>(a, N);
// make sure kernel execution is finished by synchronizing. The CPU can also
// execute other instructions during that time
HIP_CHECK(hipDeviceSynchronize());
std::cout << "Kernel execution finished." << std::endl;
HIP_CHECK(hipFree(a));
}
.. literalinclude:: ../tools/example_codes/calling_global_functions.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
Inline qualifiers
--------------------------------------------------------------------------------
@@ -321,28 +265,10 @@ launch has to specify the needed amount of ``extern`` shared memory in the launc
configuration. The statically allocated shared memory is allocated without this
parameter.
.. code-block:: cpp
#include <hip/hip_runtime.h>
extern __shared__ int shared_array[];
__global__ void kernel(){
// initialize shared memory
shared_array[threadIdx.x] = threadIdx.x;
// use shared memory - synchronize to make sure, that all threads of the
// block see all changes to shared memory
__syncthreads();
}
int main(){
//shared memory in this case depends on the configurable block size
constexpr int blockSize = 256;
constexpr int sharedMemSize = blockSize * sizeof(int);
constexpr int gridSize = 2;
kernel<<<dim3(gridSize), dim3(blockSize), sharedMemSize, 0>>>();
}
.. literalinclude:: ../tools/example_codes/extern_shared_memory.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
__managed__
--------------------------------------------------------------------------------
@@ -735,22 +661,18 @@ with the actual frequency.
The difference between the returned values represents the cycles used.
.. code-block:: cpp
__global void kernel(){
long long int start = clock64();
// kernel code
long long int stop = clock64();
long long int cycles = stop - start;
}
.. literalinclude:: ../tools/example_codes/timer.hip
:start-after: // [sphinx-kernel-start]
:end-before: // [sphinx-kernel-end]
:language: cpp
``long long int wall_clock64()`` returns the wall clock time on the device, with a constant, fixed frequency.
The frequency is device dependent and can be queried using:
.. code-block:: cpp
int wallClkRate = 0; //in kilohertz
hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId);
.. literalinclude:: ../tools/example_codes/timer.hip
:start-after: // [sphinx-query-start]
:end-before: // [sphinx-query-end]
:language: cpp
.. _atomic functions:
-649
Просмотреть файл
@@ -1,649 +0,0 @@
.. meta::
:description: This chapter presents how to port the CUDA driver API and showcases equivalent operations in HIP.
:keywords: AMD, ROCm, HIP, CUDA, driver API, porting, port
.. _porting_driver_api:
*******************************************************************************
Porting CUDA driver API
*******************************************************************************
CUDA provides separate driver and runtime APIs. The two APIs generally provide
the similar functionality and mostly can be used interchangeably, however the
driver API allows for more fine-grained control over the kernel level
initialization, contexts and module management. This is all taken care of
implicitly by the runtime API.
* Driver API calls begin with the prefix ``cu``, while runtime API calls begin
with the prefix ``cuda``. For example, the driver API contains
``cuEventCreate``, while the runtime API contains ``cudaEventCreate``, which
has similar functionality.
* The driver API offers two additional low-level functionalities not exposed by
the runtime API: module management ``cuModule*`` and context management
``cuCtx*`` APIs.
HIP does not explicitly provide two different APIs, the corresponding functions
for the CUDA driver API are available in the HIP runtime API, and are usually
prefixed with ``hipDrv``. The module and context functionality is available with
the ``hipModule`` and ``hipCtx`` prefix.
cuModule API
================================================================================
The Module section of the driver API provides additional control over how and
when accelerator code objects are loaded. For example, the driver API enables
code objects to load from files or memory pointers. Symbols for kernels or
global data are extracted from the loaded code objects. In contrast, the runtime
API loads automatically and, if necessary, compiles all the kernels from an
executable binary when it runs. In this mode, kernel code must be compiled using
NVCC so that automatic loading can function correctly.
The Module features are useful in an environment that generates the code objects
directly, such as a new accelerator language front end. NVCC is not used here.
Instead, the environment might have a different kernel language or compilation
flow. Other environments have many kernels and don't want all of them to be
loaded automatically. The Module functions load the generated code objects and
launch kernels. Similar to the cuModule API, HIP defines a hipModule API that
provides similar explicit control over code object management.
.. _context_driver_api:
cuCtx API
================================================================================
The driver API defines "Context" and "Devices" as separate entities.
Contexts contain a single device, and a device can theoretically have multiple contexts.
Each context contains a set of streams and events specific to the context.
Historically, contexts also defined a unique address space for the GPU. This might no longer be the case in unified memory platforms, because the CPU and all the devices in the same process share a single unified address space.
The Context APIs also provide a mechanism to switch between devices, which enables a single CPU thread to send commands to different GPUs.
HIP and recent versions of the CUDA Runtime provide other mechanisms to accomplish this feat, for example, using streams or ``cudaSetDevice``.
The CUDA runtime API unifies the Context API with the Device API. This simplifies the APIs and has little loss of functionality. This is because each context can contain a single device, and the benefits of multiple contexts have been replaced with other interfaces.
HIP provides a Context API to facilitate easy porting from existing Driver code.
In HIP, the ``Ctx`` functions largely provide an alternate syntax for changing the active device.
Most new applications preferentially use ``hipSetDevice`` or the stream APIs. Therefore, HIP has marked the ``hipCtx`` APIs as **deprecated**. Support for these APIs might not be available in future releases. For more details on deprecated APIs, see :doc:`../reference/deprecated_api_list`.
HIP module and Ctx APIs
================================================================================
Rather than present two separate APIs, HIP extends the HIP API with new APIs for
modules and ``Ctx`` control.
hipModule API
--------------------------------------------------------------------------------
Like the CUDA driver API, the Module API provides additional control over how
code is loaded, including options to load code from files or from in-memory
pointers.
NVCC and HIP-Clang target different architectures and use different code object
formats. NVCC supports ``cubin`` or ``ptx`` files, while the HIP-Clang path uses
the ``hsaco`` format.
The external compilers which generate these code objects are responsible for
generating and loading the correct code object for each platform.
Notably, there is no fat binary format that can contain code for both NVCC and
HIP-Clang platforms. The following table summarizes the formats used on each
platform:
.. list-table:: Module formats
:header-rows: 1
* - Format
- APIs
- NVCC
- HIP-CLANG
* - Code object
- ``hipModuleLoad``, ``hipModuleLoadData``
- ``.cubin`` or PTX text
- ``.hsaco``
* - Fat binary
- ``hipModuleLoadFatBin``
- ``.fatbin``
- ``.hip_fatbin``
``hipcc`` uses HIP-Clang or NVCC to compile host code. Both of these compilers can embed code objects into the final executable. These code objects are automatically loaded when the application starts.
The ``hipModule`` API can be used to load additional code objects. When used this way, it extends the capability of the automatically loaded code objects.
HIP-Clang enables both of these capabilities to be used together. Of course, it is possible to create a program with no kernels and no automatic loading.
For module API reference, visit :ref:`module_management_reference`.
hipCtx API
--------------------------------------------------------------------------------
HIP provides a ``Ctx`` API as a thin layer over the existing device functions. The ``Ctx`` API can be used to set the current context or to query properties of the device associated with the context.
The current context is implicitly used by other APIs, such as ``hipStreamCreate``.
For context reference, visit :ref:`context_management_reference`.
HIPIFY translation of CUDA driver API
================================================================================
The HIPIFY tools convert CUDA driver APIs such as streams, events, modules,
devices, memory management, context, and the profiler to the equivalent HIP
calls. For example, ``cuEventCreate`` is translated to :cpp:func:`hipEventCreate`.
HIPIFY tools also convert error codes from the driver namespace and coding
conventions to the equivalent HIP error code. HIP unifies the APIs for these
common functions.
The memory copy API requires additional explanation. The CUDA driver includes
the memory direction in the name of the API (``cuMemcpyHtoD``), while the CUDA
runtime API provides a single memory copy API with a parameter that specifies
the direction. It also supports a "default" direction where the runtime
determines the direction automatically.
HIP provides both versions, for example, :cpp:func:`hipMemcpyHtoD` as well as
:cpp:func:`hipMemcpy`. The first version might be faster in some cases because
it avoids any host overhead to detect the different memory directions.
HIP defines a single error space and uses camel case for all errors (i.e. ``hipErrorInvalidValue``).
For further information, visit the :doc:`hipify:index`.
Address spaces
--------------------------------------------------------------------------------
HIP-Clang defines a process-wide address space where the CPU and all devices
allocate addresses from a single unified pool.
This means addresses can be shared between contexts. Unlike the original CUDA
implementation, a new context does not create a new address space for the device.
Using hipModuleLaunchKernel
--------------------------------------------------------------------------------
Both CUDA driver and runtime APIs define a function for launching kernels,
called ``cuLaunchKernel`` or ``cudaLaunchKernel``. The equivalent API in HIP is
``hipModuleLaunchKernel``.
The kernel arguments and the execution configuration (grid dimensions, group
dimensions, dynamic shared memory, and stream) are passed as arguments to the
launch function.
The runtime API additionally provides the ``<<< >>>`` syntax for launching
kernels, which resembles a special function call and is easier to use than the
explicit launch API, especially when handling kernel arguments.
However, this syntax is not standard C++ and is available only when NVCC is used
to compile the host code.
Additional information
--------------------------------------------------------------------------------
HIP-Clang creates a primary context when the HIP API is called. So, in pure
driver API code, HIP-Clang creates a primary context while HIP/NVCC has an empty
context stack. HIP-Clang pushes the primary context to the context stack when it
is empty. This can lead to subtle differences in applications which mix the
runtime and driver APIs.
HIP-Clang implementation notes
================================================================================
.hip_fatbin
--------------------------------------------------------------------------------
HIP-Clang links device code from different translation units together. For each
device target, it generates a code object. ``clang-offload-bundler`` bundles
code objects for different device targets into one fat binary, which is embedded
as the global symbol ``__hip_fatbin`` in the ``.hip_fatbin`` section of the ELF
file of the executable or shared object.
Initialization and termination functions
--------------------------------------------------------------------------------
HIP-Clang generates initialization and termination functions for each
translation unit for host code compilation. The initialization functions call
``__hipRegisterFatBinary`` to register the fat binary embedded in the ELF file.
They also call ``__hipRegisterFunction`` and ``__hipRegisterVar`` to register
kernel functions and device-side global variables. The termination functions
call ``__hipUnregisterFatBinary``.
HIP-Clang emits a global variable ``__hip_gpubin_handle`` of type ``void**``
with ``linkonce`` linkage and an initial value of 0 for each host translation
unit. Each initialization function checks ``__hip_gpubin_handle`` and registers
the fat binary only if ``__hip_gpubin_handle`` is 0. It saves the return value
of ``__hip_gpubin_handle`` to ``__hip_gpubin_handle``. This ensures that the fat
binary is registered once. A similar check is performed in the termination
functions.
Kernel launching
--------------------------------------------------------------------------------
HIP-Clang supports kernel launching using either the CUDA ``<<<>>>`` syntax,
``hipLaunchKernel``, or ``hipLaunchKernelGGL``. The last option is a macro which
expands to the CUDA ``<<<>>>`` syntax by default. It can also be turned into a
template by defining ``HIP_TEMPLATE_KERNEL_LAUNCH``.
When the executable or shared library is loaded by the dynamic linker, the
initialization functions are called. In the initialization functions, the code
objects containing all kernels are loaded when ``__hipRegisterFatBinary`` is
called. When ``__hipRegisterFunction`` is called, the stub functions are
associated with the corresponding kernels in the code objects.
HIP-Clang implements two sets of APIs for launching kernels.
By default, when HIP-Clang encounters the ``<<<>>>`` statement in the host code,
it first calls ``hipConfigureCall`` to set up the threads and grids. It then
calls the stub function with the given arguments. The stub function calls
``hipSetupArgument`` for each kernel argument, then calls ``hipLaunchByPtr``
with a function pointer to the stub function. In ``hipLaunchByPtr``, the actual
kernel associated with the stub function is launched.
NVCC implementation notes
================================================================================
Interoperation between HIP and CUDA driver
--------------------------------------------------------------------------------
CUDA applications might want to mix CUDA driver code with HIP code (see the
example below). This table shows the equivalence between CUDA and HIP types
required to implement this interaction.
.. list-table:: Equivalence table between HIP and CUDA types
:header-rows: 1
* - HIP type
- CU Driver type
- CUDA Runtime type
* - ``hipModule_t``
- ``CUmodule``
-
* - ``hipFunction_t``
- ``CUfunction``
-
* - ``hipCtx_t``
- ``CUcontext``
-
* - ``hipDevice_t``
- ``CUdevice``
-
* - ``hipStream_t``
- ``CUstream``
- ``cudaStream_t``
* - ``hipEvent_t``
- ``CUevent``
- ``cudaEvent_t``
* - ``hipArray``
- ``CUarray``
- ``cudaArray``
Compilation options
--------------------------------------------------------------------------------
The ``hipModule_t`` interface does not support the ``cuModuleLoadDataEx`` function, which is used to control PTX compilation options.
HIP-Clang does not use PTX, so it does not support these compilation options.
In fact, HIP-Clang code objects contain fully compiled code for a device-specific instruction set and don't require additional compilation as a part of the load step.
The corresponding HIP function ``hipModuleLoadDataEx`` behaves like ``hipModuleLoadData`` on the HIP-Clang path (where compilation options are not used) and like ``cuModuleLoadDataEx`` on the NVCC path.
For example:
.. tab-set::
.. tab-item:: HIP
.. code-block:: cpp
hipModule_t module;
void *imagePtr = ...; // Somehow populate data pointer with code object
const int numOptions = 1;
hipJitOption options[numOptions];
void *optionValues[numOptions];
options[0] = hipJitOptionMaxRegisters;
unsigned maxRegs = 15;
optionValues[0] = (void *)(&maxRegs);
// hipModuleLoadData(module, imagePtr) will be called on HIP-Clang path, JIT
// options will not be used, and cupModuleLoadDataEx(module, imagePtr,
// numOptions, options, optionValues) will be called on NVCC path
hipModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues);
hipFunction_t k;
hipModuleGetFunction(&k, module, "myKernel");
.. tab-item:: CUDA
.. code-block:: cpp
CUmodule module;
void *imagePtr = ...; // Somehow populate data pointer with code object
const int numOptions = 1;
CUJit_option options[numOptions];
void *optionValues[numOptions];
options[0] = CU_JIT_MAX_REGISTERS;
unsigned maxRegs = 15;
optionValues[0] = (void *)(&maxRegs);
cuModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues);
CUfunction k;
cuModuleGetFunction(&k, module, "myKernel");
The sample below shows how to use ``hipModuleGetFunction``.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>
#include <vector>
int main() {
size_t elements = 64*1024;
size_t size_bytes = elements * sizeof(float);
std::vector<float> A(elements), B(elements);
// On NVIDIA platforms the driver runtime needs to be initiated
#ifdef __HIP_PLATFORM_NVIDIA__
hipInit(0);
hipDevice_t device;
hipCtx_t context;
HIPCHECK(hipDeviceGet(&device, 0));
HIPCHECK(hipCtxCreate(&context, 0, device));
#endif
// Allocate device memory
hipDeviceptr_t d_A, d_B;
HIPCHECK(hipMalloc(&d_A, size_bytes));
HIPCHECK(hipMalloc(&d_B, size_bytes));
// Copy data to device
HIPCHECK(hipMemcpyHtoD(d_A, A.data(), size_bytes));
HIPCHECK(hipMemcpyHtoD(d_B, B.data(), size_bytes));
// Load module
hipModule_t Module;
// For AMD the module file has to contain architecture specific object codee
// For NVIDIA the module file has to contain PTX, found in e.g. "vcpy_isa.ptx"
HIPCHECK(hipModuleLoad(&Module, "vcpy_isa.co"));
// Get kernel function from the module via its name
hipFunction_t Function;
HIPCHECK(hipModuleGetFunction(&Function, Module, "hello_world"));
// Create buffer for kernel arguments
std::vector<void*> argBuffer{&d_A, &d_B};
size_t arg_size_bytes = argBuffer.size() * sizeof(void*);
// Create configuration passed to the kernel as arguments
void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, argBuffer.data(),
HIP_LAUNCH_PARAM_BUFFER_SIZE, &arg_size_bytes, HIP_LAUNCH_PARAM_END};
int threads_per_block = 128;
int blocks = (elements + threads_per_block - 1) / threads_per_block;
// Actually launch kernel
HIPCHECK(hipModuleLaunchKernel(Function, blocks, 1, 1, threads_per_block, 1, 1, 0, 0, NULL, config));
HIPCHECK(hipMemcpyDtoH(A.data(), d_A, elements));
HIPCHECK(hipMemcpyDtoH(B.data(), d_B, elements));
#ifdef __HIP_PLATFORM_NVIDIA__
HIPCHECK(hipCtxDetach(context));
#endif
HIPCHECK(hipFree(d_A));
HIPCHECK(hipFree(d_B));
return 0;
}
HIP module and texture Driver API
================================================================================
HIP supports texture driver APIs. However, texture references must be declared
within the host scope. The following code demonstrates the use of texture
references for the ``__HIP_PLATFORM_AMD__`` platform.
.. code-block:: cpp
// Code to generate code object
#include "hip/hip_runtime.h"
extern texture<float, 2, hipReadModeElementType> tex;
__global__ void tex2dKernel(hipLaunchParm lp, float *outputData, int width,
int height) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
outputData[y * width + x] = tex2D(tex, x, y);
}
.. code-block:: cpp
// Host code:
texture<float, 2, hipReadModeElementType> tex;
void myFunc ()
{
// ...
textureReference* texref;
hipModuleGetTexRef(&texref, Module1, "tex");
hipTexRefSetAddressMode(texref, 0, hipAddressModeWrap);
hipTexRefSetAddressMode(texref, 1, hipAddressModeWrap);
hipTexRefSetFilterMode(texref, hipFilterModePoint);
hipTexRefSetFlags(texref, 0);
hipTexRefSetFormat(texref, HIP_AD_FORMAT_FLOAT, 1);
hipTexRefSetArray(texref, array, HIP_TRSA_OVERRIDE_FORMAT);
// ...
}
Driver entry point access
================================================================================
Starting from HIP version 6.2.0, support for Driver Entry Point Access is
available when using CUDA 12.0 or newer. This feature allows developers to
directly interact with the CUDA driver API, providing more control over GPU
operations.
Driver Entry Point Access provides several features:
* Retrieving the address of a runtime function
* Requesting the default stream version on a per-thread basis
* Accessing new HIP features on older toolkits with a newer driver
For driver entry point access reference, visit :cpp:func:`hipGetProcAddress`.
Address retrieval
--------------------------------------------------------------------------------
The :cpp:func:`hipGetProcAddress` function can be used to obtain the address of
a runtime function. This is demonstrated in the following example:
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>
#include <iostream>
typedef hipError_t (*hipInit_t)(unsigned int);
int main() {
// Initialize the HIP runtime
hipError_t res = hipInit(0);
if (res != hipSuccess) {
std::cerr << "Failed to initialize HIP runtime." << std::endl;
return 1;
}
// Get the address of the hipInit function
hipInit_t hipInitFunc;
int hipVersion = HIP_VERSION; // Use the HIP version defined in hip_runtime_api.h
uint64_t flags = 0; // No special flags
hipDriverProcAddressQueryResult symbolStatus;
res = hipGetProcAddress("hipInit", (void**)&hipInitFunc, hipVersion, flags, &symbolStatus);
if (res != hipSuccess) {
std::cerr << "Failed to get address of hipInit()." << std::endl;
return 1;
}
// Call the hipInit function using the obtained address
res = hipInitFunc(0);
if (res == hipSuccess) {
std::cout << "HIP runtime initialized successfully using hipGetProcAddress()." << std::endl;
} else {
std::cerr << "Failed to initialize HIP runtime using hipGetProcAddress()." << std::endl;
}
return 0;
}
Per-thread default stream version request
================================================================================
HIP offers functionality similar to CUDA for managing streams on a per-thread
basis. By using ``hipStreamPerThread``, each thread can independently manage its
default stream, simplifying operations. The following example demonstrates how
this feature enhances performance by reducing contention and improving
efficiency.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
int main() {
// Initialize the HIP runtime
hipError_t res = hipInit(0);
if (res != hipSuccess) {
std::cerr << "Failed to initialize HIP runtime." << std::endl;
return 1;
}
// Get the per-thread default stream
hipStream_t stream = hipStreamPerThread;
// Use the stream for some operation
// For example, allocate memory on the device
void* d_ptr;
size_t size = 1024;
res = hipMalloc(&d_ptr, size);
if (res != hipSuccess) {
std::cerr << "Failed to allocate memory." << std::endl;
return 1;
}
// Perform some operation using the stream
// For example, set memory on the device
res = hipMemsetAsync(d_ptr, 0, size, stream);
if (res != hipSuccess) {
std::cerr << "Failed to set memory." << std::endl;
return 1;
}
// Synchronize the stream
res = hipStreamSynchronize(stream);
if (res != hipSuccess) {
std::cerr << "Failed to synchronize stream." << std::endl;
return 1;
}
std::cout << "Operation completed successfully using per-thread default stream." << std::endl;
// Free the allocated memory
hipFree(d_ptr);
return 0;
}
Accessing new HIP features with a newer driver
================================================================================
HIP is designed to be forward compatible, allowing newer features to be utilized
with older toolkits, provided a compatible driver is present. Feature support
can be verified through runtime API functions and version checks. This approach
ensures that applications can benefit from new features and improvements in the
HIP runtime without needing to be recompiled with a newer toolkit. The function
:cpp:func:`hipGetProcAddress` enables dynamic querying and the use of newer
functions offered by the HIP runtime, even if the application was built with an
older toolkit.
An example is provided for a hypothetical ``foo()`` function.
.. code-block:: cpp
// Get the address of the foo function
foo_t fooFunc;
int hipVersion = 60300000; // Use an own HIP version number (e.g. 6.3.0)
uint64_t flags = 0; // No special flags
hipDriverProcAddressQueryResult symbolStatus;
res = hipGetProcAddress("foo", (void**)&fooFunc, hipVersion, flags, &symbolStatus);
The HIP version number is defined as an integer:
.. code-block:: cpp
HIP_VERSION=HIP_VERSION_MAJOR * 10000000 + HIP_VERSION_MINOR * 100000 + HIP_VERSION_PATCH
CU_POINTER_ATTRIBUTE_MEMORY_TYPE
================================================================================
To get the pointer's memory type in HIP, developers should use
:cpp:func:`hipPointerGetAttributes`. First parameter of the function is
`hipPointerAttribute_t`. Its ``type`` member variable indicates whether the
memory pointed to is allocated on the device or the host.
For example:
.. code-block:: cpp
double * ptr;
hipMalloc(&ptr, sizeof(double));
hipPointerAttribute_t attr;
hipPointerGetAttributes(&attr, ptr); /*attr.type is hipMemoryTypeDevice*/
if(attr.type == hipMemoryTypeDevice)
std::cout << "ptr is of type hipMemoryTypeDevice" << std::endl;
double* ptrHost;
hipHostMalloc(&ptrHost, sizeof(double));
hipPointerAttribute_t attr;
hipPointerGetAttributes(&attr, ptrHost); /*attr.type is hipMemoryTypeHost*/
if(attr.type == hipMemorTypeHost)
std::cout << "ptrHost is of type hipMemoryTypeHost" << std::endl;
Note that ``hipMemoryType`` enum values are different from the
``cudaMemoryType`` enum values.
For example, on AMD platform, `hipMemoryType` is defined in `hip_runtime_api.h`,
.. code-block:: cpp
typedef enum hipMemoryType {
hipMemoryTypeHost = 0, ///< Memory is physically located on host
hipMemoryTypeDevice = 1, ///< Memory is physically located on device. (see deviceId for specific device)
hipMemoryTypeArray = 2, ///< Array memory, physically located on device. (see deviceId for specific device)
hipMemoryTypeUnified = 3, ///< Not used currently
hipMemoryTypeManaged = 4 ///< Managed memory, automaticallly managed by the unified memory system
} hipMemoryType;
Looking into CUDA toolkit, it defines `cudaMemoryType` as following,
.. code-block:: cpp
enum cudaMemoryType
{
cudaMemoryTypeUnregistered = 0, // Unregistered memory.
cudaMemoryTypeHost = 1, // Host memory.
cudaMemoryTypeDevice = 2, // Device memory.
cudaMemoryTypeManaged = 3, // Managed memory
}
In this case, memory type translation for ``hipPointerGetAttributes`` needs to
be handled properly on NVIDIA platform to get the correct memory type in CUDA,
which is done in the file ``nvidia_hip_runtime_api.h``.
So in any HIP applications which use HIP APIs involving memory types, developers
should use ``#ifdef`` in order to assign the correct enum values depending on
NVIDIA or AMD platform.
As an example, please see the code from the `link <https://github.com/ROCm/hip-tests/tree/develop/catch/unit/memory/hipMemcpyParam2D.cc>`_.
With the ``#ifdef`` condition, HIP APIs work as expected on both AMD and NVIDIA
platforms.
Note, ``cudaMemoryTypeUnregistered`` is currently not supported as
``hipMemoryType`` enum, due to HIP functionality backward compatibility.
Разница между файлами не показана из-за своего большого размера Загрузить разницу
+12 -307
Просмотреть файл
@@ -207,319 +207,24 @@ The example codes
.. tab-item:: Sequential
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <vector>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
// GPU Kernels
__global__ void kernelA(double* arrayA, size_t size){
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size){arrayA[x] += 1.0;}
};
__global__ void kernelB(double* arrayA, double* arrayB, size_t size){
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size){arrayB[x] += arrayA[x] + 3.0;}
};
int main()
{
constexpr int numOfBlocks = 1 << 20;
constexpr int threadsPerBlock = 1024;
constexpr int numberOfIterations = 50;
// The array size smaller to avoid the relatively short kernel launch compared to memory copies
constexpr size_t arraySize = 1U << 25;
double *d_dataA;
double *d_dataB;
double initValueA = 0.0;
double initValueB = 2.0;
std::vector<double> vectorA(arraySize, initValueA);
std::vector<double> vectorB(arraySize, initValueB);
// Allocate device memory
HIP_CHECK(hipMalloc(&d_dataA, arraySize * sizeof(*d_dataA)));
HIP_CHECK(hipMalloc(&d_dataB, arraySize * sizeof(*d_dataB)));
for(int iteration = 0; iteration < numberOfIterations; iteration++)
{
// Host to Device copies
HIP_CHECK(hipMemcpy(d_dataA, vectorA.data(), arraySize * sizeof(*d_dataA), hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(d_dataB, vectorB.data(), arraySize * sizeof(*d_dataB), hipMemcpyHostToDevice));
// Launch the GPU kernels
hipLaunchKernelGGL(kernelA, dim3(numOfBlocks), dim3(threadsPerBlock), 0, 0, d_dataA, arraySize);
hipLaunchKernelGGL(kernelB, dim3(numOfBlocks), dim3(threadsPerBlock), 0, 0, d_dataA, d_dataB, arraySize);
// Device to Host copies
HIP_CHECK(hipMemcpy(vectorA.data(), d_dataA, arraySize * sizeof(*vectorA.data()), hipMemcpyDeviceToHost));
HIP_CHECK(hipMemcpy(vectorB.data(), d_dataB, arraySize * sizeof(*vectorB.data()), hipMemcpyDeviceToHost));
}
// Wait for all operations to complete
HIP_CHECK(hipDeviceSynchronize());
// Verify results
const double expectedA = (double)numberOfIterations;
const double expectedB =
initValueB + (3.0 * numberOfIterations) +
(expectedA * (expectedA + 1.0)) / 2.0;
bool passed = true;
for(size_t i = 0; i < arraySize; ++i){
if(vectorA[i] != expectedA){
passed = false;
std::cerr << "Validation failed! Expected " << expectedA << " got " << vectorA[i] << " at index: " << i << std::endl;
break;
}
if(vectorB[i] != expectedB){
passed = false;
std::cerr << "Validation failed! Expected " << expectedB << " got " << vectorB[i] << " at index: " << i << std::endl;
break;
}
}
if(passed){
std::cout << "Sequential execution completed successfully." << std::endl;
}else{
std::cerr << "Sequential execution failed." << std::endl;
}
// Cleanup
HIP_CHECK(hipFree(d_dataA));
HIP_CHECK(hipFree(d_dataB));
return 0;
}
.. literalinclude:: ../../tools/example_codes/sequential_kernel_execution.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
.. tab-item:: Asynchronous
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <vector>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
// GPU Kernels
__global__ void kernelA(double* arrayA, size_t size){
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size){arrayA[x] += 1.0;}
};
__global__ void kernelB(double* arrayA, double* arrayB, size_t size){
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size){arrayB[x] += arrayA[x] + 3.0;}
};
int main()
{
constexpr int numOfBlocks = 1 << 20;
constexpr int threadsPerBlock = 1024;
constexpr int numberOfIterations = 50;
// The array size smaller to avoid the relatively short kernel launch compared to memory copies
constexpr size_t arraySize = 1U << 25;
double *d_dataA;
double *d_dataB;
double initValueA = 0.0;
double initValueB = 2.0;
std::vector<double> vectorA(arraySize, initValueA);
std::vector<double> vectorB(arraySize, initValueB);
// Allocate device memory
HIP_CHECK(hipMalloc(&d_dataA, arraySize * sizeof(*d_dataA)));
HIP_CHECK(hipMalloc(&d_dataB, arraySize * sizeof(*d_dataB)));
// Create streams
hipStream_t streamA, streamB;
HIP_CHECK(hipStreamCreate(&streamA));
HIP_CHECK(hipStreamCreate(&streamB));
for(unsigned int iteration = 0; iteration < numberOfIterations; iteration++)
{
// Stream 1: Host to Device 1
HIP_CHECK(hipMemcpyAsync(d_dataA, vectorA.data(), arraySize * sizeof(*d_dataA), hipMemcpyHostToDevice, streamA));
// Stream 2: Host to Device 2
HIP_CHECK(hipMemcpyAsync(d_dataB, vectorB.data(), arraySize * sizeof(*d_dataB), hipMemcpyHostToDevice, streamB));
// Stream 1: Kernel 1
hipLaunchKernelGGL(kernelA, dim3(numOfBlocks), dim3(threadsPerBlock), 0, streamA, d_dataA, arraySize);
// Wait for streamA finish
HIP_CHECK(hipStreamSynchronize(streamA));
// Stream 2: Kernel 2
hipLaunchKernelGGL(kernelB, dim3(numOfBlocks), dim3(threadsPerBlock), 0, streamB, d_dataA, d_dataB, arraySize);
// Stream 1: Device to Host 2 (after Kernel 1)
HIP_CHECK(hipMemcpyAsync(vectorA.data(), d_dataA, arraySize * sizeof(*vectorA.data()), hipMemcpyDeviceToHost, streamA));
// Stream 2: Device to Host 2 (after Kernel 2)
HIP_CHECK(hipMemcpyAsync(vectorB.data(), d_dataB, arraySize * sizeof(*vectorB.data()), hipMemcpyDeviceToHost, streamB));
}
// Wait for all operations in both streams to complete
HIP_CHECK(hipStreamSynchronize(streamA));
HIP_CHECK(hipStreamSynchronize(streamB));
// Verify results
double expectedA = (double)numberOfIterations;
double expectedB =
initValueB + (3.0 * numberOfIterations) +
(expectedA * (expectedA + 1.0)) / 2.0;
bool passed = true;
for(size_t i = 0; i < arraySize; ++i){
if(vectorA[i] != expectedA){
passed = false;
std::cerr << "Validation failed! Expected " << expectedA << " got " << vectorA[i] << " at index: " << i << std::endl;
break;
}
if(vectorB[i] != expectedB){
passed = false;
std::cerr << "Validation failed! Expected " << expectedB << " got " << vectorB[i] << " at index: " << i << std::endl;
break;
}
}
if(passed){
std::cout << "Asynchronous execution completed successfully." << std::endl;
}else{
std::cerr << "Asynchronous execution failed." << std::endl;
}
// Cleanup
HIP_CHECK(hipStreamDestroy(streamA));
HIP_CHECK(hipStreamDestroy(streamB));
HIP_CHECK(hipFree(d_dataA));
HIP_CHECK(hipFree(d_dataB));
return 0;
}
.. literalinclude:: ../../tools/example_codes/async_kernel_execution.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
.. tab-item:: hipStreamWaitEvent
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <vector>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
// GPU Kernels
__global__ void kernelA(double* arrayA, size_t size){
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size){arrayA[x] += 1.0;}
};
__global__ void kernelB(double* arrayA, double* arrayB, size_t size){
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size){arrayB[x] += arrayA[x] + 3.0;}
};
int main()
{
constexpr int numOfBlocks = 1 << 20;
constexpr int threadsPerBlock = 1024;
constexpr int numberOfIterations = 50;
// The array size smaller to avoid the relatively short kernel launch compared to memory copies
constexpr size_t arraySize = 1U << 25;
double *d_dataA;
double *d_dataB;
double initValueA = 0.0;
double initValueB = 2.0;
std::vector<double> vectorA(arraySize, initValueA);
std::vector<double> vectorB(arraySize, initValueB);
// Allocate device memory
HIP_CHECK(hipMalloc(&d_dataA, arraySize * sizeof(*d_dataA)));
HIP_CHECK(hipMalloc(&d_dataB, arraySize * sizeof(*d_dataB)));
// Create streams
hipStream_t streamA, streamB;
HIP_CHECK(hipStreamCreate(&streamA));
HIP_CHECK(hipStreamCreate(&streamB));
// Create events
hipEvent_t event, eventA, eventB;
HIP_CHECK(hipEventCreate(&event));
HIP_CHECK(hipEventCreate(&eventA));
HIP_CHECK(hipEventCreate(&eventB));
for(unsigned int iteration = 0; iteration < numberOfIterations; iteration++)
{
// Stream 1: Host to Device 1
HIP_CHECK(hipMemcpyAsync(d_dataA, vectorA.data(), arraySize * sizeof(*d_dataA), hipMemcpyHostToDevice, streamA));
// Stream 2: Host to Device 2
HIP_CHECK(hipMemcpyAsync(d_dataB, vectorB.data(), arraySize * sizeof(*d_dataB), hipMemcpyHostToDevice, streamB));
// Stream 1: Kernel 1
hipLaunchKernelGGL(kernelA, dim3(numOfBlocks), dim3(threadsPerBlock), 0, streamA, d_dataA, arraySize);
// Record event after the GPU kernel in Stream 1
HIP_CHECK(hipEventRecord(event, streamA));
// Stream 2: Wait for event before starting Kernel 2
HIP_CHECK(hipStreamWaitEvent(streamB, event, 0));
// Stream 2: Kernel 2
hipLaunchKernelGGL(kernelB, dim3(numOfBlocks), dim3(threadsPerBlock), 0, streamB, d_dataA, d_dataB, arraySize);
// Stream 1: Device to Host 2 (after Kernel 1)
HIP_CHECK(hipMemcpyAsync(vectorA.data(), d_dataA, arraySize * sizeof(*vectorA.data()), hipMemcpyDeviceToHost, streamA));
// Stream 2: Device to Host 2 (after Kernel 2)
HIP_CHECK(hipMemcpyAsync(vectorB.data(), d_dataB, arraySize * sizeof(*vectorB.data()), hipMemcpyDeviceToHost, streamB));
// Wait for all operations in both streams to complete
HIP_CHECK(hipEventRecord(eventA, streamA));
HIP_CHECK(hipEventRecord(eventB, streamB));
HIP_CHECK(hipStreamWaitEvent(streamA, eventA, 0));
HIP_CHECK(hipStreamWaitEvent(streamB, eventB, 0));
}
// Verify results
double expectedA = (double)numberOfIterations;
double expectedB =
initValueB + (3.0 * numberOfIterations) +
(expectedA * (expectedA + 1.0)) / 2.0;
bool passed = true;
for(size_t i = 0; i < arraySize; ++i){
if(vectorA[i] != expectedA){
passed = false;
std::cerr << "Validation failed! Expected " << expectedA << " got " << vectorA[i] << std::endl;
break;
}
if(vectorB[i] != expectedB){
passed = false;
std::cerr << "Validation failed! Expected " << expectedB << " got " << vectorB[i] << std::endl;
break;
}
}
if(passed){
std::cout << "Asynchronous execution with events completed successfully." << std::endl;
}else{
std::cerr << "Asynchronous execution with events failed." << std::endl;
}
// Cleanup
HIP_CHECK(hipEventDestroy(event));
HIP_CHECK(hipEventDestroy(eventA));
HIP_CHECK(hipEventDestroy(eventB));
HIP_CHECK(hipStreamDestroy(streamA));
HIP_CHECK(hipStreamDestroy(streamB));
HIP_CHECK(hipFree(d_dataA));
HIP_CHECK(hipFree(d_dataB));
return 0;
}
.. literalinclude:: ../../tools/example_codes/event_based_synchronization.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
HIP Graphs
===============================================================================
+8 -78
Просмотреть файл
@@ -33,38 +33,10 @@ You can adjust the call stack size as shown in the following example, allowing
fine-tuning based on specific kernel requirements. This helps prevent stack
overflow errors by ensuring sufficient stack memory is allocated.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
int main()
{
size_t stackSize;
HIP_CHECK(hipDeviceGetLimit(&stackSize, hipLimitStackSize));
std::cout << "Default stack size: " << stackSize << " bytes" << std::endl;
// Set a new stack size
size_t newStackSize = 1024 * 8; // 8 KiB
HIP_CHECK(hipDeviceSetLimit(hipLimitStackSize, newStackSize));
HIP_CHECK(hipDeviceGetLimit(&stackSize, hipLimitStackSize));
std::cout << "Updated stack size: " << stackSize << " bytes" << std::endl;
return 0;
}
.. literalinclude:: ../../tools/example_codes/call_stack_management.cpp
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
Depending on the GPU model, at full occupancy, it can consume a significant
amount of memory. For instance, an MI300X with 304 compute units (CU) and up to
@@ -81,49 +53,7 @@ needed for the call stack due to the GPUs inherent parallelism. This can be
achieved by increasing stack size or optimizing code to reduce stack usage. To
detect stack overflow add proper error handling or use debugging tools.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
__device__ unsigned long long fibonacci(unsigned long long n)
{
if (n == 0 || n == 1)
{
return n;
}
return fibonacci(n - 1) + fibonacci(n - 2);
}
__global__ void kernel(unsigned long long n)
{
unsigned long long result = fibonacci(n);
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if (x == 0)
printf("%llu! = %llu \n", n, result);
}
int main()
{
kernel<<<1, 1>>>(10);
HIP_CHECK(hipDeviceSynchronize());
// With -O0 optimization option hit the stack limit
// kernel<<<1, 256>>>(2048);
// HIP_CHECK(hipDeviceSynchronize());
return 0;
}
.. literalinclude:: ../../tools/example_codes/device_recursion.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
+4 -67
Просмотреть файл
@@ -68,70 +68,7 @@ Complete example
A complete example to demonstrate the error handling with a simple addition of
two values kernel:
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <vector>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c, size_t size) {
const size_t index = threadIdx.x + blockDim.x * blockIdx.x;
if(index < size) {
c[index] += a[index] + b[index];
}
}
int main() {
constexpr int numOfBlocks = 256;
constexpr int threadsPerBlock = 256;
constexpr size_t arraySize = 1U << 16;
std::vector<int> a(arraySize), b(arraySize), c(arraySize);
int *d_a, *d_b, *d_c;
// Setup input values.
std::fill(a.begin(), a.end(), 1);
std::fill(b.begin(), b.end(), 2);
// Allocate device copies of a, b and c.
HIP_CHECK(hipMalloc(&d_a, arraySize * sizeof(*d_a)));
HIP_CHECK(hipMalloc(&d_b, arraySize * sizeof(*d_b)));
HIP_CHECK(hipMalloc(&d_c, arraySize * sizeof(*d_c)));
// Copy input values to device.
HIP_CHECK(hipMemcpy(d_a, &a, arraySize * sizeof(*d_a), hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(d_b, &b, arraySize * sizeof(*d_b), hipMemcpyHostToDevice));
// Launch add() kernel on GPU.
hipLaunchKernelGGL(add, dim3(numOfBlocks), dim3(threadsPerBlock), 0, 0, d_a, d_b, d_c, arraySize);
// Check the kernel launch
HIP_CHECK(hipGetLastError());
// Check for kernel execution error
HIP_CHECK(hipDeviceSynchronize());
// Copy the result back to the host.
HIP_CHECK(hipMemcpy(&c, d_c, arraySize * sizeof(*d_c), hipMemcpyDeviceToHost));
// Cleanup allocated memory.
HIP_CHECK(hipFree(d_a));
HIP_CHECK(hipFree(d_b));
HIP_CHECK(hipFree(d_c));
// Print the result.
std::cout << a[0] << " + " << b[0] << " = " << c[0] << std::endl;
return 0;
}
.. literalinclude:: ../../tools/example_codes/error_handling.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
+12 -293
Просмотреть файл
@@ -14,6 +14,10 @@ method via streams. A HIP graph is made up of nodes and edges. The nodes of a
HIP graph represent the operations performed, while the edges mark dependencies
between those operations.
.. hint::
The :ref:`HIP Graph API tutorial <hip_graph_api_tutorial>` demonstrates how
to use HIP graphs in a real-world application.
The nodes can be one of the following:
- empty nodes
@@ -180,124 +184,10 @@ The general flow for using stream capture to create a graph template is:
The following code is an example of how to use the HIP graph API to capture a
graph from a stream.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <vector>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
__global__ void kernelA(double* arrayA, size_t size){
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size){arrayA[x] *= 2.0;}
};
__global__ void kernelB(int* arrayB, size_t size){
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size){arrayB[x] = 3;}
};
__global__ void kernelC(double* arrayA, const int* arrayB, size_t size){
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size){arrayA[x] += arrayB[x];}
};
struct set_vector_args{
std::vector<double>& h_array;
double value;
};
void set_vector(void* args){
set_vector_args h_args{*(reinterpret_cast<set_vector_args*>(args))};
std::vector<double>& vec{h_args.h_array};
vec.assign(vec.size(), h_args.value);
}
int main(){
constexpr int numOfBlocks = 1024;
constexpr int threadsPerBlock = 1024;
constexpr size_t arraySize = 1U << 20;
// This example assumes that kernelA operates on data that needs to be initialized on
// and copied from the host, while kernelB initializes the array that is passed to it.
// Both arrays are then used as input to kernelC, where arrayA is also used as
// output, that is copied back to the host, while arrayB is only read from and not modified.
double* d_arrayA;
int* d_arrayB;
std::vector<double> h_array(arraySize);
constexpr double initValue = 2.0;
hipStream_t captureStream;
HIP_CHECK(hipStreamCreate(&captureStream));
// Start capturing the operations assigned to the stream
HIP_CHECK(hipStreamBeginCapture(captureStream, hipStreamCaptureModeGlobal));
// hipMallocAsync and hipMemcpyAsync are needed, to be able to assign it to a stream
HIP_CHECK(hipMallocAsync(&d_arrayA, arraySize*sizeof(double), captureStream));
HIP_CHECK(hipMallocAsync(&d_arrayB, arraySize*sizeof(int), captureStream));
// Assign host function to the stream
// Needs a custom struct to pass the arguments
set_vector_args args{h_array, initValue};
HIP_CHECK(hipLaunchHostFunc(captureStream, set_vector, &args));
HIP_CHECK(hipMemcpyAsync(d_arrayA, h_array.data(), arraySize*sizeof(double), hipMemcpyHostToDevice, captureStream));
kernelA<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, arraySize);
kernelB<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayB, arraySize);
kernelC<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, d_arrayB, arraySize);
HIP_CHECK(hipMemcpyAsync(h_array.data(), d_arrayA, arraySize*sizeof(*d_arrayA), hipMemcpyDeviceToHost, captureStream));
HIP_CHECK(hipFreeAsync(d_arrayA, captureStream));
HIP_CHECK(hipFreeAsync(d_arrayB, captureStream));
// Stop capturing
hipGraph_t graph;
HIP_CHECK(hipStreamEndCapture(captureStream, &graph));
// Create an executable graph from the captured graph
hipGraphExec_t graphExec;
HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
// The graph template can be deleted after the instantiation if it's not needed for later use
HIP_CHECK(hipGraphDestroy(graph));
// Actually launch the graph. The stream does not have
// to be the same as the one used for capturing.
HIP_CHECK(hipGraphLaunch(graphExec, captureStream));
// Verify results
constexpr double expected = initValue * 2.0 + 3;
bool passed = true;
for(size_t i = 0; i < arraySize; ++i){
if(h_array[i] != expected){
passed = false;
std::cerr << "Validation failed! Expected " << expected << " got " << h_array[0] << std::endl;
break;
}
}
if(passed){
std::cerr << "Validation passed." << std::endl;
}
// Free graph and stream resources after usage
HIP_CHECK(hipGraphExecDestroy(graphExec));
HIP_CHECK(hipStreamDestroy(captureStream));
}
.. literalinclude:: ../../tools/example_codes/graph_capture.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
Explicit graph creation
================================================================================
@@ -333,178 +223,7 @@ The general flow for explicitly creating a graph is usually:
The following code example demonstrates how to explicitly create nodes in order to create a graph.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <vector>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
__global__ void kernelA(double* arrayA, size_t size){
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size){arrayA[x] *= 2.0;}
};
__global__ void kernelB(int* arrayB, size_t size){
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size){arrayB[x] = 3;}
};
__global__ void kernelC(double* arrayA, const int* arrayB, size_t size){
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size){arrayA[x] += arrayB[x];}
};
struct set_vector_args{
std::vector<double>& h_array;
double value;
};
void set_vector(void* args){
set_vector_args h_args{*(reinterpret_cast<set_vector_args*>(args))};
std::vector<double>& vec{h_args.h_array};
vec.assign(vec.size(), h_args.value);
}
int main(){
constexpr int numOfBlocks = 1024;
constexpr int threadsPerBlock = 1024;
size_t arraySize = 1U << 20;
// The pointers to the device memory don't need to be declared here,
// they are contained within the hipMemAllocNodeParams as the dptr member
std::vector<double> h_array(arraySize);
constexpr double initValue = 2.0;
// Create graph an empty graph
hipGraph_t graph;
HIP_CHECK(hipGraphCreate(&graph, 0));
// Parameters to allocate arrays
hipMemAllocNodeParams allocArrayAParams{};
allocArrayAParams.poolProps.allocType = hipMemAllocationTypePinned;
allocArrayAParams.poolProps.location.type = hipMemLocationTypeDevice;
allocArrayAParams.poolProps.location.id = 0; // GPU on which memory resides
allocArrayAParams.bytesize = arraySize * sizeof(double);
hipMemAllocNodeParams allocArrayBParams{};
allocArrayBParams.poolProps.allocType = hipMemAllocationTypePinned;
allocArrayBParams.poolProps.location.type = hipMemLocationTypeDevice;
allocArrayBParams.poolProps.location.id = 0; // GPU on which memory resides
allocArrayBParams.bytesize = arraySize * sizeof(int);
// Add the allocation nodes to the graph. They don't have any dependencies
hipGraphNode_t allocNodeA, allocNodeB;
HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeA, graph, nullptr, 0, &allocArrayAParams));
HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeB, graph, nullptr, 0, &allocArrayBParams));
// Parameters for the host function
// Needs custom struct to pass the arguments
set_vector_args args{h_array, initValue};
hipHostNodeParams hostParams{};
hostParams.fn = set_vector;
hostParams.userData = static_cast<void*>(&args);
// Add the host node that initializes the host array. It also doesn't have any dependencies
hipGraphNode_t hostNode;
HIP_CHECK(hipGraphAddHostNode(&hostNode, graph, nullptr, 0, &hostParams));
// Add memory copy node, that copies the initialized host array to the device.
// It has to wait for the host array to be initialized and the device memory to be allocated
hipGraphNode_t cpyNodeDependencies[] = {allocNodeA, hostNode};
hipGraphNode_t cpyToDevNode;
HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToDevNode, graph, cpyNodeDependencies, 1, allocArrayAParams.dptr, h_array.data(), arraySize * sizeof(double), hipMemcpyHostToDevice));
// Parameters for kernelA
hipKernelNodeParams kernelAParams;
void* kernelAArgs[] = {&allocArrayAParams.dptr, static_cast<void*>(&arraySize)};
kernelAParams.func = reinterpret_cast<void*>(kernelA);
kernelAParams.gridDim = numOfBlocks;
kernelAParams.blockDim = threadsPerBlock;
kernelAParams.sharedMemBytes = 0;
kernelAParams.kernelParams = kernelAArgs;
kernelAParams.extra = nullptr;
// Add the node for kernelA. It has to wait for the memory copy to finish, as it depends on the values from the host array.
hipGraphNode_t kernelANode;
HIP_CHECK(hipGraphAddKernelNode(&kernelANode, graph, &cpyToDevNode, 1, &kernelAParams));
// Parameters for kernelB
hipKernelNodeParams kernelBParams;
void* kernelBArgs[] = {&allocArrayBParams.dptr, static_cast<void*>(&arraySize)};
kernelBParams.func = reinterpret_cast<void*>(kernelB);
kernelBParams.gridDim = numOfBlocks;
kernelBParams.blockDim = threadsPerBlock;
kernelBParams.sharedMemBytes = 0;
kernelBParams.kernelParams = kernelBArgs;
kernelBParams.extra = nullptr;
// Add the node for kernelB. It only has to wait for the memory to be allocated, as it initializes the array.
hipGraphNode_t kernelBNode;
HIP_CHECK(hipGraphAddKernelNode(&kernelBNode, graph, &allocNodeB, 1, &kernelBParams));
// Parameters for kernelC
hipKernelNodeParams kernelCParams;
void* kernelCArgs[] = {&allocArrayAParams.dptr, &allocArrayBParams.dptr, static_cast<void*>(&arraySize)};
kernelCParams.func = reinterpret_cast<void*>(kernelC);
kernelCParams.gridDim = numOfBlocks;
kernelCParams.blockDim = threadsPerBlock;
kernelCParams.sharedMemBytes = 0;
kernelCParams.kernelParams = kernelCArgs;
kernelCParams.extra = nullptr;
// Add the node for kernelC. It has to wait on both kernelA and kernelB to finish, as it depends on their results.
hipGraphNode_t kernelCNode;
hipGraphNode_t kernelCDependencies[] = {kernelANode, kernelBNode};
HIP_CHECK(hipGraphAddKernelNode(&kernelCNode, graph, kernelCDependencies, 1, &kernelCParams));
// Copy the results back to the host. Has to wait for kernelC to finish.
hipGraphNode_t cpyToHostNode;
HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToHostNode, graph, &kernelCNode, 1, h_array.data(), allocArrayAParams.dptr, arraySize * sizeof(double), hipMemcpyDeviceToHost));
// Free array of allocNodeA. It needs to wait for the copy to finish, as kernelC stores its results in it.
hipGraphNode_t freeNodeA;
HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeA, graph, &cpyToHostNode, 1, allocArrayAParams.dptr));
// Free array of allocNodeB. It only needs to wait for kernelC to finish, as it is not written back to the host.
hipGraphNode_t freeNodeB;
HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeB, graph, &kernelCNode, 1, allocArrayBParams.dptr));
// Instantiate the graph in order to execute it
hipGraphExec_t graphExec;
HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
// The graph can be freed after the instantiation if it's not needed for other purposes
HIP_CHECK(hipGraphDestroy(graph));
// Actually launch the graph
hipStream_t graphStream;
HIP_CHECK(hipStreamCreate(&graphStream));
HIP_CHECK(hipGraphLaunch(graphExec, graphStream));
// Verify results
constexpr double expected = initValue * 2.0 + 3;
bool passed = true;
for(size_t i = 0; i < arraySize; ++i){
if(h_array[i] != expected){
passed = false;
std::cerr << "Validation failed! Expected " << expected << " got " << h_array[0] << std::endl;
break;
}
}
if(passed){
std::cerr << "Validation passed." << std::endl;
}
HIP_CHECK(hipGraphExecDestroy(graphExec));
HIP_CHECK(hipStreamDestroy(graphStream));
}
.. literalinclude:: ../../tools/example_codes/graph_creation.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
+4 -18
Просмотреть файл
@@ -66,24 +66,10 @@ which can be used to loop over the available GPUs.
Example code of querying GPUs:
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
int main() {
int deviceCount;
if (hipGetDeviceCount(&deviceCount) == hipSuccess){
for (int i = 0; i < deviceCount; ++i){
hipDeviceProp_t prop;
if ( hipGetDeviceProperties(&prop, i) == hipSuccess)
std::cout << "Device" << i << prop.name << std::endl;
}
}
return 0;
}
.. literalinclude:: ../../tools/example_codes/simple_device_query.cpp
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
Setting the GPU
--------------------------------------------------------------------------------
+8 -110
Просмотреть файл
@@ -47,61 +47,10 @@ C++ application.
**Example:** Using pageable host memory in HIP
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
int main()
{
const int element_number = 100;
int *host_input, *host_output;
// Host allocation
host_input = new int[element_number];
host_output = new int[element_number];
// Host data preparation
for (int i = 0; i < element_number; i++) {
host_input[i] = i;
}
memset(host_output, 0, element_number * sizeof(int));
int *device_input, *device_output;
// Device allocation
HIP_CHECK(hipMalloc((int **)&device_input, element_number * sizeof(int)));
HIP_CHECK(hipMalloc((int **)&device_output, element_number * sizeof(int)));
// Device data preparation
HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
HIP_CHECK(hipMemset(device_output, 0, element_number * sizeof(int)));
// Run the kernel
// ...
HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
// Free host memory
delete[] host_input;
delete[] host_output;
// Free device memory
HIP_CHECK(hipFree(device_input));
HIP_CHECK(hipFree(device_output));
}
.. literalinclude:: ../../../tools/example_codes/pageable_host_memory.cpp
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
.. note::
@@ -133,61 +82,10 @@ processes, which can negatively impact the overall performance of the host.
**Example:** Using pinned memory in HIP
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
int main()
{
const int element_number = 100;
int *host_input, *host_output;
// Host allocation
HIP_CHECK(hipHostMalloc((int **)&host_input, element_number * sizeof(int)));
HIP_CHECK(hipHostMalloc((int **)&host_output, element_number * sizeof(int)));
// Host data preparation
for (int i = 0; i < element_number; i++) {
host_input[i] = i;
}
memset(host_output, 0, element_number * sizeof(int));
int *device_input, *device_output;
// Device allocation
HIP_CHECK(hipMalloc((int **)&device_input, element_number * sizeof(int)));
HIP_CHECK(hipMalloc((int **)&device_output, element_number * sizeof(int)));
// Device data preparation
HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
HIP_CHECK(hipMemset(device_output, 0, element_number * sizeof(int)));
// Run the kernel
// ...
HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
// Free host memory
delete[] host_input;
delete[] host_output;
// Free device memory
HIP_CHECK(hipFree(device_input));
HIP_CHECK(hipFree(device_output));
}
.. literalinclude:: ../../../tools/example_codes/pinned_host_memory.cpp
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
.. _memory_allocation_flags:
+29 -272
Просмотреть файл
@@ -37,102 +37,17 @@ Here is how to use stream ordered memory allocation:
.. tab-set::
.. tab-item:: Stream Ordered Memory Allocation
.. code-block:: cpp
#include <iostream>
#include <hip/hip_runtime.h>
// Kernel to perform some computation on allocated memory.
__global__ void myKernel(int* data, size_t numElements) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < numElements) {
data[tid] = tid * 2;
}
}
int main() {
// Initialize HIP.
hipInit(0);
// Stream 0.
constexpr hipStream_t streamId = 0;
// Allocate memory with stream ordered semantics.
constexpr size_t numElements = 1024;
int* devData;
hipMallocAsync(&devData, numElements * sizeof(*devData), streamId);
// Launch the kernel to perform computation.
dim3 blockSize(256);
dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
myKernel<<<gridSize, blockSize>>>(devData, numElements);
// Copy data back to host.
int* hostData = new int[numElements];
hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost);
// Print the array.
for (size_t i = 0; i < numElements; ++i) {
std::cout << "Element " << i << ": " << hostData[i] << std::endl;
}
// Free memory with stream ordered semantics.
hipFreeAsync(devData, streamId);
delete[] hostData;
// Synchronize to ensure completion.
hipDeviceSynchronize();
return 0;
}
.. literalinclude:: ../../../tools/example_codes/stream_ordered_memory_allocation.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
.. tab-item:: Ordinary Allocation
.. code-block:: cpp
#include <iostream>
#include <hip/hip_runtime.h>
// Kernel to perform some computation on allocated memory.
__global__ void myKernel(int* data, size_t numElements) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < numElements) {
data[tid] = tid * 2;
}
}
int main() {
// Initialize HIP.
hipInit(0);
// Allocate memory.
constexpr size_t numElements = 1024;
int* devData;
hipMalloc(&devData, numElements * sizeof(*devData));
// Launch the kernel to perform computation.
dim3 blockSize(256);
dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
myKernel<<<gridSize, blockSize>>>(devData, numElements);
// Copy data back to host.
int* hostData = new int[numElements];
hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost);
// Print the array.
for (size_t i = 0; i < numElements; ++i) {
std::cout << "Element " << i << ": " << hostData[i] << std::endl;
}
// Free memory.
hipFree(devData);
delete[] hostData;
// Synchronize to ensure completion.
hipDeviceSynchronize();
return 0;
}
.. literalinclude:: ../../../tools/example_codes/ordinary_memory_allocation.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
For more details, see :ref:`stream_ordered_memory_allocator_reference`.
@@ -148,121 +63,29 @@ The ``hipMallocAsync()`` function uses the current memory pool and also provides
Unlike NVIDIA CUDA, where stream-ordered memory allocation can be implicit, ROCm HIP is explicit. This requires managing memory allocation for each stream in HIP while ensuring precise control over memory usage and synchronization.
.. code-block:: cpp
#include <iostream>
#include <hip/hip_runtime.h>
// Kernel to perform some computation on allocated memory.
__global__ void myKernel(int* data, size_t numElements) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < numElements) {
data[tid] = tid * 2;
}
}
int main() {
// Create a stream.
hipStream_t stream;
hipStreamCreate(&stream);
// Create a memory pool with default properties.
hipMemPoolProps poolProps = {};
poolProps.allocType = hipMemAllocationTypePinned;
poolProps.handleTypes = hipMemHandleTypePosixFileDescriptor;
poolProps.location.type = hipMemLocationTypeDevice;
poolProps.location.id = 0; // Assuming device 0.
hipMemPool_t memPool;
hipMemPoolCreate(&memPool, &poolProps);
// Allocate memory from the pool asynchronously.
constexpr size_t numElements = 1024;
int* devData = nullptr;
hipMallocFromPoolAsync(&devData, numElements * sizeof(*devData), memPool, stream);
// Define grid and block sizes.
dim3 blockSize(256);
dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
// Launch the kernel to perform computation.
myKernel<<<gridSize, blockSize, 0, stream>>>(devData, numElements);
// Synchronize the stream.
hipStreamSynchronize(stream);
// Copy data back to host.
int* hostData = new int[numElements];
hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost);
// Print the array.
for (size_t i = 0; i < numElements; ++i) {
std::cout << "Element " << i << ": " << hostData[i] << std::endl;
}
// Free the allocated memory.
hipFreeAsync(devData, stream);
// Synchronize the stream again to ensure all operations are complete.
hipStreamSynchronize(stream);
// Destroy the memory pool and stream.
hipMemPoolDestroy(memPool);
hipStreamDestroy(stream);
// Free host memory.
delete[] hostData;
return 0;
}
.. literalinclude:: ../../../tools/example_codes/memory_pool.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
Trim pools
----------
The memory allocator allows you to allocate and free memory in stream order. To control memory usage, set the release threshold attribute using ``hipMemPoolAttrReleaseThreshold``. This threshold specifies the amount of reserved memory in bytes to hold onto.
.. code-block:: cpp
uint64_t threshold = UINT64_MAX;
hipMemPoolSetAttribute(memPool, hipMemPoolAttrReleaseThreshold, &threshold);
.. literalinclude:: ../../../tools/example_codes/memory_pool_threshold.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
When the amount of memory held in the memory pool exceeds the threshold, the allocator tries to release memory back to the operating system during the next call to stream, event, or context synchronization.
To improve performance, it is a good practice to adjust the memory pool size using ``hipMemPoolTrimTo()``. It helps to reclaim memory from an excessive memory pool, which optimizes memory usage for your application.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
int main() {
hipMemPool_t memPool;
hipDevice_t device = 0; // Specify the device index.
// Initialize the device.
hipSetDevice(device);
// Get the default memory pool for the device.
hipDeviceGetDefaultMemPool(&memPool, device);
// Allocate memory from the pool (e.g., 1 MB).
size_t allocSize = 1 * 1024 * 1024;
void* ptr;
hipMalloc(&ptr, allocSize);
// Free the allocated memory.
hipFree(ptr);
// Trim the memory pool to a specific size (e.g., 512 KB).
size_t newSize = 512 * 1024;
hipMemPoolTrimTo(memPool, newSize);
// Clean up.
hipMemPoolDestroy(memPool);
std::cout << "Memory pool trimmed to " << newSize << " bytes." << std::endl;
return 0;
}
.. literalinclude:: ../../../tools/example_codes/memory_pool_trim.cpp
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
Resource usage statistics
-------------------------
@@ -276,81 +99,10 @@ Resource usage statistics help in optimization. Here is the list of pool attribu
To reset these attributes to the current value, use ``hipMemPoolSetAttribute()``.
.. code-block:: cpp
#include <iostream>
#include <hip/hip_runtime.h>
// Sample helper functions for getting the usage statistics in bulk.
struct usageStatistics {
uint64_t reservedMemCurrent;
uint64_t reservedMemHigh;
uint64_t usedMemCurrent;
uint64_t usedMemHigh;
};
void getUsageStatistics(hipMemPool_t memPool, struct usageStatistics *statistics) {
hipMemPoolGetAttribute(memPool, hipMemPoolAttrReservedMemCurrent, &statistics->reservedMemCurrent);
hipMemPoolGetAttribute(memPool, hipMemPoolAttrReservedMemHigh, &statistics->reservedMemHigh);
hipMemPoolGetAttribute(memPool, hipMemPoolAttrUsedMemCurrent, &statistics->usedMemCurrent);
hipMemPoolGetAttribute(memPool, hipMemPoolAttrUsedMemHigh, &statistics->usedMemHigh);
}
// Resetting the watermarks resets them to the current value.
void resetStatistics(hipMemPool_t memPool) {
uint64_t value = 0;
hipMemPoolSetAttribute(memPool, hipMemPoolAttrReservedMemHigh, &value);
hipMemPoolSetAttribute(memPool, hipMemPoolAttrUsedMemHigh, &value);
}
int main() {
hipMemPool_t memPool;
hipDevice_t device = 0; // Specify the device index.
// Initialize the device.
hipSetDevice(device);
// Get the default memory pool for the device.
hipDeviceGetDefaultMemPool(&memPool, device);
// Allocate memory from the pool (e.g., 1 MB).
size_t allocSize = 1 * 1024 * 1024;
void* ptr;
hipMalloc(&ptr, allocSize);
// Free the allocated memory.
hipFree(ptr);
// Trim the memory pool to a specific size (e.g., 512 KB).
size_t newSize = 512 * 1024;
hipMemPoolTrimTo(memPool, newSize);
// Get and print usage statistics before resetting.
usageStatistics statsBefore;
getUsageStatistics(memPool, &statsBefore);
std::cout << "Before resetting statistics:" << std::endl;
std::cout << "Reserved Memory Current: " << statsBefore.reservedMemCurrent << " bytes" << std::endl;
std::cout << "Reserved Memory High: " << statsBefore.reservedMemHigh << " bytes" << std::endl;
std::cout << "Used Memory Current: " << statsBefore.usedMemCurrent << " bytes" << std::endl;
std::cout << "Used Memory High: " << statsBefore.usedMemHigh << " bytes" << std::endl;
// Reset the statistics.
resetStatistics(memPool);
// Get and print usage statistics after resetting.
usageStatistics statsAfter;
getUsageStatistics(memPool, &statsAfter);
std::cout << "After resetting statistics:" << std::endl;
std::cout << "Reserved Memory Current: " << statsAfter.reservedMemCurrent << " bytes" << std::endl;
std::cout << "Reserved Memory High: " << statsAfter.reservedMemHigh << " bytes" << std::endl;
std::cout << "Used Memory Current: " << statsAfter.usedMemCurrent << " bytes" << std::endl;
std::cout << "Used Memory High: " << statsAfter.usedMemHigh << " bytes" << std::endl;
// Clean up.
hipMemPoolDestroy(memPool);
return 0;
}
.. literalinclude:: ../../../tools/example_codes/memory_pool_resource_usage_statistics.cpp
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
Memory reuse policies
---------------------
@@ -369,6 +121,11 @@ Allocations are initially accessible from the device where they reside.
Interprocess memory handling
=============================
.. attention::
IPC API calls are only supported on systems with an active ``amdgpu-dkms`` driver. Please refer to the
`AMDGPU documentation <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/index.html>`__ for information
on how to install ``amdgpu-dkms``.
Interprocess capable (IPC) memory pools facilitate efficient and secure sharing of GPU memory between processes.
To achieve interprocess memory sharing, you can use either :ref:`device pointer <device-pointer>` or :ref:`shareable handle <shareable-handle>`. Both provide allocator (export) and consumer (import) interfaces.
+28 -373
Просмотреть файл
@@ -303,207 +303,35 @@ explicit memory management example is presented in the last tab.
.. tab-item:: hipMallocManaged()
.. code-block:: cpp
.. literalinclude:: ../../../tools/example_codes/dynamic_unified_memory.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:emphasize-lines: 22-25
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess){ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
int main() {
int *a, *b, *c;
// Allocate memory for a, b and c that is accessible to both device and host codes.
HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
// Setup input values.
*a = 1;
*b = 2;
// Launch add() kernel on GPU.
hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
// Wait for GPU to finish before accessing on host.
HIP_CHECK(hipDeviceSynchronize());
// Print the result.
std::cout << *a << " + " << *b << " = " << *c << std::endl;
// Cleanup allocated memory.
HIP_CHECK(hipFree(a));
HIP_CHECK(hipFree(b));
HIP_CHECK(hipFree(c));
return 0;
}
:language: cpp
.. tab-item:: __managed__
.. code-block:: cpp
.. literalinclude:: ../../../tools/example_codes/static_unified_memory.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:emphasize-lines: 19-20
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess){ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
// Declare a, b and c as static variables.
__managed__ int a, b, c;
int main() {
// Setup input values.
a = 1;
b = 2;
// Launch add() kernel on GPU.
hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, &a, &b, &c);
// Wait for GPU to finish before accessing on host.
HIP_CHECK(hipDeviceSynchronize());
// Prints the result.
std::cout << a << " + " << b << " = " << c << std::endl;
return 0;
}
:language: cpp
.. tab-item:: new
.. code-block:: cpp
.. literalinclude:: ../../../tools/example_codes/standard_unified_memory.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:emphasize-lines: 21-24
#include <hip/hip_runtime.h>
#include <iostream>
#include <new>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess){ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int* a, int* b, int* c) {
*c = *a + *b;
}
// This example requires HMM support and the environment variable HSA_XNACK needs to be set to 1
int main() {
// Allocate memory with proper alignment for performance
int *a = new(std::align_val_t(128)) int[1];
int *b = new(std::align_val_t(128)) int[1];
int *c = new(std::align_val_t(128)) int[1];
// Setup input values.
*a = 1;
*b = 2;
// Launch add() kernel on GPU.
hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
// Wait for GPU to finish before accessing on host.
HIP_CHECK(hipDeviceSynchronize());
// Prints the result.
std::cout << *a << " + " << *b << " = " << *c << std::endl;
// Cleanup allocated memory with matching aligned delete.
::operator delete[](a, std::align_val_t(128));
::operator delete[](b, std::align_val_t(128));
::operator delete[](c, std::align_val_t(128));
return 0;
}
:language: cpp
.. tab-item:: Explicit Memory Management
.. code-block:: cpp
.. literalinclude:: ../../../tools/example_codes/explicit_memory.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:emphasize-lines: 27-34, 39-40
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess){ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
int main() {
int a, b, c;
int *d_a, *d_b, *d_c;
// Setup input values.
a = 1;
b = 2;
// Allocate device copies of a, b and c.
HIP_CHECK(hipMalloc(&d_a, sizeof(*d_a)));
HIP_CHECK(hipMalloc(&d_b, sizeof(*d_b)));
HIP_CHECK(hipMalloc(&d_c, sizeof(*d_c)));
// Copy input values to device.
HIP_CHECK(hipMemcpy(d_a, &a, sizeof(*d_a), hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(d_b, &b, sizeof(*d_b), hipMemcpyHostToDevice));
// Launch add() kernel on GPU.
hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, d_a, d_b, d_c);
// Copy the result back to the host.
HIP_CHECK(hipMemcpy(&c, d_c, sizeof(*d_c), hipMemcpyDeviceToHost));
// Cleanup allocated memory.
HIP_CHECK(hipFree(d_a));
HIP_CHECK(hipFree(d_b));
HIP_CHECK(hipFree(d_c));
// Prints the result.
std::cout << a << " + " << b << " = " << c << std::endl;
return 0;
}
:language: cpp
.. _using unified memory:
@@ -559,65 +387,11 @@ Data prefetching is a technique used to improve the performance of your
application by moving data to the desired device before it's actually
needed. ``hipCpuDeviceId`` is a special constant to specify the CPU as target.
.. code-block:: cpp
.. literalinclude:: ../../../tools/example_codes/data_prefetching.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:emphasize-lines: 33-36,41-42
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess){ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
int main() {
int *a, *b, *c;
int deviceId;
HIP_CHECK(hipGetDevice(&deviceId)); // Get the current device ID
// Allocate memory for a, b and c that is accessible to both device and host codes.
HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
// Setup input values.
*a = 1;
*b = 2;
// Prefetch the data to the GPU device.
HIP_CHECK(hipMemPrefetchAsync(a, sizeof(*a), deviceId, 0));
HIP_CHECK(hipMemPrefetchAsync(b, sizeof(*b), deviceId, 0));
HIP_CHECK(hipMemPrefetchAsync(c, sizeof(*c), deviceId, 0));
// Launch add() kernel on GPU.
hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
// Prefetch the result back to the CPU.
HIP_CHECK(hipMemPrefetchAsync(c, sizeof(*c), hipCpuDeviceId, 0));
// Wait for the prefetch operations to complete.
HIP_CHECK(hipDeviceSynchronize());
// Prints the result.
std::cout << *a << " + " << *b << " = " << *c << std::endl;
// Cleanup allocated memory.
HIP_CHECK(hipFree(a));
HIP_CHECK(hipFree(b));
HIP_CHECK(hipFree(c));
return 0;
}
:language: cpp
Memory advice
--------------------------------------------------------------------------------
@@ -642,71 +416,11 @@ impact on performance can vary based on the specific use case and the system.
The following is the updated version of the example above with memory advice
instead of prefetching.
.. code-block:: cpp
.. literalinclude:: ../../../tools/example_codes/unified_memory_advice.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:emphasize-lines: 29-41
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess){ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
int main() {
int deviceId;
HIP_CHECK(hipGetDevice(&deviceId));
int *a, *b, *c;
// Allocate memory for a, b, and c accessible to both device and host codes.
HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
// Set memory advice for a and b to be read, located on and accessed by the GPU.
HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetPreferredLocation, deviceId));
HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetAccessedBy, deviceId));
HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetReadMostly, deviceId));
HIP_CHECK(hipMemAdvise(b, sizeof(*b), hipMemAdviseSetPreferredLocation, deviceId));
HIP_CHECK(hipMemAdvise(b, sizeof(*b), hipMemAdviseSetAccessedBy, deviceId));
HIP_CHECK(hipMemAdvise(b, sizeof(*b), hipMemAdviseSetReadMostly, deviceId));
// Set memory advice for c to be read, located on and accessed by the CPU.
HIP_CHECK(hipMemAdvise(c, sizeof(*c), hipMemAdviseSetPreferredLocation, hipCpuDeviceId));
HIP_CHECK(hipMemAdvise(c, sizeof(*c), hipMemAdviseSetAccessedBy, hipCpuDeviceId));
HIP_CHECK(hipMemAdvise(c, sizeof(*c), hipMemAdviseSetReadMostly, hipCpuDeviceId));
// Setup input values.
*a = 1;
*b = 2;
// Launch add() kernel on GPU.
hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
// Wait for GPU to finish before accessing on host.
HIP_CHECK(hipDeviceSynchronize());
// Prints the result.
std::cout << *a << " + " << *b << " = " << *c << std::endl;
// Cleanup allocated memory.
HIP_CHECK(hipFree(a));
HIP_CHECK(hipFree(b));
HIP_CHECK(hipFree(c));
return 0;
}
:language: cpp
Memory range attributes
--------------------------------------------------------------------------------
@@ -714,70 +428,11 @@ Memory range attributes
:cpp:func:`hipMemRangeGetAttribute()` allows you to query attributes of a given
memory range. The attributes are given in :cpp:enum:`hipMemRangeAttribute`.
.. code-block:: cpp
.. literalinclude:: ../../../tools/example_codes/memory_range_attributes.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:emphasize-lines: 44-49
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess){ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
int main() {
int *a, *b, *c;
unsigned int attributeValue;
constexpr size_t attributeSize = sizeof(attributeValue);
int deviceId;
HIP_CHECK(hipGetDevice(&deviceId));
// Allocate memory for a, b and c that is accessible to both device and host codes.
HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
// Setup input values.
*a = 1;
*b = 2;
HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetReadMostly, deviceId));
// Launch add() kernel on GPU.
hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
// Wait for GPU to finish before accessing on host.
HIP_CHECK(hipDeviceSynchronize());
// Query an attribute of the memory range.
HIP_CHECK(hipMemRangeGetAttribute(&attributeValue,
attributeSize,
hipMemRangeAttributeReadMostly,
a,
sizeof(*a)));
// Prints the result.
std::cout << *a << " + " << *b << " = " << *c << std::endl;
std::cout << "The array a is" << (attributeValue == 1 ? "" : " NOT") << " set to hipMemRangeAttributeReadMostly" << std::endl;
// Cleanup allocated memory.
HIP_CHECK(hipFree(a));
HIP_CHECK(hipFree(b));
HIP_CHECK(hipFree(c));
return 0;
}
:language: cpp
Asynchronously attach memory to a stream
--------------------------------------------------------------------------------
+26 -349
Просмотреть файл
@@ -22,43 +22,10 @@ dynamic selections during runtime to ensure optimal performance.
If the application does not define a specific GPU, device 0 is selected.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
int main()
{
int deviceCount;
hipGetDeviceCount(&deviceCount);
std::cout << "Number of devices: " << deviceCount << std::endl;
for (int deviceId = 0; deviceId < deviceCount; ++deviceId)
{
hipDeviceProp_t deviceProp;
hipGetDeviceProperties(&deviceProp, deviceId);
std::cout << "Device " << deviceId << std::endl << " Properties:" << std::endl;
std::cout << " Name: " << deviceProp.name << std::endl;
std::cout << " Total Global Memory: " << deviceProp.totalGlobalMem / (1024 * 1024) << " MiB" << std::endl;
std::cout << " Shared Memory per Block: " << deviceProp.sharedMemPerBlock / 1024 << " KiB" << std::endl;
std::cout << " Registers per Block: " << deviceProp.regsPerBlock << std::endl;
std::cout << " Warp Size: " << deviceProp.warpSize << std::endl;
std::cout << " Max Threads per Block: " << deviceProp.maxThreadsPerBlock << std::endl;
std::cout << " Max Threads per Multiprocessor: " << deviceProp.maxThreadsPerMultiProcessor << std::endl;
std::cout << " Number of Multiprocessors: " << deviceProp.multiProcessorCount << std::endl;
std::cout << " Max Threads Dimensions: ["
<< deviceProp.maxThreadsDim[0] << ", "
<< deviceProp.maxThreadsDim[1] << ", "
<< deviceProp.maxThreadsDim[2] << "]" << std::endl;
std::cout << " Max Grid Size: ["
<< deviceProp.maxGridSize[0] << ", "
<< deviceProp.maxGridSize[1] << ", "
<< deviceProp.maxGridSize[2] << "]" << std::endl;
std::cout << std::endl;
}
return 0;
}
.. literalinclude:: ../../tools/example_codes/device_enumeration.cpp
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
.. _multi_device_selection:
@@ -72,71 +39,10 @@ different GPUs might have different capabilities or workloads. By selecting the
appropriate device, you ensure that the computational tasks are directed to the
correct GPU, optimizing performance and resource utilization.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) { \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
exit(status); \
} \
}
__global__ void simpleKernel(double *data)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] = idx * 2.0;
}
int main()
{
double* deviceData0;
double* deviceData1;
size_t size = 1024 * sizeof(*deviceData0);
int deviceId0 = 0;
int deviceId1 = 1;
// Set device 0 and perform operations
HIP_CHECK(hipSetDevice(deviceId0)); // Set device 0 as current
HIP_CHECK(hipMalloc(&deviceData0, size)); // Allocate memory on device 0
simpleKernel<<<1000, 128>>>(deviceData0); // Launch kernel on device 0
HIP_CHECK(hipDeviceSynchronize());
// Set device 1 and perform operations
HIP_CHECK(hipSetDevice(deviceId1)); // Set device 1 as current
HIP_CHECK(hipMalloc(&deviceData1, size)); // Allocate memory on device 1
simpleKernel<<<1000, 128>>>(deviceData1); // Launch kernel on device 1
HIP_CHECK(hipDeviceSynchronize());
// Copy result from device 0
double hostData0[1024];
HIP_CHECK(hipSetDevice(deviceId0));
HIP_CHECK(hipMemcpy(hostData0, deviceData0, size, hipMemcpyDeviceToHost));
// Copy result from device 1
double hostData1[1024];
HIP_CHECK(hipSetDevice(deviceId1));
HIP_CHECK(hipMemcpy(hostData1, deviceData1, size, hipMemcpyDeviceToHost));
// Display results from both devices
std::cout << "Device 0 data: " << hostData0[0] << std::endl;
std::cout << "Device 1 data: " << hostData1[0] << std::endl;
// Free device memory
HIP_CHECK(hipFree(deviceData0));
HIP_CHECK(hipFree(deviceData1));
return 0;
}
.. literalinclude:: ../../tools/example_codes/device_selection.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
Stream and event behavior
===============================================================================
@@ -151,100 +57,10 @@ conditions and optimizes data flow in multi-GPU systems. Together, streams and
events maximize performance by enabling parallel execution, load balancing, and
effective resource utilization across heterogeneous hardware.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
__global__ void simpleKernel(double *data)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] = idx * 2.0;
}
int main()
{
int numDevices;
hipGetDeviceCount(&numDevices);
if (numDevices < 2) {
std::cerr << "This example requires at least two GPUs." << std::endl;
return -1;
}
double *deviceData0, *deviceData1;
size_t size = 1024 * sizeof(*deviceData0);
// Create streams and events for each device
hipStream_t stream0, stream1;
hipEvent_t startEvent0, stopEvent0, startEvent1, stopEvent1;
// Initialize device 0
hipSetDevice(0);
hipStreamCreate(&stream0);
hipEventCreate(&startEvent0);
hipEventCreate(&stopEvent0);
hipMalloc(&deviceData0, size);
// Initialize device 1
hipSetDevice(1);
hipStreamCreate(&stream1);
hipEventCreate(&startEvent1);
hipEventCreate(&stopEvent1);
hipMalloc(&deviceData1, size);
// Record the start event on device 0
hipSetDevice(0);
hipEventRecord(startEvent0, stream0);
// Launch the kernel asynchronously on device 0
simpleKernel<<<1000, 128, 0, stream0>>>(deviceData0);
// Record the stop event on device 0
hipEventRecord(stopEvent0, stream0);
// Wait for the stop event on device 0 to complete
hipEventSynchronize(stopEvent0);
// Record the start event on device 1
hipSetDevice(1);
hipEventRecord(startEvent1, stream1);
// Launch the kernel asynchronously on device 1
simpleKernel<<<1000, 128, 0, stream1>>>(deviceData1);
// Record the stop event on device 1
hipEventRecord(stopEvent1, stream1);
// Wait for the stop event on device 1 to complete
hipEventSynchronize(stopEvent1);
// Calculate elapsed time between the events for both devices
float milliseconds0 = 0, milliseconds1 = 0;
hipEventElapsedTime(&milliseconds0, startEvent0, stopEvent0);
hipEventElapsedTime(&milliseconds1, startEvent1, stopEvent1);
std::cout << "Elapsed time on GPU 0: " << milliseconds0 << " ms" << std::endl;
std::cout << "Elapsed time on GPU 1: " << milliseconds1 << " ms" << std::endl;
// Cleanup for device 0
hipSetDevice(0);
hipEventDestroy(startEvent0);
hipEventDestroy(stopEvent0);
hipStreamSynchronize(stream0);
hipStreamDestroy(stream0);
hipFree(deviceData0);
// Cleanup for device 1
hipSetDevice(1);
hipEventDestroy(startEvent1);
hipEventDestroy(stopEvent1);
hipStreamSynchronize(stream1);
hipStreamDestroy(stream1);
hipFree(deviceData1);
return 0;
}
.. literalinclude:: ../../tools/example_codes/multi_device_synchronization.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
Peer-to-peer memory access
===============================================================================
@@ -257,164 +73,25 @@ applications that require frequent data exchange between GPUs, as it eliminates
the need to transfer data through the host memory.
By adding peer-to-peer access to the example referenced in
:ref:`multi_device_selection`, data can be copied between devices:
:ref:`multi_device_selection`, data can be efficiently copied between devices.
If peer-to-peer access is not activated, the call to :cpp:func:`hipMemcpy`
still works but internally uses a staging buffer in host memory, which incurs a
performance penalty.
.. tab-set::
.. tab-item:: with peer-to-peer
.. code-block:: cpp
:emphasize-lines: 31-37, 51-55
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) { \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
exit(status); \
} \
}
__global__ void simpleKernel(double *data)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] = idx * 2.0;
}
int main()
{
double* deviceData0;
double* deviceData1;
size_t size = 1024 * sizeof(*deviceData0);
int deviceId0 = 0;
int deviceId1 = 1;
// Enable peer access to the memory (allocated and future) on the peer device.
// Ensure the device is active before enabling peer access.
hipSetDevice(deviceId0);
hipDeviceEnablePeerAccess(deviceId1, 0);
hipSetDevice(deviceId1);
hipDeviceEnablePeerAccess(deviceId0, 0);
// Set device 0 and perform operations
HIP_CHECK(hipSetDevice(deviceId0)); // Set device 0 as current
HIP_CHECK(hipMalloc(&deviceData0, size)); // Allocate memory on device 0
simpleKernel<<<1000, 128>>>(deviceData0); // Launch kernel on device 0
HIP_CHECK(hipDeviceSynchronize());
// Set device 1 and perform operations
HIP_CHECK(hipSetDevice(deviceId1)); // Set device 1 as current
HIP_CHECK(hipMalloc(&deviceData1, size)); // Allocate memory on device 1
simpleKernel<<<1000, 128>>>(deviceData1); // Launch kernel on device 1
HIP_CHECK(hipDeviceSynchronize());
// Use peer-to-peer access
hipSetDevice(deviceId0);
// Now device 0 can access memory allocated on device 1
hipMemcpy(deviceData0, deviceData1, size, hipMemcpyDeviceToDevice);
// Copy result from device 0
double hostData0[1024];
HIP_CHECK(hipSetDevice(deviceId0));
HIP_CHECK(hipMemcpy(hostData0, deviceData0, size, hipMemcpyDeviceToHost));
// Copy result from device 1
double hostData1[1024];
HIP_CHECK(hipSetDevice(deviceId1));
HIP_CHECK(hipMemcpy(hostData1, deviceData1, size, hipMemcpyDeviceToHost));
// Display results from both devices
std::cout << "Device 0 data: " << hostData0[0] << std::endl;
std::cout << "Device 1 data: " << hostData1[0] << std::endl;
// Free device memory
HIP_CHECK(hipFree(deviceData0));
HIP_CHECK(hipFree(deviceData1));
return 0;
}
.. literalinclude:: ../../tools/example_codes/p2p_memory_access.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:emphasize-lines: 43-49, 63-67
:language: cpp
.. tab-item:: without peer-to-peer
.. code-block:: cpp
:emphasize-lines: 43-49, 53, 58
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) { \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
exit(status); \
} \
}
__global__ void simpleKernel(double *data)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] = idx * 2.0;
}
int main()
{
double* deviceData0;
double* deviceData1;
size_t size = 1024 * sizeof(*deviceData0);
int deviceId0 = 0;
int deviceId1 = 1;
// Set device 0 and perform operations
HIP_CHECK(hipSetDevice(deviceId0)); // Set device 0 as current
HIP_CHECK(hipMalloc(&deviceData0, size)); // Allocate memory on device 0
simpleKernel<<<1000, 128>>>(deviceData0); // Launch kernel on device 0
HIP_CHECK(hipDeviceSynchronize());
// Set device 1 and perform operations
HIP_CHECK(hipSetDevice(deviceId1)); // Set device 1 as current
HIP_CHECK(hipMalloc(&deviceData1, size)); // Allocate memory on device 1
simpleKernel<<<1000, 128>>>(deviceData1); // Launch kernel on device 1
HIP_CHECK(hipDeviceSynchronize());
// Attempt to use deviceData0 on device 1 (This will not work as deviceData0 is allocated on device 0)
HIP_CHECK(hipSetDevice(deviceId1));
hipError_t err = hipMemcpy(deviceData1, deviceData0, size, hipMemcpyDeviceToDevice); // This should fail
if (err != hipSuccess)
{
std::cout << "Error: Cannot access deviceData0 from device 1, deviceData0 is on device 0" << std::endl;
}
// Copy result from device 0
double hostData0[1024];
HIP_CHECK(hipSetDevice(deviceId0));
HIP_CHECK(hipMemcpy(hostData0, deviceData0, size, hipMemcpyDeviceToHost));
// Copy result from device 1
double hostData1[1024];
HIP_CHECK(hipSetDevice(deviceId1));
HIP_CHECK(hipMemcpy(hostData1, deviceData1, size, hipMemcpyDeviceToHost));
// Display results from both devices
std::cout << "Device 0 data: " << hostData0[0] << std::endl;
std::cout << "Device 1 data: " << hostData1[0] << std::endl;
// Free device memory
HIP_CHECK(hipFree(deviceData0));
HIP_CHECK(hipFree(deviceData1));
return 0;
}
.. literalinclude:: ../../tools/example_codes/p2p_memory_access_host_staging.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:emphasize-lines: 55-57
:language: cpp
+2 -2
Просмотреть файл
@@ -38,8 +38,7 @@ The HIP documentation is organized into the following categories:
* {doc}`./how-to/hip_runtime_api`
* {doc}`./how-to/hip_cpp_language_extensions`
* {doc}`./how-to/kernel_language_cpp_support`
* [HIP porting guide](./how-to/hip_porting_guide)
* [HIP porting: driver API guide](./how-to/hip_porting_driver_api)
* {doc}`./how-to/hip_porting_guide`
* {doc}`./how-to/hip_rtc`
* {doc}`./understand/amd_clr`
@@ -66,6 +65,7 @@ The HIP documentation is organized into the following categories:
* [SAXPY tutorial](./tutorial/saxpy)
* [Reduction tutorial](./tutorial/reduction)
* [Cooperative groups tutorial](./tutorial/cooperative_groups_tutorial)
* [HIP Graph API tutorial](./tutorial/graph_api)
:::
+4 -86
Просмотреть файл
@@ -11,92 +11,10 @@ example and comparison table. For a complete list of mappings, visit :ref:`HIPIF
The following CUDA code example illustrates several CUDA API syntaxes.
.. code-block:: cpp
#include <iostream>
#include <vector>
#include <cuda_runtime.h>
__global__ void block_reduction(const float* input, float* output, int num_elements)
{
extern __shared__ float s_data[];
int tid = threadIdx.x;
int global_id = blockDim.x * blockIdx.x + tid;
if (global_id < num_elements)
{
s_data[tid] = input[global_id];
}
else
{
s_data[tid] = 0.0f;
}
__syncthreads();
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
{
if (tid < stride)
{
s_data[tid] += s_data[tid + stride];
}
__syncthreads();
}
if (tid == 0)
{
output[blockIdx.x] = s_data[0];
}
}
int main()
{
int threads = 256;
const int num_elements = 50000;
std::vector<float> h_a(num_elements);
std::vector<float> h_b((num_elements + threads - 1) / threads);
for (int i = 0; i < num_elements; ++i)
{
h_a[i] = rand() / static_cast<float>(RAND_MAX);
}
float *d_a, *d_b;
cudaMalloc(&d_a, h_a.size() * sizeof(float));
cudaMalloc(&d_b, h_b.size() * sizeof(float));
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
cudaEvent_t start_event, stop_event;
cudaEventCreate(&start_event);
cudaEventCreate(&stop_event);
cudaMemcpyAsync(d_a, h_a.data(), h_a.size() * sizeof(float), cudaMemcpyHostToDevice, stream);
cudaEventRecord(start_event, stream);
int blocks = (num_elements + threads - 1) / threads;
block_reduction<<<blocks, threads, threads * sizeof(float), stream>>>(d_a, d_b, num_elements);
cudaMemcpyAsync(h_b.data(), d_b, h_b.size() * sizeof(float), cudaMemcpyDeviceToHost, stream);
cudaEventRecord(stop_event, stream);
cudaEventSynchronize(stop_event);
cudaEventElapsedTime(&milliseconds, start_event, stop_event);
std::cout << "Kernel execution time: " << milliseconds << " ms\n";
cudaFree(d_a);
cudaFree(d_b);
cudaEventDestroy(start_event);
cudaEventDestroy(stop_event);
cudaStreamDestroy(stream);
return 0;
}
.. literalinclude:: ../tools/example_codes/block_reduction.cu
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
The following table maps CUDA API functions to corresponding HIP API functions, as demonstrated in the
preceding code examples.
+4 -114
Просмотреть файл
@@ -337,117 +337,7 @@ The kernel function ``computeDFT`` shows various HIP complex math operations in
The example also demonstrates proper use of complex number handling on both host and device, including
memory allocation, transfer, and validation of results between CPU and GPU implementations.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <hip/hip_complex.h>
#include <iostream>
#include <vector>
#include <cmath>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if (err != hipSuccess) { \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
exit(EXIT_FAILURE); \
} \
}
// Kernel to compute DFT
__global__ void computeDFT(const float* input,
hipFloatComplex* output,
const int N)
{
int k = blockIdx.x * blockDim.x + threadIdx.x;
if (k >= N) return;
hipFloatComplex sum = make_hipFloatComplex(0.0f, 0.0f);
for (int n = 0; n < N; n++) {
float angle = -2.0f * M_PI * k * n / N;
hipFloatComplex w = make_hipFloatComplex(cosf(angle), sinf(angle));
hipFloatComplex x = make_hipFloatComplex(input[n], 0.0f);
sum = hipCaddf(sum, hipCmulf(x, w));
}
output[k] = sum;
}
// CPU implementation of DFT for verification
std::vector<hipFloatComplex> cpuDFT(const std::vector<float>& input) {
const int N = input.size();
std::vector<hipFloatComplex> result(N);
for (int k = 0; k < N; k++) {
hipFloatComplex sum = make_hipFloatComplex(0.0f, 0.0f);
for (int n = 0; n < N; n++) {
float angle = -2.0f * M_PI * k * n / N;
hipFloatComplex w = make_hipFloatComplex(cosf(angle), sinf(angle));
hipFloatComplex x = make_hipFloatComplex(input[n], 0.0f);
sum = hipCaddf(sum, hipCmulf(x, w));
}
result[k] = sum;
}
return result;
}
int main() {
const int N = 256; // Signal length
const int blockSize = 256;
// Generate input signal: sum of two sine waves
std::vector<float> signal(N);
for (int i = 0; i < N; i++) {
float t = static_cast<float>(i) / N;
signal[i] = sinf(2.0f * M_PI * 10.0f * t) + // 10 Hz component
0.5f * sinf(2.0f * M_PI * 20.0f * t); // 20 Hz component
}
// Compute reference solution on CPU
std::vector<hipFloatComplex> cpu_output = cpuDFT(signal);
// Allocate device memory
float* d_signal;
hipFloatComplex* d_output;
HIP_CHECK(hipMalloc(&d_signal, N * sizeof(float)));
HIP_CHECK(hipMalloc(&d_output, N * sizeof(hipFloatComplex)));
// Copy input to device
HIP_CHECK(hipMemcpy(d_signal, signal.data(), N * sizeof(float),
hipMemcpyHostToDevice));
// Launch kernel
dim3 grid((N + blockSize - 1) / blockSize);
dim3 block(blockSize);
computeDFT<<<grid, block>>>(d_signal, d_output, N);
HIP_CHECK(hipGetLastError());
// Get GPU results
std::vector<hipFloatComplex> gpu_output(N);
HIP_CHECK(hipMemcpy(gpu_output.data(), d_output, N * sizeof(hipFloatComplex),
hipMemcpyDeviceToHost));
// Verify results
bool passed = true;
const float tolerance = 1e-5f; // Adjust based on precision requirements
for (int i = 0; i < N; i++) {
float diff_real = std::abs(hipCrealf(gpu_output[i]) - hipCrealf(cpu_output[i]));
float diff_imag = std::abs(hipCimagf(gpu_output[i]) - hipCimagf(cpu_output[i]));
if (diff_real > tolerance || diff_imag > tolerance) {
passed = false;
break;
}
}
std::cout << "DFT Verification: " << (passed ? "PASSED" : "FAILED") << "\n";
// Cleanup
HIP_CHECK(hipFree(d_signal));
HIP_CHECK(hipFree(d_output));
return passed ? 0 : 1;
}
.. literalinclude:: ../tools/example_codes/complex_math.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
+9 -1
Просмотреть файл
@@ -27,7 +27,7 @@ and :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`.
-
* - | ``AMD_LOG_MASK``
| Specifies HIP log filters. Here is the ` complete list of log masks <https://github.com/ROCm/clr/blob/develop/rocclr/utils/debug.hpp#L40>`_.
| Specifies HIP log filters. Here is the `complete list of log masks <https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/utils/debug.hpp#L48>`_.
- ``0x7FFFFFFF``
- | 0x1: Log API calls.
| 0x2: Kernel and copy commands and barriers.
@@ -49,8 +49,16 @@ and :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`.
| 0x20000: Memory allocation.
| 0x40000: Memory pool allocation, including memory in graphs.
| 0x80000: Timestamp details.
| 0x100000: Comgr path information print.
| 0xFFFFFFFF: Log always even mask flag is zero.
* - | ``HIP_FORCE_DEV_KERNARG``
| Forces kernel arguments to be stored in device memory to reduce latency.
| Can improve performance by 2-3 µs for some kernels.
- ``1``
- | 0: Disable
| 1: Enable
* - | ``HIP_LAUNCH_BLOCKING``
| Used for serialization on kernel execution.
- ``0``
+9 -1
Просмотреть файл
@@ -14,7 +14,7 @@ environment variables in HIP are collected in the following table.
* - | ``ROCR_VISIBLE_DEVICES``
| A list of device indices or UUIDs that will be exposed to applications.
- :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`, :doc:`Setting the number of compute units <rocm:how-to/setting-cus>`
- Example: ``0,GPU-DEADBEEFDEADBEEF``
- Example: ``0,GPU-4b2c1a9f-8d3e-6f7a-b5c9-2e4d8a1f6c3b``
* - | ``GPU_DEVICE_ORDINAL``
| Devices indices exposed to OpenCL and HIP applications.
@@ -25,3 +25,11 @@ environment variables in HIP are collected in the following table.
| Device indices exposed to HIP applications.
- :doc:`GPU isolation <rocm:conceptual/gpu-isolation>`, :doc:`HIP debugging <hip:how-to/debugging>`
- Example: ``0,2``
.. admonition:: Recommendation
* On Linux, use ``ROCR_VISIBLE_DEVICES``.
* On Windows, use ``HIP_VISIBLE_DEVICES``.
* For portability across different vendors, use ``CUDA_VISIBLE_DEVICES``.
+6
Просмотреть файл
@@ -55,6 +55,12 @@ pages:
- | 0: Disable
| 1: Enable
* - | ``GPU_SINGLE_ALLOC_PERCENT``
| Limits the maximum size of a single memory allocation as a percentage of GPU memory.
- ``100``
- | Unit: Percentage
| Prevents single allocations from consuming all available GPU memory.
* - | ``GPU_MAX_HEAP_SIZE``
| Set maximum size of the GPU heap to % of board memory.
- ``100``
+3 -3
Просмотреть файл
@@ -16,19 +16,19 @@ different features in HIP.
- ``--gpu-architecture=gfx906:sramecc+:xnack``, ``-fgpu-rdc``
* - | ``AMD_COMGR_SAVE_TEMPS``
| Controls the deletion of temporary files generated during the compilation of COMGR. These files do not appear in the current working directory, but are instead left in a platform-specific temporary directory.
| Controls the deletion of temporary files generated during the compilation of Comgr. These files do not appear in the current working directory, but are instead left in a platform-specific temporary directory.
- Unset by default.
- | 0: Temporary files are deleted automatically.
| Non zero integer: Turn off the temporary files deletion.
* - | ``AMD_COMGR_EMIT_VERBOSE_LOGS``
| Sets logging of COMGR to include additional Comgr-specific informational messages.
| Sets logging of Comgr to include additional Comgr-specific informational messages.
- Unset by default.
- | 0: Verbose log disabled.
| Non zero integer: Verbose log enabled.
* - | ``AMD_COMGR_REDIRECT_LOGS``
| Controls redirect logs of COMGR.
| Controls redirect logs of Comgr.
- Unset by default.
- | `stdout` / `-`: Redirected to the standard output.
| `stderr`: Redirected to the error stream.
+4 -82
Просмотреть файл
@@ -24,88 +24,10 @@ The following C++ example shows a simplified method for computing ULP difference
HIP and standard C++ math functions by first finding where the maximum absolute error
occurs.
.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
#include <vector>
#include <cmath>
#include <limits>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if (err != hipSuccess) { \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
exit(EXIT_FAILURE); \
} \
}
// Simple ULP difference calculator
int64_t ulp_diff(float a, float b) {
if (a == b) return 0;
union { float f; int32_t i; } ua{a}, ub{b};
// For negative values, convert to a positive-based representation
if (ua.i < 0) ua.i = std::numeric_limits<int32_t>::max() - ua.i;
if (ub.i < 0) ub.i = std::numeric_limits<int32_t>::max() - ub.i;
return std::abs((int64_t)ua.i - (int64_t)ub.i);
}
// Test kernel
__global__ void test_sin(float* out, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
float x = -M_PI + (2.0f * M_PI * i) / (n - 1);
out[i] = sin(x);
}
}
int main() {
const int n = 1000000;
const int blocksize = 256;
std::vector<float> outputs(n);
float* d_out;
HIP_CHECK(hipMalloc(&d_out, n * sizeof(float)));
dim3 threads(blocksize);
dim3 blocks((n + blocksize - 1) / blocksize); // Fixed grid calculation
test_sin<<<blocks, threads>>>(d_out, n);
HIP_CHECK(hipPeekAtLastError());
HIP_CHECK(hipMemcpy(outputs.data(), d_out, n * sizeof(float), hipMemcpyDeviceToHost));
// Step 1: Find the maximum absolute error
double max_abs_error = 0.0;
float max_error_output = 0.0;
float max_error_expected = 0.0;
for (int i = 0; i < n; i++) {
float x = -M_PI + (2.0f * M_PI * i) / (n - 1);
float expected = std::sin(x);
double abs_error = std::abs(outputs[i] - expected);
if (abs_error > max_abs_error) {
max_abs_error = abs_error;
max_error_output = outputs[i];
max_error_expected = expected;
}
}
// Step 2: Compute ULP difference based on the max absolute error pair
int64_t max_ulp = ulp_diff(max_error_output, max_error_expected);
// Output results
std::cout << "Max Absolute Error: " << max_abs_error << std::endl;
std::cout << "Max ULP Difference: " << max_ulp << std::endl;
std::cout << "Max Error Values -> Got: " << max_error_output
<< ", Expected: " << max_error_expected << std::endl;
HIP_CHECK(hipFree(d_out));
return 0;
}
.. literalinclude:: ../tools/example_codes/math.hip
:start-after: // [sphinx-start]
:end-before: // [sphinx-end]
:language: cpp
Standard mathematical functions
===============================
+1 -1
Просмотреть файл
@@ -58,7 +58,6 @@ subtrees:
- file: how-to/hip_cpp_language_extensions
- file: how-to/kernel_language_cpp_support
- file: how-to/hip_porting_guide
- file: how-to/hip_porting_driver_api
- file: how-to/hip_rtc
- file: understand/amd_clr
@@ -127,6 +126,7 @@ subtrees:
- file: tutorial/saxpy
- file: tutorial/reduction
- file: tutorial/cooperative_groups_tutorial
- file: tutorial/graph_api
- caption: About
entries:
+95
Просмотреть файл
@@ -0,0 +1,95 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include "example_utils.hpp"
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
#include <numeric>
#include <vector>
///\brief Calculates \p a[i] = \p a[i] + \p b[i] where \p i stands for the thread's index in the grid.
// [sphinx-kernel-start]
__global__ void AddKernel(float* a, const float* b)
{
int global_idx = threadIdx.x + blockIdx.x * blockDim.x;
a[global_idx] += b[global_idx];
}
// [sphinx-kernel-end]
int main()
{
// The number of float elements in each vector.
constexpr unsigned int size = 1 << 20; // == 1'048'576 elements
// Bytes to allocate for each device vector.
constexpr size_t size_bytes = size * sizeof(float);
// Number of threads per kernel block.
constexpr unsigned int threads_per_block = 256;
// Number of blocks per kernel grid. The expression below calculates ceil(size/block_size).
constexpr unsigned int number_of_blocks = ceiling_div(size, threads_per_block);
// Allocate a vector and fill it with an increasing sequence (i.e. 1, 2, 3, 4...)
std::vector<float> h_a(size);
std::iota(h_a.begin(), h_a.end(), 1.f);
// Allocate b vector and fill it with a decreasing sequence (i.e. 1'048'576, 1'048'575, ..., 3, 2, 1)
std::vector<float> h_b(size);
std::iota(h_b.rbegin(), h_b.rend(), 1.f);
// Allocate and copy vectors to device memory.
float* d_a{};
float* d_b{};
HIP_CHECK(hipMalloc(&d_a, size_bytes));
HIP_CHECK(hipMalloc(&d_b, size_bytes));
HIP_CHECK(hipMemcpy(d_a, h_a.data(), size_bytes, hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(d_b, h_b.data(), size_bytes, hipMemcpyHostToDevice));
std::cout << "Calculating a[i] = a[i] + b[i] over " << size << " elements." << std::endl;
// Launch the kernel on the default stream.
// [sphinx-kernel-launch-start]
AddKernel<<<number_of_blocks, threads_per_block>>>(d_a, d_b);
// [sphinx-kernel-launch-end]
// Check if the kernel launch was successful.
HIP_CHECK(hipGetLastError());
// Copy the results back to the host. This call blocks the host's execution until the copy is finished.
HIP_CHECK(hipMemcpy(h_a.data(), d_a, size_bytes, hipMemcpyDeviceToHost));
// Free device memory.
HIP_CHECK(hipFree(d_b));
HIP_CHECK(hipFree(d_a));
// Print the first few elements of the results:
constexpr size_t elements_to_print = 10;
std::cout << "First " << elements_to_print << " elements of the results: "
<< format_range(h_a.begin(), h_a.begin() + elements_to_print) << std::endl;
return EXIT_SUCCESS;
}
+142
Просмотреть файл
@@ -0,0 +1,142 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <vector>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
// GPU Kernels
__global__ void kernelA(double* arrayA, std::size_t size)
{
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size)
{
arrayA[x] += 1.0;
}
}
__global__ void kernelB(double* arrayA, double* arrayB, std::size_t size)
{
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size)
{
arrayB[x] += arrayA[x] + 3.0;
}
}
int main()
{
constexpr int numOfBlocks = 1 << 20;
constexpr int threadsPerBlock = 1024;
constexpr int numberOfIterations = 50;
// The array size smaller to avoid the relatively short kernel launch compared to memory copies
constexpr std::size_t arraySize = 1U << 25;
double *d_dataA;
double *d_dataB;
double initValueA = 0.0;
double initValueB = 2.0;
std::vector<double> vectorA(arraySize, initValueA);
std::vector<double> vectorB(arraySize, initValueB);
// Allocate device memory
HIP_CHECK(hipMalloc(&d_dataA, arraySize * sizeof(*d_dataA)));
HIP_CHECK(hipMalloc(&d_dataB, arraySize * sizeof(*d_dataB)));
// Create streams
hipStream_t streamA, streamB;
HIP_CHECK(hipStreamCreate(&streamA));
HIP_CHECK(hipStreamCreate(&streamB));
for(unsigned int iteration = 0; iteration < numberOfIterations; iteration++)
{
// Stream 1: Host to Device 1
HIP_CHECK(hipMemcpyAsync(d_dataA, vectorA.data(), arraySize * sizeof(*d_dataA), hipMemcpyHostToDevice, streamA));
// Stream 2: Host to Device 2
HIP_CHECK(hipMemcpyAsync(d_dataB, vectorB.data(), arraySize * sizeof(*d_dataB), hipMemcpyHostToDevice, streamB));
// Stream 1: Kernel 1
kernelA<<<numOfBlocks, threadsPerBlock, 0, streamA>>>(d_dataA, arraySize);
// Wait for streamA finish
HIP_CHECK(hipStreamSynchronize(streamA));
// Stream 2: Kernel 2
kernelB<<<numOfBlocks, threadsPerBlock, 0, streamB>>>(d_dataA, d_dataB, arraySize);
// Stream 1: Device to Host 2 (after Kernel 1)
HIP_CHECK(hipMemcpyAsync(vectorA.data(), d_dataA, arraySize * sizeof(*vectorA.data()), hipMemcpyDeviceToHost, streamA));
// Stream 2: Device to Host 2 (after Kernel 2)
HIP_CHECK(hipMemcpyAsync(vectorB.data(), d_dataB, arraySize * sizeof(*vectorB.data()), hipMemcpyDeviceToHost, streamB));
}
// Wait for all operations in both streams to complete
HIP_CHECK(hipStreamSynchronize(streamA));
HIP_CHECK(hipStreamSynchronize(streamB));
// Verify results
double expectedA = (double)numberOfIterations;
double expectedB = initValueB + (3.0 * numberOfIterations) + (expectedA * (expectedA + 1.0)) / 2.0;
bool passed = true;
for(std::size_t i = 0; i < arraySize; ++i)
{
if(vectorA[i] != expectedA)
{
passed = false;
std::cerr << "Validation failed! Expected " << expectedA << " got " << vectorA[i] << " at index: " << i << std::endl;
break;
}
if(vectorB[i] != expectedB)
{
passed = false;
std::cerr << "Validation failed! Expected " << expectedB << " got " << vectorB[i] << " at index: " << i << std::endl;
break;
}
}
if(passed)
{
std::cout << "Asynchronous execution completed successfully." << std::endl;
}
else
{
std::cerr << "Asynchronous execution failed." << std::endl;
}
// Cleanup
HIP_CHECK(hipStreamDestroy(streamA));
HIP_CHECK(hipStreamDestroy(streamB));
HIP_CHECK(hipFree(d_dataA));
HIP_CHECK(hipFree(d_dataB));
return EXIT_SUCCESS;
}
// [sphinx-end]
+110
Просмотреть файл
@@ -0,0 +1,110 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
__global__ void block_reduction(const float* input, float* output, int num_elements)
{
extern __shared__ float s_data[];
int tid = threadIdx.x;
int global_id = blockDim.x * blockIdx.x + tid;
if (global_id < num_elements)
{
s_data[tid] = input[global_id];
}
else
{
s_data[tid] = 0.0f;
}
__syncthreads();
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
{
if (tid < stride)
{
s_data[tid] += s_data[tid + stride];
}
__syncthreads();
}
if (tid == 0)
{
output[blockIdx.x] = s_data[0];
}
}
int main()
{
int threads = 256;
const int num_elements = 50000;
std::vector<float> h_a(num_elements);
std::vector<float> h_b((num_elements + threads - 1) / threads);
for (int i = 0; i < num_elements; ++i)
{
h_a[i] = rand() / static_cast<float>(RAND_MAX);
}
float *d_a, *d_b;
cudaMalloc(&d_a, h_a.size() * sizeof(float));
cudaMalloc(&d_b, h_b.size() * sizeof(float));
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
cudaEvent_t start_event, stop_event;
cudaEventCreate(&start_event);
cudaEventCreate(&stop_event);
cudaMemcpyAsync(d_a, h_a.data(), h_a.size() * sizeof(float), cudaMemcpyHostToDevice, stream);
cudaEventRecord(start_event, stream);
int blocks = (num_elements + threads - 1) / threads;
block_reduction<<<blocks, threads, threads * sizeof(float), stream>>>(d_a, d_b, num_elements);
cudaMemcpyAsync(h_b.data(), d_b, h_b.size() * sizeof(float), cudaMemcpyDeviceToHost, stream);
cudaEventRecord(stop_event, stream);
cudaEventSynchronize(stop_event);
float milliseconds = 0.f;
cudaEventElapsedTime(&milliseconds, start_event, stop_event);
std::cout << "Kernel execution time: " << milliseconds << " ms\n";
cudaFree(d_a);
cudaFree(d_b);
cudaEventDestroy(start_event);
cudaEventDestroy(stop_event);
cudaStreamDestroy(stream);
return 0;
}
// [sphinx-end]
+58
Просмотреть файл
@@ -0,0 +1,58 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
int main()
{
std::size_t stackSize;
HIP_CHECK(hipDeviceGetLimit(&stackSize, hipLimitStackSize));
std::cout << "Default stack size: " << stackSize << " bytes" << std::endl;
// Set a new stack size
std::size_t newStackSize = 1024 * 8; // 8 KiB
HIP_CHECK(hipDeviceSetLimit(hipLimitStackSize, newStackSize));
HIP_CHECK(hipDeviceGetLimit(&stackSize, hipLimitStackSize));
std::cout << "Updated stack size: " << stackSize << " bytes" << std::endl;
return EXIT_SUCCESS;
}
// [sphinx-end]
+89
Просмотреть файл
@@ -0,0 +1,89 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess) \
{ \
std::cerr << "HIP error: " << hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Performs a simple initialization of an array with the thread's index variables.
// This function is only available in device code.
__device__ void init_array(float * const a, const unsigned int arraySize)
{
// globalIdx uniquely identifies a thread in a 1D launch configuration.
const int globalIdx = threadIdx.x + blockIdx.x * blockDim.x;
// Each thread initializes a single element of the array.
if(globalIdx < arraySize)
{
a[globalIdx] = globalIdx;
}
}
// Rounds a value up to the next multiple.
// This function is available in host and device code.
__host__ __device__ constexpr int round_up_to_nearest_multiple(int number, int multiple)
{
return (number + multiple - 1)/multiple;
}
__global__ void example_kernel(float * const a, const unsigned int N)
{
// Initialize array.
init_array(a, N);
// Perform additional work:
// - work with the array
// - use the array in a different kernel
// - ...
}
int main()
{
constexpr int N = 100000000; // problem size
constexpr int blockSize = 256; //configurable block size
//needed number of blocks for the given problem size
constexpr int gridSize = round_up_to_nearest_multiple(N, blockSize);
float *a;
// allocate memory on the GPU
HIP_CHECK(hipMalloc(&a, sizeof(*a) * N));
std::cout << "Launching kernel." << std::endl;
example_kernel<<<dim3(gridSize), dim3(blockSize), 0/*example doesn't use shared memory*/, 0/*default stream*/>>>(a, N);
// make sure kernel execution is finished by synchronizing. The CPU can also
// execute other instructions during that time
HIP_CHECK(hipDeviceSynchronize());
std::cout << "Kernel execution finished." << std::endl;
HIP_CHECK(hipFree(a));
}
// [sphinx-end]
+165
Просмотреть файл
@@ -0,0 +1,165 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <hip/hiprtc.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <string>
#include <vector>
#define CHECK_RET_CODE(call, ret_code) \
{ \
if ((call) != ret_code) \
{ \
std::cout << "Failed in call: " << #call << std::endl; \
std::abort(); \
} \
}
#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
// source code for hiprtc
static constexpr auto kernel_source{
R"(
extern "C"
__global__ void vector_add(float* output, float* input1, float* input2, size_t size)
{
int i = threadIdx.x;
if (i < size)
{
output[i] = input1[i] + input2[i];
}
}
)"};
int main()
{
hiprtcProgram prog;
auto rtc_ret_code = hiprtcCreateProgram(&prog, // HIPRTC program handle
kernel_source, // kernel source string
"vector_add.cpp", // Name of the file
0, // Number of headers
nullptr, // Header sources
nullptr); // Name of header file
if (rtc_ret_code != HIPRTC_SUCCESS)
{
std::cerr << "Failed to create program" << std::endl;
std::abort();
}
hipDeviceProp_t props;
int device = 0;
HIP_CHECK(hipGetDeviceProperties(&props, device));
auto sarg = std::string{"--gpu-architecture="} + props.gcnArchName; // device for which binary is to be generated
const char* options[] = {sarg.c_str()};
rtc_ret_code = hiprtcCompileProgram(prog, // hiprtcProgram
1, // Number of options
options); // Clang Options
if (rtc_ret_code != HIPRTC_SUCCESS)
{
std::cerr << "Failed to create program" << std::endl;
std::abort();
}
std::size_t logSize;
HIPRTC_CHECK(hiprtcGetProgramLogSize(prog, &logSize));
if (logSize)
{
std::string log(logSize, '\0');
HIPRTC_CHECK(hiprtcGetProgramLog(prog, &log[0]));
std::cerr << "Compilation failed or produced warnings: " << log << std::endl;
std::abort();
}
std::size_t codeSize;
HIPRTC_CHECK(hiprtcGetCodeSize(prog, &codeSize));
std::vector<char> kernel_binary(codeSize);
HIPRTC_CHECK(hiprtcGetCode(prog, kernel_binary.data()));
HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
hipModule_t module;
hipFunction_t kernel;
HIP_CHECK(hipModuleLoadData(&module, kernel_binary.data()));
HIP_CHECK(hipModuleGetFunction(&kernel, module, "vector_add"));
constexpr std::size_t ele_size = 256; // total number of items to add
std::vector<float> hinput, output;
hinput.reserve(ele_size);
output.reserve(ele_size);
for (std::size_t i = 0; i < ele_size; i++)
{
hinput.push_back(static_cast<float>(i + 1));
output.push_back(0.0f);
}
float *dinput1, *dinput2, *doutput;
HIP_CHECK(hipMalloc(&dinput1, sizeof(float) * ele_size));
HIP_CHECK(hipMalloc(&dinput2, sizeof(float) * ele_size));
HIP_CHECK(hipMalloc(&doutput, sizeof(float) * ele_size));
HIP_CHECK(hipMemcpy(dinput1, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(dinput2, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
struct
{
float* output;
float* input1;
float* input2;
std::size_t size;
} args{doutput, dinput1, dinput2, ele_size};
auto size = sizeof(args);
void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, &args, HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
HIP_LAUNCH_PARAM_END};
HIP_CHECK(hipModuleLaunchKernel(kernel, 1, 1, 1, ele_size, 1, 1, 0, nullptr, nullptr, config));
HIP_CHECK(hipMemcpy(output.data(), doutput, sizeof(float) * ele_size, hipMemcpyDeviceToHost));
for (std::size_t i = 0; i < ele_size; i++)
{
if ((hinput[i] + hinput[i]) != output[i])
{
std::cout << "Failed in validation: " << (hinput[i] + hinput[i]) << " - " << output[i] << std::endl;
std::abort();
}
}
std::cout << "Passed" << std::endl;
HIP_CHECK(hipFree(dinput1));
HIP_CHECK(hipFree(dinput2));
HIP_CHECK(hipFree(doutput));
return EXIT_SUCCESS;
}
// [sphinx-stop]
+142
Просмотреть файл
@@ -0,0 +1,142 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <hip/hip_complex.h>
#include <cmath>
#include <cstdlib>
#include <iostream>
#include <vector>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if (err != hipSuccess) { \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
exit(EXIT_FAILURE); \
} \
}
// Kernel to compute DFT
__global__ void computeDFT(const float* input, hipFloatComplex* output, const int N)
{
int k = blockIdx.x * blockDim.x + threadIdx.x;
if (k >= N) return;
hipFloatComplex sum = make_hipFloatComplex(0.0f, 0.0f);
for (int n = 0; n < N; n++)
{
float angle = -2.0f * M_PI * k * n / N;
hipFloatComplex w = make_hipFloatComplex(cosf(angle), sinf(angle));
hipFloatComplex x = make_hipFloatComplex(input[n], 0.0f);
sum = hipCaddf(sum, hipCmulf(x, w));
}
output[k] = sum;
}
// CPU implementation of DFT for verification
std::vector<hipFloatComplex> cpuDFT(const std::vector<float>& input)
{
const int N = input.size();
std::vector<hipFloatComplex> result(N);
for (int k = 0; k < N; k++)
{
hipFloatComplex sum = make_hipFloatComplex(0.0f, 0.0f);
for (int n = 0; n < N; n++)
{
float angle = -2.0f * M_PI * k * n / N;
hipFloatComplex w = make_hipFloatComplex(cosf(angle), sinf(angle));
hipFloatComplex x = make_hipFloatComplex(input[n], 0.0f);
sum = hipCaddf(sum, hipCmulf(x, w));
}
result[k] = sum;
}
return result;
}
int main()
{
const int N = 256; // Signal length
const int blockSize = 256;
// Generate input signal: sum of two sine waves
std::vector<float> signal(N);
for (int i = 0; i < N; i++)
{
float t = static_cast<float>(i) / N;
signal[i] = sinf(2.0f * M_PI * 10.0f * t) + // 10 Hz component
0.5f * sinf(2.0f * M_PI * 20.0f * t); // 20 Hz component
}
// Compute reference solution on CPU
std::vector<hipFloatComplex> cpu_output = cpuDFT(signal);
// Allocate device memory
float* d_signal;
hipFloatComplex* d_output;
HIP_CHECK(hipMalloc(&d_signal, N * sizeof(float)));
HIP_CHECK(hipMalloc(&d_output, N * sizeof(hipFloatComplex)));
// Copy input to device
HIP_CHECK(hipMemcpy(d_signal, signal.data(), N * sizeof(float), hipMemcpyHostToDevice));
// Launch kernel
dim3 grid((N + blockSize - 1) / blockSize);
dim3 block(blockSize);
computeDFT<<<grid, block>>>(d_signal, d_output, N);
HIP_CHECK(hipGetLastError());
// Get GPU results
std::vector<hipFloatComplex> gpu_output(N);
HIP_CHECK(hipMemcpy(gpu_output.data(), d_output, N * sizeof(hipFloatComplex), hipMemcpyDeviceToHost));
// Verify results
bool passed = true;
const float tolerance = 1e-5f; // Adjust based on precision requirements
for (int i = 0; i < N; i++)
{
float diff_real = std::abs(hipCrealf(gpu_output[i]) - hipCrealf(cpu_output[i]));
float diff_imag = std::abs(hipCimagf(gpu_output[i]) - hipCimagf(cpu_output[i]));
if (diff_real > tolerance || diff_imag > tolerance)
{
passed = false;
break;
}
}
std::cout << "DFT Verification: " << (passed ? "PASSED" : "FAILED") << "\n";
// Cleanup
HIP_CHECK(hipFree(d_signal));
HIP_CHECK(hipFree(d_output));
return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}
// [sphinx-end]
+75
Просмотреть файл
@@ -0,0 +1,75 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include "example_utils.hpp"
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
// [sphinx-start]
constexpr std::size_t const_array_size = 32;
__constant__ double const_array[const_array_size];
void set_constant_memory(double* values)
{
HIP_CHECK(hipMemcpyToSymbol(const_array, values, const_array_size * sizeof(double)));
}
__global__ void kernel_using_const_memory(double* array)
{
int warpIdx = threadIdx.x / warpSize;
// uniform access of warps to const_array for best performance
array[blockIdx.x] *= const_array[warpIdx];
}
// [sphinx-end]
int main()
{
std::size_t elements = 32;
std::size_t size_bytes = elements * sizeof(double);
// allocate host array
double *host_array = new double[elements];
// allocate device array
double *device_array = nullptr;
HIP_CHECK(hipMalloc((double**) &device_array, size_bytes));
// copy from host to the device
set_constant_memory(host_array);
kernel_using_const_memory<<<32, 32>>>(device_array);
// copy from device to host, to e.g. get results from the kernel
HIP_CHECK(hipMemcpy(host_array, device_array, size_bytes, hipMemcpyDeviceToHost));
// free memory when not needed any more
HIP_CHECK(hipFree(device_array));
delete[] host_array;
std::cout << "Success!" << std::endl;
return EXIT_SUCCESS;
}
+84
Просмотреть файл
@@ -0,0 +1,84 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess) \
{ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c)
{
*c = *a + *b;
}
int main()
{
int *a, *b, *c;
int deviceId;
HIP_CHECK(hipGetDevice(&deviceId)); // Get the current device ID
// Allocate memory for a, b and c that is accessible to both device and host codes.
HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
// Setup input values.
*a = 1;
*b = 2;
// Prefetch the data to the GPU device.
HIP_CHECK(hipMemPrefetchAsync(a, sizeof(*a), deviceId, 0));
HIP_CHECK(hipMemPrefetchAsync(b, sizeof(*b), deviceId, 0));
HIP_CHECK(hipMemPrefetchAsync(c, sizeof(*c), deviceId, 0));
// Launch add() kernel on GPU.
add<<<1, 1>>>(a, b, c);
// Prefetch the result back to the CPU.
HIP_CHECK(hipMemPrefetchAsync(c, sizeof(*c), hipCpuDeviceId, 0));
// Wait for the prefetch operations to complete.
HIP_CHECK(hipDeviceSynchronize());
// Prints the result.
std::cout << *a << " + " << *b << " = " << *c << std::endl;
// Cleanup allocated memory.
HIP_CHECK(hipFree(a));
HIP_CHECK(hipFree(b));
HIP_CHECK(hipFree(c));
return 0;
}
// [sphinx-end]
+61
Просмотреть файл
@@ -0,0 +1,61 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if (err != hipSuccess) \
{ \
std::cout << "HIP Error: " << hipGetErrorString(err) \
<< " at line " << __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
__global__ void test_kernel()
{
// [sphinx-start]
//#if __CUDA_ARCH__ >= 130 // does not properly specify, what feature is required, not portable
#if __HIP_ARCH_HAS_DOUBLES__ == 1 // explicitly specifies, what feature is required, portable between AMD and NVIDIA GPUs
// device code
#endif
// [sphinx-end]
#if __HIP_ARCH_HAS_DOUBLES__ == 1
printf("Device has double-precision support.\n");
#else
printf("Device does not have double-precision support.\n");
#endif
}
int main()
{
test_kernel<<<1, 1, 0, 0>>>();
HIP_CHECK(hipDeviceSynchronize());
return EXIT_SUCCESS;
}
+74
Просмотреть файл
@@ -0,0 +1,74 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) \
{ \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
int main()
{
int deviceCount;
HIP_CHECK(hipGetDeviceCount(&deviceCount));
std::cout << "Number of devices: " << deviceCount << std::endl;
for (int deviceId = 0; deviceId < deviceCount; ++deviceId)
{
hipDeviceProp_t deviceProp;
HIP_CHECK(hipGetDeviceProperties(&deviceProp, deviceId));
std::cout << "Device " << deviceId << std::endl << " Properties:" << std::endl;
std::cout << " Name: " << deviceProp.name << std::endl;
std::cout << " Total Global Memory: " << deviceProp.totalGlobalMem / (1024 * 1024) << " MiB" << std::endl;
std::cout << " Shared Memory per Block: " << deviceProp.sharedMemPerBlock / 1024 << " KiB" << std::endl;
std::cout << " Registers per Block: " << deviceProp.regsPerBlock << std::endl;
std::cout << " Warp Size: " << deviceProp.warpSize << std::endl;
std::cout << " Max Threads per Block: " << deviceProp.maxThreadsPerBlock << std::endl;
std::cout << " Max Threads per Multiprocessor: " << deviceProp.maxThreadsPerMultiProcessor << std::endl;
std::cout << " Number of Multiprocessors: " << deviceProp.multiProcessorCount << std::endl;
std::cout << " Max Threads Dimensions: ["
<< deviceProp.maxThreadsDim[0] << ", "
<< deviceProp.maxThreadsDim[1] << ", "
<< deviceProp.maxThreadsDim[2] << "]" << std::endl;
std::cout << " Max Grid Size: ["
<< deviceProp.maxGridSize[0] << ", "
<< deviceProp.maxGridSize[1] << ", "
<< deviceProp.maxGridSize[2] << "]" << std::endl;
std::cout << std::endl;
}
return EXIT_SUCCESS;
}
// [sphinx-end]
+72
Просмотреть файл
@@ -0,0 +1,72 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
__device__ unsigned long long fibonacci(unsigned long long n)
{
if (n == 0 || n == 1)
{
return n;
}
return fibonacci(n - 1) + fibonacci(n - 2);
}
__global__ void kernel(unsigned long long n)
{
unsigned long long result = fibonacci(n);
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if (x == 0)
printf("%llu! = %llu \n", n, result);
}
int main()
{
kernel<<<1, 1>>>(10);
HIP_CHECK(hipDeviceSynchronize());
// With -O0 optimization option hit the stack limit
// kernel<<<1, 256>>>(2048);
// HIP_CHECK(hipDeviceSynchronize());
return EXIT_SUCCESS;
}
// [sphinx-end]
+100
Просмотреть файл
@@ -0,0 +1,100 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) \
{ \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
__global__ void simpleKernel(double *data, std::size_t elems)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < elems)
data[idx] = idx * 2.0;
}
int main()
{
int deviceCount;
HIP_CHECK(hipGetDeviceCount(&deviceCount));
if(deviceCount < 2)
{
std::cout << "This example requires at least two HIP devices." << std::endl;
return EXIT_SUCCESS;
}
double* deviceData0;
double* deviceData1;
constexpr std::size_t elems = 1024;
constexpr std::size_t size = elems * sizeof(double);
int deviceId0 = 0;
int deviceId1 = 1;
// Set device 0 and perform operations
HIP_CHECK(hipSetDevice(deviceId0)); // Set device 0 as current
HIP_CHECK(hipMalloc(&deviceData0, size)); // Allocate memory on device 0
simpleKernel<<<8, 128>>>(deviceData0, elems); // Launch kernel on device 0
HIP_CHECK(hipDeviceSynchronize());
// Set device 1 and perform operations
HIP_CHECK(hipSetDevice(deviceId1)); // Set device 1 as current
HIP_CHECK(hipMalloc(&deviceData1, size)); // Allocate memory on device 1
simpleKernel<<<8, 128>>>(deviceData1, elems); // Launch kernel on device 1
HIP_CHECK(hipDeviceSynchronize());
// Copy result from device 0
double hostData0[elems];
HIP_CHECK(hipSetDevice(deviceId0));
HIP_CHECK(hipMemcpy(hostData0, deviceData0, size, hipMemcpyDeviceToHost));
// Copy result from device 1
double hostData1[elems];
HIP_CHECK(hipSetDevice(deviceId1));
HIP_CHECK(hipMemcpy(hostData1, deviceData1, size, hipMemcpyDeviceToHost));
// Display results from both devices
std::cout << "Device 0 data: " << hostData0[0] << std::endl;
std::cout << "Device 1 data: " << hostData1[0] << std::endl;
// Free device memory
HIP_CHECK(hipFree(deviceData0));
HIP_CHECK(hipFree(deviceData1));
return EXIT_SUCCESS;
}
// [sphinx-end]
+64
Просмотреть файл
@@ -0,0 +1,64 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include "example_utils.hpp"
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
// [sphinx-start]
extern __shared__ int dynamic_shared[];
__global__ void kernel(int array1SizeX, int array1SizeY, int array2Size)
{
// at least (array1SizeX * array1SizeY + array2Size) * sizeof(int) bytes
// dynamic shared memory need to be allocated when the kernel is launched
int* array1 = dynamic_shared;
// array1 is interpreted as 2D of size:
int array1Size = array1SizeX * array1SizeY;
int* array2 = &(array1[array1Size]);
if(threadIdx.x < array1SizeX && threadIdx.y < array1SizeY)
{
// access array1 with threadIdx.x + threadIdx.y * array1SizeX
}
if(threadIdx.x < array2Size)
{
// access array2 threadIdx.x
}
}
// [sphinx-end]
int main()
{
std::size_t shared_memory_bytes = 512 * sizeof(int);
kernel<<<64, 512, shared_memory_bytes>>>(512, 1, 512);
HIP_CHECK(hipPeekAtLastError());
HIP_CHECK(hipDeviceSynchronize());
std::cout << "Success!" << std::endl;
return EXIT_SUCCESS;
}
+74
Просмотреть файл
@@ -0,0 +1,74 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess) \
{ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c)
{
*c = *a + *b;
}
int main()
{
int *a, *b, *c;
// Allocate memory for a, b and c that is accessible to both device and host codes.
HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
// Setup input values.
*a = 1;
*b = 2;
// Launch add() kernel on GPU.
add<<<1, 1>>>(a, b, c);
// Wait for GPU to finish before accessing on host.
HIP_CHECK(hipDeviceSynchronize());
// Print the result.
std::cout << *a << " + " << *b << " = " << *c << std::endl;
// Cleanup allocated memory.
HIP_CHECK(hipFree(a));
HIP_CHECK(hipFree(b));
HIP_CHECK(hipFree(c));
return 0;
}
// [sphinx-end]
+97
Просмотреть файл
@@ -0,0 +1,97 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <algorithm>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <vector>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c, std::size_t size)
{
const std::size_t index = threadIdx.x + blockDim.x * blockIdx.x;
if(index < size)
{
c[index] += a[index] + b[index];
}
}
int main()
{
constexpr int numOfBlocks = 256;
constexpr int threadsPerBlock = 256;
constexpr std::size_t arraySize = 1U << 16;
std::vector<int> a(arraySize), b(arraySize), c(arraySize);
int *d_a, *d_b, *d_c;
// Setup input values.
std::fill(a.begin(), a.end(), 1);
std::fill(b.begin(), b.end(), 2);
// Allocate device copies of a, b and c.
HIP_CHECK(hipMalloc(&d_a, arraySize * sizeof(int)));
HIP_CHECK(hipMalloc(&d_b, arraySize * sizeof(int)));
HIP_CHECK(hipMalloc(&d_c, arraySize * sizeof(int)));
// Copy input values to device.
HIP_CHECK(hipMemcpy(d_a, a.data(), arraySize * sizeof(int), hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(d_b, b.data(), arraySize * sizeof(int), hipMemcpyHostToDevice));
// Launch add() kernel on GPU.
add<<<numOfBlocks, threadsPerBlock>>>(d_a, d_b, d_c, arraySize);
// Check the kernel launch
HIP_CHECK(hipGetLastError());
// Check for kernel execution error
HIP_CHECK(hipDeviceSynchronize());
// Copy the result back to the host.
HIP_CHECK(hipMemcpy(c.data(), d_c, arraySize * sizeof(int), hipMemcpyDeviceToHost));
// Cleanup allocated memory.
HIP_CHECK(hipFree(d_a));
HIP_CHECK(hipFree(d_b));
HIP_CHECK(hipFree(d_c));
// Print the result.
std::cout << a[0] << " + " << b[0] << " = " << c[0] << std::endl;
return EXIT_SUCCESS;
}
// [sphinx-end]
+153
Просмотреть файл
@@ -0,0 +1,153 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <vector>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
// GPU Kernels
__global__ void kernelA(double* arrayA, std::size_t size)
{
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size)
{
arrayA[x] += 1.0;
}
}
__global__ void kernelB(double* arrayA, double* arrayB, std::size_t size)
{
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size)
{
arrayB[x] += arrayA[x] + 3.0;
}
}
int main()
{
constexpr int numOfBlocks = 1 << 20;
constexpr int threadsPerBlock = 1024;
constexpr int numberOfIterations = 50;
// The array size smaller to avoid the relatively short kernel launch compared to memory copies
constexpr std::size_t arraySize = 1U << 25;
double *d_dataA;
double *d_dataB;
double initValueA = 0.0;
double initValueB = 2.0;
std::vector<double> vectorA(arraySize, initValueA);
std::vector<double> vectorB(arraySize, initValueB);
// Allocate device memory
HIP_CHECK(hipMalloc(&d_dataA, arraySize * sizeof(*d_dataA)));
HIP_CHECK(hipMalloc(&d_dataB, arraySize * sizeof(*d_dataB)));
// Create streams
hipStream_t streamA, streamB;
HIP_CHECK(hipStreamCreate(&streamA));
HIP_CHECK(hipStreamCreate(&streamB));
// Create events
hipEvent_t event, eventA, eventB;
HIP_CHECK(hipEventCreate(&event));
HIP_CHECK(hipEventCreate(&eventA));
HIP_CHECK(hipEventCreate(&eventB));
for(unsigned int iteration = 0; iteration < numberOfIterations; iteration++)
{
// Stream 1: Host to Device 1
HIP_CHECK(hipMemcpyAsync(d_dataA, vectorA.data(), arraySize * sizeof(*d_dataA), hipMemcpyHostToDevice, streamA));
// Stream 2: Host to Device 2
HIP_CHECK(hipMemcpyAsync(d_dataB, vectorB.data(), arraySize * sizeof(*d_dataB), hipMemcpyHostToDevice, streamB));
// Stream 1: Kernel 1
kernelA<<<numOfBlocks, threadsPerBlock, 0, streamA>>>(d_dataA, arraySize);
// Record event after the GPU kernel in Stream 1
HIP_CHECK(hipEventRecord(event, streamA));
// Stream 2: Wait for event before starting Kernel 2
HIP_CHECK(hipStreamWaitEvent(streamB, event, 0));
// Stream 2: Kernel 2
kernelB<<<numOfBlocks, threadsPerBlock, 0, streamB>>>(d_dataA, d_dataB, arraySize);
// Stream 1: Device to Host 2 (after Kernel 1)
HIP_CHECK(hipMemcpyAsync(vectorA.data(), d_dataA, arraySize * sizeof(*vectorA.data()), hipMemcpyDeviceToHost, streamA));
// Stream 2: Device to Host 2 (after Kernel 2)
HIP_CHECK(hipMemcpyAsync(vectorB.data(), d_dataB, arraySize * sizeof(*vectorB.data()), hipMemcpyDeviceToHost, streamB));
// Wait for all operations in both streams to complete
HIP_CHECK(hipEventRecord(eventA, streamA));
HIP_CHECK(hipEventRecord(eventB, streamB));
HIP_CHECK(hipStreamWaitEvent(streamA, eventA, 0));
HIP_CHECK(hipStreamWaitEvent(streamB, eventB, 0));
}
// Verify results
double expectedA = (double)numberOfIterations;
double expectedB = initValueB + (3.0 * numberOfIterations) + (expectedA * (expectedA + 1.0)) / 2.0;
bool passed = true;
for(std::size_t i = 0; i < arraySize; ++i)
{
if(vectorA[i] != expectedA)
{
passed = false;
std::cerr << "Validation failed! Expected " << expectedA << " got " << vectorA[i] << std::endl;
break;
}
if(vectorB[i] != expectedB)
{
passed = false;
std::cerr << "Validation failed! Expected " << expectedB << " got " << vectorB[i] << std::endl;
break;
}
}
if(passed)
{
std::cout << "Asynchronous execution with events completed successfully." << std::endl;
}
else
{
std::cerr << "Asynchronous execution with events failed." << std::endl;
}
// Cleanup
HIP_CHECK(hipEventDestroy(event));
HIP_CHECK(hipEventDestroy(eventA));
HIP_CHECK(hipEventDestroy(eventB));
HIP_CHECK(hipStreamDestroy(streamA));
HIP_CHECK(hipStreamDestroy(streamB));
HIP_CHECK(hipFree(d_dataA));
HIP_CHECK(hipFree(d_dataB));
return 0;
}
// [sphinx-end]
+58
Просмотреть файл
@@ -0,0 +1,58 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include "example_utils.hpp"
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
int main()
{
// [sphinx-start]
std::size_t elements = 1 << 20;
std::size_t size_bytes = elements * sizeof(int);
// allocate host and device memory
int *host_pointer = new int[elements];
int *device_input, *device_result;
HIP_CHECK(hipMalloc(&device_input, size_bytes));
HIP_CHECK(hipMalloc(&device_result, size_bytes));
// copy from host to the device
HIP_CHECK(hipMemcpy(device_input, host_pointer, size_bytes, hipMemcpyHostToDevice));
// Use memory on the device, i.e. execute kernels
// copy from device to host, to e.g. get results from the kernel
HIP_CHECK(hipMemcpy(host_pointer, device_result, size_bytes, hipMemcpyDeviceToHost));
// free memory when not needed any more
HIP_CHECK(hipFree(device_result));
HIP_CHECK(hipFree(device_input));
delete[] host_pointer;
// [sphinx-end]
return EXIT_SUCCESS;
}
+79
Просмотреть файл
@@ -0,0 +1,79 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess) \
{ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c)
{
*c = *a + *b;
}
int main()
{
int a, b, c;
int *d_a, *d_b, *d_c;
// Setup input values.
a = 1;
b = 2;
// Allocate device copies of a, b and c.
HIP_CHECK(hipMalloc(&d_a, sizeof(*d_a)));
HIP_CHECK(hipMalloc(&d_b, sizeof(*d_b)));
HIP_CHECK(hipMalloc(&d_c, sizeof(*d_c)));
// Copy input values to device.
HIP_CHECK(hipMemcpy(d_a, &a, sizeof(*d_a), hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(d_b, &b, sizeof(*d_b), hipMemcpyHostToDevice));
// Launch add() kernel on GPU.
add<<<1, 1>>>(d_a, d_b, d_c);
// Copy the result back to the host.
HIP_CHECK(hipMemcpy(&c, d_c, sizeof(*d_c), hipMemcpyDeviceToHost));
// Cleanup allocated memory.
HIP_CHECK(hipFree(d_a));
HIP_CHECK(hipFree(d_b));
HIP_CHECK(hipFree(d_c));
// Prints the result.
std::cout << a << " + " << b << " = " << c << std::endl;
return 0;
}
// [sphinx-end]
+53
Просмотреть файл
@@ -0,0 +1,53 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
extern __shared__ int shared_array[];
__global__ void kernel()
{
// initialize shared memory
shared_array[threadIdx.x] = threadIdx.x;
// use shared memory - synchronize to make sure, that all threads of the
// block see all changes to shared memory
__syncthreads();
}
int main()
{
//shared memory in this case depends on the configurable block size
constexpr int blockSize = 256;
constexpr int sharedMemSize = blockSize * sizeof(int);
constexpr int gridSize = 2;
kernel<<<dim3(gridSize), dim3(blockSize), sharedMemSize, 0>>>();
if(auto err = hipDeviceSynchronize(); err != hipSuccess)
std::cerr << "HIP error " << err << ": " << hipGetErrorString(err) << std::endl;
return EXIT_SUCCESS;
}
// [sphinx-end]
+606
Просмотреть файл
@@ -0,0 +1,606 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include "backprojection.hpp"
#include "filtering.hpp"
#include "log_transform.hpp"
#include "normalization.hpp"
#include "phantom.hpp"
#include "projection.hpp"
#include "utility.hpp"
#include "weighting.hpp"
#include "volume.hpp"
#include <hip/hip_runtime.h>
#include <hipfft/hipfft.h>
#include <algorithm>
#include <chrono>
#include <cmath>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <numbers>
#include <ostream>
#include <set>
#include <stdexcept>
#include <vector>
auto main() -> int
{
try
{
auto hasTextures = int{0};
hip_check(hipDeviceGetAttribute(&hasTextures, hipDeviceAttributeImageSupport, 0));
// Fetch device properties
auto devProps = hipDeviceProp_t{};
hip_check(hipGetDeviceProperties(&devProps, 0));
auto const numStreams = devProps.asyncEngineCount;
std::cout << "Device has " << numStreams << " asynchronous engines; preprocessing will use "
<< numStreams << " parallel streams." << std::endl;
auto streams = std::vector<hipStream_t>{};
streams.resize(numStreams);
for(auto&& stream : streams)
hip_check(hipStreamCreate(&stream));
auto r = static_cast<float*>(nullptr);
auto R = static_cast<hipfftComplex*>(nullptr);
auto forwardPlans = std::vector<hipfftHandle>{};
auto forwardSizes = std::vector<std::size_t>{};
auto backwardPlans = std::vector<hipfftHandle>{};
auto backwardSizes = std::vector<std::size_t>{};
forwardPlans.resize(numStreams);
forwardSizes.resize(numStreams);
backwardPlans.resize(numStreams);
backwardSizes.resize(numStreams);
auto projections = std::vector<float*>{};
auto projectionPitches = std::vector<std::size_t>{};
auto expandedProjections = std::vector<float*>{};
auto expandedPitches = std::vector<std::size_t>{};
auto transformedProjections = std::vector<hipfftComplex*>{};
auto transformedPitches = std::vector<std::size_t>{};
auto textureProjections = std::vector<hipTextureObject_t>{};
auto projGeom = phantom::make_projectionGeometry();
auto volGeom = phantom::make_volumeGeometry();
auto phantomProjections = phantom::make_projections(projGeom, volGeom, streams);
std::cout << "Initializing... " << std::flush;
auto stream = streams.at(0);
// Create filter kernel
hip_check(hipMalloc(reinterpret_cast<void**>(&r), projGeom.dimFFT.x * sizeof(float)));
auto const creationBlocks = std::max((projGeom.dimFFT.x / 1024u), 1u);
filter_creation_kernel<<<creationBlocks, 1024, 0, stream>>>(r, projGeom.s_dimFFT.x, projGeom.pixelDim.x);
hip_check(hipMalloc(reinterpret_cast<void**>(&R), projGeom.dimTrans.x * sizeof(hipfftComplex)));
auto filterPlan = hipfftHandle{};
hipfft_check(hipfftPlan1d(&filterPlan, projGeom.dimFFT.x, HIPFFT_R2C, 1));
hipfft_check(hipfftSetStream(filterPlan, stream));
hipfft_check(hipfftExecR2C(filterPlan, r, R));
auto absoluteBlocks = (projGeom.dimTrans.x / 1024u) + 1u;
filter_absolute_kernel<<<absoluteBlocks, 1024, 0, stream>>>(R, projGeom.dimTrans.x, projGeom.pixelDim.x);
hip_check(hipStreamSynchronize(stream));
hipfft_check(hipfftDestroy(filterPlan));
hip_check(hipFree(r));
auto const inputProjSingle = projGeom.dim.x * projGeom.dim.y * sizeof(std::uint16_t);
auto const inputProjTotal = inputProjSingle * numStreams;
auto const projSingle = projGeom.dim.x * projGeom.dim.y * sizeof(float);
auto const projTotal = projSingle * numStreams;
auto const expandedSingle = projGeom.dimFFT.x * projGeom.dimFFT.y * sizeof(float);
auto const expandedTotal = expandedSingle * numStreams;
auto const transformedSingle = projGeom.dimTrans.x * projGeom.dimTrans.y * sizeof(hipfftComplex);
auto const transformedTotal = transformedSingle * numStreams;
auto const volumeTotal = volGeom.dim.x * volGeom.dim.y * volGeom.dim.z * sizeof(float);
auto const memTotal = inputProjTotal + projTotal + expandedTotal + transformedTotal + volumeTotal;
auto devMemFree = std::size_t{};
auto devMemTotal = std::size_t{};
hip_check(hipMemGetInfo(&devMemFree, &devMemTotal));
auto memRequired = static_cast<std::size_t>(memTotal);
if(memRequired > devMemFree)
{
std::cerr << "Not enough device memory. Required: " << memRequired
<< ", available: " << devMemFree << std::endl;
return EXIT_FAILURE;
}
std::cout << "Done!" << std::endl;
std::cout << "Volume dimensions: " << volGeom.dim.x << " x "
<< volGeom.dim.y << " x "
<< volGeom.dim.z << std::endl;
// Initialize per-stream data
for(auto streamIdx = 0u; streamIdx < streams.size(); ++streamIdx)
{
std::cout << "Initializing stream " << streamIdx << "... " << std::flush;
auto stream = streams.at(streamIdx);
auto proj = static_cast<float*>(nullptr);
auto projPitch = std::size_t{};
hip_check(hipMallocPitch(
reinterpret_cast<void**>(&proj), &projPitch, projGeom.dim.x * sizeof(float), projGeom.dim.y
));
projections.push_back(proj);
projectionPitches.push_back(projPitch);
auto expanded = static_cast<float*>(nullptr);
auto expandedPitch = std::size_t{};
hip_check(hipMallocPitch(
reinterpret_cast<void**>(&expanded),
&expandedPitch,
projGeom.dimFFT.x * sizeof(float),
projGeom.dimFFT.y
));
expandedProjections.push_back(expanded);
expandedPitches.push_back(expandedPitch);
auto transformed = static_cast<hipfftComplex*>(nullptr);
auto transformedPitch = std::size_t{};
hip_check(hipMallocPitch(
reinterpret_cast<void**>(&transformed),
&transformedPitch,
projGeom.dimTrans.x * sizeof(hipfftComplex),
projGeom.dimTrans.y
));
transformedProjections.push_back(transformed);
transformedPitches.push_back(transformedPitch);
auto& forward = forwardPlans.at(streamIdx);
auto& forwardSize = forwardSizes.at(streamIdx);
auto fw_inembed = static_cast<int>(expandedPitch / sizeof(float));
auto fw_istride = 1;
auto fw_idist = fw_inembed;
auto fw_onembed = static_cast<int>(transformedPitch / sizeof(hipfftComplex));
auto fw_ostride = 1;
auto fw_odist = fw_onembed;
hipfft_check(hipfftCreate(&forward));
hipfft_check(hipfftMakePlanMany(forward, 1, &projGeom.s_dimFFT.x,
&fw_inembed, 1, fw_idist,
&fw_onembed, 1, fw_odist,
HIPFFT_R2C, projGeom.s_dimFFT.y, &forwardSize));
hipfft_check(hipfftSetStream(forward, stream));
auto& backward = backwardPlans.at(streamIdx);
auto& backwardSize = backwardSizes.at(streamIdx);
auto bw_inembed = fw_onembed;
auto bw_istride = fw_ostride;
auto bw_idist = fw_odist;
auto bw_onembed = fw_inembed;
auto bw_ostride = fw_istride;
auto bw_odist = fw_idist;
hipfft_check(hipfftCreate(&backward));
hipfft_check(hipfftMakePlanMany(backward, 1, &projGeom.s_dimFFT.x,
&bw_inembed, bw_istride, bw_idist,
&bw_onembed, bw_ostride, bw_odist,
HIPFFT_C2R, projGeom.s_dimFFT.y, &backwardSize));
hipfft_check(hipfftSetStream(backward, stream));
if(hasTextures)
{
// create a HIP texture from the projection
auto resDesc = hipResourceDesc{};
resDesc.resType = hipResourceTypePitch2D;
resDesc.res.pitch2D.desc = hipCreateChannelDesc<float>();
resDesc.res.pitch2D.devPtr = static_cast<void*>(proj);
resDesc.res.pitch2D.width = projGeom.dim.x;
resDesc.res.pitch2D.height = projGeom.dim.y;
resDesc.res.pitch2D.pitchInBytes = projPitch;
auto texDesc = hipTextureDesc{};
texDesc.addressMode[0] = hipAddressModeBorder;
texDesc.addressMode[1] = hipAddressModeBorder;
texDesc.readMode = hipReadModeElementType;
texDesc.borderColor[0] = 0.f;
texDesc.borderColor[0] = 0.f;
texDesc.filterMode = hipFilterModeLinear;
texDesc.normalizedCoords = 0;
auto& projTex = textureProjections.emplace_back();
hip_check(hipCreateTextureObject(&projTex, &resDesc, &texDesc, nullptr));
}
std::cout << "Done!" << std::endl;
}
create_volume("volume.tif");
auto hostVolPtr = static_cast<float*>(nullptr);
hip_check(hipHostMalloc(
reinterpret_cast<void**>(&hostVolPtr),
volGeom.dim.x * volGeom.dim.y * volGeom.dim.z * sizeof(float),
hipHostMallocDefault
));
auto hostVol = make_hipPitchedPtr(
hostVolPtr, volGeom.dim.x * sizeof(float), volGeom.dim.x, volGeom.dim.y
);
auto vol = hipPitchedPtr{};
auto volExt = make_hipExtent(volGeom.dim.x * sizeof(float), volGeom.dim.y, volGeom.dim.z);
hip_check(hipMalloc3D(&vol, volExt));
hip_check(hipMemset3D(vol, 0, volExt));
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// MAIN LOOP
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// [sphinx-graph-vars-start]
auto graphCreated = false;
auto graphExec = hipGraphExec_t{};
auto graphFinalCreated = false;
auto graphExecFinal = hipGraphExec_t{};
auto graphStream = hipStream_t{};
hip_check(hipStreamCreate(&graphStream));
// [sphinx-graph-vars-end]
auto start = std::chrono::steady_clock::now();
auto projIdx = 0u;
while(projIdx < projGeom.numProj)
{
// [sphinx-begin-capture-start]
// Capture the current batch into a graph template
auto graph = hipGraph_t{};
hip_check(hipStreamBeginCapture(streams.at(0), hipStreamCaptureModeGlobal));
// [sphinx-begin-capture-end]
auto batchSize = std::min(numStreams, static_cast<int>(projGeom.numProj - projIdx));
// [sphinx-fork-start]
// Fork: Record events on stream 0, then have other streams wait
for(auto streamIdx = 1; streamIdx < batchSize; ++streamIdx)
{
auto forkEvent = hipEvent_t{};
hip_check(hipEventCreate(&forkEvent));
hip_check(hipEventRecord(forkEvent, streams.at(0)));
hip_check(hipStreamWaitEvent(streams.at(streamIdx), forkEvent, 0));
hip_check(hipEventDestroy(forkEvent)); // Can destroy after wait is recorded
}
// [sphinx-fork-end]
// Launch batch in parallel streams
for(auto streamIdx = 0; streamIdx < batchSize; ++streamIdx, ++projIdx)
{
auto stream = streams.at(streamIdx);
auto threadsPerBlock = dim3{32, 32, 1};
auto blocksPerGrid = dim3{
(projGeom.dim.x / threadsPerBlock.x) + 1, (projGeom.dim.y / threadsPerBlock.y) + 1, 1
};
auto inputPitchedPtr = phantomProjections.at(projIdx);
auto input = static_cast<std::uint16_t*>(inputPitchedPtr.ptr);
auto inputPitch = inputPitchedPtr.pitch;
auto proj = projections.at(streamIdx);
auto projPitch = projectionPitches.at(streamIdx);
normalization_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(
input, inputPitch, proj, projPitch, projGeom.dim, projGeom.bps
);
log_transformation_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(proj, projPitch, projGeom.dim);
weighting_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(
proj,
projPitch,
projGeom.dim,
projGeom.d_sd,
projGeom.d_so,
projGeom.minCoord,
projGeom.pixelDim
);
// Expand projection to filter length
auto expanded = expandedProjections.at(streamIdx);
auto expandedPitch = expandedPitches.at(streamIdx);
hip_check(hipMemset2DAsync(
expanded, expandedPitch, 0, projGeom.dimFFT.x * sizeof(float), projGeom.dimFFT.y, stream
));
hip_check(hipMemcpy2DAsync(
expanded,
expandedPitch,
proj,
projPitch,
projGeom.dim.x * sizeof(float),
projGeom.dim.y,
hipMemcpyDeviceToDevice,
stream
));
// R2C Fourier-transform projection
auto transformed = transformedProjections.at(streamIdx);
auto transformedPitch = transformedPitches.at(streamIdx);
hip_check(hipMemset2DAsync(
transformed,
transformedPitch,
0,
projGeom.dimTrans.x * sizeof(hipfftComplex),
projGeom.dimTrans.y,
stream
));
auto& forward = forwardPlans.at(streamIdx);
hipfft_check(hipfftExecR2C(forward, expanded, transformed));
// Apply filter
auto filterBlocksPerGrid = dim3{
(projGeom.dimTrans.x / threadsPerBlock.x) + 1,
(projGeom.dimTrans.y / threadsPerBlock.y) + 1,
1
};
filter_application_kernel<<<filterBlocksPerGrid, threadsPerBlock, 0, stream>>>(
transformed, transformedPitch, R, projGeom.dimTrans
);
auto& backward = backwardPlans.at(streamIdx);
hipfft_check(hipfftExecC2R(backward, transformed, expanded));
// Shrink projection to original size and normalize
hip_check(hipMemcpy2DAsync(
proj,
projPitch,
expanded,
expandedPitch,
projGeom.dim.x * sizeof(float),
projGeom.dim.y,
hipMemcpyDeviceToDevice,
stream
));
filter_normalization_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(
proj, projPitch, projGeom.dimFFT.x, projGeom.dim
);
// Backprojection
auto thetaDeg = projGeom.thetaSign * projGeom.thetaStep * projIdx; // Current angle
auto thetaRad = thetaDeg * std::numbers::pi_v<float> / 180.f; // Convert to radians
auto sinTheta = std::sin(thetaRad);
auto cosTheta = std::cos(thetaRad);
auto bpBlockSize = dim3{32u, 8u, 4u};
auto bpBlocks = dim3{
static_cast<std::uint32_t>(volGeom.dim.x / bpBlockSize.x + 1),
static_cast<std::uint32_t>(volGeom.dim.y / bpBlockSize.y + 1),
static_cast<std::uint32_t>(volGeom.dim.z / bpBlockSize.z + 1)
};
if(hasTextures)
{
auto& projTex = textureProjections.at(streamIdx);
backprojection_kernel<<<bpBlocks, bpBlockSize, 0, stream>>>(
static_cast<float*>(vol.ptr),
vol.pitch,
volGeom.dim,
volGeom.voxelDim,
projTex,
projGeom.minCoord,
sinTheta,
cosTheta,
projGeom.pixelDim,
projGeom.d_sd,
projGeom.d_so
);
}
else
{
// Fallback for devices without support for texture instructions
backprojection_kernel_no_tex<<<bpBlocks, bpBlockSize, 0, stream>>>(
static_cast<float*>(vol.ptr),
vol.pitch,
volGeom.dim,
volGeom.voxelDim,
proj,
projPitch,
projGeom.dim,
projGeom.minCoord,
sinTheta,
cosTheta,
projGeom.pixelDim,
projGeom.d_sd,
projGeom.d_so
);
}
}
// [sphinx-join-start]
// Join: Record events on all streams except stream 0, then have stream 0 wait
for(auto streamIdx = 1; streamIdx < batchSize; ++streamIdx)
{
auto joinEvent = hipEvent_t{};
hip_check(hipEventCreate(&joinEvent));
hip_check(hipEventRecord(joinEvent, streams.at(streamIdx)));
hip_check(hipStreamWaitEvent(streams.at(0), joinEvent, 0));
hip_check(hipEventDestroy(joinEvent)); // Can destroy after wait is recorded
}
// [sphinx-join-end]
// [sphinx-stop-capture-start]
// Stop capturing -- this will stop capturing on all streams
hip_check(hipStreamEndCapture(streams.at(0), &graph));
// [sphinx-stop-capture-end]
// Instantiate and launch the graph
if(batchSize == numStreams)
{
// [sphinx-graph-instantiate-start]
if(!graphCreated)
{
hip_check(hipGraphDebugDotPrint(graph, "graph_capture.dot", hipGraphDebugDotFlagsVerbose));
hip_check(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
hip_check(hipGraphDestroy(graph));
hip_check(hipGraphLaunch(graphExec, graphStream));
graphCreated = true;
}
// [sphinx-graph-instantiate-end]
// [sphinx-graph-update-start]
else
{
// Update existing executable graph after each iteration with new input data
auto result = hipGraphExecUpdateResult{};
auto errorNode = hipGraphNode_t{};
hip_check(hipGraphExecUpdate(graphExec, graph, &errorNode, &result));
if(result != hipGraphExecUpdateSuccess)
{
auto msg = std::string{"Failed to update graph: "};
switch(result)
{
case hipGraphExecUpdateError:
msg += "Invalid value.";
break;
case hipGraphExecUpdateErrorFunctionChanged:
msg += "Function of kernel node changed.";
break;
case hipGraphExecUpdateErrorNodeTypeChanged:
msg += "Type of node changed.";
break;
case hipGraphExecUpdateErrorNotSupported:
msg += "Something about the node is not supported.";
break;
case hipGraphExecUpdateErrorParametersChanged:
msg += "Unsupported parameter change.";
break;
case hipGraphExecUpdateErrorTopologyChanged:
msg += "Graph topology changed.";
break;
case hipGraphExecUpdateErrorUnsupportedFunctionChange:
msg += "Unsupported change of kernel node function.";
break;
default:
msg += "Unknown error.";
break;
}
throw std::runtime_error{msg};
}
hip_check(hipGraphDestroy(graph));
hip_check(hipGraphLaunch(graphExec, graphStream));
}
// [sphinx-graph-update-end]
}
else
{
// [sphinx-graph-final-start]
hip_check(hipGraphDebugDotPrint(graph, "graph_capture_final.dot", hipGraphDebugDotFlagsVerbose));
// Incomplete batch: topology changed, must instantiate new executable graph
hip_check(hipGraphInstantiate(&graphExecFinal, graph, nullptr, nullptr, 0));
hip_check(hipGraphDestroy(graph));
hip_check(hipGraphLaunch(graphExecFinal, graphStream));
// [sphinx-graph-final-end]
graphFinalCreated = true;
}
}
// Obtain reconstruction time before copying back the result
auto stop = std::chrono::steady_clock::time_point{};
hip_check(hipLaunchHostFunc(graphStream, [](void* data)
{
auto& stop = *(static_cast<std::chrono::steady_clock::time_point*>(data));
stop = std::chrono::steady_clock::now();
}, static_cast<void*>(&stop)));
// Copy volume back to host and save
auto memcpyParams = hipMemcpy3DParms{};
std::memset(&memcpyParams, 0, sizeof(hipMemcpy3DParms));
memcpyParams.dstPos = make_hipPos(0, 0, 0);
memcpyParams.dstPtr = hostVol;
memcpyParams.srcPos = make_hipPos(0, 0, 0);
memcpyParams.srcPtr = vol;
memcpyParams.extent = volExt;
memcpyParams.kind = hipMemcpyDeviceToHost;
hip_check(hipMemcpy3DAsync(&memcpyParams, graphStream));
auto saveVolArgs = new save_volume_args
{
"volume.tif",
hostVolPtr,
volGeom.dim.x, volGeom.dim.y, volGeom.dim.z,
volGeom.voxelDim.x, volGeom.voxelDim.y
};
hip_check(hipLaunchHostFunc(graphStream, save_volume, saveVolArgs));
std::cout << "All work items enqueued, waiting for completion... " << std::flush;
hip_check(hipStreamSynchronize(graphStream));
std::cout << "Done!" << std::endl;
auto const elapsed = std::chrono::duration<double>{stop - start};
std::cout << "Reconstruction time: " << elapsed.count() << 's' << std::endl;
// Cleanup
if(graphFinalCreated)
hip_check(hipGraphExecDestroy(graphExecFinal));
hip_check(hipGraphExecDestroy(graphExec));
hip_check(hipStreamDestroy(graphStream));
hip_check(hipFree(vol.ptr));
hip_check(hipFreeHost(hostVolPtr));
if(hasTextures)
{
for(auto&& tex : textureProjections)
hip_check(hipDestroyTextureObject(tex));
}
for(auto&& plan : backwardPlans)
hipfft_check(hipfftDestroy(plan));
for(auto&& plan : forwardPlans)
hipfft_check(hipfftDestroy(plan));
for(auto&& p : transformedProjections)
hip_check(hipFree(p));
for(auto&& p : expandedProjections)
hip_check(hipFree(p));
for(auto&& p : projections)
hip_check(hipFree(p));
for(auto&& p : phantomProjections)
hip_check(hipFree(p.ptr));
hip_check(hipFree(R));
for(auto&& stream : streams)
hip_check(hipStreamDestroy(stream));
hip_check(hipDeviceSynchronize());
return EXIT_SUCCESS;
}
catch(std::runtime_error const& e)
{
std::cerr << "Caught runtime error: " << e.what() << std::endl;
return EXIT_FAILURE;
}
}
+804
Просмотреть файл
@@ -0,0 +1,804 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include "backprojection.hpp"
#include "filtering.hpp"
#include "log_transform.hpp"
#include "normalization.hpp"
#include "phantom.hpp"
#include "projection.hpp"
#include "utility.hpp"
#include "weighting.hpp"
#include "volume.hpp"
#include <hip/hip_runtime.h>
#include <hipfft/hipfft.h>
#include <algorithm>
#include <chrono>
#include <cmath>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <iterator>
#include <numbers>
#include <ostream>
#include <set>
#include <stdexcept>
#include <vector>
auto main() -> int
{
try
{
auto hasTextures = int{0};
hip_check(hipDeviceGetAttribute(&hasTextures, hipDeviceAttributeImageSupport, 0));
// Fetch device properties
auto devProps = hipDeviceProp_t{};
hip_check(hipGetDeviceProperties(&devProps, 0));
auto const numBranches = devProps.asyncEngineCount;
std::cout << "Device supports " << numBranches << " asynchronous engines; preprocessing will use "
<< numBranches << " parallel branches." << std::endl;
// For interoperability with hipFFT we require streams and events
auto streams = std::vector<hipStream_t>{};
streams.resize(numBranches);
for(auto&& stream : streams)
hip_check(hipStreamCreate(&stream));
auto r = static_cast<float*>(nullptr);
auto R = static_cast<hipfftComplex*>(nullptr);
auto forwardPlans = std::vector<hipfftHandle>{};
auto forwardSizes = std::vector<std::size_t>{};
auto backwardPlans = std::vector<hipfftHandle>{};
auto backwardSizes = std::vector<std::size_t>{};
forwardPlans.resize(numBranches);
forwardSizes.resize(numBranches);
backwardPlans.resize(numBranches);
backwardSizes.resize(numBranches);
auto projections = std::vector<float*>{};
auto projectionPitches = std::vector<std::size_t>{};
auto expandedProjections = std::vector<float*>{};
auto expandedPitches = std::vector<std::size_t>{};
auto transformedProjections = std::vector<hipfftComplex*>{};
auto transformedPitches = std::vector<std::size_t>{};
auto textureProjections = std::vector<hipTextureObject_t>{};
auto projGeom = phantom::make_projectionGeometry();
auto volGeom = phantom::make_volumeGeometry();
auto phantomProjections = phantom::make_projections(projGeom, volGeom, streams);
std::cout << "Initializing... " << std::flush;
auto stream = streams.at(0);
// Create filter kernel
hip_check(hipMalloc(reinterpret_cast<void**>(&r), projGeom.dimFFT.x * sizeof(float)));
auto const creationBlocks = std::max((projGeom.dimFFT.x / 1024u), 1u);
filter_creation_kernel<<<creationBlocks, 1024, 0, stream>>>(r, projGeom.s_dimFFT.x, projGeom.pixelDim.x);
hip_check(hipMalloc(reinterpret_cast<void**>(&R), projGeom.dimTrans.x * sizeof(hipfftComplex)));
auto filterPlan = hipfftHandle{};
hipfft_check(hipfftPlan1d(&filterPlan, projGeom.dimFFT.x, HIPFFT_R2C, 1));
hipfft_check(hipfftSetStream(filterPlan, stream));
hipfft_check(hipfftExecR2C(filterPlan, r, R));
auto absoluteBlocks = (projGeom.dimTrans.x / 1024u) + 1u;
filter_absolute_kernel<<<absoluteBlocks, 1024, 0, stream>>>(R, projGeom.dimTrans.x, projGeom.pixelDim.x);
hip_check(hipStreamSynchronize(stream));
hipfft_check(hipfftDestroy(filterPlan));
hip_check(hipFree(r));
auto const inputProjSingle = projGeom.dim.x * projGeom.dim.y * sizeof(std::uint16_t);
auto const inputProjTotal = inputProjSingle * numBranches;
auto const projSingle = projGeom.dim.x * projGeom.dim.y * sizeof(float);
auto const projTotal = projSingle * numBranches;
auto const expandedSingle = projGeom.dimFFT.x * projGeom.dimFFT.y * sizeof(float);
auto const expandedTotal = expandedSingle * numBranches;
auto const transformedSingle = projGeom.dimTrans.x * projGeom.dimTrans.y * sizeof(hipfftComplex);
auto const transformedTotal = transformedSingle * numBranches;
auto const volumeTotal = volGeom.dim.x * volGeom.dim.y * volGeom.dim.z * sizeof(float);
auto const memTotal = inputProjTotal + projTotal + expandedTotal + transformedTotal + volumeTotal;
auto devMemFree = std::size_t{};
auto devMemTotal = std::size_t{};
hip_check(hipMemGetInfo(&devMemFree, &devMemTotal));
auto memRequired = static_cast<std::size_t>(memTotal);
if(memRequired > devMemFree)
{
std::cerr << "Not enough device memory. Required: " << memRequired
<< ", available: " << devMemFree << std::endl;
return EXIT_FAILURE;
}
std::cout << "Done!" << std::endl;
std::cout << "Volume dimensions: " << volGeom.dim.x << " x "
<< volGeom.dim.y << " x "
<< volGeom.dim.z << std::endl;
// Initialize per-branch data
for(auto branchIdx = 0; branchIdx < numBranches; ++branchIdx)
{
std::cout << "Initializing branch #" << branchIdx << "... " << std::flush;
auto stream = streams.at(branchIdx);
auto proj = static_cast<float*>(nullptr);
auto projPitch = std::size_t{};
hip_check(hipMallocPitch(
reinterpret_cast<void**>(&proj), &projPitch, projGeom.dim.x * sizeof(float), projGeom.dim.y
));
projections.push_back(proj);
projectionPitches.push_back(projPitch);
auto expanded = static_cast<float*>(nullptr);
auto expandedPitch = std::size_t{};
hip_check(hipMallocPitch(
reinterpret_cast<void**>(&expanded),
&expandedPitch,
projGeom.dimFFT.x * sizeof(float),
projGeom.dimFFT.y
));
expandedProjections.push_back(expanded);
expandedPitches.push_back(expandedPitch);
auto transformed = static_cast<hipfftComplex*>(nullptr);
auto transformedPitch = std::size_t{};
hip_check(hipMallocPitch(
reinterpret_cast<void**>(&transformed),
&transformedPitch,
projGeom.dimTrans.x * sizeof(hipfftComplex),
projGeom.dimTrans.y
));
transformedProjections.push_back(transformed);
transformedPitches.push_back(transformedPitch);
auto& forward = forwardPlans.at(branchIdx);
auto& forwardSize = forwardSizes.at(branchIdx);
auto fw_inembed = static_cast<int>(expandedPitch / sizeof(float));
auto fw_istride = 1;
auto fw_idist = fw_inembed;
auto fw_onembed = static_cast<int>(transformedPitch / sizeof(hipfftComplex));
auto fw_ostride = 1;
auto fw_odist = fw_onembed;
hipfft_check(hipfftCreate(&forward));
hipfft_check(hipfftMakePlanMany(forward, 1, &projGeom.s_dimFFT.x,
&fw_inembed, 1, fw_idist,
&fw_onembed, 1, fw_odist,
HIPFFT_R2C, projGeom.s_dimFFT.y, &forwardSize));
hipfft_check(hipfftSetStream(forward, stream));
auto& backward = backwardPlans.at(branchIdx);
auto& backwardSize = backwardSizes.at(branchIdx);
auto bw_inembed = fw_onembed;
auto bw_istride = fw_ostride;
auto bw_idist = fw_odist;
auto bw_onembed = fw_inembed;
auto bw_ostride = fw_istride;
auto bw_odist = fw_idist;
hipfft_check(hipfftCreate(&backward));
hipfft_check(hipfftMakePlanMany(backward, 1, &projGeom.s_dimFFT.x,
&bw_inembed, bw_istride, bw_idist,
&bw_onembed, bw_ostride, bw_odist,
HIPFFT_C2R, projGeom.s_dimFFT.y, &backwardSize));
hipfft_check(hipfftSetStream(backward, stream));
if(hasTextures)
{
// create a HIP texture from the projection
auto resDesc = hipResourceDesc{};
resDesc.resType = hipResourceTypePitch2D;
resDesc.res.pitch2D.desc = hipCreateChannelDesc<float>();
resDesc.res.pitch2D.devPtr = static_cast<void*>(proj);
resDesc.res.pitch2D.width = projGeom.dim.x;
resDesc.res.pitch2D.height = projGeom.dim.y;
resDesc.res.pitch2D.pitchInBytes = projPitch;
auto texDesc = hipTextureDesc{};
texDesc.addressMode[0] = hipAddressModeBorder;
texDesc.addressMode[1] = hipAddressModeBorder;
texDesc.readMode = hipReadModeElementType;
texDesc.borderColor[0] = 0.f;
texDesc.borderColor[0] = 0.f;
texDesc.filterMode = hipFilterModeLinear;
texDesc.normalizedCoords = 0;
auto& projTex = textureProjections.emplace_back();
hip_check(hipCreateTextureObject(&projTex, &resDesc, &texDesc, nullptr));
}
std::cout << "Done!" << std::endl;
}
create_volume("volume.tif");
auto hostVolPtr = static_cast<float*>(nullptr);
hip_check(hipHostMalloc(
reinterpret_cast<void**>(&hostVolPtr),
volGeom.dim.x * volGeom.dim.y * volGeom.dim.z * sizeof(float),
hipHostMallocDefault
));
auto hostVol = make_hipPitchedPtr(
hostVolPtr, volGeom.dim.x * sizeof(float), volGeom.dim.x, volGeom.dim.y
);
auto vol = hipPitchedPtr{};
auto volExt = make_hipExtent(volGeom.dim.x * sizeof(float), volGeom.dim.y, volGeom.dim.z);
hip_check(hipMalloc3D(&vol, volExt));
hip_check(hipMemset3D(vol, 0, volExt));
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// MAIN LOOP
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
auto graphCreated = false;
auto graphExec = hipGraphExec_t{};
auto graphFinalCreated = false;
auto graphExecFinal = hipGraphExec_t{};
auto graphStream = hipStream_t{};
hip_check(hipStreamCreate(&graphStream));
auto start = std::chrono::steady_clock::now();
auto projIdx = 0u;
while(projIdx < projGeom.numProj)
{
auto batchSize = std::min(numBranches, static_cast<int>(projGeom.numProj - projIdx));
// Create graph for current batch
auto graph = hipGraph_t{};
hip_check(hipGraphCreate(&graph, 0));
// Add nodes for each projection in batch
for(auto branchIdx = 0; branchIdx < batchSize; ++branchIdx, ++projIdx)
{
auto stream = streams.at(branchIdx);
auto threadsPerBlock = dim3{32, 32, 1};
auto blocksPerGrid = dim3{
(projGeom.dim.x / threadsPerBlock.x) + 1, (projGeom.dim.y / threadsPerBlock.y) + 1, 1
};
auto inputPitchedPtr = phantomProjections.at(projIdx);
auto input = static_cast<std::uint16_t*>(inputPitchedPtr.ptr);
auto inputPitch = inputPitchedPtr.pitch;
auto proj = projections.at(branchIdx);
auto projPitch = projectionPitches.at(branchIdx);
void* normalizationKernelParams[] =
{
static_cast<void*>(&input),
static_cast<void*>(&inputPitch),
static_cast<void*>(&proj),
static_cast<void*>(&projPitch),
static_cast<void*>(&projGeom.dim),
static_cast<void*>(&projGeom.bps)
};
auto normalizationKernelNodeParams = hipKernelNodeParams{};
normalizationKernelNodeParams.blockDim = threadsPerBlock;
normalizationKernelNodeParams.extra = nullptr;
normalizationKernelNodeParams.func = reinterpret_cast<void*>(normalization_kernel);
normalizationKernelNodeParams.gridDim = blocksPerGrid;
normalizationKernelNodeParams.kernelParams = normalizationKernelParams;
normalizationKernelNodeParams.sharedMemBytes = 0;
auto normalizationKernelNode = hipGraphNode_t{};
hip_check(hipGraphAddKernelNode(
&normalizationKernelNode, graph, nullptr, 0, &normalizationKernelNodeParams
));
void* logTransformationKernelParams[] =
{
static_cast<void*>(&proj),
static_cast<void*>(&projPitch),
static_cast<void*>(&projGeom.dim)
};
auto logTransformationKernelNodeParams = hipKernelNodeParams{};
logTransformationKernelNodeParams.blockDim = threadsPerBlock;
logTransformationKernelNodeParams.extra = nullptr;
logTransformationKernelNodeParams.func = reinterpret_cast<void*>(log_transformation_kernel);
logTransformationKernelNodeParams.gridDim = blocksPerGrid;
logTransformationKernelNodeParams.kernelParams = logTransformationKernelParams;
logTransformationKernelNodeParams.sharedMemBytes = 0;
auto logTransformationKernelNode = hipGraphNode_t{};
hip_check(hipGraphAddKernelNode(
&logTransformationKernelNode,
graph,
&normalizationKernelNode,
1,
&logTransformationKernelNodeParams
));
// [sphinx-weighting-node-start]
void* weightingKernelParams[] =
{
static_cast<void*>(&proj),
static_cast<void*>(&projPitch),
static_cast<void*>(&projGeom.dim),
static_cast<void*>(&projGeom.d_sd),
static_cast<void*>(&projGeom.d_so),
static_cast<void*>(&projGeom.minCoord),
static_cast<void*>(&projGeom.pixelDim)
};
auto weightingKernelNodeParams = hipKernelNodeParams{};
weightingKernelNodeParams.blockDim = threadsPerBlock;
weightingKernelNodeParams.extra = nullptr;
weightingKernelNodeParams.func = reinterpret_cast<void*>(weighting_kernel);
weightingKernelNodeParams.gridDim = blocksPerGrid;
weightingKernelNodeParams.kernelParams = weightingKernelParams;
weightingKernelNodeParams.sharedMemBytes = 0;
auto weightingKernelNode = hipGraphNode_t{};
hip_check(hipGraphAddKernelNode(
&weightingKernelNode, graph, &logTransformationKernelNode, 1, &weightingKernelNodeParams
));
// [sphinx-weighting-node-end]
// Expand projection to filter length
auto expanded = expandedProjections.at(branchIdx);
auto expandedPitch = expandedPitches.at(branchIdx);
// [sphinx-memset-node-start]
auto expandedMemsetNodeParams = hipMemsetParams{};
expandedMemsetNodeParams.dst = static_cast<void*>(expanded);
expandedMemsetNodeParams.elementSize = sizeof(float);
expandedMemsetNodeParams.height = projGeom.dimFFT.y;
expandedMemsetNodeParams.pitch = expandedPitch;
expandedMemsetNodeParams.value = 0;
expandedMemsetNodeParams.width = projGeom.dimFFT.x;
auto expandedMemsetNode = hipGraphNode_t{};
hip_check(hipGraphAddMemsetNode(
&expandedMemsetNode, graph, &weightingKernelNode, 1, &expandedMemsetNodeParams
));
// [sphinx-memset-node-end]
auto copyProjToExpandedNodeParams = hipMemcpy3DParms{};
std::memset(&copyProjToExpandedNodeParams, 0, sizeof(hipMemcpy3DParms));
copyProjToExpandedNodeParams.srcPos = make_hipPos(0, 0, 0);
copyProjToExpandedNodeParams.srcPtr = make_hipPitchedPtr(
static_cast<void*>(proj), projPitch, projGeom.dim.x, projGeom.dim.y);
copyProjToExpandedNodeParams.dstPos = make_hipPos(0, 0, 0);
copyProjToExpandedNodeParams.dstPtr = make_hipPitchedPtr(
static_cast<void*>(expanded), expandedPitch, projGeom.dimFFT.x, projGeom.dimFFT.y);
copyProjToExpandedNodeParams.extent = make_hipExtent(
projGeom.dim.x * sizeof(float), projGeom.dim.y, 1);
copyProjToExpandedNodeParams.kind = hipMemcpyDeviceToDevice;
auto copyProjToExpandedNode = hipGraphNode_t{};
hip_check(hipGraphAddMemcpyNode(
&copyProjToExpandedNode,
graph,
&expandedMemsetNode,
1,
&copyProjToExpandedNodeParams
));
// R2C Fourier-transform projection
auto transformed = transformedProjections.at(branchIdx);
auto transformedPitch = transformedPitches.at(branchIdx);
auto transformedMemsetNodeParams = hipMemsetParams{};
transformedMemsetNodeParams.dst = static_cast<void*>(transformed);
transformedMemsetNodeParams.elementSize = sizeof(float); // elementSize maximum is 4 bytes
transformedMemsetNodeParams.height = projGeom.dimTrans.y;
transformedMemsetNodeParams.pitch = transformedPitch;
transformedMemsetNodeParams.value = 0;
transformedMemsetNodeParams.width = projGeom.dimTrans.x * 2; // hipfftComplex = 2 floats
auto transformedMemsetNode = hipGraphNode_t{};
hip_check(hipGraphAddMemsetNode(
&transformedMemsetNode, graph, &copyProjToExpandedNode, 1, &transformedMemsetNodeParams
));
// [sphinx-before-forward-start]
// Before capturing the FFT operations, obtain the set of nodes already present in the graph
auto nodesBeforeForward = std::vector<hipGraphNode_t>{};
auto numNodesBeforeForward = std::size_t{};
hip_check(hipGraphGetNodes(graph, nullptr, &numNodesBeforeForward));
nodesBeforeForward.resize(numNodesBeforeForward);
hip_check(hipGraphGetNodes(graph, nodesBeforeForward.data(), &numNodesBeforeForward));
auto nodesBeforeForwardSorted = std::set<hipGraphNode_t>{
std::begin(nodesBeforeForward), std::end(nodesBeforeForward)
};
// [sphinx-before-forward-end]
// [sphinx-hipfft-start]
hip_check(hipStreamBeginCaptureToGraph(
stream, graph, &transformedMemsetNode, nullptr, 1, hipStreamCaptureModeGlobal));
auto& forward = forwardPlans.at(branchIdx);
hipfft_check(hipfftExecR2C(forward, expanded, transformed));
hip_check(hipStreamEndCapture(stream, &graph));
// [sphinx-hipfft-end]
// [sphinx-is-leaf-start]
auto is_leaf = [](hipGraphNode_t node)
{
auto numDependentNodes = std::size_t{};
hip_check(hipGraphNodeGetDependentNodes(node, nullptr, &numDependentNodes));
return numDependentNodes == 0;
};
// [sphinx-is-leaf-end]
// [sphinx-after-forward-start]
// Obtain nodes in graph again, the new nodes will be our dependencies for the next step
auto nodesAfterForward = std::vector<hipGraphNode_t>{};
auto numNodesAfterForward = std::size_t{};
hip_check(hipGraphGetNodes(graph, nullptr, &numNodesAfterForward));
nodesAfterForward.resize(numNodesAfterForward);
hip_check(hipGraphGetNodes(graph, nodesAfterForward.data(), &numNodesAfterForward));
auto nodesAfterForwardSorted = std::set<hipGraphNode_t>{
std::begin(nodesAfterForward), std::end(nodesAfterForward)
};
// [sphinx-after-forward-end]
// [sphinx-node-difference-start]
// Compute difference between both sets
auto forwardFFTNodes = std::vector<hipGraphNode_t>{};
std::set_difference(std::begin(nodesAfterForwardSorted), std::end(nodesAfterForwardSorted),
std::begin(nodesBeforeForwardSorted), std::end(nodesBeforeForwardSorted),
std::back_inserter(forwardFFTNodes));
// [sphinx-node-difference-end]
// [sphinx-find-leaf-start]
// Find leaf node in difference set
auto forwardLeafNode = *(std::find_if(std::begin(forwardFFTNodes), std::end(forwardFFTNodes), is_leaf));
// [sphinx-find-leaf-end]
// Apply filter
auto filterBlocksPerGrid = dim3{
(projGeom.dimTrans.x / threadsPerBlock.x) + 1,
(projGeom.dimTrans.y / threadsPerBlock.y) + 1,
1
};
void* filterApplicationKernelParams[] =
{
static_cast<void*>(&transformed),
static_cast<void*>(&transformedPitch),
static_cast<void*>(&R),
static_cast<void*>(&projGeom.dimTrans)
};
auto filterApplicationKernelNodeParams = hipKernelNodeParams{};
filterApplicationKernelNodeParams.blockDim = threadsPerBlock;
filterApplicationKernelNodeParams.extra = nullptr;
filterApplicationKernelNodeParams.func = reinterpret_cast<void*>(filter_application_kernel);
filterApplicationKernelNodeParams.gridDim = filterBlocksPerGrid;
filterApplicationKernelNodeParams.kernelParams = filterApplicationKernelParams;
filterApplicationKernelNodeParams.sharedMemBytes = 0;
auto filterApplicationKernelNode = hipGraphNode_t{};
hip_check(hipGraphAddKernelNode(
&filterApplicationKernelNode, graph, &forwardLeafNode, 1, &filterApplicationKernelNodeParams
));
// C2R Fourier-transform projection - same node counting procedure as above
auto nodesBeforeBackward = std::vector<hipGraphNode_t>{};
auto numNodesBeforeBackward = std::size_t{};
hip_check(hipGraphGetNodes(graph, nullptr, &numNodesBeforeBackward));
nodesBeforeBackward.resize(numNodesBeforeBackward);
hip_check(hipGraphGetNodes(graph, nodesBeforeBackward.data(), &numNodesBeforeBackward));
auto nodesBeforeBackwardSorted = std::set<hipGraphNode_t>{
std::begin(nodesBeforeBackward), std::end(nodesBeforeBackward)
};
hip_check(hipStreamBeginCaptureToGraph(
stream, graph, &filterApplicationKernelNode, nullptr, 1, hipStreamCaptureModeGlobal
));
auto& backward = backwardPlans.at(branchIdx);
hipfft_check(hipfftExecC2R(backward, transformed, expanded));
hip_check(hipStreamEndCapture(stream, &graph));
auto nodesAfterBackward = std::vector<hipGraphNode_t>{};
auto numNodesAfterBackward = std::size_t{};
hip_check(hipGraphGetNodes(graph, nullptr, &numNodesAfterBackward));
nodesAfterBackward.resize(numNodesAfterBackward);
hip_check(hipGraphGetNodes(graph, nodesAfterBackward.data(), &numNodesAfterBackward));
auto nodesAfterBackwardSorted = std::set<hipGraphNode_t>{
std::begin(nodesAfterBackward), std::end(nodesAfterBackward)
};
auto backwardFFTNodes = std::vector<hipGraphNode_t>{};
std::set_difference(std::begin(nodesAfterBackwardSorted), std::end(nodesAfterBackwardSorted),
std::begin(nodesBeforeBackwardSorted), std::end(nodesBeforeBackwardSorted),
std::back_inserter(backwardFFTNodes));
auto backwardLeafNode = *(
std::find_if(std::begin(backwardFFTNodes), std::end(backwardFFTNodes), is_leaf
));
// Shrink projection to original size and normalize
auto copyExpandedToProjNodeParams = hipMemcpy3DParms{};
std::memset(&copyExpandedToProjNodeParams, 0, sizeof(hipMemcpy3DParms));
copyExpandedToProjNodeParams.srcPos = make_hipPos(0, 0, 0);
copyExpandedToProjNodeParams.srcPtr = make_hipPitchedPtr(
static_cast<void*>(expanded), expandedPitch, projGeom.dimFFT.x, projGeom.dimFFT.y
);
copyExpandedToProjNodeParams.dstPos = make_hipPos(0, 0, 0);
copyExpandedToProjNodeParams.dstPtr = make_hipPitchedPtr(
static_cast<void*>(proj), projPitch, projGeom.dim.x, projGeom.dim.y
);
copyExpandedToProjNodeParams.extent = make_hipExtent(
projGeom.dim.x * sizeof(float), projGeom.dim.y, 1);
copyExpandedToProjNodeParams.kind = hipMemcpyDeviceToDevice;
auto copyExpandedToProjNode = hipGraphNode_t{};
hip_check(hipGraphAddMemcpyNode(
&copyExpandedToProjNode, graph, &backwardLeafNode, 1, &copyExpandedToProjNodeParams
));
void* filterNormalizationKernelParams[] =
{
static_cast<void*>(&proj),
static_cast<void*>(&projPitch),
static_cast<void*>(&projGeom.dimFFT.x),
static_cast<void*>(&projGeom.dim)
};
auto filterNormalizationKernelNodeParams = hipKernelNodeParams{};
filterNormalizationKernelNodeParams.blockDim = threadsPerBlock;
filterNormalizationKernelNodeParams.extra = nullptr;
filterNormalizationKernelNodeParams.func = reinterpret_cast<void*>(filter_normalization_kernel);
filterNormalizationKernelNodeParams.gridDim = blocksPerGrid;
filterNormalizationKernelNodeParams.kernelParams = filterNormalizationKernelParams;
filterNormalizationKernelNodeParams.sharedMemBytes = 0;
auto filterNormalizationKernelNode = hipGraphNode_t{};
hip_check(hipGraphAddKernelNode(
&filterNormalizationKernelNode,
graph,
&copyExpandedToProjNode,
1,
&filterNormalizationKernelNodeParams));
// Backprojection
auto thetaDeg = projGeom.thetaSign * projGeom.thetaStep * projIdx; // Current angle
auto thetaRad = thetaDeg * std::numbers::pi_v<float> / 180.f; // Convert to radians
auto sinTheta = std::sin(thetaRad);
auto cosTheta = std::cos(thetaRad);
auto bpBlockSize = dim3{32u, 8u, 4u};
auto bpBlocks = dim3{
static_cast<std::uint32_t>(volGeom.dim.x / bpBlockSize.x + 1),
static_cast<std::uint32_t>(volGeom.dim.y / bpBlockSize.y + 1),
static_cast<std::uint32_t>(volGeom.dim.z / bpBlockSize.z + 1)
};
if(hasTextures)
{
auto& projTex = textureProjections.at(branchIdx);
void* backprojectionKernelParams[] =
{
&vol.ptr,
static_cast<void*>(&vol.pitch),
static_cast<void*>(&volGeom.dim),
static_cast<void*>(&volGeom.voxelDim),
static_cast<void*>(&projTex),
static_cast<void*>(&projGeom.minCoord),
static_cast<void*>(&sinTheta),
static_cast<void*>(&cosTheta),
static_cast<void*>(&projGeom.pixelDim),
static_cast<void*>(&projGeom.d_sd),
static_cast<void*>(&projGeom.d_so)
};
auto backprojectionKernelNodeParams = hipKernelNodeParams{};
backprojectionKernelNodeParams.blockDim = bpBlockSize;
backprojectionKernelNodeParams.extra = nullptr;
backprojectionKernelNodeParams.func = reinterpret_cast<void*>(backprojection_kernel);
backprojectionKernelNodeParams.gridDim = bpBlocks;
backprojectionKernelNodeParams.kernelParams = backprojectionKernelParams;
backprojectionKernelNodeParams.sharedMemBytes = 0;
auto backprojectionKernelNode = hipGraphNode_t{};
hip_check(hipGraphAddKernelNode(
&backprojectionKernelNode,
graph,
&filterNormalizationKernelNode,
1,
&backprojectionKernelNodeParams
));
}
else
{
// Fallback for devices without support for texture instructions
void* backprojectionKernelParams[] =
{
&vol.ptr,
static_cast<void*>(&vol.pitch),
static_cast<void*>(&volGeom.dim),
static_cast<void*>(&volGeom.voxelDim),
static_cast<void*>(&proj),
static_cast<void*>(&projPitch),
static_cast<void*>(&projGeom.dim),
static_cast<void*>(&projGeom.minCoord),
static_cast<void*>(&sinTheta),
static_cast<void*>(&cosTheta),
static_cast<void*>(&projGeom.pixelDim),
static_cast<void*>(&projGeom.d_sd),
static_cast<void*>(&projGeom.d_so)
};
auto backprojectionKernelNodeParams = hipKernelNodeParams{};
backprojectionKernelNodeParams.blockDim = bpBlockSize;
backprojectionKernelNodeParams.extra = nullptr;
backprojectionKernelNodeParams.func = reinterpret_cast<void*>(backprojection_kernel_no_tex);
backprojectionKernelNodeParams.gridDim = bpBlocks;
backprojectionKernelNodeParams.kernelParams = backprojectionKernelParams;
backprojectionKernelNodeParams.sharedMemBytes = 0;
auto backprojectionKernelNode = hipGraphNode_t{};
hip_check(hipGraphAddKernelNode(
&backprojectionKernelNode,
graph,
&filterNormalizationKernelNode,
1,
&backprojectionKernelNodeParams
));
}
}
// Instantiate and launch the graph
if(batchSize == numBranches)
{
if(!graphCreated)
{
hip_check(hipGraphDebugDotPrint(graph, "graph_creation.dot", hipGraphDebugDotFlagsVerbose));
hip_check(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
hip_check(hipGraphDestroy(graph));
hip_check(hipGraphLaunch(graphExec, graphStream));
graphCreated = true;
}
else
{
// Update existing executable graph after each iteration with new input data
auto result = hipGraphExecUpdateResult{};
auto errorNode = hipGraphNode_t{};
hip_check(hipGraphExecUpdate(graphExec, graph, &errorNode, &result));
if(result != hipGraphExecUpdateSuccess)
{
auto msg = std::string{"Failed to update graph: "};
switch(result)
{
case hipGraphExecUpdateError:
msg += "Invalid value.";
break;
case hipGraphExecUpdateErrorFunctionChanged:
msg += "Function of kernel node changed.";
break;
case hipGraphExecUpdateErrorNodeTypeChanged:
msg += "Type of node changed.";
break;
case hipGraphExecUpdateErrorNotSupported:
msg += "Something about the node is not supported.";
break;
case hipGraphExecUpdateErrorParametersChanged:
msg += "Unsupported parameter change.";
break;
case hipGraphExecUpdateErrorTopologyChanged:
msg += "Graph topology changed.";
break;
case hipGraphExecUpdateErrorUnsupportedFunctionChange:
msg += "Unsupported change of kernel node function.";
break;
default:
msg += "Unknown error.";
break;
}
throw std::runtime_error{msg};
}
hip_check(hipGraphDestroy(graph));
hip_check(hipGraphLaunch(graphExec, graphStream));
}
}
else
{
hip_check(hipGraphDebugDotPrint(graph, "graph_creation_final.dot", hipGraphDebugDotFlagsVerbose));
// Incomplete batch: topology changed, must instantiate new executable graph
hip_check(hipGraphInstantiate(&graphExecFinal, graph, nullptr, nullptr, 0));
hip_check(hipGraphDestroy(graph));
hip_check(hipGraphLaunch(graphExecFinal, graphStream));
graphFinalCreated = true;
}
}
// Obtain reconstruction time before copying back the result
auto stop = std::chrono::steady_clock::time_point{};
hip_check(hipLaunchHostFunc(graphStream, [](void* data)
{
auto& stop = *(static_cast<std::chrono::steady_clock::time_point*>(data));
stop = std::chrono::steady_clock::now();
}, static_cast<void*>(&stop)));
// Copy volume back to host and save
auto memcpyParams = hipMemcpy3DParms{};
std::memset(&memcpyParams, 0, sizeof(hipMemcpy3DParms));
memcpyParams.dstPos = make_hipPos(0, 0, 0);
memcpyParams.dstPtr = hostVol;
memcpyParams.srcPos = make_hipPos(0, 0, 0);
memcpyParams.srcPtr = vol;
memcpyParams.extent = volExt;
memcpyParams.kind = hipMemcpyDeviceToHost;
hip_check(hipMemcpy3DAsync(&memcpyParams, graphStream));
auto saveVolArgs = new save_volume_args
{
"volume.tif",
hostVolPtr,
volGeom.dim.x, volGeom.dim.y, volGeom.dim.z,
volGeom.voxelDim.x, volGeom.voxelDim.y
};
hip_check(hipLaunchHostFunc(graphStream, save_volume, saveVolArgs));
std::cout << "All work items enqueued, waiting for completion... " << std::flush;
hip_check(hipStreamSynchronize(graphStream));
std::cout << "Done!" << std::endl;
auto const elapsed = std::chrono::duration<double>{stop - start};
std::cout << "Reconstruction time: " << elapsed.count() << 's' << std::endl;
// Cleanup
if(graphFinalCreated)
hip_check(hipGraphExecDestroy(graphExecFinal));
hip_check(hipGraphExecDestroy(graphExec));
hip_check(hipStreamDestroy(graphStream));
hip_check(hipFree(vol.ptr));
hip_check(hipFreeHost(hostVolPtr));
if(hasTextures)
{
for(auto&& tex : textureProjections)
hip_check(hipDestroyTextureObject(tex));
}
for(auto&& plan : backwardPlans)
hipfft_check(hipfftDestroy(plan));
for(auto&& plan : forwardPlans)
hipfft_check(hipfftDestroy(plan));
for(auto&& p : transformedProjections)
hip_check(hipFree(p));
for(auto&& p : expandedProjections)
hip_check(hipFree(p));
for(auto&& p : projections)
hip_check(hipFree(p));
for(auto&& p : phantomProjections)
hip_check(hipFree(p.ptr));
hip_check(hipFree(R));
for(auto&& stream : streams)
hip_check(hipStreamDestroy(stream));
hip_check(hipDeviceSynchronize());
return EXIT_SUCCESS;
}
catch(std::runtime_error const& e)
{
std::cerr << "Caught runtime error: " << e.what() << std::endl;
return EXIT_FAILURE;
}
}
+521
Просмотреть файл
@@ -0,0 +1,521 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include "backprojection.hpp"
#include "filtering.hpp"
#include "log_transform.hpp"
#include "normalization.hpp"
#include "phantom.hpp"
#include "projection.hpp"
#include "utility.hpp"
#include "weighting.hpp"
#include "volume.hpp"
#include <hip/hip_runtime.h>
#include <hipfft/hipfft.h>
#include <algorithm>
#include <chrono>
#include <cmath>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <numbers>
#include <ostream>
#include <set>
#include <stdexcept>
#include <vector>
auto main() -> int
{
try
{
auto hasTextures = int{0};
hip_check(hipDeviceGetAttribute(&hasTextures, hipDeviceAttributeImageSupport, 0));
// [sphinx-async-engine-start]
// Fetch device properties
auto devProps = hipDeviceProp_t{};
hip_check(hipGetDeviceProperties(&devProps, 0));
auto const numStreams = devProps.asyncEngineCount;
std::cout << "Device has " << numStreams << " asynchronous engines; preprocessing will use "
<< numStreams << " parallel streams." << std::endl;
auto streams = std::vector<hipStream_t>{};
streams.resize(numStreams);
for(auto&& stream : streams)
hip_check(hipStreamCreate(&stream));
// [sphinx-async-engine-end]
auto r = static_cast<float*>(nullptr);
auto R = static_cast<hipfftComplex*>(nullptr);
auto forwardPlans = std::vector<hipfftHandle>{};
auto forwardSizes = std::vector<std::size_t>{};
auto backwardPlans = std::vector<hipfftHandle>{};
auto backwardSizes = std::vector<std::size_t>{};
forwardPlans.resize(numStreams);
forwardSizes.resize(numStreams);
backwardPlans.resize(numStreams);
backwardSizes.resize(numStreams);
auto projections = std::vector<float*>{};
auto projectionPitches = std::vector<std::size_t>{};
auto expandedProjections = std::vector<float*>{};
auto expandedPitches = std::vector<std::size_t>{};
auto transformedProjections = std::vector<hipfftComplex*>{};
auto transformedPitches = std::vector<std::size_t>{};
auto textureProjections = std::vector<hipTextureObject_t>{};
auto projGeom = phantom::make_projectionGeometry();
auto volGeom = phantom::make_volumeGeometry();
auto phantomProjections = phantom::make_projections(projGeom, volGeom, streams);
std::cout << "Initializing... " << std::flush;
auto stream = streams.at(0);
// Create filter kernel
hip_check(hipMalloc(reinterpret_cast<void**>(&r), projGeom.dimFFT.x * sizeof(float)));
auto const creationBlocks = std::max((projGeom.dimFFT.x / 1024u), 1u);
filter_creation_kernel<<<creationBlocks, 1024, 0, stream>>>(r, projGeom.s_dimFFT.x, projGeom.pixelDim.x);
hip_check(hipMalloc(reinterpret_cast<void**>(&R), projGeom.dimTrans.x * sizeof(hipfftComplex)));
auto filterPlan = hipfftHandle{};
hipfft_check(hipfftPlan1d(&filterPlan, projGeom.dimFFT.x, HIPFFT_R2C, 1));
hipfft_check(hipfftSetStream(filterPlan, stream));
hipfft_check(hipfftExecR2C(filterPlan, r, R));
auto absoluteBlocks = (projGeom.dimTrans.x / 1024u) + 1u;
filter_absolute_kernel<<<absoluteBlocks, 1024, 0, stream>>>(R, projGeom.dimTrans.x, projGeom.pixelDim.x);
hip_check(hipStreamSynchronize(stream));
hipfft_check(hipfftDestroy(filterPlan));
hip_check(hipFree(r));
auto const inputProjSingle = projGeom.dim.x * projGeom.dim.y * sizeof(std::uint16_t);
auto const inputProjTotal = inputProjSingle * numStreams;
auto const projSingle = projGeom.dim.x * projGeom.dim.y * sizeof(float);
auto const projTotal = projSingle * numStreams;
auto const expandedSingle = projGeom.dimFFT.x * projGeom.dimFFT.y * sizeof(float);
auto const expandedTotal = expandedSingle * numStreams;
auto const transformedSingle = projGeom.dimTrans.x * projGeom.dimTrans.y * sizeof(hipfftComplex);
auto const transformedTotal = transformedSingle * numStreams;
auto const volumeTotal = volGeom.dim.x * volGeom.dim.y * volGeom.dim.z * sizeof(float);
auto const memTotal = inputProjTotal + projTotal + expandedTotal + transformedTotal + volumeTotal;
auto devMemFree = std::size_t{};
auto devMemTotal = std::size_t{};
hip_check(hipMemGetInfo(&devMemFree, &devMemTotal));
auto memRequired = static_cast<std::size_t>(memTotal);
if(memRequired > devMemFree)
{
std::cerr << "Not enough device memory. Required: " << memRequired
<< ", available: " << devMemFree << std::endl;
return EXIT_FAILURE;
}
std::cout << "Done!" << std::endl;
std::cout << "Volume dimensions: " << volGeom.dim.x << " x "
<< volGeom.dim.y << " x "
<< volGeom.dim.z << std::endl;
// Initialize per-stream data
for(auto streamIdx = 0u; streamIdx < streams.size(); ++streamIdx)
{
std::cout << "Initializing stream " << streamIdx << "... " << std::flush;
auto stream = streams.at(streamIdx);
auto proj = static_cast<float*>(nullptr);
auto projPitch = std::size_t{};
hip_check(hipMallocPitch(
reinterpret_cast<void**>(&proj), &projPitch, projGeom.dim.x * sizeof(float), projGeom.dim.y
));
projections.push_back(proj);
projectionPitches.push_back(projPitch);
auto expanded = static_cast<float*>(nullptr);
auto expandedPitch = std::size_t{};
hip_check(hipMallocPitch(
reinterpret_cast<void**>(&expanded),
&expandedPitch,
projGeom.dimFFT.x * sizeof(float),
projGeom.dimFFT.y
));
expandedProjections.push_back(expanded);
expandedPitches.push_back(expandedPitch);
auto transformed = static_cast<hipfftComplex*>(nullptr);
auto transformedPitch = std::size_t{};
hip_check(hipMallocPitch(
reinterpret_cast<void**>(&transformed),
&transformedPitch,
projGeom.dimTrans.x * sizeof(hipfftComplex),
projGeom.dimTrans.y
));
transformedProjections.push_back(transformed);
transformedPitches.push_back(transformedPitch);
auto& forward = forwardPlans.at(streamIdx);
auto& forwardSize = forwardSizes.at(streamIdx);
auto fw_inembed = static_cast<int>(expandedPitch / sizeof(float));
auto fw_istride = 1;
auto fw_idist = fw_inembed;
auto fw_onembed = static_cast<int>(transformedPitch / sizeof(hipfftComplex));
auto fw_ostride = 1;
auto fw_odist = fw_onembed;
hipfft_check(hipfftCreate(&forward));
hipfft_check(hipfftMakePlanMany(forward, 1, &projGeom.s_dimFFT.x,
&fw_inembed, 1, fw_idist,
&fw_onembed, 1, fw_odist,
HIPFFT_R2C, projGeom.s_dimFFT.y, &forwardSize));
hipfft_check(hipfftSetStream(forward, stream));
auto& backward = backwardPlans.at(streamIdx);
auto& backwardSize = backwardSizes.at(streamIdx);
auto bw_inembed = fw_onembed;
auto bw_istride = fw_ostride;
auto bw_idist = fw_odist;
auto bw_onembed = fw_inembed;
auto bw_ostride = fw_istride;
auto bw_odist = fw_idist;
hipfft_check(hipfftCreate(&backward));
hipfft_check(hipfftMakePlanMany(backward, 1, &projGeom.s_dimFFT.x,
&bw_inembed, bw_istride, bw_idist,
&bw_onembed, bw_ostride, bw_odist,
HIPFFT_C2R, projGeom.s_dimFFT.y, &backwardSize));
hipfft_check(hipfftSetStream(backward, stream));
if(hasTextures)
{
// create a HIP texture from the projection
auto resDesc = hipResourceDesc{};
resDesc.resType = hipResourceTypePitch2D;
resDesc.res.pitch2D.desc = hipCreateChannelDesc<float>();
resDesc.res.pitch2D.devPtr = static_cast<void*>(proj);
resDesc.res.pitch2D.width = projGeom.dim.x;
resDesc.res.pitch2D.height = projGeom.dim.y;
resDesc.res.pitch2D.pitchInBytes = projPitch;
auto texDesc = hipTextureDesc{};
texDesc.addressMode[0] = hipAddressModeBorder;
texDesc.addressMode[1] = hipAddressModeBorder;
texDesc.readMode = hipReadModeElementType;
texDesc.borderColor[0] = 0.f;
texDesc.borderColor[0] = 0.f;
texDesc.filterMode = hipFilterModeLinear;
texDesc.normalizedCoords = 0;
auto& projTex = textureProjections.emplace_back();
hip_check(hipCreateTextureObject(&projTex, &resDesc, &texDesc, nullptr));
}
std::cout << "Done!" << std::endl;
}
create_volume("volume.tif");
auto hostVolPtr = static_cast<float*>(nullptr);
hip_check(hipHostMalloc(
reinterpret_cast<void**>(&hostVolPtr),
volGeom.dim.x * volGeom.dim.y * volGeom.dim.z * sizeof(float),
hipHostMallocDefault
));
auto hostVol = make_hipPitchedPtr(
hostVolPtr, volGeom.dim.x * sizeof(float), volGeom.dim.x, volGeom.dim.y
);
auto vol = hipPitchedPtr{};
auto volExt = make_hipExtent(volGeom.dim.x * sizeof(float), volGeom.dim.y, volGeom.dim.z);
hip_check(hipMalloc3D(&vol, volExt));
hip_check(hipMemset3D(vol, 0, volExt));
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// MAIN LOOP
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
auto start = std::chrono::steady_clock::now();
// [sphinx-batch-start]
auto projIdx = 0u;
while(projIdx < projGeom.numProj)
{
auto batchSize = std::min(numStreams, static_cast<int>(projGeom.numProj - projIdx));
// Launch batch in parallel streams
for(auto streamIdx = 0; streamIdx < batchSize; ++streamIdx, ++projIdx)
{
auto stream = streams.at(streamIdx);
// [sphinx-batch-end]
auto threadsPerBlock = dim3{32, 32, 1};
auto blocksPerGrid = dim3{
(projGeom.dim.x / threadsPerBlock.x) + 1, (projGeom.dim.y / threadsPerBlock.y) + 1, 1
};
auto inputPitchedPtr = phantomProjections.at(projIdx);
auto input = static_cast<std::uint16_t*>(inputPitchedPtr.ptr);
auto inputPitch = inputPitchedPtr.pitch;
// [sphinx-preprocessing-start]
////////////////////////////////////////////////////////////////////////////////////////////////////
// START HERE
////////////////////////////////////////////////////////////////////////////////////////////////////
auto proj = projections.at(streamIdx);
auto projPitch = projectionPitches.at(streamIdx);
normalization_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(
input, inputPitch, proj, projPitch, projGeom.dim, projGeom.bps
);
log_transformation_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(proj, projPitch, projGeom.dim);
weighting_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(
proj,
projPitch,
projGeom.dim,
projGeom.d_sd,
projGeom.d_so,
projGeom.minCoord,
projGeom.pixelDim
);
// [sphinx-preprocessing-end]
// [sphinx-proj-to-expanded-start]
// Expand projection to filter length
auto expanded = expandedProjections.at(streamIdx);
auto expandedPitch = expandedPitches.at(streamIdx);
hip_check(hipMemset2DAsync(
expanded, expandedPitch, 0, projGeom.dimFFT.x * sizeof(float), projGeom.dimFFT.y, stream
));
hip_check(hipMemcpy2DAsync(
expanded,
expandedPitch,
proj,
projPitch,
projGeom.dim.x * sizeof(float),
projGeom.dim.y,
hipMemcpyDeviceToDevice,
stream
));
// [sphinx-proj-to-expanded-end]
// [sphinx-forward-start]
// R2C Fourier-transform projection
auto transformed = transformedProjections.at(streamIdx);
auto transformedPitch = transformedPitches.at(streamIdx);
hip_check(hipMemset2DAsync(
transformed,
transformedPitch,
0,
projGeom.dimTrans.x * sizeof(hipfftComplex),
projGeom.dimTrans.y,
stream
));
auto& forward = forwardPlans.at(streamIdx);
hipfft_check(hipfftExecR2C(forward, expanded, transformed));
// [sphinx-forward-end]
// [sphinx-filter-start]
// Apply filter
auto filterBlocksPerGrid = dim3{
(projGeom.dimTrans.x / threadsPerBlock.x) + 1,
(projGeom.dimTrans.y / threadsPerBlock.y) + 1,
1
};
filter_application_kernel<<<filterBlocksPerGrid, threadsPerBlock, 0, stream>>>(
transformed, transformedPitch, R, projGeom.dimTrans
);
auto& backward = backwardPlans.at(streamIdx);
hipfft_check(hipfftExecC2R(backward, transformed, expanded));
// [sphinx-filter-end]
// [sphinx-expanded-to-proj-start]
// Shrink projection to original size and normalize
hip_check(hipMemcpy2DAsync(
proj,
projPitch,
expanded,
expandedPitch,
projGeom.dim.x * sizeof(float),
projGeom.dim.y,
hipMemcpyDeviceToDevice,
stream
));
filter_normalization_kernel<<<blocksPerGrid, threadsPerBlock, 0, stream>>>(
proj, projPitch, projGeom.dimFFT.x, projGeom.dim
);
// [sphinx-expanded-to-proj-end]
// [sphinx-bp-start]
// Backprojection
auto thetaDeg = projGeom.thetaSign * projGeom.thetaStep * projIdx; // Current angle
auto thetaRad = thetaDeg * std::numbers::pi_v<float> / 180.f; // Convert to radians
auto sinTheta = std::sin(thetaRad);
auto cosTheta = std::cos(thetaRad);
auto bpBlockSize = dim3{32u, 8u, 4u};
auto bpBlocks = dim3{
static_cast<std::uint32_t>(volGeom.dim.x / bpBlockSize.x + 1),
static_cast<std::uint32_t>(volGeom.dim.y / bpBlockSize.y + 1),
static_cast<std::uint32_t>(volGeom.dim.z / bpBlockSize.z + 1)
};
if(hasTextures)
{
auto& projTex = textureProjections.at(streamIdx);
backprojection_kernel<<<bpBlocks, bpBlockSize, 0, stream>>>(
static_cast<float*>(vol.ptr),
vol.pitch,
volGeom.dim,
volGeom.voxelDim,
projTex,
projGeom.minCoord,
sinTheta,
cosTheta,
projGeom.pixelDim,
projGeom.d_sd,
projGeom.d_so
);
}
else
{
// Fallback for devices without support for texture instructions
backprojection_kernel_no_tex<<<bpBlocks, bpBlockSize, 0, stream>>>(
static_cast<float*>(vol.ptr),
vol.pitch,
volGeom.dim,
volGeom.voxelDim,
proj,
projPitch,
projGeom.dim,
projGeom.minCoord,
sinTheta,
cosTheta,
projGeom.pixelDim,
projGeom.d_sd,
projGeom.d_so
);
}
// [sphinx-bp-end]
}
}
// [sphinx-sync-start]
// First stream waits for other streams to complete
auto completionEvents = std::vector<hipEvent_t>{};
for(auto streamIdx = 1u; streamIdx < streams.size(); ++streamIdx)
{
auto event = hipEvent_t{};
hip_check(hipEventCreate(&event));
hip_check(hipEventRecord(event, streams.at(streamIdx)));
completionEvents.push_back(event);
}
for(auto&& event : completionEvents)
hip_check(hipStreamWaitEvent(streams.at(0), event, 0));
// [sphinx-sync-end]
// Obtain reconstruction time before copying back the result
auto stop = std::chrono::steady_clock::time_point{};
hip_check(hipLaunchHostFunc(streams.at(0), [](void* data)
{
auto& stop = *(static_cast<std::chrono::steady_clock::time_point*>(data));
stop = std::chrono::steady_clock::now();
}, static_cast<void*>(&stop)));
// Copy volume back to host and save
auto memcpyParams = hipMemcpy3DParms{};
std::memset(&memcpyParams, 0, sizeof(hipMemcpy3DParms));
memcpyParams.dstPos = make_hipPos(0, 0, 0);
memcpyParams.dstPtr = hostVol;
memcpyParams.srcPos = make_hipPos(0, 0, 0);
memcpyParams.srcPtr = vol;
memcpyParams.extent = volExt;
memcpyParams.kind = hipMemcpyDeviceToHost;
hip_check(hipMemcpy3DAsync(&memcpyParams, streams.at(0)));
auto saveVolArgs = new save_volume_args
{
"volume.tif",
hostVolPtr,
volGeom.dim.x, volGeom.dim.y, volGeom.dim.z,
volGeom.voxelDim.x, volGeom.voxelDim.y
};
hip_check(hipLaunchHostFunc(streams.at(0), save_volume, saveVolArgs));
std::cout << "All work items enqueued, waiting for completion... " << std::flush;
hip_check(hipStreamSynchronize(streams.at(0)));
std::cout << "Done!" << std::endl;
auto const elapsed = std::chrono::duration<double>{stop - start};
std::cout << "Reconstruction time: " << elapsed.count() << 's' << std::endl;
for(auto&& event : completionEvents)
hip_check(hipEventDestroy(event));
hip_check(hipFree(vol.ptr));
hip_check(hipFreeHost(hostVolPtr));
if(hasTextures)
{
for(auto&& tex : textureProjections)
hip_check(hipDestroyTextureObject(tex));
}
for(auto&& plan : backwardPlans)
hipfft_check(hipfftDestroy(plan));
for(auto&& plan : forwardPlans)
hipfft_check(hipfftDestroy(plan));
for(auto&& p : transformedProjections)
hip_check(hipFree(p));
for(auto&& p : expandedProjections)
hip_check(hipFree(p));
for(auto&& p : projections)
hip_check(hipFree(p));
for(auto&& p : phantomProjections)
hip_check(hipFree(p.ptr));
hip_check(hipFree(R));
for(auto&& stream : streams)
hip_check(hipStreamDestroy(stream));
hip_check(hipDeviceSynchronize());
return EXIT_SUCCESS;
}
catch(std::runtime_error const& e)
{
std::cerr << "Caught runtime error: " << e.what() << std::endl;
return EXIT_FAILURE;
}
}
+168
Просмотреть файл
@@ -0,0 +1,168 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <vector>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
__global__ void kernelA(double* arrayA, std::size_t size)
{
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size)
{
arrayA[x] *= 2.0;
}
}
__global__ void kernelB(int* arrayB, std::size_t size)
{
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size)
{
arrayB[x] = 3;
}
}
__global__ void kernelC(double* arrayA, const int* arrayB, std::size_t size)
{
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size)
{
arrayA[x] += arrayB[x];
}
}
struct set_vector_args
{
std::vector<double>& h_array;
double value;
};
void set_vector(void* args)
{
set_vector_args h_args{*(reinterpret_cast<set_vector_args*>(args))};
std::vector<double>& vec{h_args.h_array};
vec.assign(vec.size(), h_args.value);
}
int main()
{
constexpr int numOfBlocks = 1024;
constexpr int threadsPerBlock = 1024;
constexpr std::size_t arraySize = 1U << 20;
// This example assumes that kernelA operates on data that needs to be initialized on
// and copied from the host, while kernelB initializes the array that is passed to it.
// Both arrays are then used as input to kernelC, where arrayA is also used as
// output, that is copied back to the host, while arrayB is only read from and not modified.
double* d_arrayA;
int* d_arrayB;
std::vector<double> h_array(arraySize);
constexpr double initValue = 2.0;
hipStream_t captureStream;
HIP_CHECK(hipStreamCreate(&captureStream));
// Start capturing the operations assigned to the stream
HIP_CHECK(hipStreamBeginCapture(captureStream, hipStreamCaptureModeGlobal));
// hipMallocAsync and hipMemcpyAsync are needed, to be able to assign it to a stream
HIP_CHECK(hipMallocAsync(reinterpret_cast<void**>(&d_arrayA), arraySize*sizeof(double), captureStream));
HIP_CHECK(hipMallocAsync(reinterpret_cast<void**>(&d_arrayB), arraySize*sizeof(int), captureStream));
// Assign host function to the stream
// Needs a custom struct to pass the arguments
set_vector_args args{h_array, initValue};
HIP_CHECK(hipLaunchHostFunc(captureStream, set_vector, &args));
HIP_CHECK(hipMemcpyAsync(d_arrayA, h_array.data(), arraySize*sizeof(double), hipMemcpyHostToDevice, captureStream));
kernelA<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, arraySize);
kernelB<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayB, arraySize);
kernelC<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, d_arrayB, arraySize);
HIP_CHECK(hipMemcpyAsync(h_array.data(), d_arrayA, arraySize*sizeof(*d_arrayA), hipMemcpyDeviceToHost, captureStream));
HIP_CHECK(hipFreeAsync(d_arrayA, captureStream));
HIP_CHECK(hipFreeAsync(d_arrayB, captureStream));
// Stop capturing
hipGraph_t graph;
HIP_CHECK(hipStreamEndCapture(captureStream, &graph));
// Create an executable graph from the captured graph
hipGraphExec_t graphExec;
HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
// The graph template can be deleted after the instantiation if it's not needed for later use
HIP_CHECK(hipGraphDestroy(graph));
// Actually launch the graph. The stream does not have
// to be the same as the one used for capturing.
HIP_CHECK(hipGraphLaunch(graphExec, captureStream));
HIP_CHECK(hipStreamSynchronize(captureStream));
// Verify results
constexpr double expected = initValue * 2.0 + 3;
bool passed = true;
for(std::size_t i = 0; i < arraySize; ++i)
{
if(h_array[i] != expected)
{
passed = false;
std::cerr << "Validation failed! Expected " << expected << " got " << h_array[0] << std::endl;
break;
}
}
if(passed)
{
std::cerr << "Validation passed." << std::endl;
}
// Free graph and stream resources after usage
HIP_CHECK(hipGraphExecDestroy(graphExec));
HIP_CHECK(hipStreamDestroy(captureStream));
return EXIT_SUCCESS;
}
// [sphinx-end]
+226
Просмотреть файл
@@ -0,0 +1,226 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <vector>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
__global__ void kernelA(double* arrayA, std::size_t size)
{
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size)
{
arrayA[x] *= 2.0;
}
}
__global__ void kernelB(int* arrayB, std::size_t size)
{
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size)
{
arrayB[x] = 3;
}
}
__global__ void kernelC(double* arrayA, const int* arrayB, std::size_t size)
{
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size)
{
arrayA[x] += arrayB[x];
}
}
struct set_vector_args
{
std::vector<double>& h_array;
double value;
};
void set_vector(void* args)
{
set_vector_args h_args{*(reinterpret_cast<set_vector_args*>(args))};
std::vector<double>& vec{h_args.h_array};
vec.assign(vec.size(), h_args.value);
}
int main()
{
constexpr int numOfBlocks = 1024;
constexpr int threadsPerBlock = 1024;
std::size_t arraySize = 1U << 20;
// The pointers to the device memory don't need to be declared here,
// they are contained within the hipMemAllocNodeParams as the dptr member
std::vector<double> h_array(arraySize);
constexpr double initValue = 2.0;
// Create graph an empty graph
hipGraph_t graph;
HIP_CHECK(hipGraphCreate(&graph, 0));
// Parameters to allocate arrays
hipMemAllocNodeParams allocArrayAParams{};
allocArrayAParams.poolProps.allocType = hipMemAllocationTypePinned;
allocArrayAParams.poolProps.location.type = hipMemLocationTypeDevice;
allocArrayAParams.poolProps.location.id = 0; // GPU on which memory resides
allocArrayAParams.bytesize = arraySize * sizeof(double);
hipMemAllocNodeParams allocArrayBParams{};
allocArrayBParams.poolProps.allocType = hipMemAllocationTypePinned;
allocArrayBParams.poolProps.location.type = hipMemLocationTypeDevice;
allocArrayBParams.poolProps.location.id = 0; // GPU on which memory resides
allocArrayBParams.bytesize = arraySize * sizeof(int);
// Add the allocation nodes to the graph. They don't have any dependencies
hipGraphNode_t allocNodeA, allocNodeB;
HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeA, graph, nullptr, 0, &allocArrayAParams));
HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeB, graph, nullptr, 0, &allocArrayBParams));
// Parameters for the host function
// Needs custom struct to pass the arguments
set_vector_args args{h_array, initValue};
hipHostNodeParams hostParams{};
hostParams.fn = set_vector;
hostParams.userData = static_cast<void*>(&args);
// Add the host node that initializes the host array. It also doesn't have any dependencies
hipGraphNode_t hostNode;
HIP_CHECK(hipGraphAddHostNode(&hostNode, graph, nullptr, 0, &hostParams));
// Add memory copy node, that copies the initialized host array to the device.
// It has to wait for the host array to be initialized and the device memory to be allocated
hipGraphNode_t cpyNodeDependencies[] = {allocNodeA, hostNode};
hipGraphNode_t cpyToDevNode;
HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToDevNode, graph, cpyNodeDependencies, 2, allocArrayAParams.dptr, h_array.data(), arraySize * sizeof(double), hipMemcpyHostToDevice));
// Parameters for kernelA
hipKernelNodeParams kernelAParams;
void* kernelAArgs[] = {&allocArrayAParams.dptr, static_cast<void*>(&arraySize)};
kernelAParams.func = reinterpret_cast<void*>(kernelA);
kernelAParams.gridDim = numOfBlocks;
kernelAParams.blockDim = threadsPerBlock;
kernelAParams.sharedMemBytes = 0;
kernelAParams.kernelParams = kernelAArgs;
kernelAParams.extra = nullptr;
// Add the node for kernelA. It has to wait for the memory copy to finish, as it depends on the values from the host array.
hipGraphNode_t kernelANode;
HIP_CHECK(hipGraphAddKernelNode(&kernelANode, graph, &cpyToDevNode, 1, &kernelAParams));
// Parameters for kernelB
hipKernelNodeParams kernelBParams;
void* kernelBArgs[] = {&allocArrayBParams.dptr, static_cast<void*>(&arraySize)};
kernelBParams.func = reinterpret_cast<void*>(kernelB);
kernelBParams.gridDim = numOfBlocks;
kernelBParams.blockDim = threadsPerBlock;
kernelBParams.sharedMemBytes = 0;
kernelBParams.kernelParams = kernelBArgs;
kernelBParams.extra = nullptr;
// Add the node for kernelB. It only has to wait for the memory to be allocated, as it initializes the array.
hipGraphNode_t kernelBNode;
HIP_CHECK(hipGraphAddKernelNode(&kernelBNode, graph, &allocNodeB, 1, &kernelBParams));
// Parameters for kernelC
hipKernelNodeParams kernelCParams;
void* kernelCArgs[] = {&allocArrayAParams.dptr, &allocArrayBParams.dptr, static_cast<void*>(&arraySize)};
kernelCParams.func = reinterpret_cast<void*>(kernelC);
kernelCParams.gridDim = numOfBlocks;
kernelCParams.blockDim = threadsPerBlock;
kernelCParams.sharedMemBytes = 0;
kernelCParams.kernelParams = kernelCArgs;
kernelCParams.extra = nullptr;
// Add the node for kernelC. It has to wait on both kernelA and kernelB to finish, as it depends on their results.
hipGraphNode_t kernelCNode;
hipGraphNode_t kernelCDependencies[] = {kernelANode, kernelBNode};
HIP_CHECK(hipGraphAddKernelNode(&kernelCNode, graph, kernelCDependencies, 2, &kernelCParams));
// Copy the results back to the host. Has to wait for kernelC to finish.
hipGraphNode_t cpyToHostNode;
HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToHostNode, graph, &kernelCNode, 1, h_array.data(), allocArrayAParams.dptr, arraySize * sizeof(double), hipMemcpyDeviceToHost));
// Free array of allocNodeA. It needs to wait for the copy to finish, as kernelC stores its results in it.
hipGraphNode_t freeNodeA;
HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeA, graph, &cpyToHostNode, 1, allocArrayAParams.dptr));
// Free array of allocNodeB. It only needs to wait for kernelC to finish, as it is not written back to the host.
hipGraphNode_t freeNodeB;
HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeB, graph, &kernelCNode, 1, allocArrayBParams.dptr));
// Instantiate the graph in order to execute it
hipGraphExec_t graphExec;
HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
// The graph can be freed after the instantiation if it's not needed for other purposes
HIP_CHECK(hipGraphDestroy(graph));
// Actually launch the graph
hipStream_t graphStream;
HIP_CHECK(hipStreamCreate(&graphStream));
HIP_CHECK(hipGraphLaunch(graphExec, graphStream));
HIP_CHECK(hipStreamSynchronize(graphStream));
// Verify results
constexpr double expected = initValue * 2.0 + 3;
bool passed = true;
for(std::size_t i = 0; i < arraySize; ++i)
{
if(h_array[i] != expected)
{
passed = false;
std::cerr << "Validation failed! Expected " << expected << " got " << h_array[0] << std::endl;
break;
}
}
if(passed)
{
std::cerr << "Validation passed." << std::endl;
}
HIP_CHECK(hipGraphExecDestroy(graphExec));
HIP_CHECK(hipStreamDestroy(graphStream));
return EXIT_SUCCESS;
}
// [sphinx-end]
+59
Просмотреть файл
@@ -0,0 +1,59 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if (err != hipSuccess) \
{ \
std::cout << "HIP Error: " << hipGetErrorString(err) \
<< " at line " << __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
int main()
{
int deviceCount;
HIP_CHECK(hipGetDeviceCount(&deviceCount));
int device = 0; // Query first available GPU. Can be replaced with any
// integer up to, not including, deviceCount
hipDeviceProp_t deviceProp;
HIP_CHECK(hipGetDeviceProperties(&deviceProp, device));
std::cout << "The queried device ";
if (deviceProp.arch.hasSharedInt32Atomics) // portable HIP feature query
std::cout << "supports";
else
std::cout << "does not support";
std::cout << " shared int32 atomic operations" << std::endl;
return EXIT_SUCCESS;
}
// [sphinx-end]
+59
Просмотреть файл
@@ -0,0 +1,59 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if (err != hipSuccess) \
{ \
std::cout << "HIP Error: " << hipGetErrorString(err) \
<< " at line " << __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
int main()
{
int deviceCount;
HIP_CHECK(hipGetDeviceCount(&deviceCount));
int device = 0; // Query first available GPU. Can be replaced with any
// integer up to, not including, deviceCount
hipDeviceProp_t deviceProp;
HIP_CHECK(hipGetDeviceProperties(&deviceProp, device));
std::cout << "The queried device ";
if (deviceProp.arch.hasSharedInt32Atomics) // portable HIP feature query
std::cout << "supports";
else
std::cout << "does not support";
std::cout << " shared int32 atomic operations" << std::endl;
return EXIT_SUCCESS;
}
// [sphinx-end]
+48
Просмотреть файл
@@ -0,0 +1,48 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <hip/hip_runtime.h>
#include <cstdlib>
int main()
{
// [sphinx-amd-start]
#ifdef __HIP_PLATFORM_AMD__
// This code path is compiled when amdclang++ is used for compilation
#endif
// [sphinx-amd-end]
// [sphinx-nvidia-start]
#ifdef __HIP_PLATFORM_NVIDIA__
// This code path is compiled when nvcc is used for compilation
// Could be compiling with CUDA language extensions enabled (for example, a ".cu file)
// Could be in pass-through mode to an underlying host compiler (for example, a .cpp file)
#endif
// [sphinx-nvidia-end]
#if !defined(__HIP_PLATFORM_AMD__) && !defined(__HIP_PLATFORM_NVIDIA__)
# error "No compatible HIP platform defined!"
#endif
return EXIT_SUCCESS;
}
+52
Просмотреть файл
@@ -0,0 +1,52 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
__host__ __device__ void call_func()
{
#ifdef __HIP_DEVICE_COMPILE__
printf("device\n");
#else
std::cout << "host" << std::endl;
#endif
}
__global__ void test_kernel()
{
call_func();
}
int main()
{
test_kernel<<<1, 1, 0, 0>>>();
if(auto err = hipDeviceSynchronize(); err != hipSuccess)
std::cerr << "HIP error " << err << ": " << hipGetErrorString(err) << std::endl;
call_func();
return EXIT_SUCCESS;
}
// [sphinx-end]
+75
Просмотреть файл
@@ -0,0 +1,75 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include "example_utils.hpp"
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
// [sphinx-kernel-start]
__global__ void kernel_memory_allocation()
{
// The pointer is stored in shared memory, so that all
// threads of the block can access the pointer
__shared__ int *memory;
std::size_t blockSize = blockDim.x;
constexpr std::size_t elementsPerThread = 1024;
if(threadIdx.x == 0)
{
// allocate memory in one contiguous block
memory = new int[blockDim.x * elementsPerThread];
}
__syncthreads();
// load pointer into thread-local variable to avoid
// unnecessary accesses to shared memory
int *localPtr = memory;
// work with allocated memory, e.g. initialization
for(int i = 0; i < elementsPerThread; ++i)
{
// access in a contiguous way
localPtr[i * blockSize + threadIdx.x] = i;
}
// synchronize to make sure no thread is accessing the memory before freeing
__syncthreads();
if(threadIdx.x == 0)
{
delete[] memory;
}
}
// [sphinx-kernel-end]
int main()
{
kernel_memory_allocation<<<64, 1024>>>();
HIP_CHECK(hipGetLastError());
std::cout << "Success!" << std::endl;
return EXIT_SUCCESS;
}
+91
Просмотреть файл
@@ -0,0 +1,91 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess) \
{ \
std::cerr << "HIP error: " << hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Performs a simple initialization of an array with the thread's index variables.
// This function is only available in device code.
__device__ void init_array(float * const a, const unsigned int arraySize)
{
// globalIdx uniquely identifies a thread in a 1D launch configuration.
const int globalIdx = threadIdx.x + blockIdx.x * blockDim.x;
// Each thread initializes a single element of the array.
if(globalIdx < arraySize)
{
a[globalIdx] = globalIdx;
}
}
// Rounds a value up to the next multiple.
// This function is available in host and device code.
__host__ __device__ constexpr int round_up_to_nearest_multiple(int number, int multiple)
{
return (number + multiple - 1)/multiple;
}
__global__
__launch_bounds__(512, 4) // This kernel requires at most 512 threads per block and at least 4 warps per execution unit.
void example_kernel(float * const a, const unsigned int N)
{
// Initialize array.
init_array(a, N);
// Perform additional work:
// - work with the array
// - use the array in a different kernel
// - ...
}
int main()
{
constexpr int N = 100000000; // problem size
constexpr int blockSize = 256; //configurable block size
//needed number of blocks for the given problem size
constexpr int gridSize = round_up_to_nearest_multiple(N, blockSize);
float *a;
// allocate memory on the GPU
HIP_CHECK(hipMalloc(&a, sizeof(*a) * N));
std::cout << "Launching kernel." << std::endl;
example_kernel<<<dim3(gridSize), dim3(blockSize), 0/*example doesn't use shared memory*/, 0/*default stream*/>>>(a, N);
// make sure kernel execution is finished by synchronizing. The CPU can also
// execute other instructions during that time
HIP_CHECK(hipDeviceSynchronize());
std::cout << "Kernel execution finished." << std::endl;
HIP_CHECK(hipFree(a));
}
// [sphinx-end]
+200
Просмотреть файл
@@ -0,0 +1,200 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <hip/hiprtc.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <string>
#include <vector>
#define CHECK_RET_CODE(call, ret_code) \
{ \
if ((call) != ret_code) \
{ \
std::cout << "Failed in call: " << #call << std::endl; \
std::abort(); \
} \
}
#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
// source code for hiprtc
static constexpr auto kernel_source{
R"(
extern "C"
__global__ void vector_add(float* output, float* input1, float* input2, size_t size)
{
int i = threadIdx.x;
if (i < size)
{
output[i] = input1[i] + input2[i];
}
}
)"};
int main()
{
hiprtcProgram prog;
auto rtc_ret_code = hiprtcCreateProgram(&prog, // HIPRTC program handle
kernel_source, // kernel source string
"vector_add.cpp", // Name of the file
0, // Number of headers
nullptr, // Header sources
nullptr); // Name of header file
if (rtc_ret_code != HIPRTC_SUCCESS)
{
std::cerr << "Failed to create program" << std::endl;
std::abort();
}
// [sphinx-options-start]
auto sarg = std::string{"-fgpu-rdc"};
const char* compile_options[] = {sarg.c_str()};
rtc_ret_code = hiprtcCompileProgram(prog, // hiprtcProgram
1, // Number of options
compile_options);
// [sphinx-options-end]
if (rtc_ret_code != HIPRTC_SUCCESS)
{
std::cerr << "Failed to create program" << std::endl;
std::abort();
}
std::size_t logSize;
HIPRTC_CHECK(hiprtcGetProgramLogSize(prog, &logSize));
if (logSize)
{
std::string log(logSize, '\0');
HIPRTC_CHECK(hiprtcGetProgramLog(prog, &log[0]));
std::cerr << "Compilation failed or produced warnings: " << log << std::endl;
std::abort();
}
// [sphinx-bitcode-start]
std::size_t bitCodeSize;
HIPRTC_CHECK(hiprtcGetBitcodeSize(prog, &bitCodeSize));
std::vector<char> kernel_bitcode(bitCodeSize);
HIPRTC_CHECK(hiprtcGetBitcode(prog, kernel_bitcode.data()));
// [sphinx-bitcode-end]
HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
auto num_options = 0u;
hiprtcJIT_option* options = nullptr;
void* option_vals[] = {nullptr};
auto rtc_link_state = hiprtcLinkState{};
// [sphinx-link-create-start]
HIPRTC_CHECK(hiprtcLinkCreate(num_options, // number of options
options, // Array of options
option_vals, // Array of option values cast to void*
&rtc_link_state)); // HIPRTC link state created upon success
// [sphinx-link-create-end]
auto input_type = HIPRTC_JIT_INPUT_LLVM_BITCODE;
auto bit_code_ptr = kernel_bitcode.data();
auto bit_code_size = bitCodeSize;
// [sphinx-link-add-start]
HIPRTC_CHECK(hiprtcLinkAddData(rtc_link_state, // HIPRTC link state
input_type, // type of the input data or bitcode
bit_code_ptr, // input data which is null terminated
bit_code_size, // size of the input data
"a", // optional name for this input
0, // size of the options
nullptr, // Array of options applied to this input
nullptr)); // Array of option values cast to void*
// [sphinx-link-add-end]
void* binary = nullptr;
auto binarySize = std::size_t{};
// [sphinx-link-complete-start]
HIPRTC_CHECK(hiprtcLinkComplete(rtc_link_state, // HIPRTC link state
&binary, // upon success, points to the output binary
&binarySize)); // size of the binary is stored (optional)
// [sphinx-link-complete-end]
hipModule_t module;
hipFunction_t kernel;
HIP_CHECK(hipModuleLoadData(&module, binary));
HIP_CHECK(hipModuleGetFunction(&kernel, module, "vector_add"));
HIPRTC_CHECK(hiprtcLinkDestroy(rtc_link_state));
constexpr std::size_t ele_size = 256; // total number of items to add
std::vector<float> hinput, output;
hinput.reserve(ele_size);
output.reserve(ele_size);
for (std::size_t i = 0; i < ele_size; i++)
{
hinput.push_back(static_cast<float>(i + 1));
output.push_back(0.0f);
}
float *dinput1, *dinput2, *doutput;
HIP_CHECK(hipMalloc(&dinput1, sizeof(float) * ele_size));
HIP_CHECK(hipMalloc(&dinput2, sizeof(float) * ele_size));
HIP_CHECK(hipMalloc(&doutput, sizeof(float) * ele_size));
HIP_CHECK(hipMemcpy(dinput1, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(dinput2, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
struct
{
float* output;
float* input1;
float* input2;
std::size_t size;
} args{doutput, dinput1, dinput2, ele_size};
auto size = sizeof(args);
void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, &args, HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
HIP_LAUNCH_PARAM_END};
HIP_CHECK(hipModuleLaunchKernel(kernel, 1, 1, 1, ele_size, 1, 1, 0, nullptr, nullptr, config));
HIP_CHECK(hipMemcpy(output.data(), doutput, sizeof(float) * ele_size, hipMemcpyDeviceToHost));
for (std::size_t i = 0; i < ele_size; i++)
{
if ((hinput[i] + hinput[i]) != output[i])
{
std::cout << "Failed in validation: " << (hinput[i] + hinput[i]) << " - " << output[i] << std::endl;
std::abort();
}
}
std::cout << "Passed" << std::endl;
HIP_CHECK(hipFree(dinput1));
HIP_CHECK(hipFree(dinput2));
HIP_CHECK(hipFree(doutput));
return EXIT_SUCCESS;
}
// [sphinx-stop]
+219
Просмотреть файл
@@ -0,0 +1,219 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <hip/hiprtc.h>
#include <cstddef>
#include <cstdlib>
#include <fstream>
#include <ios>
#include <iostream>
#include <string>
#include <vector>
#if __has_include(<filesystem>)
#include <filesystem>
namespace fs = std::filesystem;
#elif __has_include(<experimental/filesystem>)
#include <experimental/filesystem>
namespace fs = std::experimental::filesystem;
#else
static_assert(false, "filesystem not available");
#endif
#define CHECK_RET_CODE(call, ret_code) \
{ \
if ((call) != ret_code) \
{ \
std::cout << "Failed in call: " << #call << std::endl; \
std::abort(); \
} \
}
#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
// source code for hiprtc
static constexpr auto kernel_source{
R"(
extern "C"
__global__ void vector_add(float* output, float* input1, float* input2, size_t size)
{
int i = threadIdx.x;
if (i < size)
{
output[i] = input1[i] + input2[i];
}
}
)"};
int main()
{
hiprtcProgram prog;
auto rtc_ret_code = hiprtcCreateProgram(&prog, // HIPRTC program handle
kernel_source, // kernel source string
"vector_add.cpp", // Name of the file
0, // Number of headers
nullptr, // Header sources
nullptr); // Name of header file
if (rtc_ret_code != HIPRTC_SUCCESS)
{
std::cerr << "Failed to create program" << std::endl;
std::abort();
}
// [sphinx-options-start]
auto sarg = std::string{"-fgpu-rdc"};
const char* compile_options[] = {sarg.c_str()};
rtc_ret_code = hiprtcCompileProgram(prog, // hiprtcProgram
1, // Number of options
compile_options);
// [sphinx-options-end]
if (rtc_ret_code != HIPRTC_SUCCESS)
{
std::cerr << "Failed to create program" << std::endl;
std::abort();
}
std::size_t logSize;
HIPRTC_CHECK(hiprtcGetProgramLogSize(prog, &logSize));
if (logSize)
{
std::string log(logSize, '\0');
HIPRTC_CHECK(hiprtcGetProgramLog(prog, &log[0]));
std::cerr << "Compilation failed or produced warnings: " << log << std::endl;
std::abort();
}
// [sphinx-bitcode-start]
std::size_t bitCodeSize;
HIPRTC_CHECK(hiprtcGetBitcodeSize(prog, &bitCodeSize));
std::vector<char> kernel_bitcode(bitCodeSize);
HIPRTC_CHECK(hiprtcGetBitcode(prog, kernel_bitcode.data()));
// [sphinx-bitcode-end]
HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
auto num_options = 0u;
hiprtcJIT_option* options = nullptr;
void* option_vals[] = {nullptr};
auto rtc_link_state = hiprtcLinkState{};
// [sphinx-link-create-start]
HIPRTC_CHECK(hiprtcLinkCreate(num_options, // number of options
options, // Array of options
option_vals, // Array of option values cast to void*
&rtc_link_state)); // HIPRTC link state created upon success
// [sphinx-link-create-end]
auto input_type = HIPRTC_JIT_INPUT_LLVM_BITCODE;
auto bc_file_path = std::string{"bitcode.bc"};
auto bc_file = std::fstream{bc_file_path.c_str(), std::ios::binary | std::ios::out};
if(!bc_file.is_open())
{
std::cerr << "Could not open bitcode file for writing!" << std::endl;
std::abort();
}
bc_file.write(kernel_bitcode.data(), bitCodeSize);
bc_file.close();
// [sphinx-link-add-start]
HIPRTC_CHECK(hiprtcLinkAddFile(rtc_link_state, // HIPRTC link state
input_type, // type of the input data or bitcode
bc_file_path.c_str(), // input data which is null terminated
0, // size of the options
nullptr, // Array of options applied to this input
nullptr)); // Array of option values cast to void*
// [sphinx-link-add-end]
fs::remove(bc_file_path);
void* binary = nullptr;
auto binarySize = std::size_t{};
// [sphinx-link-complete-start]
HIPRTC_CHECK(hiprtcLinkComplete(rtc_link_state, // HIPRTC link state
&binary, // upon success, points to the output binary
&binarySize)); // size of the binary is stored (optional)
// [sphinx-link-complete-end]
hipModule_t module;
hipFunction_t kernel;
HIP_CHECK(hipModuleLoadData(&module, binary));
HIP_CHECK(hipModuleGetFunction(&kernel, module, "vector_add"));
HIPRTC_CHECK(hiprtcLinkDestroy(rtc_link_state));
constexpr std::size_t ele_size = 256; // total number of items to add
std::vector<float> hinput, output;
hinput.reserve(ele_size);
output.reserve(ele_size);
for (std::size_t i = 0; i < ele_size; i++)
{
hinput.push_back(static_cast<float>(i + 1));
output.push_back(0.0f);
}
float *dinput1, *dinput2, *doutput;
HIP_CHECK(hipMalloc(&dinput1, sizeof(float) * ele_size));
HIP_CHECK(hipMalloc(&dinput2, sizeof(float) * ele_size));
HIP_CHECK(hipMalloc(&doutput, sizeof(float) * ele_size));
HIP_CHECK(hipMemcpy(dinput1, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(dinput2, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
struct
{
float* output;
float* input1;
float* input2;
std::size_t size;
} args{doutput, dinput1, dinput2, ele_size};
auto size = sizeof(args);
void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, &args, HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
HIP_LAUNCH_PARAM_END};
HIP_CHECK(hipModuleLaunchKernel(kernel, 1, 1, 1, ele_size, 1, 1, 0, nullptr, nullptr, config));
HIP_CHECK(hipMemcpy(output.data(), doutput, sizeof(float) * ele_size, hipMemcpyDeviceToHost));
for (std::size_t i = 0; i < ele_size; i++)
{
if ((hinput[i] + hinput[i]) != output[i])
{
std::cout << "Failed in validation: " << (hinput[i] + hinput[i]) << " - " << output[i] << std::endl;
std::abort();
}
}
std::cout << "Passed" << std::endl;
HIP_CHECK(hipFree(dinput1));
HIP_CHECK(hipFree(dinput2));
HIP_CHECK(hipFree(doutput));
return EXIT_SUCCESS;
}
// [sphinx-stop]
+200
Просмотреть файл
@@ -0,0 +1,200 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <hip/hiprtc.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <string>
#include <vector>
#define CHECK_RET_CODE(call, ret_code) \
{ \
if ((call) != ret_code) \
{ \
std::cout << "Failed in call: " << #call << std::endl; \
std::abort(); \
} \
}
#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
// source code for hiprtc
static constexpr auto kernel_source{
R"(
extern "C"
__global__ void vector_add(float* output, float* input1, float* input2, size_t size)
{
int i = threadIdx.x;
if (i < size)
{
output[i] = input1[i] + input2[i];
}
}
)"};
int main()
{
hiprtcProgram prog;
auto rtc_ret_code = hiprtcCreateProgram(&prog, // HIPRTC program handle
kernel_source, // kernel source string
"vector_add.cpp", // Name of the file
0, // Number of headers
nullptr, // Header sources
nullptr); // Name of header file
if (rtc_ret_code != HIPRTC_SUCCESS)
{
std::cerr << "Failed to create program" << std::endl;
std::abort();
}
// [sphinx-options-start]
auto sarg = std::string{"-fgpu-rdc"};
const char* compile_options[] = {sarg.c_str()};
rtc_ret_code = hiprtcCompileProgram(prog, // hiprtcProgram
1, // Number of options
compile_options);
// [sphinx-options-end]
if (rtc_ret_code != HIPRTC_SUCCESS)
{
std::cerr << "Failed to create program" << std::endl;
std::abort();
}
std::size_t logSize;
HIPRTC_CHECK(hiprtcGetProgramLogSize(prog, &logSize));
if (logSize)
{
std::string log(logSize, '\0');
HIPRTC_CHECK(hiprtcGetProgramLog(prog, &log[0]));
std::cerr << "Compilation failed or produced warnings: " << log << std::endl;
std::abort();
}
// [sphinx-bitcode-start]
std::size_t bitCodeSize;
HIPRTC_CHECK(hiprtcGetBitcodeSize(prog, &bitCodeSize));
std::vector<char> kernel_bitcode(bitCodeSize);
HIPRTC_CHECK(hiprtcGetBitcode(prog, kernel_bitcode.data()));
// [sphinx-bitcode-end]
HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
// [sphinx-link-create-start]
const char* isaopts[] = {"-mllvm", "-inline-threshold=1", "-mllvm", "-inlinehint-threshold=1"};
std::vector<hiprtcJIT_option> jit_options = {HIPRTC_JIT_IR_TO_ISA_OPT_EXT,
HIPRTC_JIT_IR_TO_ISA_OPT_COUNT_EXT};
std::size_t isaoptssize = 4;
void* lopts[] = {reinterpret_cast<void*>(isaopts),
reinterpret_cast<void*>(isaoptssize)};
hiprtcLinkState linkstate;
HIPRTC_CHECK(hiprtcLinkCreate(2u, jit_options.data(), reinterpret_cast<void**>(lopts), &linkstate));
// [sphinx-link-create-end]
auto input_type = HIPRTC_JIT_INPUT_LLVM_BITCODE;
auto bit_code_ptr = kernel_bitcode.data();
auto bit_code_size = bitCodeSize;
// [sphinx-link-add-start]
HIPRTC_CHECK(hiprtcLinkAddData(linkstate, // HIPRTC link state
input_type, // type of the input data or bitcode
bit_code_ptr, // input data which is null terminated
bit_code_size, // size of the input data
"a", // optional name for this input
0, // size of the options
nullptr, // Array of options applied to this input
nullptr)); // Array of option values cast to void*
// [sphinx-link-add-end]
void* binary = nullptr;
auto binarySize = std::size_t{};
// [sphinx-link-complete-start]
HIPRTC_CHECK(hiprtcLinkComplete(linkstate, // HIPRTC link state
&binary, // upon success, points to the output binary
&binarySize)); // size of the binary is stored (optional)
// [sphinx-link-complete-end]
hipModule_t module;
hipFunction_t kernel;
HIP_CHECK(hipModuleLoadData(&module, binary));
HIP_CHECK(hipModuleGetFunction(&kernel, module, "vector_add"));
HIPRTC_CHECK(hiprtcLinkDestroy(linkstate));
constexpr std::size_t ele_size = 256; // total number of items to add
std::vector<float> hinput, output;
hinput.reserve(ele_size);
output.reserve(ele_size);
for (std::size_t i = 0; i < ele_size; i++)
{
hinput.push_back(static_cast<float>(i + 1));
output.push_back(0.0f);
}
float *dinput1, *dinput2, *doutput;
HIP_CHECK(hipMalloc(&dinput1, sizeof(float) * ele_size));
HIP_CHECK(hipMalloc(&dinput2, sizeof(float) * ele_size));
HIP_CHECK(hipMalloc(&doutput, sizeof(float) * ele_size));
HIP_CHECK(hipMemcpy(dinput1, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(dinput2, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
struct
{
float* output;
float* input1;
float* input2;
std::size_t size;
} args{doutput, dinput1, dinput2, ele_size};
auto size = sizeof(args);
void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, &args, HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
HIP_LAUNCH_PARAM_END};
HIP_CHECK(hipModuleLaunchKernel(kernel, 1, 1, 1, ele_size, 1, 1, 0, nullptr, nullptr, config));
HIP_CHECK(hipMemcpy(output.data(), doutput, sizeof(float) * ele_size, hipMemcpyDeviceToHost));
for (std::size_t i = 0; i < ele_size; i++)
{
if ((hinput[i] + hinput[i]) != output[i])
{
std::cout << "Failed in validation: " << (hinput[i] + hinput[i]) << " - " << output[i] << std::endl;
std::abort();
}
}
std::cout << "Passed" << std::endl;
HIP_CHECK(hipFree(dinput1));
HIP_CHECK(hipFree(dinput2));
HIP_CHECK(hipFree(doutput));
return EXIT_SUCCESS;
}
// [sphinx-stop]
+107
Просмотреть файл
@@ -0,0 +1,107 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <vector>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if (err != hipSuccess) \
{ \
std::cout << "HIP Error: " << hipGetErrorString(err) \
<< " at line " << __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
int main()
{
std::size_t elements = 64*1024;
std::size_t size_bytes = elements * sizeof(float);
std::vector<float> A(elements), B(elements);
// On NVIDIA platforms the driver runtime needs to be initiated
#ifdef __HIP_PLATFORM_NVIDIA__
hipInit(0);
hipDevice_t device;
hipCtx_t context;
HIP_CHECK(hipDeviceGet(&device, 0));
HIP_CHECK(hipCtxCreate(&context, 0, device));
#endif
// Allocate device memory
hipDeviceptr_t d_A, d_B;
HIP_CHECK(hipMalloc(reinterpret_cast<void**>(&d_A), size_bytes));
HIP_CHECK(hipMalloc(reinterpret_cast<void**>(&d_B), size_bytes));
// Copy data to device
HIP_CHECK(hipMemcpyHtoD(d_A, A.data(), size_bytes));
HIP_CHECK(hipMemcpyHtoD(d_B, B.data(), size_bytes));
// Load module
hipModule_t Module;
// For AMD the module file has to contain architecture specific object code
// For NVIDIA the module file has to contain PTX, found in e.g. "vcpy_isa.ptx"
#ifdef __HIP_PLATFORM_AMD__
HIP_CHECK(hipModuleLoad(&Module, "vcpy_isa.hsaco"));
#elif defined(__HIP_PLATFORM_NVIDIA__)
HIP_CHECK(hipModuleLoad(&Module, "vcpy_isa.ptx"));
#endif
// Get kernel function from the module via its name
hipFunction_t Function;
HIP_CHECK(hipModuleGetFunction(&Function, Module, "hello_world"));
// Create buffer for kernel arguments
std::vector<void*> argBuffer{reinterpret_cast<void*>(d_A), reinterpret_cast<void*>(d_B)};
std::size_t arg_size_bytes = argBuffer.size() * sizeof(void*);
// Create configuration passed to the kernel as arguments
void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, argBuffer.data(),
HIP_LAUNCH_PARAM_BUFFER_SIZE, &arg_size_bytes,
HIP_LAUNCH_PARAM_END};
int threads_per_block = 128;
int blocks = (elements + threads_per_block - 1) / threads_per_block;
// Actually launch kernel
HIP_CHECK(hipModuleLaunchKernel(Function, blocks, 1, 1, threads_per_block, 1, 1, 0, 0, NULL, config));
HIP_CHECK(hipMemcpyDtoH(A.data(), d_A, elements));
HIP_CHECK(hipMemcpyDtoH(B.data(), d_B, elements));
HIP_CHECK(hipFree(reinterpret_cast<void*>(d_A)));
HIP_CHECK(hipFree(reinterpret_cast<void*>(d_B)));
#ifdef __HIP_PLATFORM_NVIDIA__
HIP_CHECK(hipCtxDestroy(context));
#endif
return EXIT_SUCCESS;
}
// [sphinx-end]
+145
Просмотреть файл
@@ -0,0 +1,145 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <memory>
#include <string>
#include <vector>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if (err != hipSuccess) \
{ \
std::cout << "HIP Error: " << hipGetErrorString(err) \
<< " at line " << __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
void* populate_data_pointer()
{
#ifdef __HIP_PLATFORM_AMD__
auto filename = std::string{"myKernel.hsaco"};
#elif defined(__HIP_PLATFORM_NVIDIA__)
auto filename = std::string{"myKernel.ptx"};
#endif
std::fstream file{filename, std::ios::in | std::ios::binary | std::ios::ate};
if(!file.is_open())
{
std::cerr << "Error opening file " << filename << std::endl;
std::exit(EXIT_FAILURE);
}
auto filesize = file.tellg();
auto storage = new char[filesize];
file.seekg(0, std::ios::beg);
file.read(storage, filesize);
return storage;
}
int main()
{
std::size_t elements = 64*1024;
std::size_t size_bytes = elements * sizeof(float);
std::vector<float> A(elements), B(elements);
// On NVIDIA platforms the driver runtime needs to be initiated
#ifdef __HIP_PLATFORM_NVIDIA__
HIP_CHECK(hipInit(0));
hipDevice_t device;
hipCtx_t context;
HIP_CHECK(hipDeviceGet(&device, 0));
HIP_CHECK(hipCtxCreate(&context, 0, device));
#endif
// Allocate device memory
hipDeviceptr_t d_A, d_B;
HIP_CHECK(hipMalloc(reinterpret_cast<void**>(&d_A), size_bytes));
HIP_CHECK(hipMalloc(reinterpret_cast<void**>(&d_B), size_bytes));
// Copy data to device
HIP_CHECK(hipMemcpyHtoD(d_A, A.data(), size_bytes));
HIP_CHECK(hipMemcpyHtoD(d_B, B.data(), size_bytes));
// Load module
// For AMD the module file has to contain architecture specific object code
// For NVIDIA the module file has to contain PTX, found in e.g. "myKernel.ptx"
// [sphinx-start]
hipModule_t module;
void* imagePtr = populate_data_pointer();
const int numOptions = 1;
hipJitOption options[numOptions];
void *optionValues[numOptions];
options[0] = hipJitOptionMaxRegisters;
unsigned maxRegs = 15;
optionValues[0] = static_cast<void*>(&maxRegs);
// hipModuleLoadData(module, imagePtr) will be called on HIP-Clang path, JIT options will not be used, and
// cuModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues) will be called on NVCC path
HIP_CHECK(hipModuleLoadDataEx(&module, imagePtr, numOptions, options, optionValues));
// Get kernel function from the module via its name
hipFunction_t k;
HIP_CHECK(hipModuleGetFunction(&k, module, "myKernel"));
// [sphinx-end]
// Create buffer for kernel arguments
std::vector<void*> argBuffer{reinterpret_cast<void*>(d_A), reinterpret_cast<void*>(d_B)};
std::size_t arg_size_bytes = argBuffer.size() * sizeof(void*);
// Create configuration passed to the kernel as arguments
void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, argBuffer.data(),
HIP_LAUNCH_PARAM_BUFFER_SIZE, &arg_size_bytes,
HIP_LAUNCH_PARAM_END};
int threads_per_block = 128;
int blocks = (elements + threads_per_block - 1) / threads_per_block;
// Actually launch kernel
HIP_CHECK(hipModuleLaunchKernel(k, blocks, 1, 1, threads_per_block, 1, 1, 0, 0, NULL, config));
HIP_CHECK(hipMemcpyDtoH(A.data(), d_A, elements));
HIP_CHECK(hipMemcpyDtoH(B.data(), d_B, elements));
HIP_CHECK(hipFree(reinterpret_cast<void*>(d_A)));
HIP_CHECK(hipFree(reinterpret_cast<void*>(d_B)));
#ifdef __HIP_PLATFORM_NVIDIA__
HIP_CHECK(hipCtxDestroy(context));
#endif
delete[] static_cast<char*>(imagePtr);
return EXIT_SUCCESS;
}
+134
Просмотреть файл
@@ -0,0 +1,134 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <cuda.h>
#include <cstddef>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <memory>
#include <string>
#include <vector>
#define CUDA_CHECK(expression) \
{ \
const CUresult err = expression; \
if (err != CUDA_SUCCESS) \
{ \
const char* err_str{nullptr}; \
cuGetErrorString(err, &err_str); \
std::cerr << "CUDA Error: " << err_str \
<< " at line " << __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
void* populate_data_pointer()
{
auto filename = std::string{"myKernel.ptx"};
std::fstream file{filename, std::ios::in | std::ios::binary | std::ios::ate};
if(!file.is_open())
{
std::cerr << "Error opening file " << filename << std::endl;
std::exit(EXIT_FAILURE);
}
auto filesize = file.tellg();
auto storage = new char[filesize];
file.seekg(0, std::ios::beg);
file.read(storage, filesize);
return storage;
}
int main()
{
std::size_t elements = 64*1024;
std::size_t size_bytes = elements * sizeof(float);
std::vector<float> A(elements), B(elements);
// On NVIDIA platforms the driver runtime needs to be initiated
cuInit(0);
CUdevice device;
CUcontext context;
CUDA_CHECK(cuDeviceGet(&device, 0));
CUDA_CHECK(cuCtxCreate(&context, 0, device));
// Allocate device memory
CUdeviceptr d_A, d_B;
CUDA_CHECK(cuMemAlloc(&d_A, size_bytes));
CUDA_CHECK(cuMemAlloc(&d_B, size_bytes));
// Copy data to device
CUDA_CHECK(cuMemcpyHtoD(d_A, A.data(), size_bytes));
CUDA_CHECK(cuMemcpyHtoD(d_B, B.data(), size_bytes));
// Load module
// For NVIDIA the module file has to contain PTX, found in e.g. "myKernel.ptx"
// [sphinx-start]
CUmodule module;
void* imagePtr = populate_data_pointer();
const int numOptions = 1;
CUjit_option options[numOptions];
void *optionValues[numOptions];
options[0] = CU_JIT_MAX_REGISTERS;
unsigned maxRegs = 15;
optionValues[0] = (void *)(&maxRegs);
cuModuleLoadDataEx(&module, imagePtr, numOptions, options, optionValues);
CUfunction k;
cuModuleGetFunction(&k, module, "myKernel");
// [sphinx-end]
// Create buffer for kernel arguments
std::vector<void*> argBuffer{&d_A, &d_B};
std::size_t arg_size_bytes = argBuffer.size() * sizeof(void*);
// Create configuration passed to the kernel as arguments
void* config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, argBuffer.data(),
CU_LAUNCH_PARAM_BUFFER_SIZE, &arg_size_bytes, CU_LAUNCH_PARAM_END};
int threads_per_block = 128;
int blocks = (elements + threads_per_block - 1) / threads_per_block;
// Actually launch kernel
CUDA_CHECK(cuLaunchKernel(k, blocks, 1, 1, threads_per_block, 1, 1, 0, 0, NULL, config));
CUDA_CHECK(cuMemcpyDtoH(A.data(), d_A, elements));
CUDA_CHECK(cuMemcpyDtoH(B.data(), d_B, elements));
CUDA_CHECK(cuMemFree(d_A));
CUDA_CHECK(cuMemFree(d_B));
CUDA_CHECK(cuCtxDestroy(context));
delete[] static_cast<char*>(imagePtr);
return EXIT_SUCCESS;
}
+111
Просмотреть файл
@@ -0,0 +1,111 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <hip/hip_fp16.h>
#include <hip/hip_runtime.h>
#include <iostream>
#include <vector>
#define hip_check(hip_call) \
{ \
auto hip_res = hip_call; \
if (hip_res != hipSuccess) { \
std::cerr << "Failed in HIP call: " << #hip_call \
<< " at " << __FILE__ << ":" << __LINE__ \
<< " with error: " << hipGetErrorString(hip_res) << std::endl; \
std::abort(); \
} \
}
__global__ void add_half_precision(__half* in1, __half* in2, float* out, size_t size)
{
int idx = threadIdx.x;
if (idx < size)
{
// Load as half, perform addition in float, store as float
float sum = __half2float(in1[idx] + in2[idx]);
out[idx] = sum;
}
}
int main()
{
constexpr size_t size = 32;
constexpr float tolerance = 1e-1f; // Allowable numerical difference
// Initialize input vectors as floats
std::vector<float> in1(size), in2(size);
for (size_t i = 0; i < size; i++) {
in1[i] = i + 1.1f;
in2[i] = i + 2.2f;
}
// Compute expected results in full precision on CPU
std::vector<float> cpu_out(size);
for (size_t i = 0; i < size; i++) {
cpu_out[i] = in1[i] + in2[i]; // Direct float addition
}
// Allocate device memory (store input as half, output as float)
__half *d_in1, *d_in2;
float *d_out;
hip_check(hipMalloc(&d_in1, sizeof(__half) * size));
hip_check(hipMalloc(&d_in2, sizeof(__half) * size));
hip_check(hipMalloc(&d_out, sizeof(float) * size));
// Convert input to half and copy to device
std::vector<__half> in1_half(size), in2_half(size);
for (size_t i = 0; i < size; i++)
{
in1_half[i] = __float2half(in1[i]);
in2_half[i] = __float2half(in2[i]);
}
hip_check(hipMemcpy(d_in1, in1_half.data(), sizeof(__half) * size, hipMemcpyHostToDevice));
hip_check(hipMemcpy(d_in2, in2_half.data(), sizeof(__half) * size, hipMemcpyHostToDevice));
// Launch kernel
add_half_precision<<<1, size>>>(d_in1, d_in2, d_out, size);
// Copy result back to host
std::vector<float> gpu_out(size, 0.0f);
hip_check(hipMemcpy(gpu_out.data(), d_out, sizeof(float) * size, hipMemcpyDeviceToHost));
// Free device memory
hip_check(hipFree(d_in1));
hip_check(hipFree(d_in2));
hip_check(hipFree(d_out));
// Validation with tolerance
for (size_t i = 0; i < size; i++)
{
if (std::fabs(cpu_out[i] - gpu_out[i]) > tolerance)
{
std::cerr << "Mismatch at index " << i
<< ": CPU result = " << cpu_out[i]
<< ", GPU result = " << gpu_out[i] << std::endl;
std::abort();
}
}
std::cout << "Success: CPU and GPU half-precision addition match within tolerance!" << std::endl;
}
+130
Просмотреть файл
@@ -0,0 +1,130 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_fp8.h>
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
#include <vector>
#define hip_check(hip_call) \
{ \
auto hip_res = hip_call; \
if (hip_res != hipSuccess) \
{ \
std::cerr << "Failed in HIP call: " << #hip_call \
<< " at " << __FILE__ << ":" << __LINE__ \
<< " with error: " << hipGetErrorString(hip_res) << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
__device__ __hip_fp8_storage_t d_convert_float_to_fp8(float in, __hip_fp8_interpretation_t interpret, __hip_saturation_t sat)
{
return __hip_cvt_float_to_fp8(in, sat, interpret);
}
__global__ void float_to_fp8_to_float(float *in, __hip_fp8_interpretation_t interpret, __hip_saturation_t sat, float *out, size_t size)
{
int i = threadIdx.x;
if (i < size)
{
auto fp8 = d_convert_float_to_fp8(in[i], interpret, sat);
// Implicit conversion from fp8 to float is defined by HIP runtime
out[i] = fp8;
}
}
__hip_fp8_storage_t convert_float_to_fp8(float in, /* Input val */
__hip_fp8_interpretation_t interpret, /* interpretation of number E4M3/E5M2 */
__hip_saturation_t sat /* Saturation behavior */
)
{
return __hip_cvt_float_to_fp8(in, sat, interpret);
}
int main()
{
constexpr size_t size = 32;
hipDeviceProp_t prop;
hip_check(hipGetDeviceProperties(&prop, 0));
bool is_supported = (std::string(prop.gcnArchName).find("gfx94") != std::string::npos); // gfx94x
if(!is_supported)
{
std::cerr << "Need a gfx94x, but found: " << prop.gcnArchName << std::endl;
std::cerr << "No device conversions are supported, only host conversions are supported." << std::endl;
return EXIT_SUCCESS;
}
const __hip_fp8_interpretation_t interpret = (std::string(prop.gcnArchName).find("gfx94") != std::string::npos)
? __HIP_E4M3_FNUZ // gfx94x
: __HIP_E4M3;
constexpr __hip_saturation_t sat = __HIP_SATFINITE;
std::vector<float> in;
in.reserve(size);
for (size_t i = 0; i < size; i++)
in.push_back(i + 1.1f);
std::cout << "Converting float to fp8 and back..." << std::endl;
// CPU convert
std::vector<float> cpu_out;
cpu_out.reserve(size);
for (const auto &fval : in)
{
auto fp8 = convert_float_to_fp8(fval, interpret, sat);
// Implicit conversion from fp8 to float is defined by HIP runtime
cpu_out.push_back(fp8);
}
// GPU convert
float *d_in, *d_out;
hip_check(hipMalloc(&d_in, sizeof(float) * size));
hip_check(hipMalloc(&d_out, sizeof(float) * size));
hip_check(hipMemcpy(d_in, in.data(), sizeof(float) * in.size(), hipMemcpyHostToDevice));
float_to_fp8_to_float<<<1, size>>>(d_in, interpret, sat, d_out, size);
std::vector<float> gpu_out(size, 0.0f);
hip_check(hipMemcpy(gpu_out.data(), d_out, sizeof(float) * gpu_out.size(), hipMemcpyDeviceToHost));
hip_check(hipFree(d_in));
hip_check(hipFree(d_out));
// Validation
for (size_t i = 0; i < size; i++)
{
if (cpu_out[i] != gpu_out[i])
{
std::cerr << "cpu round trip result: " << cpu_out[i]
<< " - gpu round trip result: " << gpu_out[i] << std::endl;
return EXIT_FAILURE;
}
}
std::cout << "...CPU and GPU round trip convert matches." << std::endl;
return EXIT_SUCCESS;
}
// [sphinx-end]
+202
Просмотреть файл
@@ -0,0 +1,202 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <hip/hip_runtime.h>
#include <hip/hiprtc.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <string>
#include <vector>
#define CHECK_RET_CODE(call, ret_code) \
{ \
if ((call) != ret_code) \
{ \
std::cout << "Failed in call: " << #call << std::endl; \
std::abort(); \
} \
}
#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
// [sphinx-source-start]
static constexpr const char gpu_program[] {
R"(
__device__ int V1; // set from host code
static __global__ void f1(int *result)
{
*result = V1 + 10;
}
namespace N1
{
namespace N2
{
__constant__ int V2; // set from host code
__global__ void f2(int *result)
{
*result = V2 + 20;
}
}
}
template<typename T>
__global__ void f3(int *result)
{
*result = sizeof(T);
}
)"};
// [sphinx-source-end]
int main()
{
using namespace std::string_literals;
hiprtcProgram prog;
HIPRTC_CHECK(hiprtcCreateProgram(&prog, gpu_program, "gpu_source.cpp", 0, nullptr, nullptr));
std::vector<std::string> kernel_names;
std::vector<std::string> variable_names;
std::vector<int> initial_values;
std::vector<int> expected_results;
initial_values.emplace_back(100);
initial_values.emplace_back(200);
expected_results.emplace_back(110);
expected_results.emplace_back(220);
expected_results.emplace_back(static_cast<int>(sizeof(int)));
// [sphinx-add-expression-start]
kernel_names.emplace_back("&f1"s);
kernel_names.emplace_back("N1::N2::f2"s);
kernel_names.emplace_back("f3<int>"s);
for(auto&& name : kernel_names)
HIPRTC_CHECK(hiprtcAddNameExpression(prog, name.c_str()));
variable_names.emplace_back("&V1"s);
variable_names.emplace_back("&N1::N2::V2");
for(auto&& name : variable_names)
HIPRTC_CHECK(hiprtcAddNameExpression(prog, name.c_str()));
// [sphinx-add-expression-end]
hipDeviceProp_t props;
int device = 0;
HIP_CHECK(hipGetDeviceProperties(&props, device));
auto sarg = std::string{"--gpu-architecture="} + props.gcnArchName; // device for which binary is to be generated
const char* options[] = {sarg.c_str()};
HIPRTC_CHECK(hiprtcCompileProgram(prog, 1, options));
std::size_t logSize;
HIPRTC_CHECK(hiprtcGetProgramLogSize(prog, &logSize));
if (logSize)
{
std::string log(logSize, '\0');
HIPRTC_CHECK(hiprtcGetProgramLog(prog, &log[0]));
std::cerr << "Compilation failed or produced warnings: " << log << std::endl;
std::abort();
}
std::size_t codeSize;
HIPRTC_CHECK(hiprtcGetCodeSize(prog, &codeSize));
std::vector<char> kernel_binary(codeSize);
HIPRTC_CHECK(hiprtcGetCode(prog, kernel_binary.data()));
std::vector<std::string> lowered_kernel_names;
std::vector<std::string> lowered_variable_names;
// [sphinx-get-kernel-name-start]
for(auto&& name : kernel_names)
{
const char* lowered_name = nullptr;
HIPRTC_CHECK(hiprtcGetLoweredName(prog, name.c_str(), &lowered_name));
lowered_kernel_names.emplace_back(lowered_name);
}
// [sphinx-get-kernel-name-end]
// [sphinx-get-variable-name-start]
for(auto&& name : variable_names)
{
const char* lowered_name = nullptr;
HIPRTC_CHECK(hiprtcGetLoweredName(prog, name.c_str(), &lowered_name));
lowered_variable_names.emplace_back(lowered_name);
}
// [sphinx-get-variable-name-end]
HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
hipModule_t module;
HIP_CHECK(hipModuleLoadData(&module, kernel_binary.data()));
for(auto i = std::size_t{0}; i < initial_values.size(); ++i)
{
auto name = lowered_variable_names.at(i);
auto initial_value = initial_values.at(i);
// [sphinx-update-variable-start]
hipDeviceptr_t variable_addr;
std::size_t bytes{};
HIP_CHECK(hipModuleGetGlobal(&variable_addr, &bytes, module, name.c_str()));
HIP_CHECK(hipMemcpyHtoD(variable_addr, &initial_value, sizeof(initial_value)));
// [sphinx-update-variable-end]
}
hipDeviceptr_t d_result;
auto h_result = 0;
HIP_CHECK(hipMalloc(reinterpret_cast<void**>(&d_result), sizeof(h_result)));
HIP_CHECK(hipMemcpyHtoD(d_result, &h_result, sizeof(h_result)));
struct
{
hipDeviceptr_t ptr;
} args{d_result};
auto args_size = sizeof(args);
void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, &args,
HIP_LAUNCH_PARAM_BUFFER_SIZE, &args_size,
HIP_LAUNCH_PARAM_END};
for(auto i = std::size_t{0}; i < lowered_kernel_names.size(); ++i)
{
auto name = lowered_kernel_names.at(i);
auto expected = expected_results.at(i);
// [sphinx-launch-kernel-start]
hipFunction_t kernel;
HIP_CHECK(hipModuleGetFunction(&kernel, module, name.c_str()));
HIP_CHECK(hipModuleLaunchKernel(kernel, 1, 1, 1, 1, 1, 1, 0, nullptr, nullptr, config));
// [sphinx-launch-kernel-end]
HIP_CHECK(hipMemcpyDtoH(&h_result, d_result, sizeof(h_result)));
if(expected != h_result)
{
std::cerr << "Validation failed. expected = " << expected << ", h_result = " << h_result << std::endl;
return EXIT_FAILURE;
}
}
std::cout << "Validation passed." << std::endl;
HIP_CHECK(hipFree(reinterpret_cast<void*>(d_result)));
return EXIT_SUCCESS;
}
+118
Просмотреть файл
@@ -0,0 +1,118 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cmath>
#include <iostream>
#include <limits>
#include <vector>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if (err != hipSuccess) \
{ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
exit(EXIT_FAILURE); \
} \
}
// Simple ULP difference calculator
int64_t ulp_diff(float a, float b)
{
if (a == b)
return 0;
union
{
float f;
int32_t i;
} ua{a}, ub{b};
// For negative values, convert to a positive-based representation
if (ua.i < 0) ua.i = std::numeric_limits<int32_t>::max() - ua.i;
if (ub.i < 0) ub.i = std::numeric_limits<int32_t>::max() - ub.i;
return std::abs((int64_t)ua.i - (int64_t)ub.i);
}
// Test kernel
__global__ void test_sin(float* out, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
{
float x = -M_PI + (2.0f * M_PI * i) / (n - 1);
out[i] = sinf(x);
}
}
int main()
{
const int n = 1000000;
const int blocksize = 256;
std::vector<float> outputs(n);
float* d_out;
HIP_CHECK(hipMalloc(&d_out, n * sizeof(float)));
dim3 threads(blocksize);
dim3 blocks((n + blocksize - 1) / blocksize); // Fixed grid calculation
test_sin<<<blocks, threads>>>(d_out, n);
HIP_CHECK(hipPeekAtLastError());
HIP_CHECK(hipMemcpy(outputs.data(), d_out, n * sizeof(float), hipMemcpyDeviceToHost));
// Step 1: Find the maximum absolute error
double max_abs_error = 0.0;
float max_error_output = 0.0;
float max_error_expected = 0.0;
for (int i = 0; i < n; i++)
{
float x = -M_PI + (2.0f * M_PI * i) / (n - 1);
float expected = std::sin(x);
double abs_error = std::abs(outputs[i] - expected);
if (abs_error > max_abs_error)
{
max_abs_error = abs_error;
max_error_output = outputs[i];
max_error_expected = expected;
}
}
// Step 2: Compute ULP difference based on the max absolute error pair
int64_t max_ulp = ulp_diff(max_error_output, max_error_expected);
// Output results
std::cout << "Max Absolute Error: " << max_abs_error << std::endl;
std::cout << "Max ULP Difference: " << max_ulp << std::endl;
std::cout << "Max Error Values -> Got: " << max_error_output
<< ", Expected: " << max_error_expected << std::endl;
HIP_CHECK(hipFree(d_out));
return 0;
}
// [sphinx-end]
+109
Просмотреть файл
@@ -0,0 +1,109 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) \
{ \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
// Kernel to perform some computation on allocated memory.
__global__ void myKernel(int* data, std::size_t numElements)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < numElements)
{
data[tid] = tid * 2;
}
}
int main()
{
// Create a stream.
hipStream_t stream;
HIP_CHECK(hipStreamCreate(&stream));
// Create a memory pool with default properties.
hipMemPoolProps poolProps = {};
poolProps.allocType = hipMemAllocationTypePinned;
poolProps.handleTypes = hipMemHandleTypePosixFileDescriptor;
poolProps.location.type = hipMemLocationTypeDevice;
poolProps.location.id = 0; // Assuming device 0.
hipMemPool_t memPool;
HIP_CHECK(hipMemPoolCreate(&memPool, &poolProps));
// Allocate memory from the pool asynchronously.
constexpr std::size_t numElements = 1024;
int* devData = nullptr;
HIP_CHECK(hipMallocFromPoolAsync(reinterpret_cast<void**>(&devData),
numElements * sizeof(*devData),
memPool,
stream));
// Define grid and block sizes.
dim3 blockSize(256);
dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
// Launch the kernel to perform computation.
myKernel<<<gridSize, blockSize, 0, stream>>>(devData, numElements);
// Synchronize the stream.
HIP_CHECK(hipStreamSynchronize(stream));
// Copy data back to host.
int* hostData = new int[numElements];
HIP_CHECK(hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost));
// Print the array.
for (std::size_t i = 0; i < numElements; ++i)
std::cout << "Element " << i << ": " << hostData[i] << std::endl;
// Free the allocated memory.
HIP_CHECK(hipFreeAsync(devData, stream));
// Synchronize the stream again to ensure all operations are complete.
HIP_CHECK(hipStreamSynchronize(stream));
// Destroy the memory pool and stream.
HIP_CHECK(hipMemPoolDestroy(memPool));
HIP_CHECK(hipStreamDestroy(stream));
// Free host memory.
delete[] hostData;
return EXIT_SUCCESS;
}
// [sphinx-end]
+115
Просмотреть файл
@@ -0,0 +1,115 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) \
{ \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
// Sample helper functions for getting the usage statistics in bulk.
struct usageStatistics
{
std::uint64_t reservedMemCurrent;
std::uint64_t reservedMemHigh;
std::uint64_t usedMemCurrent;
std::uint64_t usedMemHigh;
};
void getUsageStatistics(hipMemPool_t memPool, struct usageStatistics *statistics)
{
HIP_CHECK(hipMemPoolGetAttribute(memPool, hipMemPoolAttrReservedMemCurrent, &statistics->reservedMemCurrent));
HIP_CHECK(hipMemPoolGetAttribute(memPool, hipMemPoolAttrReservedMemHigh, &statistics->reservedMemHigh));
HIP_CHECK(hipMemPoolGetAttribute(memPool, hipMemPoolAttrUsedMemCurrent, &statistics->usedMemCurrent));
HIP_CHECK(hipMemPoolGetAttribute(memPool, hipMemPoolAttrUsedMemHigh, &statistics->usedMemHigh));
}
// Resetting the watermarks resets them to the current value.
void resetStatistics(hipMemPool_t memPool)
{
std::uint64_t value = 0;
HIP_CHECK(hipMemPoolSetAttribute(memPool, hipMemPoolAttrReservedMemHigh, &value));
HIP_CHECK(hipMemPoolSetAttribute(memPool, hipMemPoolAttrUsedMemHigh, &value));
}
int main()
{
hipMemPool_t memPool;
hipDevice_t device = 0; // Specify the device index.
// Initialize the device.
HIP_CHECK(hipSetDevice(device));
// Get the default memory pool for the device.
HIP_CHECK(hipDeviceGetDefaultMemPool(&memPool, device));
// Allocate memory from the pool (e.g., 1 MB).
std::size_t allocSize = 1 * 1024 * 1024;
void* ptr;
HIP_CHECK(hipMalloc(&ptr, allocSize));
// Free the allocated memory.
HIP_CHECK(hipFree(ptr));
// Trim the memory pool to a specific size (e.g., 512 KB).
std::size_t newSize = 512 * 1024;
HIP_CHECK(hipMemPoolTrimTo(memPool, newSize));
// Get and print usage statistics before resetting.
usageStatistics statsBefore;
getUsageStatistics(memPool, &statsBefore);
std::cout << "Before resetting statistics:" << std::endl;
std::cout << "Reserved Memory Current: " << statsBefore.reservedMemCurrent << " bytes" << std::endl;
std::cout << "Reserved Memory High: " << statsBefore.reservedMemHigh << " bytes" << std::endl;
std::cout << "Used Memory Current: " << statsBefore.usedMemCurrent << " bytes" << std::endl;
std::cout << "Used Memory High: " << statsBefore.usedMemHigh << " bytes" << std::endl;
// Reset the statistics.
resetStatistics(memPool);
// Get and print usage statistics after resetting.
usageStatistics statsAfter;
getUsageStatistics(memPool, &statsAfter);
std::cout << "After resetting statistics:" << std::endl;
std::cout << "Reserved Memory Current: " << statsAfter.reservedMemCurrent << " bytes" << std::endl;
std::cout << "Reserved Memory High: " << statsAfter.reservedMemHigh << " bytes" << std::endl;
std::cout << "Used Memory Current: " << statsAfter.usedMemCurrent << " bytes" << std::endl;
std::cout << "Used Memory High: " << statsAfter.usedMemHigh << " bytes" << std::endl;
return EXIT_SUCCESS;
}
// [sphinx-end]
+115
Просмотреть файл
@@ -0,0 +1,115 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <iostream>
#include <limits>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) \
{ \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
// Kernel to perform some computation on allocated memory.
__global__ void myKernel(int* data, std::size_t numElements)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < numElements)
{
data[tid] = tid * 2;
}
}
int main()
{
// Create a stream.
hipStream_t stream;
HIP_CHECK(hipStreamCreate(&stream));
// Create a memory pool with default properties.
hipMemPoolProps poolProps = {};
poolProps.allocType = hipMemAllocationTypePinned;
poolProps.handleTypes = hipMemHandleTypePosixFileDescriptor;
poolProps.location.type = hipMemLocationTypeDevice;
poolProps.location.id = 0; // Assuming device 0.
hipMemPool_t memPool;
HIP_CHECK(hipMemPoolCreate(&memPool, &poolProps));
// [sphinx-start]
std::uint64_t threshold = std::numeric_limits<std::uint64_t>::max();
HIP_CHECK(hipMemPoolSetAttribute(memPool, hipMemPoolAttrReleaseThreshold, &threshold));
// [sphinx-end]
// Allocate memory from the pool asynchronously.
constexpr std::size_t numElements = 1024;
int* devData = nullptr;
HIP_CHECK(hipMallocFromPoolAsync(reinterpret_cast<void**>(&devData),
numElements * sizeof(*devData),
memPool,
stream));
// Define grid and block sizes.
dim3 blockSize(256);
dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
// Launch the kernel to perform computation.
myKernel<<<gridSize, blockSize, 0, stream>>>(devData, numElements);
// Synchronize the stream.
HIP_CHECK(hipStreamSynchronize(stream));
// Copy data back to host.
int* hostData = new int[numElements];
HIP_CHECK(hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost));
// Print the array.
for (std::size_t i = 0; i < numElements; ++i)
std::cout << "Element " << i << ": " << hostData[i] << std::endl;
// Free the allocated memory.
HIP_CHECK(hipFreeAsync(devData, stream));
// Synchronize the stream again to ensure all operations are complete.
HIP_CHECK(hipStreamSynchronize(stream));
// Destroy the memory pool and stream.
HIP_CHECK(hipMemPoolDestroy(memPool));
HIP_CHECK(hipStreamDestroy(stream));
// Free host memory.
delete[] hostData;
return 0;
}
+69
Просмотреть файл
@@ -0,0 +1,69 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) \
{ \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
int main()
{
hipMemPool_t memPool;
hipDevice_t device = 0; // Specify the device index.
// Initialize the device.
HIP_CHECK(hipSetDevice(device));
// Get the default memory pool for the device.
HIP_CHECK(hipDeviceGetDefaultMemPool(&memPool, device));
// Allocate memory from the pool (e.g., 1 MB).
std::size_t allocSize = 1 * 1024 * 1024;
void* ptr;
HIP_CHECK(hipMalloc(&ptr, allocSize));
// Free the allocated memory.
HIP_CHECK(hipFree(ptr));
// Trim the memory pool to a specific size (e.g., 512 KB).
std::size_t newSize = 512 * 1024;
HIP_CHECK(hipMemPoolTrimTo(memPool, newSize));
std::cout << "Memory pool trimmed to " << newSize << " bytes." << std::endl;
return EXIT_SUCCESS;
}
// [sphinx-end]
+90
Просмотреть файл
@@ -0,0 +1,90 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess) \
{ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c)
{
*c = *a + *b;
}
int main()
{
int *a, *b, *c;
unsigned int attributeValue;
constexpr std::size_t attributeSize = sizeof(attributeValue);
int deviceId;
HIP_CHECK(hipGetDevice(&deviceId));
// Allocate memory for a, b and c that is accessible to both device and host codes.
HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
// Setup input values.
*a = 1;
*b = 2;
HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetReadMostly, deviceId));
// Launch add() kernel on GPU.
add<<<1, 1>>>(a, b, c);
// Wait for GPU to finish before accessing on host.
HIP_CHECK(hipDeviceSynchronize());
// Query an attribute of the memory range.
HIP_CHECK(hipMemRangeGetAttribute(&attributeValue,
attributeSize,
hipMemRangeAttributeReadMostly,
a,
sizeof(*a)));
// Prints the result.
std::cout << *a << " + " << *b << " = " << *c << std::endl;
std::cout << "The array a is" << (attributeValue == 1 ? "" : " NOT") << " set to hipMemRangeAttributeReadMostly" << std::endl;
// Cleanup allocated memory.
HIP_CHECK(hipFree(a));
HIP_CHECK(hipFree(b));
HIP_CHECK(hipFree(c));
return 0;
}
// [sphinx-end]
+136
Просмотреть файл
@@ -0,0 +1,136 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) \
{ \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
__global__ void simpleKernel(double *data, std::size_t elems)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < elems)
data[idx] = idx * 2.0;
}
int main()
{
int numDevices;
HIP_CHECK(hipGetDeviceCount(&numDevices));
if (numDevices < 2)
{
std::cout << "This example requires at least two HIP devices." << std::endl;
return EXIT_SUCCESS;
}
double *deviceData0;
double *deviceData1;
constexpr std::size_t elems = 1024;
constexpr std::size_t size = elems * sizeof(double);
// Create streams and events for each device
hipStream_t stream0, stream1;
hipEvent_t startEvent0, stopEvent0, startEvent1, stopEvent1;
// Initialize device 0
HIP_CHECK(hipSetDevice(0));
HIP_CHECK(hipStreamCreate(&stream0));
HIP_CHECK(hipEventCreate(&startEvent0));
HIP_CHECK(hipEventCreate(&stopEvent0));
HIP_CHECK(hipMalloc(&deviceData0, size));
// Initialize device 1
HIP_CHECK(hipSetDevice(1));
HIP_CHECK(hipStreamCreate(&stream1));
HIP_CHECK(hipEventCreate(&startEvent1));
HIP_CHECK(hipEventCreate(&stopEvent1));
HIP_CHECK(hipMalloc(&deviceData1, size));
// Record the start event on device 0
HIP_CHECK(hipSetDevice(0));
HIP_CHECK(hipEventRecord(startEvent0, stream0));
// Launch the kernel asynchronously on device 0
simpleKernel<<<8, 128, 0, stream0>>>(deviceData0, elems);
// Record the stop event on device 0
HIP_CHECK(hipEventRecord(stopEvent0, stream0));
// Wait for the stop event on device 0 to complete
HIP_CHECK(hipEventSynchronize(stopEvent0));
// Record the start event on device 1
HIP_CHECK(hipSetDevice(1));
HIP_CHECK(hipEventRecord(startEvent1, stream1));
// Launch the kernel asynchronously on device 1
simpleKernel<<<8, 128, 0, stream1>>>(deviceData1, elems);
// Record the stop event on device 1
HIP_CHECK(hipEventRecord(stopEvent1, stream1));
// Wait for the stop event on device 1 to complete
HIP_CHECK(hipEventSynchronize(stopEvent1));
// Calculate elapsed time between the events for both devices
float milliseconds0 = 0, milliseconds1 = 0;
HIP_CHECK(hipEventElapsedTime(&milliseconds0, startEvent0, stopEvent0));
HIP_CHECK(hipEventElapsedTime(&milliseconds1, startEvent1, stopEvent1));
std::cout << "Elapsed time on GPU 0: " << milliseconds0 << " ms" << std::endl;
std::cout << "Elapsed time on GPU 1: " << milliseconds1 << " ms" << std::endl;
// Cleanup for device 0
HIP_CHECK(hipSetDevice(0));
HIP_CHECK(hipEventDestroy(startEvent0));
HIP_CHECK(hipEventDestroy(stopEvent0));
HIP_CHECK(hipStreamSynchronize(stream0));
HIP_CHECK(hipStreamDestroy(stream0));
HIP_CHECK(hipFree(deviceData0));
// Cleanup for device 1
HIP_CHECK(hipSetDevice(1));
HIP_CHECK(hipEventDestroy(startEvent1));
HIP_CHECK(hipEventDestroy(stopEvent1));
HIP_CHECK(hipStreamSynchronize(stream1));
HIP_CHECK(hipStreamDestroy(stream1));
HIP_CHECK(hipFree(deviceData1));
return EXIT_SUCCESS;
}
// [sphinx-end]
+81
Просмотреть файл
@@ -0,0 +1,81 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) \
{ \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
// Kernel to perform some computation on allocated memory.
__global__ void myKernel(int* data, std::size_t numElements)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < numElements)
{
data[tid] = tid * 2;
}
}
int main()
{
// Allocate memory.
constexpr std::size_t numElements = 1024;
int* devData;
HIP_CHECK(hipMalloc(&devData, numElements * sizeof(*devData)));
// Launch the kernel to perform computation.
dim3 blockSize(256);
dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
myKernel<<<gridSize, blockSize>>>(devData, numElements);
// Copy data back to host.
int* hostData = new int[numElements];
HIP_CHECK(hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost));
// Print the array.
for (std::size_t i = 0; i < numElements; ++i)
std::cout << "Element " << i << ": " << hostData[i] << std::endl;
// Free memory.
HIP_CHECK(hipFree(devData));
delete[] hostData;
// Synchronize to ensure completion.
HIP_CHECK(hipDeviceSynchronize());
return EXIT_SUCCESS;
}
// [sphinx-end]
+114
Просмотреть файл
@@ -0,0 +1,114 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) \
{ \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
__global__ void simpleKernel(double *data, std::size_t elems)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < elems)
data[idx] = idx * 2.0;
}
int main()
{
int deviceCount;
HIP_CHECK(hipGetDeviceCount(&deviceCount));
if(deviceCount < 2)
{
std::cout << "This example requires at least two HIP devices." << std::endl;
return EXIT_SUCCESS;
}
double* deviceData0;
double* deviceData1;
constexpr std::size_t elems = 1024;
constexpr std::size_t size = elems * sizeof(double);
int deviceId0 = 0;
int deviceId1 = 1;
// Enable peer access to the memory (allocated and future) on the peer device.
// Ensure the device is active before enabling peer access.
HIP_CHECK(hipSetDevice(deviceId0));
HIP_CHECK(hipDeviceEnablePeerAccess(deviceId1, 0));
HIP_CHECK(hipSetDevice(deviceId1));
HIP_CHECK(hipDeviceEnablePeerAccess(deviceId0, 0));
// Set device 0 and perform operations
HIP_CHECK(hipSetDevice(deviceId0)); // Set device 0 as current
HIP_CHECK(hipMalloc(&deviceData0, size)); // Allocate memory on device 0
simpleKernel<<<8, 128>>>(deviceData0, elems); // Launch kernel on device 0
HIP_CHECK(hipDeviceSynchronize());
// Set device 1 and perform operations
HIP_CHECK(hipSetDevice(deviceId1)); // Set device 1 as current
HIP_CHECK(hipMalloc(&deviceData1, size)); // Allocate memory on device 1
simpleKernel<<<8, 128>>>(deviceData1, elems); // Launch kernel on device 1
HIP_CHECK(hipDeviceSynchronize());
// Use peer-to-peer access
HIP_CHECK(hipSetDevice(deviceId0));
// Now device 0 can access memory allocated on device 1
HIP_CHECK(hipMemcpy(deviceData0, deviceData1, size, hipMemcpyDeviceToDevice));
// Copy result from device 0
double hostData0[elems];
HIP_CHECK(hipSetDevice(deviceId0));
HIP_CHECK(hipMemcpy(hostData0, deviceData0, size, hipMemcpyDeviceToHost));
// Copy result from device 1
double hostData1[elems];
HIP_CHECK(hipSetDevice(deviceId1));
HIP_CHECK(hipMemcpy(hostData1, deviceData1, size, hipMemcpyDeviceToHost));
// Display results from both devices
std::cout << "Device 0 data: " << hostData0[0] << std::endl;
std::cout << "Device 1 data: " << hostData1[0] << std::endl;
// Free device memory
HIP_CHECK(hipFree(deviceData0));
HIP_CHECK(hipFree(deviceData1));
return EXIT_SUCCESS;
}
// [sphinx-end]
+104
Просмотреть файл
@@ -0,0 +1,104 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) \
{ \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
__global__ void simpleKernel(double *data, std::size_t elems)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < elems)
data[idx] = idx * 2.0;
}
int main()
{
int deviceCount;
HIP_CHECK(hipGetDeviceCount(&deviceCount));
if(deviceCount < 2)
{
std::cout << "This example requires at least two HIP devices." << std::endl;
return EXIT_SUCCESS;
}
double* deviceData0;
double* deviceData1;
constexpr std::size_t elems = 1024;
constexpr std::size_t size = elems * sizeof(double);
int deviceId0 = 0;
int deviceId1 = 1;
// Set device 0 and perform operations
HIP_CHECK(hipSetDevice(deviceId0)); // Set device 0 as current
HIP_CHECK(hipMalloc(&deviceData0, size)); // Allocate memory on device 0
simpleKernel<<<8, 128>>>(deviceData0, elems); // Launch kernel on device 0
HIP_CHECK(hipDeviceSynchronize());
// Set device 1 and perform operations
HIP_CHECK(hipSetDevice(deviceId1)); // Set device 1 as current
HIP_CHECK(hipMalloc(&deviceData1, size)); // Allocate memory on device 1
simpleKernel<<<8, 128>>>(deviceData1, elems); // Launch kernel on device 1
HIP_CHECK(hipDeviceSynchronize());
// Use deviceData0 on device 1. This works but incurs a performance penalty.
HIP_CHECK(hipSetDevice(deviceId1));
HIP_CHECK(hipMemcpy(deviceData1, deviceData0, size, hipMemcpyDeviceToDevice));
// Copy result from device 0
double hostData0[elems];
HIP_CHECK(hipSetDevice(deviceId0));
HIP_CHECK(hipMemcpy(hostData0, deviceData0, size, hipMemcpyDeviceToHost));
// Copy result from device 1
double hostData1[elems];
HIP_CHECK(hipSetDevice(deviceId1));
HIP_CHECK(hipMemcpy(hostData1, deviceData1, size, hipMemcpyDeviceToHost));
// Display results from both devices
std::cout << "Device 0 data: " << hostData0[0] << std::endl;
std::cout << "Device 1 data: " << hostData1[0] << std::endl;
// Free device memory
HIP_CHECK(hipFree(deviceData0));
HIP_CHECK(hipFree(deviceData1));
return EXIT_SUCCESS;
}
// [sphinx-end]
+80
Просмотреть файл
@@ -0,0 +1,80 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstring>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
int main()
{
const int element_number = 100;
int *host_input, *host_output;
// Host allocation
host_input = new int[element_number];
host_output = new int[element_number];
// Host data preparation
for (int i = 0; i < element_number; i++) {
host_input[i] = i;
}
std::memset(host_output, 0, element_number * sizeof(int));
int *device_input, *device_output;
// Device allocation
HIP_CHECK(hipMalloc((int **)&device_input, element_number * sizeof(int)));
HIP_CHECK(hipMalloc((int **)&device_output, element_number * sizeof(int)));
// Device data preparation
HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
HIP_CHECK(hipMemset(device_output, 0, element_number * sizeof(int)));
// Run the kernel
// ...
HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
// Free host memory
delete[] host_input;
delete[] host_output;
// Free device memory
HIP_CHECK(hipFree(device_input));
HIP_CHECK(hipFree(device_output));
}
// [sphinx-end]
+78
Просмотреть файл
@@ -0,0 +1,78 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
int main()
{
// Initialize the HIP runtime
if (auto err = hipInit(0); err != hipSuccess)
{
std::cerr << "Failed to initialize HIP runtime." << std::endl;
return EXIT_FAILURE;
}
// Get the per-thread default stream
hipStream_t stream = hipStreamPerThread;
// Use the stream for some operation
// For example, allocate memory on the device
void* d_ptr;
std::size_t size = 1024;
if (auto err = hipMalloc(&d_ptr, size); err != hipSuccess)
{
std::cerr << "Failed to allocate memory." << std::endl;
return EXIT_FAILURE;
}
// Perform some operation using the stream
// For example, set memory on the device
if (auto err = hipMemsetAsync(d_ptr, 0, size, stream); err != hipSuccess)
{
std::cerr << "Failed to set memory." << std::endl;
return EXIT_FAILURE;
}
// Synchronize the stream
if (auto err = hipStreamSynchronize(stream); err != hipSuccess)
{
std::cerr << "Failed to synchronize stream." << std::endl;
return EXIT_FAILURE;
}
// Free the allocated memory
if(auto err = hipFree(d_ptr); err != hipSuccess)
{
std::cerr << "Failed to free memory." << std::endl;
return EXIT_FAILURE;
}
std::cout << "Operation completed successfully using per-thread default stream." << std::endl;
return EXIT_SUCCESS;
}
// [sphinx-end]
+81
Просмотреть файл
@@ -0,0 +1,81 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstring>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
int main()
{
const int element_number = 100;
int *host_input, *host_output;
// Host allocation
HIP_CHECK(hipHostMalloc(&host_input, element_number * sizeof(int)));
HIP_CHECK(hipHostMalloc(&host_output, element_number * sizeof(int)));
// Host data preparation
for (int i = 0; i < element_number; i++)
{
host_input[i] = i;
}
std::memset(host_output, 0, element_number * sizeof(int));
int *device_input, *device_output;
// Device allocation
HIP_CHECK(hipMalloc(&device_input, element_number * sizeof(int)));
HIP_CHECK(hipMalloc(&device_output, element_number * sizeof(int)));
// Device data preparation
HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
HIP_CHECK(hipMemset(device_output, 0, element_number * sizeof(int)));
// Run the kernel
// ...
HIP_CHECK(hipMemcpy(device_input, host_input, element_number * sizeof(int), hipMemcpyHostToDevice));
// Free host memory
HIP_CHECK(hipFreeHost(host_input));
HIP_CHECK(hipFreeHost(host_output));
// Free device memory
HIP_CHECK(hipFree(device_input));
HIP_CHECK(hipFree(device_output));
}
// [sphinx-end]
+61
Просмотреть файл
@@ -0,0 +1,61 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if (err != hipSuccess) \
{ \
std::cout << "HIP Error: " << hipGetErrorString(err) \
<< " at line " << __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
int main()
{
// [sphinx-start]
double * ptr;
HIP_CHECK(hipMalloc(&ptr, sizeof(double)));
hipPointerAttribute_t attr;
HIP_CHECK(hipPointerGetAttributes(&attr, ptr)); /*attr.type is hipMemoryTypeDevice*/
if(attr.type == hipMemoryTypeDevice)
std::cout << "ptr is of type hipMemoryTypeDevice" << std::endl;
double* ptrHost;
HIP_CHECK(hipHostMalloc(&ptrHost, sizeof(double)));
hipPointerAttribute_t attrHost;
HIP_CHECK(hipPointerGetAttributes(&attrHost, ptrHost)); /*attr.type is hipMemoryTypeHost*/
if(attrHost.type == hipMemoryTypeHost)
std::cout << "ptrHost is of type hipMemoryTypeHost" << std::endl;
// [sphinx-end]
HIP_CHECK(hipFreeHost(ptrHost));
HIP_CHECK(hipFree(ptr));
return EXIT_SUCCESS;
}
+79
Просмотреть файл
@@ -0,0 +1,79 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <hip/hip_runtime.h>
#include <hip/hiprtc.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <string>
#include <vector>
#define CHECK_RET_CODE(call, ret_code) \
{ \
if ((call) != ret_code) \
{ \
std::cout << "Failed in call: " << #call << std::endl; \
std::abort(); \
} \
}
#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
int main()
{
const char* kernel_source = "adafsfgadascvsfgsadfbdt";
hiprtcProgram prog;
auto rtc_ret_code = hiprtcCreateProgram(&prog, // HIPRTC program handle
kernel_source, // kernel source string
"vector_add.cpp", // Name of the file
0, // Number of headers
nullptr, // Header sources
nullptr); // Name of header file
if (rtc_ret_code != HIPRTC_SUCCESS)
{
std::cerr << "Failed to create program" << std::endl;
std::abort();
}
hipDeviceProp_t props;
int device = 0;
HIP_CHECK(hipGetDeviceProperties(&props, device));
auto sarg = std::string{"--gpu-architecture="} + props.gcnArchName; // device for which binary is to be generated
const char* opts[] = {sarg.c_str()};
// [sphinx-start]
hiprtcResult result;
result = hiprtcCompileProgram(prog, 1, opts);
if (result != HIPRTC_SUCCESS)
{
std::cout << "hiprtcCompileProgram fails with error " << hiprtcGetErrorString(result);
}
// [sphinx-end]
HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
return EXIT_SUCCESS;
}
+131
Просмотреть файл
@@ -0,0 +1,131 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <vector>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
// GPU Kernels
__global__ void kernelA(double* arrayA, std::size_t size)
{
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size)
{
arrayA[x] += 1.0;
}
}
__global__ void kernelB(double* arrayA, double* arrayB, std::size_t size)
{
const std::size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x < size)
{
arrayB[x] += arrayA[x] + 3.0;
}
}
int main()
{
constexpr int numOfBlocks = 1 << 20;
constexpr int threadsPerBlock = 1024;
constexpr int numberOfIterations = 50;
// The array size smaller to avoid the relatively short kernel launch compared to memory copies
constexpr std::size_t arraySize = 1U << 25;
double *d_dataA;
double *d_dataB;
double initValueA = 0.0;
double initValueB = 2.0;
std::vector<double> vectorA(arraySize, initValueA);
std::vector<double> vectorB(arraySize, initValueB);
// Allocate device memory
HIP_CHECK(hipMalloc(&d_dataA, arraySize * sizeof(*d_dataA)));
HIP_CHECK(hipMalloc(&d_dataB, arraySize * sizeof(*d_dataB)));
for(int iteration = 0; iteration < numberOfIterations; iteration++)
{
// Host to Device copies
HIP_CHECK(hipMemcpy(d_dataA, vectorA.data(), arraySize * sizeof(*d_dataA), hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(d_dataB, vectorB.data(), arraySize * sizeof(*d_dataB), hipMemcpyHostToDevice));
// Launch the GPU kernels
kernelA<<<numOfBlocks, threadsPerBlock>>>(d_dataA, arraySize);
kernelB<<<numOfBlocks, threadsPerBlock>>>(d_dataA, d_dataB, arraySize);
// Device to Host copies
HIP_CHECK(hipMemcpy(vectorA.data(), d_dataA, arraySize * sizeof(*vectorA.data()), hipMemcpyDeviceToHost));
HIP_CHECK(hipMemcpy(vectorB.data(), d_dataB, arraySize * sizeof(*vectorB.data()), hipMemcpyDeviceToHost));
}
// Wait for all operations to complete
HIP_CHECK(hipDeviceSynchronize());
// Verify results
const double expectedA = (double)numberOfIterations;
const double expectedB = initValueB + (3.0 * numberOfIterations) + (expectedA * (expectedA + 1.0)) / 2.0;
bool passed = true;
for(std::size_t i = 0; i < arraySize; ++i)
{
if(vectorA[i] != expectedA)
{
passed = false;
std::cerr << "Validation failed! Expected " << expectedA << " got " << vectorA[i] << " at index: " << i << std::endl;
break;
}
if(vectorB[i] != expectedB)
{
passed = false;
std::cerr << "Validation failed! Expected " << expectedB << " got " << vectorB[i] << " at index: " << i << std::endl;
break;
}
}
if(passed)
{
std::cout << "Sequential execution completed successfully." << std::endl;
}
else
{
std::cerr << "Sequential execution failed." << std::endl;
}
// Cleanup
HIP_CHECK(hipFree(d_dataA));
HIP_CHECK(hipFree(d_dataB));
return EXIT_SUCCESS;
}
// [sphinx-end]
+47
Просмотреть файл
@@ -0,0 +1,47 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
// [sphinx-start]
__constant__ int const_array[8];
void set_constant_memory()
{
int host_data[8] {1,2,3,4,5,6,7,8};
if(auto err = hipMemcpyToSymbol(const_array, host_data, sizeof(int) * 8); err != hipSuccess)
std::cerr << "HIP error " << err << ": " << hipGetErrorString(err) << std::endl;
// call kernel that accesses const_array
}
// [sphinx-end]
int main()
{
set_constant_memory();
std::cout << "Success!" << std::endl;
return EXIT_SUCCESS;
}
+42
Просмотреть файл
@@ -0,0 +1,42 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <iostream>
int main()
{
int deviceCount;
if (hipGetDeviceCount(&deviceCount) == hipSuccess)
{
for (int i = 0; i < deviceCount; ++i)
{
hipDeviceProp_t prop;
if (hipGetDeviceProperties(&prop, i) == hipSuccess)
std::cout << "Device" << i << prop.name << std::endl;
}
}
return 0;
}
// [sphinx-end]
+73
Просмотреть файл
@@ -0,0 +1,73 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess) \
{ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c)
{
*c = *a + *b;
}
// This example requires HMM support and the environment variable HSA_XNACK needs to be set to 1
int main()
{
// Allocate memory for a, b, and c.
int *a = new int[1];
int *b = new int[1];
int *c = new int[1];
// Setup input values.
*a = 1;
*b = 2;
// Launch add() kernel on GPU.
add<<<1, 1>>>(a, b, c);
// Wait for GPU to finish before accessing on host.
HIP_CHECK(hipDeviceSynchronize());
// Print the result.
std::cout << *a << " + " << *b << " = " << *c << std::endl;
// Cleanup allocated memory.
delete[] c;
delete[] b;
delete[] a;
return 0;
}
// [sphinx-end]
+46
Просмотреть файл
@@ -0,0 +1,46 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include "example_utils.hpp"
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
// [sphinx-start]
__global__ void kernel()
{
__shared__ int array[128];
__shared__ double result;
}
// [sphinx-end]
int main()
{
kernel<<<64, 512>>>();
HIP_CHECK(hipPeekAtLastError());
HIP_CHECK(hipDeviceSynchronize());
std::cout << "Success!" << std::endl;
return EXIT_SUCCESS;
}
+65
Просмотреть файл
@@ -0,0 +1,65 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess) \
{ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c)
{
*c = *a + *b;
}
// Declare a, b and c as static variables.
__managed__ int a, b, c;
int main()
{
// Setup input values.
a = 1;
b = 2;
// Launch add() kernel on GPU.
add<<<1, 1>>>(&a, &b, &c);
// Wait for GPU to finish before accessing on host.
HIP_CHECK(hipDeviceSynchronize());
// Print the result.
std::cout << a << " + " << b << " = " << c << std::endl;
return 0;
}
// [sphinx-end]
+85
Просмотреть файл
@@ -0,0 +1,85 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if (status != hipSuccess) \
{ \
std::cerr << "HIP error " << status \
<< ": " << hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
// Kernel to perform some computation on allocated memory.
__global__ void myKernel(int* data, std::size_t numElements)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < numElements)
{
data[tid] = tid * 2;
}
}
int main()
{
// Stream 0.
constexpr hipStream_t streamId = 0;
// Allocate memory with stream ordered semantics.
constexpr std::size_t numElements = 1024;
int* devData;
HIP_CHECK(hipMallocAsync(reinterpret_cast<void**>(&devData), numElements * sizeof(*devData), streamId));
// Launch the kernel to perform computation.
dim3 blockSize(256);
dim3 gridSize((numElements + blockSize.x - 1) / blockSize.x);
myKernel<<<gridSize, blockSize>>>(devData, numElements);
// Copy data back to host.
int* hostData = new int[numElements];
HIP_CHECK(hipMemcpy(hostData, devData, numElements * sizeof(*devData), hipMemcpyDeviceToHost));
// Print the array.
for (std::size_t i = 0; i < numElements; ++i)
std::cout << "Element " << i << ": " << hostData[i] << std::endl;
// Free memory with stream ordered semantics.
HIP_CHECK(hipFreeAsync(devData, streamId));
delete[] hostData;
// Synchronize to ensure completion.
HIP_CHECK(hipDeviceSynchronize());
return EXIT_SUCCESS;
}
// [sphinx-end]
+168 -145
Просмотреть файл
@@ -20,16 +20,23 @@
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include "popcount.hpp"
#include <hip/hip_runtime.h>
#include <type_traits>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <iostream>
#include <vector>
#include <random>
#include <vector>
#include <type_traits>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
@@ -39,169 +46,185 @@
}
// [Sphinx template warp size block reduction kernel start]
template<uint32_t WarpSize>
using lane_mask_t = typename std::conditional<WarpSize == 32, uint32_t, uint64_t>::type;
template<std::uint32_t WarpSize>
using lane_mask_t = typename std::conditional<WarpSize == 32, std::uint32_t, std::uint64_t>::type;
template<uint32_t WarpSize>
__global__ void block_reduce(int* input, lane_mask_t<WarpSize>* mask, int* output, size_t size) {
extern __shared__ int shared[];
template<std::uint32_t WarpSize>
__global__ void block_reduce(int* input, lane_mask_t<WarpSize>* mask, int* output, size_t size)
{
extern __shared__ int shared[];
// Read of input with bounds check
auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id)
{
lane_mask_t<WarpSize> warp_mask = lane_mask_t<WarpSize>(1) << lane_id;
return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
};
// Read of input with bounds check
auto read_global_safe = [&](const std::uint32_t i, const std::uint32_t lane_id, const std::uint32_t mask_id)
{
lane_mask_t<WarpSize> warp_mask = lane_mask_t<WarpSize>(1) << lane_id;
return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
};
const uint32_t tid = threadIdx.x,
lid = threadIdx.x % WarpSize,
wid = threadIdx.x / WarpSize,
bid = blockIdx.x,
gid = bid * blockDim.x + tid;
const std::uint32_t tid = threadIdx.x,
lid = threadIdx.x % WarpSize,
wid = threadIdx.x / WarpSize,
bid = blockIdx.x,
gid = bid * blockDim.x + tid;
// Read input buffer to shared
shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / WarpSize) + wid);
__syncthreads();
// Shared reduction
for (uint32_t i = blockDim.x / 2; i >= WarpSize; i /= 2)
{
if (tid < i)
shared[tid] = shared[tid] + shared[tid + i];
// Read input buffer to shared
shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / WarpSize) + wid);
__syncthreads();
}
// Use local variable in warp reduction
int result = shared[tid];
__syncthreads();
// Shared reduction
for (std::uint32_t i = blockDim.x / 2; i >= WarpSize; i /= 2)
{
if (tid < i)
shared[tid] = shared[tid] + shared[tid + i];
__syncthreads();
}
// This loop would be unrolled the same with the runtime warpSize.
#pragma unroll
for (uint32_t i = WarpSize/2; i >= 1; i /= 2) {
result = result + __shfl_down(result, i);
}
// Use local variable in warp reduction
int result = shared[tid];
__syncthreads();
// Write result to output buffer
if (tid == 0)
output[bid] = result;
};
// This loop would be unrolled the same with the runtime warpSize.
#pragma unroll
for (std::uint32_t i = WarpSize/2; i >= 1; i /= 2)
{
result = result + __shfl_down(result, i);
}
// Write result to output buffer
if (tid == 0)
output[bid] = result;
}
// [Sphinx template warp size block reduction kernel end]
// [Sphinx template warp size mask generation start]
template<uint32_t WarpSize>
template<std::uint32_t WarpSize>
void generate_and_copy_mask(
void *d_mask,
std::vector<int>& vectorExpected,
int numOfBlocks,
int numberOfWarp,
int mask_size,
int mask_element_size) {
std::random_device rd;
std::mt19937_64 eng(rd());
void *d_mask,
std::vector<int>& vectorExpected,
int numOfBlocks,
int numberOfWarp,
int mask_size,
int mask_element_size)
{
std::random_device rd;
std::mt19937_64 eng(rd());
// Host side mask vector
std::vector<lane_mask_t<WarpSize>> mask(mask_size);
// Define uniform unsigned int distribution
std::uniform_int_distribution<lane_mask_t<WarpSize>> distr;
// Fill up the mask
for(int i=0; i < numOfBlocks; i++) {
int count = 0;
for(int j=0; j < numberOfWarp; j++) {
int mask_index = i * numberOfWarp + j;
mask[mask_index] = distr(eng);
if constexpr(WarpSize == 32)
count += __builtin_popcount(mask[mask_index]);
else
count += __builtin_popcountll(mask[mask_index]);
// Host side mask vector
std::vector<lane_mask_t<WarpSize>> mask(mask_size);
// Define uniform unsigned int distribution
std::uniform_int_distribution<lane_mask_t<WarpSize>> distr;
// Fill up the mask
for(int i=0; i < numOfBlocks; i++)
{
int count = 0;
for(int j=0; j < numberOfWarp; j++)
{
int mask_index = i * numberOfWarp + j;
mask[mask_index] = distr(eng);
if constexpr(WarpSize == 32)
count += popcount(static_cast<std::uint32_t>(mask[mask_index]));
else
count += popcount(mask[mask_index]);
}
vectorExpected[i]= count;
}
vectorExpected[i]= count;
}
// Copy the mask array
HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
// Copy the mask array
HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
}
// [Sphinx template warp size mask generation end]
int main() {
int main()
{
int deviceId = 0;
int warpSizeHost;
HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
std::cout << "Warp size: " << warpSizeHost << std::endl;
int deviceId = 0;
int warpSizeHost;
HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
std::cout << "Warp size: " << warpSizeHost << std::endl;
constexpr int numOfBlocks = 16;
constexpr int threadsPerBlock = 1024;
const int numberOfWarp = threadsPerBlock / warpSizeHost;
const int mask_element_size = warpSizeHost == 32 ? sizeof(std::uint32_t) : sizeof(std::uint64_t);
const int mask_size = numOfBlocks * numberOfWarp;
constexpr std::size_t arraySize = numOfBlocks * threadsPerBlock;
constexpr int numOfBlocks = 16;
constexpr int threadsPerBlock = 1024;
const int numberOfWarp = threadsPerBlock / warpSizeHost;
const int mask_element_size = warpSizeHost == 32 ? sizeof(uint32_t) : sizeof(uint64_t);
const int mask_size = numOfBlocks * numberOfWarp;
constexpr size_t arraySize = numOfBlocks * threadsPerBlock;
int *d_data, *d_results;
void *d_mask;
int initValue = 1;
std::vector<int> vectorInput(arraySize, initValue);
std::vector<int> vectorOutput(numOfBlocks);
std::vector<int> vectorExpected(numOfBlocks);
// Allocate device memory
HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
// Host to Device copy of the input array
HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
int *d_data, *d_results;
void *d_mask;
int initValue = 1;
std::vector<int> vectorInput(arraySize, initValue);
std::vector<int> vectorOutput(numOfBlocks);
std::vector<int> vectorExpected(numOfBlocks);
// Allocate device memory
HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
// Host to Device copy of the input array
HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
// [Sphinx template warp size select kernel start]
// Fill up the mask variable, copy to device and select the right kernel.
if(warpSizeHost == 32) {
// Generate and copy mask arrays
generate_and_copy_mask<32>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
// [Sphinx template warp size select kernel start]
// Fill up the mask variable, copy to device and select the right kernel.
if(warpSizeHost == 32)
{
// Generate and copy mask arrays
generate_and_copy_mask<32>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
// Start the kernel
block_reduce<32><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
d_data,
static_cast<uint32_t*>(d_mask),
d_results,
arraySize);
} else if(warpSizeHost == 64) {
// Generate and copy mask arrays
generate_and_copy_mask<64>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
// Start the kernel
block_reduce<64><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
d_data,
static_cast<uint64_t*>(d_mask),
d_results,
arraySize);
} else {
std::cerr << "Unsupported warp size." << std::endl;
return 0;
}
// [Sphinx template warp size select kernel end]
// Check the kernel launch
HIP_CHECK(hipGetLastError());
// Check for kernel execution error
HIP_CHECK(hipDeviceSynchronize());
// Device to Host copy of the result
HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
// Verify results
bool passed = true;
for(size_t i = 0; i < numOfBlocks; ++i) {
if(vectorOutput[i] != vectorExpected[i]) {
passed = false;
std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl;
// Start the kernel
block_reduce<32><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
d_data,
static_cast<std::uint32_t*>(d_mask),
d_results,
arraySize);
}
}
if(passed){
std::cout << "Execution completed successfully." << std::endl;
}else{
std::cerr << "Execution failed." << std::endl;
}
else if(warpSizeHost == 64)
{
// Generate and copy mask arrays
generate_and_copy_mask<64>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
// Cleanup
HIP_CHECK(hipFree(d_data));
HIP_CHECK(hipFree(d_mask));
HIP_CHECK(hipFree(d_results));
return 0;
}
// Start the kernel
block_reduce<64><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
d_data,
static_cast<std::uint64_t*>(d_mask),
d_results,
arraySize);
}
else
{
std::cerr << "Unsupported warp size." << std::endl;
return EXIT_FAILURE;
}
// [Sphinx template warp size select kernel end]
// Check the kernel launch
HIP_CHECK(hipGetLastError());
// Check for kernel execution error
HIP_CHECK(hipDeviceSynchronize());
// Device to Host copy of the result
HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
// Verify results
bool passed = true;
for(std::size_t i = 0; i < numOfBlocks; ++i)
{
if(vectorOutput[i] != vectorExpected[i])
{
passed = false;
std::cerr << "Validation failed! Expected " << vectorExpected[i]
<< " got " << vectorOutput[i] << " at index: " << i << std::endl;
}
}
if(passed)
{
std::cout << "Execution completed successfully." << std::endl;
}
else
{
std::cerr << "Execution failed." << std::endl;
}
// Cleanup
HIP_CHECK(hipFree(d_data));
HIP_CHECK(hipFree(d_mask));
HIP_CHECK(hipFree(d_results));
return EXIT_SUCCESS;
}
+66
Просмотреть файл
@@ -0,0 +1,66 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
// [sphinx-kernel-start]
__global__ void kernel()
{
long long int start = clock64();
// kernel code
long long int stop = clock64();
long long int cycles = stop - start;
}
// [sphinx-kernel-end]
int main()
{
int deviceId = 0;
// [sphinx-query-start]
int wallClkRate = 0; //in kilohertz
HIP_CHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId));
// [sphinx-query-end]
kernel<<<dim3{1, 1, 1}, dim3{32,1,1}>>>();
HIP_CHECK(hipDeviceSynchronize());
std::cout << "Device's wall clock rate is " << wallClkRate << " kHz." << std::endl;
return EXIT_SUCCESS;
}
+89
Просмотреть файл
@@ -0,0 +1,89 @@
// MIT License
//
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
// [sphinx-start]
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t err = expression; \
if(err != hipSuccess) \
{ \
std::cerr << "HIP error: " \
<< hipGetErrorString(err) \
<< " at " << __LINE__ << "\n"; \
} \
}
// Addition of two values.
__global__ void add(int *a, int *b, int *c)
{
*c = *a + *b;
}
int main()
{
int deviceId;
HIP_CHECK(hipGetDevice(&deviceId));
int *a, *b, *c;
// Allocate memory for a, b, and c accessible to both device and host codes.
HIP_CHECK(hipMallocManaged(&a, sizeof(*a)));
HIP_CHECK(hipMallocManaged(&b, sizeof(*b)));
HIP_CHECK(hipMallocManaged(&c, sizeof(*c)));
// Set memory advice for a and b to be read, located on and accessed by the GPU.
HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetPreferredLocation, deviceId));
HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetAccessedBy, deviceId));
HIP_CHECK(hipMemAdvise(a, sizeof(*a), hipMemAdviseSetReadMostly, deviceId));
HIP_CHECK(hipMemAdvise(b, sizeof(*b), hipMemAdviseSetPreferredLocation, deviceId));
HIP_CHECK(hipMemAdvise(b, sizeof(*b), hipMemAdviseSetAccessedBy, deviceId));
HIP_CHECK(hipMemAdvise(b, sizeof(*b), hipMemAdviseSetReadMostly, deviceId));
// Set memory advice for c to be read, located on and accessed by the CPU.
HIP_CHECK(hipMemAdvise(c, sizeof(*c), hipMemAdviseSetPreferredLocation, hipCpuDeviceId));
HIP_CHECK(hipMemAdvise(c, sizeof(*c), hipMemAdviseSetAccessedBy, hipCpuDeviceId));
HIP_CHECK(hipMemAdvise(c, sizeof(*c), hipMemAdviseSetReadMostly, hipCpuDeviceId));
// Setup input values.
*a = 1;
*b = 2;
// Launch add() kernel on GPU.
add<<<1, 1>>>(a, b, c);
// Wait for GPU to finish before accessing on host.
HIP_CHECK(hipDeviceSynchronize());
// Prints the result.
std::cout << *a << " + " << *b << " = " << *c << std::endl;
// Cleanup allocated memory.
HIP_CHECK(hipFree(a));
HIP_CHECK(hipFree(b));
HIP_CHECK(hipFree(c));
return 0;
}
// [sphinx-end]
+154 -129
Просмотреть файл
@@ -20,16 +20,23 @@
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
#include "popcount.hpp"
#include <hip/hip_runtime.h>
#include <type_traits>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <iostream>
#include <vector>
#include <random>
#include <vector>
#include <type_traits>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
if(status != hipSuccess) \
{ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
@@ -39,146 +46,164 @@
}
// [Sphinx HIP warp size block reduction kernel start]
__global__ void block_reduce(int* input, uint64_t* mask, int* output, size_t size){
extern __shared__ int shared[];
// Read of input with bounds check
auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id)
{
uint64_t warp_mask = 1ull << lane_id;
return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
};
const uint32_t tid = threadIdx.x,
lid = threadIdx.x % warpSize,
wid = threadIdx.x / warpSize,
bid = blockIdx.x,
gid = bid * blockDim.x + tid;
// Read input buffer to shared
shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / warpSize) + wid);
__syncthreads();
// Shared reduction
for (uint32_t i = blockDim.x / 2; i >= warpSize; i /= 2)
{
if (tid < i)
shared[tid] = shared[tid] + shared[tid + i];
__global__ void block_reduce(int* input, std::uint64_t* mask, int* output, std::size_t size)
{
extern __shared__ int shared[];
// Read of input with bounds check
auto read_global_safe = [&](const std::uint32_t i, const std::uint32_t lane_id, const std::uint32_t mask_id)
{
std::uint64_t warp_mask = 1ull << lane_id;
return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
};
const std::uint32_t tid = threadIdx.x,
lid = threadIdx.x % warpSize,
wid = threadIdx.x / warpSize,
bid = blockIdx.x,
gid = bid * blockDim.x + tid;
// Read input buffer to shared
shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / warpSize) + wid);
__syncthreads();
}
// Use local variable in warp reduction
int result = shared[tid];
__syncthreads();
// Shared reduction
for (std::uint32_t i = blockDim.x / 2; i >= warpSize; i /= 2)
{
if (tid < i)
shared[tid] = shared[tid] + shared[tid + i];
__syncthreads();
}
// This loop would be unrolled the same with the compile-time WarpSize.
#pragma unroll
for (uint32_t i = warpSize/2; i >= 1; i /= 2) {
result = result + __shfl_down(result, i);
}
// Use local variable in warp reduction
int result = shared[tid];
__syncthreads();
// Write result to output buffer
if (tid == 0)
output[bid] = result;
};
// This loop would be unrolled the same with the compile-time WarpSize.
#pragma unroll
for (std::uint32_t i = warpSize/2; i >= 1; i /= 2) {
result = result + __shfl_down(result, i);
}
// Write result to output buffer
if (tid == 0)
output[bid] = result;
}
// [Sphinx HIP warp size block reduction kernel end]
// [Sphinx HIP warp size mask generation start]
void generate_and_copy_mask(
uint64_t *d_mask,
std::vector<int>& vectorExpected,
int warpSizeHost,
int numOfBlocks,
int numberOfWarp,
int mask_size,
int mask_element_size) {
std::random_device rd;
std::mt19937_64 eng(rd());
std::uint64_t *d_mask,
std::vector<int>& vectorExpected,
int warpSizeHost,
int numOfBlocks,
int numberOfWarp,
int mask_size,
int mask_element_size)
{
std::random_device rd;
std::mt19937_64 eng(rd());
// Host side mask vector
std::vector<uint64_t> mask(mask_size);
// Define uniform unsigned int distribution
std::uniform_int_distribution<uint64_t> distr;
// Fill up the mask
for(int i=0; i < numOfBlocks; i++) {
int count = 0;
for(int j=0; j < numberOfWarp; j++) {
int mask_index = i * numberOfWarp + j;
mask[mask_index] = distr(eng);
if(warpSizeHost == 32)
count += __builtin_popcount(mask[mask_index]);
else
count += __builtin_popcountll(mask[mask_index]);
// Host side mask vector
std::vector<std::uint64_t> mask(mask_size);
// Define uniform unsigned int distribution
std::uniform_int_distribution<std::uint64_t> distr;
// Fill up the mask
for(int i=0; i < numOfBlocks; i++)
{
int count = 0;
for(int j=0; j < numberOfWarp; j++)
{
int mask_index = i * numberOfWarp + j;
mask[mask_index] = distr(eng);
if(warpSizeHost == 32)
count += popcount(static_cast<std::uint32_t>(mask[mask_index]));
else
count += popcount(mask[mask_index]);
}
vectorExpected[i]= count;
}
vectorExpected[i]= count;
}
// Copy the mask array
HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
// Copy the mask array
HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
}
// [Sphinx HIP warp size mask generation end]
int main() {
int deviceId = 0;
int warpSizeHost;
HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
std::cout << "Warp size: " << warpSizeHost << std::endl;
constexpr int numOfBlocks = 16;
constexpr int threadsPerBlock = 1024;
const int numberOfWarp = threadsPerBlock / warpSizeHost;
const int mask_element_size = sizeof(uint64_t);
const int mask_size = numOfBlocks * numberOfWarp;
constexpr size_t arraySize = numOfBlocks * threadsPerBlock;
int *d_data, *d_results;
uint64_t *d_mask;
int initValue = 1;
std::vector<int> vectorInput(arraySize, initValue);
std::vector<int> vectorOutput(numOfBlocks);
std::vector<int> vectorExpected(numOfBlocks);
// Allocate device memory
HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
// Host to Device copy of the input array
HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
int main()
{
int deviceId = 0;
int warpSizeHost;
HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
std::cout << "Warp size: " << warpSizeHost << std::endl;
constexpr int numOfBlocks = 16;
constexpr int threadsPerBlock = 1024;
const int numberOfWarp = threadsPerBlock / warpSizeHost;
const int mask_element_size = sizeof(std::uint64_t);
const int mask_size = numOfBlocks * numberOfWarp;
constexpr std::size_t arraySize = numOfBlocks * threadsPerBlock;
int *d_data, *d_results;
std::uint64_t *d_mask;
int initValue = 1;
std::vector<int> vectorInput(arraySize, initValue);
std::vector<int> vectorOutput(numOfBlocks);
std::vector<int> vectorExpected(numOfBlocks);
// Allocate device memory
HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
// Host to Device copy of the input array
HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
// [Sphinx HIP warp size select kernel start]
// Generate and copy mask arrays
generate_and_copy_mask(
d_mask,
vectorExpected,
warpSizeHost,
numOfBlocks,
numberOfWarp,
mask_size,
mask_element_size);
// [Sphinx HIP warp size select kernel start]
// Generate and copy mask arrays
generate_and_copy_mask(
d_mask,
vectorExpected,
warpSizeHost,
numOfBlocks,
numberOfWarp,
mask_size,
mask_element_size);
// Start the kernel
block_reduce<<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
d_data,
d_mask,
d_results,
arraySize);
// [Sphinx HIP warp size select kernel end]
// Start the kernel
block_reduce<<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
d_data,
d_mask,
d_results,
arraySize);
// [Sphinx HIP warp size select kernel end]
// Check the kernel launch
HIP_CHECK(hipGetLastError());
// Check for kernel execution error
HIP_CHECK(hipDeviceSynchronize());
// Device to Host copy of the result
HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
// Verify results
bool passed = true;
for(size_t i = 0; i < numOfBlocks; ++i) {
if(vectorOutput[i] != vectorExpected[i]) {
passed = false;
std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl;
// Check the kernel launch
HIP_CHECK(hipGetLastError());
// Check for kernel execution error
HIP_CHECK(hipDeviceSynchronize());
// Device to Host copy of the result
HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
// Verify results
bool passed = true;
for(std::size_t i = 0; i < numOfBlocks; ++i)
{
if(vectorOutput[i] != vectorExpected[i])
{
passed = false;
std::cerr << "Validation failed! Expected " << vectorExpected[i]
<< " got " << vectorOutput[i] << " at index: " << i << std::endl;
}
}
}
if(passed){
std::cout << "Execution completed successfully." << std::endl;
}else{
std::cerr << "Execution failed." << std::endl;
}
// Cleanup
HIP_CHECK(hipFree(d_data));
HIP_CHECK(hipFree(d_mask));
HIP_CHECK(hipFree(d_results));
return 0;
}
if(passed)
{
std::cout << "Execution completed successfully." << std::endl;
}
else
{
std::cerr << "Execution failed." << std::endl;
}
// Cleanup
HIP_CHECK(hipFree(d_data));
HIP_CHECK(hipFree(d_mask));
HIP_CHECK(hipFree(d_results));
return EXIT_SUCCESS;
}
+313 -2
Просмотреть файл
@@ -21,5 +21,316 @@
import urllib.request
urllib.request.urlretrieve("https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/develop/HIP-Basic/opengl_interop/main.hip", "docs/tools/example_codes/opengl_interop.hip")
urllib.request.urlretrieve("https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/develop/HIP-Basic/vulkan_interop/main.hip", "docs/tools/example_codes/external_interop.hip")
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Basic/opengl_interop/main.hip",
"docs/tools/example_codes/opengl_interop.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Basic/vulkan_interop/main.hip",
"docs/tools/example_codes/external_interop.hip"
)
# HIP-C%2B%2B-Language-Extensions
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/calling_global_functions/main.hip",
"docs/tools/example_codes/calling_global_functions.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/extern_shared_memory/main.hip",
"docs/tools/example_codes/extern_shared_memory.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/launch_bounds/main.hip",
"docs/tools/example_codes/launch_bounds.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/set_constant_memory/main.hip",
"docs/tools/example_codes/set_constant_memory.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/template_warp_size_reduction/main.hip",
"docs/tools/example_codes/template_warp_size_reduction.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/timer/main.hip",
"docs/tools/example_codes/timer.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-C%2B%2B-Language-Extensions/warp_size_reduction/main.hip",
"docs/tools/example_codes/warp_size_reduction.hip"
)
# HIP-Porting-Guide
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-Porting-Guide/device_code_feature_identification/main.hip",
"docs/tools/example_codes/device_code_feature_identification.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-Porting-Guide/host_code_feature_identification/main.cpp",
"docs/tools/example_codes/host_code_feature_identification.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-Porting-Guide/identifying_compilation_target_platform/main.cpp",
"docs/tools/example_codes/identifying_compilation_target_platform.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/HIP-Porting-Guide/identifying_host_device_compilation_pass/main.hip",
"docs/tools/example_codes/identifying_host_device_compilation_pass.hip"
)
# Introduction-to-the-HIP-Programming-Model
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Introduction-to-the-HIP-Programming-Model/add_kernel/main.hip",
"docs/tools/example_codes/add_kernel.hip"
)
# Porting-CUDA-Driver-API
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Porting-CUDA-Driver-API/load_module/main.cpp",
"docs/tools/example_codes/load_module.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Porting-CUDA-Driver-API/load_module_ex/main.cpp",
"docs/tools/example_codes/load_module_ex.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Porting-CUDA-Driver-API/load_module_ex_cuda/main.cpp",
"docs/tools/example_codes/load_module_ex_cuda.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Porting-CUDA-Driver-API/per_thread_default_stream/main.cpp",
"docs/tools/example_codes/per_thread_default_stream.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Porting-CUDA-Driver-API/pointer_memory_type/main.cpp",
"docs/tools/example_codes/pointer_memory_type.cpp"
)
# Programming-for-HIP-Runtime-Compiler
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Programming-for-HIP-Runtime-Compiler/compilation_apis/main.cpp",
"docs/tools/example_codes/compilation_apis.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Programming-for-HIP-Runtime-Compiler/linker_apis/main.cpp",
"docs/tools/example_codes/linker_apis.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Programming-for-HIP-Runtime-Compiler/linker_apis_file/main.cpp",
"docs/tools/example_codes/linker_apis_file.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Programming-for-HIP-Runtime-Compiler/linker_apis_options/main.cpp",
"docs/tools/example_codes/linker_apis_options.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Programming-for-HIP-Runtime-Compiler/lowered_names/main.cpp",
"docs/tools/example_codes/lowered_names.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Programming-for-HIP-Runtime-Compiler/rtc_error_handling/main.cpp",
"docs/tools/example_codes/rtc_error_handling.cpp"
)
# Using-HIP-Runtime-API
# Using-HIP-Runtime-API/Asynchronous-Concurrent-Execution
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Asynchronous-Concurrent-Execution/async_kernel_execution/main.hip",
"docs/tools/example_codes/async_kernel_execution.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Asynchronous-Concurrent-Execution/event_based_synchronization/main.hip",
"docs/tools/example_codes/event_based_synchronization.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/refs/heads/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Asynchronous-Concurrent-Execution/sequential_kernel_execution/main.hip",
"docs/tools/example_codes/sequential_kernel_execution.hip"
)
# Using-HIP-Runtime-API / Call-Stack
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Call-Stack/call_stack_management/main.cpp",
"docs/tools/example_codes/call_stack_management.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Call-Stack/device_recursion/main.hip",
"docs/tools/example_codes/device_recursion.hip"
)
# Using-HIP-Runtime-API / Error-Handling
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Error-Handling/error_handling/main.hip",
"docs/tools/example_codes/error_handling.hip"
)
# Using-HIP-Runtime-API / HIP-Graphs
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/HIP-Graphs/graph_capture/main.hip",
"docs/tools/example_codes/graph_capture.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/HIP-Graphs/graph_creation/main.hip",
"docs/tools/example_codes/graph_creation.hip"
)
# Using-HIP-Runtime-API / Initialization
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Initialization/simple_device_query/main.cpp",
"docs/tools/example_codes/simple_device_query.cpp"
)
# Using-HIP-Runtime-API / Memory-Management / Device-Memory
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Device-Memory/constant_memory/main.hip",
"docs/tools/example_codes/constant_memory_device.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Device-Memory/dynamic_shared_memory/main.hip",
"docs/tools/example_codes/dynamic_shared_memory_device.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Device-Memory/explicit_copy/main.cpp",
"docs/tools/example_codes/explicit_copy.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Device-Memory/kernel_memory_allocation/main.hip",
"docs/tools/example_codes/kernel_memory_allocation.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Device-Memory/static_shared_memory/main.hip",
"docs/tools/example_codes/static_shared_memory_device.hip"
)
# Using-HIP-Runtime-API / Memory-Management / Host-Memory
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Host-Memory/pageable_host_memory/main.cpp",
"docs/tools/example_codes/pageable_host_memory.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Host-Memory/pinned_host_memory/main.cpp",
"docs/tools/example_codes/pinned_host_memory.cpp"
)
# Using-HIP-Runtime-API / Memory-Management / SOMA
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/SOMA/stream_ordered_memory_allocation/main.hip",
"docs/tools/example_codes/stream_ordered_memory_allocation.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/SOMA/ordinary_memory_allocation/main.hip",
"docs/tools/example_codes/ordinary_memory_allocation.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/SOMA/memory_pool/main.hip",
"docs/tools/example_codes/memory_pool.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/SOMA/memory_pool_resource_usage_statistics/main.cpp",
"docs/tools/example_codes/memory_pool_resource_usage_statistics.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/SOMA/memory_pool_threshold/main.hip",
"docs/tools/example_codes/memory_pool_threshold.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/SOMA/memory_pool_trim/main.cpp",
"docs/tools/example_codes/memory_pool_trim.cpp"
)
# Using-HIP-Runtime-API / Memory-Management / Unified-Memory-Management
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/data_prefetching/main.hip",
"docs/tools/example_codes/data_prefetching.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/dynamic_unified_memory/main.hip",
"docs/tools/example_codes/dynamic_unified_memory.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/explicit_memory/main.hip",
"docs/tools/example_codes/explicit_memory.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/memory_range_attributes/main.hip",
"docs/tools/example_codes/memory_range_attributes.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/standard_unified_memory/main.hip",
"docs/tools/example_codes/standard_unified_memory.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/static_unified_memory/main.hip",
"docs/tools/example_codes/static_unified_memory.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Memory-Management/Unified-Memory-Management/unified_memory_advice/main.hip",
"docs/tools/example_codes/unified_memory_advice.hip"
)
# Using-HIP-Runtime-API / Multi-Device-Management
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Multi-Device-Management/device_enumeration/main.cpp",
"docs/tools/example_codes/device_enumeration.cpp"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Multi-Device-Management/device_selection/main.hip",
"docs/tools/example_codes/device_selection.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Multi-Device-Management/multi_device_synchronization/main.hip",
"docs/tools/example_codes/multi_device_synchronization.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Multi-Device-Management/p2p_memory_access/main.hip",
"docs/tools/example_codes/p2p_memory_access.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Programming-Guide/Using-HIP-Runtime-API/Multi-Device-Management/p2p_memory_access_host_staging/main.hip",
"docs/tools/example_codes/p2p_memory_access_host_staging.hip"
)
# Reference examples from HIP-Doc / Reference
# CUDA-to-HIP-API-Function-Comparison
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Reference/CUDA-to-HIP-API-Function-Comparison/block_reduction/main.cu",
"docs/tools/example_codes/block_reduction.cu"
)
# HIP-Complex-Math-API
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Reference/HIP-Complex-Math-API/complex_math/main.hip",
"docs/tools/example_codes/complex_math.hip"
)
# HIP-Math-API
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Reference/HIP-Math-API/math/main.hip",
"docs/tools/example_codes/math.hip"
)
# Low-Precision-Floating-Point-Types
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Reference/Low-Precision-Floating-Point-Types/low_precision_float_fp8/main.hip",
"docs/tools/example_codes/low_precision_float_fp8.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Reference/Low-Precision-Floating-Point-Types/low_precision_float_fp16/main.hip",
"docs/tools/example_codes/low_precision_float_fp16.hip"
)
# Tutorial codes from HIP-Doc / Tutorials
# graph_api
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Tutorials/graph_api/src/main_streams.hip",
"docs/tools/example_codes/graph_api_tutorial_main_streams.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Tutorials/graph_api/src/main_graph_capture.hip",
"docs/tools/example_codes/graph_api_tutorial_main_graph_capture.hip"
)
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/ROCm/rocm-examples/amd-staging/HIP-Doc/Tutorials/graph_api/src/main_graph_creation.hip",
"docs/tools/example_codes/graph_api_tutorial_main_graph_creation.hip"
)
+667
Просмотреть файл
@@ -0,0 +1,667 @@
.. meta::
:description: HIP graph API tutorial
:keywords: AMD, ROCm, HIP, graph API, tutorial
.. _hip_graph_api_tutorial:
*******************************************************************************
HIP Graph API Tutorial
*******************************************************************************
**Time to complete**: 60 minutes | **Difficulty**: Intermediate | **Domain**: Medical Imaging
Introduction
============
Imagine you are directing a movie. In traditional GPU programming with streams, you are like a director who must call
"action!" for every single shot, waiting between each take. With HIP graphs, you pre-plan the entire scene sequence and
then call "action!" just once to film everything in one go. This tutorial will show you how to transform your GPU
applications from repeated direction to choreographed performance.
Modeling dependencies between GPU operations
--------------------------------------------
Most movies in the world follow a plot where certain scenes must happen before the following scenes; otherwise the
movie might not make much sense. If a scene *A* must happen before scenes *B* and *C*, *B* and *C* depend on *A*. If
*B* and *C* contain different stories that (at this point) are unrelated to each other, *B* and *C* are independent and
can be shown to the audience in any order. However, both scenes might be a prerequisite for the final scene *D*, so *D*
depends on both of them. When you represent scenes as *nodes* and dependencies as *edges*, you can create a graph, and
the graph representing your imaginary movie script will have a diamond-like shape:
.. figure:: ../data/tutorial/graph_api/diamond.svg
:alt: Diagram showing a graph with diamond-like shape. Nodes represent movie scenes and edges represent dependencies
between scenes.
:align: center
You can think about GPU operations in a similar way. For example, most kernels require at least one data buffer to work
with, so they will depend on a preceding copy or ``memset`` operation. Others might process the results of preceding
kernels. Real-world applications typically involve multiple GPU operations with dependencies between them. HIP offers
two ways to think about and model these dependencies: streams and graphs.
Streams
^^^^^^^
Streams are HIP's default model for organizing and launching GPU operations on the device. They are sequential sets of
operations, similar to CPU threads. Adding operation *A* before operation *B* to a stream ensures *A* happens before
*B*, regardless of any interdependencies (or lack thereof) between them. A stream can be thought of as a first-in,
first-out (FIFO) queue of operations.
Multiple streams operate independently, and manual synchronization is required when dependencies cross stream
boundaries. Additionally, each operation in a stream is scheduled independently, which — depending on the complexity of
the enqueued operation — might lead to noticeable CPU launch overhead and kernel dispatch latency, especially for
workloads with many small kernels. However, applications that use streams are well suited for workloads that are
dynamic and unpredictable.
For more information about HIP streams, see :ref:`asynchronous_how-to`.
Graphs
^^^^^^
HIP graphs model dependencies between operations as nodes and edges on a diagram. Each node in the graph represents an
operation, and each edge represents a dependency between two nodes. If no edge exists between two nodes, they are
independent and can execute in any order.
Because dependency information is built into the graph, the HIP runtime automatically inserts the necessary
synchronization points. Launching all operations in a graph requires only a single API call, reducing launch overhead
and dispatch latency to near-zero. This is especially beneficial for workloads with many small kernels, where launch
overhead can dominate overall execution time.
Graphs must be defined once before use, making them ideal for fixed workflows that run repeatedly. While node
parameters can be updated between executions, the graph structure itself cannot change after instantiation. This
structural immutability is the primary trade-off compared to the flexibility of streams.
For more information about HIP graphs, see :ref:`how_to_HIP_graph`.
When to use graphs
^^^^^^^^^^^^^^^^^^
This table shows when to use graphs in your application.
.. list-table::
:header-rows: 1
:class: decision-matrix
* - ✅ **Use Graphs When**
-**Avoid Graphs When**
* - Workflow is fixed and repetitive
- Workflow changes dynamically
* - Same kernels execute many times
- One-shot operations
* - Launch overhead is significant (many small kernels)
- Kernels are long-running
Transitioning a CT reconstruction pipeline
------------------------------------------
In this tutorial, you will modify an existing GPU-accelerated stream-based image processing pipeline that reconstructs
computer tomography (CT) data (the classic Shepp-Logan phantom [ShLo74]_). The pipeline transforms raw X-ray
projections into clear cross-sectional images used in medical diagnosis.
.. figure:: ../data/tutorial/graph_api/ct_reconstruction_overview.png
:alt: Diagram showing raw projection data being transformed into a reconstructed CT slice
:align: center
.. note::
The tutorial application generates a phantom volume and forward projections. This GPU-accelerated operation uses
multiple streams and appears in the traces. You can ignore the dataset generation — it is not relevant to this
tutorial.
The reconstruction pipeline consists of:
1. **Load** projection data into GPU memory
2. **Preprocess** the projection through six stages:
a. Logarithmic transformation (convert X-ray intensities)
b. Pixel weighting (correct for cone-beam geometry)
c. Forward FFT (transform to frequency domain)
d. Shepp-Logan filtering (enhance edges and improve contrast)
e. Inverse FFT (return to spatial domain)
f. Normalization (account for unnormalized FFT)
3. **Reconstruct** the 3D volume using the Feldkamp-Davis-Kress (FDK) algorithm [FeDK84]_
**Why HIP graphs?** CT scanners process hundreds of projections per scan. By capturing this fixed workflow as a graph,
you will reduce the amount of API calls required for launching the workflow on a GPU to 1 per projection, thus reducing
launch overhead and dispatch latency to near-zero.
What you will learn
-------------------
After completing this tutorial, you will be able to:
* Convert a stream-based HIP application to a graph-based application via stream capturing
* Create graphs manually for fine-grained control
* Integrate graph-safe libraries like hipFFT into your graphs
* Understand when graphs provide performance benefits
* Apply graph concepts to your own workflows
Before you begin
----------------
Required knowledge
^^^^^^^^^^^^^^^^^^
You should be comfortable writing and debugging HIP kernels, understand basic GPU memory management concepts like
device allocation and host-to-device transfers, be familiar with HIP streams and events, and have experience using
CMake to build C++ projects. This tutorial assumes you have written at least a few HIP programs before and understand
concepts like grid dimensions and thread blocks.
Hardware and software requirements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Your system needs ROCm 6.2 or later with the hipFFT library installed. The tutorial works on
all :doc:`supported AMD GPUs <rocm-install-on-linux:reference/system-requirements>`, though at least 4 GiB of GPU
memory are recommended for comfortable performance with the reconstruction workload. You will also need
`git <https://git-scm.com/>`__ to check out the code repository, `CMake <https://www.cmake.org>`__ 3.21 or later to
build the code, along with a CMake generator that supports the HIP language such as GNU Make or Ninja.
.. note::
Visual Studio generators currently do not support HIP. The (optional) ``rocprofv3`` tool is currently supported on
Linux only.
To save the output volume, you need a recent version of `libTIFF <https://libtiff.gitlab.io/libtiff/>`__. If CMake
cannot find libTIFF on your system, it automatically downloads and builds it.
To view both the input projections and the output volume produced by this tutorial, install a scientific image viewer
that can display 16-bit and 32-bit grayscale data, such as `Fiji <https://imagej.net/software/fiji/downloads>`__.
Standard image viewers may be unable to correctly display the output.
Optional knowledge
^^^^^^^^^^^^^^^^^^
While not required, familiarity with Fast Fourier Transform (FFT) operations will help you understand the filtering
steps. Similarly, knowledge of medical imaging or CT reconstruction is helpful for understanding the application
context. If you have worked with signal processing or image filtering before, you will recognize some of the applied
concepts.
.. note::
You can skip the reconstruction algorithm and concentrate on the stream and graph implementations in the files
prefixed with ``main_``.
Step 1: Build the tutorial code
===============================
The full code for this tutorial is part of the `ROCm examples repository <https://github.com/ROCm/rocm-examples>`__.
Check out the repository:
.. code-block:: bash
git clone https://github.com/ROCm/rocm-examples.git
Then navigate to ``rocm-examples/HIP-Doc/Tutorials/graph_api/``. The code can be found in the ``src`` subdirectory.
Create a separate ``build`` directory inside ``rocm-examples/HIP-Doc/Tutorials/graph_api/``. Then
configure the project (adjust ``CMAKE_HIP_ARCHITECTURES`` to match your GPU):
.. code-block:: bash
cd build
cmake -DCMAKE_PREFIX_PATH=/opt/rocm -DCMAKE_BUILD_TYPE=Release -DCMAKE_HIP_ARCHITECTURES=gfx1100 -DCMAKE_HIP_PLATFORM=amd -DCMAKE_CXX_COMPILER=amdclang++ -DCMAKE_C_COMPILER=amdclang -DCMAKE_HIP_COMPILER=amdclang++ ..
Now you can build the three variants of the tutorial code:
.. code-block:: bash
cmake --build . --target hip_graph_api_tutorial_streams hip_graph_api_tutorial_graph_capture hip_graph_api_tutorial_graph_creation
.. note::
The ``graph_capture`` variant is currently not supported on Windows and the build target is therefore unavailable.
Step 2: Examining the stream-based baseline application
=======================================================
Open ``src/main_streams.hip`` in your editor. You will explore how this application processes data.
Understanding batched processing
--------------------------------
The application processes multiple projections simultaneously to maximize GPU utilization.
Determining parallel capacity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
At the beginning of ``main()``, the program queries the GPU for its number of asynchronous engines to determine how
many streams it can create, indicating how many data transfer or compute operations can run in parallel.
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
:start-after: // [sphinx-async-engine-start]
:end-before: // [sphinx-async-engine-end]
:language: cuda
:dedent:
.. tip::
Each asynchronous engine executes operations independently. More engines mean more parallelism.
Processing projections in batches
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Find the ``MAIN LOOP`` comment. Here the application groups projections into parallel batches:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
:start-after: // [sphinx-batch-start]
:end-before: // [sphinx-batch-end]
:language: cuda
:dedent:
Notice how each batch size equals the stream count — this ensures every stream stays busy.
Synchronization
^^^^^^^^^^^^^^^
Each projection processes independently, so you only need to synchronize once at the end.
:cpp:func:`hipStreamWaitEvent()` function makes the first stream wait for all other streams to complete.
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
:start-after: // [sphinx-sync-start]
:end-before: // [sphinx-sync-end]
:language: cuda
:dedent:
Exploring the processing pipeline
---------------------------------
Next, examine what happens to each projection. Find the ``START HERE`` comment to see the reconstruction pipeline's
first steps:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
:start-after: // [sphinx-preprocessing-start]
:end-before: // [sphinx-preprocessing-end]
:language: cuda
:dedent:
This is a typical pattern found across many HIP applications: multiple kernels executing in sequence with data
dependencies. In the next step, the weighted projections need to be transformed into Fourier space and filtered. For
optimal performance, it is recommended to execute a 1D FFT on a buffer size which is a power of two. Copy the weighted
projection to another buffer where the row length is a power of two equal to or larger than the projection's row
length:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
:start-after: // [sphinx-proj-to-expanded-start]
:end-before: // [sphinx-proj-to-expanded-end]
:language: cuda
:dedent:
Next, transform the expanded projection into Fourier space for filtering:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
:start-after: // [sphinx-forward-start]
:end-before: // [sphinx-forward-end]
:language: cuda
:dedent:
.. tip::
Some hipFFT operations are graph-safe: As long as these operations are operating on the capturing stream, they will
be captured into the graph as well. Refer to :ref:`hipFFT's documentation <hipfft:hipfft-api-usage>` for more
information on its graph-safe operations.
In Fourier space, apply the Shepp-Logan filter, then transform back:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
:start-after: // [sphinx-filter-start]
:end-before: // [sphinx-filter-end]
:language: cuda
:dedent:
Shrink to original size and normalize the FFT output:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
:start-after: // [sphinx-expanded-to-proj-start]
:end-before: // [sphinx-expanded-to-proj-end]
:language: cuda
:dedent:
Finally, back-project the filtered projection into the 3D volume using ``atomicAdd`` operations to accumulate voxel
values from multiple kernels:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_streams.hip
:start-after: // [sphinx-bp-start]
:end-before: // [sphinx-bp-end]
:language: cuda
:dedent:
.. note::
The preprocessing kernels process 512 × 512 pixels (:math:`\mathcal{O}(n²)`), while the back-projection kernel
processes 512 × 512 × 512 voxels (:math:`\mathcal{O}(n³)`). This cubic complexity makes back-projection the
computational bottleneck.
Creating a trace file
^^^^^^^^^^^^^^^^^^^^^
Inside the ``build`` directory you will now generate a trace:
.. code-block:: bash
rocprofv3 -o streams -d outDir -f pftrace --hip-trace --kernel-trace --memory-copy-trace --memory-allocation-trace -- ./HIP-Doc/Tutorials/graph_api/src/hip_graph_api_tutorial_graph_creation
.. note::
For more information on the ``rocprofv3`` tool, please refer to its
:ref:`documentation <rocprofiler-sdk:using-rocprofv3>`.
Analyzing the trace
^^^^^^^^^^^^^^^^^^^
Open the trace file to see what is really happening:
1. Navigate to your ``build/outDir`` directory
2. Open ``streams_results.pftrace`` in `Perfetto <https://ui.perfetto.dev>`__
3. Click the arrow next to your executable name under ``System``
4. Focus on the kernel execution pattern on the right
.. figure:: ../data/tutorial/graph_api/streams_trace.png
:alt: Stream execution showing gaps between kernel launches
:align: center
While projections process in parallel, there are visible gaps between operations. These gaps represent overhead caused
by scheduling and launching the operations. In the next section, you will eliminate these gaps by capturing streams into
a graph.
Step 3: Converting to graphs via stream capture
===============================================
Stream capture is a feature that allows you to record a sequence of GPU operations (kernel launches, memory copies,
etc.) into a HIP Graph, which can later be executed as a single, optimized unit. Open the file
``src/main_graph_capture.hip``, which contains the code from the previous subsection, with a few changes that allow you
to capture the streams into a single graph.
Before the main loop, declare graph-specific variables:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
:start-after: // [sphinx-graph-vars-start]
:end-before: // [sphinx-graph-vars-end]
:language: cuda
:dedent:
``graphExec`` and ``graphExecFinal`` will be instances of the graph template that you will create in the following
steps. You will typically instantiate a graph template once and update its parameters for repeated launches. If the
graph topology changes, you will need a new instance. The ``graphStream`` will launch the final graph instances.
Inside the main loop, activate capture mode on the first stream:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
:start-after: // [sphinx-begin-capture-start]
:end-before: // [sphinx-begin-capture-end]
:language: cuda
:dedent:
.. admonition:: What happens during capture?
When :cpp:func:`hipStreamBeginCapture` is called, the stream stops executing operations immediately. Instead, it
records operations into a graph template (``graph`` in the code shown here).
To capture multiple streams, use events to implement the fork-join pattern:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
:start-after: // [sphinx-fork-start]
:end-before: // [sphinx-fork-end]
:language: cuda
:dedent:
This creates dependencies between streams, activating capture mode on the additional streams and ensuring they are all
part of the same graph.
**The processing pipeline itself remains unchanged.**
After recording all operations of the current batch, join the streams:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
:start-after: // [sphinx-join-start]
:end-before: // [sphinx-join-end]
:language: cuda
:dedent:
Then stop capturing:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
:start-after: // [sphinx-stop-capture-start]
:end-before: // [sphinx-stop-capture-end]
:language: cuda
:dedent:
The graph template is now complete. In order to execute the recorded operations, you need to instantiate the graph
and execute it on the ``graphStream``. The graph template can be safely destroyed after instantiating:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
:start-after: // [sphinx-graph-instantiate-start]
:end-before: // [sphinx-graph-instantiate-end]
:language: cuda
:dedent:
.. tip::
Use :cpp:func:`hipGraphDebugDotPrint` to save a graph's topology into a ``*.dot`` file. The resulting file
contains a `DOT <https://graphviz.org/doc/info/lang.html>`__ description which can be processed with
`Graphviz <https://graphviz.org/>`__ or visualized with several tools. For example:
.. code-block:: bash
dot -Tpng graph_capture.dot -o graph_capture.png
Instantiating a graph is a relatively costly operation. However, you need to update the parameters whenever a new batch
is processed. Since the graph templates are the same for all batches (i.e., the topology of the resulting graph does
not change), it is sufficient to update the existing graph instance's parameters instead of creating a new instance:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
:start-after: // [sphinx-graph-update-start]
:end-before: // [sphinx-graph-update-end]
:language: cuda
:dedent:
Should the graph's topology change between iterations, it is necessary to create a new graph instance. In your
application's case, this can happen when the number of projections is not evenly divisible by the number of
asynchronous engines:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_capture.hip
:start-after: // [sphinx-graph-final-start]
:end-before: // [sphinx-graph-final-end]
:language: cuda
:dedent:
Creating a trace
----------------
Now you have successfully converted the processing pipeline into an executable graph. You can examine the effects of
this change and generate another trace:
.. code-block:: bash
rocprofv3 -o graph_capture -d outDir -f pftrace --hip-trace --kernel-trace --memory-copy-trace --memory-allocation-trace -- ./HIP-Doc/Tutorials/graph_api/src/hip_graph_api_tutorial_graph_capture
Analyzing the trace
-------------------
Opening the resulting trace file ``outDir/graph_capture_results.pftrace`` with Perfetto shows a significant change:
.. figure:: ../data/tutorial/graph_api/capture_trace.png
:alt: Diagram showing a trace of the capturing variant.
:align: center
The gaps have disappeared! By capturing all operations of a batch into a single graph, you have successfully
eliminated the launching and scheduling overhead previously observed in the stream-based variant.
A limitation of stream capture is that it preserves stream ordering even when unnecessary. Operations that could run in
parallel still execute sequentially. Another approach to graphs is manual construction. This is quite verbose but also
offers much more control over dependencies and parallelism.
Step 4: Manual graph creation (advanced)
========================================
Open ``src/main_graph_creation.hip`` and find the main loop. The code here differs from the other variants: rather than
capturing streams into graphs, you will build the graph manually. Consider how the weighting kernel is invoked through
a kernel node:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
:start-after: // [sphinx-weighting-node-start]
:end-before: // [sphinx-weighting-node-end]
:language: cuda
:dedent:
You create an array of ``void*`` pointers containing the kernel parameters. Next, configure the kernel launch
parameters: grid and block dimensions, the kernel function pointer, and the dynamic shared memory size. Finally, add
the kernel node to the graph template. Note the ``&logTransformationKernelNode, 1`` part: this is how you specify a
dependency from the preceding log transformation kernel node to the weighting kernel node.
.. note::
For specifying multiple dependencies, you would pass an array of :cpp:type:`hipGraphNode_t` objects and the number of
nodes inside the array to :cpp:func:`hipGraphAddKernelNode`.
The HIP graph API supports multiple different node types. For example, this is how a ``memset`` node is set up:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
:start-after: // [sphinx-memset-node-start]
:end-before: // [sphinx-memset-node-end]
:language: cuda
:dedent:
.. note::
Despite the different construction method, graph instantiation and updates
work exactly as before. You can find the same patterns at the loop's end.
Adding hipFFT nodes
-------------------
While hipFFT provides graph-safe functionality, it does not support manual node creation. Integrating hipFFT into the
graph requires a workaround using stream capture with additional bookkeeping.
You capture the graph state before and after hipFFT operations, then identify the nodes hipFFT added:
Step 1: Save existing nodes
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Record all current graph nodes in a sorted ``std::set``:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
:start-after: // [sphinx-before-forward-start]
:end-before: // [sphinx-before-forward-end]
:language: cuda
:dedent:
Step 2: Capture hipFFT operations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
:start-after: // [sphinx-hipfft-start]
:end-before: // [sphinx-hipfft-end]
:language: cuda
:dedent:
Step 3: Get updated node list
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
:start-after: // [sphinx-after-forward-start]
:end-before: // [sphinx-after-forward-end]
:language: cuda
:dedent:
Step 4: Find new nodes
^^^^^^^^^^^^^^^^^^^^^^
Compute the difference between both node sets:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
:start-after: // [sphinx-node-difference-start]
:end-before: // [sphinx-node-difference-end]
:language: cuda
:dedent:
Step 5: Identify the leaf node
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Find hipFFT's final node for dependency tracking:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
:start-after: // [sphinx-find-leaf-start]
:end-before: // [sphinx-find-leaf-end]
:language: cuda
:dedent:
The leaf detection logic checks if a node has no outgoing edges:
.. literalinclude:: ../tools/example_codes/graph_api_tutorial_main_graph_creation.hip
:start-after: // [sphinx-is-leaf-start]
:end-before: // [sphinx-is-leaf-end]
:language: cuda
:dedent:
With hipFFT integrated and its leaf node identified, subsequent nodes can establish proper dependencies.
.. note::
You can also capture hipFFT operations into a separate graph template, then add it to the main graph as a child graph
using :cpp:func:`hipGraphAddChildGraphNode`. The approach above adds hipFFT nodes directly to the main graph as
first-class nodes. A child graph acts as a single node that expands recursively into its components. The scheduler
may handle these approaches differently, potentially affecting performance.
Creating a trace
----------------
Now you have manually implemented the processing pipeline with the graph API. You can examine the result by generating
another trace:
.. code-block:: bash
rocprofv3 -o graph_creation -d outDir -f pftrace --hip-trace --kernel-trace --memory-copy-trace --memory-allocation-trace -- ./HIP-Doc/Tutorials/graph_api/src/hip_graph_api_tutorial_graph_creation
Analyzing the trace
-------------------
Opening the resulting trace file ``outDir/graph_creation_results.pftrace`` with Perfetto shows a similar trace to what
you achieved with the capture variant:
.. figure:: ../data/tutorial/graph_api/creation_trace.png
:alt: Diagram showing a trace of the creation variant.
:align: center
Like before, the kernels are executed *en bloc*. By creating nodes for all operations in the processing pipeline, you
avoided the launching and scheduling overhead you previously observed in the stream-based variant.
Updating individual nodes
-------------------------
The code presented in this tutorial updates the entire graph instance for each new batch. Applications that require
updates to only a small subset of nodes might experience excessive overhead. For these cases, the HIP Graph API
provides the following methods for updating individual nodes:
* :cpp:func:`hipGraphExecChildGraphNodeSetParams`
* :cpp:func:`hipGraphExecEventRecordNodeSetEvent`
* :cpp:func:`hipGraphExecEventWaitNodeSetEvent`
* :cpp:func:`hipGraphExecExternalSemaphoresSignalNodeSetParams`
* :cpp:func:`hipGraphExecExternalSemaphoresWaitNodeSetParams`
* :cpp:func:`hipGraphExecHostNodeSetParams`
* :cpp:func:`hipGraphExecKernelNodeSetParams`
* :cpp:func:`hipGraphExecMemcpyNodeSetParams`
* :cpp:func:`hipGraphExecMemcpyNodeSetParams1D`
* :cpp:func:`hipGraphExecMemcpyNodeSetParamsFromSymbol`
* :cpp:func:`hipGraphExecMemcpyNodeSetParamsToSymbol`
* :cpp:func:`hipGraphExecMemsetNodeSetParams`
* :cpp:func:`hipGraphExecNodeSetParams`
Conclusion
==========
When an application has predictable, repetitive workflows, transitioning from streams to graphs can significantly
reduce launch overhead and improve performance. HIP provides two approaches for creating graphs: stream capture and
explicit graph construction.
**Stream capture** converts existing stream-based code into a graph by recording the operations between start and stop
capture calls. This approach minimizes code changes and works well when your application already has a graph-like
structure with clear dependencies.
**Explicit graph construction** involves manually creating nodes and defining edges between them using the graph API.
While this approach requires more code changes and is more verbose, it provides fine-grained control over dependencies
and allows for optimizations that might not be possible with stream capture. This method is ideal when you need precise
control over the graph topology or when working with complex dependency patterns.
.. tip::
Choose stream capture for quick conversions of existing code with minimal changes. Choose explicit construction when
you need maximum control and optimization opportunities.
Resources
=========
* :ref:`HIP Programming Guide's section on HIP graphs <how_to_HIP_graph>`
* :ref:`HIP graph API reference <graph_management_reference>`
.. rubric:: References
.. [FeDK84] L.A. Feldkamp, L.C. Davis and J.W. Kress: "Practical cone-beam algorithm". In *Journal of the Optical Society of America A*, vol. 1, no. 6, pp. 612-619, June 1984, DOI `10.1364/JOSAA.1.000612 <https://dx.doi.org/10.1364/JOSAA.1.000612>`__.
.. [ShLo74] L.A. Shepp and B.F. Logan: "The Fourier reconstruction of a head section". In *IEEE Transactions on Nuclear Science*, vol. 21, no. 3, pp. 21-43, June 1974, DOI `10.1109/TNS.1974.6499235 <https://dx.doi.org/10.1109/TNS.1974.6499235>`__.