CPUs and GPUs have been designed for different purposes. CPUs quickly execute a single thread, decreasing the time for a single operation while increasing the number of sequential instructions that can be executed. This includes fetching data and reducing pipeline stalls where the ALU has to wait for previous instructions to finish.
- Large L1/L2 cache per core, shared by fewer threads (maximum of 2 when hyperthreading is available).
- A disadvantage is switching execution from one thread to another (or context switching) takes a considerable amount of time: the ALU pipeline needs to be emptied, the register file has to be written to memory to free the register for another thread.
- Designed for parallel processing with many simpler cores (hundreds/thousands)
- Lower clock speeds (1-2 GHz)
- Streamlined control logic
- Small caches, more registers
- Register files are shared among threads. The number of threads that can be run in parallel depends on the registers needed per thread.
- Multiple ALUs execute a collection of threads having the same operations, also known as a wavefront or warp. This is called single-instruction, multiple threads (SIMT) operation as described in :ref:`programming_model_simt`.
- The collection of ALUs is called SIMD. SIMDs are an extension to the hardware architecture that allows a `single instruction` to concurrently operate on `multiple data` inputs.
- For branching threads where conditional instructions lead to thread divergence, ALUs still process the full wavefront, but the result for divergent threads is masked out. This leads to wasted ALU cycles and should be a consideration in your programming. Keep instructions consistent and leave conditionals out of threads.
- The advantage for GPUs is that context switching is easy. All threads that run on a core/compute unit have their registers on the compute unit, so they don't need to be stored to global memory, and each cycle one instruction from any wavefront that resides on the compute unit can be issued.
HIP performs implicit synchronization on occasions, unlike some
APIs where the responsibility for synchronization is left to the user.
Host programming
----------------
In heterogeneous programming, the CPU is available for processing operations but the host application has the additional task of managing data and computation exchanges between the CPU (host) and GPU (device). The host acts as the application manager, coordinating the overall workflow and directing operations to the appropriate context, handles data preparation and data transfers, and manages GPU tasks and synchronization. Here is a typical sequence of operations:
1. Initialize the HIP runtime and select the GPU: As described in :ref:`initialization`, refers to identifying and selecting a target GPU, setting up a context to let the CPU interact with the GPU.
2. Data preparation: As discussed in :ref:`memory_management`, this includes allocating the required memory on the host and device, preparing input data and transferring it from the host to the device. The data is both transferred to the device, and passed as an input parameter when launching the kernel.
3. Configure and launch the kernel on the GPU: As described in :ref:`device_program`, this defines kernel configurations and arguments, launches kernel to run on the GPU device using the triple chevron syntax or appropriate API call (for example ``hipLaunchKernelGGL``). On the GPU, multiple kernels can run on streams, with a queue of operations. Within the same stream, operations run in the order they were issued, but on multiple streams operations are independent and can execute concurrently. In the HIP runtime, kernels run on the default stream when one is not specified, but specifying a stream for the kernel lets you increase concurrency in task scheduling and resource utilization, and launch and manage multiple kernels from the host program.
4. Synchronization: As described in :ref:`asynchronous_how-to`, kernel execution occurs in the context of device streams, specifically the default (`0`) stream. You can use streams and events to manage task dependencies, overlap computation with data transfers, and manage asynchronous processes to ensure proper sequencing of operations. Wait for events or streams to finish execution and transfer results from the GPU back to the host.
5. Error handling: As described in :ref:`error_handling`, you should catch and handle potential errors from API calls, kernel launches, or memory operations. For example, use ``hipGetErrorString`` to retrieve error messages.
6. Cleanup and resource management: Validate results, clean up GPU contexts and resources, and free allocated memory on the host and devices.
This structure allows for efficient use of GPU resources and facilitates the acceleration of compute-intensive tasks while keeping the host CPU available for other tasks.
:alt:Diagram depicting a host CPU and device GPU rectangles of varying color.
There are arrows pointing between the rectangles showing from the Host
to the Device the initialization, data transfer, and Kernel execution
steps, and from the Device back to the Host the returning results.
Interaction of Host and Device in a GPU application
.._device_program:
Device programming
------------------
The device or kernel program acts as workers on the GPU application, distributing operations to be handled quickly and efficiently. Launching a kernel in the host application starts the kernel program running on the GPU, defining the parallel operations to repeat the same instructions across many datasets. Understanding how the kernel works and the processes involved is essential to writing efficient GPU applications. Threads, blocks, and grids provide a hierarchical approach to parallel operations. Understanding the thread hierarchy is critical to distributing work across the available CUs, managing parallel operations, and optimizing memory access. The general flow of the kernel program looks like this:
1. Thread Grouping: As described in :ref:`inherent_thread_model`, threads are organized into a hierarchy consisting of threads, which are individual instances of parallel operations, blocks that group the threads, and grids that group blocks into the kernel. Each thread runs an instance of the kernel in parallel with other threads in the block.
2. Indexing: The kernel computes the unique index for each thread to access the relevant data to be processed by the thread.
3. Data Fetch: Threads fetch input data from memory previously transferred from the host to the device. As described in :ref:`memory_hierarchy`, the hierarchy of threads is influenced by the memory subsystem of GPUs. The memory hierarchy includes local memory per-thread with very fast access, shared memory for the block of threads which also supports quick access, and larger amounts of global memory visible to the whole kernel,but accesses are expensive due to high latency. Understanding the memory model is a key concept for kernel programming.
4. Computation: Threads perform the required computations on the input data, and generate any needed output. Each thread of the kernel runs the same instruction simultaneously on the different datasets. This sometimes require multiple iterations when the number of operations exceeds the resources of the CU.
5. Synchronization: When needed, threads synchronize within their block to ensure correct results when working with shared memory.
Kernels are parallel programs that execute the same instruction set across multiple threads, organized in wavefronts, as described below and as demonstrated in the `Hello World tutorial <https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/hello_world>`_ or :doc:`../tutorial/saxpy`. However, heterogeneous GPU applications can also become quite complex, managing hundreds, thousands, or hundreds of thousands of operations with repeated data transfers between host and device to support massive parallelization, using multiple streams to manage concurrent asynchronous operations, using rich libraries of functions optimized for GPU hardware as described in the `ROCm documentation <https://rocm.docs.amd.com/en/latest/>`_.
:alt:Diagram depicting the SIMT execution model. There is a red rectangle
which contains the expression a[i] = b[i] + c[i], and below that four
arrows that point to Thread 0,1,2, and 3. Each thread contains different
values for b, c, and a, showing the parallel operations of this equation.
Instruction flow of a sample SIMT program
A kernel follows the same C++ rules as the functions on the host, but it has a special ``__global__`` label to mark it for execution on the device, as shown in the following example:
The following are a few memory access patterns and best practices to improve performance. You can find additional information in :ref:`memory_management` and :doc:`../how-to/performance_guidelines`.
Coalesced memory access in HIP refers to the optimization of memory transactions to maximize throughput when accessing global memory. When a kernel accesses global memory, the memory transactions typically occur in chunks of 32, 64, or 128 bytes, which must be naturally aligned. Coalescing memory accesses means aligning and organizing these accesses so that multiple threads in a warp can combine their memory requests into the fewest possible transactions. If threads access memory in a coalesced manner, meaning consecutive threads read or write consecutive memory locations, the memory controller can merge these accesses into a single transaction. This is crucial because global memory bandwidth is relatively low compared to on-chip bandwidths, and non-optimal memory accesses can significantly impact performance. If all the threads in a warp can access consecutive memory locations, memory access is fully coalesced.
To achieve coalesced memory access in HIP, you should:
1. *Align Data*: Use data types that are naturally aligned and ensure that structures and arrays are aligned properly.
2.*Optimize Access Patterns*: Arrange memory accesses so that consecutive threads in a warp access consecutive memory locations. For example, if threads access a 2D array, the array and thread block widths should be multiples of the warp size.
3.*Avoid strided access*: For example array[i * stride] can lead to memory bank conflicts and inefficient access.
4.*Pad Data*: If necessary, pad data structures to ensure alignment and coalescing.
***Shared memory**: Avoiding bank conflicts reduces the serialization of memory transactions.
Shared memory is a small, fast memory region inside the CU. Unlike global memory, shared memory accesses do not require coalescing, but they can suffer from bank conflicts, which are another form of inefficient memory access. Shared memory is divided into multiple memory banks (usually 32 banks on modern GPUs). If multiple threads within a warp try to access different addresses that map to the same memory bank, accesses get serialized, leading to poor performance. To optimize shared memory usage, ensure that consecutive threads access different memory banks. Use padding if necessary to avoid conflicts.
Texture memory is read-only memory optimized for spatial locality and caching rather than coalescing. Texture memory is cached, unlike standard global memory, and it provides optimized access patterns for 2D and spatially local data. Accessing neighboring values results in cache hits, improving performance. Therefore, instead of worrying about coalescing, optimal memory access patterns involve ensuring that threads access spatially adjacent texture elements, and the memory layout aligns well with the 2D caching mechanism.
***Unified memory**: Structured access reduces the overhead of page migrations.
Unified memory allows the CPU and GPU to share memory seamlessly, but performance depends on access patterns. Unified memory enables automatic page migration between CPU and GPU memory. However, if different threads access different pages, it can lead to expensive page migrations and slow throughput performance. Accessing unified memory in a structured, warp-friendly manner reduces unnecessary page transfers. Ensure threads access memory in a structured, consecutive manner, minimizing page faults. Prefetch data to the GPU before computation by using ``hipMemPrefetchAsync()``. In addition, using small batch transfers as described below, can reduce unexpected page migrations when using unified memory.
***Small batch transfers**: Enable pipelining and improve PCIe bandwidth use.
Memory transfers between the host and the device can become a major bottleneck if not optimized. One method is to use small batch memory transfers where data is transferred in smaller chunks instead of dealing with large datasets to avoid long blocking operations. Small batch transfers offer better PCIe bandwidth utilization over large data transfers. Small batch transfers offer performance improvement by offering reduced latency with small batches that run asynchronously using ``hipMemcpyAsync()`` as described in :ref:`asynchronous_how-to`, pipelining data transfers and kernel execution using separate streams. Finally, using pinned memory with small batch transfers enables faster DMA transfers without CPU involvement, greatly improving memory transfer performance.
Execution model
===============
As previously discussed in :ref:`heterogeneous_programming`, HIP programs consist of two distinct scopes:
* The host-side API running on the host processor.
* The device-side kernels running on GPUs.
Both the host and the device-side APIs have synchronous and asynchronous functions.
The host-side API dealing with device management and their queries are synchronous.
All asynchronous APIs, such as kernel execution, data movement and potentially data
allocation/freeing all happen in the context of device streams, as described in `Managing streams <../how-to/hip_runtime_api/asynchronous.html#managing-streams>`_.
Asynchronous operations between the host and the kernel provide a variety of opportunities,
or challenges, for managing synchronization, as described in :ref:`asynchronous_how-to`.
For instance, a basic model would be to launch an asynchronous operation on a kernel
in a stream, create an event to track the operation, continue operations in the host
program, and when the event shows that the asynchronous operation is complete, synchronize the kernel to return the results.
However, one of the opportunities of asynchronous operation is the pipelining of operations
between launching kernels and transferring memory. In this case, you would be working
with multiple streams running concurrently, or at least overlapping in some regard,
and managing any dependencies between the streams in the host application.
The producer-consumer paradigm can be used to convert a sequential program
into parallel operations to improve performance. This process can employ multiple
streams to kick off asynchronous kernels, provide data to the kernels, perform operations,
and return the results for further processing in the host application.
These asynchronous activities call for stream management strategies. In the case
of the single stream, the only management would be the stream synchronization
when the work was complete. However, with multiple streams you have
overlapping execution of operations and synchronization becomes more complex, as shown
in the variations of the example in `Programmatic dependent launch and synchronization <../how-to/hip_runtime_api/asynchronous.html#programmatic-dependent-launch-and-synchronization>`_.
You need to manage each stream's activities, evaluate the availability of results, evaluate the critical path of the tasks, allocate resources on the hardware, and manage the execution order.
Multi-GPU and load balancing
----------------------------
For applications requiring additional computational power beyond a single device,
HIP supports utilizing multiple GPUs within a system. Large-scale applications
that need more compute power can use multiple GPUs in the system. This enables
the runtime to distribute workloads across multiple GPUs to balance the load and prevent some GPUs