SWDEV-270961 - Update HIP documents

Change-Id: Iba88d78456b9b190c2af92ca696777e459e4acb0


[ROCm/hip commit: 066fa459bb]
Этот коммит содержится в:
Julia Jiang
2021-03-01 23:21:04 -05:00
коммит произвёл Julia Jiang
родитель a13ca05639
Коммит afe16f6b0a
5 изменённых файлов: 69 добавлений и 54 удалений
+5 -5
Просмотреть файл
@@ -35,7 +35,7 @@ HIP-Clang is the compiler for compiling HIP programs on AMD platform.
HIP-Clang can be built manually:
```
git clone -b rocm-3.10.x https://github.com/RadeonOpenCompute/llvm-project.git
git clone -b rocm-4.2.x https://github.com/RadeonOpenCompute/llvm-project.git
cd llvm-project
mkdir -p build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/opt/rocm/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=1 -DLLVM_TARGETS_TO_BUILD="AMDGPU;X86" -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" ../llvm
@@ -46,7 +46,7 @@ sudo make install
Rocm device library can be manually built as following,
```
export PATH=/opt/rocm/llvm/bin:$PATH
git clone -b rocm-3.10.x https://github.com/RadeonOpenCompute/ROCm-Device-Libs.git
git clone -b rocm-4.2.x https://github.com/RadeonOpenCompute/ROCm-Device-Libs.git
cd ROCm-Device-Libs
mkdir -p build && cd build
CC=clang CXX=clang++ cmake -DLLVM_DIR=/opt/rocm/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_WERROR=1 -DLLVM_ENABLE_ASSERTIONS=1 -DCMAKE_INSTALL_PREFIX=/opt/rocm ..
@@ -77,9 +77,9 @@ ROCclr is defined on AMD platform that HIP use Radeon Open Compute Common Langua
See https://github.com/ROCm-Developer-Tools/ROCclr
```
git clone -b rocm-3.10.x https://github.com/ROCm-Developer-Tools/ROCclr.git
git clone -b rocm-4.2.x https://github.com/ROCm-Developer-Tools/ROCclr.git
export ROCclr_DIR="$(readlink -f ROCclr)"
git clone -b rocm-3.10.x https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime.git
git clone -b rocm-4.2.x https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime.git
export OPENCL_DIR="$(readlink -f ROCm-OpenCL-Runtime)"
cd "$ROCclr_DIR"
mkdir -p build;cd build
@@ -91,7 +91,7 @@ sudo make install
## Build HIP
```
git clone -b rocm-3.10.x https://github.com/ROCm-Developer-Tools/HIP.git
git clone -b rocm-4.2.x https://github.com/ROCm-Developer-Tools/HIP.git
export HIP_DIR="$(readlink -f HIP)"
cd "$HIP_DIR"
mkdir -p build; cd build
+9 -1
Просмотреть файл
@@ -30,6 +30,7 @@
- [Are _shfl_*_sync functions supported on HIP platform?](#are-_shfl_*_sync-functions-supported-on-hip-platform)
- [How to create a guard for code that is specific to the host or the GPU?](#how-to-create-a-guard-for-code-that-is-specific-to-the-host-or-the-gpu)
- [Why _OpenMP is undefined when compiling with -fopenmp?](#why-_openmp-is-undefined-when-compiling-with--fopenmp)
- [Does the HIP-Clang compiler support extern shared declarations?](#does-the-hip-clang-compiler-support-extern-shared-declarations)
<!-- tocstop -->
### What APIs and features does HIP support?
@@ -224,4 +225,11 @@ __shfl_*_sync is not supported on HIP but for nvcc path CUDA 9.0 and above all s
The compiler defines the `__HIP_DEVICE_COMPILE__` macro only when compiling the code for the GPU. It could be used to guard code that is specific to the host or the GPU.
### Why _OpenMP is undefined when compiling with -fopenmp?
When compiling an OpenMP source file with `hipcc -fopenmp`, the compiler may generate error if there is a reference to the `_OPENMP` macro. This is due to a limitation in hipcc that treats any source file type (e.g., `.cpp`) as an HIP translation unit leading to some conflicts with the OpenMP language switch. If the OpenMP source file doesn't contain any HIP language construct, you could workaround this issue by adding the `-x c++` switch to force the compiler to treat the file as regular C++. Another approach would be to guard the OpenMP code with `#ifdef _OPENMP` so that the code block is disabled when compiling for the GPU. The `__HIP_DEVICE_COMPILE__` macro defined by the HIP compiler when compiling GPU code could also be used for guarding code paths specific to the host or the GPU.
When compiling an OpenMP source file with `hipcc -fopenmp`, the compiler may generate error if there is a reference to the `_OPENMP` macro. This is due to a limitation in hipcc that treats any source file type (e.g., `.cpp`) as an HIP translation unit leading to some conflicts with the OpenMP language switch. If the OpenMP source file doesn't contain any HIP language construct, you could workaround this issue by adding the `-x c++` switch to force the compiler to treat the file as regular C++. Another approach would be to guard the OpenMP code with `#ifdef _OPENMP` so that the code block is disabled when compiling for the GPU. The `__HIP_DEVICE_COMPILE__` macro defined by the HIP compiler when compiling GPU code could also be used for guarding code paths specific to the host or the GPU.
### Does the HIP-Clang compiler support extern shared declarations?
Previously, it was essential to declare dynamic shared memory using the HIP_DYNAMIC_SHARED macro for accuracy, as using static shared memory in the same kernel could result in overlapping memory ranges and data-races.
Now, the HIP-Clang compiler provides support for extern shared declarations, and the HIP_DYNAMIC_SHARED option is no longer required. You may use the standard extern definition:
extern __shared__ type var[];
+38 -41
Просмотреть файл
@@ -92,7 +92,7 @@ Supported `__host__` functions are
- Executed on the host
- Called from the host
`__host__` can combine with `__device__`, in which case the function compiles for both the host and device. These functions cannot use the HIP grid coordinate functions (for example, "hipThreadIdx_x"). A possible workaround is to pass the necessary coordinate info as an argument to the function.
`__host__` can combine with `__device__`, in which case the function compiles for both the host and device. These functions cannot use the HIP grid coordinate functions (for example, "threadIdx.x"). A possible workaround is to pass the necessary coordinate info as an argument to the function.
`__host__` cannot combine with `__global__`.
@@ -129,19 +129,19 @@ The hipLaunchKernel macro always starts with the five parameters specified above
## Kernel-Launch Example
```
// Example showing device function, __device__ __host__
// <- compile for both device and host
float PlusOne(float x)
// Example showing device function, __device__ __host__
// <- compile for both device and host
float PlusOne(float x)
{
return x + 1.0;
}
__global__
void
__global__
void
MyKernel (hipLaunchParm lp, /*lp parm for execution configuration */
const float *a, const float *b, float *c, unsigned N)
{
unsigned gid = hipThreadIdx_x; // <- coordinate index function
unsigned gid = threadIdx.x; // <- coordinate index function
if (gid < N) {
c[gid] = a[gid] + PlusOne(b[gid]);
}
@@ -163,40 +163,43 @@ void callMyKernel()
### `__constant__`
The `__constant__` keyword is supported. The host writes constant memory before launching the kernel; from the GPU, this memory is read-only during kernel execution. The functions for accessing constant memory (hipGetSymbolAddress(), hipGetSymbolSize(), hipMemcpyToSymbol(), hipMemcpyToSymbolAsync(), hipMemcpyFromSymbol(), hipMemcpyFromSymbolAsync()) are available.
### `__shared__`
### `__shared__`
The `__shared__` keyword is supported.
`extern __shared__` allows the host to dynamically allocate shared memory and is specified as a launch parameter. HIP uses an alternate syntax based on the HIP_DYNAMIC_SHARED macro.
`extern __shared__` allows the host to dynamically allocate shared memory and is specified as a launch parameter.
Previously, it was essential to declare dynamic shared memory using the HIP_DYNAMIC_SHARED macro for accuracy, as using static shared memory in the same kernel could result in overlapping memory ranges and data-races.
Now, the HIP-Clang compiler provides support for extern shared declarations, and the HIP_DYNAMIC_SHARED option is no longer required..
### `__managed__`
Managed memory, including the `__managed__` keyword, are not supported in HIP.
### `__restrict__`
The `__restrict__` keyword tells the compiler that the associated memory pointer will not alias with any other pointer in the kernel or function. This feature can help the compiler generate better code. In most cases, all pointer arguments must use this keyword to realize the benefit.
The `__restrict__` keyword tells the compiler that the associated memory pointer will not alias with any other pointer in the kernel or function. This feature can help the compiler generate better code. In most cases, all pointer arguments must use this keyword to realize the benefit.
## Built-In Variables
### Coordinate Built-Ins
These built-ins determine the coordinate of the active work item in the execution grid. They are defined in hip_runtime.h (rather than being implicitly defined by the compiler).
These built-ins determine the coordinate of the active work item in the execution grid. They are defined in hip_runtime.h (rather than being implicitly defined by the compiler).
| **HIP Syntax** | **Cuda Syntax** |
| --- | --- |
| hipThreadIdx_x | threadIdx.x |
| hipThreadIdx_y | threadIdx.y |
| hipThreadIdx_z | threadIdx.z |
| threadIdx.x | threadIdx.x |
| threadIdx.y | threadIdx.y |
| threadIdx.z | threadIdx.z |
| | |
| hipBlockIdx_x | blockIdx.x |
| hipBlockIdx_y | blockIdx.y |
| hipBlockIdx_z | blockIdx.z |
| blockIdx.x | blockIdx.x |
| blockIdx.y | blockIdx.y |
| blockIdx.z | blockIdx.z |
| | |
| hipBlockDim_x | blockDim.x |
| hipBlockDim_y | blockDim.y |
| hipBlockDim_z | blockDim.z |
| blockDim.x | blockDim.x |
| blockDim.y | blockDim.y |
| blockDim.z | blockDim.z |
| | |
| hipGridDim_x | gridDim.x |
| hipGridDim_y | gridDim.y |
| hipGridDim_z | gridDim.z |
| gridDim.x | gridDim.x |
| gridDim.y | gridDim.y |
| gridDim.z | gridDim.z |
### warpSize
The warpSize variable is of type int and contains the warp size (in threads) for the target device. Note that all current Nvidia devices return 32 for this variable, and all current AMD devices return 64. Device code should use the warpSize built-in to develop portable wave-aware code.
@@ -480,20 +483,14 @@ Following is the list of supported floating-point intrinsics. Note that intrinsi
| float __cosf ( float x ) <br><sub>Calculate the fast approximate cosine of the input argument.</sub> |
| float __expf ( float x ) <br><sub>Calculate the fast approximate base e exponential of the input argument.</sub> |
| float __frsqrt_rn ( float x ) <br><sub>Compute `1 / √x` in round-to-nearest-even mode.</sub> |
| float __fsqrt_rd ( float x ) <br><sub>Compute `√x` in round-down mode.</sub> |
| float __fsqrt_rn ( float x ) <br><sub>Compute `√x` in round-to-nearest-even mode.</sub> |
| float __fsqrt_ru ( float x ) <br><sub>Compute `√x` in round-up mode.</sub> |
| float __fsqrt_rz ( float x ) <br><sub>Compute `√x` in round-towards-zero mode.</sub> |
| float __log10f ( float x ) <br><sub>Calculate the fast approximate base 10 logarithm of the input argument.</sub> |
| float __log2f ( float x ) <br><sub>Calculate the fast approximate base 2 logarithm of the input argument.</sub> |
| float __logf ( float x ) <br><sub>Calculate the fast approximate base e logarithm of the input argument.</sub> |
| float __powf ( float x, float y ) <br><sub>Calculate the fast approximate of x<sup>y</sup>.</sub> |
| float __sinf ( float x ) <br><sub>Calculate the fast approximate sine of the input argument.</sub> |
| float __tanf ( float x ) <br><sub>Calculate the fast approximate tangent of the input argument.</sub> |
| double __dsqrt_rd ( double x ) <br><sub>Compute `√x` in round-down mode.</sub> |
| double __dsqrt_rn ( double x ) <br><sub>Compute `√x` in round-to-nearest-even mode.</sub> |
| double __dsqrt_ru ( double x ) <br><sub>Compute `√x` in round-up mode.</sub> |
| double __dsqrt_rz ( double x ) <br><sub>Compute `√x` in round-towards-zero mode.</sub> |
## Texture Functions
Texture functions are not supported.
@@ -511,8 +508,7 @@ Returns the value of counter that is incremented every clock cycle on device. Di
## Atomic Functions
Atomic functions execute as read-modify-write operations residing in global or shared memory. No other device or thread can observe or modify the memory location during an atomic operation. If multiple instructions from different devices or threads target the same memory location, the
instructions are serialized in an undefined order.
Atomic functions execute as read-modify-write operations residing in global or shared memory. No other device or thread can observe or modify the memory location during an atomic operation. If multiple instructions from different devices or threads target the same memory location, the instructions are serialized in an undefined order.
HIP supports the following atomic operations.
@@ -557,15 +553,16 @@ HIP supports the following atomic operations.
Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp lanes will execute in lockstep, so additional synchronization is unnecessary, and the instructions use no shared memory.
Note that Nvidia and AMD devices have different warp sizes, so portable code should use the warpSize built-ins to query the warp size. Hipified code from the Cuda path requires careful review to ensure it doesnt assume a waveSize of 32. "Wave-aware" code that assumes a waveSize of 32 will run on a wave-64 machine, but it will utilize only half of the machine resources. In addition to the warpSize device function, host code can obtain the warpSize from the device properties:
Note that Nvidia and AMD devices have different warp sizes, so portable code should use the warpSize built-ins to query the warp size. Hipified code from the Cuda path requires careful review to ensure it doesnt assume a waveSize of 32. "Wave-aware" code that assumes a waveSize of 32 will run on a wave-64 machine, but it will utilize only half of the machine resources. WarpSize built-ins should only be used in device functions and its value depends on GPU arch. Users should not assume warpSize to be a compile-time constant. Host functions should use hipGetDeviceProperties to get the default warp size of a GPU device:
```
cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceID);
int w = props.warpSize;
int w = props.warpSize;
// implement portable algorithm based on w (rather than assume 32 or 64)
```
Note that assembly kernels may be built for a warp size which is different than the default warp size.
### Warp Vote and Ballot Functions
@@ -585,7 +582,7 @@ Applications can test whether the target platform supports the any/all instructi
`__ballot` provides a bit mask containing the 1-bit predicate value from each lane. The nth bit of the result contains the 1 bit contributed by the nth warp lane. Note that HIP's `__ballot` function supports a 64-bit return value (compared with Cudas 32 bits). Code ported from Cuda should support the larger warp sizes that the HIP version of this instruction supports. Applications can test whether the target platform supports the ballot instruction using the `hasWarpBallot` device property or the HIP_ARCH_HAS_WARP_BALLOT compiler define.
### Warp Shuffle Functions
### Warp Shuffle Functions
Half-float shuffles are not supported. The default width is warpSize---see [Warp Cross-Lane Functions](#warp-cross-lane-functions). Applications should not assume the warpSize is 32 or 64.
@@ -595,8 +592,8 @@ float __shfl (float var, int srcLane, int width=warpSize);
int __shfl_up (int var, unsigned int delta, int width=warpSize);
float __shfl_up (float var, unsigned int delta, int width=warpSize);
int __shfl_down (int var, unsigned int delta, int width=warpSize);
float __shfl_down (float var, unsigned int delta, int width=warpSize) ;
int __shfl_xor (int var, int laneMask, int width=warpSize)
float __shfl_down (float var, unsigned int delta, int width=warpSize);
int __shfl_xor (int var, int laneMask, int width=warpSize);
float __shfl_xor (float var, int laneMask, int width=warpSize);
```
@@ -686,13 +683,13 @@ implementation of malloc and free that can be called from device functions.
## `__launch_bounds__`
GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory) which are shared by the actively running warps. Using more resources can increase IPC of the kernel but reduces the resources available for other warps and limits the number of warps that can be simulaneously running. Thus GPUs have a complex relationship between resource usage and performance.
GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory) which are shared by the actively running warps. Using more resources can increase IPC of the kernel but reduces the resources available for other warps and limits the number of warps that can be simulaneously running. Thus GPUs have a complex relationship between resource usage and performance.
__launch_bounds__ allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code. It is a function attribute that must be attached to a __global__ function:
```
__global__ void `__launch_bounds__`(MAX_THREADS_PER_BLOCK, MIN_WARPS_PER_EU) MyKernel(...) ...
MyKernel(hipGridLaunch lp, ...)
MyKernel(hipGridLaunch lp, ...)
...
```
@@ -729,10 +726,10 @@ MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_
The key differences in the interface are:
- Warps (rather than blocks):
The developer is trying to tell the compiler to control resource utilization to guarantee some amount of active Warps/EU for latency hiding. Specifying active warps in terms of blocks appears to hide the micro-architectural details of the warp size, but makes the interface more confusing since the developer ultimately needs to compute the number of warps to obtain the desired level of control.
The developer is trying to tell the compiler to control resource utilization to guarantee some amount of active Warps/EU for latency hiding. Specifying active warps in terms of blocks appears to hide the micro-architectural details of the warp size, but makes the interface more confusing since the developer ultimately needs to compute the number of warps to obtain the desired level of control.
- Execution Units (rather than multiProcessor):
The use of execution units rather than multiprocessors provides support for architectures with multiple execution units/multi-processor. For example, the AMD GCN architecture has 4 execution units per multiProcessor. The hipDeviceProps has a field executionUnitsPerMultiprocessor.
Platform-specific coding techniques such as #ifdef can be used to specify different launch_bounds for NVCC and HIP-Clang platforms, if desired.
Platform-specific coding techniques such as #ifdef can be used to specify different launch_bounds for NVCC and HIP-Clang platforms, if desired.
### maxregcount
@@ -786,7 +783,7 @@ The following C++ features are not supported:
- Try/catch
## Kernel Compilation
hipcc now supports compiling C++/HIP kernels to binary code objects.
hipcc now supports compiling C++/HIP kernels to binary code objects.
The file format for binary is `.co` which means Code Object. The following command builds the code object using `hipcc`.
`hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]`
+1 -1
Просмотреть файл
@@ -150,7 +150,7 @@ All HIP projects target either AMD or NVIDIA platform. The platform affects whic
- `HIP_PLATFORM_AMD` is defined if the HIP platform targets AMD.
Note, `HIP_PLATFORM_HCC` was previously defined if the HIP platform targeted AMD, it is deprecated.
- `HIP_PLATFORM_NVIDA` is defined if the HIP platform targets NVIDIA.
- `HIP_PLATFORM_NVDIA` is defined if the HIP platform targets NVIDIA.
Note, `HIP_PLATFORM_NVCC` was previously defined if the HIP platform targeted NVIDIA, it is deprecated.
### Identifying the Compiler: hip-clang or nvcc
+16 -6
Просмотреть файл
@@ -21,7 +21,7 @@ ROCm defines two coherency options for host memory:
- Non-coherent memory : Can be cached by GPU, but cannot support synchronization while the kernel is running.  Non-coherent memory can be optionally synchronized only at command (end-of-kernel or copy command) boundaries.  This memory is appropriate for high-performance access when fine-grain synchronization is not required.
IP provides the developer with controls to select which type of memory is used via allocation flags passed to hipHostMalloc and the HIP_HOST_COHERENT environment variable:
- hipHostllocCoherent=0, hipHostMallocNonCoherent=0: Use HIP_HOST_COHERENT environment variable:
- hipHostllocCoherent=0, hipHostMallocNonCoherent=0: Use HIP_HOST_COHERENT environment variable:
- If HIP_HOST_COHERENT is 1 or undefined, the host memory allocation is coherent.
- If host memory is `defined and 0: the host memory allocation is non-coherent.
- hipHostMallocCoherent=1, hipHostMallocNonCoherent=0: The host memory allocation will be coherent.  HIP_HOST_COHERENT env variable is ignored.
@@ -29,8 +29,8 @@ IP provides the developer with controls to select which type of memory is used v
- hipHostMallocCoherent=1, hipHostMallocNonCoherent=1: Illegal.
### Visibility of Zero-Copy Host Memory
Coherent host memory is automatically visible at synchronization points.
### Visibility of Zero-Copy Host Memory
Coherent host memory is automatically visible at synchronization points.
Non-coherent
| HIP API | Synchronization Effect | Fence | Coherent Host Memory Visibiity | Non-Coherent Host Memory Visibility|
@@ -41,12 +41,13 @@ Non-coherent
| hipStreamWaitEvent | stream waits for the specified event to complete | none | yes | no |
### hipEventSynchronize
### hipEventSynchronize
Developers can control the release scope for hipEvents:
- By default, the GPU performs a device-scope acquire and release operation with each recorded event.  This will make host and device memory visible to other commands executing on the same device. 
- By default, the GPU performs a device-scope acquire and release operation with each recorded event.  This will make host and device memory visible to other commands executing on the same device.
A stronger system-level fence can be specified when the event is created with hipEventCreateWithFlags:
- hipEventReleaseToSystem : Perform a system-scope release operation when the event is recorded.  This will make both Coherent and Non-Coherent host memory visible to other agents in the system, but may involve heavyweight operations such as cache flushing.  Coherent memory will typically use lighter-weight in-kernel synchronization mechanisms such as an atomic operation and thus does not need to use hipEventReleaseToSystem.
- hipEventDisableTiming: Events created with this flag would not record profiling data and provide best performance if used for synchronization.
### Summary and Recommendations:
@@ -61,9 +62,14 @@ HIP-Clang currenntly doesn't supports device-side malloc and free.
In HIP-Clang, long double type is 80-bit extended precision format for x86_64, which is not supported by AMDGPU. HIP-Clang treats long double type as IEEE double type for AMDGPU. Using long double type in HIP source code will not cause issue as long as data of long double type is not transferred between host and device. However, long double type should not be used as kernel argument type.
## Use of _Float16 Type
If a host function is to be used between clang (or hipcc) and gcc for x86_64, i.e. its definition is compiled by one compiler but the caller is compiled by a different compiler, _Float16 or aggregates containing _Float16 should not be used as function argument or return type. This is due to lack of stable ABI for _Float16 on x86_64. Passing _Float16 or aggregates containing _Float16 between clang and gcc could cause undefined behavior.
## FMA and contractions
By default HIP-Clang assumes -ffp-contract=fast.
By default HIP-Clang assumes -ffp-contract=fast-honor-pragmas.
Users can use '#pragma clang fp contract(on|off|fast)' to control fp contraction of a block of code.
For x86_64, FMA is off by default since the generic x86_64 target does not
support FMA by default. To turn on FMA on x86_64, either use -mfma or -march=native
on CPU's supporting FMA.
@@ -72,4 +78,8 @@ When contractions are enabled and the CPU has not enabled FMA instructions, the
GPU can produce different numerical results than the CPU for expressions that
can be contracted. Tolerance should be used for floating point comparsions.
## Math functions with special rounding modes
HIP does not support math functions with rounding modes ru (round up), rd (round down), and rz (round towards zero). HIP only supports math function with rounding mode rn (round to nearest). The math functions with postfixes _ru, _rd and _rz are implemented in the same way as math functions with postfix _rn. They serve as a workaround to get programs using them compiled.
## [Supported Clang Options](clang_options.md)