Update HIP docs
Purge all obsoleted information that is only relevant to HCC
Change-Id: Ice6ae174a73ccac9dfe2ca05027d1ecdb32558d7
[ROCm/hip commit: 5df9b3d7cc]
This commit is contained in:
committad av
Siuchi Chan
förälder
2195b7aa82
incheckning
1ed2f693a1
@@ -14,21 +14,16 @@
|
||||
- [What hardware does HIP support?](#what-hardware-does-hip-support)
|
||||
- [Do HIPIFY tools automatically convert all source code?](#do-hipify-tools-automatically-convert-all-source-code)
|
||||
- [What is NVCC?](#what-is-nvcc)
|
||||
- [What is HCC?](#what-is-hcc)
|
||||
- [What is HIP-Clang?](#what-is-hip-clang)
|
||||
- [Why use HIP rather than supporting CUDA directly?](#why-use-hip-rather-than-supporting-cuda-directly)
|
||||
- [Can I develop HIP code on an Nvidia CUDA platform?](#can-i-develop-hip-code-on-an-nvidia-cuda-platform)
|
||||
- [Can I develop HIP code on an AMD HIP-Clang platform?](#can-i-develop-hip-code-on-an-amd-hip-clang-platform)
|
||||
- [Do I need to make code changes in HIP code if switch compiler from HCC to HIP-Clang?](#Do-I-need-to-make-code-changes-in-hip-code-if-switch-compiler-from-hcc-to-hip-clang)
|
||||
- [How to use HIP-Clang to build HIP programs instead of HCC?](#how-to-use-hip-clang-to-build-hip-programs-instead-of-hcc)
|
||||
- [What is ROCclr?](#what-is-rocclr)
|
||||
- [Can a HIP binary run on both AMD and Nvidia platforms?](#can-a-hip-binary-run-on-both-amd-and-nvidia-platforms)
|
||||
- [What's the difference between HIP and hc?](#whats-the-difference-between-hip-and-hc)
|
||||
- [On HIP-Clang, can I link HIP code with host code compiled with another compiler such as gcc, icc, or clang?](#on-HIP-Clang-can-i-link-hip-code-with-host-code-compiled-with-another-compiler-such-as-gcc-icc-or-clang-)
|
||||
- [HIP detected my platform (hip-clang vs nvcc) incorrectly - what should I do?](#hip-detected-my-platform-hip-clang-vs-nvcc-incorrectly---what-should-i-do)
|
||||
- [Can I install both CUDA SDK and HIP-clang on same machine?](#can-i-install-both-cuda-sdk-and-hip-clang-on-same-machine)
|
||||
- [On CUDA, can I mix CUDA code with HIP code?](#on-cuda-can-i-mix-cuda-code-with-hip-code)
|
||||
- [On HIP-Clang, can I use HC functionality with HIP?](#on-hip-clang-can-i-use-hc-functionality-with-hip)
|
||||
- [How do I trace HIP application flow?](#how-do-i-trace-hip-application-flow)
|
||||
- [What if HIP generates an error of "symbol multiply defined!" only on AMD machine?](#what-if-hip-generates-error-of-symbol-multiply-defined-only-on-amd-machine)
|
||||
- [What is maximum limit of Generic kernel launching parameter?](#what-is-maximum-limit-of-generic-kernel-launching-parameter)
|
||||
@@ -67,7 +62,7 @@ See the [API Support Table](CUDA_Runtime_API_functions_supported_by_HIP.md) for
|
||||
- C++-style device-side dynamic memory allocations (free, new, delete) (CUDA 4.0)
|
||||
- Virtual functions, indirect functions and try/catch (CUDA 4.0)
|
||||
- `__prof_trigger`
|
||||
- PTX assembly (CUDA 4.0). HCC supports inline GCN assembly.
|
||||
- PTX assembly (CUDA 4.0). HIP-Clang supports inline GCN assembly.
|
||||
- Several kernel features are under development. See the [HIP Kernel Language](hip_kernel_language.md) for more information. These include:
|
||||
- printf
|
||||
|
||||
@@ -95,11 +90,13 @@ However, we can provide a rough summary of the features included in each CUDA SD
|
||||
- __shfl intriniscs (supported)
|
||||
- CUDA 7.0 :
|
||||
- Per-thread-streams (under development)
|
||||
- C++11 (HCC supports all of C++11, all of C++14 and some C++17 features)
|
||||
- C++11 (Hip-Clang supports all of C++11, all of C++14 and some C++17 features)
|
||||
- CUDA 7.5 :
|
||||
- float16 (supported)
|
||||
- CUDA 8.0 :
|
||||
- Page Migration including cudaMemAdvise, cudaMemPrefetch, other cudaMem* APIs(not supported)
|
||||
- CUDA 9.0 :
|
||||
- Cooperative Launch, Surface Object Management, Version Management
|
||||
|
||||
### What libraries does HIP support?
|
||||
HIP includes growing support for the four key math libraries using hcBlas, hcFft, hcrng and hcsparse, as well as MIOpen for machine intelligence applications.
|
||||
@@ -139,7 +136,7 @@ The tools also struggle with more complex CUDA applications, in particular, thos
|
||||
|
||||
|
||||
### What hardware does HIP support?
|
||||
- For AMD platforms, HIP runs on the same hardware that the HCC "hc" mode supports. See the ROCm documentation for the list of supported platforms.
|
||||
- For AMD platforms, see the [ROCm documentation](https://github.com/RadeonOpenCompute/ROCm#supported-gpus) for the list of supported platforms.
|
||||
- For Nvidia platforms, HIP requires Unified Memory and should run on any device supporting CUDA SDK 6.0 or newer. We have tested the Nvidia Titan and Tesla K40.
|
||||
|
||||
### Do HIPIFY tools automatically convert all source code?
|
||||
@@ -152,13 +149,8 @@ In general, developers should always expect to perform some platform-specific tu
|
||||
### What is NVCC?
|
||||
NVCC is Nvidia's compiler driver for compiling "CUDA C++" code into PTX or device code for Nvidia GPUs. It's a closed-source binary compiler that is provided by the CUDA SDK.
|
||||
|
||||
### What is HCC?
|
||||
HCC is AMD's compiler driver which compiles "heterogeneous C++" code into HSAIL or GCN device code for AMD GPUs. It's an open-source compiler based on recent versions of CLANG/LLVM.
|
||||
|
||||
In ROCM v3.5 release, HCC compiler is deprecated and HIP-Clang compiler is introduced to compile HIP programs.
|
||||
|
||||
### What is HIP-Clang?
|
||||
HIP-Clang is new compiler to emphasize its capability to compile HIP programs which can run on AMD platform.
|
||||
HIP-Clang is a Clang/LLVM based compiler to compile HIP programs which can run on AMD platform.
|
||||
|
||||
### Why use HIP rather than supporting CUDA directly?
|
||||
While HIP is a strong subset of the CUDA, it is a subset. The HIP layer allows that subset to be clearly defined and documented.
|
||||
@@ -174,11 +166,6 @@ In some cases, CUDA has a richer set of modes for some APIs, and some C++ capabi
|
||||
|
||||
### Can I develop HIP code on an AMD HIP-Clang platform?
|
||||
Yes. HIP's HIP-Clang path only exposes the APIs and functions that work on AMD runtime back ends. "Extra" APIs, parameters and features that appear in HIP-Clang but not CUDA will typically cause compile- or run-time errors. Developers must use the HIP API for most accelerator code and bracket any HIP-Clang specific code with preprocessor conditionals. Those concerned about portability should, of course, test their code on both platforms and should tune it for performance. Typically, HIP-Clang supports a more modern set of C++11/C++14/C++17 features, so HIP developers who want portability should be careful when using advanced C++ features on the HIP-Clang path.
|
||||
In ROCM v3.5 release, HCC compiler is deprecated, and the HIP-Clang compiler can be used for compiling HIP programs.
|
||||
|
||||
### Do I need to make code changes in HIP code if switching compiler from HCC to HIP-Clang?
|
||||
For most HIP applications, the transition from HCC to HIP-Clang is transparent as the HIPCC and HIP cmake files automatically choose compiler options for HIP-Clang and hide the difference between the HCC and HIP-Clang code.
|
||||
However, minor changes may be required as HIP-Clang has stricter syntax and semantic checks compared to HCC.
|
||||
|
||||
### How to use HIP-Clang to build HIP programs?
|
||||
The environment variable can be used to set compiler path:
|
||||
@@ -194,12 +181,6 @@ ROCclr (Radeon Open Compute Common Language Runtime) is a virtual device interfa
|
||||
### Can a HIP binary run on both AMD and Nvidia platforms?
|
||||
HIP is a source-portable language that can be compiled to run on either AMD or NVIDIA platform. HIP tools don't create a "fat binary" that can run on either platform, however.
|
||||
|
||||
### What's the difference between HIP and hc?
|
||||
HIP is a portable C++ language that supports a strong subset of the CUDA run-time APIs and device-kernel language. It's designed to simplify CUDA conversion to portable C++. HIP provides a C-compatible run-time API, C-compatible kernel-launch mechanism, C++ kernel language and pointer-based memory management.
|
||||
|
||||
A C++ dialect, hc is supported by the AMD compiler. It provides C++ run time, C++ kernel-launch APIs (parallel_for_each), C++ kernel language, and several memory-management options, including pointers, arrays and array_view (with implicit data synchronization). It's intended to be a leading indicator of the ISO C++ standard.
|
||||
The HCC compiler has been deprecated in the ROCm Release v3.5.
|
||||
|
||||
### On HIP-Clang, can I link HIP code with host code compiled with another compiler such as gcc, icc, or clang ?
|
||||
Yes. HIP generates the object code which conforms to the GCC ABI, and also links with libstdc++. This means you can compile host code with the compiler of your choice and link the generated object code
|
||||
with GPU code compiled with HIP. Larger projects often contain a mixture of accelerator code (initially written in CUDA with nvcc) and host code (compiled with gcc, icc, or clang). These projects
|
||||
@@ -211,16 +192,14 @@ Yes. You can use HIP_PLATFORM to choose which path hipcc targets. This configur
|
||||
|
||||
|
||||
### HIP detected my platform (HIP-Clang vs nvcc) incorrectly - what should I do?
|
||||
HIP will set the platform to hcc and compiler to HIP-Clang if it sees that the AMD graphics driver is installed and has detected an AMD GPU.
|
||||
HIP will set the platform to HIP-Clang if it sees that the AMD graphics driver is installed and has detected an AMD GPU.
|
||||
Sometimes this isn't what you want - you can force HIP to recognize the platform by setting the following,
|
||||
```
|
||||
export HIP_COMPILER=clang
|
||||
export HIP_PLATFORM=hcc
|
||||
export HIP_PLATFORM=rocclr
|
||||
```
|
||||
|
||||
One symptom of this problem is the message "error: 'unknown error'(11) at square.hipref.cpp:56". This can occur if you have a CUDA installation on an AMD platform, and HIP incorrectly detects the platform as nvcc. HIP may be able to compile the application using the nvcc tool-chain but will generate this error at runtime since the platform does not have a CUDA device.
|
||||
The fix is to set HIP_PLATFORM=hcc and rebuild.
|
||||
|
||||
|
||||
### On CUDA, can I mix CUDA code with HIP code?
|
||||
Yes. Most HIP data structures (hipStream_t, hipEvent_t) are typedefs to CUDA equivalents and can be intermixed. Both CUDA and HIP use integer device ids.
|
||||
@@ -232,19 +211,17 @@ hipCUResultTohipError
|
||||
|
||||
If platform portability is important, use #ifdef __HIP_PLATFORM_NVCC__ to guard the CUDA-specific code.
|
||||
|
||||
### On HIP-Clang, can I use HC functionality with HIP?
|
||||
No. HC functionality is not supported by HIP-Clang.
|
||||
|
||||
### How do I trace HIP application flow?
|
||||
See the [HIP Profiling Guide](hip_porting_guide.md) for more information.
|
||||
|
||||
### What if HIP generates error of "symbol multiply defined!" only on AMD machine?
|
||||
Unlike CUDA, in HCC, for functions defined in the header files, the keyword of "__forceinline__" does not imply "static".
|
||||
Thus, if failed to define "static" keyword, you might see a lot of "symbol multiply defined!" errors at compilation.
|
||||
The workaround is to explicitly add the keyword of "static" before any functions that were defined as "__forceinline__".
|
||||
|
||||
### What is maximum limit of kernel launching parameter?
|
||||
Product of block.x, block.y, and block.z should be less than 1024.
|
||||
|
||||
### Are __shfl_*_sync functions supported on HIP platform?
|
||||
__shfl_*_sync is not supported on HIP but for nvcc path CUDA 9.0 and above all shuffle calls get redirected to it's sync version.
|
||||
__shfl_*_sync is not supported on HIP but for nvcc path CUDA 9.0 and above all shuffle calls get redirected to it's sync version.
|
||||
|
||||
### How to create a guard for code that is specific to the host or the GPU?
|
||||
The compiler defines the `__HIP_DEVICE_COMPILE__` macro only when compiling the code for the GPU. It could be used to guard code that is specific to the host or the GPU.
|
||||
|
||||
### Why _OpenMP is undefined when compiling with -fopenmp?
|
||||
When compiling an OpenMP source file with `hipcc -fopenmp`, the compiler may generate error if there is a reference to the `_OPENMP` macro. This is due to a limitation in hipcc that treats any source file type (e.g., `.cpp`) as an HIP translation unit leading to some conflicts with the OpenMP language switch. If the OpenMP source file doesn't contain any HIP language construct, you could workaround this issue by adding the `-x c++` switch to force the compiler to treat the file as regular C++. Another approach would be to guard the OpenMP code with `#ifdef _OPENMP` so that the code block is disabled when compiling for the GPU. The `__HIP_DEVICE_COMPILE__` macro defined by the HIP compiler when compiling GPU code could also be used for guarding code paths specific to the host or the GPU.
|
||||
@@ -96,8 +96,7 @@ Supported `__host__` functions are
|
||||
|
||||
`__host__` cannot combine with `__global__`.
|
||||
|
||||
HIP parses the `__noinline__` and `__forceinline__` keywords and converts them to the appropriate Clang attributes. The hcc compiler, however, currently in-lines all device functions, so they are effectively ignored.
|
||||
|
||||
HIP parses the `__noinline__` and `__forceinline__` keywords and converts them to the appropriate Clang attributes.
|
||||
|
||||
## Calling `__global__` Functions
|
||||
|
||||
@@ -244,7 +243,7 @@ typedef struct dim3 {
|
||||
## Memory-Fence Instructions
|
||||
HIP supports __threadfence() and __threadfence_block().
|
||||
|
||||
HIP provides workaround for threadfence_system() under HCC path.
|
||||
HIP provides workaround for threadfence_system() under the HIP-Clang path.
|
||||
To enable the workaround, HIP should be built with environment variable HIP_COHERENT_HOST_ALLOC enabled.
|
||||
In addition,the kernels that use __threadfence_system() should be modified as follows:
|
||||
- The kernel should only operate on finegrained system memory; which should be allocated with hipHostMalloc().
|
||||
@@ -254,7 +253,7 @@ In addition,the kernels that use __threadfence_system() should be modified as fo
|
||||
The __syncthreads() built-in function is supported in HIP. The __syncthreads_count(int), __syncthreads_and(int) and __syncthreads_or(int) functions are under development.
|
||||
|
||||
## Math Functions
|
||||
hcc supports a set of math operations callable from the device.
|
||||
HIP-Clang supports a set of math operations callable from the device.
|
||||
|
||||
### Single Precision Mathematical Functions
|
||||
Following is the list of supported single precision mathematical functions.
|
||||
@@ -468,9 +467,9 @@ Following is the list of supported integer intrinsics. Note that intrinsics are
|
||||
| int __mul24 ( int x, int y )<br><sub>Multiply two 24bit integers.</sub> |
|
||||
| unsigned int __umul24 ( unsigned int x, unsigned int y )<br><sub>Multiply two 24bit unsigned integers.</sub> |
|
||||
<sub><b id="f3"><sup>[1]</sup></b>
|
||||
The hcc implementation of __ffs() and __ffsll() contains code to add a constant +1 to produce the ffs result format.
|
||||
The HIP-Clang implementation of __ffs() and __ffsll() contains code to add a constant +1 to produce the ffs result format.
|
||||
For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform,
|
||||
hcc provides hc::__lastbit_u32_u32(unsigned int input) and hc::__lastbit_u32_u64(unsigned long long int input).
|
||||
HIP-Clang provides __lastbit_u32_u32(unsigned int input) and __lastbit_u32_u64(unsigned long long int input).
|
||||
The index returned by __lastbit_ instructions starts at -1, while for ffs the index starts at 0.
|
||||
|
||||
### Floating-point Intrinsics
|
||||
@@ -689,8 +688,7 @@ implementation of malloc and free that can be called from device functions.
|
||||
|
||||
GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory) which are shared by the actively running warps. Using more resources can increase IPC of the kernel but reduces the resources available for other warps and limits the number of warps that can be simulaneously running. Thus GPUs have a complex relationship between resource usage and performance.
|
||||
|
||||
__hip_launch_bounds__ allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code.
|
||||
__hip_launch_bounds__ is a function attribute that must be attached to a __global__ function:
|
||||
__launch_bounds__ allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code. It is a function attribute that must be attached to a __global__ function:
|
||||
|
||||
```
|
||||
__global__ void `__launch_bounds__`(MAX_THREADS_PER_BLOCK, MIN_WARPS_PER_EU) MyKernel(...) ...
|
||||
@@ -734,17 +732,16 @@ The key differences in the interface are:
|
||||
The developer is trying to tell the compiler to control resource utilization to guarantee some amount of active Warps/EU for latency hiding. Specifying active warps in terms of blocks appears to hide the micro-architectural details of the warp size, but makes the interface more confusing since the developer ultimately needs to compute the number of warps to obtain the desired level of control.
|
||||
- Execution Units (rather than multiProcessor):
|
||||
The use of execution units rather than multiprocessors provides support for architectures with multiple execution units/multi-processor. For example, the AMD GCN architecture has 4 execution units per multiProcessor. The hipDeviceProps has a field executionUnitsPerMultiprocessor.
|
||||
Platform-specific coding techniques such as #ifdef can be used to specify different launch_bounds for NVCC and HCC platforms, if desired.
|
||||
Platform-specific coding techniques such as #ifdef can be used to specify different launch_bounds for NVCC and HIP-Clang platforms, if desired.
|
||||
|
||||
|
||||
### maxregcount
|
||||
Unlike nvcc, hcc does not support the "--maxregcount" option. Instead, users are encouraged to use the hip_launch_bounds directive since the parameters are more intuitive and portable than
|
||||
micro-architecture details like registers, and also the directive allows per-kernel control rather than an entire file. hip_launch_bounds works on both hcc and nvcc targets.
|
||||
Unlike nvcc, HIP-Clang does not support the "--maxregcount" option. Instead, users are encouraged to use the hip_launch_bounds directive since the parameters are more intuitive and portable than
|
||||
micro-architecture details like registers, and also the directive allows per-kernel control rather than an entire file. hip_launch_bounds works on both HIP-Clang and nvcc targets.
|
||||
|
||||
|
||||
## Register Keyword
|
||||
The register keyword is deprecated in C++, and is silently ignored by both nvcc and hcc. To see warnings, you can pass the option `-Wdeprecated-register` to hcc.
|
||||
|
||||
The register keyword is deprecated in C++, and is silently ignored by both nvcc and HIP-Clang. You can pass the option `-Wdeprecated-register` the compiler warning message.
|
||||
|
||||
## Pragma Unroll
|
||||
|
||||
@@ -790,18 +787,15 @@ The following C++ features are not supported:
|
||||
|
||||
## Kernel Compilation
|
||||
hipcc now supports compiling C++/HIP kernels to binary code objects.
|
||||
The user can specify the target for which the binary can be generated. HIP/HCC does not yet support fat binaries so only a single target may be specified.
|
||||
The file format for binary is `.co` which means Code Object. The following command builds the code object using `hipcc`.
|
||||
|
||||
`hipcc --genco --targets [TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]`
|
||||
`hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]`
|
||||
|
||||
```
|
||||
[TARGET GPU] = gfx900 gfx803 gfx701
|
||||
[TARGET GPU] = GPU architecture
|
||||
[INPUT FILE] = Name of the file containing kernels
|
||||
[OUTPUT FILE] = Name of the generated code object file
|
||||
```
|
||||
|
||||
Note that one important fact to remember when using binary code objects is that the number of arguments to the kernel are different on HCC and NVCC path. Refer to the sample in samples/0_Intro/module_api for differences in the arguments to be passed to the kernel.
|
||||
|
||||
## gfx-arch-specific-kernel
|
||||
Clang defined '__gfx*__' macros can be used to execute gfx arch specific codes inside the kernel. Refer to the sample 14_gpu_arch in samples/2_Cookbook.
|
||||
|
||||
@@ -49,18 +49,18 @@ Rather than present two separate APIs, HIP extends the HIP API with new APIs for
|
||||
### hipModule API
|
||||
|
||||
Like the CUDA Driver API, the Module API provides additional control over how code is loaded, including options to load code from files or from in-memory pointers.
|
||||
NVCC and HCC target different architectures and use different code object formats: NVCC is `cubin` or `ptx` files, while the HCC path is the `hsaco` format.
|
||||
NVCC and HIP-Clang target different architectures and use different code object formats: NVCC is `cubin` or `ptx` files, while the HIP-Clang path is the `hsaco` format.
|
||||
The external compilers which generate these code objects are responsible for generating and loading the correct code object for each platform.
|
||||
Notably, there is not a fat binary format that can contain code for both NVCC and HCC platforms. The following table summarizes the formats used on each platform:
|
||||
Notably, there is not a fat binary format that can contain code for both NVCC and HIP-Clang platforms. The following table summarizes the formats used on each platform:
|
||||
|
||||
| Format | APIs | NVCC | HCC | HIP-CLANG |
|
||||
| --- | --- | --- | --- | ---
|
||||
| Code Object | hipModuleLoad, hipModuleLoadData | .cubin or PTX text | .hsaco | .hsaco |
|
||||
| Fat Binary | hipModuleLoadFatBin | .fatbin | Under Development | .hip_fatbin |
|
||||
| Format | APIs | NVCC | HIP-CLANG |
|
||||
| --- | --- | --- | --- |
|
||||
| Code Object | hipModuleLoad, hipModuleLoadData | .cubin or PTX text | .hsaco |
|
||||
| Fat Binary | hipModuleLoadFatBin | .fatbin | .hip_fatbin |
|
||||
|
||||
`hipcc` uses NVCC and HCC to compile host codes. Both of these may embed code objects into the final executable, and these code objects will be automatically loaded when the application starts.
|
||||
`hipcc` uses HIP-Clang or NVCC to compile host codes. Both of these may embed code objects into the final executable, and these code objects will be automatically loaded when the application starts.
|
||||
The hipModule API can be used to load additional code objects, and in this way provides an extended capability to the automatically loaded code objects.
|
||||
HCC allows both of these capabilities to be used together, if desired. Of course it is possible to create a program with no kernels and thus no automatic loading.
|
||||
HIP-Clang allows both of these capabilities to be used together, if desired. Of course it is possible to create a program with no kernels and thus no automatic loading.
|
||||
|
||||
|
||||
### hipCtx API
|
||||
@@ -77,22 +77,16 @@ The first flavor may be faster in some cases since they avoid host overhead to d
|
||||
|
||||
HIP defines a single error space, and uses camel-case for all errors (i.e. `hipErrorInvalidValue`).
|
||||
|
||||
### HCC Implementation Notes
|
||||
#### .hsaco
|
||||
The .hsaco format used by HCC is described in more detail [here](https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc).
|
||||
An example and blog that show how to use the format is [here](http://gpuopen.com/rocm-with-harmony-combining-opencl-hcc-hsa-in-a-single-program). hsaco can be generated by hcc + extractkernel tool, cloc, the GCN assembler, or other tools.
|
||||
|
||||
#### Address Spaces
|
||||
HCC defines a process-wide address space where the CPU and all devices allocate addresses from a single unified pool.
|
||||
HIP-Clang defines a process-wide address space where the CPU and all devices allocate addresses from a single unified pool.
|
||||
Thus addresses may be shared between contexts, and unlike the original CUDA definition a new context does not create a new address space for the device.
|
||||
|
||||
#### Using hipModuleLaunchKernel
|
||||
`hipModuleLaunchKernel` is `cuLaunchKernel` in HIP world. It takes the same arguments as `cuLaunchKernel`. The argument `kernelParams` is not fully implemented for HCC. The workaround for it is, to use platform specific macros for each target. Or, `extra` argument can be used which works on both the platforms.
|
||||
`hipModuleLaunchKernel` is `cuLaunchKernel` in HIP world. It takes the same arguments as `cuLaunchKernel`.
|
||||
|
||||
#### Additional Information
|
||||
- HCC allocates staging buffers (used for unpinned copies) on a per-device basis.
|
||||
- HCC creates a primary context when the HIP API is called. So in a pure driver API code, HIP/HCC will create a primary context while HIP/NVCC will have empty context stack.
|
||||
HIP/HCC will push primary context to context stack when it is empty. This can have subtle differences on applications which mix the runtime and driver APIs.
|
||||
- HIP-Clang creates a primary context when the HIP API is called. So in a pure driver API code, HIP-Clang will create a primary context while HIP/NVCC will have empty context stack.
|
||||
HIP-Clang will push primary context to context stack when it is empty. This can have subtle differences on applications which mix the runtime and driver APIs.
|
||||
|
||||
### hip-clang Implementation Notes
|
||||
#### .hip_fatbin
|
||||
@@ -130,9 +124,9 @@ CUDA applications may want to mix CUDA driver code with HIP code (see example be
|
||||
|
||||
#### Compilation Options
|
||||
The `hipModule_t` interface does not support `cuModuleLoadDataEx` function, which is used to control PTX compilation options.
|
||||
HCC does not use PTX and does not support these compilation options.
|
||||
In fact, HCC code objects always contain fully compiled ISA and do not require additional compilation as a part of the load step.
|
||||
The corresponding HIP function `hipModuleLoadDataEx` behaves as `hipModuleLoadData` on HCC path (compilation options are not used) and as `cuModuleLoadDataEx` on NVCC path.
|
||||
HIP-Clang does not use PTX and does not support these compilation options.
|
||||
In fact, HIP-Clang code objects always contain fully compiled ISA and do not require additional compilation as a part of the load step.
|
||||
The corresponding HIP function `hipModuleLoadDataEx` behaves as `hipModuleLoadData` on HIP-Clang path (compilation options are not used) and as `cuModuleLoadDataEx` on NVCC path.
|
||||
For example (CUDA):
|
||||
```
|
||||
CUmodule module;
|
||||
@@ -164,7 +158,7 @@ options[0] = hipJitOptionMaxRegisters;
|
||||
unsigned maxRegs = 15;
|
||||
optionValues[0] = (void*)(&maxRegs);
|
||||
|
||||
// hipModuleLoadData(module, imagePtr) will be called on HCC path, JIT options will not be used, and
|
||||
// hipModuleLoadData(module, imagePtr) will be called on HIP-Clang path, JIT options will not be used, and
|
||||
// cupModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues) will be called on NVCC path
|
||||
hipModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues);
|
||||
|
||||
|
||||
@@ -151,31 +151,13 @@ All HIP projects target either AMD or NVIDIA platform. The platform affects whic
|
||||
|
||||
- `HIP_PLATFORM_NVCC` is defined if the HIP platform targets NVIDIA
|
||||
|
||||
On AMD platform, the compiler was hcc, but is deprecated in ROCM v3.5 release, and HIP-Clang compiler is introduced for compiling HIP programs.
|
||||
|
||||
For most HIP applications, the transition from hcc to HIP-Clang is transparent.
|
||||
HIPCC and HIP cmake files automatically choose compilation options for HIP-Clang and hide the difference between the hcc and hip-clang code.
|
||||
However, minor changes may be required as HIP-Clang has stricter syntax and semantic checks compared to hcc.
|
||||
|
||||
Many projects use a mixture of an accelerator compiler (AMD or NVIDIA) and a standard compiler (e.g. g++). These defines are set for both accelerator and standard compilers and thus are often the best option when writing code that uses conditional compilation.
|
||||
|
||||
|
||||
|
||||
### Identifying the Compiler: hip-clang or nvcc
|
||||
Often, it's useful to know whether the underlying compiler is HIP-Clang or nvcc. This knowledge can guard platform-specific code or aid in platform-specific performance tuning.
|
||||
|
||||
```
|
||||
#ifdef __HIP_PLATFORM_HCC__
|
||||
// Compiled with HIP-Clang
|
||||
|
||||
```
|
||||
|
||||
```
|
||||
#if defined(__HCC__) || (defined(__clang__) && defined(__HIP__))
|
||||
#define __HIP_PLATFORM_HCC__
|
||||
#endif
|
||||
// Compiled with HIP-Clang
|
||||
|
||||
```
|
||||
|
||||
```
|
||||
@@ -236,7 +218,7 @@ Some CUDA code tests `__CUDA_ARCH__` for a specific value to determine whether t
|
||||
#if (__CUDA_ARCH__ >= 130)
|
||||
// doubles are supported
|
||||
```
|
||||
This type of code requires special attention, since hcc/AMD and nvcc/CUDA devices have different architectural capabilities. Moreover, you can't determine the presence of a feature using a simple comparison against an architecture's version number. HIP provides a set of defines and device properties to query whether a specific architectural feature is supported.
|
||||
This type of code requires special attention, since AMD and CUDA devices have different architectural capabilities. Moreover, you can't determine the presence of a feature using a simple comparison against an architecture's version number. HIP provides a set of defines and device properties to query whether a specific architectural feature is supported.
|
||||
|
||||
The `__HIP_ARCH_*` defines can replace comparisons of `__CUDA_ARCH__` values:
|
||||
```
|
||||
@@ -349,12 +331,20 @@ MY_LAUNCH (hipLaunchKernel(vAdd, dim3(1024), dim3(1), 0, 0, Ad), true, "firstCal
|
||||
hipcc is a portable compiler driver that will call nvcc or HIP-Clang (depending on the target system) and attach all required include and library options. It passes options through to the target compiler. Tools that call hipcc must ensure the compiler options are appropriate for the target compiler.
|
||||
The `hipconfig` script may helpful in identifying the target platform, compiler and runtime. It can also help set options appropriately.
|
||||
|
||||
### Option for specifying GPU processor
|
||||
### Compiler options supported on AMD platforms
|
||||
|
||||
`--offload-arch=X` can be used to specify [target ID](https://clang.llvm.org/docs/ClangOffloadBundlerFileFormat.html#target-id).
|
||||
Here are the main compiler options supported on AMD platforms by HIP-Clang.
|
||||
|
||||
For backward compatibility, hipcc also accepts `--amdgpu-target=X` for specifying target ID. However, it will be deprecated
|
||||
in future releases.
|
||||
| Option | Description |
|
||||
| ------ | ----------- |
|
||||
| --amdgpu-target=<gpu_arch> | [DEPRECATED] This option is being replaced by `--offload-arch=<target>`. Generate code for the given GPU target. Supported targets are gfx701, gfx801, gfx802, gfx803, gfx900, gfx906, gfx908, gfx1010, gfx1011, gfx1012, gfx1030, gfx1031. This option could appear multiple times on the same command line to generate a fat binary for multiple targets. |
|
||||
| --fgpu-rdc | Generate relocatable device code, which allows kernels or device functions calling device functions in different translation units. |
|
||||
| -ggdb | Equivalent to `-g` plus tuning for GDB. This is recommended when using ROCm's GDB to debug GPU code. |
|
||||
| --gpu-max-threads-per-block=<num> | Generate code to support up to the specified number of threads per block. |
|
||||
| -O<n> | Specify the optimization level. |
|
||||
| -offload-arch=<target> | Specify the AMD GPU [target ID](https://clang.llvm.org/docs/ClangOffloadBundlerFileFormat.html#target-id). |
|
||||
| -save-temps | Save the compiler generated intermediate files. |
|
||||
| -v | Show the compilation steps. |
|
||||
|
||||
## Linking Issues
|
||||
|
||||
@@ -374,7 +364,7 @@ It also uses a standard compiler (g++) for the rest of the application. nvcc is
|
||||
Code compiled using this tool can employ only the intersection of language features supported by both nvcc and the host compiler.
|
||||
In some cases, you must take care to ensure the data types and alignment of the host compiler are identical to those of the device compiler. Only some host compilers are supported---for example, recent nvcc versions lack Clang host-compiler capability.
|
||||
|
||||
hcc generates both device and host code using the same Clang-based compiler. The code uses the same API as gcc, which allows code generated by different gcc-compatible compilers to be linked together. For example, code compiled using hcc can link with code compiled using "standard" compilers (such as gcc, ICC and Clang). Take care to ensure all compilers use the same standard C++ header and library formats.
|
||||
HIP-Clang generates both device and host code using the same Clang-based compiler. The code uses the same API as gcc, which allows code generated by different gcc-compatible compilers to be linked together. For example, code compiled using HIP-Clang can link with code compiled using "standard" compilers (such as gcc, ICC and Clang). Take care to ensure all compilers use the same standard C++ header and library formats.
|
||||
|
||||
|
||||
### libc++ and libstdc++
|
||||
@@ -385,8 +375,8 @@ If you pass "--stdlib=libc++" to hipcc, hipcc will use the libc++ library. Gene
|
||||
|
||||
When cross-linking C++ code, any C++ functions that use types from the C++ standard library (including std::string, std::vector and other containers) must use the same standard-library implementation. They include the following:
|
||||
|
||||
- Functions or kernels defined in hcc that are called from a standard compiler
|
||||
- Functions defined in a standard compiler that are called from hcc.
|
||||
- Functions or kernels defined in HIP-Clang that are called from a standard compiler
|
||||
- Functions defined in a standard compiler that are called from HIP-Clanng.
|
||||
|
||||
Applications with these interfaces should use the default libstdc++ linking.
|
||||
|
||||
@@ -423,7 +413,7 @@ The hipify-perl script automatically converts "cuda_runtime.h" to "hip_runtime.h
|
||||
|
||||
#### cuda.h
|
||||
|
||||
The hcc path provides an empty cuda.h file. Some existing CUDA programs include this file but don't require any of the functions.
|
||||
The HIP-Clang path provides an empty cuda.h file. Some existing CUDA programs include this file but don't require any of the functions.
|
||||
|
||||
### Choosing HIP File Extensions
|
||||
|
||||
|
||||
@@ -53,71 +53,17 @@ A stronger system-level fence can be specified when the event is created with hi
|
||||
- Coherent host memory is the default and is the easiest to use since the memory is visible to the CPU at typical synchronization points. This memory allows in-kernel synchronization commands such as threadfence_system to work transparently.
|
||||
- HIP/ROCm also supports the ability to cache host memory in the GPU using the "Non-Coherent" host memory allocations. This can provide performance benefit, but care must be taken to use the correct synchronization.
|
||||
|
||||
|
||||
## Unpinned Memory Transfer Optimization
|
||||
Please note that this document lists possible ways for experimenting with HIP stack to gain performance. Performance may vary from platform to platform.
|
||||
|
||||
### On Small BAR Setup
|
||||
|
||||
There are two possible ways to transfer data from host-to-device (H2D) and device-to-host(D2H)
|
||||
* Using Staging Buffers
|
||||
* Using PinInPlace
|
||||
|
||||
### On Large BAR Setup
|
||||
|
||||
There are three possible ways to transfer data from host-to-device (H2D)
|
||||
* Using Staging Buffers
|
||||
* Using PinInPlace
|
||||
* Direct Memcpy
|
||||
|
||||
And there are two possible ways to transfer data from device-to-host (D2H)
|
||||
* Using Staging Buffers
|
||||
* Using PinInPlace
|
||||
|
||||
Some GPUs may not be able to directly access host memory, and in these cases we need to
|
||||
stage the copy through an optimized pinned staging buffer, to implement H2D and D2H copies.The copy is broken into buffer-sized chunks to limit the size of the buffer and also to provide better performance by overlapping the CPU copies with the DMA copies.
|
||||
|
||||
PinInPlace is another algorithm which pins the host memory "in-place", and copies it with the DMA engine.
|
||||
|
||||
Unpinned memory transfer mode can be controlled using environment variable HCC_UNPINNED_COPY_MODE.
|
||||
|
||||
By default HCC_UNPINNED_COPY_MODE is set to 0, which uses default threshold values to decide which transfer way to use based on data size.
|
||||
|
||||
Setting HCC_UNPINNED_COPY_MODE = 1, forces all unpinned transfer to use PinInPlace logic.
|
||||
|
||||
Setting HCC_UNPINNED_COPY_MODE = 2, forces all unpinned transfer to use Staging buffers.
|
||||
|
||||
Setting HCC_UNPINNED_COPY_MODE = 3, forces all unpinned transfer to use direct memcpy on large BAR systems.
|
||||
|
||||
Following environment variables can be used to control the transfer thresholds:
|
||||
|
||||
- HCC_H2D_STAGING_THRESHOLD - Threshold in KB for H2D copy. For sizes smaller than threshold direct copy logic would be used else staging buffers logic. By default it is set to 64.
|
||||
|
||||
- HCC_H2D_PININPLACE_THRESHOLD - Threshold in KB for H2D copy. For sizes smaller than threshold staging buffers logic would be used else PinInPlace logic. By default it is set to 4096.
|
||||
|
||||
- HCC_D2H_PININPLACE_THRESHOLD - Threshold in KB for D2H copy. For sizes smaller than threshold staging buffer logic would be used else PinInPlace logic. By default it is set to 1024.
|
||||
|
||||
## Device-Side Malloc
|
||||
|
||||
hip-hcc and hip-clang supports device-side malloc and free. Users can allocate
|
||||
memory dynamically in a kernel. The allocated memory are in global address
|
||||
space, however, different threads get different memory allocations for the same
|
||||
call of malloc. The allocated memory can be accessed or freed by other threads
|
||||
or other kernels. It persists in the life time of the HIP program until it is
|
||||
freed.
|
||||
|
||||
The memory are allocated in pages. Users can define macro
|
||||
`__HIP_SIZE_OF_PAGE` for controlling the page size in bytes and macro
|
||||
`__HIP_NUM_PAGES` for controlling the total number of pages that can be
|
||||
allocated.
|
||||
HIP-Clang currenntly doesn't supports device-side malloc and free.
|
||||
|
||||
## Use of Long Double Type
|
||||
|
||||
In HCC and HIP-Clang, long double type is 80-bit extended precision format for x86_64, which is not supported by AMDGPU. HCC and HIP-Clang treat long double type as IEEE double type for AMDGPU. Using long double type in HIP source code will not cause issue as long as data of long double type is not transferred between host and device. However, long double type should not be used as kernel argument type.
|
||||
In HIP-Clang, long double type is 80-bit extended precision format for x86_64, which is not supported by AMDGPU. HIP-Clang treats long double type as IEEE double type for AMDGPU. Using long double type in HIP source code will not cause issue as long as data of long double type is not transferred between host and device. However, long double type should not be used as kernel argument type.
|
||||
|
||||
## FMA and contractions
|
||||
|
||||
By default HIP-Clang assumes -ffp-contract=fast and HCC assumes -ffp-contract=off.
|
||||
By default HIP-Clang assumes -ffp-contract=fast.
|
||||
For x86_64, FMA is off by default since the generic x86_64 target does not
|
||||
support FMA by default. To turn on FMA on x86_64, either use -mfma or -march=native
|
||||
on CPU's supporting FMA.
|
||||
|
||||
@@ -8,17 +8,12 @@ The default device can be set with hipSetDevice.
|
||||
|
||||
- "active host thread" - the thread which is running the HIP APIs.
|
||||
|
||||
- completion_future becomes ready. "Completes".
|
||||
|
||||
- hcc = Heterogeneous Compute Compiler ( https://github.com/RadeonOpenCompute/hcc).
|
||||
Starting from ROCM v3.5 release, hcc compiler is deprecated and HIP-Clang compiler is introduced for compiling HIP programs
|
||||
|
||||
- HIP-Clang - Heterogeneous AMDGPU Compiler, with its capability to compile HIP programs on AMD platform (https://github.com/RadeonOpenCompute/llvm-project).
|
||||
|
||||
- ROCclr - a virtual device interface that compute runtimes interact with different backends such as ROCr on Linux or PAL on Windows.
|
||||
The ROCclr (https://github.com/ROCm-Developer-Tools/ROCclr) is an abstraction layer allowing runtimes to work on both OSes without much effort.
|
||||
|
||||
- hipify tools - tools to convert CUDA(R) code to portable C++ code (https://github.com/ROCm-Developer-Tools/HIPIFY).
|
||||
- hipify tools - tools to convert CUDA code to portable C++ code (https://github.com/ROCm-Developer-Tools/HIPIFY).
|
||||
|
||||
- hipconfig - tool to report various configuration properties of the target platform.
|
||||
|
||||
|
||||
Referens i nytt ärende
Block a user