SWDEV-422771 - Update some documents

Change-Id: I7a4032af44b66b6936e845ecce59e189f5c03664
This commit is contained in:
Julia Jiang
2023-09-22 14:37:33 -04:00
committed by Julia Jiang
parent 2e8e205552
commit a4066ec239
8 changed files with 26 additions and 26 deletions
+2 -1
View File
@@ -11,6 +11,7 @@ This is the full HIP Runtime API reference. The API is organized into
- @ref Error
- @ref Stream
- @ref StreamM
- @ref Event
- @ref Memory
- @ref External
- @ref MemoryM
@@ -29,4 +30,4 @@ This is the full HIP Runtime API reference. The API is organized into
- @ref Graph
- @ref Virtual
- @ref GL
- [Surface Object](#Surface)
- @ref Surface
+5 -5
View File
@@ -6,7 +6,7 @@ HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler,
Before build and run HIP, make sure drivers and pre-build packages are installed properly on the platform.
### AMD platform
Install ROCm packages (see ROCm Installation Guide on AMD public documentation site (https://docs.amd.com/) or install pre-built binary packages using the package manager,
Install ROCm packages or pre-built binary packages using the package manager. Refer to the ROCm Installation Guide at https://rocm.docs.amd.com for more information on installing ROCm.
```shell
sudo apt install mesa-common-dev
@@ -23,14 +23,14 @@ Install Nvidia driver and pre-build packages (see HIP Installation Guide at http
### Branch of repository
Before get HIP source code, set the expected branch of repository at the variable `ROCM_BRANCH`.
For example, for ROCm5.6 release branch, set
For example, for ROCm6.0 release branch, set
```shell
export ROCM_BRANCH=rocm-5.6.x
export ROCM_BRANCH=rocm-6.0.x
```
ROCm5.6 release branch, set
ROCm5.7 release branch, set
```shell
export ROCM_BRANCH=rocm-5.6.x
export ROCM_BRANCH=rocm-5.7.x
```
Similiar format for future branches.
+1
View File
@@ -80,3 +80,4 @@ Should use roctracer/rocTX instead
### hipTexRefSetMipmapLevelBias
### hipTexRefSetMipmapLevelClamp
### hipTexRefSetMipmappedArray
### hipBindTextureToMipmappedArray
+5 -7
View File
@@ -6,7 +6,7 @@ HIP provides a C++ syntax that is suitable for compiling most code that commonly
- Math functions resembling those in the "math.h" header included with standard C++ compilers
- Built-in functions for accessing specific GPU hardware capabilities
This section describes the built-in variables and functions accessible from the HIP kernel. Its intended for readers who are familiar with Cuda kernel syntax and want to understand how HIP is different.
This section describes the built-in variables and functions accessible from the HIP kernel. It is intended for readers familiar with CUDA kernel syntax and wanting to understand how HIP is different from CUDA.
Features are marked with one of the following keywords:
- **Supported**---HIP supports the feature with a Cuda-equivalent function
@@ -26,11 +26,10 @@ Supported `__global__` functions are
- Executed on the device
- Called ("launched") from the host
HIP `__global__` functions must have a `void` return type, and the first parameter to a HIP `__global__` function must have the type `hipLaunchParm`. See [Kernel-Launch Example](#kernel-launch-example).
HIP `__global__` functions must have a `void` return type, and the first parameter to a HIP `__global__` function must have the type `hipLaunchParm`. See [Kernel-Launch Example](#kernel-launch-example).
HIP lacks dynamic-parallelism support, so `__global__ ` functions cannot be called from the device.
(host_attr)=
### `__host__`
Supported `__host__` functions are
- Executed on the host
@@ -68,7 +67,7 @@ MyKernel<<<dim3(gridDim), dim3(groupDim), 0, 0>>> (a,b,c,n);
```
The hipLaunchKernelGGL macro always starts with the five parameters specified above, followed by the kernel arguments. HIPIFY tools optionally convert Cuda launch syntax to hipLaunchKernelGGL, including conversion of optional arguments in <<< >>> to the five required hipLaunchKernelGGL parameters. The dim3 constructor accepts zero to three arguments and will by default initialize unspecified dimensions to 1. See [dim3](#dim3). The kernel uses the coordinate built-ins (thread*, block*, grid*) to determine coordinate index and coordinate bounds of the work item thats currently executing. See {ref}`coordinate_builtins`.
The hipLaunchKernelGGL macro always starts with the five parameters specified above, followed by the kernel arguments. HIPIFY tools optionally convert Cuda launch syntax to hipLaunchKernelGGL, including conversion of optional arguments in <<< >>> to the five required hipLaunchKernelGGL parameters. The dim3 constructor accepts zero to three arguments and will by default initialize unspecified dimensions to 1. See [dim3](#dim3). The kernel uses the coordinate built-ins (thread*, block*, grid*) to determine coordinate index and coordinate bounds of the work item that's currently executing. See [Coordinate Built-Ins](#Coordinate-Built-Ins).
Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32.
@@ -141,7 +140,6 @@ The warpSize variable is of type int and contains the warp size (in threads) for
Note that these types are defined in hip_runtime.h and are not automatically provided by the compiler.
### Short Vector Types
Short vector types derive from the basic integer and floating-point types. They are structures defined in hip_vector_types.h. The first, second, third and fourth components of the vector are accessible through the ```x```, ```y```, ```z``` and ```w``` fields, respectively. All the short vector types support a constructor function of the form ```make_<type_name>()```. For example, ```float4 make_float4(float x, float y, float z, float w)``` creates a vector of type ```float4``` and value ```(x,y,z,w)```.
@@ -558,7 +556,7 @@ Towards this end, HIP has four extra functions to help developers more precisely
Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp lanes will execute in lockstep, so additional synchronization is unnecessary, and the instructions use no shared memory.
Note that Nvidia and AMD devices have different warp sizes, so portable code should use the warpSize built-ins to query the warp size. Hipified code from the Cuda path requires careful review to ensure it doesnt assume a waveSize of 32. "Wave-aware" code that assumes a waveSize of 32 will run on a wave-64 machine, but it will utilize only half of the machine resources. WarpSize built-ins should only be used in device functions and its value depends on GPU arch. Users should not assume warpSize to be a compile-time constant. Host functions should use hipGetDeviceProperties to get the default warp size of a GPU device:
Note that Nvidia and AMD devices have different warp sizes, so portable code should use the warpSize built-ins to query the warp size. Hipified code from the Cuda path requires careful review to ensure it doesn't assume a waveSize of 32. "Wave-aware" code that assumes a waveSize of 32 will run on a wave-64 machine, but it will utilize only half of the machine resources. WarpSize built-ins should only be used in device functions and its value depends on GPU arch. Users should not assume warpSize to be a compile-time constant. Host functions should use hipGetDeviceProperties to get the default warp size of a GPU device:
```
cudaDeviceProp props;
@@ -584,7 +582,7 @@ Threads in a warp are referred to as *lanes* and are numbered from 0 to warpSize
Applications can test whether the target platform supports the any/all instruction using the `hasWarpVote` device property or the HIP_ARCH_HAS_WARP_VOTE compiler define.
`__ballot` provides a bit mask containing the 1-bit predicate value from each lane. The nth bit of the result contains the 1 bit contributed by the nth warp lane. Note that HIP's `__ballot` function supports a 64-bit return value (compared with Cudas 32 bits). Code ported from Cuda should support the larger warp sizes that the HIP version of this instruction supports. Applications can test whether the target platform supports the ballot instruction using the `hasWarpBallot` device property or the HIP_ARCH_HAS_WARP_BALLOT compiler define.
`__ballot` provides a bit mask containing the 1-bit predicate value from each lane. The nth bit of the result contains the 1 bit contributed by the nth warp lane. Note that HIP's `__ballot` function supports a 64-bit return value (compared with Cuda's 32 bits). Code ported from Cuda should support the larger warp sizes that the HIP version of this instruction supports. Applications can test whether the target platform supports the ballot instruction using the `hasWarpBallot` device property or the HIP_ARCH_HAS_WARP_BALLOT compiler define.
### Warp Shuffle Functions
+1 -1
View File
@@ -280,7 +280,7 @@ The per-thread default stream is an implicit stream local to both the thread and
The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program.
In ROCm, a compilation option should be added in order to compile the translation unit with per-thread default stream enabled.
-fgpu-default-stream=per-thread.
"-fgpu-default-stream=per-thread".
Once source is compiled with per-thread default stream enabled, all APIs will be executed on per thread default stream, hence there will not be any implicit synchronization with other streams.
Besides, per-thread default stream be enabled per translation unit, users can compile some files with feature enabled and some with feature disabled. Feature enabled translation unit will have default stream as per thread and there will not be any implicit synchronization done but other modules will have legacy default stream which will do implicit synchronization.
+3 -3
View File
@@ -89,7 +89,7 @@ directory names.
| CUB | rocPRIM | Low Level Optimized Parallel Primitives
| cuDNN | MIOpen | Deep learning Solver Library
| cuRAND | rocRAND | Random Number Generator Library
| EIGEN | EIGEN HIP port | C++ template library for linear algebra: matrices, vectors, numerical solvers,
| EIGEN | EIGEN - HIP port | C++ template library for linear algebra: matrices, vectors, numerical solvers,
| NCCL | RCCL | Communications Primitives Library based on the MPI equivalents
@@ -333,7 +333,7 @@ CPPFLAGS += $(shell $(HIP_PATH)/bin/hipconfig --cpp_config)
nvcc includes some headers by default. However, HIP does not include default headers, and instead all required files must be explicitly included.
Specifically, files that call HIP run-time APIs or define HIP kernels must explicitly include the appropriate HIP headers.
If the compilation process reports that it cannot find necessary APIs (for example, "error: identifier ‘hipSetDevice’ is undefined"),
If the compilation process reports that it cannot find necessary APIs (for example, "error: identifier hipSetDevice is undefined"),
ensure that the file includes hip_runtime.h (or hip_runtime_api.h, if appropriate).
The hipify-perl script automatically converts "cuda_runtime.h" to "hip_runtime.h," and it converts "cuda_runtime_api.h" to "hip_runtime_api.h", but it may miss nested headers or macros.
@@ -465,7 +465,7 @@ As an example, please see the code from the [link](github.com/ROCm-Developer-Too
With the #ifdef condition, HIP APIs work as expected on both AMD and NVIDIA platforms.
Note, cudaMemoryType enum value cudaMemoryTypeUnregstered is not supported currently in hipMemoryType, due to HIP functionality backward compatibility.
Note, cudaMemoryTypeUnregstered is currently not supported in hipMemoryType enum, due to HIP functionality backward compatibility.
## threadfence_system
Threadfence_system makes all device memory writes, all writes to mapped host memory, and all writes to peer memory visible to CPU and other GPU devices.
+6 -6
View File
@@ -29,10 +29,10 @@ Note, Numa policy is so far implemented on Linux, and under development on Windo
### Coherency Controls
ROCm defines two coherency options for host memory:
- Coherent memory : Supports fine-grain synchronization while the kernel is running.  For example, a kernel can perform atomic operations that are visible to the host CPU or to other (peer) GPUs.  Synchronization instructions include threadfence_system and C++11-style atomic operations.
- Coherent memory : Supports fine-grain synchronization while the kernel is running. For example, a kernel can perform atomic operations that are visible to the host CPU or to other (peer) GPUs. Synchronization instructions include threadfence_system and C++11-style atomic operations.
In order to achieve this fine-grained coherence, many AMD GPUs use a limited cache policy, such as leaving these allocations uncached by the GPU, or making them read-only.
- Non-coherent memory : Can be cached by GPU, but cannot support synchronization while the kernel is running.  Non-coherent memory can be optionally synchronized only at command (end-of-kernel or copy command) boundaries.  This memory is appropriate for high-performance access when fine-grain synchronization is not required.
- Non-coherent memory : Can be cached by GPU, but cannot support synchronization while the kernel is running. Non-coherent memory can be optionally synchronized only at command (end-of-kernel or copy command) boundaries. This memory is appropriate for high-performance access when fine-grain synchronization is not required.
HIP provides the developer with controls to select which type of memory is used via allocation flags passed to hipHostMalloc and the HIP_HOST_COHERENT environment variable. By default, the environment variable HIP_HOST_COHERENT is set to 0 in HIP.
The control logic in the current version of HIP is as follows:
@@ -59,10 +59,10 @@ Non-coherent
### hipEventSynchronize
Developers can control the release scope for hipEvents:
- By default, the GPU performs a device-scope acquire and release operation with each recorded event.  This will make host and device memory visible to other commands executing on the same device.
- By default, the GPU performs a device-scope acquire and release operation with each recorded event. This will make host and device memory visible to other commands executing on the same device.
A stronger system-level fence can be specified when the event is created with hipEventCreateWithFlags:
- hipEventReleaseToSystem : Perform a system-scope release operation when the event is recorded.  This will make both Coherent and Non-Coherent host memory visible to other agents in the system, but may involve heavyweight operations such as cache flushing.  Coherent memory will typically use lighter-weight in-kernel synchronization mechanisms such as an atomic operation and thus does not need to use hipEventReleaseToSystem.
- hipEventReleaseToSystem : Perform a system-scope release operation when the event is recorded. This will make both Coherent and Non-Coherent host memory visible to other agents in the system, but may involve heavyweight operations such as cache flushing. Coherent memory will typically use lighter-weight in-kernel synchronization mechanisms such as an atomic operation and thus does not need to use hipEventReleaseToSystem.
- hipEventDisableTiming: Events created with this flag would not record profiling data and provide best performance if used for synchronization.
### Summary and Recommendations:
@@ -146,7 +146,7 @@ The test codes in the link (https://github.com/ROCm-Developer-Tools/HIP/blob/dev
The per-thread default stream is supported in HIP. It is an implicit stream local to both the thread and the current device. This means that the command issued to the per-thread default stream by the thread does not implicitly synchronize with other streams (like explicitly created streams), or default per-thread stream on other threads.
The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program.
The per-thread default stream can be enabled via adding a compilation option,
-fgpu-default-stream=per-thread.
"-fgpu-default-stream=per-thread".
And users can explicitly use "hipStreamPerThread" as per-thread default stream handle as input in API commands. There are test codes as examples in the link (https://github.com/ROCm-Developer-Tools/hip-tests/tree/develop/catch/unit/streamperthread).
@@ -193,4 +193,4 @@ Here is an example to create and use static libraries:
hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out
```
For more information, please see samples/2_Cookbook/15_static_library/host_functions and samples/2_Cookbook/15_static_library/device_functions.
For more information, please see [HIP samples](https://github.com/ROCm-Developer-Tools/hip-tests/tree/rocm-5.5.x/samples/2_Cookbook/15_static_library/host_functions) and [samples](https://github.com/ROCm-Developer-Tools/hip-tests/tree/rocm-5.5.x/samples/2_Cookbook/15_static_library/device_functions).
+3 -3
View File
@@ -2924,13 +2924,13 @@ hipError_t hipMemPrefetchAsync(const void* dev_ptr,
*
* @returns #hipSuccess, #hipErrorInvalidValue
*
* This HIP advises about the usage to be applied on unified memory allocation in the
* This HIP API advises about the usage to be applied on unified memory allocation in the
* range starting from the pointer address devPtr, with the size of count bytes.
* The memory range must refer to managed memory allocated via the API hipMallocManaged, and the
* range will be handled with proper round down and round up respectively in the driver to
* be aligned to CPU page size.
*
* @note This API is implemented on Linux, under development on Windows.
* @note This API is implemented on Linux and is under development on Windows.
*/
hipError_t hipMemAdvise(const void* dev_ptr,
size_t count,
@@ -4316,7 +4316,7 @@ hipError_t hipMemcpyToArray(hipArray* dst, size_t wOffset, size_t hOffset, const
*
* @param[in] dst Destination memory address
* @param[in] srcArray Source memory address
* @param[in] woffset Source starting X offset
* @param[in] wOffset Source starting X offset
* @param[in] hOffset Source starting Y offset
* @param[in] count Size in bytes to copy
* @param[in] kind Type of transfer