SWDEV-288929 - Add atomic system functions in HIP doc

Change-Id: Ic9ec276c1c486c4e1d3b83e1578081673f97efa2
Этот коммит содержится в:
jujiang
2021-06-03 21:09:27 -04:00
коммит произвёл Julia Jiang
родитель 575be36c2e
Коммит d8b2571d47
2 изменённых файлов: 92 добавлений и 62 удалений
+64 -32
Просмотреть файл
@@ -511,40 +511,72 @@ Returns the value of counter that is incremented every clock cycle on device. Di
Atomic functions execute as read-modify-write operations residing in global or shared memory. No other device or thread can observe or modify the memory location during an atomic operation. If multiple instructions from different devices or threads target the same memory location, the instructions are serialized in an undefined order.
HIP adds new APIs with _system as suffix to support system scope atomic operations. For example, atomicAnd atomic is dedicated to the GPU device, atomicAnd_system will allow developers to extend the atomic operation to system scope, from the GPU device to other CPUs and GPU devices in the system.
HIP supports the following atomic operations.
| **Function** | **Supported in HIP** | **Supported in CUDA** |
| --- | --- | --- |
| int atomicAdd(int* address, int val) | ✓ | ✓ |
| unsigned int atomicAdd(unsigned int* address,unsigned int val) | ✓ | ✓ |
| unsigned long long int atomicAdd(unsigned long long int* address,unsigned long long int val) | ✓ | ✓ |
| float atomicAdd(float* address, float val) | ✓ | ✓ |
| int atomicSub(int* address, int val) | ✓ | ✓ |
| unsigned int atomicSub(unsigned int* address,unsigned int val) | ✓ | ✓ |
| int atomicExch(int* address, int val) | ✓ | ✓ |
| unsigned int atomicExch(unsigned int* address,unsigned int val) | ✓ | ✓ |
| unsigned long long int atomicExch(unsigned long long int* address,unsigned long long int val) | ✓ | ✓ |
| float atomicExch(float* address, float val) | ✓ | ✓ |
| int atomicMin(int* address, int val) | ✓ | ✓ |
| unsigned int atomicMin(unsigned int* address,unsigned int val) | ✓ | ✓ |
| unsigned long long int atomicMin(unsigned long long int* address,unsigned long long int val) | ✓ | ✓ |
| int atomicMax(int* address, int val) | ✓ | ✓ |
| unsigned int atomicMax(unsigned int* address,unsigned int val) | ✓ | ✓ |
| unsigned long long int atomicMax(unsigned long long int* address,unsigned long long int val) | ✓ | ✓ |
| unsigned int atomicInc(unsigned int* address)| ✗ | ✓ |
| unsigned int atomicDec(unsigned int* address)| ✗ | ✓ |
| int atomicCAS(int* address, int compare, int val) | ✓ | ✓ |
| unsigned int atomicCAS(unsigned int* address,unsigned int compare,unsigned int val) | ✓ | ✓ |
| unsigned long long int atomicCAS(unsigned long long int* address,unsigned long long int compare,unsigned long long int val) | ✓ | ✓ |
| int atomicAnd(int* address, int val) | ✓ | ✓ |
| unsigned int atomicAnd(unsigned int* address,unsigned int val) | ✓ | ✓ |
| unsigned long long int atomicAnd(unsigned long long int* address,unsigned long long int val) | ✓ | ✓ |
| int atomicOr(int* address, int val) | ✓ | ✓ |
| unsigned int atomicOr(unsigned int* address,unsigned int val) | ✓ | ✓ |
| unsigned long long int atomicOr(unsigned long long int* address,unsigned long long int val) | ✓ | ✓ |
| int atomicXor(int* address, int val) | ✓ | ✓ |
| unsigned int atomicXor(unsigned int* address,unsigned int val) | ✓ | ✓ |
| unsigned long long int atomicXor(unsigned long long int* address,unsigned long long int val)) | ✓ | ✓ |
| **Function** | **Supported in HIP** | **Supported in CUDA** |
| -------------------------------------------------------------------------------------------------------------------- | --------------------- | ---------------------- |
| int atomicAdd(int* address, int val) | | ✓ |
| int atomicAdd_system(int* address, int val) | ✓ | ✓ |
| unsigned int atomicAdd(unsigned int* address,unsigned int val) | ✓ | ✓ |
| unsigned int atomicAdd_system(unsigned int* address, unsigned int val) | ✓ | ✓ |
| unsigned long long atomicAdd(unsigned long long* address,unsigned long long val) | | ✓ |
| unsigned long long atomicAdd_system(unsigned long long* address, unsigned long long val) | ✓ | ✓ |
| float atomicAdd(float* address, float val) | | ✓ |
| float atomicAdd_system(float* address, float val) | | ✓ |
| double atomicAdd(double* address, double val) | ✓ | ✓ |
| double atomicAdd_system(double* address, double val) | ✓ | ✓ |
| int atomicSub(int* address, int val) | | ✓ |
| int atomicSub_system(int* address, int val) | | ✓ |
| unsigned int atomicSub(unsigned int* address,unsigned int val) | ✓ | ✓ |
| unsigned int atomicSub_system(unsigned int* address, unsigned int val) | | ✓ |
| int atomicExch(int* address, int val) | | ✓ |
| int atomicExch_system(int* address, int val) | | ✓ |
| unsigned int atomicExch(unsigned int* address,unsigned int val) | ✓ | |
| unsigned int atomicExch_system(unsigned int* address, unsigned int val) | ✓ | ✓ |
| unsigned long long atomicExch(unsigned long long int* address,unsigned long long int val) | | ✓ |
| unsigned long long atomicExch_system(unsigned long long* address, unsigned long long val) | | ✓ |
| unsigned long long atomicExch_system(unsigned long long* address, unsigned long long val) | ✓ | ✓ |
| float atomicExch(float* address, float val) | ✓ | ✓ |
| int atomicMin(int* address, int val) | | ✓ |
| int atomicMin_system(int* address, int val) | | ✓ |
| unsigned int atomicMin(unsigned int* address,unsigned int val) | | ✓ |
| unsigned int atomicMin_system(unsigned int* address, unsigned int val) | ✓ | ✓ |
| unsigned long long atomicMin(unsigned long long* address,unsigned long long val) | | ✓ |
| int atomicMax(int* address, int val) | | ✓ |
| int atomicMax_system(int* address, int val) | | ✓ |
| unsigned int atomicMax(unsigned int* address,unsigned int val) | | ✓ |
| unsigned int atomicMax_system(unsigned int* address, unsigned int val) | ✓ | ✓ |
| unsigned long long atomicMax(unsigned long long* address,unsigned long long val) | ✓ | ✓ |
| unsigned int atomicInc(unsigned int* address) | ✗ | ✓ |
| unsigned int atomicDec(unsigned int* address) | ✗ | ✓ |
| int atomicCAS(int* address, int compare, int val) | ✓ | ✓ |
| int atomicCAS_system(int* address, int compare, int val) | ✓ | ✓ |
| unsigned int atomicCAS(unsigned int* address,unsigned int compare,unsigned int val) | ✓ | ✓ |
| unsigned int atomicCAS_system(unsigned int* address, unsigned int compare, unsigned int val) | ✓ | ✓ |
| unsigned long long atomicCAS(unsigned long long* address,unsigned long long compare,unsigned long long val) | ✓ | ✓ |
| unsigned long long atomicCAS_system(unsigned long long* address, unsigned long long compare, unsigned long long val) | ✓ | ✓ |
| int atomicAnd(int* address, int val) | ✓ | ✓ |
| int atomicAnd_system(int* address, int val) | ✓ | ✓ |
| unsigned int atomicAnd(unsigned int* address,unsigned int val) | ✓ | ✓ |
| unsigned int atomicAnd_system(unsigned int* address, unsigned int val) | ✓ | ✓ |
| unsigned long long atomicAnd(unsigned long long* address,unsigned long long val) | ✓ | ✓ |
| unsigned long long atomicAnd_system(unsigned long long* address, unsigned long long val) | ✓ | ✓ |
| int atomicOr(int* address, int val) | ✓ | ✓ |
| int atomicOr_system(int* address, int val) | ✓ | ✓ |
| unsigned int atomicOr(unsigned int* address,unsigned int val) | ✓ | ✓ |
| unsigned int atomicOr_system(unsigned int* address, unsigned int val) | ✓ | ✓ |
| unsigned int atomicOr_system(unsigned int* address, unsigned int val) | ✓ | ✓ |
| unsigned long long atomicOr(unsigned long long int* address,unsigned long long val) | ✓ | ✓ |
| unsigned long long atomicOr_system(unsigned long long* address, unsigned long long val) | ✓ | ✓ |
| int atomicXor(int* address, int val) | ✓ | ✓ |
| int atomicXor_system(int* address, int val) | ✓ | ✓ |
| unsigned int atomicXor(unsigned int* address,unsigned int val) | ✓ | ✓ |
| unsigned int atomicXor_system(unsigned int* address, unsigned int val) | ✓ | ✓ |
| unsigned long long atomicXor(unsigned long long* address,unsigned long long val)) | ✓ | ✓ |
| unsigned long long atomicXor_system(unsigned long long* address, unsigned long long val) | ✓ | ✓ |
### Caveats and Features Under-Development:
+28 -30
Просмотреть файл
@@ -1,40 +1,38 @@
# Table Comparing Syntax for Different Compute APIs
|Term|CUDA|HIP|HC|C++AMP|OpenCL|
|---|---|---|---|---|---|
|Device|`int deviceId`|`int deviceId`|`hc::accelerator`|`concurrency::`<br>`accelerator`|`cl_device`
|Queue|`cudaStream_t`|`hipStream_t`|`hc::`<br>`accelerator_view`|`concurrency::`<br>`accelerator_view`|`cl_command_queue`
|Event|`cudaEvent_t`|`hipEvent_t`|`hc::`<br>`completion_future`|`concurrency::`<br>`completion_future`|`cl_event`
|Memory|`void *`|`void *`|`void *`; `hc::array`; `hc::array_view`|`concurrency::array`;<br>`concurrency::array_view`|`cl_mem`
|Term|CUDA|HIP|OpenCL|
|---|---|---|---|
|Device|`int deviceId`|`int deviceId`|`cl_device`|
|Queue|`cudaStream_t`|`hipStream_t`|`cl_command_queue`|
|Event|`cudaEvent_t`|`hipEvent_t`|`cl_event`|
|Memory|`void *`|`void *`|`cl_mem`|
|||||
| |grid|grid|extent|extent|NDRange
| |block|block|tile|tile|work-group
| |thread|thread|thread|thread|work-item
| |warp|warp|wavefront|N/A|sub-group
| |grid|grid|NDRange|
| |block|block|work-group|
| |thread|thread|work-item|
| |warp|warp|sub-group|
|||||
|Thread-<br>index | threadIdx.x | hipThreadIdx_x | t_idx.local[0] | t_idx.local[0] | get_local_id(0) |
|Block-<br>index | blockIdx.x | hipBlockIdx_x | t_idx.tile[0] | t_idx.tile[0] | get_group_id(0) |
|Block-<br>dim | blockDim.x | hipBlockDim_x | t_ext.tile_dim[0]| t_idx.tile_dim0 | get_local_size(0) |
|Grid-dim | gridDim.x | hipGridDim_x | t_ext[0]| t_ext[0] | get_global_size(0) |
|Thread-<br>index | threadIdx.x | threadIdx.x | get_local_id(0) |
|Block-<br>index | blockIdx.x | blockIdx.x | get_group_id(0) |
|Block-<br>dim | blockDim.x | blockDim.x | get_local_size(0) |
|Grid-dim | gridDim.x | gridDim.x | get_num_groups(0) |
|||||
|Device Kernel|`__global__`|`__global__`|lambda inside `hc::`<br>`parallel_for_each` or [[hc]]|`restrict(amp)`|`__kernel`
|Device Function|`__device__`|`__device__`|`[[hc]]` (detected automatically in many case)|`restrict(amp)`|Implied in device compilation
|Host Function|`__host_` (default)|`__host_` (default)|`[[cpu]]` (default)|`restrict(cpu)` (default)|Implied in host compilation.
|Host + Device Function|`__host__` `__device__`|`__host__` `__device__`| `[[hc]]` `[[cpu]]`|`restrict(amp,cpu)`|No equivalent
|Kernel Launch|`<<< >>>`|`hipLaunchKernel`|`hc::`<br>`parallel_for_each`|`concurrency::`<br>`parallel_for_each`|`clEnqueueNDRangeKernel`
|Device Kernel|`__global__`|`__global__`|`__kernel`|
|Device Function|`__device__`|`__device__`|Implied in device compilation|
|Host Function|`__host_` (default)|`__host_` (default)|Implied in host compilation|
|Host + Device Function|`__host__` `__device__`|`__host__` `__device__`| No equivalent|
|Kernel Launch|`<<< >>>`|`hipLaunchKernel`/`hipLaunchKernelGGL`/`<<< >>>`|`clEnqueueNDRangeKernel`|
||||||
|Global Memory|`__global__`|`__global__`|Unnecessary / Implied|Unnecessary / Implied|`__global`
|Group Memory|`__shared__`|`__shared__`|`tile_static`|`tile_static`|`__local`
|Constant|`__constant__`|`__constant__`|Unnecessary / Implied|Unnecessary / Implied|`__constant`
|Global Memory|`__global__`|`__global__`|`__global`|
|Group Memory|`__shared__`|`__shared__`|`__local`|
|Constant|`__constant__`|`__constant__`|`__constant`|
||||||
||`__syncthreads`|`__syncthreads`|`tile_static.barrier()`|`t_idx.barrier()`|`barrier(CLK_LOCAL_MEMFENCE)`
|Atomic Builtins|`atomicAdd`|`atomicAdd`|`hc::atomic_fetch_add`|`concurrency::`<br>`atomic_fetch_add`|`atomic_add`
|Precise Math|`cos(f)`|`cos(f)`|`hc::`<br>`precise_math::cos(f)`|`concurrency::`<br>`precise_math::cos(f)`|`cos(f)`
|Fast Math|`__cos(f)`|`__cos(f)`|`hc::`<br>`fast_math::cos(f)`|`concurrency::`<br>`fast_math::cos(f)`|`native_cos(f)`
|Vector|`float4`|`float4`|`hc::`<br>`short_vector::float4`|`concurrency::`<br>`graphics::float_4`|`float4`
||`__syncthreads`|`__syncthreads`|`barrier(CLK_LOCAL_MEMFENCE)`|
|Atomic Builtins|`atomicAdd`|`atomicAdd`|`atomic_add`|
|Precise Math|`cos(f)`|`cos(f)`|`cos(f)`|
|Fast Math|`__cos(f)`|`__cos(f)`|`native_cos(f)`|
|Vector|`float4`|`float4`|`float4`|
### Notes
1. For HC and C++AMP, assume a captured _tiled_ext_ named "t_ext" and captured _extent_ named "ext". These languages use captured variables to pass information to the kernel rather than using special built-in functions so the exact variable name may vary.
2. The indexing functions (starting with `thread-index`) show the terminology for a 1D grid. Some APIs use reverse order of xyz / 012 indexing for 3D grids.
3. HC allows tile dimensions to be specified at runtime while C++AMP requires that tile dimensions be specified at compile-time. Thus hc syntax for tile dims is `t_ext.tile_dim[0]` while C++AMP is t_ext.tile_dim0.
The indexing functions (starting with `thread-index`) show the terminology for a 1D grid. Some APIs use reverse order of xyz / 012 indexing for 3D grids.