docs/markdown/hip_porting_driver_api.md

# Porting CUDA Driver API

## Introduction to the CUDA Driver and Runtime APIs
CUDA provides a separate CUDA Driver and Runtime APIs.  The two APis have significant overlap in functionality:  
- Both APIs support events, streams, memory management, memory copy, and error handling.
- Both APIs deliver similar performance.  
- Driver APIs calls begin with the prefix `cu` while Runtime APIs begin with the prefix `cuda`.  For example, the Driver API API contains `cuEventCreate` while the Runtime API contains `cudaEventCreate`, with similar functionality.
- The Driver API defines a different but largely overlapping error code code space than the Runtime API, and uses a different coding convention.  For example, Driver API defines `CUDA_ERROR_INVALID_VALUE` while the Runtime API defines `cudaErrorInvalidValue`


The Driver API offers two additional pieces of functionality not provided by the Runtime API: cuModule and cuCtx APIs.

### cuModule API
The Module section of the Driver API provides additional control over how and when accelerator
 code objects are loaded.
For example, the driver API allows code objects to be loaded from files or memory pointers.
Symbols for kernels or global data can be extracted from the loaded code objects.
In contrast, the Runtime API automatically loads and (if necessary) compiles all of the 
kernels from a executable binary when run.
In this mode, NVCC must be used to compile kernel code so the automatic loading can function correctly.

Both Driver and Runtime APIs define a function for launching kernels (called `cuLaunchKernel` or `cudaLaunchKernel`.
The kernel arguments and the execution configuration (grid dimensions, group dimensions, dynamic shared memory, and stream) are passed as arguments to the launch function.
The Runtime additionally provides the `<<< >>>` syntax for launching kernels, which resembles
a special function call and is easier to use than explicit launch API (in particular with respect to
handling of kernel arguments).
However, this syntax is not standard C++ and is available only when NVCC is used to compile the host code.  

The Module features are useful in an environment which generate the code objects directly, such as a new
accelerator language front-end.  Here, NVCC is not used.   Instead, the environment may have a 
different kernel language or different compilation flow.
Other environments have many kernels and do not want them to be all loaded automatically.
The Module functions can be used to load the generated code objects and launch kernels.
As we will see below, HIP defines a Module API which provides similar explict control over code
object managemenet.

### cuCtx API
The Driver API defines "Context" and "Devices" as separate entities.  
Contexts contain a single device, and a device can theoretically have multiple contexts.  
Each context contains a set of streams and events specific to the context.  
Historically contexts also defined a unique address space for the GPU, though this may not longer be the case in Unified Memory platforms (since the CPU and all the devices in the same process share a single unified address space).  
The Context APIs also provide a mechanism to switch between devices, which allowed
a single CPU thread to send commands to different GPUs.  HIP as well as a recent versions
of CUDA Runtime provide other mechanisms to accomplish this feat - for example using streams or
`cudaSetDevice`.

The CUDA Runtime API unifies the Context API with the Device API.  This simplifies the APIs
and has little loss of functionality since each Context can contain a single device, 
and the benefits of multiple contexts has been replaced with other interfaces.   HIP provides
a context API to facilitate easy porting from existing Driver codes.  
In HIP, the Ctx functions largely provide an alternate syntax for changing the active device. 
Most new applications will prefer to use `hipSetDevice` or the stream APIs.

## HIP Module and Ctx APIs

Rather than present two separate APIs, HIP extends the HIP API with new APIs for Modules 
and Ctx control.   

### hipModule API

Like the CUDA Driver API, the Module API provides additional control 
over how code is loaded, including options to load code from files or from in-memory pointers.  
NVCC and HCC target different architectures and use different code object formats : NVCC
is `cubin` or `ptx` files, while the HCC path is the `hsaco` format.   The external
compilers which generate these code objects are responsible for generating and loading
the correct code object for each platform.  Notably, there is not a fat binary format that
can contain code for both NVCC and HCC platforms.     The following table summarizes the
formats used on each platform:

| Format        | APIs                         | NVCC               | HCC               |
| ---           | ---                          | ---                | ---               |
| Code Object   | hipModuleLoad, hipModuleLoadData | .cubin or PTX text | .hsaco            |
| Fat Binary    | hipModuleLoadFatBin          | .fatbin            | Under Development | 

hipcc uses NVCC and HCC to compile host codes.  Both of these may embed code objects
into the final executable, and these code objects will be automatically loaded when
the application starts.
The hipModule API can be used to load additional code objects, and in this way
provides an extended capability to the automatically loaded code objects.  HCC allows
both of these capabilities to be used together, if desired.  Of course it is possible
to create a program with no kernels and thus no automatic loading.


### hipCtx API
HIP provides a `Ctx` API as a thin layer over the existing Device functions.  This Ctx API
can be used to set the current context, or to query properties of the device associated with
the context.  The current context is implicitly used by other APIs such as `hipStreamCreate`.

### hipify translation of CUDA Driver API
The hipify tool will convert CUDA Driver APIs for streams, events, memory management to 
the equivalent HIP driver calls.   For example, `cuEventCreate` will be translated to 
`hipEventCreate`.  Hipify also converts error code from the Driver namespace and coding
convention to the equivalent HIP error code.  Thus, HIP unifies the APis for these common functions.
[hipify support for translating driver API is Under Development]

The memory copy APIs require additional explanation.  The CUDA driver includes the memory
direction in the name of the API (ie `cuMemcpyH2D`) while the CUDA driver API provides
a single memory copy API with a parameter that specifies the direction and additionally
supports a "default" direction where the runtime determines the direction automatically.
HIP provides APis with both styles: for example, `hipMemcpyH2D` as well as `hipMemcpy`.
The first flavor may be faster in some cases since they avoid host overhead to detect the 
different memory directions.

HIP defines a single error space, and uses camel-case for all errors (i.e. `hipErrorInvalidValue`).  

### HCC Implementation Notes
#### .hsaco
The .hsaco format used by HCC is described in more detail [here](https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc).
An example and blog that show how to use the format is [here](http://gpuopen.com/rocm-with-harmony-combining-opencl-hcc-hsa-in-a-single-program).  hsaco can be generated by hcc + extractkernel tool,
cloc, the GCN assembler, or other tools.

#### Address Spaces
HCC defines a process-wide address space where the CPU and all devices allocate
addresses from a single unified pool.  Thus addresses may be shared between contexts, and 
unlike the original CUDA definition a new context does not create a new address space for
the device.   

#### Using hipModuleLaunchKernel
`hipModuleLaunchKernel` is `cuLaunchKernel` in HIP world. It takes the same arguments as `cuLaunchKernel`. The argument `kernelParams` is not fully implemented for HCC. The workaround for it is, to use platform specific macros for each target. Or, `extra` argument can be used which works on both the platforms.

#### Additional Information
- HCC allocates staging buffers (used for unpinned copies) on a per-device basis.
- HCC creates a primary context when the HIP API is called.  So in a pure driver API code, HIP/HCC will create a primary context while HIP/NVCC will have empty context stack.  HIP/HCC will push primary context to context stack when it is empty. This can have subtle differences on applications which mix the runtime and driver APIs.

### NVCC Implementation Notes

#### Interoperation between HIP and CUDA Driver
CUDA applications may want to mix CUDA driver code with HIP code (see example below).  This
table shows the type equivalence to enable this interaction.

|**HIP Type** |**CU Driver Type**|  **CUDA Runtime Type** |
| ----        | ----             | ---- | 
| hipModule   |  CUmodule        |   			|
| hipFunction |  CUfunction      | 			|
| hipCtx_t    |  CUcontext       | 			|
| hipDevice_t |  CUdevice        |  			|
| hipStream_t |  CUstream        | cudaStream_t         |
| hipEvent_t  |  CUevent         | cudaEvent_t          |
| hipArray    |  CUarray         | cudaArray            |


#### Compilation Flags
The hipModule interface does not support the hipModuleLoadEx function, which is used to control PTX compilaton options.
HCC does not use PTX and does not support the same compilation options.
In fact, HCC code objects always contain fully compiled ISA and do not require additional compilation as part of the load step.
Code which requires this functionaly should use platform-specific coding, calling `cuModuleLoadEx` 
on the NVCC path and hipModuleLoad on the hcc path.  For example:

```
hipModule module;
void *imagePtr = ... ;  // Somehow populate data pointer with code object

#ifdef __HIP_PLATFORM_NVCC__
// Use CUDA driver API but write to hipModule since they are same type:
const int numOptions = 1;
CUJit_option options[numOptions];
void * optionValues[numOptions];

options[0] = CU_JIT_MAX_REGISTERS;
unsigned maxRegs=15;
optionValues[0] = (void*) (&maxRegs);

cuModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues);

#else // __HIP_PLATFORM_HCC__

// HCC path does not support or require JIT options, so just load the module.
hipModuleLoadData(&module, imagePtr);

#endif

// Back to unified code - both paths above loaded the "module" variable.
hipFunction k;
hipModuleGetFunction(&k, module, "myKernel");
```

The below sample shows how to use `hipModuleGetFunction`.

```
#include<hip_runtime.h>
#include<hip_runtime_api.h>
#include<iostream>
#include<fstream>
#include<vector>

#define LEN 64
#define SIZE LEN<<2

#ifdef __HIP_PLATFORM_HCC__
#define fileName "vcpy_isa.co"
#endif

#ifdef __HIP_PLATFORM_NVCC__
#define fileName "vcpy_isa.ptx"
#endif

#define kernel_name "hello_world"

int main(){
    float *A, *B;
    hipDeviceptr_t Ad, Bd;
    A = new float[LEN];
    B = new float[LEN];

    for(uint32_t i=0;i<LEN;i++){
        A[i] = i*1.0f;
        B[i] = 0.0f;
        std::cout<<A[i] << " "<<B[i]<<std::endl;
    }


#ifdef __HIP_PLATFORM_NVCC__
          hipInit(0);
          hipDevice_t device;
          hipCtx_t context;
          hipDeviceGet(&device, 0);
          hipCtxCreate(&context, 0, device);
#endif

    hipMalloc((void**)&Ad, SIZE);
    hipMalloc((void**)&Bd, SIZE);

    hipMemcpyHtoD(Ad, A, SIZE);
    hipMemcpyHtoD(Bd, B, SIZE);
    hipModule_t Module;
    hipFunction_t Function;
    hipModuleLoad(&Module, fileName);
    hipModuleGetFunction(&Function, Module, kernel_name);

    std::vector<void*>argBuffer(2);
    memcpy(&argBuffer[0], &Ad, sizeof(void*));
    memcpy(&argBuffer[1], &Bd, sizeof(void*));

    size_t size = argBuffer.size()*sizeof(void*);

    void *config[] = {
      HIP_LAUNCH_PARAM_BUFFER_POINTER, &argBuffer[0],
      HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
      HIP_LAUNCH_PARAM_END
    };

    hipModuleLaunchKernel(Function, 1, 1, 1, LEN, 1, 1, 0, 0, NULL, (void**)&con                                     fig);

    hipMemcpyDtoH(B, Bd, SIZE);
    for(uint32_t i=0;i<LEN;i++){
        std::cout<<A[i]<<" - "<<B[i]<<std::endl;
    }

#ifdef __HIP_PLATFORM_NVCC__
          hipCtxDetach(context);
#endif

    return 0;
}

```
Add docs on porting driver API 2016-08-26 10:46:09 -05:00			`# Porting CUDA Driver API`

			`## Introduction to the CUDA Driver and Runtime APIs`
			`CUDA provides a separate CUDA Driver and Runtime APIs. The two APis have significant overlap in functionality:`
			`- Both APIs support events, streams, memory management, memory copy, and error handling.`
			`- Both APIs deliver similar performance.`
			- Driver APIs calls begin with the prefix `cu` while Runtime APIs begin with the prefix `cuda`. For example, the Driver API API contains `cuEventCreate` while the Runtime API contains `cudaEventCreate`, with similar functionality.
			- The Driver API defines a different but largely overlapping error code code space than the Runtime API, and uses a different coding convention. For example, Driver API defines `CUDA_ERROR_INVALID_VALUE` while the Runtime API defines `cudaErrorInvalidValue`


			`The Driver API offers two additional pieces of functionality not provided by the Runtime API: cuModule and cuCtx APIs.`

			`### cuModule API`
			`The Module section of the Driver API provides additional control over how and when accelerator`
			`code objects are loaded.`
			`For example, the driver API allows code objects to be loaded from files or memory pointers.`
			`Symbols for kernels or global data can be extracted from the loaded code objects.`
			`In contrast, the Runtime API automatically loads and (if necessary) compiles all of the`
			`kernels from a executable binary when run.`
			`In this mode, NVCC must be used to compile kernel code so the automatic loading can function correctly.`

			Both Driver and Runtime APIs define a function for launching kernels (called `cuLaunchKernel` or `cudaLaunchKernel`.
			`The kernel arguments and the execution configuration (grid dimensions, group dimensions, dynamic shared memory, and stream) are passed as arguments to the launch function.`
			The Runtime additionally provides the `<<< >>>` syntax for launching kernels, which resembles
			`a special function call and is easier to use than explicit launch API (in particular with respect to`
			`handling of kernel arguments).`
			`However, this syntax is not standard C++ and is available only when NVCC is used to compile the host code.`

			`The Module features are useful in an environment which generate the code objects directly, such as a new`
			`accelerator language front-end. Here, NVCC is not used. Instead, the environment may have a`
			`different kernel language or different compilation flow.`
			`Other environments have many kernels and do not want them to be all loaded automatically.`
			`The Module functions can be used to load the generated code objects and launch kernels.`
			`As we will see below, HIP defines a Module API which provides similar explict control over code`
			`object managemenet.`

			`### cuCtx API`
			`The Driver API defines "Context" and "Devices" as separate entities.`
			`Contexts contain a single device, and a device can theoretically have multiple contexts.`
			`Each context contains a set of streams and events specific to the context.`
			`Historically contexts also defined a unique address space for the GPU, though this may not longer be the case in Unified Memory platforms (since the CPU and all the devices in the same process share a single unified address space).`
			`The Context APIs also provide a mechanism to switch between devices, which allowed`
			`a single CPU thread to send commands to different GPUs. HIP as well as a recent versions`
			`of CUDA Runtime provide other mechanisms to accomplish this feat - for example using streams or`
			`cudaSetDevice`.

			`The CUDA Runtime API unifies the Context API with the Device API. This simplifies the APIs`
			`and has little loss of functionality since each Context can contain a single device,`
			`and the benefits of multiple contexts has been replaced with other interfaces. HIP provides`
			`a context API to facilitate easy porting from existing Driver codes.`
			`In HIP, the Ctx functions largely provide an alternate syntax for changing the active device.`
			Most new applications will prefer to use `hipSetDevice` or the stream APIs.

			`## HIP Module and Ctx APIs`

			`Rather than present two separate APIs, HIP extends the HIP API with new APIs for Modules`
			`and Ctx control.`

			`### hipModule API`

			`Like the CUDA Driver API, the Module API provides additional control`
			`over how code is loaded, including options to load code from files or from in-memory pointers.`
			`NVCC and HCC target different architectures and use different code object formats : NVCC`
			is `cubin` or `ptx` files, while the HCC path is the `hsaco` format. The external
			`compilers which generate these code objects are responsible for generating and loading`
			`the correct code object for each platform. Notably, there is not a fat binary format that`
			`can contain code for both NVCC and HCC platforms. The following table summarizes the`
			`formats used on each platform:`

			`\| Format \| APIs \| NVCC \| HCC \|`
			`\| --- \| --- \| --- \| --- \|`
			`\| Code Object \| hipModuleLoad, hipModuleLoadData \| .cubin or PTX text \| .hsaco \|`
			`\| Fat Binary \| hipModuleLoadFatBin \| .fatbin \| Under Development \|`

			`hipcc uses NVCC and HCC to compile host codes. Both of these may embed code objects`
			`into the final executable, and these code objects will be automatically loaded when`
			`the application starts.`
			`The hipModule API can be used to load additional code objects, and in this way`
			`provides an extended capability to the automatically loaded code objects. HCC allows`
			`both of these capabilities to be used together, if desired. Of course it is possible`
			`to create a program with no kernels and thus no automatic loading.`


			`### hipCtx API`
			HIP provides a `Ctx` API as a thin layer over the existing Device functions. This Ctx API
			`can be used to set the current context, or to query properties of the device associated with`
			the context. The current context is implicitly used by other APIs such as `hipStreamCreate`.

			`### hipify translation of CUDA Driver API`
			`The hipify tool will convert CUDA Driver APIs for streams, events, memory management to`
			the equivalent HIP driver calls. For example, `cuEventCreate` will be translated to
			`hipEventCreate`. Hipify also converts error code from the Driver namespace and coding
			`convention to the equivalent HIP error code. Thus, HIP unifies the APis for these common functions.`
			`[hipify support for translating driver API is Under Development]`

			`The memory copy APIs require additional explanation. The CUDA driver includes the memory`
			direction in the name of the API (ie `cuMemcpyH2D`) while the CUDA driver API provides
			`a single memory copy API with a parameter that specifies the direction and additionally`
			`supports a "default" direction where the runtime determines the direction automatically.`
			HIP provides APis with both styles: for example, `hipMemcpyH2D` as well as `hipMemcpy`.
			`The first flavor may be faster in some cases since they avoid host overhead to detect the`
			`different memory directions.`

			HIP defines a single error space, and uses camel-case for all errors (i.e. `hipErrorInvalidValue`).

			`### HCC Implementation Notes`
			`#### .hsaco`
			`The .hsaco format used by HCC is described in more detail [here](https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc).`
			`An example and blog that show how to use the format is [here](http://gpuopen.com/rocm-with-harmony-combining-opencl-hcc-hsa-in-a-single-program). hsaco can be generated by hcc + extractkernel tool,`
			`cloc, the GCN assembler, or other tools.`

			`#### Address Spaces`
			`HCC defines a process-wide address space where the CPU and all devices allocate`
			`addresses from a single unified pool. Thus addresses may be shared between contexts, and`
			`unlike the original CUDA definition a new context does not create a new address space for`
			`the device.`

added docs to how to use hipModuleLaunchKernel 2016-08-29 15:38:04 -05:00			`#### Using hipModuleLaunchKernel`
			`hipModuleLaunchKernel` is `cuLaunchKernel` in HIP world. It takes the same arguments as `cuLaunchKernel`. The argument `kernelParams` is not fully implemented for HCC. The workaround for it is, to use platform specific macros for each target. Or, `extra` argument can be used which works on both the platforms.

Add docs on porting driver API 2016-08-26 10:46:09 -05:00			`#### Additional Information`
Added context related difference note in hip_porting_driver_api.md 2016-09-01 11:35:42 +05:30			`- HCC allocates staging buffers (used for unpinned copies) on a per-device basis.`
			`- HCC creates a primary context when the HIP API is called. So in a pure driver API code, HIP/HCC will create a primary context while HIP/NVCC will have empty context stack. HIP/HCC will push primary context to context stack when it is empty. This can have subtle differences on applications which mix the runtime and driver APIs.`
Add docs on porting driver API 2016-08-26 10:46:09 -05:00
			`### NVCC Implementation Notes`

			`#### Interoperation between HIP and CUDA Driver`
			`CUDA applications may want to mix CUDA driver code with HIP code (see example below). This`
			`table shows the type equivalence to enable this interaction.`

			`\|HIP Type \|CU Driver Type\| CUDA Runtime Type \|`
			`\| ---- \| ---- \| ---- \|`
			`\| hipModule \| CUmodule \| \|`
			`\| hipFunction \| CUfunction \| \|`
			`\| hipCtx_t \| CUcontext \| \|`
			`\| hipDevice_t \| CUdevice \| \|`
			`\| hipStream_t \| CUstream \| cudaStream_t \|`
			`\| hipEvent_t \| CUevent \| cudaEvent_t \|`
			`\| hipArray \| CUarray \| cudaArray \|`



			`#### Compilation Flags`
			`The hipModule interface does not support the hipModuleLoadEx function, which is used to control PTX compilaton options.`
			`HCC does not use PTX and does not support the same compilation options.`
			`In fact, HCC code objects always contain fully compiled ISA and do not require additional compilation as part of the load step.`
			Code which requires this functionaly should use platform-specific coding, calling `cuModuleLoadEx`
			`on the NVCC path and hipModuleLoad on the hcc path. For example:`

			```
			`hipModule module;`
			`void *imagePtr = ... ; // Somehow populate data pointer with code object`

			`#ifdef __HIP_PLATFORM_NVCC__`
			`// Use CUDA driver API but write to hipModule since they are same type:`
			`const int numOptions = 1;`
			`CUJit_option options[numOptions];`
			`void * optionValues[numOptions];`

			`options[0] = CU_JIT_MAX_REGISTERS;`
			`unsigned maxRegs=15;`
			`optionValues[0] = (void*) (&maxRegs);`

			`cuModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues);`

			`#else // __HIP_PLATFORM_HCC__`

			`// HCC path does not support or require JIT options, so just load the module.`
			`hipModuleLoadData(&module, imagePtr);`

			`#endif`

			`// Back to unified code - both paths above loaded the "module" variable.`
			`hipFunction k;`
			`hipModuleGetFunction(&k, module, "myKernel");`
			```

added sample to porting driver guide 2016-08-29 15:25:35 -05:00			The below sample shows how to use `hipModuleGetFunction`.

			```
			`#include<hip_runtime.h>`
			`#include<hip_runtime_api.h>`
			`#include<iostream>`
			`#include<fstream>`
			`#include<vector>`

			`#define LEN 64`
			`#define SIZE LEN<<2`

			`#ifdef __HIP_PLATFORM_HCC__`
			`#define fileName "vcpy_isa.co"`
			`#endif`

			`#ifdef __HIP_PLATFORM_NVCC__`
			`#define fileName "vcpy_isa.ptx"`
			`#endif`

			`#define kernel_name "hello_world"`

			`int main(){`
			`float A, B;`
			`hipDeviceptr_t Ad, Bd;`
			`A = new float[LEN];`
			`B = new float[LEN];`

			`for(uint32_t i=0;i<LEN;i++){`
			`A[i] = i*1.0f;`
			`B[i] = 0.0f;`
			`std::cout<<A[i] << " "<<B[i]<<std::endl;`
			`}`


			`#ifdef __HIP_PLATFORM_NVCC__`
			`hipInit(0);`
			`hipDevice_t device;`
			`hipCtx_t context;`
			`hipDeviceGet(&device, 0);`
			`hipCtxCreate(&context, 0, device);`
			`#endif`

			`hipMalloc((void**)&Ad, SIZE);`
			`hipMalloc((void**)&Bd, SIZE);`

			`hipMemcpyHtoD(Ad, A, SIZE);`
			`hipMemcpyHtoD(Bd, B, SIZE);`
			`hipModule_t Module;`
			`hipFunction_t Function;`
			`hipModuleLoad(&Module, fileName);`
			`hipModuleGetFunction(&Function, Module, kernel_name);`

			`std::vector<void*>argBuffer(2);`
			`memcpy(&argBuffer[0], &Ad, sizeof(void*));`
			`memcpy(&argBuffer[1], &Bd, sizeof(void*));`

			`size_t size = argBuffer.size()sizeof(void);`

			`void *config[] = {`
			`HIP_LAUNCH_PARAM_BUFFER_POINTER, &argBuffer[0],`
			`HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,`
			`HIP_LAUNCH_PARAM_END`
			`};`

			`hipModuleLaunchKernel(Function, 1, 1, 1, LEN, 1, 1, 0, 0, NULL, (void**)&con fig);`

			`hipMemcpyDtoH(B, Bd, SIZE);`
			`for(uint32_t i=0;i<LEN;i++){`
			`std::cout<<A[i]<<" - "<<B[i]<<std::endl;`
			`}`

			`#ifdef __HIP_PLATFORM_NVCC__`
			`hipCtxDetach(context);`
			`#endif`

			`return 0;`
			`}`

			```