dd84bb0d84
Change-Id: I5bd2884a16db51871baa7c19fa2bd63a0bd3adad
258 строки
12 KiB
Markdown
258 строки
12 KiB
Markdown
# Porting CUDA Driver API
|
|
|
|
## Introduction to the CUDA Driver and Runtime APIs
|
|
CUDA provides a separate CUDA Driver and Runtime APIs. The two APis have significant overlap in functionality:
|
|
- Both APIs support events, streams, memory management, memory copy, and error handling.
|
|
- Both APIs deliver similar performance.
|
|
- Driver APIs calls begin with the prefix `cu` while Runtime APIs begin with the prefix `cuda`. For example, the Driver API API contains `cuEventCreate` while the Runtime API contains `cudaEventCreate`, with similar functionality.
|
|
- The Driver API defines a different but largely overlapping error code code space than the Runtime API, and uses a different coding convention. For example, Driver API defines `CUDA_ERROR_INVALID_VALUE` while the Runtime API defines `cudaErrorInvalidValue`
|
|
|
|
|
|
The Driver API offers two additional pieces of functionality not provided by the Runtime API: cuModule and cuCtx APIs.
|
|
|
|
### cuModule API
|
|
The Module section of the Driver API provides additional control over how and when accelerator
|
|
code objects are loaded.
|
|
For example, the driver API allows code objects to be loaded from files or memory pointers.
|
|
Symbols for kernels or global data can be extracted from the loaded code objects.
|
|
In contrast, the Runtime API automatically loads and (if necessary) compiles all of the
|
|
kernels from a executable binary when run.
|
|
In this mode, NVCC must be used to compile kernel code so the automatic loading can function correctly.
|
|
|
|
Both Driver and Runtime APIs define a function for launching kernels (called `cuLaunchKernel` or `cudaLaunchKernel`.
|
|
The kernel arguments and the execution configuration (grid dimensions, group dimensions, dynamic shared memory, and stream) are passed as arguments to the launch function.
|
|
The Runtime additionally provides the `<<< >>>` syntax for launching kernels, which resembles
|
|
a special function call and is easier to use than explicit launch API (in particular with respect to
|
|
handling of kernel arguments).
|
|
However, this syntax is not standard C++ and is available only when NVCC is used to compile the host code.
|
|
|
|
The Module features are useful in an environment which generate the code objects directly, such as a new
|
|
accelerator language front-end. Here, NVCC is not used. Instead, the environment may have a
|
|
different kernel language or different compilation flow.
|
|
Other environments have many kernels and do not want them to be all loaded automatically.
|
|
The Module functions can be used to load the generated code objects and launch kernels.
|
|
As we will see below, HIP defines a Module API which provides similar explict control over code
|
|
object managemenet.
|
|
|
|
### cuCtx API
|
|
The Driver API defines "Context" and "Devices" as separate entities.
|
|
Contexts contain a single device, and a device can theoretically have multiple contexts.
|
|
Each context contains a set of streams and events specific to the context.
|
|
Historically contexts also defined a unique address space for the GPU, though this may not longer be the case in Unified Memory platforms (since the CPU and all the devices in the same process share a single unified address space).
|
|
The Context APIs also provide a mechanism to switch between devices, which allowed
|
|
a single CPU thread to send commands to different GPUs. HIP as well as a recent versions
|
|
of CUDA Runtime provide other mechanisms to accomplish this feat - for example using streams or
|
|
`cudaSetDevice`.
|
|
|
|
The CUDA Runtime API unifies the Context API with the Device API. This simplifies the APIs
|
|
and has little loss of functionality since each Context can contain a single device,
|
|
and the benefits of multiple contexts has been replaced with other interfaces. HIP provides
|
|
a context API to facilitate easy porting from existing Driver codes.
|
|
In HIP, the Ctx functions largely provide an alternate syntax for changing the active device.
|
|
Most new applications will prefer to use `hipSetDevice` or the stream APIs.
|
|
|
|
## HIP Module and Ctx APIs
|
|
|
|
Rather than present two separate APIs, HIP extends the HIP API with new APIs for Modules
|
|
and Ctx control.
|
|
|
|
### hipModule API
|
|
|
|
Like the CUDA Driver API, the Module API provides additional control
|
|
over how code is loaded, including options to load code from files or from in-memory pointers.
|
|
NVCC and HCC target different architectures and use different code object formats : NVCC
|
|
is `cubin` or `ptx` files, while the HCC path is the `hsaco` format. The external
|
|
compilers which generate these code objects are responsible for generating and loading
|
|
the correct code object for each platform. Notably, there is not a fat binary format that
|
|
can contain code for both NVCC and HCC platforms. The following table summarizes the
|
|
formats used on each platform:
|
|
|
|
| Format | APIs | NVCC | HCC |
|
|
| --- | --- | --- | --- |
|
|
| Code Object | hipModuleLoad, hipModuleLoadData | .cubin or PTX text | .hsaco |
|
|
| Fat Binary | hipModuleLoadFatBin | .fatbin | Under Development |
|
|
|
|
hipcc uses NVCC and HCC to compile host codes. Both of these may embed code objects
|
|
into the final executable, and these code objects will be automatically loaded when
|
|
the application starts.
|
|
The hipModule API can be used to load additional code objects, and in this way
|
|
provides an extended capability to the automatically loaded code objects. HCC allows
|
|
both of these capabilities to be used together, if desired. Of course it is possible
|
|
to create a program with no kernels and thus no automatic loading.
|
|
|
|
|
|
### hipCtx API
|
|
HIP provides a `Ctx` API as a thin layer over the existing Device functions. This Ctx API
|
|
can be used to set the current context, or to query properties of the device associated with
|
|
the context. The current context is implicitly used by other APIs such as `hipStreamCreate`.
|
|
|
|
### hipify translation of CUDA Driver API
|
|
The hipify tool will convert CUDA Driver APIs for streams, events, memory management to
|
|
the equivalent HIP driver calls. For example, `cuEventCreate` will be translated to
|
|
`hipEventCreate`. Hipify also converts error code from the Driver namespace and coding
|
|
convention to the equivalent HIP error code. Thus, HIP unifies the APis for these common functions.
|
|
[hipify support for translating driver API is Under Development]
|
|
|
|
The memory copy APIs require additional explanation. The CUDA driver includes the memory
|
|
direction in the name of the API (ie `cuMemcpyH2D`) while the CUDA driver API provides
|
|
a single memory copy API with a parameter that specifies the direction and additionally
|
|
supports a "default" direction where the runtime determines the direction automatically.
|
|
HIP provides APis with both styles: for example, `hipMemcpyH2D` as well as `hipMemcpy`.
|
|
The first flavor may be faster in some cases since they avoid host overhead to detect the
|
|
different memory directions.
|
|
|
|
HIP defines a single error space, and uses camel-case for all errors (i.e. `hipErrorInvalidValue`).
|
|
|
|
### HCC Implementation Notes
|
|
#### .hsaco
|
|
The .hsaco format used by HCC is described in more detail [here](https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc).
|
|
An example and blog that show how to use the format is [here](http://gpuopen.com/rocm-with-harmony-combining-opencl-hcc-hsa-in-a-single-program). hsaco can be generated by hcc + extractkernel tool,
|
|
cloc, the GCN assembler, or other tools.
|
|
|
|
#### Address Spaces
|
|
HCC defines a process-wide address space where the CPU and all devices allocate
|
|
addresses from a single unified pool. Thus addresses may be shared between contexts, and
|
|
unlike the original CUDA definition a new context does not create a new address space for
|
|
the device.
|
|
|
|
#### Using hipModuleLaunchKernel
|
|
`hipModuleLaunchKernel` is `cuLaunchKernel` in HIP world. It takes the same arguments as `cuLaunchKernel`. The argument `kernelParams` is not fully implemented for HCC. The workaround for it is, to use platform specific macros for each target. Or, `extra` argument can be used which works on both the platforms.
|
|
|
|
#### Additional Information
|
|
- HCC allocates staging buffers (used for unpinned copies) on a per-device basis.
|
|
- HCC creates a primary context when the HIP API is called. So in a pure driver API code, HIP/HCC will create a primary context while HIP/NVCC will have empty context stack. HIP/HCC will push primary context to context stack when it is empty. This can have subtle differences on applications which mix the runtime and driver APIs.
|
|
|
|
### NVCC Implementation Notes
|
|
|
|
#### Interoperation between HIP and CUDA Driver
|
|
CUDA applications may want to mix CUDA driver code with HIP code (see example below). This
|
|
table shows the type equivalence to enable this interaction.
|
|
|
|
|**HIP Type** |**CU Driver Type**| **CUDA Runtime Type** |
|
|
| ---- | ---- | ---- |
|
|
| hipModule | CUmodule | |
|
|
| hipFunction | CUfunction | |
|
|
| hipCtx_t | CUcontext | |
|
|
| hipDevice_t | CUdevice | |
|
|
| hipStream_t | CUstream | cudaStream_t |
|
|
| hipEvent_t | CUevent | cudaEvent_t |
|
|
| hipArray | CUarray | cudaArray |
|
|
|
|
|
|
|
|
#### Compilation Flags
|
|
The hipModule interface does not support the hipModuleLoadEx function, which is used to control PTX compilaton options.
|
|
HCC does not use PTX and does not support the same compilation options.
|
|
In fact, HCC code objects always contain fully compiled ISA and do not require additional compilation as part of the load step.
|
|
Code which requires this functionaly should use platform-specific coding, calling `cuModuleLoadEx`
|
|
on the NVCC path and hipModuleLoad on the hcc path. For example:
|
|
|
|
```
|
|
hipModule module;
|
|
void *imagePtr = ... ; // Somehow populate data pointer with code object
|
|
|
|
#ifdef __HIP_PLATFORM_NVCC__
|
|
// Use CUDA driver API but write to hipModule since they are same type:
|
|
const int numOptions = 1;
|
|
CUJit_option options[numOptions];
|
|
void * optionValues[numOptions];
|
|
|
|
options[0] = CU_JIT_MAX_REGISTERS;
|
|
unsigned maxRegs=15;
|
|
optionValues[0] = (void*) (&maxRegs);
|
|
|
|
cuModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues);
|
|
|
|
#else // __HIP_PLATFORM_HCC__
|
|
|
|
// HCC path does not support or require JIT options, so just load the module.
|
|
hipModuleLoadData(&module, imagePtr);
|
|
|
|
#endif
|
|
|
|
// Back to unified code - both paths above loaded the "module" variable.
|
|
hipFunction k;
|
|
hipModuleGetFunction(&k, module, "myKernel");
|
|
```
|
|
|
|
The below sample shows how to use `hipModuleGetFunction`.
|
|
|
|
```
|
|
#include<hip_runtime.h>
|
|
#include<hip_runtime_api.h>
|
|
#include<iostream>
|
|
#include<fstream>
|
|
#include<vector>
|
|
|
|
#define LEN 64
|
|
#define SIZE LEN<<2
|
|
|
|
#ifdef __HIP_PLATFORM_HCC__
|
|
#define fileName "vcpy_isa.co"
|
|
#endif
|
|
|
|
#ifdef __HIP_PLATFORM_NVCC__
|
|
#define fileName "vcpy_isa.ptx"
|
|
#endif
|
|
|
|
#define kernel_name "hello_world"
|
|
|
|
int main(){
|
|
float *A, *B;
|
|
hipDeviceptr_t Ad, Bd;
|
|
A = new float[LEN];
|
|
B = new float[LEN];
|
|
|
|
for(uint32_t i=0;i<LEN;i++){
|
|
A[i] = i*1.0f;
|
|
B[i] = 0.0f;
|
|
std::cout<<A[i] << " "<<B[i]<<std::endl;
|
|
}
|
|
|
|
|
|
#ifdef __HIP_PLATFORM_NVCC__
|
|
hipInit(0);
|
|
hipDevice_t device;
|
|
hipCtx_t context;
|
|
hipDeviceGet(&device, 0);
|
|
hipCtxCreate(&context, 0, device);
|
|
#endif
|
|
|
|
hipMalloc((void**)&Ad, SIZE);
|
|
hipMalloc((void**)&Bd, SIZE);
|
|
|
|
hipMemcpyHtoD(Ad, A, SIZE);
|
|
hipMemcpyHtoD(Bd, B, SIZE);
|
|
hipModule_t Module;
|
|
hipFunction_t Function;
|
|
hipModuleLoad(&Module, fileName);
|
|
hipModuleGetFunction(&Function, Module, kernel_name);
|
|
|
|
std::vector<void*>argBuffer(2);
|
|
memcpy(&argBuffer[0], &Ad, sizeof(void*));
|
|
memcpy(&argBuffer[1], &Bd, sizeof(void*));
|
|
|
|
size_t size = argBuffer.size()*sizeof(void*);
|
|
|
|
void *config[] = {
|
|
HIP_LAUNCH_PARAM_BUFFER_POINTER, &argBuffer[0],
|
|
HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
|
|
HIP_LAUNCH_PARAM_END
|
|
};
|
|
|
|
hipModuleLaunchKernel(Function, 1, 1, 1, LEN, 1, 1, 0, 0, NULL, (void**)&con fig);
|
|
|
|
hipMemcpyDtoH(B, Bd, SIZE);
|
|
for(uint32_t i=0;i<LEN;i++){
|
|
std::cout<<A[i]<<" - "<<B[i]<<std::endl;
|
|
}
|
|
|
|
#ifdef __HIP_PLATFORM_NVCC__
|
|
hipCtxDetach(context);
|
|
#endif
|
|
|
|
return 0;
|
|
}
|
|
|
|
```
|