SWDEV-436821 Update hip samples Readme files

Change-Id: I6bf3a72eac4a4242cb2dbf4e6eee73e0e1bef2ef [ROCm/hip-tests commit: 76dd8ea569]
2023-12-11 03:43:30 +00:00
@@ -4,3 +4,26 @@ Show an application written directly in HIP which uses platform-specific check o
 an instruction that only exists on the AMD platform.

 See related [blog](http://gpuopen.com/platform-aware-coding-inside-hip/) demonstrating platform specialization.
+
+- Steps to build this sample:
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+
+- Execute File
+```
+$ ./bit_extract
+
+pch size: 11743288
+__hipGetPCH succeeded!
+info: running on device #0
+info: allocate host mem (  7.63 MB)
+info: allocate device mem (  7.63 MB)
+info: copy Host2Device
+info: launch 'bit_extract_kernel'
+info: copy Device2Host
+info: check result
+PASSED!
+```
@@ -0,0 +1,19 @@
+# module_api
+
+- Steps to build this sample
+
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+
+- Execute Code
+```
+$ ./launchKernelHcc.hip.out
+PASSED!
+$ ./runKernel.hip.out
+PASSED!
+$ ./defaultDriver.hip.out
+PASSED!
+```
@@ -0,0 +1,17 @@
+# module_api_global
+
+- Steps to build this sample
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+
+- Execute Code
+```
+$ ./runKernel1.hip.out
+PASSED!
+Shared Size Bytes = 0
+Num Regs = 3
+PASSED!
+```
@@ -3,19 +3,16 @@
 Simple test below is an example, shows how to use hipify-perl to port CUDA code to HIP:

 - Add hip/bin path to the PATH
-
 ```
 $ export PATH=$PATH:[MYHIP]/bin
 ```

 - Define environment variable
-
 ```
 $ export HIP_PATH=[MYHIP]
 ```

 - Build executable file
-
 ```
 $ cd ~/hip/samples/0_Intro/square
  mkdir -p build && cd build
@@ -0,0 +1,26 @@
+# hipDispatchLatency.cpp
+
+- Steps to build this sample
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+
+- Execute Code
+```
+$ ./hipDispatchEnqueueRateMT 1 0
+Thread ID : 0 , hipModuleLaunchKernel enqueue rate: 0.8 us, std: 0.1 us
+
+$ ./hipDispatchEnqueueRateMT 1 1
+Thread ID : 0 , hipLaunchKernelGGL enqueue rate: 1.0 us, std: 0.1 us
+
+$ ./hipDispatchLatency
+hipModuleLaunchKernel enqueue rate: 0.8 us, std: 0.1 us
+
+hipLaunchKernelGGL enqueue rate: 1.0 us, std: 0.1 us
+
+Timing around single dispatch latency: 8.1 us, std: 4.7 us
+
+Batch dispatch latency: 1.4 us, std: 0.0 us
+```
@@ -4,3 +4,82 @@ Simple tool that prints properties for each device (from hipGetDeviceProperties)
    Properties includes all of the architectural feature flags for each device.

 Also demonstrates how to use platform-specific compilation path (testing `__HIP_PLATFORM_AMD__` or `__HIP_PLATFORM_NVIDIA__`)
+
+
+- Steps to build this sample
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+
+- Execute Code
+```
+$ ./hipInfo
+--------------------------------------------------------------------------------
+device#                           0
+Name:
+pciBusID:                         103
+pciDeviceID:                      0
+pciDomainID:                      0
+multiProcessorCount:              64
+maxThreadsPerMultiProcessor:      2560
+isMultiGpuBoard:                  0
+clockRate:                        1800 Mhz
+memoryClockRate:                  1000 Mhz
+memoryBusWidth:                   4096
+totalGlobalMem:                   31.98 GB
+totalConstMem:                    2147483647
+sharedMemPerBlock:                64.00 KB
+canMapHostMemory:                 1
+regsPerBlock:                     65536
+warpSize:                         64
+l2CacheSize:                      8388608
+computeMode:                      0
+maxThreadsPerBlock:               1024
+maxThreadsDim.x:                  1024
+maxThreadsDim.y:                  1024
+maxThreadsDim.z:                  1024
+maxGridSize.x:                    2147483647
+maxGridSize.y:                    65536
+maxGridSize.z:                    65536
+major:                            9
+minor:                            0
+concurrentKernels:                1
+cooperativeLaunch:                1
+cooperativeMultiDeviceLaunch:     1
+isIntegrated:                     0
+maxTexture1D:                     16384
+maxTexture2D.width:               16384
+maxTexture2D.height:              16384
+maxTexture3D.width:               16384
+maxTexture3D.height:              16384
+maxTexture3D.depth:               8192
+hostNativeAtomicSupported:        1
+isLargeBar:                       1
+asicRevision:                     1
+maxSharedMemoryPerMultiProcessor: 64.00 KB
+clockInstructionRate:             1000.00 Mhz
+arch.hasGlobalInt32Atomics:       1
+arch.hasGlobalFloatAtomicExch:    1
+arch.hasSharedInt32Atomics:       1
+arch.hasSharedFloatAtomicExch:    1
+arch.hasFloatAtomicAdd:           1
+arch.hasGlobalInt64Atomics:       1
+arch.hasSharedInt64Atomics:       1
+arch.hasDoubles:                  1
+arch.hasWarpVote:                 1
+arch.hasWarpBallot:               1
+arch.hasWarpShuffle:              1
+arch.hasFunnelShift:              0
+arch.hasThreadFenceSystem:        1
+arch.hasSyncThreadsExt:           0
+arch.hasSurfaceFuncs:             0
+arch.has3dGrid:                   1
+arch.hasDynamicParallelism:       0
+gcnArchName:                      gfx906:sramecc+:xnack-
+peers:
+non-peers:                        device#0
+memInfo.total:                    31.98 GB
+memInfo.free:                     31.96 GB (100%)
+```
@@ -87,8 +87,19 @@ After, copying the data from device to memory, we will verify it with the one we
 Finally, we will free the memory allocated earlier by using free() for host while for devices we will use `hipFree`.

 ## How to build and run:
-Use the make command and execute it using ./exe
-Use hipcc to build the application, which is using hcc on AMD and nvcc on nvidia.
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./MatrixTranspose
+Device name Navi 14 [Radeon Pro W5500]
+PASSED!
+```
+

 ## More Info:
 - [HIP FAQ](https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_faq.md)
@@ -45,9 +45,21 @@ Index for the respective operand in the ordered fashion is provided by `%` follo
 Output Constraints are specified by an `"="` prefix as shown above ("=v"). This indicate that assemby will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of `"0"` says to use the assigned register for output as an input as well (it being the 0'th constraint).

 ## How to build and run:
-Use the make command and execute it using ./exe
-Use hipcc to build the application, which is using hcc on AMD and nvcc on nvidia.
-
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./inline_asm
+Device name
+hipMemcpyHostToDevice time taken  =  1.057ms
+kernel Execution time             =  0.509ms
+hipMemcpyDeviceToHost time taken  =  1.254ms
+PASSED!
+```

 ## More Info:
 - [HIP FAQ](https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_faq.md)
@@ -0,0 +1,21 @@
+# texture_driver
+
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./texture2dDrv
+tex2dKernelChar test  PASSED ...
+tex2dKernelShort test  PASSED ...
+tex2dKernelInt test  PASSED ...
+tex2dKernelFloat test  PASSED ...
+tex2dKernelChar4 test  PASSED ...
+tex2dKernelShort4 test  PASSED ...
+tex2dKernelInt4 test  PASSED ...
+tex2dKernelFloat4 test  PASSED ...
+texture2dDrv PASSED ...
+```
@@ -32,20 +32,21 @@ The macro supports specifying CLANG-specific, NVCC-specific compiler options usi
 Common options targeting both compilers can be specificed after the ```HIPCC_OPTIONS``` keyword.

 ## How to build and run:
-Use the following commands to build and execute the sample
-
+- Build sample using cmake
+```
+$ mkdir build; cd build
+ # For shared lib of hip rt,
+$ cmake ..
+ # Or for static lib of hip rt,
+$ cmake -DCMAKE_PREFIX_PATH="/opt/rocm/llvm/lib/cmake" ..
+$ make
 ```
-mkdir build
-cd build

-For shared lib of hip rt,
-cmake ..
-Or for static lib of hip rt,
-cmake -DCMAKE_PREFIX_PATH="/opt/rocm/llvm/lib/cmake" ..
-
-Then,
-make
-./MatrixTranspose
+- Execute the sample
+```
+$ ./MatrixTranspose
+Device name
+PASSED!
 ```

 ## More Info:
@@ -0,0 +1,24 @@
+# occupancy
+
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./occupancy
+Manual Configuration with block size 32
+kernel Execution time =  0.433ms
+Theoretical Occupancy is 40%
+
+Automatic Configuation based on hipOccupancyMaxPotentialBlockSize
+Suggested blocksize is 1024, Minimum gridsize is 128
+kernel Execution time =  0.037ms
+Theoretical Occupancy is 80%
+
+Manual Test PASSED!
+
+Automatic Test PASSED!
+```
@@ -0,0 +1,15 @@
+# gpu_arch
+
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./gpuarch
+success
+```
+
+## Note : This sample works on architectures gfx908 and above
@@ -1,179 +1,81 @@
-# Emitting Static Library
-
-This sample shows how to generate a static library for a simple HIP application. We will evaluate two types of static libraries: the first type exports host functions in a static library generated with --emit-static-lib and is compatible with host linkers, and second type exports device functions in a static library made with system ar.
-
-Please refer to the hip_programming_guide for limitations.
-
-## Static libraries with host functions
-
-### Source files
-The static library source files may contain host functions and kernel `__global__` and `__device__` functions. Here is an example (please refer to the directory host_functions).
-
-hipOptLibrary.cpp:
-```
-#define HIP_ASSERT(status) assert(status == hipSuccess)
-#define LEN 512
-
-__global__ void copy(uint32_t* A, uint32_t* B) {
-    size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
-    B[tid] = A[tid];
-}
-
-void run_test1() {
-    uint32_t *A_h, *B_h, *A_d, *B_d;
-    size_t valbytes = LEN * sizeof(uint32_t);
-
-    A_h = (uint32_t*)malloc(valbytes);
-    B_h = (uint32_t*)malloc(valbytes);
-    for (uint32_t i = 0; i < LEN; i++) {
-        A_h[i] = i;
-        B_h[i] = 0;
-    }
-
-    HIP_ASSERT(hipMalloc((void**)&A_d, valbytes));
-    HIP_ASSERT(hipMalloc((void**)&B_d, valbytes));
-
-    HIP_ASSERT(hipMemcpy(A_d, A_h, valbytes, hipMemcpyHostToDevice));
-    hipLaunchKernelGGL(copy, dim3(LEN/64), dim3(64), 0, 0, A_d, B_d);
-    HIP_ASSERT(hipMemcpy(B_h, B_d, valbytes, hipMemcpyDeviceToHost));
-
-    for (uint32_t i = 0; i < LEN; i++) {
-        assert(A_h[i] == B_h[i]);
-    }
-
-    HIP_ASSERT(hipFree(A_d));
-    HIP_ASSERT(hipFree(B_d));
-    free(A_h);
-    free(B_h);
-    std::cout << "Test Passed!\n";
-}
-```
-
-The above source file can be compiled into a static library, libHipOptLibrary.a, using the --emit-static-lib flag, like so:
-```
-hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
-```
-
-### Main source files
-The main() program source file may link with the above static library using either hipcc or a host compiler (such as g++). A simple source file that calls the host function inside libHipOptLibrary.a:
-
-hipMain1.cpp:
-```
-extern void run_test1();
-
-int main(){
-  run_test1();
-}
-```
-
-To link to the static library:
-
-Using hipcc:
-```
-hipcc hipMain1.cpp -L. -lHipOptLibrary -o test_emit_static_hipcc_linker.out
-```
-Using g++:
-```
 ROCM_PATH is the path where ROCM is installed. default path is /opt/rocm.
-g++ hipMain1.cpp -L. -lHipOptLibrary -L<ROCM_PATH>/hip/lib -lamdhip64 -o test_emit_static_host_linker.out
+# Compile to assembly and create an executable from modified asm
+
+This sample shows how to generate the assembly code for a simple HIP source application, then re-compiling it and generating a valid HIP executable.
+
+This sample uses a previous HIP application sample, please see [0_Intro/square](https://github.com/ROCm-Developer-Tools/HIP/blob/master/samples/0_Intro/square).
+
+## Compiling the HIP source into assembly
+Using HIP flags `-c -S` will help generate the host x86_64 and the device AMDGCN assembly code when paired with `--cuda-host-only` and `--cuda-device-only` respectively. In this sample we use these commands:
+```
+<ROCM_PATH>/hip/bin/hipcc -c -S --cuda-host-only -target x86_64-linux-gnu -o square_host.s square.cpp
+<ROCM_PATH>/hip/bin/hipcc -c -S --cuda-device-only --offload-arch=gfx900 --offload-arch=gfx906 --offload-arch=gfx908 --offload-arch=gfx1010 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx1103 square.cpp
 ```

-## Static libraries with device functions
+The device assembly will be output into two separate files:
+- square-hip-amdgcn-amd-amdhsa-gfx900.s
+- square-hip-amdgcn-amd-amdhsa-gfx906.s
+- square-hip-amdgcn-amd-amdhsa-gfx908.s
+- square-hip-amdgcn-amd-amdhsa-gfx1010.s
+- square-hip-amdgcn-amd-amdhsa-gfx1030.s
+- square-hip-amdgcn-amd-amdhsa-gfx1100.s
+- square-hip-amdgcn-amd-amdhsa-gfx1101.s
+- square-hip-amdgcn-amd-amdhsa-gfx1102.s
+- square-hip-amdgcn-amd-amdhsa-gfx1103.s

-### Source files
-The static library source files which contain only `__device__` functions need to be created using ar. Here is an example (please refer to the directory device_functions).
+You may modify `--offload-arch` flag to build other archs and choose to enable or disable xnack and sram-ecc.

-hipDevice.cpp:
+**Note:** At this point, you may evaluate the assembly code, and make modifications if you are familiar with the AMDGCN assembly language and architecture.
+
+## Compiling the assembly into a valid HIP executable
+If valid, the modified host and device assembly may be compiled into a HIP executable. The host assembly can be compiled into an object using this command:
 ```
-#include <hip/hip_runtime.h>
-
-__device__ int square_me(int A) {
-  return A*A;
-}
+<ROCM_PATH>/hip/bin/hipcc -c square_host.s -o square_host.o
 ```

-The above source file may be compiled into a static library, libHipDevice.a, by first compiling into a relocatable object, and then placed in an archive using ar:
+However, the device assembly code will require a few extra steps. The device assemblies needs to be compiled into device objects, then offload-bundled into a HIP fat binary using the clang-offload-bundler, then llvm-mc embeds the binary inside of a host object using the MC directives provided in `hip_obj_gen.mcin`. The output is a host object with an embedded device object. Here are the steps for device side compilation into an object:
 ```
-hipcc hipDevice.cpp -c -fgpu-rdc -fPIC -o hipDevice.o
-ar rcsD libHipDevice.a hipDevice.o
+<ROCM_PATH>/hip/../llvm/bin/clang -target amdgcn-amd-amdhsa -mcpu=gfx900 square-hip-amdgcn-amd-amdhsa-gfx900.s -o square-hip-amdgcn-amd-amdhsa-gfx900.o
+<ROCM_PATH>/hip/../llvm/bin/clang -target amdgcn-amd-amdhsa -mcpu=gfx906 square-hip-amdgcn-amd-amdhsa-gfx906.s -o square-hip-amdgcn-amd-amdhsa-gfx906.o
+<ROCM_PATH>/hip/../llvm/bin/clang -target amdgcn-amd-amdhsa -mcpu=gfx908 square-hip-amdgcn-amd-amdhsa-gfx908.s -o square-hip-amdgcn-amd-amdhsa-gfx908.o
+<ROCM_PATH>/hip/../llvm/bin/clang -target amdgcn-amd-amdhsa -mcpu=gfx1010 square-hip-amdgcn-amd-amdhsa-gfx1010.s -o square-hip-amdgcn-amd-amdhsa-gfx1010.o
+<ROCM_PATH>/hip/../llvm/bin/clang -target amdgcn-amd-amdhsa -mcpu=gfx1030 square-hip-amdgcn-amd-amdhsa-gfx1030.s -o square-hip-amdgcn-amd-amdhsa-gfx1030.o
+<ROCM_PATH>/hip/../llvm/bin/clang -target amdgcn-amd-amdhsa -mcpu=gfx1100 square-hip-amdgcn-amd-amdhsa-gfx1100.s -o square-hip-amdgcn-amd-amdhsa-gfx1100.o
+<ROCM_PATH>/hip/../llvm/bin/clang -target amdgcn-amd-amdhsa -mcpu=gfx1101 square-hip-amdgcn-amd-amdhsa-gfx1101.s -o square-hip-amdgcn-amd-amdhsa-gfx1101.o
+<ROCM_PATH>/hip/../llvm/bin/clang -target amdgcn-amd-amdhsa -mcpu=gfx1102 square-hip-amdgcn-amd-amdhsa-gfx1102.s -o square-hip-amdgcn-amd-amdhsa-gfx1102.o
+<ROCM_PATH>/hip/../llvm/bin/clang -target amdgcn-amd-amdhsa -mcpu=gfx1103 square-hip-amdgcn-amd-amdhsa-gfx1103.s -o square-hip-amdgcn-amd-amdhsa-gfx1103.o
+<ROCM_PATH>/llvm/bin/clang-offload-bundler -type=o -bundle-align=4096 -targets=host-x86_64-unknown-linux,hip-amdgcn-amd-amdhsa-gfx900,hip-amdgcn-amd-amdhsa-gfx906,hip-amdgcn-amd-amdhsa-gfx908,hip-amdgcn-amd-amdhsa-gfx1010,hip-amdgcn-amd-amdhsa-gfx1030,hip-amdgcn-amd-amdhsa-gfx1100,hip-amdgcn-amd-amdhsa-gfx1101,hip-amdgcn-amd-amdhsa-gfx1102,hip-amdgcn-amd-amdhsa-gfx1103 -inputs=/dev/null,square-hip-amdgcn-amd-amdhsa-gfx900.o,square-hip-amdgcn-amd-amdhsa-gfx906.o,square-hip-amdgcn-amd-amdhsa-gfx908.o,square-hip-amdgcn-amd-amdhsa-gfx1010.o,square-hip-amdgcn-amd-amdhsa-gfx1030.o,square-hip-amdgcn-amd-amdhsa-gfx1100.o,square-hip-amdgcn-amd-amdhsa-gfx1101.o,square-hip-amdgcn-amd-amdhsa-gfx1102.o,square-hip-amdgcn-amd-amdhsa-gfx1103.o -outputs=offload_bundle.hipfb
+<ROCM_PATH>/llvm/bin/llvm-mc -triple x86_64-unknown-linux-gnu hip_obj_gen.mcin -o square_device.o --filetype=obj
 ```

-### Main source files
-The main() program source file can link with the static library using hipcc. A simple source file that calls the device function inside libHipDevice.a:
+**Note:** Using option `-bundle-align=4096` only works on ROCm 4.0 and newer compilers. Also, the architecture must match the same arch as when compiling to assembly.

-hipMain2.cpp:
+Finally, using the system linker, hipcc, or clang, link the host and device objects into an executable:
 ```
-#include <hip/hip_runtime.h>
-#include <hip/hip_runtime_api.h>
-#include <iostream>
-
-#define HIP_ASSERT(status) assert(status == hipSuccess)
-#define LEN 512
-
-extern __device__ int square_me(int);
-
-__global__ void square_and_save(int* A, int* B) {
-    int tid = threadIdx.x + blockIdx.x * blockDim.x;
-    B[tid] = square_me(A[tid]);
-}
-
-void run_test2() {
-    int *A_h, *B_h, *A_d, *B_d;
-    A_h = new int[LEN];
-    B_h = new int[LEN];
-    for (unsigned i = 0; i < LEN; i++) {
-        A_h[i] = i;
-        B_h[i] = 0;
-    }
-    size_t valbytes = LEN*sizeof(int);
-
-    HIP_ASSERT(hipMalloc((void**)&A_d, valbytes));
-    HIP_ASSERT(hipMalloc((void**)&B_d, valbytes));
-
-    HIP_ASSERT(hipMemcpy(A_d, A_h, valbytes, hipMemcpyHostToDevice));
-    hipLaunchKernelGGL(square_and_save, dim3(LEN/64), dim3(64),
-                       0, 0, A_d, B_d);
-    HIP_ASSERT(hipMemcpy(B_h, B_d, valbytes, hipMemcpyDeviceToHost));
-
-    for (unsigned i = 0; i < LEN; i++) {
-        assert(A_h[i]*A_h[i] == B_h[i]);
-    }
-
-    HIP_ASSERT(hipFree(A_d));
-    HIP_ASSERT(hipFree(B_d));
-    free(A_h);
-    free(B_h);
-    std::cout << "Test Passed!\n";
-}
-
-int main(){
-  // Run test that generates static lib with ar
-  run_test2();
-}
+<ROCM_PATH>/hip/bin/hipcc square_host.o square_device.o -o square_asm.out
 ```

-To link to the static library:
+## How to build and run this sample:
+- Build the sample using cmake
 ```
-hipcc libHipDevice.a hipMain2.cpp -fgpu-rdc -o test_device_static_hipcc.out
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
 ```

-##  How to build and run this sample:
-Use the make command to build the static libraries, link with it, and execute it.
- Change directory to either host or device functions folder.
- To build the static library and link the main executable, use `make all`.
- To execute, run the generated executable `./test_*.out`.
-
-Alternatively, use these CMake commands.
+- Execute sample
 ```
-cd device_functions
-mkdir -p build
-cd build
-cmake ..
-make
-./test_*.out
+$ ./square_asm.out
+info: running on device AMD Radeon Graphics
+info: allocate host mem (  7.63 MB)
+info: allocate device mem (  7.63 MB)
+info: copy Host2Device
+info: launch 'vector_square' kernel
+info: copy Device2Host
+info: check result
+PASSED!
 ```
-It is recommended to use Visual Studio's command prompt for this sample due to requirement of MS Librarian tool - LIB.exe on windows platform.
-Override CMAKE_C_COMPILER and CMAKE_CXX_COMPILER to hipcc as Visual Studio's compiler would use cl.exe as default compiler.
-i.e. cmake.exe -GNinja -DCMAKE_CXX_COMPILER_ID=ROCMClang -DCMAKE_C_COMPILER_ID=ROCMClang -DCMAKE_PREFIX_PATH=%HIP_PATH% -DCMAKE_C_COMPILER=%HIP_PATH%/bin/hipcc.bat -DCMAKE_CXX_COMPILER=%HIP_PATH%/bin/hipcc.bat ..

-## For More Infomation, please refer to the HIP FAQ.
+**Note:** Currently, defined arch is `gfx900`, `gfx906`, `gfx908`, `gfx1010`,`gfx1030`,`gfx1100`,`gfx1101`,`gfx1102` and `gfx1103`. Any undefined arch can be modified with make argument `GPU_ARCHxx`.
+
+## For More Information, please refer to the HIP FAQ.
@@ -56,12 +56,16 @@ Finally, using the system linker, hipcc, or clang, link the host and device obje
 ```

 ## How to build and run this sample:
-Use these make commands to compile into assembly, compile assembly into executable, and execute it.
- To compile the HIP application into host and device assembly: `make src_to_asm`.
- To compile the assembly files into an executable: `make asm_to_exec`.
- To execute, run
+- Build the sample using cmake
 ```
-./square_asm.out
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+
+- Execute sample
+```
+$ ./square_asm.out
 info: running on device AMD Radeon Graphics
 info: allocate host mem (  7.63 MB)
 info: allocate device mem (  7.63 MB)
@@ -84,14 +84,16 @@ Finally, using the system linker, hipcc, or clang, link the host and device obje
 If you haven't modified the GPU archs, this executable should run on the defined `gfx900`, `gfx906`, `gfx908`, `gfx1010`, `gfx1030`, `gfx1100`, `gfx1101`, `gfx1102` and `gfx1103`.

 ## How to build and run this sample:
-Use these make commands to compile into LLVM IR, compile IR into executable, and execute it.
- To compile the HIP application into host and device LLVM IR: `make src_to_ir`.
- To disassembly the LLVM IR bitcode into human readable LLVM IR: `make bc_to_ll`.
- To assembly the human readable LLVM IR bitcode back into LLVM IR bitcode: `make ll_to_bc`.
- To compile the LLVM IR files into an executable: `make ir_to_exec`.
- To execute, run
+- Build the sample using cmake
 ```
-./square_ir.out
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+
+- Execute sample
+```
+$ ./square_ir.out
 info: running on device AMD Radeon Graphics
 info: allocate host mem (  7.63 MB)
 info: allocate device mem (  7.63 MB)
@@ -2,15 +2,14 @@
 I. Build

 ```
-mkdir -p build; cd build
-rm -rf *;
-CXX="$(hipconfig -l)"/clang++ cmake -DCMAKE_PREFIX_PATH=/opt/rocm ..
-make
+$ mkdir build; cd build
+$ CXX="$(hipconfig -l)"/clang++ cmake -DCMAKE_PREFIX_PATH=/opt/rocm ..
+$ make
 ```

 Note, users may need to add ADMGPU support as command line option, if test failed to run, for example,
 ```
-CXX="$(hipconfig -l)"/clang++ cmake -DCMAKE_PREFIX_PATH=/opt/rocm -DAMDGPU_TARGETS="gfx1102" ..
+$ CXX="$(hipconfig -l)"/clang++ cmake -DCMAKE_PREFIX_PATH=/opt/rocm -DAMDGPU_TARGETS="gfx1102" ..
 ```

 II. Test
@@ -8,28 +8,28 @@ I. Prepare

 II. Build
 ```
-mkdir -p build; cd build
-rm -rf *;
-CXX="$(hipconfig -l)"/clang++ FC=$(which gfortran) cmake -DCMAKE_PREFIX_PATH=/opt/rocm ..
-cmake ..
-make
+$ mkdir -p build; cd build
+$ rm -rf *;
+$ CXX="$(hipconfig -l)"/clang++ FC=$(which gfortran) cmake -DCMAKE_PREFIX_PATH=/opt/rocm ..
+$ cmake ..
+$ make
 ```

 Note, users may need to add AMD GPU support, if test failed, for example,
 ```
-CXX="$(hipconfig -l)"/clang++ FC=$(which gfortran) cmake -DCMAKE_PREFIX_PATH=/opt/rocm -DAMDGPU_TARGETS="gfx1102" ..
+$ CXX="$(hipconfig -l)"/clang++ FC=$(which gfortran) cmake -DCMAKE_PREFIX_PATH=/opt/rocm -DAMDGPU_TARGETS="gfx1102" ..
 ```
-To enable compiler auto detection of gpu users may need to add ADMGPU support as command line option, 
+To enable compiler auto detection of gpu users may need to add ADMGPU support as command line option,
 if test failed to run, for example,
 ```
-CXX="$(hipconfig -l)"/clang++ FC=$(which gfortran) cmake -DCMAKE_PREFIX_PATH=/opt/rocm -DAMDGPU_TARGETS=native ..
+$ CXX="$(hipconfig -l)"/clang++ FC=$(which gfortran) cmake -DCMAKE_PREFIX_PATH=/opt/rocm -DAMDGPU_TARGETS=native ..
 ```
 III. Test
 ```
-./test_fortran
+$ ./test_fortran
 Succeeded testing Fortran!

-./test_cpp
+$ ./test_cpp
 Device name AMD Radeon Graphics
 PASSED!
 ```
@@ -66,8 +66,21 @@ Here the first parameter will store the time taken value, second parameter is th
 We can print the value of time take comfortably since eventMs is float variable.

 ## How to build and run:
-Use the make command and execute it using ./exe
-Use hipcc to build the application, which is using hcc on AMD and nvcc on nvidia.
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./hipEvent
+Device name Navi 14 [Radeon Pro W5500]
+hipMemcpyHostToDevice time taken  =  0.981ms
+kernel Execution time             =  0.539ms
+hipMemcpyDeviceToHost time taken  =  1.220ms
+PASSED!
+```

 ## More Info:
 - [HIP FAQ](https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_faq.md)
@@ -2,14 +2,14 @@
 I. Build

 ```
-mkdir -p build; cd build
-rm -rf *;
-CXX="$(hipconfig -l)"/amdclang++ cmake -DCMAKE_PREFIX_PATH=/opt/rocm ..
-make
+$ mkdir -p build; cd build
+$ rm -rf *;
+$ CXX="$(hipconfig -l)"/amdclang++ cmake -DCMAKE_PREFIX_PATH=/opt/rocm ..
+$ make
 ```
 To enable compiler auto detection of gpu users may need to add ADMGPU support as command line option, if test failed to run, for example,
 ```
-CXX="$(hipconfig -l)"/amdclang++ cmake -DCMAKE_PREFIX_PATH=/opt/rocm -DAMDGPU_TARGETS="gfx1102" ..
+$ CXX="$(hipconfig -l)"/amdclang++ cmake -DCMAKE_PREFIX_PATH=/opt/rocm -DAMDGPU_TARGETS="gfx1102" ..
 ```
 II. Test

@@ -2,10 +2,10 @@
 I. Build

 ```
-mkdir -p build; cd build
-rm -rf *;
-cmake -DCMAKE_PREFIX_PATH=/opt/rocm ..
-make
+$ mkdir -p build; cd build
+$ rm -rf *;
+$ cmake -DCMAKE_PREFIX_PATH=/opt/rocm ..
+$ make
 ```

 II. Test
@@ -2,10 +2,10 @@
 I. Build

 ```
-mkdir -p build; cd build
-rm -rf *;
-CXX="$(hipconfig -l)"/amdclang++ cmake -DCMAKE_PREFIX_PATH=/opt/rocm ..
-make
+$ mkdir -p build; cd build
+$ rm -rf *;
+$ CXX="$(hipconfig -l)"/amdclang++ cmake -DCMAKE_PREFIX_PATH=/opt/rocm ..
+$ make
 ```

 II. Test
@@ -28,8 +28,18 @@ Be careful while using shared memory, since all threads within the block can acc
 `  __syncthreads();`

 ## How to build and run:
-Use the make command and execute it using ./exe
-Use hipcc to build the application, which is using hcc on AMD and nvcc on nvidia.
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./sharedMemory
+Device name Navi 14 [Radeon Pro W5500]
+PASSED!
+```

 ## More Info:
 - [HIP FAQ](https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_faq.md)
@@ -36,9 +36,18 @@ In this tutorial, we'll use `__shfl()` ops. In the same sourcecode, we used for
 Be careful while using shfl operations, since all exchanges are possible between the threads of corresponding warp only.

 ## How to build and run:
-Use the make command and execute it using ./exe
-Use hipcc to build the application, which is using hcc on AMD and nvcc on nvidia.
-
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./shfl
+Device name Navi 14 [Radeon Pro W5500]
+PASSED!
+```
 ## requirement for nvidia
 please make sure you have a 3.0 or higher compute capable device in order to use warp shfl operations and add `-gencode arch=compute=30, code=sm_30` nvcc flag in the Makefile while using this application.

@@ -38,9 +38,18 @@ In the same sourcecode, we used for MatrixTranspose. We'll add the following:
 With the help of this application, we can say that kernel code can be converted into  multi-dimensional threads with ease.

 ## How to build and run:
-Use the make command and execute it using ./exe
-Use hipcc to build the application, which is using hcc on AMD and nvcc on nvidia.
-
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./2dshfl
+Device name Navi 14 [Radeon Pro W5500]
+PASSED!
+```
 ## requirement for nvidia
 please make sure you have a 3.0 or higher compute capable device in order to use warp shfl operations and add `-gencode arch=compute=30, code=sm_30` nvcc flag in the Makefile while using this application.

@@ -38,9 +38,18 @@ The other important change is:
 here we replaced 4th parameter with amount of additional shared memory to allocate when launching the kernel.

 ## How to build and run:
-Use the make command and execute it using ./exe
-Use hipcc to build the application, which is using hcc on AMD and nvcc on nvidia.
-
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./dynamic_shared
+Device name Navi 14 [Radeon Pro W5500]
+dynamic_shared PASSED!
+```
 ## More Info:
 - [HIP FAQ](https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_faq.md)
 - [HIP Kernel Language](https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_kernel_language.md)
@@ -49,8 +49,17 @@ and while kernel launch, we make the following changes in 5th parameter to hipLa
 here we replaced 4th parameter with amount of additional shared memory to allocate when launching the kernel.

 ## How to build and run:
-Use the make command and execute it using ./exe
-Use hipcc to build the application, which is using hcc on AMD and nvcc on nvidia.
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./stream
+stream PASSED!
+```

 ## More Info:
 - [HIP FAQ](https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_faq.md)
@@ -0,0 +1,12 @@
+# peer2peer
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./peer2peer
+Peer2Peer application requires atleast 2 gpu devices
+```
@@ -16,7 +16,7 @@ Programmers familiar with CUDA, OpenCL will be able to quickly learn and start c

 ## Simple Matrix Transpose

-For this tutorial we will be using an example which sums up the row of a 2D matrix and writes it in a 1D array. 
+For this tutorial we will be using an example which sums up the row of a 2D matrix and writes it in a 1D array.

 In this tutorial, we'll use `#pragma unroll`. In the same sourcecode, we used for gpuMatrixRowSum. We'll add it just before the for loop as following:

@@ -31,9 +31,18 @@ Specifying the optional parameter, #pragma unroll value, directs the unroller to
 Specifying #pragma nounroll indicates that the loop should not be unroll. #pragma unroll 1 will show the same behaviour.

 ## How to build and run:
-Use the make command and execute it using ./exe
-Use hipcc to build the application, which is using hcc on AMD and nvcc on nvidia.
-
+- Build the sample using cmake
+```
+$ mkdir build; cd build
+$ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm
+$ make
+```
+- Execute the sample
+```
+$ ./unroll
+Device name
+PASSED
+```
 ## requirement for nvidia
 please make sure you have a 3.0 or higher compute capable device in order to use warp shfl operations and add `-gencode arch=compute=30, code=sm_30` nvcc flag in the Makefile while using this application.

@@ -40,3 +40,5 @@ Note that if you want debug version, add "-DCMAKE_BUILD_TYPE=Debug" in cmake cmd
 cmake ../samples

 make package_samples
+
+## Note: sample 2_Cookbook/22_cmake_hip_lang is current not included in toplevel cmake. To build this sample from toplevel cmake, uncomment Line 43 inside samples/2_Cookbook/CMakeLists.txt.