hipify nccl-tests to become rccl-tests

[ROCm/rccl-tests commit: a15f771cb2]
2019-04-09 15:51:40 -07:00
@@ -1,26 +1,26 @@
-# NCCL Tests
+# RCCL Tests

-These tests check both the performance and the correctness of NCCL operations. They can be compiled against [NCCL](http://github.com/nvidia/nccl)
+These tests check both the performance and the correctness of RCCL operations. They can be compiled against [RCCL](https://github.com/ROCmSoftwarePlatform/rccl)

 ## Build

 To build the tests, just type `make`.

-If CUDA is not installed in /usr/local/cuda, you may specify CUDA\_HOME. Similarly, if NCCL is not installed in /usr, you may specify NCCL\_HOME.
+If HIP is not installed in /opt/rocm, you may specify HIP\_HOME. Similarly, if RCCL is not installed in /usr, you may specify RCCL\_HOME.

 ```shell
-$ make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
+$ make HIP_HOME=/path/to/hip RCCL_HOME=/path/to/rccl
 ```

-NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI\_HOME to the path where MPI is installed.
+RCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI\_HOME to the path where MPI is installed.

 ```shell
-$ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
+$ make MPI=1 MPI_HOME=/path/to/mpi HIP_HOME=/path/to/hip RCCL_HOME=/path/to/rccl
 ```

 ## Usage

-NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to (number of processes)\*(number of threads)\*(number of GPUs per thread).
+RCCL tests can run on multiple processes, multiple threads, and multiple HIP devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=HIP devices) will be equal to (number of processes)\*(number of threads)\*(number of GPUs per thread).

 ### Quick examples

@@ -51,7 +51,7 @@ All tests support the same set of arguments :
  * Increments can be either fixed or a multiplication factor. Only one of those should be used
    * `-i,--stepbytes <increment size>` fixed increment between sizes. Default : (max-min)/10.
    * `-f,--stepfactor <increment factor>` multiplication factor between sizes. Default : disabled.
-* NCCL operations arguments
+* RCCL operations arguments
  * `-o,--op <sum/prod/min/max/all>` Specify which reduction operation to perform. Only relevant for reduction operations like Allreduce, Reduce or ReduceScatter. Default : Sum.
  * `-d,--datatype <nccltype/all>` Specify which datatype to use. Default : Float.
  * `-r,--root <root/all>` Specify which root to use. Only for operations with a root like broadcast or reduce. Default : 0.
@@ -60,11 +60,11 @@ All tests support the same set of arguments :
  * `-w,--warmup_iters <warmup iteration count>` number of warmup iterations (not timed). Default : 5.
  * `-m,--agg_iters <aggregation count>` number of operations to aggregate together in each iteration. Default : 1.
 * Test operation
-  * `-p,--parallel_init <0/1>` use threads to initialize NCCL in parallel. Default : 0.
+  * `-p,--parallel_init <0/1>` use threads to initialize RCCL in parallel. Default : 0.
  * `-c,--check <0/1>` check correctness of results. This can be quite slow on large numbers of GPUs. Default : 1.
-  * `-z,--blocking <0/1>` Make NCCL collective blocking, i.e. have CPUs wait and sync after each collective. Default : 0.
+  * `-z,--blocking <0/1>` Make RCCL collective blocking, i.e. have CPUs wait and sync after each collective. Default : 0.

 ## Copyright

-NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2019, NVIDIA CORPORATION. All rights reserved.
+RCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2019, NVIDIA CORPORATION. All rights reserved.

@@ -1,6 +1,6 @@
-# Performance reported by NCCL tests
+# Performance reported by RCCL tests

-NCCL tests report the average operation time in ms, and two bandwidths in GB/s : algorithm bandwidth and bus bandwidth. This page explains what those numbers mean and what you should expect depending on the hardware used.
+RCCL tests report the average operation time in ms, and two bandwidths in GB/s : algorithm bandwidth and bus bandwidth. This page explains what those numbers mean and what you should expect depending on the hardware used.

 # Time

@@ -24,7 +24,7 @@ Algorithm bandwidth is using the most commonly used formula for bandwidth : size
 While the algorithm bandwidth makes sense for point-to-point operations like Send/Receive, it is not always helpful to measure collective operations speed, since the theoretical peak algorithm bandwidth is not equal to the hardware peak bandwidth, usually depending on the number of ranks.
 Most benchmarks only provide time measurements, which is hard to interpret for large sizes. Some others also provide algorithms bandwidth, but see that depending on the number of ranks, that bandwidth varies (and decreases as the number of ranks increase).

-To provide a number which reflects how optimally the hardware is used, NCCL tests introduce the notion of "Bus Bandwidth" ("busbw" column in the tests output).
+To provide a number which reflects how optimally the hardware is used, RCCL tests introduce the notion of "Bus Bandwidth" ("busbw" column in the tests output).
 This number is obtained applying a formula to the algorithm bandwidth to reflect the speed of the inter-GPU communication.
 Using this bus bandwidth, we can compare it with the hardware peak bandwidth, independently of the number of ranks used.

@@ -78,7 +78,7 @@ And the Bus Bandwidth is therefore computed as :

 `B = S/t * (n-1)/n = algbw * (n-1)/n`

-Note that here, S is the size in bytes of the total array, which for NCCL is equal to `recvcount*sizeof(datatype)*n` as the `recvcount` argument is the count per rank.
+Note that here, S is the size in bytes of the total array, which for RCCL is equal to `recvcount*sizeof(datatype)*n` as the `recvcount` argument is the count per rank.

 ### AllGather

@@ -96,7 +96,7 @@ And the Bus Bandwidth is therefore computed as :

 `B = S/t * (n-1)/n = algbw * (n-1)/n`

-Note that here, S is the size in bytes of the total array, which for NCCL is equal to `sendcount*sizeof(datatype)*n` as the `sendcount` argument is the count per rank.
+Note that here, S is the size in bytes of the total array, which for RCCL is equal to `sendcount*sizeof(datatype)*n` as the `sendcount` argument is the count per rank.

 ### Broadcast

@@ -4,41 +4,30 @@
 # See LICENSE.txt for license information
 #

-CUDA_HOME ?= /usr/local/cuda
+ROCM_HOME ?= /opt/rocm
+MPI_HOME ?= /usr/lib/openmpi
 PREFIX ?= /usr/local
 VERBOSE ?= 0
 DEBUG ?= 0

-CUDA_LIB ?= $(CUDA_HOME)/lib64
-CUDA_INC ?= $(CUDA_HOME)/include
-NVCC = $(CUDA_HOME)/bin/nvcc
+HIPCC = $(ROCM_HOME)/hip/bin/hipcc
+CXX = $(HIPCC)

-# Better define NVCC_GENCODE in your environment to the minimal set
-# of archs to reduce compile time.
-NVCC_GENCODE ?= -gencode=arch=compute_30,code=sm_30 \
-		-gencode=arch=compute_35,code=sm_35 \
-                -gencode=arch=compute_50,code=sm_50 \
-		-gencode=arch=compute_60,code=sm_60 \
-                -gencode=arch=compute_61,code=sm_61 \
-		-gencode=arch=compute_70,code=compute_70 \
-		-gencode=arch=compute_70,code=sm_70
-
-NVCUFLAGS  := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++11
-
-LDFLAGS    := -L${CUDA_LIB} -lcudart -lrt
-NVLDFLAGS  := -L${CUDA_LIB} -lcudart -lrt
+HIPCUFLAGS :=
+HIPCUFLAGS += -I$(ROCM_HOME)/include
+HIPCUFLAGS += -I$(ROCM_HOME)/include/rccl
+HIPCUFLAGS += -I$(ROCM_HOME)/hip/include/hip
+HIPCUFLAGS += -I$(ROCM_HOME)/hiprand/include
+LDFLAGS    := -L$(ROCM_HOME)/lib -lhsa-runtime64 -lrt
+HIPLDFLAGS := -L$(ROCM_HOME)/lib -lhsa-runtime64 -lrt

 ifeq ($(DEBUG), 0)
-NVCUFLAGS += -O3 -g
-CXXFLAGS  += -O3 -g
+HIPCUFLAGS += -O3
 else
-NVCUFLAGS += -O0 -G -g
-CXXFLAGS  += -O0 -g -ggdb3
+HIPCUFLAGS += -O0 -g -ggdb3
 endif

-ifneq ($(VERBOSE), 0)
-NVCUFLAGS += -Xcompiler -Wall,-Wextra,-Wno-unused-parameter
-else
+ifeq ($(VERBOSE), 0)
 .SILENT:
 endif

@@ -46,16 +35,16 @@ endif

 BUILDDIR ?= ../build
 ifneq ($(NCCL_HOME), "")
-NVCUFLAGS += -I$(NCCL_HOME)/include/
-NVLDFLAGS += -L$(NCCL_HOME)/lib
+HIPCUFLAGS += -I$(NCCL_HOME)/include/
+HIPLDFLAGS   += -L$(NCCL_HOME)/lib
 endif

 ifeq ($(MPI), 1)
-NVCUFLAGS += -DMPI_SUPPORT -I$(MPI_HOME)/include
-NVLDFLAGS += -L$(MPI_HOME)/lib -lmpi
+HIPCUFLAGS += -DMPI_SUPPORT -I${MPI_HOME}/include
+HIPLDFLAGS += -L${MPI_HOME}/lib -lmpi
 endif
-LIBRARIES += curand nccl nvToolsExt
-NVLDFLAGS += $(LIBRARIES:%=-l%)
+LIBRARIES += rccl
+HIPLDFLAGS   += $(LIBRARIES:%=-l%)

 DST_DIR := $(BUILDDIR)
 SRC_FILES := $(wildcard *.cu)
@@ -71,10 +60,12 @@ clean:
 ${DST_DIR}/%.o: %.cu common.h
 	@printf "Compiling  %-35s > %s\n" $< $@
 	@mkdir -p ${DST_DIR}
-	$(NVCC) -o $@ $(NVCUFLAGS) -c $<
+	echo "$(HIPCC) -o $@ $(HIPCUFLAGS) -c $<"
+	$(HIPCC) -o $@ $(HIPCUFLAGS) -c $<

 ${DST_DIR}/%_perf:${DST_DIR}/%.o ${DST_DIR}/common.o
 	@printf "Linking  %-35s > %s\n" $< $@
 	@mkdir -p ${DST_DIR}
-	$(NVCC) -o $@ $(NVCUFLAGS) $^ ${NVLDFLAGS}
+	echo "$(HIPCC) -o $@ $(HIPCUFLAGS) $^ ${HIPLDFLAGS}"
+	$(HIPCC) -o $@ $(HIPCUFLAGS) $^ ${HIPLDFLAGS}

@@ -4,7 +4,7 @@
 * See LICENSE.txt for license information
 ************************************************************************/

-#include "cuda_runtime.h"
+#include <hip/hip_runtime.h>
 #include "common.h"

 void print_header() {
@@ -34,15 +34,15 @@ testResult_t AllGatherInitData(struct threadArgs* args, ncclDataType_t type, ncc

  for (int i=0; i<args->nGpus; i++) {
    int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
-    CUDACHECK(cudaSetDevice(gpuid));
+    HIPCHECK(hipSetDevice(gpuid));
    int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
-    CUDACHECK(cudaMemset(args->recvbuffs[i], 0, args->expectedBytes));
+    HIPCHECK(hipMemset(args->recvbuffs[i], 0, args->expectedBytes));
    void* data = in_place ? ((char*)args->recvbuffs[i])+rank*args->sendBytes : args->sendbuffs[i];
    TESTCHECK(InitData(data, sendcount, type, rep, rank));
    for (int j=0; j<nranks; j++) {
      TESTCHECK(InitData(((char*)args->expected[i])+args->sendBytes*j, sendcount, type, rep, j));
    }
-    CUDACHECK(cudaDeviceSynchronize());
+    HIPCHECK(hipDeviceSynchronize());
  }
  return testSuccess;
 }
@@ -55,7 +55,7 @@ void AllGatherGetBw(size_t count, int typesize, double sec, double* algBw, doubl
  *busBw = baseBw * factor;
 }

-testResult_t AllGatherRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
+testResult_t AllGatherRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) {
  NCCLCHECK(ncclAllGather(sendbuff, recvbuff, count, type, comm, stream));
  return testSuccess;
 }
@@ -4,7 +4,7 @@
 * See LICENSE.txt for license information
 ************************************************************************/

-#include "cuda_runtime.h"
+#include <hip/hip_runtime.h>
 #include "common.h"

 void print_header() {
@@ -34,13 +34,13 @@ testResult_t AllReduceInitData(struct threadArgs* args, ncclDataType_t type, ncc

  for (int i=0; i<args->nGpus; i++) {
    int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
-    CUDACHECK(cudaSetDevice(gpuid));
+    HIPCHECK(hipSetDevice(gpuid));
    int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
-    CUDACHECK(cudaMemset(args->recvbuffs[i], 0, args->expectedBytes));
+    HIPCHECK(hipMemset(args->recvbuffs[i], 0, args->expectedBytes));
    void* data = in_place ? args->recvbuffs[i] : args->sendbuffs[i];
    TESTCHECK(InitData(data, sendcount, type, rep, rank));
    TESTCHECK(InitDataReduce(args->expected[i], recvcount, 0, type, op, rep, nranks));
-    CUDACHECK(cudaDeviceSynchronize());
+    HIPCHECK(hipDeviceSynchronize());
  }
  return testSuccess;
 }
@@ -53,7 +53,7 @@ void AllReduceGetBw(size_t count, int typesize, double sec, double* algBw, doubl
  *busBw = baseBw * factor;
 }

-testResult_t AllReduceRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
+testResult_t AllReduceRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) {
  NCCLCHECK(ncclAllReduce(sendbuff, recvbuff, count, type, op, comm, stream));
  return testSuccess;
 }
@@ -4,7 +4,7 @@
 * See LICENSE.txt for license information
 ************************************************************************/

-#include "cuda_runtime.h"
+#include <hip/hip_runtime.h>
 #include "common.h"

 void print_header() {
@@ -33,13 +33,13 @@ testResult_t BroadcastInitData(struct threadArgs* args, ncclDataType_t type, ncc

  for (int i=0; i<args->nGpus; i++) {
    int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
-    CUDACHECK(cudaSetDevice(gpuid));
+    HIPCHECK(hipSetDevice(gpuid));
    int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
-    CUDACHECK(cudaMemset(args->recvbuffs[i], 0, args->expectedBytes));
+    HIPCHECK(hipMemset(args->recvbuffs[i], 0, args->expectedBytes));
    void* data = in_place ? args->recvbuffs[i] : args->sendbuffs[i];
    if (rank == root) TESTCHECK(InitData(data, sendcount, type, rep, rank));
    TESTCHECK(InitData(args->expected[i], recvcount, type, rep, root));
-    CUDACHECK(cudaDeviceSynchronize());
+    HIPCHECK(hipDeviceSynchronize());
  }
  return testSuccess;
 }
@@ -52,7 +52,7 @@ void BroadcastGetBw(size_t count, int typesize, double sec, double* algBw, doubl
  *busBw = baseBw * factor;
 }

-testResult_t BroadcastRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
+testResult_t BroadcastRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) {
  int rank;
  NCCLCHECK(ncclCommUserRank(comm, &rank));
 #if NCCL_MAJOR >= 2 && NCCL_MINOR >= 2
@@ -1,3 +1,4 @@
+#include "hip/hip_runtime.h"
 /*************************************************************************
 * Copyright (c) 2016-2019, NVIDIA CORPORATION. All rights reserved.
 *
@@ -9,7 +10,6 @@
 #include <cstdio>
 #include <getopt.h>
 #include <libgen.h>
-#include "cuda.h"

 #if NCCL_MAJOR >= 2
 ncclDataType_t test_types[ncclNumTypes] = {ncclInt8, ncclUint8, ncclInt32, ncclUint32, ncclInt64, ncclUint64, ncclHalf, ncclFloat, ncclDouble};
@@ -129,27 +129,27 @@ void deltaKern(void* A_, void* B_, size_t count, double* max) {
 testResult_t CheckDelta(void* expected, void* results, size_t count, ncclDataType_t type, double* devmax) {
  switch (type) {
    case ncclHalf:
-      deltaKern<half, 512><<<1, 512>>>(results, expected, count, devmax); break;
+      hipLaunchKernelGGL((deltaKern<half, 512>), dim3(1), dim3(512), 0, 0, results, expected, count, devmax); break;
    case ncclFloat:
-      deltaKern<float, 512><<<1, 512>>>(results, expected, count, devmax); break;
+      hipLaunchKernelGGL((deltaKern<float, 512>), dim3(1), dim3(512), 0, 0, results, expected, count, devmax); break;
    case ncclDouble:
-      deltaKern<double, 512><<<1, 512>>>(results, expected, count, devmax); break;
+      hipLaunchKernelGGL((deltaKern<double, 512>), dim3(1), dim3(512), 0, 0, results, expected, count, devmax); break;

    case ncclChar:
 #if NCCL_MAJOR >= 2
    case ncclUint8:
 #endif
-      deltaKern<uint8_t, 512><<<1, 512>>>(results, expected, count, devmax); break;
+      hipLaunchKernelGGL((deltaKern<uint8_t, 512>), dim3(1), dim3(512), 0, 0, results, expected, count, devmax); break;
    case ncclInt:
 #if NCCL_MAJOR >= 2
    case ncclUint32:
 #endif
-      deltaKern<uint32_t, 512><<<1, 512>>>(results, expected, count, devmax); break;
+      hipLaunchKernelGGL((deltaKern<uint32_t, 512>), dim3(1), dim3(512), 0, 0, results, expected, count, devmax); break;
    case ncclInt64:
    case ncclUint64:
-      deltaKern<uint64_t, 512><<<1, 512>>>(results, expected, count, devmax); break;
+      hipLaunchKernelGGL((deltaKern<uint64_t, 512>), dim3(1), dim3(512), 0, 0, results, expected, count, devmax); break;
  }
-  CUDACHECK(cudaDeviceSynchronize());
+  HIPCHECK(hipDeviceSynchronize());
  return testSuccess;
 }

@@ -196,61 +196,63 @@ template<>
 __device__ half ncclOpMin(half a, half b) { return __half2float(a)<__half2float(b) ? a : b; }

 template<typename T, T (*Op)(T, T)>
-__global__ void InitDataReduceKernel(T* data, const size_t N, const size_t offset, const int rep, const int nranks) {
+__global__ void InitDataReduceKernel(void* data, const size_t N, const size_t offset, const int rep, const int nranks) {
  for (size_t o=blockIdx.x*blockDim.x+threadIdx.x; o<N; o+=gridDim.x*blockDim.x) {
    T val = testValue<T>(o+offset, rep, 0);
    for (int i=1; i<nranks; i++) {
      val = Op(val, testValue<T>(o+offset, rep, i));
    }
-    data[o] = val;
+    ((T*)data)[o] = val;
  }
 }

-#define KERN(type, op) (void*)InitDataReduceKernel<type, op<type>>
+typedef void(*redInitKern_t)(void* data, const size_t N, const size_t offset, const int rep, const int nranks);
+
+#define KERN(type, op) InitDataReduceKernel<type, op<type>>
 #define OPS(type) KERN(type, ncclOpSum), KERN(type, ncclOpProd), KERN(type, ncclOpMax), KERN(type, ncclOpMin)

-static void* const redInitDataKerns[ncclNumOps*ncclNumTypes] = {
+static redInitKern_t const redInitDataKerns[ncclNumOps*ncclNumTypes] = {
  OPS(int8_t), OPS(uint8_t), OPS(int32_t), OPS(uint32_t), OPS(int64_t), OPS(uint64_t), OPS(half), OPS(float), OPS(double)
 };

 testResult_t InitDataReduce(void* data, const size_t count, const size_t offset, ncclDataType_t type, ncclRedOp_t op, const int rep, const int nranks) {
  dim3 grid = { 32, 1, 1 };
  dim3 block = { 256, 1, 1 };
-  void* args[5] = { (void*)&data, (void*)&count, (void*)&offset, (void*)&rep, (void*)&nranks };
-  CUDACHECK(cudaLaunchKernel(redInitDataKerns[type*ncclNumOps+op], grid, block, args, 0, cudaStreamDefault));
+  hipLaunchKernelGGL((redInitDataKerns[type*ncclNumOps+op]), grid, block, 0, 0, data, count, offset, rep, nranks);
  return testSuccess;
 }

 template<typename T>
-__global__ void InitDataKernel(T* data, const size_t N, const int rep, const int rank) {
+__global__ void InitDataKernel(void* data, const size_t N, const int rep, const int rank) {
  for (size_t o=blockIdx.x*blockDim.x+threadIdx.x; o<N; o+=gridDim.x*blockDim.x)
-    data[o] = testValue<T>(o, rep, rank);
+    ((T*)data)[o] = testValue<T>(o, rep, rank);
 }

-static void* const initDataKerns[ncclNumTypes] = {
-  (void*)InitDataKernel<  int8_t>,
-  (void*)InitDataKernel< uint8_t>,
-  (void*)InitDataKernel< int32_t>,
-  (void*)InitDataKernel<uint32_t>,
-  (void*)InitDataKernel< int64_t>,
-  (void*)InitDataKernel<uint64_t>,
-  (void*)InitDataKernel<    half>,
-  (void*)InitDataKernel<   float>,
-  (void*)InitDataKernel<  double>
+typedef void(*initDataKern_t)(void* data, const size_t N, const int rep, const int rank);
+
+static initDataKern_t const initDataKerns[ncclNumTypes] = {
+  InitDataKernel<  int8_t>,
+  InitDataKernel< uint8_t>,
+  InitDataKernel< int32_t>,
+  InitDataKernel<uint32_t>,
+  InitDataKernel< int64_t>,
+  InitDataKernel<uint64_t>,
+  InitDataKernel<    half>,
+  InitDataKernel<   float>,
+  InitDataKernel<  double>
 };

 template<typename T>
 testResult_t InitDataType(void* dest, const size_t N, const int rep, const int rank) {
  T* ptr = (T*)dest;
-  InitDataKernel<<<16, 512>>>(ptr, N, rep, rank);
+  hipLaunchKernelGGL((InitDataKernel), dim3(16), dim3(512), 0, 0, ptr, N, rep, rank);
  return testSuccess;
 }

 testResult_t InitData(void* data, const size_t count, ncclDataType_t type, const int rep, const int rank) {
  dim3 grid = { 32, 1, 1 };
  dim3 block = { 256, 1, 1 };
-  void* args[4] = { (void*)&data, (void*)&count, (void*)&rep, (void*)&rank };
-  CUDACHECK(cudaLaunchKernel(initDataKerns[type], grid, block, args, 0, cudaStreamDefault));
+  hipLaunchKernelGGL((initDataKerns[type]), grid, block, 0, 0, data, count, rep, rank);
  return testSuccess;
 }

@@ -279,7 +281,7 @@ testResult_t CheckData(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
    int device;
    int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
    NCCLCHECK(ncclCommCuDevice(args->comms[i], &device));
-    CUDACHECK(cudaSetDevice(device));
+    HIPCHECK(hipSetDevice(device));
    void *data = in_place ? ((void *)((uintptr_t)args->recvbuffs[i] + args->recvInplaceOffset*rank)) : args->recvbuffs[i];
    TESTCHECK(CheckDelta(data , args->expected[i], count, type, args->delta));
    maxDelta = std::max(*(args->deltaHost), maxDelta);
@@ -289,14 +291,14 @@ testResult_t CheckData(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
       int *expectedHost = (int *)malloc(args->expectedBytes);
       int *dataHost = (int *)malloc(args->expectedBytes);

-       cudaMemcpy(expectedHost, args->expected[0], args->expectedBytes, cudaMemcpyDeviceToHost);
+       hipMemcpy(expectedHost, args->expected[0], args->expectedBytes, hipMemcpyDeviceToHost);
       printf("\n Expected: ");
       for(int j=0; j<args->expectedBytes/sizeof(int); j++) {
         printf("%d:%d ", j, expectedHost[j]);
       }
       printf("\n");

-       cudaMemcpy(dataHost, data, args->expectedBytes, cudaMemcpyDeviceToHost);
+       hipMemcpy(dataHost, data, args->expectedBytes, hipMemcpyDeviceToHost);
       printf("\n Actual: ");
       for (int j=0; j<args->expectedBytes/sizeof(int); j++) {
         printf("%d:%d ", j, dataHost[j]);
@@ -312,8 +314,8 @@ testResult_t CheckData(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
  return testSuccess;
 }

-testResult_t testStreamSynchronize(int ngpus, cudaStream_t* streams, ncclComm_t* comms) {
-  cudaError_t cudaErr;
+testResult_t testStreamSynchronize(int ngpus, hipStream_t* streams, ncclComm_t* comms) {
+  hipError_t hipErr;
  int remaining = ngpus;
  int* done = (int*)malloc(sizeof(int)*ngpus);
  memset(done, 0, sizeof(int)*ngpus);
@@ -322,15 +324,15 @@ testResult_t testStreamSynchronize(int ngpus, cudaStream_t* streams, ncclComm_t*
   for (int i=0; i<ngpus; i++) {
     if (done[i]) continue;

-     cudaErr = cudaStreamQuery(streams[i]);
-     if (cudaErr == cudaSuccess) {
+     hipErr = hipStreamQuery(streams[i]);
+     if (hipErr == hipSuccess) {
       done[i] = 1;
       remaining--;
       idle = 0;
       continue;
     }

-     if (cudaErr != cudaErrorNotReady) CUDACHECK(cudaErr);
+     if (hipErr != hipErrorNotReady) HIPCHECK(hipErr);

 #if NCCL_VERSION_CODE >= NCCL_VERSION(2,4,0)
     if (comms) {
@@ -365,9 +367,9 @@ testResult_t startColl(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
  if (args->nGpus > 1) NCCLCHECK(ncclGroupStart());
  for (int i = 0; i < args->nGpus; i++) {
 #ifndef NCCL_MAJOR
-    int cudaDev;
-    NCCLCHECK(ncclCommCuDevice(args->comms[i], &cudaDev));
-    CUDACHECK(cudaSetDevice(cudaDev));
+    int hipDev;
+    NCCLCHECK(ncclCommCuDevice(args->comms[i], &hipDev));
+    HIPCHECK(hipSetDevice(hipDev));
 #endif
    int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
    char* recvBuff = ((char*)args->recvbuffs[i]) + shift;
@@ -514,7 +516,7 @@ testResult_t threadRunTests(struct threadArgs* args) {
  // will be done on the current GPU (by default : 0) and if the GPUs are in
  // exclusive mode those operations will fail.
  int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus;
-  CUDACHECK(cudaSetDevice(gpuid));
+  HIPCHECK(hipSetDevice(gpuid));
  TESTCHECK(ncclTestEngine.runTest(args, ncclroot, (ncclDataType_t)nccltype, test_typenames[nccltype], (ncclRedOp_t)ncclop, test_opnames[ncclop]));
  return testSuccess;
 }
@@ -531,7 +533,7 @@ testResult_t threadInit(struct threadArgs* args) {
  for (int i=0; i<args->nGpus; i++) {
    int rank = args->proc*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
    int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
-    CUDACHECK(cudaSetDevice(gpuid));
+    HIPCHECK(hipSetDevice(gpuid));
    NCCLCHECK(ncclCommInitRank(args->comms+i, nranks, args->ncclId, rank));
  }
  NCCLCHECK(ncclGroupEnd());
@@ -555,9 +557,9 @@ testResult_t threadLaunch(struct testThread* thread) {
 }

 testResult_t AllocateBuffs(void **sendbuff, size_t sendBytes, void **recvbuff, size_t recvBytes, void **expected, size_t nbytes, int nranks) {
-    CUDACHECK(cudaMalloc(sendbuff, nbytes));
-    CUDACHECK(cudaMalloc(recvbuff, nbytes));
-    CUDACHECK(cudaMalloc(expected, recvBytes));
+    HIPCHECK(hipMalloc(sendbuff, nbytes));
+    HIPCHECK(hipMalloc(recvbuff, nbytes));
+    HIPCHECK(hipMalloc(expected, recvBytes));
    return testSuccess;
 }

@@ -724,12 +726,12 @@ testResult_t run() {
  char line[MAX_LINE];
  int len = 0;
  for (int i=0; i<nThreads*nGpus; i++) {
-    int cudaDev = localRank*nThreads*nGpus+i;
+    int hipDev = localRank*nThreads*nGpus+i;
    int rank = proc*nThreads*nGpus+i;
-    cudaDeviceProp prop;
-    CUDACHECK(cudaGetDeviceProperties(&prop, cudaDev));
+    hipDeviceProp_t prop;
+    HIPCHECK(hipGetDeviceProperties(&prop, hipDev));
    len += snprintf(line+len, MAX_LINE-len, "#   Rank %2d Pid %6d on %10s device %2d [0x%02x] %s\n",
-                    rank, getpid(), hostname, cudaDev, prop.pciBusID, prop.name);
+                    rank, getpid(), hostname, hipDev, prop.pciBusID, prop.name);
  }

 #if MPI_SUPPORT
@@ -752,7 +754,7 @@ testResult_t run() {
 #ifdef MPI_SUPPORT
  MPI_Bcast(&ncclId, sizeof(ncclId), MPI_BYTE, 0, MPI_COMM_WORLD);
 #endif
-  cudaStream_t streams[nGpus*nThreads];
+  hipStream_t streams[nGpus*nThreads];
  void* sendbuffs[nGpus*nThreads];
  void* recvbuffs[nGpus*nThreads];
  void* expected[nGpus*nThreads];
@@ -761,9 +763,9 @@ testResult_t run() {
  ncclTestEngine.getBuffSize(&sendBytes, &recvBytes, (size_t)maxBytes, (size_t)nProcs*nGpus*nThreads);

  for (int i=0; i<nGpus*nThreads; i++) {
-    CUDACHECK(cudaSetDevice(localRank*nThreads*nGpus+i));
+    HIPCHECK(hipSetDevice(localRank*nThreads*nGpus+i));
    AllocateBuffs(sendbuffs+i, sendBytes, recvbuffs+i, recvBytes, expected+i, (size_t)maxBytes, nProcs*nThreads*nGpus);
-    CUDACHECK(cudaStreamCreateWithFlags(streams+i, cudaStreamNonBlocking));
+    HIPCHECK(hipStreamCreateWithFlags(streams+i, hipStreamNonBlocking));
  }

  //if parallel init is not selected, use main thread to initialize NCCL
@@ -776,7 +778,7 @@ testResult_t run() {
     } else {
       NCCLCHECK(ncclGroupStart());
       for (int i=0; i<nGpus*nThreads; i++) {
-         CUDACHECK(cudaSetDevice(localRank*nThreads*nGpus+i));
+         HIPCHECK(hipSetDevice(localRank*nThreads*nGpus+i));
         NCCLCHECK(ncclCommInitRank(comms+i, nProcs*nThreads*nGpus, ncclId, proc*nThreads*nGpus+i));
       }
       NCCLCHECK(ncclGroupEnd());
@@ -786,7 +788,7 @@ testResult_t run() {
  int errors[nThreads];
  double bw[nThreads];
  double* delta;
-  CUDACHECK(cudaHostAlloc(&delta, sizeof(double)*nThreads, cudaHostAllocPortable | cudaHostAllocMapped));
+  HIPCHECK(hipHostMalloc(&delta, sizeof(double)*nThreads, hipHostMallocPortable | hipHostMallocMapped));
  int bw_count[nThreads];
  for (int t=0; t<nThreads; t++) {
    bw[t] = 0.0;
@@ -860,13 +862,13 @@ testResult_t run() {
    free(comms);
  }

-  // Free off CUDA allocated memory
+  // Free off HIP allocated memory
  for (int i=0; i<nGpus*nThreads; i++) {
-    CUDACHECK(cudaFree(sendbuffs[i]));
-    CUDACHECK(cudaFree(recvbuffs[i]));
-    CUDACHECK(cudaFree(expected[i]));
+    HIPCHECK(hipFree(sendbuffs[i]));
+    HIPCHECK(hipFree(recvbuffs[i]));
+    HIPCHECK(hipFree(expected[i]));
  }
-  CUDACHECK(cudaFreeHost(delta));
+  HIPCHECK(hipHostFree(delta));

  char* str = getenv("NCCL_TESTS_MIN_BW");
  double check_avg_bw = str ? atof(str) : -1;
@@ -879,8 +881,8 @@ testResult_t run() {
  MPI_Finalize();
 #endif

-  // 'cuda-memcheck --leak-check full' requires this
-  cudaDeviceReset();
+  // 'hip-memcheck --leak-check full' requires this
+  hipDeviceReset();

  if (errors[0] || bw[0] < check_avg_bw*(0.9))
    exit(EXIT_FAILURE);
@@ -6,24 +6,23 @@
 #ifndef __COMMON_H__
 #define __COMMON_H__

-#include "nccl.h"
+#include "rccl.h"
 #include <stdio.h>
 #include <algorithm>
-#include <curand.h>
 #ifdef MPI_SUPPORT
 #include "mpi.h"
 #endif
 #include <pthread.h>
 #include "nccl1_compat.h"

-#define CUDACHECK(cmd) do {                         \
-  cudaError_t e = cmd;                              \
-  if( e != cudaSuccess ) {                          \
+#define HIPCHECK(cmd) do {                         \
+  hipError_t e = cmd;                              \
+  if( e != hipSuccess ) {                          \
    char hostname[1024];                            \
    getHostName(hostname, 1024);                    \
-    printf("%s: Test CUDA failure %s:%d '%s'\n",    \
+    printf("%s: Test HIP failure %s:%d '%s'\n",    \
         hostname,                                  \
-        __FILE__,__LINE__,cudaGetErrorString(e));   \
+        __FILE__,__LINE__,hipGetErrorString(e));   \
    return testCudaError;                           \
  }                                                 \
 } while(0)
@@ -71,7 +70,7 @@ struct testColl {
      ncclRedOp_t op, int root, int rep, int in_place);
  void (*getBw)(size_t count, int typesize, double sec, double* algBw, double* busBw, int nranks);
  testResult_t (*runColl)(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type,
-      ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream);
+      ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream);
 };
 extern struct testColl allReduceTest;
 extern struct testColl allGatherTest;
@@ -107,7 +106,7 @@ struct threadArgs {
  size_t recvInplaceOffset;
  ncclUniqueId ncclId;
  ncclComm_t* comms;
-  cudaStream_t* streams;
+  hipStream_t* streams;

  void** expected;
  size_t expectedBytes;
@@ -20,28 +20,28 @@ static ncclResult_t ncclGroupEnd() { return ncclSuccess; }
 #define CHECKCOUNT(count) if (count > INT_MAX) return ncclInvalidArgument;

 static ncclResult_t ncclReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype,
-    ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
+    ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) {
  CHECKCOUNT(count);
  return ncclReduce(sendbuff, recvbuff, (int)count, datatype, op, root, comm, stream);
 }
 static ncclResult_t ncclAllReduce(const void* sendbuff, void* recvbuff, size_t count,
-    ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream) {
+    ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, hipStream_t stream) {
  CHECKCOUNT(count);
  return ncclAllReduce(sendbuff, recvbuff, (int)count, datatype, op, comm, stream);
 }
 static ncclResult_t ncclBcast(void* buff, size_t count, ncclDataType_t datatype, int root,
-    ncclComm_t comm, cudaStream_t stream) {
+    ncclComm_t comm, hipStream_t stream) {
  CHECKCOUNT(count);
  return ncclBcast(buff, (int)count, datatype, root, comm, stream);
 }
 static ncclResult_t ncclReduceScatter(const void* sendbuff, void* recvbuff,
    size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm,
-    cudaStream_t stream) {
+    hipStream_t stream) {
  CHECKCOUNT(recvcount);
  return ncclReduceScatter(sendbuff, recvbuff, (int)recvcount, datatype, op, comm, stream);
 }
 static ncclResult_t ncclAllGather(const void* sendbuff, void* recvbuff, size_t sendcount,
-    ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream) {
+    ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream) {
  CHECKCOUNT(sendcount);
  return ncclAllGather(sendbuff, (int)sendcount, datatype, recvbuff, comm, stream);
 }
@@ -4,7 +4,7 @@
 * See LICENSE.txt for license information
 ************************************************************************/

-#include "cuda_runtime.h"
+#include <hip/hip_runtime.h>
 #include "common.h"

 void print_header() {
@@ -34,14 +34,14 @@ testResult_t ReduceInitData(struct threadArgs* args, ncclDataType_t type, ncclRe

  for (int i=0; i<args->nGpus; i++) {
    int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
-    CUDACHECK(cudaSetDevice(gpuid));
+    HIPCHECK(hipSetDevice(gpuid));
    int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
-    CUDACHECK(cudaMemset(args->recvbuffs[i], 0, args->expectedBytes));
+    HIPCHECK(hipMemset(args->recvbuffs[i], 0, args->expectedBytes));
    void* data = in_place ? args->recvbuffs[i] : args->sendbuffs[i];
    TESTCHECK(InitData(data, sendcount, type, rep, rank));
-    CUDACHECK(cudaMemcpy(args->expected[i], args->recvbuffs[i], args->expectedBytes, cudaMemcpyDefault));
+    HIPCHECK(hipMemcpy(args->expected[i], args->recvbuffs[i], args->expectedBytes, hipMemcpyDefault));
    if (rank == root) TESTCHECK(InitDataReduce(args->expected[i], recvcount, 0, type, op, rep, nranks));
-    CUDACHECK(cudaDeviceSynchronize());
+    HIPCHECK(hipDeviceSynchronize());
  }
  return testSuccess;
 }
@@ -52,7 +52,7 @@ void ReduceGetBw(size_t count, int typesize, double sec, double* algBw, double*
  *busBw = baseBw;
 }

-testResult_t ReduceRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
+testResult_t ReduceRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) {
  NCCLCHECK(ncclReduce(sendbuff, recvbuff, count, type, op, root, comm, stream));
  return testSuccess;
 }
@@ -4,7 +4,7 @@
 * See LICENSE.txt for license information
 ************************************************************************/

-#include "cuda_runtime.h"
+#include <hip/hip_runtime.h>
 #include "common.h"

 void print_header() {
@@ -34,14 +34,14 @@ testResult_t ReduceScatterInitData(struct threadArgs* args, ncclDataType_t type,

  for (int i=0; i<args->nGpus; i++) {
    int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
-    CUDACHECK(cudaSetDevice(gpuid));
+    HIPCHECK(hipSetDevice(gpuid));
    int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
-    CUDACHECK(cudaMemset(args->recvbuffs[i], 0, args->expectedBytes));
+    HIPCHECK(hipMemset(args->recvbuffs[i], 0, args->expectedBytes));
    void* data = in_place ? args->recvbuffs[i] : args->sendbuffs[i];
    TESTCHECK(InitData(data, sendcount, type, rep, rank));
-    CUDACHECK(cudaMemcpy(args->expected[i], args->recvbuffs[i], args->expectedBytes, cudaMemcpyDefault));
+    HIPCHECK(hipMemcpy(args->expected[i], args->recvbuffs[i], args->expectedBytes, hipMemcpyDefault));
    TESTCHECK(InitDataReduce(args->expected[i], recvcount, rank*recvcount, type, op, rep, nranks));
-    CUDACHECK(cudaDeviceSynchronize());
+    HIPCHECK(hipDeviceSynchronize());
  }
  return testSuccess;
 }
@@ -54,7 +54,7 @@ void ReduceScatterGetBw(size_t count, int typesize, double sec, double* algBw, d
  *busBw = baseBw * factor;
 }

-testResult_t ReduceScatterRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
+testResult_t ReduceScatterRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) {
  NCCLCHECK(ncclReduceScatter(sendbuff, recvbuff, count, type, op, comm, stream));
  return testSuccess;
 }