hipify nccl-tests to become rccl-tests

[ROCm/rccl-tests commit: a15f771cb2]
Tento commit je obsažen v:
Wenkai Du
2019-04-09 15:51:40 -07:00
rodič 18902f40a7
revize 3c8cfb2d6e
11 změnil soubory, kde provedl 143 přidání a 151 odebrání
+11 -11
Zobrazit soubor
@@ -1,26 +1,26 @@
# NCCL Tests
# RCCL Tests
These tests check both the performance and the correctness of NCCL operations. They can be compiled against [NCCL](http://github.com/nvidia/nccl)
These tests check both the performance and the correctness of RCCL operations. They can be compiled against [RCCL](https://github.com/ROCmSoftwarePlatform/rccl)
## Build
To build the tests, just type `make`.
If CUDA is not installed in /usr/local/cuda, you may specify CUDA\_HOME. Similarly, if NCCL is not installed in /usr, you may specify NCCL\_HOME.
If HIP is not installed in /opt/rocm, you may specify HIP\_HOME. Similarly, if RCCL is not installed in /usr, you may specify RCCL\_HOME.
```shell
$ make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
$ make HIP_HOME=/path/to/hip RCCL_HOME=/path/to/rccl
```
NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI\_HOME to the path where MPI is installed.
RCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI\_HOME to the path where MPI is installed.
```shell
$ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
$ make MPI=1 MPI_HOME=/path/to/mpi HIP_HOME=/path/to/hip RCCL_HOME=/path/to/rccl
```
## Usage
NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to (number of processes)\*(number of threads)\*(number of GPUs per thread).
RCCL tests can run on multiple processes, multiple threads, and multiple HIP devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=HIP devices) will be equal to (number of processes)\*(number of threads)\*(number of GPUs per thread).
### Quick examples
@@ -51,7 +51,7 @@ All tests support the same set of arguments :
* Increments can be either fixed or a multiplication factor. Only one of those should be used
* `-i,--stepbytes <increment size>` fixed increment between sizes. Default : (max-min)/10.
* `-f,--stepfactor <increment factor>` multiplication factor between sizes. Default : disabled.
* NCCL operations arguments
* RCCL operations arguments
* `-o,--op <sum/prod/min/max/all>` Specify which reduction operation to perform. Only relevant for reduction operations like Allreduce, Reduce or ReduceScatter. Default : Sum.
* `-d,--datatype <nccltype/all>` Specify which datatype to use. Default : Float.
* `-r,--root <root/all>` Specify which root to use. Only for operations with a root like broadcast or reduce. Default : 0.
@@ -60,11 +60,11 @@ All tests support the same set of arguments :
* `-w,--warmup_iters <warmup iteration count>` number of warmup iterations (not timed). Default : 5.
* `-m,--agg_iters <aggregation count>` number of operations to aggregate together in each iteration. Default : 1.
* Test operation
* `-p,--parallel_init <0/1>` use threads to initialize NCCL in parallel. Default : 0.
* `-p,--parallel_init <0/1>` use threads to initialize RCCL in parallel. Default : 0.
* `-c,--check <0/1>` check correctness of results. This can be quite slow on large numbers of GPUs. Default : 1.
* `-z,--blocking <0/1>` Make NCCL collective blocking, i.e. have CPUs wait and sync after each collective. Default : 0.
* `-z,--blocking <0/1>` Make RCCL collective blocking, i.e. have CPUs wait and sync after each collective. Default : 0.
## Copyright
NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2019, NVIDIA CORPORATION. All rights reserved.
RCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2019, NVIDIA CORPORATION. All rights reserved.
+5 -5
Zobrazit soubor
@@ -1,6 +1,6 @@
# Performance reported by NCCL tests
# Performance reported by RCCL tests
NCCL tests report the average operation time in ms, and two bandwidths in GB/s : algorithm bandwidth and bus bandwidth. This page explains what those numbers mean and what you should expect depending on the hardware used.
RCCL tests report the average operation time in ms, and two bandwidths in GB/s : algorithm bandwidth and bus bandwidth. This page explains what those numbers mean and what you should expect depending on the hardware used.
# Time
@@ -24,7 +24,7 @@ Algorithm bandwidth is using the most commonly used formula for bandwidth : size
While the algorithm bandwidth makes sense for point-to-point operations like Send/Receive, it is not always helpful to measure collective operations speed, since the theoretical peak algorithm bandwidth is not equal to the hardware peak bandwidth, usually depending on the number of ranks.
Most benchmarks only provide time measurements, which is hard to interpret for large sizes. Some others also provide algorithms bandwidth, but see that depending on the number of ranks, that bandwidth varies (and decreases as the number of ranks increase).
To provide a number which reflects how optimally the hardware is used, NCCL tests introduce the notion of "Bus Bandwidth" ("busbw" column in the tests output).
To provide a number which reflects how optimally the hardware is used, RCCL tests introduce the notion of "Bus Bandwidth" ("busbw" column in the tests output).
This number is obtained applying a formula to the algorithm bandwidth to reflect the speed of the inter-GPU communication.
Using this bus bandwidth, we can compare it with the hardware peak bandwidth, independently of the number of ranks used.
@@ -78,7 +78,7 @@ And the Bus Bandwidth is therefore computed as :
`B = S/t * (n-1)/n = algbw * (n-1)/n`
Note that here, S is the size in bytes of the total array, which for NCCL is equal to `recvcount*sizeof(datatype)*n` as the `recvcount` argument is the count per rank.
Note that here, S is the size in bytes of the total array, which for RCCL is equal to `recvcount*sizeof(datatype)*n` as the `recvcount` argument is the count per rank.
### AllGather
@@ -96,7 +96,7 @@ And the Bus Bandwidth is therefore computed as :
`B = S/t * (n-1)/n = algbw * (n-1)/n`
Note that here, S is the size in bytes of the total array, which for NCCL is equal to `sendcount*sizeof(datatype)*n` as the `sendcount` argument is the count per rank.
Note that here, S is the size in bytes of the total array, which for RCCL is equal to `sendcount*sizeof(datatype)*n` as the `sendcount` argument is the count per rank.
### Broadcast
+24 -33
Zobrazit soubor
@@ -4,41 +4,30 @@
# See LICENSE.txt for license information
#
CUDA_HOME ?= /usr/local/cuda
ROCM_HOME ?= /opt/rocm
MPI_HOME ?= /usr/lib/openmpi
PREFIX ?= /usr/local
VERBOSE ?= 0
DEBUG ?= 0
CUDA_LIB ?= $(CUDA_HOME)/lib64
CUDA_INC ?= $(CUDA_HOME)/include
NVCC = $(CUDA_HOME)/bin/nvcc
HIPCC = $(ROCM_HOME)/hip/bin/hipcc
CXX = $(HIPCC)
# Better define NVCC_GENCODE in your environment to the minimal set
# of archs to reduce compile time.
NVCC_GENCODE ?= -gencode=arch=compute_30,code=sm_30 \
-gencode=arch=compute_35,code=sm_35 \
-gencode=arch=compute_50,code=sm_50 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=compute_70 \
-gencode=arch=compute_70,code=sm_70
NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++11
LDFLAGS := -L${CUDA_LIB} -lcudart -lrt
NVLDFLAGS := -L${CUDA_LIB} -lcudart -lrt
HIPCUFLAGS :=
HIPCUFLAGS += -I$(ROCM_HOME)/include
HIPCUFLAGS += -I$(ROCM_HOME)/include/rccl
HIPCUFLAGS += -I$(ROCM_HOME)/hip/include/hip
HIPCUFLAGS += -I$(ROCM_HOME)/hiprand/include
LDFLAGS := -L$(ROCM_HOME)/lib -lhsa-runtime64 -lrt
HIPLDFLAGS := -L$(ROCM_HOME)/lib -lhsa-runtime64 -lrt
ifeq ($(DEBUG), 0)
NVCUFLAGS += -O3 -g
CXXFLAGS += -O3 -g
HIPCUFLAGS += -O3
else
NVCUFLAGS += -O0 -G -g
CXXFLAGS += -O0 -g -ggdb3
HIPCUFLAGS += -O0 -g -ggdb3
endif
ifneq ($(VERBOSE), 0)
NVCUFLAGS += -Xcompiler -Wall,-Wextra,-Wno-unused-parameter
else
ifeq ($(VERBOSE), 0)
.SILENT:
endif
@@ -46,16 +35,16 @@ endif
BUILDDIR ?= ../build
ifneq ($(NCCL_HOME), "")
NVCUFLAGS += -I$(NCCL_HOME)/include/
NVLDFLAGS += -L$(NCCL_HOME)/lib
HIPCUFLAGS += -I$(NCCL_HOME)/include/
HIPLDFLAGS += -L$(NCCL_HOME)/lib
endif
ifeq ($(MPI), 1)
NVCUFLAGS += -DMPI_SUPPORT -I$(MPI_HOME)/include
NVLDFLAGS += -L$(MPI_HOME)/lib -lmpi
HIPCUFLAGS += -DMPI_SUPPORT -I${MPI_HOME}/include
HIPLDFLAGS += -L${MPI_HOME}/lib -lmpi
endif
LIBRARIES += curand nccl nvToolsExt
NVLDFLAGS += $(LIBRARIES:%=-l%)
LIBRARIES += rccl
HIPLDFLAGS += $(LIBRARIES:%=-l%)
DST_DIR := $(BUILDDIR)
SRC_FILES := $(wildcard *.cu)
@@ -71,10 +60,12 @@ clean:
${DST_DIR}/%.o: %.cu common.h
@printf "Compiling %-35s > %s\n" $< $@
@mkdir -p ${DST_DIR}
$(NVCC) -o $@ $(NVCUFLAGS) -c $<
echo "$(HIPCC) -o $@ $(HIPCUFLAGS) -c $<"
$(HIPCC) -o $@ $(HIPCUFLAGS) -c $<
${DST_DIR}/%_perf:${DST_DIR}/%.o ${DST_DIR}/common.o
@printf "Linking %-35s > %s\n" $< $@
@mkdir -p ${DST_DIR}
$(NVCC) -o $@ $(NVCUFLAGS) $^ ${NVLDFLAGS}
echo "$(HIPCC) -o $@ $(HIPCUFLAGS) $^ ${HIPLDFLAGS}"
$(HIPCC) -o $@ $(HIPCUFLAGS) $^ ${HIPLDFLAGS}
+5 -5
Zobrazit soubor
@@ -4,7 +4,7 @@
* See LICENSE.txt for license information
************************************************************************/
#include "cuda_runtime.h"
#include <hip/hip_runtime.h>
#include "common.h"
void print_header() {
@@ -34,15 +34,15 @@ testResult_t AllGatherInitData(struct threadArgs* args, ncclDataType_t type, ncc
for (int i=0; i<args->nGpus; i++) {
int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
CUDACHECK(cudaSetDevice(gpuid));
HIPCHECK(hipSetDevice(gpuid));
int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
CUDACHECK(cudaMemset(args->recvbuffs[i], 0, args->expectedBytes));
HIPCHECK(hipMemset(args->recvbuffs[i], 0, args->expectedBytes));
void* data = in_place ? ((char*)args->recvbuffs[i])+rank*args->sendBytes : args->sendbuffs[i];
TESTCHECK(InitData(data, sendcount, type, rep, rank));
for (int j=0; j<nranks; j++) {
TESTCHECK(InitData(((char*)args->expected[i])+args->sendBytes*j, sendcount, type, rep, j));
}
CUDACHECK(cudaDeviceSynchronize());
HIPCHECK(hipDeviceSynchronize());
}
return testSuccess;
}
@@ -55,7 +55,7 @@ void AllGatherGetBw(size_t count, int typesize, double sec, double* algBw, doubl
*busBw = baseBw * factor;
}
testResult_t AllGatherRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
testResult_t AllGatherRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) {
NCCLCHECK(ncclAllGather(sendbuff, recvbuff, count, type, comm, stream));
return testSuccess;
}
+5 -5
Zobrazit soubor
@@ -4,7 +4,7 @@
* See LICENSE.txt for license information
************************************************************************/
#include "cuda_runtime.h"
#include <hip/hip_runtime.h>
#include "common.h"
void print_header() {
@@ -34,13 +34,13 @@ testResult_t AllReduceInitData(struct threadArgs* args, ncclDataType_t type, ncc
for (int i=0; i<args->nGpus; i++) {
int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
CUDACHECK(cudaSetDevice(gpuid));
HIPCHECK(hipSetDevice(gpuid));
int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
CUDACHECK(cudaMemset(args->recvbuffs[i], 0, args->expectedBytes));
HIPCHECK(hipMemset(args->recvbuffs[i], 0, args->expectedBytes));
void* data = in_place ? args->recvbuffs[i] : args->sendbuffs[i];
TESTCHECK(InitData(data, sendcount, type, rep, rank));
TESTCHECK(InitDataReduce(args->expected[i], recvcount, 0, type, op, rep, nranks));
CUDACHECK(cudaDeviceSynchronize());
HIPCHECK(hipDeviceSynchronize());
}
return testSuccess;
}
@@ -53,7 +53,7 @@ void AllReduceGetBw(size_t count, int typesize, double sec, double* algBw, doubl
*busBw = baseBw * factor;
}
testResult_t AllReduceRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
testResult_t AllReduceRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) {
NCCLCHECK(ncclAllReduce(sendbuff, recvbuff, count, type, op, comm, stream));
return testSuccess;
}
+5 -5
Zobrazit soubor
@@ -4,7 +4,7 @@
* See LICENSE.txt for license information
************************************************************************/
#include "cuda_runtime.h"
#include <hip/hip_runtime.h>
#include "common.h"
void print_header() {
@@ -33,13 +33,13 @@ testResult_t BroadcastInitData(struct threadArgs* args, ncclDataType_t type, ncc
for (int i=0; i<args->nGpus; i++) {
int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
CUDACHECK(cudaSetDevice(gpuid));
HIPCHECK(hipSetDevice(gpuid));
int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
CUDACHECK(cudaMemset(args->recvbuffs[i], 0, args->expectedBytes));
HIPCHECK(hipMemset(args->recvbuffs[i], 0, args->expectedBytes));
void* data = in_place ? args->recvbuffs[i] : args->sendbuffs[i];
if (rank == root) TESTCHECK(InitData(data, sendcount, type, rep, rank));
TESTCHECK(InitData(args->expected[i], recvcount, type, rep, root));
CUDACHECK(cudaDeviceSynchronize());
HIPCHECK(hipDeviceSynchronize());
}
return testSuccess;
}
@@ -52,7 +52,7 @@ void BroadcastGetBw(size_t count, int typesize, double sec, double* algBw, doubl
*busBw = baseBw * factor;
}
testResult_t BroadcastRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
testResult_t BroadcastRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) {
int rank;
NCCLCHECK(ncclCommUserRank(comm, &rank));
#if NCCL_MAJOR >= 2 && NCCL_MINOR >= 2
+63 -61
Zobrazit soubor
@@ -1,3 +1,4 @@
#include "hip/hip_runtime.h"
/*************************************************************************
* Copyright (c) 2016-2019, NVIDIA CORPORATION. All rights reserved.
*
@@ -9,7 +10,6 @@
#include <cstdio>
#include <getopt.h>
#include <libgen.h>
#include "cuda.h"
#if NCCL_MAJOR >= 2
ncclDataType_t test_types[ncclNumTypes] = {ncclInt8, ncclUint8, ncclInt32, ncclUint32, ncclInt64, ncclUint64, ncclHalf, ncclFloat, ncclDouble};
@@ -129,27 +129,27 @@ void deltaKern(void* A_, void* B_, size_t count, double* max) {
testResult_t CheckDelta(void* expected, void* results, size_t count, ncclDataType_t type, double* devmax) {
switch (type) {
case ncclHalf:
deltaKern<half, 512><<<1, 512>>>(results, expected, count, devmax); break;
hipLaunchKernelGGL((deltaKern<half, 512>), dim3(1), dim3(512), 0, 0, results, expected, count, devmax); break;
case ncclFloat:
deltaKern<float, 512><<<1, 512>>>(results, expected, count, devmax); break;
hipLaunchKernelGGL((deltaKern<float, 512>), dim3(1), dim3(512), 0, 0, results, expected, count, devmax); break;
case ncclDouble:
deltaKern<double, 512><<<1, 512>>>(results, expected, count, devmax); break;
hipLaunchKernelGGL((deltaKern<double, 512>), dim3(1), dim3(512), 0, 0, results, expected, count, devmax); break;
case ncclChar:
#if NCCL_MAJOR >= 2
case ncclUint8:
#endif
deltaKern<uint8_t, 512><<<1, 512>>>(results, expected, count, devmax); break;
hipLaunchKernelGGL((deltaKern<uint8_t, 512>), dim3(1), dim3(512), 0, 0, results, expected, count, devmax); break;
case ncclInt:
#if NCCL_MAJOR >= 2
case ncclUint32:
#endif
deltaKern<uint32_t, 512><<<1, 512>>>(results, expected, count, devmax); break;
hipLaunchKernelGGL((deltaKern<uint32_t, 512>), dim3(1), dim3(512), 0, 0, results, expected, count, devmax); break;
case ncclInt64:
case ncclUint64:
deltaKern<uint64_t, 512><<<1, 512>>>(results, expected, count, devmax); break;
hipLaunchKernelGGL((deltaKern<uint64_t, 512>), dim3(1), dim3(512), 0, 0, results, expected, count, devmax); break;
}
CUDACHECK(cudaDeviceSynchronize());
HIPCHECK(hipDeviceSynchronize());
return testSuccess;
}
@@ -196,61 +196,63 @@ template<>
__device__ half ncclOpMin(half a, half b) { return __half2float(a)<__half2float(b) ? a : b; }
template<typename T, T (*Op)(T, T)>
__global__ void InitDataReduceKernel(T* data, const size_t N, const size_t offset, const int rep, const int nranks) {
__global__ void InitDataReduceKernel(void* data, const size_t N, const size_t offset, const int rep, const int nranks) {
for (size_t o=blockIdx.x*blockDim.x+threadIdx.x; o<N; o+=gridDim.x*blockDim.x) {
T val = testValue<T>(o+offset, rep, 0);
for (int i=1; i<nranks; i++) {
val = Op(val, testValue<T>(o+offset, rep, i));
}
data[o] = val;
((T*)data)[o] = val;
}
}
#define KERN(type, op) (void*)InitDataReduceKernel<type, op<type>>
typedef void(*redInitKern_t)(void* data, const size_t N, const size_t offset, const int rep, const int nranks);
#define KERN(type, op) InitDataReduceKernel<type, op<type>>
#define OPS(type) KERN(type, ncclOpSum), KERN(type, ncclOpProd), KERN(type, ncclOpMax), KERN(type, ncclOpMin)
static void* const redInitDataKerns[ncclNumOps*ncclNumTypes] = {
static redInitKern_t const redInitDataKerns[ncclNumOps*ncclNumTypes] = {
OPS(int8_t), OPS(uint8_t), OPS(int32_t), OPS(uint32_t), OPS(int64_t), OPS(uint64_t), OPS(half), OPS(float), OPS(double)
};
testResult_t InitDataReduce(void* data, const size_t count, const size_t offset, ncclDataType_t type, ncclRedOp_t op, const int rep, const int nranks) {
dim3 grid = { 32, 1, 1 };
dim3 block = { 256, 1, 1 };
void* args[5] = { (void*)&data, (void*)&count, (void*)&offset, (void*)&rep, (void*)&nranks };
CUDACHECK(cudaLaunchKernel(redInitDataKerns[type*ncclNumOps+op], grid, block, args, 0, cudaStreamDefault));
hipLaunchKernelGGL((redInitDataKerns[type*ncclNumOps+op]), grid, block, 0, 0, data, count, offset, rep, nranks);
return testSuccess;
}
template<typename T>
__global__ void InitDataKernel(T* data, const size_t N, const int rep, const int rank) {
__global__ void InitDataKernel(void* data, const size_t N, const int rep, const int rank) {
for (size_t o=blockIdx.x*blockDim.x+threadIdx.x; o<N; o+=gridDim.x*blockDim.x)
data[o] = testValue<T>(o, rep, rank);
((T*)data)[o] = testValue<T>(o, rep, rank);
}
static void* const initDataKerns[ncclNumTypes] = {
(void*)InitDataKernel< int8_t>,
(void*)InitDataKernel< uint8_t>,
(void*)InitDataKernel< int32_t>,
(void*)InitDataKernel<uint32_t>,
(void*)InitDataKernel< int64_t>,
(void*)InitDataKernel<uint64_t>,
(void*)InitDataKernel< half>,
(void*)InitDataKernel< float>,
(void*)InitDataKernel< double>
typedef void(*initDataKern_t)(void* data, const size_t N, const int rep, const int rank);
static initDataKern_t const initDataKerns[ncclNumTypes] = {
InitDataKernel< int8_t>,
InitDataKernel< uint8_t>,
InitDataKernel< int32_t>,
InitDataKernel<uint32_t>,
InitDataKernel< int64_t>,
InitDataKernel<uint64_t>,
InitDataKernel< half>,
InitDataKernel< float>,
InitDataKernel< double>
};
template<typename T>
testResult_t InitDataType(void* dest, const size_t N, const int rep, const int rank) {
T* ptr = (T*)dest;
InitDataKernel<<<16, 512>>>(ptr, N, rep, rank);
hipLaunchKernelGGL((InitDataKernel), dim3(16), dim3(512), 0, 0, ptr, N, rep, rank);
return testSuccess;
}
testResult_t InitData(void* data, const size_t count, ncclDataType_t type, const int rep, const int rank) {
dim3 grid = { 32, 1, 1 };
dim3 block = { 256, 1, 1 };
void* args[4] = { (void*)&data, (void*)&count, (void*)&rep, (void*)&rank };
CUDACHECK(cudaLaunchKernel(initDataKerns[type], grid, block, args, 0, cudaStreamDefault));
hipLaunchKernelGGL((initDataKerns[type]), grid, block, 0, 0, data, count, rep, rank);
return testSuccess;
}
@@ -279,7 +281,7 @@ testResult_t CheckData(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
int device;
int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
NCCLCHECK(ncclCommCuDevice(args->comms[i], &device));
CUDACHECK(cudaSetDevice(device));
HIPCHECK(hipSetDevice(device));
void *data = in_place ? ((void *)((uintptr_t)args->recvbuffs[i] + args->recvInplaceOffset*rank)) : args->recvbuffs[i];
TESTCHECK(CheckDelta(data , args->expected[i], count, type, args->delta));
maxDelta = std::max(*(args->deltaHost), maxDelta);
@@ -289,14 +291,14 @@ testResult_t CheckData(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
int *expectedHost = (int *)malloc(args->expectedBytes);
int *dataHost = (int *)malloc(args->expectedBytes);
cudaMemcpy(expectedHost, args->expected[0], args->expectedBytes, cudaMemcpyDeviceToHost);
hipMemcpy(expectedHost, args->expected[0], args->expectedBytes, hipMemcpyDeviceToHost);
printf("\n Expected: ");
for(int j=0; j<args->expectedBytes/sizeof(int); j++) {
printf("%d:%d ", j, expectedHost[j]);
}
printf("\n");
cudaMemcpy(dataHost, data, args->expectedBytes, cudaMemcpyDeviceToHost);
hipMemcpy(dataHost, data, args->expectedBytes, hipMemcpyDeviceToHost);
printf("\n Actual: ");
for (int j=0; j<args->expectedBytes/sizeof(int); j++) {
printf("%d:%d ", j, dataHost[j]);
@@ -312,8 +314,8 @@ testResult_t CheckData(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
return testSuccess;
}
testResult_t testStreamSynchronize(int ngpus, cudaStream_t* streams, ncclComm_t* comms) {
cudaError_t cudaErr;
testResult_t testStreamSynchronize(int ngpus, hipStream_t* streams, ncclComm_t* comms) {
hipError_t hipErr;
int remaining = ngpus;
int* done = (int*)malloc(sizeof(int)*ngpus);
memset(done, 0, sizeof(int)*ngpus);
@@ -322,15 +324,15 @@ testResult_t testStreamSynchronize(int ngpus, cudaStream_t* streams, ncclComm_t*
for (int i=0; i<ngpus; i++) {
if (done[i]) continue;
cudaErr = cudaStreamQuery(streams[i]);
if (cudaErr == cudaSuccess) {
hipErr = hipStreamQuery(streams[i]);
if (hipErr == hipSuccess) {
done[i] = 1;
remaining--;
idle = 0;
continue;
}
if (cudaErr != cudaErrorNotReady) CUDACHECK(cudaErr);
if (hipErr != hipErrorNotReady) HIPCHECK(hipErr);
#if NCCL_VERSION_CODE >= NCCL_VERSION(2,4,0)
if (comms) {
@@ -365,9 +367,9 @@ testResult_t startColl(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
if (args->nGpus > 1) NCCLCHECK(ncclGroupStart());
for (int i = 0; i < args->nGpus; i++) {
#ifndef NCCL_MAJOR
int cudaDev;
NCCLCHECK(ncclCommCuDevice(args->comms[i], &cudaDev));
CUDACHECK(cudaSetDevice(cudaDev));
int hipDev;
NCCLCHECK(ncclCommCuDevice(args->comms[i], &hipDev));
HIPCHECK(hipSetDevice(hipDev));
#endif
int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
char* recvBuff = ((char*)args->recvbuffs[i]) + shift;
@@ -514,7 +516,7 @@ testResult_t threadRunTests(struct threadArgs* args) {
// will be done on the current GPU (by default : 0) and if the GPUs are in
// exclusive mode those operations will fail.
int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus;
CUDACHECK(cudaSetDevice(gpuid));
HIPCHECK(hipSetDevice(gpuid));
TESTCHECK(ncclTestEngine.runTest(args, ncclroot, (ncclDataType_t)nccltype, test_typenames[nccltype], (ncclRedOp_t)ncclop, test_opnames[ncclop]));
return testSuccess;
}
@@ -531,7 +533,7 @@ testResult_t threadInit(struct threadArgs* args) {
for (int i=0; i<args->nGpus; i++) {
int rank = args->proc*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
CUDACHECK(cudaSetDevice(gpuid));
HIPCHECK(hipSetDevice(gpuid));
NCCLCHECK(ncclCommInitRank(args->comms+i, nranks, args->ncclId, rank));
}
NCCLCHECK(ncclGroupEnd());
@@ -555,9 +557,9 @@ testResult_t threadLaunch(struct testThread* thread) {
}
testResult_t AllocateBuffs(void **sendbuff, size_t sendBytes, void **recvbuff, size_t recvBytes, void **expected, size_t nbytes, int nranks) {
CUDACHECK(cudaMalloc(sendbuff, nbytes));
CUDACHECK(cudaMalloc(recvbuff, nbytes));
CUDACHECK(cudaMalloc(expected, recvBytes));
HIPCHECK(hipMalloc(sendbuff, nbytes));
HIPCHECK(hipMalloc(recvbuff, nbytes));
HIPCHECK(hipMalloc(expected, recvBytes));
return testSuccess;
}
@@ -724,12 +726,12 @@ testResult_t run() {
char line[MAX_LINE];
int len = 0;
for (int i=0; i<nThreads*nGpus; i++) {
int cudaDev = localRank*nThreads*nGpus+i;
int hipDev = localRank*nThreads*nGpus+i;
int rank = proc*nThreads*nGpus+i;
cudaDeviceProp prop;
CUDACHECK(cudaGetDeviceProperties(&prop, cudaDev));
hipDeviceProp_t prop;
HIPCHECK(hipGetDeviceProperties(&prop, hipDev));
len += snprintf(line+len, MAX_LINE-len, "# Rank %2d Pid %6d on %10s device %2d [0x%02x] %s\n",
rank, getpid(), hostname, cudaDev, prop.pciBusID, prop.name);
rank, getpid(), hostname, hipDev, prop.pciBusID, prop.name);
}
#if MPI_SUPPORT
@@ -752,7 +754,7 @@ testResult_t run() {
#ifdef MPI_SUPPORT
MPI_Bcast(&ncclId, sizeof(ncclId), MPI_BYTE, 0, MPI_COMM_WORLD);
#endif
cudaStream_t streams[nGpus*nThreads];
hipStream_t streams[nGpus*nThreads];
void* sendbuffs[nGpus*nThreads];
void* recvbuffs[nGpus*nThreads];
void* expected[nGpus*nThreads];
@@ -761,9 +763,9 @@ testResult_t run() {
ncclTestEngine.getBuffSize(&sendBytes, &recvBytes, (size_t)maxBytes, (size_t)nProcs*nGpus*nThreads);
for (int i=0; i<nGpus*nThreads; i++) {
CUDACHECK(cudaSetDevice(localRank*nThreads*nGpus+i));
HIPCHECK(hipSetDevice(localRank*nThreads*nGpus+i));
AllocateBuffs(sendbuffs+i, sendBytes, recvbuffs+i, recvBytes, expected+i, (size_t)maxBytes, nProcs*nThreads*nGpus);
CUDACHECK(cudaStreamCreateWithFlags(streams+i, cudaStreamNonBlocking));
HIPCHECK(hipStreamCreateWithFlags(streams+i, hipStreamNonBlocking));
}
//if parallel init is not selected, use main thread to initialize NCCL
@@ -776,7 +778,7 @@ testResult_t run() {
} else {
NCCLCHECK(ncclGroupStart());
for (int i=0; i<nGpus*nThreads; i++) {
CUDACHECK(cudaSetDevice(localRank*nThreads*nGpus+i));
HIPCHECK(hipSetDevice(localRank*nThreads*nGpus+i));
NCCLCHECK(ncclCommInitRank(comms+i, nProcs*nThreads*nGpus, ncclId, proc*nThreads*nGpus+i));
}
NCCLCHECK(ncclGroupEnd());
@@ -786,7 +788,7 @@ testResult_t run() {
int errors[nThreads];
double bw[nThreads];
double* delta;
CUDACHECK(cudaHostAlloc(&delta, sizeof(double)*nThreads, cudaHostAllocPortable | cudaHostAllocMapped));
HIPCHECK(hipHostMalloc(&delta, sizeof(double)*nThreads, hipHostMallocPortable | hipHostMallocMapped));
int bw_count[nThreads];
for (int t=0; t<nThreads; t++) {
bw[t] = 0.0;
@@ -860,13 +862,13 @@ testResult_t run() {
free(comms);
}
// Free off CUDA allocated memory
// Free off HIP allocated memory
for (int i=0; i<nGpus*nThreads; i++) {
CUDACHECK(cudaFree(sendbuffs[i]));
CUDACHECK(cudaFree(recvbuffs[i]));
CUDACHECK(cudaFree(expected[i]));
HIPCHECK(hipFree(sendbuffs[i]));
HIPCHECK(hipFree(recvbuffs[i]));
HIPCHECK(hipFree(expected[i]));
}
CUDACHECK(cudaFreeHost(delta));
HIPCHECK(hipHostFree(delta));
char* str = getenv("NCCL_TESTS_MIN_BW");
double check_avg_bw = str ? atof(str) : -1;
@@ -879,8 +881,8 @@ testResult_t run() {
MPI_Finalize();
#endif
// 'cuda-memcheck --leak-check full' requires this
cudaDeviceReset();
// 'hip-memcheck --leak-check full' requires this
hipDeviceReset();
if (errors[0] || bw[0] < check_avg_bw*(0.9))
exit(EXIT_FAILURE);
+8 -9
Zobrazit soubor
@@ -6,24 +6,23 @@
#ifndef __COMMON_H__
#define __COMMON_H__
#include "nccl.h"
#include "rccl.h"
#include <stdio.h>
#include <algorithm>
#include <curand.h>
#ifdef MPI_SUPPORT
#include "mpi.h"
#endif
#include <pthread.h>
#include "nccl1_compat.h"
#define CUDACHECK(cmd) do { \
cudaError_t e = cmd; \
if( e != cudaSuccess ) { \
#define HIPCHECK(cmd) do { \
hipError_t e = cmd; \
if( e != hipSuccess ) { \
char hostname[1024]; \
getHostName(hostname, 1024); \
printf("%s: Test CUDA failure %s:%d '%s'\n", \
printf("%s: Test HIP failure %s:%d '%s'\n", \
hostname, \
__FILE__,__LINE__,cudaGetErrorString(e)); \
__FILE__,__LINE__,hipGetErrorString(e)); \
return testCudaError; \
} \
} while(0)
@@ -71,7 +70,7 @@ struct testColl {
ncclRedOp_t op, int root, int rep, int in_place);
void (*getBw)(size_t count, int typesize, double sec, double* algBw, double* busBw, int nranks);
testResult_t (*runColl)(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type,
ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream);
ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream);
};
extern struct testColl allReduceTest;
extern struct testColl allGatherTest;
@@ -107,7 +106,7 @@ struct threadArgs {
size_t recvInplaceOffset;
ncclUniqueId ncclId;
ncclComm_t* comms;
cudaStream_t* streams;
hipStream_t* streams;
void** expected;
size_t expectedBytes;
+5 -5
Zobrazit soubor
@@ -20,28 +20,28 @@ static ncclResult_t ncclGroupEnd() { return ncclSuccess; }
#define CHECKCOUNT(count) if (count > INT_MAX) return ncclInvalidArgument;
static ncclResult_t ncclReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype,
ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) {
CHECKCOUNT(count);
return ncclReduce(sendbuff, recvbuff, (int)count, datatype, op, root, comm, stream);
}
static ncclResult_t ncclAllReduce(const void* sendbuff, void* recvbuff, size_t count,
ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream) {
ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, hipStream_t stream) {
CHECKCOUNT(count);
return ncclAllReduce(sendbuff, recvbuff, (int)count, datatype, op, comm, stream);
}
static ncclResult_t ncclBcast(void* buff, size_t count, ncclDataType_t datatype, int root,
ncclComm_t comm, cudaStream_t stream) {
ncclComm_t comm, hipStream_t stream) {
CHECKCOUNT(count);
return ncclBcast(buff, (int)count, datatype, root, comm, stream);
}
static ncclResult_t ncclReduceScatter(const void* sendbuff, void* recvbuff,
size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm,
cudaStream_t stream) {
hipStream_t stream) {
CHECKCOUNT(recvcount);
return ncclReduceScatter(sendbuff, recvbuff, (int)recvcount, datatype, op, comm, stream);
}
static ncclResult_t ncclAllGather(const void* sendbuff, void* recvbuff, size_t sendcount,
ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream) {
ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream) {
CHECKCOUNT(sendcount);
return ncclAllGather(sendbuff, (int)sendcount, datatype, recvbuff, comm, stream);
}
+6 -6
Zobrazit soubor
@@ -4,7 +4,7 @@
* See LICENSE.txt for license information
************************************************************************/
#include "cuda_runtime.h"
#include <hip/hip_runtime.h>
#include "common.h"
void print_header() {
@@ -34,14 +34,14 @@ testResult_t ReduceInitData(struct threadArgs* args, ncclDataType_t type, ncclRe
for (int i=0; i<args->nGpus; i++) {
int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
CUDACHECK(cudaSetDevice(gpuid));
HIPCHECK(hipSetDevice(gpuid));
int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
CUDACHECK(cudaMemset(args->recvbuffs[i], 0, args->expectedBytes));
HIPCHECK(hipMemset(args->recvbuffs[i], 0, args->expectedBytes));
void* data = in_place ? args->recvbuffs[i] : args->sendbuffs[i];
TESTCHECK(InitData(data, sendcount, type, rep, rank));
CUDACHECK(cudaMemcpy(args->expected[i], args->recvbuffs[i], args->expectedBytes, cudaMemcpyDefault));
HIPCHECK(hipMemcpy(args->expected[i], args->recvbuffs[i], args->expectedBytes, hipMemcpyDefault));
if (rank == root) TESTCHECK(InitDataReduce(args->expected[i], recvcount, 0, type, op, rep, nranks));
CUDACHECK(cudaDeviceSynchronize());
HIPCHECK(hipDeviceSynchronize());
}
return testSuccess;
}
@@ -52,7 +52,7 @@ void ReduceGetBw(size_t count, int typesize, double sec, double* algBw, double*
*busBw = baseBw;
}
testResult_t ReduceRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
testResult_t ReduceRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) {
NCCLCHECK(ncclReduce(sendbuff, recvbuff, count, type, op, root, comm, stream));
return testSuccess;
}
+6 -6
Zobrazit soubor
@@ -4,7 +4,7 @@
* See LICENSE.txt for license information
************************************************************************/
#include "cuda_runtime.h"
#include <hip/hip_runtime.h>
#include "common.h"
void print_header() {
@@ -34,14 +34,14 @@ testResult_t ReduceScatterInitData(struct threadArgs* args, ncclDataType_t type,
for (int i=0; i<args->nGpus; i++) {
int gpuid = args->localRank*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
CUDACHECK(cudaSetDevice(gpuid));
HIPCHECK(hipSetDevice(gpuid));
int rank = ((args->proc*args->nThreads + args->thread)*args->nGpus + i);
CUDACHECK(cudaMemset(args->recvbuffs[i], 0, args->expectedBytes));
HIPCHECK(hipMemset(args->recvbuffs[i], 0, args->expectedBytes));
void* data = in_place ? args->recvbuffs[i] : args->sendbuffs[i];
TESTCHECK(InitData(data, sendcount, type, rep, rank));
CUDACHECK(cudaMemcpy(args->expected[i], args->recvbuffs[i], args->expectedBytes, cudaMemcpyDefault));
HIPCHECK(hipMemcpy(args->expected[i], args->recvbuffs[i], args->expectedBytes, hipMemcpyDefault));
TESTCHECK(InitDataReduce(args->expected[i], recvcount, rank*recvcount, type, op, rep, nranks));
CUDACHECK(cudaDeviceSynchronize());
HIPCHECK(hipDeviceSynchronize());
}
return testSuccess;
}
@@ -54,7 +54,7 @@ void ReduceScatterGetBw(size_t count, int typesize, double sec, double* algBw, d
*busBw = baseBw * factor;
}
testResult_t ReduceScatterRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
testResult_t ReduceScatterRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream) {
NCCLCHECK(ncclReduceScatter(sendbuff, recvbuff, count, type, op, comm, stream));
return testSuccess;
}