ae0c4092c7
* Adding CPU based execution, fixing typos, adding Fine-grained mem * Exposing sampling factor when generating range of data sizes * Refactoring how Links are launched, now once per thread * Documentation updates
44 lignes
2.6 KiB
INI
44 lignes
2.6 KiB
INI
# Configfile Format:
|
|
# ==================
|
|
# A Link is defined as a uni-directional transfer from src memory location to dst memory location executed by either CPU or GPU
|
|
# Each single line in the configuration file defines a set of Links to run in parallel
|
|
|
|
# There are two ways to specify the configuration file:
|
|
|
|
# 1) Basic
|
|
# The basic specification assumes the same number of threadblocks/CUs used per GPU-executed Link
|
|
# A positive number of Links is specified followed by that number of triplets describing each Link
|
|
|
|
# #Links #CUs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
|
|
|
|
# 2) Advanced
|
|
# The advanced specification allows different number of threadblocks/CUs used per GPU-executed Link
|
|
# A negative number of links is specified, followed by quadruples describing each Link
|
|
# -#Links (srcMem1->Executor1->dstMem1 #CUs1) ... (srcMemL->ExecutorL->dstMemL #CUsL)
|
|
|
|
# Argument Details:
|
|
# #Links : Number of Links to be run in parallel
|
|
# #CUs : Number of threadblocks/CUs to use for a GPU-executed Link
|
|
# srcMemL : Source memory location (Where the data is to be read from). Ignored in memset mode
|
|
# Executor: Executor are specified by a character indicating executor type, followed by device index (0-indexed)
|
|
# - C: CPU-executed (Indexed from 0 to 1)
|
|
# - G: GPU-executed (Indexed from 0 to 3)
|
|
# dstMemL : Destination memory location (Where the data is to be written to)
|
|
|
|
# Memory locations are specified by a character indicating memory type, followed by device index (0-indexed)
|
|
# Supported memory locations are:
|
|
# - C: Pinned host memory (on NUMA node, indexed from 0 to 1)
|
|
# - G: Global device memory (on GPU device indexed from 0 to 3)
|
|
# - F: Fine-grain device memory (on GPU device indexed from 0 to 3)
|
|
|
|
# Examples:
|
|
# 1 4 (G0->G0->G1) Single Link that uses 4 CUs on GPU 0 that reads memory from GPU 0 and copies it to memory on GPU 1
|
|
# 1 4 (G1->C0->G0) Single Link that uses 4 CUs on GPU 0 that reads memory from CPU 1 and copies it to memory on GPU 0
|
|
# 1 4 (C0->G2->G2) Single Link that uses 4 CUs on GPU 2 that reads memory from CPU 0 and copies it to memory on GPU 2
|
|
# 2 4 G0->G0->G1 G1->G1->G0 Runs 2 Links in parallel. GPU 0 - > GPU1, and GP1 -> GPU 0, each with 4 CUs
|
|
# -2 (G0 G0 G1 4) (G1 G1 G0 2) Runs 2 Links in parallel. GPU 0 - > GPU 1 using four CUs, and GPU1 -> GPU 0 using two CUs
|
|
# Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
|
|
|
|
# Single GPU-executed link between GPUs 0 and 1 using 4 CUs
|
|
1 4 (G0->G0->G1)
|