Add Dockerfile to build rccl and rccl-tests (#1011)

* [BUILD] Add Dockerfile for RCCL and RCCL-Tests

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Update docker/Dockerfile.ubuntu

Typo for LD_LIBRARY_PATH

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* Update docker/Dockerfile.ubuntu

use `-b` for `git clone` instead of additional `git checkout`

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* Update docker/Dockerfile.ubuntu

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
此提交包含在:
Nilesh M Negi
2024-09-22 03:53:16 -05:00
提交者 GitHub
父節點 2fe1e9f7db
當前提交 707377b3cd
共有 2 個檔案被更改,包括 143 行新增3 行删除
+34 -3
查看文件
@@ -96,11 +96,42 @@ $ make package
$ sudo dpkg -i *.deb
```
RCCL package install requires sudo/root access because it creates a directory called "rccl" under /opt/rocm/. This is an optional step and RCCL can be used directly by including the path containing librccl.so.
RCCL package install requires sudo/root access because it installs under `/opt/rocm/`. This is an optional step as RCCL can instead be used directly by including the path containing `librccl.so`.
## Docker build
Assuming you have docker installed on your system:
#### To build the docker image :
By default, the given Dockerfile uses `docker.io/rocm/dev-ubuntu-22.04:latest` as the base docker image, and then installs RCCL (develop branch) and RCCL-Tests (develop branch).
```shell
$ docker build -t rccl-tests -f Dockerfile.ubuntu --pull .
```
The base docker image, rccl repo, and rccl-tests repo can be modified using `--build-args` in the `docker build` command above. E.g., to use a different base docker image:
```shell
$ docker build -t rccl-tests -f Dockerfile.ubuntu --build-arg="ROCM_IMAGE_NAME=rocm/dev-ubuntu-20.04" --build-arg="ROCM_IMAGE_TAG=6.2" --pull .
```
#### To start an interactive docker container on a system with AMD GPUs :
```shell
$ docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --network=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rccl-tests /bin/bash
```
#### To run rccl-tests (all_reduce_perf) on 8 AMD GPUs (inside the docker container) :
```shell
$ mpirun --allow-run-as-root -np 8 --mca pml ucx --mca btl ^openib -x NCCL_DEBUG=VERSION /workspace/rccl-tests/build/all_reduce_perf -b 1 -e 16G -f 2 -g 1
```
For more information on rccl-tests options, refer to the [Usage](https://github.com/ROCm/rccl-tests#usage) section of rccl-tests.
## Enabling peer-to-peer transport
In order to enable peer-to-peer access on machines with PCIe-connected GPUs, the HSA environment variable HSA_FORCE_FINE_GRAIN_PCIE=1 is required to be set, on top of requiring GPUs that support peer-to-peer access and proper large BAR addressing support.
In order to enable peer-to-peer access on machines with PCIe-connected GPUs, the HSA environment variable `HSA_FORCE_FINE_GRAIN_PCIE=1` is required to be set, on top of requiring GPUs that support peer-to-peer access and proper large BAR addressing support.
## Tests
@@ -111,7 +142,7 @@ rccl unit test names are now of the format:
CollectiveCall.[Type of test]
Filtering of rccl unit tests should be done with environment variable and by passing the --gtest_filter command line flag, for example:
Filtering of rccl unit tests should be done with environment variable and by passing the `--gtest_filter` command line flag, for example:
```shell
UT_DATATYPES=ncclBfloat16 UT_REDOPS=prod ./rccl-UnitTests --gtest_filter="AllReduce.C*"