Update README.md document
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I365acc202442495daf89df1328e58c92457ab10d
[ROCm/rdc commit: 5e1111d4cb]
このコミットが含まれているのは:
+88
-191
@@ -1,229 +1,126 @@
|
||||
|
||||
# Radeon Data Center Tools
|
||||
# ROCm<sup>TM</sup> Data Center Tool (RDC)
|
||||
|
||||
# Running RDC
|
||||
The ROCm™ Data Center Tool simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments. The main features are:
|
||||
|
||||
##### Additional Software Required for Running RDC
|
||||
In order to run RDC, the following components are required.
|
||||
Note that the software versions listed are what was used in development.
|
||||
Earlier versions are not guaranteed to work:
|
||||
* ROCm
|
||||
* gRPC and protoc
|
||||
Unfortunately, gRPC must be built from source as no pre-built .deb or .rpm
|
||||
packages are available.
|
||||
See instructions for building gRPC/protoc below.
|
||||
* [ROCm SMI Library](https://github.com/RadeonOpenCompute/rocm_smi_lib)
|
||||
- GPU telemetry
|
||||
- GPU statistics for jobs
|
||||
- Integration with third-party tools
|
||||
- Open source
|
||||
|
||||
##### Building gRPC and protoc
|
||||
gRPC libraries and the protoc compiler must be installed to both build RDC and
|
||||
must also be available machines were RDC will run. To build and install gRPC,
|
||||
follow these steps.
|
||||
- Get gRPC required tool installation
|
||||
``$ sudo apt-get install -y automake make g++ unzip``
|
||||
``$ sudo apt-get install -y build-essential autoconf libtool pkg-config``
|
||||
``$ sudo apt-get install -y libgflags-dev libgtest-dev``
|
||||
``$ sudo apt-get install -y clang-5.0 libc++-dev curl``
|
||||
For complete list of features and how to start using RDC from pre-built packages, please refer to [**user guide**](docs/AMD_ROCm_Data_Center_Tool_User_Guide.pdf)
|
||||
|
||||
**IMPORTANT** Building gRPC and protocol buffers using this method requires
|
||||
CMake 3.15 or greater. If you use an earlier version of CMake than this, the
|
||||
build will succeed (aside from an easily missed CMake message), but when gRPC
|
||||
is installed, not all the required files will be there, so the RDC program
|
||||
will fail to run.
|
||||
- Download and build gRPC
|
||||
# Supported platforms
|
||||
Ubuntu 18.04.5 (Kernel 5.3)
|
||||
CentOS v7.7 (Using devtoolset-7 runtime support)
|
||||
RHEL v7.7 (Using devtoolset-7 runtime support)
|
||||
SLES 15 SP1
|
||||
CentOS and RHEL 8.1(Kernel 4.18.0-147)
|
||||
|
||||
# Building RDC from source
|
||||
|
||||
## Dependencies
|
||||
|
||||
CMake 3.15 ## 3.15 or greater is required for gRPC
|
||||
g++ (5.4.0)
|
||||
Doxygen (1.8.11) ## required to build the latest documentation
|
||||
Latex (pdfTeX 3.14159265-2.6-1.40.16) ## required to build the latest documentation
|
||||
gRPC and protoc ## required for communication
|
||||
|
||||
AMD ROCm platform (https://github.com/RadeonOpenCompute/ROCm)
|
||||
* It is recommended to install the complete AMD ROCm platform.
|
||||
For installation instruction see https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html
|
||||
* At the minimum, these two components are required
|
||||
(i) AMD ROCm SMI Library (https://github.com/RadeonOpenCompute/rocm_smi_lib)
|
||||
(ii) AMD ROCk Kernel driver (https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver)
|
||||
|
||||
## Building gRPC and protoc
|
||||
**NOTE:** gRPC and protoc compiler must be built from source as pre-built packages are not available. These must be installed to both build RDC and must also be available on machines where RDC will run.
|
||||
|
||||
**IMPORTANT:** Building gRPC and protocol buffers requires CMake 3.15 or greater. With an older version build will quietly succeed with a *message*. However, all components of gRPC will not be installed and RDC will ***fail*** to run
|
||||
|
||||
The following tools are required for gRPC build & installation
|
||||
|
||||
automake make g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang-5.0 libc++-dev curl
|
||||
|
||||
Download and build gRPC
|
||||
|
||||
$ git clone -b v1.28.1 https://github.com/grpc/grpc
|
||||
$ cd grpc
|
||||
$ git submodule update --init
|
||||
$ mkdir -p cmake/build
|
||||
$ cd cmake/build
|
||||
# By default (without using the CMAKE_INSTALL_PREFIX option), the following
|
||||
# will install to /usr/local lib, include and bin directories
|
||||
|
||||
## By default (without using CMAKE_INSTALL_PREFIX option), the following will install to /usr/local lib, include and bin directories
|
||||
|
||||
$ cmake -DgRPC_INSTALL=ON -DBUILD_SHARED_LIBS=ON <-DCMAKE_INSTALL_PREFIX=<install dir>> ../..
|
||||
$ make
|
||||
$ sudo make install
|
||||
$ echo "<install dir>" | sudo tee /etc/ld.so.conf.d/grpc.conf
|
||||
$ echo "<install dir>/lib" | sudo tee /etc/ld.so.conf.d/grpc.conf
|
||||
$ sudo ldconfig
|
||||
|
||||
### Installation of RDC
|
||||
RDC packages can be installed with dpkg or yum for various distros. After RDC
|
||||
is installed, authentication must be set up before attempting to run an RDC
|
||||
application.
|
||||
## Building RDC
|
||||
|
||||
Clone the RDC source code from GitHub and use CMake to build and install
|
||||
|
||||
$ git clone https://github.com/RadeonOpenCompute/rdc
|
||||
$ cd rdc
|
||||
$ mkdir -p build; cd build
|
||||
$ cmake -DROCM_DIR=/opt/rocm -DGRPC_ROOT="$GRPC_PROTOC_ROOT" <-DCMAKE_INSTALL_PREFIX=<install dir>> ..
|
||||
$ make
|
||||
$ make install ## default installation location is /opt/rocm
|
||||
|
||||
|
||||
### Authentication
|
||||
# Running RDC
|
||||
RDC supports encrypted communications between clients and servers. The
|
||||
communication can be configured to be authenticated or not authenticated.
|
||||
communication can be configured to be *authenticated* or *not authenticated*. The [**user guide**](docs/AMD_ROCm_Data_Center_Tool_User_Guide.pdf) has information on how to generate and install SSL keys and certificates for authentication. By default, authentication is enabled.
|
||||
|
||||
##### Unauthenticated Communications
|
||||
By default, authentication is enabled. To disable authentication, when starting
|
||||
the server use the ``--unauth_comm`` flag (or ``-u`` for short). (The
|
||||
``/lib/systemd/system/rdc.service`` file can be edited to pass arguments
|
||||
to rdcd on starting.) On the client side,
|
||||
when calling rdc_channel_create(), the "secure" argument should be set to false.
|
||||
## Starting ROCm™ Data Center Daemon (RDCD)
|
||||
For an RDC client application to monitor and/or control a remote system, the RDC server daemon, *rdcd*, must be running on the remote system. *rdcd* can be configured to run with (a) full-capabilities which includes ability to set or change GPU configuration or (b) monitor-only capabilities which limits to monitoring GPU metrics.
|
||||
|
||||
##### Public Key Infrastructure (PKI) Authentication
|
||||
A number of SSL keys and certificates must be generated and installed on the
|
||||
clients and servers for authentication to work properly. By default, the RDC
|
||||
server will look under ``/etc/rdc`` for the following keys and certificates:
|
||||
### Start RDCD from command-line
|
||||
When *rdcd* is started from a command-line the *capabilities* are determined by privilege of the *user* starting *rdcd*
|
||||
|
||||
- Servers
|
||||
$ cd rdc_install_prefix ## If specified in Building RDC section
|
||||
|
||||
sudo tree /etc/rdc
|
||||
/etc/rdc
|
||||
|-- server
|
||||
|-- certs
|
||||
| |-- rdc_cacert.pem
|
||||
| |-- rdc_server_cert.pem
|
||||
|-- private
|
||||
|-- rdc_server_cert.key
|
||||
## To run with authentication. Ensure SSL keys are setup properly
|
||||
$ ./usr/sbin/rdcd ## rdcd is started with monitor-only capabilities
|
||||
$ sudo ./usr/sbin/rdcd ## rdcd is started will full-capabilities
|
||||
|
||||
## To run without authentication. SSL key & certificates are not required.
|
||||
$ ./usr/sbin/rdcd -u ## rdcd is started with monitor-only capabilities
|
||||
$ sudo ./usr/sbin/rdcd -u ## rdcd is started will full-capabilities
|
||||
|
||||
### Start RDCD using systemd
|
||||
*rdcd* can be started by using the systemctl command. systemctl will read /lib/systemd/system/rdc.service, which is installed with rdc. This file has 2 lines that control what *capabilities* with which *rdcd* will run. If left uncommented, rdcd will run with full-capabilities.
|
||||
|
||||
|
||||
- Clients
|
||||
|
||||
$ sudo tree /etc/rdc
|
||||
/etc/rdc
|
||||
|-- client
|
||||
|-- certs
|
||||
| |-- rdc_cacert.pem
|
||||
| |-- rdc_client_cert.pem
|
||||
|-- private
|
||||
|-- rdc_client_cert.key
|
||||
|
||||
|
||||
Machines that are both clients and servers will have both directory
|
||||
structures.
|
||||
|
||||
RDC users would normally generate their own keys and certificates. However,
|
||||
there are scripts in the client installation file hierarchy (normally in
|
||||
``/opt/rocm/rdc/authentication``) that will generate self-signed
|
||||
certificates in RDC source tree, under the "authentication" directory.
|
||||
The scripts call the openssl command to generate the required keys and
|
||||
certificates. The openssl command will query the caller for different
|
||||
identifying information. The calls to openssl will refer to the
|
||||
openssl.cnf file for configuration information. Included in this file is
|
||||
a section where default responses to the openssl questions can be
|
||||
specified. Look for the comment line
|
||||
|
||||
# < ** REPLACE VALUES IN THIS SECTION WITH APPROPRIATE VALUES FOR YOUR ORG. **>
|
||||
|
||||
to find this section. It is helpful to modify this section with values
|
||||
appropriate for your organization if this script will be called many times.
|
||||
|
||||
Additionally, the alt_names section needs to be updated for your environment
|
||||
(instead of the dummy values that there initially).
|
||||
|
||||
To generate the keys and certificates using these scripts, make the following
|
||||
calls:
|
||||
|
||||
|
||||
$ 01gen_root_cert.sh
|
||||
# provide answers to posed questions
|
||||
$ 02gen_ssl_artifacts.sh
|
||||
# provide answers to posed questions
|
||||
|
||||
At this point, the keys and certificates will be in the newly created
|
||||
"CA/artifacts" directory. This directory should be deleted if you need to
|
||||
rerun the scripts.
|
||||
|
||||
To install the keys and certificates, cd into the artifacts directory and run
|
||||
the install.sh script as root, specifying the install location. By default,
|
||||
RDC will expect this to be in /etc/rdc:
|
||||
|
||||
|
||||
$ cd CA/artifacts
|
||||
$ sudo install_<client|server>.sh /etc/rdc
|
||||
|
||||
These files should be copied to and and installed on all client and server
|
||||
machines that are expected to communicate with one another.
|
||||
|
||||
##### Current Limitations
|
||||
There are a few limitations on the authentication capabilities. These
|
||||
limitations are temporary and will be eliminated when the server has a
|
||||
configuration file where user preferences can be specified.
|
||||
* The client and server are hard-coded to look for openssl certificate and key
|
||||
files in /etc/rdc.
|
||||
|
||||
# Starting RDC Server Daemon (RDCD)
|
||||
In order for an RDC client application to monitor and/or control a remote
|
||||
system, the RDC server daemon, rdcd, must be running on the remote system.
|
||||
rdcd can be configured to run with full capabilities, or with monitoring-only
|
||||
capabilities. "Full capabilities" includes the ability to set some system
|
||||
functions exposed by the RDC APIs and tools. Changing a system's configuration
|
||||
involves writing to system files. When rdcd is configured to run with full
|
||||
capabilities, it has the ability to write to these system files. Alternatively,
|
||||
rdcd can be run with control functionality disabled. In this case, rdcd does
|
||||
not have the ability to write to the control-related system files. Calls to RDC
|
||||
APIs (or tools that invoke these APIs) will result in a permission-related
|
||||
failure when configured for limited functionality. This reduced mode can be
|
||||
used to prevent someone from inadvertently or maliciously putting a system
|
||||
into an unwanted state.
|
||||
|
||||
Which configuration is used depends on how rdcd is started, and certain
|
||||
settings in the rdc.service systemd configuration file.
|
||||
|
||||
###Starting rdcd with systemctl
|
||||
rdcd can be started by using the systemctl command.
|
||||
|
||||
#####Starting rdcd with systemctl
|
||||
When starting rdcd using systemctl, like this:
|
||||
|
||||
|
||||
systemctl start rdc
|
||||
|
||||
systemctl will read /lib/systemd/system/rdc.service, which is installed with
|
||||
rdc. This file has 2 lines that control what capabilities with which rdcd
|
||||
will run. If left uncommented, rdcd will run with full capabilities, as
|
||||
shown below:
|
||||
|
||||
## file: /lib/systemd/system/rdc.service
|
||||
## Comment the following two lines to run with monitor-only capabilities
|
||||
CapabilityBoundingSet=CAP_DAC_OVERRIDE
|
||||
AmbientCapabilities=CAP_DAC_OVERRIDE
|
||||
|
||||
systemctl start rdc ## start rdc as systemd service
|
||||
|
||||
When these lines are commented with ``#``, rdcd will run with monitor-only
|
||||
capabilty.
|
||||
## Invoke RDC using ROCm™ Data Center Interface (RDCI)
|
||||
RDCI provides command-line interface to all RDC features. This CLI can be run locally or remotely. Refer to [**user guide**](docs/AMD_ROCm_Data_Center_Tool_User_Guide) for the current list of features.
|
||||
|
||||
## sample rdci commands to test RDC functionality
|
||||
## discover devices in a local or remote compute node
|
||||
## NOTE: option -u (for unauthenticated) is required is rdcd was started in this mode
|
||||
|
||||
###Starting rdcd Directly
|
||||
rdcd can also be started by directly invoking rdcd from the command line,
|
||||
like this:
|
||||
$ cd rdc_install_prefix ## If specified in Building RDC section
|
||||
./opt/rocm/rdc/bin/rdci discovery -l <-u> ## list available GPUs in localhost
|
||||
./opt/rocm/rdc/bin/rdci discovery <host> -l <-u> ## list available GPUs in host machine
|
||||
|
||||
# Start as user rdc
|
||||
sudo -u rdc rdcd
|
||||
## Troubleshooting rdcd
|
||||
|
||||
or
|
||||
|
||||
# Start as root
|
||||
sudo rdcd
|
||||
|
||||
|
||||
Note how rdcd must be started as user "rdc" or root. Other regular user
|
||||
accounts will not normally work. This is because rdcd will need access to
|
||||
private SSL keys and certificates, owned by rdc. In order to run rdcd
|
||||
under a different account, the SSL keys would need to be accessible by
|
||||
that account.
|
||||
|
||||
When run from the command line, the rdc.service file mentioned in the previous
|
||||
section does not come into play. What determines the level of capability is
|
||||
the level of capability of the id under which rdcd is started. If rdcd is
|
||||
directly started as root, then rdcd will have monitor and control capability.
|
||||
If rdcd is directly started with a normal user account, then it will have
|
||||
monitor-only capability.
|
||||
|
||||
###Troubleshooting rdcd
|
||||
* When rdcd is started using systemctl, we can view messages that can help debug
|
||||
problems with either starting rdcd or with communications with a client using
|
||||
the ``journalctl`` command. To view rdcd logs, issue the following command on
|
||||
the server
|
||||
Log messages that can provide useful debug information.
|
||||
|
||||
## If rdcd was started as a systemd service, then use journalctl to view rdcd logs
|
||||
journalctl -u rdc
|
||||
|
||||
* Enable the debug log:
|
||||
|
||||
sudo RDC_LOG=DEBUG ./server/rdcd
|
||||
|
||||
* Check the ssl connection in rdci:
|
||||
|
||||
rdcd_hostname=<rdcd hostname> # Set the rdcd hostname to which you want to connect
|
||||
openssl s_client -connect $rdcd_hostname:50051 -cert /etc/rdc/client/certs/rdc_client_cert.pem -key /etc/rdc/client/private/rdc_client_cert.key -CAfile /etc/rdc/client/certs/rdc_cacert.pem
|
||||
## To run rdcd with debug log from command-line use
|
||||
RDC_LOG=DEBUG ./usr/sbin/rdcd
|
||||
|
||||
|
||||
|
||||
新しいイシューから参照
ユーザーをブロックする