diff --git a/projects/rdc/README.md b/projects/rdc/README.md index 80b1677084..5da0717377 100644 --- a/projects/rdc/README.md +++ b/projects/rdc/README.md @@ -1,229 +1,126 @@ -# Radeon Data Center Tools +# ROCmTM Data Center Tool (RDC) -# Running RDC +The ROCm™ Data Center Tool simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments. The main features are: -##### Additional Software Required for Running RDC -In order to run RDC, the following components are required. -Note that the software versions listed are what was used in development. -Earlier versions are not guaranteed to work: -* ROCm -* gRPC and protoc - Unfortunately, gRPC must be built from source as no pre-built .deb or .rpm - packages are available. - See instructions for building gRPC/protoc below. -* [ROCm SMI Library](https://github.com/RadeonOpenCompute/rocm_smi_lib) +- GPU telemetry +- GPU statistics for jobs +- Integration with third-party tools +- Open source -##### Building gRPC and protoc -gRPC libraries and the protoc compiler must be installed to both build RDC and -must also be available machines were RDC will run. To build and install gRPC, -follow these steps. -- Get gRPC required tool installation - ``$ sudo apt-get install -y automake make g++ unzip`` - ``$ sudo apt-get install -y build-essential autoconf libtool pkg-config`` - ``$ sudo apt-get install -y libgflags-dev libgtest-dev`` - ``$ sudo apt-get install -y clang-5.0 libc++-dev curl`` +For complete list of features and how to start using RDC from pre-built packages, please refer to [**user guide**](docs/AMD_ROCm_Data_Center_Tool_User_Guide.pdf) -**IMPORTANT** Building gRPC and protocol buffers using this method requires -CMake 3.15 or greater. If you use an earlier version of CMake than this, the -build will succeed (aside from an easily missed CMake message), but when gRPC -is installed, not all the required files will be there, so the RDC program -will fail to run. -- Download and build gRPC +# Supported platforms + Ubuntu 18.04.5 (Kernel 5.3) + CentOS v7.7 (Using devtoolset-7 runtime support) + RHEL v7.7 (Using devtoolset-7 runtime support) + SLES 15 SP1 + CentOS and RHEL 8.1(Kernel 4.18.0-147) + +# Building RDC from source + +## Dependencies + + CMake 3.15 ## 3.15 or greater is required for gRPC + g++ (5.4.0) + Doxygen (1.8.11) ## required to build the latest documentation + Latex (pdfTeX 3.14159265-2.6-1.40.16) ## required to build the latest documentation + gRPC and protoc ## required for communication + + AMD ROCm platform (https://github.com/RadeonOpenCompute/ROCm) + * It is recommended to install the complete AMD ROCm platform. + For installation instruction see https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html + * At the minimum, these two components are required + (i) AMD ROCm SMI Library (https://github.com/RadeonOpenCompute/rocm_smi_lib) + (ii) AMD ROCk Kernel driver (https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver) + +## Building gRPC and protoc +**NOTE:** gRPC and protoc compiler must be built from source as pre-built packages are not available. These must be installed to both build RDC and must also be available on machines where RDC will run. + +**IMPORTANT:** Building gRPC and protocol buffers requires CMake 3.15 or greater. With an older version build will quietly succeed with a *message*. However, all components of gRPC will not be installed and RDC will ***fail*** to run + +The following tools are required for gRPC build & installation + + automake make g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang-5.0 libc++-dev curl + + Download and build gRPC $ git clone -b v1.28.1 https://github.com/grpc/grpc $ cd grpc $ git submodule update --init $ mkdir -p cmake/build $ cd cmake/build - # By default (without using the CMAKE_INSTALL_PREFIX option), the following - # will install to /usr/local lib, include and bin directories + + ## By default (without using CMAKE_INSTALL_PREFIX option), the following will install to /usr/local lib, include and bin directories + $ cmake -DgRPC_INSTALL=ON -DBUILD_SHARED_LIBS=ON <-DCMAKE_INSTALL_PREFIX=> ../.. $ make $ sudo make install - $ echo "" | sudo tee /etc/ld.so.conf.d/grpc.conf + $ echo "/lib" | sudo tee /etc/ld.so.conf.d/grpc.conf $ sudo ldconfig -### Installation of RDC -RDC packages can be installed with dpkg or yum for various distros. After RDC -is installed, authentication must be set up before attempting to run an RDC -application. +## Building RDC + +Clone the RDC source code from GitHub and use CMake to build and install + + $ git clone https://github.com/RadeonOpenCompute/rdc + $ cd rdc + $ mkdir -p build; cd build + $ cmake -DROCM_DIR=/opt/rocm -DGRPC_ROOT="$GRPC_PROTOC_ROOT" <-DCMAKE_INSTALL_PREFIX=> .. + $ make + $ make install ## default installation location is /opt/rocm -### Authentication +# Running RDC RDC supports encrypted communications between clients and servers. The -communication can be configured to be authenticated or not authenticated. +communication can be configured to be *authenticated* or *not authenticated*. The [**user guide**](docs/AMD_ROCm_Data_Center_Tool_User_Guide.pdf) has information on how to generate and install SSL keys and certificates for authentication. By default, authentication is enabled. -##### Unauthenticated Communications -By default, authentication is enabled. To disable authentication, when starting -the server use the ``--unauth_comm`` flag (or ``-u`` for short). (The -``/lib/systemd/system/rdc.service`` file can be edited to pass arguments -to rdcd on starting.) On the client side, -when calling rdc_channel_create(), the "secure" argument should be set to false. +## Starting ROCm™ Data Center Daemon (RDCD) +For an RDC client application to monitor and/or control a remote system, the RDC server daemon, *rdcd*, must be running on the remote system. *rdcd* can be configured to run with (a) full-capabilities which includes ability to set or change GPU configuration or (b) monitor-only capabilities which limits to monitoring GPU metrics. -##### Public Key Infrastructure (PKI) Authentication - A number of SSL keys and certificates must be generated and installed on the - clients and servers for authentication to work properly. By default, the RDC - server will look under ``/etc/rdc`` for the following keys and certificates: +### Start RDCD from command-line +When *rdcd* is started from a command-line the *capabilities* are determined by privilege of the *user* starting *rdcd* -- Servers + $ cd rdc_install_prefix ## If specified in Building RDC section - sudo tree /etc/rdc - /etc/rdc - |-- server - |-- certs - | |-- rdc_cacert.pem - | |-- rdc_server_cert.pem - |-- private - |-- rdc_server_cert.key + ## To run with authentication. Ensure SSL keys are setup properly + $ ./usr/sbin/rdcd ## rdcd is started with monitor-only capabilities + $ sudo ./usr/sbin/rdcd ## rdcd is started will full-capabilities + + ## To run without authentication. SSL key & certificates are not required. + $ ./usr/sbin/rdcd -u ## rdcd is started with monitor-only capabilities + $ sudo ./usr/sbin/rdcd -u ## rdcd is started will full-capabilities + +### Start RDCD using systemd +*rdcd* can be started by using the systemctl command. systemctl will read /lib/systemd/system/rdc.service, which is installed with rdc. This file has 2 lines that control what *capabilities* with which *rdcd* will run. If left uncommented, rdcd will run with full-capabilities. -- Clients - - $ sudo tree /etc/rdc - /etc/rdc - |-- client - |-- certs - | |-- rdc_cacert.pem - | |-- rdc_client_cert.pem - |-- private - |-- rdc_client_cert.key - - -Machines that are both clients and servers will have both directory -structures. - -RDC users would normally generate their own keys and certificates. However, -there are scripts in the client installation file hierarchy (normally in -``/opt/rocm/rdc/authentication``) that will generate self-signed -certificates in RDC source tree, under the "authentication" directory. -The scripts call the openssl command to generate the required keys and -certificates. The openssl command will query the caller for different -identifying information. The calls to openssl will refer to the -openssl.cnf file for configuration information. Included in this file is -a section where default responses to the openssl questions can be -specified. Look for the comment line - - # < ** REPLACE VALUES IN THIS SECTION WITH APPROPRIATE VALUES FOR YOUR ORG. **> - -to find this section. It is helpful to modify this section with values -appropriate for your organization if this script will be called many times. - -Additionally, the alt_names section needs to be updated for your environment -(instead of the dummy values that there initially). - -To generate the keys and certificates using these scripts, make the following -calls: - - - $ 01gen_root_cert.sh - # provide answers to posed questions - $ 02gen_ssl_artifacts.sh - # provide answers to posed questions - -At this point, the keys and certificates will be in the newly created -"CA/artifacts" directory. This directory should be deleted if you need to -rerun the scripts. - -To install the keys and certificates, cd into the artifacts directory and run -the install.sh script as root, specifying the install location. By default, -RDC will expect this to be in /etc/rdc: - - - $ cd CA/artifacts - $ sudo install_.sh /etc/rdc - -These files should be copied to and and installed on all client and server -machines that are expected to communicate with one another. - -##### Current Limitations -There are a few limitations on the authentication capabilities. These -limitations are temporary and will be eliminated when the server has a -configuration file where user preferences can be specified. -* The client and server are hard-coded to look for openssl certificate and key -files in /etc/rdc. - -# Starting RDC Server Daemon (RDCD) -In order for an RDC client application to monitor and/or control a remote -system, the RDC server daemon, rdcd, must be running on the remote system. -rdcd can be configured to run with full capabilities, or with monitoring-only -capabilities. "Full capabilities" includes the ability to set some system -functions exposed by the RDC APIs and tools. Changing a system's configuration -involves writing to system files. When rdcd is configured to run with full -capabilities, it has the ability to write to these system files. Alternatively, -rdcd can be run with control functionality disabled. In this case, rdcd does -not have the ability to write to the control-related system files. Calls to RDC -APIs (or tools that invoke these APIs) will result in a permission-related -failure when configured for limited functionality. This reduced mode can be -used to prevent someone from inadvertently or maliciously putting a system -into an unwanted state. - -Which configuration is used depends on how rdcd is started, and certain -settings in the rdc.service systemd configuration file. - -###Starting rdcd with systemctl -rdcd can be started by using the systemctl command. - -#####Starting rdcd with systemctl -When starting rdcd using systemctl, like this: - - - systemctl start rdc - -systemctl will read /lib/systemd/system/rdc.service, which is installed with -rdc. This file has 2 lines that control what capabilities with which rdcd -will run. If left uncommented, rdcd will run with full capabilities, as -shown below: - + ## file: /lib/systemd/system/rdc.service + ## Comment the following two lines to run with monitor-only capabilities CapabilityBoundingSet=CAP_DAC_OVERRIDE AmbientCapabilities=CAP_DAC_OVERRIDE + systemctl start rdc ## start rdc as systemd service -When these lines are commented with ``#``, rdcd will run with monitor-only -capabilty. +## Invoke RDC using ROCm™ Data Center Interface (RDCI) +RDCI provides command-line interface to all RDC features. This CLI can be run locally or remotely. Refer to [**user guide**](docs/AMD_ROCm_Data_Center_Tool_User_Guide) for the current list of features. + ## sample rdci commands to test RDC functionality + ## discover devices in a local or remote compute node + ## NOTE: option -u (for unauthenticated) is required is rdcd was started in this mode -###Starting rdcd Directly -rdcd can also be started by directly invoking rdcd from the command line, -like this: + $ cd rdc_install_prefix ## If specified in Building RDC section + ./opt/rocm/rdc/bin/rdci discovery -l <-u> ## list available GPUs in localhost + ./opt/rocm/rdc/bin/rdci discovery -l <-u> ## list available GPUs in host machine - # Start as user rdc - sudo -u rdc rdcd +## Troubleshooting rdcd -or - - # Start as root - sudo rdcd - - -Note how rdcd must be started as user "rdc" or root. Other regular user -accounts will not normally work. This is because rdcd will need access to -private SSL keys and certificates, owned by rdc. In order to run rdcd -under a different account, the SSL keys would need to be accessible by -that account. - -When run from the command line, the rdc.service file mentioned in the previous -section does not come into play. What determines the level of capability is -the level of capability of the id under which rdcd is started. If rdcd is -directly started as root, then rdcd will have monitor and control capability. -If rdcd is directly started with a normal user account, then it will have -monitor-only capability. - -###Troubleshooting rdcd -* When rdcd is started using systemctl, we can view messages that can help debug -problems with either starting rdcd or with communications with a client using -the ``journalctl`` command. To view rdcd logs, issue the following command on -the server +Log messages that can provide useful debug information. + ## If rdcd was started as a systemd service, then use journalctl to view rdcd logs journalctl -u rdc -* Enable the debug log: - - sudo RDC_LOG=DEBUG ./server/rdcd - -* Check the ssl connection in rdci: - - rdcd_hostname= # Set the rdcd hostname to which you want to connect - openssl s_client -connect $rdcd_hostname:50051 -cert /etc/rdc/client/certs/rdc_client_cert.pem -key /etc/rdc/client/private/rdc_client_cert.key -CAfile /etc/rdc/client/certs/rdc_cacert.pem + ## To run rdcd with debug log from command-line use + RDC_LOG=DEBUG ./usr/sbin/rdcd