From eeb59ed0809a4e45f536417c3b92a2635dc59186 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Wed, 8 May 2024 15:14:13 -0700 Subject: [PATCH] leo update Change-Id: I34cb1cdadc1a99d0d226441f1a6b180cb8b4b258 Signed-off-by: Galantsev, Dmitrii --- docs/how-to/features.rst | 10 ++++++---- docs/how-to/integration.rst | 20 ++++++++++---------- docs/how-to/user_guide.rst | 6 +++--- docs/index.rst | 6 +++--- docs/install/handbook.rst | 10 +++++----- docs/install/install.rst | 18 +++++++++--------- docs/reference/api_intro.rst | 10 +++++----- docs/sphinx/_toc.yml.in | 6 +++--- 8 files changed, 44 insertions(+), 42 deletions(-) diff --git a/docs/how-to/features.rst b/docs/how-to/features.rst index 00e44948d6..b6b393231a 100644 --- a/docs/how-to/features.rst +++ b/docs/how-to/features.rst @@ -5,9 +5,11 @@ .. _rdc-features: ****************************************** -RDC Feature Overview +RDC tool feature overview ****************************************** +This topic provides information related to the features of the RDC tool. + .. figure:: ../data/features.png RDC components and framework for describing features @@ -139,7 +141,7 @@ You can define ``RDC_FI_ECC_CORRECT_TOTAL`` or ``RDC_FI_ECC_UNCORRECT_TOTAL`` fi Device Monitoring ================= -The RDC Tool enables you to monitor the GPU fields. +The RDC tool enables you to monitor the GPU fields. .. code-block:: shell @@ -231,7 +233,7 @@ You can display GPU statistics for any given workload. Job Stats Use Case -================== +------------------ A common use case is to record GPU statistics associated with any job or workload. The following example shows how all these features can be put together for this use case: @@ -242,7 +244,7 @@ A common use case is to record GPU statistics associated with any job or workloa rdci commands -------------- +^^^^^^^^^^^^^ .. code-block:: shell diff --git a/docs/how-to/integration.rst b/docs/how-to/integration.rst index e626cc7257..27f618bc26 100644 --- a/docs/how-to/integration.rst +++ b/docs/how-to/integration.rst @@ -5,10 +5,10 @@ .. _rdc-3rd-party: ****************************************** -3rd party integration +Third party integration ****************************************** -This section lists all the third-party plugins such as Prometheus, Grafana, and Reliability, Availability and Serviceability (RAS) plugin. +This section lists available third-party plugins for the RDC tool such as Prometheus, Grafana, and Reliability, Availability and Serviceability (RAS). Python bindings =============== @@ -52,7 +52,7 @@ See the sample program to monitor the power and GPU utilization using the ``RdcR In the sample program, * Class ``SimpleRdcReader`` is derived from the ``RdcReader``. -* The field ``ip_port=None`` in ``RdcReader`` dictates that the RDC tool runs in the embedded mode. +* The field ``ip_port=None`` in ``RdcReader`` dictates that RDC runs in the embedded mode. * ``SimpleRdcReader::process()`` fetches fields specified in ``default_field_ids``. .. note:: @@ -76,7 +76,7 @@ Prometheus plugin helps to monitor events and send alerts. The Prometheus instal Prometheus plugin installation ------------------------------ -The RDC tool's Prometheus plugin ``rdc_prometheus.py`` can be found in the ``python_binding`` folder. +RDC's Prometheus plugin ``rdc_prometheus.py`` can be found in the ``python_binding`` folder. .. note:: Ensure the Prometheus client is installed before the Prometheus plugin installation process. @@ -220,7 +220,7 @@ Example: gpu_clock Grafana Plugin ============== -Grafana is a common monitoring stack used for storing and visualizing time series data. Prometheus acts as the storage backend, and Grafana is used as the interface for analysis and visualization. Grafana has a plethora of visualization options and can be integrated with Prometheus for the RDC tool's dashboard. +Grafana is a common monitoring stack used for storing and visualizing time series data. Prometheus acts as the storage backend, and Grafana is used as the interface for analysis and visualization. Grafana has a plethora of visualization options and can be integrated with Prometheus for RDC's dashboard. Grafana Plugin Installation @@ -283,7 +283,7 @@ Follow these steps: .. image:: ../data/integration_config5.png -5. To import the RDC tool dashboard, click ``+`` and select ``Import``. +5. To import RDC dashboard, click ``+`` and select ``Import``. 6. Click the ``Upload.json`` file command. @@ -296,9 +296,9 @@ Follow these steps: Prometheus (Grafana) integration with automatic node detection ============================================================== -The RDC tool enables you to use Consul to discover the ``rdc_prometheus`` service automatically. Consul is “a service mesh solution providing a fully featured control plane with service discovery, configuration, and segmentation functionality.” For more information, refer to `Consul `_. +RDC enables you to use Consul to discover the ``rdc_prometheus`` service automatically. Consul is “a service mesh solution providing a fully featured control plane with service discovery, configuration, and segmentation functionality.” For more information, refer to `Consul `_. -The RDC tool uses Consul for health checks of RDC's integration with the Prometheus plug-in (``rdc_prometheus``), and these checks provide information on its efficiency. +RDC uses Consul for health checks of RDC's integration with the Prometheus plug-in (``rdc_prometheus``), and these checks provide information on its efficiency. Previously, when a new compute node was added, users had to manually change ``prometheus_targets.json`` to use Consul. Now, with the Consul agent integration, a new compute node can be discovered automatically. @@ -502,7 +502,7 @@ The RAS plugin helps to gather and count errors. The details of RAS integration RAS Plugin Installation ----------------------- -In this release, the RDC tool extends support to the Reliability, Availability, and Serviceability (RAS) integration. When the RAS feature is enabled in the graphic card, users can use RDC to monitor RAS errors. +In this release, RDC extends support to the Reliability, Availability, and Serviceability (RAS) integration. When the RAS feature is enabled in the graphic card, users can use RDC to monitor RAS errors. Prerequisite ^^^^^^^^^^^^ @@ -512,7 +512,7 @@ You must ensure the graphic card supports RAS. .. note:: The RAS library is installed as part of the RDC installation, and no additional configuration is required for RDC. -The RDC tool installation dynamically loads the RAS library ``librdc_ras.so``. The configuration files required by the RAS library are installed in the ``sp3`` and ``config`` folders. +RDC installation dynamically loads the RAS library ``librdc_ras.so``. The configuration files required by the RAS library are installed in the ``sp3`` and ``config`` folders. .. code-block:: shell diff --git a/docs/how-to/user_guide.rst b/docs/how-to/user_guide.rst index aab62552e8..2530f1723e 100644 --- a/docs/how-to/user_guide.rst +++ b/docs/how-to/user_guide.rst @@ -5,7 +5,7 @@ .. _rdc-use: ****************************************** -Introduction to RDC tool +Introduction to the RDC tool ****************************************** The ROCm Data Center tool (RDC) simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments. The main features are: @@ -15,7 +15,7 @@ The ROCm Data Center tool (RDC) simplifies the administration and addresses key * Integration with third-party tools * Open source -You can use the tool in standalone mode if all components are installed. However, the existing management tools can use the same set of features available in a library format. +You can use the RDC tool in standalone mode if all components are installed. However, the existing management tools can use the same set of features available in a library format. For details on different modes of operation, refer to *Starting RDC* in :ref:`rdc-install`. @@ -24,7 +24,7 @@ Target Audience The audience for the AMD RDC tool consists of: -* Administrators: The tool provides the cluster administrator with the capability of monitoring, validating, and configuring policies. +* Administrators: RDC provides the cluster administrator with the capability of monitoring, validating, and configuring policies. * HPC Users: Provides GPU-centric feedback for their workload submissions. * OEM: Add GPU information to their existing cluster management software. * Open source Contributors: RDC is open source and accepts contributions from the community. diff --git a/docs/index.rst b/docs/index.rst index 0e0a43b262..85b0a45381 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -5,17 +5,17 @@ .. _index: ****************************************** -ROCm Data Center tool documentation +ROCm Data Center (RDC) tool documentation ****************************************** -The ROCm Data Center tool (RDC) simplifies the administration of, and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments. The main features are of the RDC tool include: +The ROCm Data Center tool (RDC) simplifies the administration of, and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments. The main features are of RDC include: * GPU telemetry * GPU statistics for jobs * Integration with third-party tools * Open source -You can access RDC on `GitHub repository `_. +You can access the RDC tool on `GitHub repository `_. The documentation is structured as follows: diff --git a/docs/install/handbook.rst b/docs/install/handbook.rst index 1344ae1faf..ea431cc5ca 100644 --- a/docs/install/handbook.rst +++ b/docs/install/handbook.rst @@ -5,16 +5,16 @@ .. _rdc-handbook: *************************************************** -Building and testing RDC tool: A developer handbook +Building and testing RDC *************************************************** -The RDC tool is open source and available under the MIT License. This section is helpful for open source developers. Third-party integrators may also find this information useful. +RDC is open source and available under the MIT License. This section is helpful for open source developers. Third-party integrators may also find this information useful. Prerequisites for Building RDC ============================== .. note:: - The RDC tool is tested on the following software versions. Earlier versions may not work. + RDC is tested on the following software versions. Earlier versions may not work. * CMake 3.15 * g++ (5.4.0) @@ -91,7 +91,7 @@ Test Authentication ============== -The RDC tool supports encrypted communications between clients and servers. +RDC supports encrypted communications between clients and servers. Generate Files for Authentication --------------------------------- @@ -148,7 +148,7 @@ These files must be copied to and installed on all client and server machines th Known Limitation ---------------- -The RDC tool has the following authentication limitations: +RDC has the following authentication limitations: The client and server are hardcoded to look for the ``openssl`` certificate and key files in ``/etc/rdc``. There is no workaround available currently. diff --git a/docs/install/install.rst b/docs/install/install.rst index 500b79badc..e01dc9e86a 100644 --- a/docs/install/install.rst +++ b/docs/install/install.rst @@ -5,7 +5,7 @@ .. _rdc-install: ****************************************** -Installing and running RDC tool +Installing and running RDC ****************************************** The ROCm Data Center tool (RDC) is part of the AMD ROCm software and available on the distributions supported by AMD ROCm. For RDC installation from prebuilt packages, follow the instructions in this section. @@ -23,12 +23,12 @@ To see the instructions for building ``gRPC`` and ``protoc``, refer to `Building Authentication keys =================== -The RDC tool can be used with or without authentication. If authentication is required you must configure proper authentication keys as described in *Authentication* in :ref:`rdc-handbook`. +RDC can be used with or without authentication. If authentication is required you must configure proper authentication keys as described in *Authentication* in :ref:`rdc-handbook`. Prebuilt packages ================= -The RDC tool is packaged as part of the ROCm software repository. You must install the AMD ROCm software before installing RDC, as described in `ROCm installation `_. +RDC is packaged as part of the ROCm software repository. You must install the AMD ROCm software before installing RDC, as described in `ROCm installation `_. To install RDC after installing the ROCm package, use the following instructions. @@ -57,7 +57,7 @@ To install RDC after installing the ROCm package, use the following instructions Components ========== -The components of RDC tool are as shown below: +The components of the RDC tool are as shown below: .. figure:: ../data/install_components.png @@ -92,7 +92,7 @@ The RDC tool can be run in the following two modes. The feature set is similar i * :ref:`standalone` * :ref:`embedded` -The capability in each mode depends on the privileges you have for starting RDC. A normal user has access only to monitor (GPU telemetry) capabilities. A privileged user can run the tool with full capability. In the full capability mode, GPU configuration features can be invoked. This may or may not affect all the users and processes sharing the GPU. +The capability in each mode depends on the privileges you have for starting the RDC tool. A normal user has access only to monitor (GPU telemetry) capabilities. A privileged user can run the tool with full capability. In the full capability mode, GPU configuration features can be invoked. This may or may not affect all the users and processes sharing the GPU. .. _`standalone`: @@ -101,8 +101,8 @@ Standalone mode This is the preferred mode of operation, as it does not have any external dependencies. To start RDC in standalone mode, RDC Server Daemon (``rdcd``) must run on each compute node. Refer to *Terminology* in :ref:`rdc-use` for more information. You can start ``rdcd`` as a ``systemd`` service or directly from the command-line. -Start RDC tool using ``systemd`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Start the RDC tool using ``systemd`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If multiple RDC versions are installed, copy `/opt/rocm-/rdc/lib/rdc.service`, which is installed with the desired RDC version, to the ``systemd`` folder. The capability of RDC can be configured by modifying the ``rdc.service`` system configuration file. Use the ``systemctl`` command to start ``rdcd``. @@ -139,8 +139,8 @@ If the GPU reset fails, restart the server. Note that restarting the server also $ sudo systemctl restart rdcd -Start RDC tool from the command-line -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Start the RDC tool from the command-line +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ While ``systemctl`` is the preferred way to start ``rdcd``, you can also start directly from the command-line. The installation scripts create a default user - ``rdc``. Users have the option to edit the profile file (``rdc.service`` installed at ``/lib/systemd/system``) and change these lines accordingly: diff --git a/docs/reference/api_intro.rst b/docs/reference/api_intro.rst index 8c9bcd37a1..62e7bc2185 100644 --- a/docs/reference/api_intro.rst +++ b/docs/reference/api_intro.rst @@ -5,7 +5,7 @@ .. _api-intro: ****************************************** -Introduction to RDC tool API +Introduction to RDC API ****************************************** .. note:: @@ -14,9 +14,9 @@ Introduction to RDC tool API RDC API =========== -The RDC tool API is the core library that provides all the RDC features. This section focuses on how RDC API can be used by third-party software. +RDC API is the core library that provides all the RDC features. This section focuses on how RDC API can be used by third-party software. -The RDC includes the following libraries: +RDC includes the following libraries: * ``librdc_bootstrap.so``: Loads during runtime one of the two libraries by detecting the mode. * ``librdc_client.so``: Exposes RDC functionality using ``gRPC`` client. @@ -38,10 +38,10 @@ Example: For more information see the :ref:`rdc-ref`. -Job Stats Use Case +Job stats use case ================== -The following pseudocode shows how RDC tool API can be directly used to record GPU statistics associated with any job or workload. Refer to the example code provided with RDC on how to build it. +The following pseudocode shows how RDC API can be directly used to record GPU statistics associated with any job or workload. Refer to the example code provided with RDC on how to build it. For more information, see *Job Stats* in :ref:`rdc-features`. diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index cb0390280d..14e4a38557 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -7,9 +7,9 @@ subtrees: - caption: Install entries: - file: install/install - title: Installing RDC tool + title: Installing RDC - file: install/handbook - title: Building and testing RDC tool + title: Building and testing RDC - caption: How to entries: @@ -17,7 +17,7 @@ subtrees: - file: how-to/features - file: how-to/integration -- caption: API Reference +- caption: API reference entries: - file: reference/api_intro - file: reference/api_ref