Refactor RDC documentation
Change-Id: Ieaba84992a8cbd185f4c2d1dc36a175c0429b754
[ROCm/rdc commit: a865793b70]
Цей коміт міститься в:
зафіксовано
Galantsev, Dmitrii
джерело
259b7ac57b
коміт
e56a809946
@@ -9,9 +9,8 @@ The ROCm™ Data Center Tool (RDC) simplifies administration and addresses key i
|
||||
- **Integration with Third-Party Tools** 🔗
|
||||
- **Open Source** 🛠️
|
||||
|
||||
For comprehensive documentation and to get started with RDC using pre-built packages, refer to the [**ROCm Data Center Tool User Guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/).
|
||||
|
||||
---
|
||||
> [!NOTE]
|
||||
> The published documentation is available at [ROCm Data Center Tool](https://rocm.docs.amd.com/projects/rdc/en/latest/index.html) in an organized, easy-to-read format, with search and a table of contents. The documentation source files reside in the `rdc/docs` folder of this repository. As with all ROCm projects, the documentation is open source. For more information on contributing to the documentation, see [Contribute to ROCm documentation](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).
|
||||
|
||||
## 🛠️ Installation Guide
|
||||
|
||||
|
||||
@@ -0,0 +1,35 @@
|
||||
.. meta::
|
||||
:description: documentation of the installation, configuration, and use of the ROCm Data Center tool
|
||||
:keywords: ROCm Data Center tool, RDC, ROCm, API, reference, data type, support
|
||||
|
||||
.. _components:
|
||||
|
||||
***************
|
||||
RDC components
|
||||
***************
|
||||
|
||||
The components of the RDC tool are illustrated in the following figure.
|
||||
|
||||
.. figure:: ../data/install_components.png
|
||||
|
||||
High-level diagram of RDC components
|
||||
|
||||
RDC (API) library
|
||||
-----------------
|
||||
|
||||
This library is the central piece, which interacts with different modules and provides all the features described. This shared library provides C API and Python bindings so that third-party tools should be able to use it directly if required.
|
||||
|
||||
RDC daemon (``rdcd``)
|
||||
---------------------
|
||||
|
||||
The ``rdcd`` daemon records telemetry information from GPUs. It also provides an interface to RDC command-line tool (``rdci``) running locally or remotely. It relies on the above RDC Library for all the core features.
|
||||
|
||||
RDC command-line tool (``rdci``)
|
||||
--------------------------------
|
||||
|
||||
A command-line tool to invoke all the features of the RDC tool. This CLI can be run locally or remotely.
|
||||
|
||||
AMDSMI library
|
||||
--------------
|
||||
|
||||
A stateless system management library that provides low-level interfaces to access GPU information
|
||||
@@ -1,297 +0,0 @@
|
||||
.. meta::
|
||||
:description: documentation of the installation, configuration, and use of the ROCm Data Center tool
|
||||
:keywords: ROCm Data Center tool, RDC, ROCm, API, reference, data type, support
|
||||
|
||||
.. _rdc-features:
|
||||
|
||||
******************************************
|
||||
RDC tool feature overview
|
||||
******************************************
|
||||
|
||||
This topic provides information related to the features of the RDC tool.
|
||||
|
||||
.. figure:: ../data/features.png
|
||||
|
||||
RDC components and framework for describing features
|
||||
|
||||
|
||||
Discovery
|
||||
=========
|
||||
|
||||
The Discovery feature enables you to locate and display information of GPUs present in the compute node.
|
||||
|
||||
Example:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci discovery <host_name> -l
|
||||
2 GPUs found
|
||||
|
||||
.. list-table::
|
||||
|
||||
* - **GPU Index**
|
||||
- **Device Information**
|
||||
|
||||
* - 0
|
||||
- Name: AMD Radeon Instinct MI50 Accelerator
|
||||
|
||||
* - 1
|
||||
- Name: AMD Radeon Instinct MI50 Accelerator
|
||||
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci -l : list available GPUs
|
||||
$ rdci -u: No SSL authentication
|
||||
|
||||
|
||||
Groups
|
||||
======
|
||||
|
||||
This section explains the GPU and field groups features.
|
||||
|
||||
GPU Groups
|
||||
----------
|
||||
|
||||
With the GPU groups feature, you can create, delete, and list logical groups of GPU.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci group -c GPU_GROUP
|
||||
Successfully created a group with a group ID 1
|
||||
|
||||
$ rdci group -g 1 -a 0,1
|
||||
Successfully added the GPU 0,1 to group 1
|
||||
|
||||
$ rdci group –l
|
||||
|
||||
1 group found
|
||||
|
||||
|
||||
.. list-table::
|
||||
|
||||
* - **Group ID**
|
||||
- **Group Name**
|
||||
- **GPU Index**
|
||||
|
||||
* - 1
|
||||
- GPU_GROUP
|
||||
- 0, 1
|
||||
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci group -d 1
|
||||
Successfully removed group 1
|
||||
|
||||
-c create; –g group id; –a add GPU index; –l list; -d delete group
|
||||
|
||||
|
||||
Field Groups
|
||||
------------
|
||||
|
||||
The Field Groups feature provides you the options to create, delete, and list field groups.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci fieldgroup -c <fgroup> -f 150,155
|
||||
Successfully created a field group with a group ID 1
|
||||
|
||||
$ rdci fieldgroup -l
|
||||
|
||||
1 group found
|
||||
|
||||
|
||||
.. list-table::
|
||||
|
||||
* - **Group ID**
|
||||
- **Group Name**
|
||||
- **Field IDs**
|
||||
|
||||
* - 1
|
||||
- Fgroup
|
||||
- 150, 155
|
||||
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci fieldgroup -d 1
|
||||
Successfully removed field group 1
|
||||
|
||||
rdci dmon –l
|
||||
Supported fields Ids:
|
||||
100 RDC_FI_GPU_CLOCK: Current GPU clock freq.
|
||||
150 RDC_FI_GPU_TEMP: GPU temp. in milli Celsius.
|
||||
155 RDC_FI_POWER_USAGE: Power usage in microwatts.
|
||||
203 RDC_FI_GPU_UTIL: GPU busy percentage.
|
||||
525 RDC_FI_GPU_MEMORY_USAGE: VRAM Memory usage in bytes
|
||||
|
||||
-c create; –g group id; –a add GPU index; –l list; -d delete group
|
||||
|
||||
|
||||
Monitor Errors
|
||||
--------------
|
||||
|
||||
You can define ``RDC_FI_ECC_CORRECT_TOTAL`` or ``RDC_FI_ECC_UNCORRECT_TOTAL`` field to get the RAS Error-Correcting Code (ECC) counter:
|
||||
|
||||
* 312 ``RDC_FI_ECC_CORRECT_TOTAL``: Accumulated correctable ECC errors
|
||||
* 313 ``RDC_FI_ECC_UNCORRECT_TOTAL``: Accumulated uncorrectable ECC errors
|
||||
|
||||
|
||||
Device Monitoring
|
||||
=================
|
||||
|
||||
The RDC tool enables you to monitor the GPU fields.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000
|
||||
|
||||
|
||||
1 group found
|
||||
|
||||
|
||||
.. list-table::
|
||||
|
||||
* - **GPU Index**
|
||||
- **TEMP (m°C)**
|
||||
- **POWER (µW)**
|
||||
|
||||
* - 0
|
||||
- 25000
|
||||
- 520500
|
||||
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
rdci dmon –l
|
||||
Supported fields Ids:
|
||||
100 RDC_FI_GPU_CLOCK: Current GPU clock freq.
|
||||
150 RDC_FI_GPU_TEMP: GPU temp. in milli Celsius.
|
||||
155 RDC_FI_POWER_USAGE: Power usage in microwatts.
|
||||
203 RDC_FI_GPU_UTIL: GPU busy percentage.
|
||||
525 RDC_FI_GPU_MEMORY_USAGE: VRAM Memory usage in bytes
|
||||
|
||||
-e field ids; -i GPU index; -c count; -d delay; -l list; -f fieldgroup id
|
||||
|
||||
|
||||
Job Stats
|
||||
=========
|
||||
|
||||
You can display GPU statistics for any given workload.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci stats -s 2 -g 1
|
||||
Successfully started recording job 2 with a group ID 1
|
||||
|
||||
$ rdci stats -j 2
|
||||
|
||||
|
||||
.. list-table::
|
||||
|
||||
* - **Summary**
|
||||
- **Executive Status**
|
||||
|
||||
* - Start time
|
||||
- 1586795401
|
||||
|
||||
* - End time
|
||||
- 1586795445
|
||||
|
||||
* - Total execution time
|
||||
- 44
|
||||
|
||||
* - ==============
|
||||
- ==============
|
||||
|
||||
* - Energy Consumed (Joules)
|
||||
- 21682
|
||||
|
||||
* - Power Usage (Watts)
|
||||
- Max: 49 Min: 13 Avg: 34
|
||||
|
||||
* - GPU Clock (MHz)
|
||||
- Max: 1000 Min: 300 Avg: 903
|
||||
|
||||
* - GPU Utilization (%)
|
||||
- Max: 69 Min: 0 Avg: 2
|
||||
|
||||
* - Max GPU Memory Used (bytes)
|
||||
- 524320768
|
||||
|
||||
* - Memory Utilization (%)
|
||||
- Max: 12 Min: 11 Avg: 12
|
||||
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci stats -x 2
|
||||
Successfully stopped recording job 2
|
||||
|
||||
-s start recording on job id; -g group id; -j display job stats; –x stop recording.
|
||||
|
||||
|
||||
Job Stats Use Case
|
||||
------------------
|
||||
|
||||
A common use case is to record GPU statistics associated with any job or workload. The following example shows how all these features can be put together for this use case:
|
||||
|
||||
|
||||
.. figure:: ../data/features_jobs.png
|
||||
|
||||
An example showing how job statistics can be recorded
|
||||
|
||||
|
||||
rdci commands
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci group -c group1
|
||||
|
||||
successfully created a group with a group ID 1
|
||||
|
||||
$ rdci group -g 1 -a 0,1
|
||||
|
||||
GPU 0,1 is added to group 1 successfully.
|
||||
|
||||
rdci stats -s 123 -g 1
|
||||
|
||||
job 123 recorded successfully with the group ID
|
||||
|
||||
rdci stats -x 123
|
||||
|
||||
job 123 stops recording successfully
|
||||
|
||||
rdci stats -j 123
|
||||
|
||||
job stats printed
|
||||
|
||||
|
||||
Error-Correcting Code Output
|
||||
============================
|
||||
|
||||
In the job output, this feature prints out the Error-Correcting Code (ECC) errors while running the job.
|
||||
|
||||
Diagnostic
|
||||
==========
|
||||
|
||||
You can run diagnostic on a GPU group as shown below:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci diag -g <gpu_group>
|
||||
|
||||
No compute process: Pass
|
||||
Node topology check: Pass
|
||||
GPU parameters check: Pass
|
||||
Compute Queue ready: Pass
|
||||
System memory check: Pass
|
||||
=============== Diagnostic Details ==================
|
||||
No compute process: No processes running on any devices.
|
||||
Node topology check: No link detected.
|
||||
GPU parameters check: GPU 0 Critical Edge temperature in range.
|
||||
Compute Queue ready: Run binary search task on GPU 0 Pass.
|
||||
System memory check: Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.
|
||||
|
||||
@@ -1,95 +1,129 @@
|
||||
.. meta::
|
||||
:description: documentation of the installation, configuration, and use of the ROCm Data Center tool
|
||||
:keywords: ROCm Data Center tool, RDC, ROCm, API, reference, data type, support
|
||||
:description: The ROCm Data Center tool (RDC) addresses key infrastructure challenges regarding AMD GPUs in cluster and data center environments and simplifies their administration
|
||||
:keywords: RDC plugins, ROCm Data Center plugins, Integrating RDC, Integrating ROCm Data Center
|
||||
|
||||
.. _rdc-3rd-party:
|
||||
|
||||
******************************************
|
||||
**************************
|
||||
Third party integration
|
||||
******************************************
|
||||
**************************
|
||||
|
||||
This section lists available third-party plugins for the RDC tool such as Prometheus, Grafana, and Reliability, Availability and Serviceability (RAS).
|
||||
|
||||
Python bindings
|
||||
===============
|
||||
================
|
||||
|
||||
The RDC Tool provides a generic Python class RdcReader to simplify telemetry gathering. RdcReader simplifies usage by providing the following functionalities:
|
||||
The RDC tool provides a generic Python class ``RdcReader``, which simplifies telemetry gathering by providing the following functionalities:
|
||||
|
||||
* The user only needs to specify telemetry fields. RdcReader creates the necessary groups and fieldgroups, watch the fields, and fetch the fields.
|
||||
* The RdcReader can support embedded and standalone mode. The standalone mode can be with or without authentication.
|
||||
* In standalone mode, the RdcReader can automatically reconnect to rdcd if the connection is lost.
|
||||
* When rdcd is restarted, the previously created group and fieldgroup may be lost. The RdcReader can re-create them and watch the fields after reconnecting.
|
||||
* If the client is restarted, RdcReader can detect the groups and fieldgroups created before and avoid re-creating them.
|
||||
* A custom unit converter can be passed to RdcReader to override the default RDC unit.
|
||||
* ``RdcReader`` creates the necessary groups and fieldgroups, watch the fields, and fetch the fields for the telemetry fields specified by the user.
|
||||
* ``RdcReader`` can support embedded and standalone mode. The standalone mode can be with or without authentication.
|
||||
* In standalone mode, the ``RdcReader`` can automatically reconnect to ``rdcd`` if the connection is lost.
|
||||
* Restarting ``rdcd`` can lead to loss of previously created group and fieldgroup. The ``RdcReader`` can recreate them and watch the fields after reconnecting.
|
||||
* If the client is restarted, ``RdcReader`` can detect the previously created groups and fieldgroups and avoid recreating them.
|
||||
* A custom unit converter can be passed to ``RdcReader`` to override the default RDC unit.
|
||||
|
||||
See the sample program to monitor the power and GPU utilization using the ``RdcReader`` below:
|
||||
Here is a sample program to monitor the power and GPU utilization using the ``RdcReader``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
from RdcReader import RdcReader
|
||||
from RdcUtil import RdcUtil
|
||||
from rdc_bootstrap import *
|
||||
|
||||
|
||||
default_field_ids = [
|
||||
rdc_field_t.RDC_FI_POWER_USAGE,
|
||||
rdc_field_t.RDC_FI_GPU_UTIL
|
||||
]
|
||||
|
||||
|
||||
class SimpleRdcReader(RdcReader):
|
||||
def __init__(self):
|
||||
RdcReader.__init__(self,ip_port=None, field_ids = default_field_ids, update_freq=1000000)
|
||||
def handle_field(self, gpu_index, value):
|
||||
field_name = self.rdc_util.field_id_string(value.field_id).lower()
|
||||
print("%d %d:%s %d" % (value.ts, gpu_index, field_name, value.value.l_int))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
reader = SimpleRdcReader()
|
||||
while True:
|
||||
time.sleep(1)
|
||||
reader.process()
|
||||
|
||||
|
||||
In the sample program,
|
||||
|
||||
* Class ``SimpleRdcReader`` is derived from the ``RdcReader``.
|
||||
* The field ``ip_port=None`` in ``RdcReader`` dictates that RDC runs in the embedded mode.
|
||||
* ``SimpleRdcReader::process()`` fetches fields specified in ``default_field_ids``.
|
||||
* ``SimpleRdcReader::process()`` fetches fields specified in ``default_field_ids``.
|
||||
|
||||
.. note::
|
||||
``RdcReader.py`` can be found in the python_binding folder located at RDC install path.
|
||||
``RdcReader.py`` can be found in the ``python_binding`` folder located at RDC install path.
|
||||
|
||||
To run the example, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
# Ensure that RDC shared libraries are in the library path and
|
||||
# RdcReader.py is in PYTHONPATH
|
||||
|
||||
|
||||
$ python SimpleReader.py
|
||||
|
||||
.. _prometheus:
|
||||
|
||||
Prometheus plugin
|
||||
=================
|
||||
==================
|
||||
|
||||
Prometheus plugin helps to monitor events and send alerts. The Prometheus installation and integration details are given below.
|
||||
The Prometheus plugin helps to monitor events and send alerts. Prometheus installation and integration details are explained in the following sections.
|
||||
|
||||
Prometheus plugin installation
|
||||
------------------------------
|
||||
-------------------------------
|
||||
|
||||
RDC's Prometheus plugin ``rdc_prometheus.py`` can be found in the ``python_binding`` folder.
|
||||
|
||||
.. note::
|
||||
Ensure the Prometheus client is installed before the Prometheus plugin installation process.
|
||||
Here are the steps to install Prometheus:
|
||||
|
||||
1. Install Prometheus client:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ pip install prometheus_client
|
||||
|
||||
$ pip install prometheus_client
|
||||
|
||||
2. Run the Prometheus plugin:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ python rdc_prometheus.py
|
||||
|
||||
3. Verify plugin:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ curl localhost:5000
|
||||
|
||||
gpu_util{gpu_index="0"} 0.0
|
||||
gpu_clock{gpu_index="0"} 300.0
|
||||
gpu_memory_total{gpu_index="0"} 4294.0
|
||||
power_usage{gpu_index="0"} 9.0
|
||||
gpu_memory_usage{gpu_index="0"} 134.0
|
||||
|
||||
By default, the plugin runs in the standalone mode and connects to ``rdcd`` at ``localhost:50051`` to fetch fields. Ensure that the plugin uses the same authentication mode as ``rdcd``, for example, if ``rdcd`` runs with ``-u/--unauth`` option, the plugin must also use ``--rdc_unauth`` option.
|
||||
|
||||
**Useful options:**
|
||||
|
||||
- To run the plugin in unauthenticated mode, use the ``--rdc_unauth`` option.
|
||||
|
||||
- To use the plugin in the embedded mode without ``rdcd``, set the ``--rdc_embedded`` option.
|
||||
|
||||
- To override the default fields that are monitored, use the ``--rdc_fields`` option to specify the list of fields.
|
||||
|
||||
- To fetch field's list from a file conveniently, use the ``--rdc_fields_file`` option, if the field's list is long.
|
||||
|
||||
- To control how the fields are cached, use the ``max_keep_age`` and ``max_keep_samples`` options.
|
||||
|
||||
- To see the metrics of the plugin itself, including the plugin process CPU, memory, file descriptor usage, native threads count, process start and uptimes, set ``--enable_plugin_monitoring`` option.
|
||||
|
||||
To view the options provided with the plugin, use ``--help``.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
% python rdc_prometheus.py --help
|
||||
usage: rdc_prometheus.py [-h] [--listen_port LISTEN_PORT] [--rdc_embedded]
|
||||
[--rdc_ip_port RDC_IP_PORT] [--rdc_unauth]
|
||||
@@ -100,9 +134,9 @@ To view the options provided with the plugin, use ``--help``.
|
||||
[--rdc_fields_file RDC_FIELDS_FILE]
|
||||
[--rdc_gpu_indexes RDC_GPU_INDEXES [RDC_GPU_INDEXES ...]]
|
||||
[--enable_plugin_monitoring]
|
||||
|
||||
|
||||
RDC Prometheus plugin.
|
||||
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--listen_port LISTEN_PORT
|
||||
@@ -136,52 +170,21 @@ To view the options provided with the plugin, use ``--help``.
|
||||
Set this option to collect process metrics of
|
||||
the plugin itself (default: false)
|
||||
|
||||
Prometheus integration
|
||||
-----------------------
|
||||
|
||||
By default, the plugin runs in the standalone mode and connects to ``rdcd`` at ``localhost:50051`` to fetch fields. The plugin should use the same authentication mode as ``rdcd``, e.g., if ``rdcd`` is running with ``-u``/``--unauth`` flag, the plugin should use ``--rdc_unauth`` flag. You can use the plugin in the embedded mode without ``rdcd`` by setting ``--rdc_embedded`` flag.
|
||||
To integrate Prometheus plugin in RDC, follow these steps:
|
||||
|
||||
To override the default fields that are monitored, you can use the ``--rdc_fields`` option to specify the list of fields. If the fields list is long, the ``--rdc_fields_file`` option provides a convenient way to fetch fields list from a file. You can use the ``max_keep_age`` and ``max_keep_samples`` to control how the fields are cached.
|
||||
1. `Download and install Prometheus plugin <https://github.com/prometheus/prometheus>`_ in the management machine.
|
||||
|
||||
The plugin can provide the metrics of the plugin itself, including the plugin process CPU, memory, file descriptor usage, and native threads count, including the process start and uptimes. You can enable this using ``--enable_plugin_monitoring``.
|
||||
2. Configure Prometheus targets:
|
||||
|
||||
You can test the plugin with the default settings.
|
||||
Use the example configuration file ``rdc_prometheus_example.yml`` in the ``python_binding`` folder. This file refers to ``prometheus_targets.json``.
|
||||
Modify `prometheus_targets.json` to point to your compute nodes.
|
||||
Ensure that this is modified to point to the correct compute nodes.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Ensure that rdcd is running on the same machine
|
||||
$ python rdc_prometheus.py
|
||||
|
||||
# Check the plugin using curl
|
||||
$ curl localhost:5000
|
||||
# HELP gpu_util gpu_util
|
||||
# TYPE gpu_util gauge
|
||||
gpu_util{gpu_index="0"} 0.0
|
||||
# HELP gpu_clock gpu_clock
|
||||
# TYPE gpu_clock gauge
|
||||
gpu_clock{gpu_index="0"} 300.0
|
||||
# HELP gpu_memory_total gpu_memory_total
|
||||
# TYPE gpu_memory_total gauge
|
||||
gpu_memory_total{gpu_index="0"} 4294.0
|
||||
# HELP gpu_temp gpu_temp
|
||||
# TYPE gpu_temp gauge
|
||||
# HELP power_usage power_usage
|
||||
# TYPE power_usage gauge
|
||||
power_usage{gpu_index="0"} 9.0
|
||||
# HELP gpu_memory_usage gpu_memory_usage
|
||||
# TYPE gpu_memory_usage gauge
|
||||
gpu_memory_usage{gpu_index="0"} 134.0
|
||||
|
||||
|
||||
Prometheus Integration
|
||||
----------------------
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. `Download and install Prometheus <https://github.com/prometheus/prometheus>`_ in the management machine.
|
||||
|
||||
2. Use the example configuration file ``rdc_prometheus_example.yml`` in the python_binding folder. You can use this file in its original state. However, note that this file refers to ``prometheus_targets.json``. Ensure that this is modified to point to the correct compute nodes.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
// Sample file: prometheus_targets.json
|
||||
// Replace rdc_test*.amd.com to point the correct compute nodes
|
||||
// Add as many compute nodes as necessary
|
||||
@@ -194,232 +197,215 @@ Follow these steps:
|
||||
}
|
||||
]
|
||||
|
||||
.. note::
|
||||
|
||||
.. note::
|
||||
In the above example, there are two compute nodes, ``rdc_test1.adm.com`` and ``rdc_test2.adm.com``. Ensure that the Prometheus plugin is running on those compute nodes.
|
||||
|
||||
3. Start the Prometheus plugin.
|
||||
3. Start the Prometheus plugin.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
% prometheus --config.file=<full path of the rdc_prometheus_example.yml>
|
||||
|
||||
% prometheus --config.file=<full path of the rdc_prometheus_example.yml>
|
||||
|
||||
4. From the management node, using a browser, open the URL http://localhost:9090.
|
||||
4. From the management node, open the URL http://localhost:9090 in the browser.
|
||||
|
||||
5. Select one of the available metrics.
|
||||
|
||||
Example: gpu_clock
|
||||
------------------
|
||||
5. Select one of the available metrics.
|
||||
|
||||
.. figure:: ../data/integration_gpu_clock.png
|
||||
|
||||
The Prometheus image showing the GPU clock for both rdc_test1 and rdc_test2.
|
||||
Prometheus image showing GPU clock for both rdc_test1 and rdc_test2.
|
||||
|
||||
|
||||
Grafana Plugin
|
||||
==============
|
||||
Grafana plugin
|
||||
===============
|
||||
|
||||
Grafana is a common monitoring stack used for storing and visualizing time series data. Prometheus acts as the storage backend, and Grafana is used as the interface for analysis and visualization. Grafana has a plethora of visualization options and can be integrated with Prometheus for RDC's dashboard.
|
||||
|
||||
|
||||
Grafana Plugin Installation
|
||||
---------------------------
|
||||
Grafana plugin installation
|
||||
----------------------------
|
||||
|
||||
To install Grafana plugin, follow these steps:
|
||||
|
||||
1. `Download Grafana <https://grafana.com/grafana/download>`_
|
||||
1. `Download Grafana <https://grafana.com/grafana/download>`_.
|
||||
|
||||
2. Read the `Installation instructions <https://grafana.com/docs/grafana/latest/setup-grafana/installation/debian/>`_ to install Grafana
|
||||
2. Follow the instructions to `install Grafana <https://grafana.com/docs/grafana/latest/setup-grafana/installation/debian/>`_.
|
||||
|
||||
3. To start Grafana, follow these instructions:
|
||||
3. To start Grafana, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
$ sudo systemctl start grafana-server
|
||||
$ sudo systemctl status grafana-server
|
||||
|
||||
4. Browse to http://localhost:3000/.
|
||||
4. Open http://localhost:3000/ in the browser.
|
||||
|
||||
5. Log in using the default username and password (``admin``/``admin``) as shown in the image below:
|
||||
5. Login using the default username and password (``admin``/``admin``) as shown in the following image:
|
||||
|
||||
.. figure:: ../data/integration_login.png
|
||||
|
||||
|
||||
Grafana Integration
|
||||
-------------------
|
||||
Grafana integration
|
||||
--------------------
|
||||
|
||||
As a prerequisite, ensure:
|
||||
|
||||
* The RDC Prometheus plugin is running in each compute node.
|
||||
* The :ref:`RDC Prometheus plugin <prometheus>` is running in each compute node.
|
||||
* Prometheus is set up to collect metrics from the plugin.
|
||||
|
||||
For more information about installing and configuring Prometheus, see the section on `Prometheus Plugin <https://docs.amd.com/bundle/ROCm-DataCenter-Tool-User-Guide-v5.3/page/Data_Center_Tool_Third-Party_Integration.html#_Prometheus_Plugin>`_.
|
||||
|
||||
|
||||
Grafana Configuration
|
||||
Grafana configuration
|
||||
---------------------
|
||||
|
||||
Follow these steps:
|
||||
Firstly, add Prometheus as data source using the following steps:
|
||||
|
||||
1. Click Configuration.
|
||||
1. Go to `Configuration`.
|
||||
|
||||
.. image:: ../data/integration_config1.png
|
||||
|
||||
2. Select Data Sources, as shown in the image below:
|
||||
2. Select `Data Sources`.
|
||||
|
||||
.. image:: ../data/integration_config2.png
|
||||
|
||||
3. Click Add data source.
|
||||
3. Go to `Add data source`.
|
||||
|
||||
.. image:: ../data/integration_config3.png
|
||||
|
||||
4. Select Prometheus.
|
||||
4. Select `Prometheus`.
|
||||
|
||||
.. image:: ../data/integration_config4.png
|
||||
|
||||
.. note::
|
||||
Ensure the name of the data source is ``Prometheus``. If Prometheus and Grafana are running on the same machine, use the default URL http://localhost:9090. Otherwise, ensure the URL matches the Prometheus URL, save, and test it.
|
||||
|
||||
Ensure the name of the data source is `Prometheus`. If `Prometheus` and `Grafana` are running on the same machine, use the default URL http://localhost:9090. Otherwise, ensure the URL matches the `Prometheus` URL, save, and test it.
|
||||
|
||||
.. image:: ../data/integration_config5.png
|
||||
|
||||
5. To import RDC dashboard, click ``+`` and select ``Import``.
|
||||
Then, import RDC dashboard using the following steps:
|
||||
|
||||
6. Click the ``Upload.json`` file command.
|
||||
1. Go to `+` and select `Import`.
|
||||
|
||||
7. Choose ``rdc_grafana_dashboard_example.json`` which is in the python_binding folder.
|
||||
2. Upload ``rdc_grafana_dashboard_example.json`` from the ``python_binding`` folder.
|
||||
|
||||
8. Import the ``rdc_grafana_dashboard_example.json`` file and select the desired compute node on the dashboard, as shown in the image below:
|
||||
3. Select the desired compute node for visualization.
|
||||
|
||||
.. image:: ../data/integration_config6.png
|
||||
|
||||
Prometheus (Grafana) integration with automatic node detection
|
||||
==============================================================
|
||||
|
||||
RDC enables you to use Consul to discover the ``rdc_prometheus`` service automatically. Consul is “a service mesh solution providing a fully featured control plane with service discovery, configuration, and segmentation functionality.” For more information, refer to `Consul <https://developer.hashicorp.com/consul/docs/intro>`_.
|
||||
RDC provides `Consul` to discover the ``rdc_prometheus`` service automatically. `Consul` is a service mesh solution providing a fully featured control plane with service discovery, configuration, and segmentation functionality. For more information, see `Consul <https://developer.hashicorp.com/consul/docs/intro>`_.
|
||||
|
||||
RDC uses Consul for health checks of RDC's integration with the Prometheus plug-in (``rdc_prometheus``), and these checks provide information on its efficiency.
|
||||
RDC uses `Consul` for health checks of RDC's integration with the `Prometheus` plugin (``rdc_prometheus``). These checks provide information on its efficiency.
|
||||
|
||||
Previously, when a new compute node was added, users had to manually change ``prometheus_targets.json`` to use Consul. Now, with the Consul agent integration, a new compute node can be discovered automatically.
|
||||
With the `Consul` agent integration, a new compute node can be discovered automatically, which saves users from manually changing ``prometheus_targets.json`` to use `Consul`.
|
||||
|
||||
Installing the Consul Agent for Compute and Management Nodes
|
||||
Installing the Consul agent for compute and management nodes
|
||||
------------------------------------------------------------
|
||||
|
||||
To install the latest Consul agent for compute and management nodes, follow the instructions below:
|
||||
To install the latest `Consul` agent for compute and management nodes, follow these steps:
|
||||
|
||||
1. Set up the apt repository to download and install the Consul agent.
|
||||
1. To download and install the ``Consul`` agent, set up the ``apt`` repository:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
$ curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
|
||||
$ sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
|
||||
$ sudo apt-get update && sudo apt-get install consul
|
||||
|
||||
2. Generate a key to encrypt the communication between `Consul` agents. The same key is used by both the compute and management nodes for communication.
|
||||
|
||||
2. Generate a key to encrypt the communication between consul agents. Note that you can generate the key once, and both the compute and management nodes use the same key for communication.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
$ consul keygen
|
||||
|
||||
|
||||
For the purposes of this feature documentation, the following key is used in the configuration file:
|
||||
For demonstration purposes, the following key is used in the configuration file:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
$ consul keygen
|
||||
4lgGQXr3/R2QeTi5vEp7q5Xs1KoYBhCsk9+VgJZZHAo=
|
||||
|
||||
|
||||
Setting up the Consul Server in Management Nodes
|
||||
------------------------------------------------
|
||||
Setting up the Consul server in management nodes
|
||||
-------------------------------------------------
|
||||
|
||||
While Consul can function with one server, it is recommended to use three to five servers to avoid failure scenarios, which often lead to data loss.
|
||||
While ``Consul`` can function with one server, it's recommended to use three to five servers to avoid failure scenarios leading to data loss.
|
||||
|
||||
.. note::
|
||||
For example purposes, the configuration settings documented below are for a single server.
|
||||
For demonstration purposes, the configuration settings documented below are for a single server.
|
||||
|
||||
Follow these steps:
|
||||
To set up ``Consul`` server, follow these steps:
|
||||
|
||||
1. Create a configuration file ``/etc/consul.d/server.hcl``.
|
||||
1. Create a configuration file ``/etc/consul.d/server.hcl``.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
server = true
|
||||
encrypt = "<CONSUL_ENCRYPTION_KEY>"
|
||||
bootstrap_expect = 1
|
||||
ui = true
|
||||
client_addr = "0.0.0.0"
|
||||
bind_addr = "<The IP address can be reached by client>"
|
||||
|
||||
server = true
|
||||
encrypt = "<CONSUL_ENCRYPTION_KEY>"
|
||||
bootstrap_expect = 1
|
||||
ui = true
|
||||
client_addr = "0.0.0.0"
|
||||
bind_addr = "<The IP address can be reached by client>"
|
||||
|
||||
2. Run the agent in server mode, and set the encrypt to the key generated in the first step. The bootstrap_expect variable indicates the number of servers required to form the first Consul cluster.
|
||||
Here is how to use the variables in the configuration file:
|
||||
|
||||
3. Set the number of servers to 1 to allow a cluster with a single server.
|
||||
* Run the agent in server mode by setting ``server`` to ``true``.
|
||||
* Set ``encrypt`` to the key generated in the first step.
|
||||
* The ``bootstrap_expect`` variable indicates the number of servers required to form the first `Consul` cluster. Set this variable to ``1`` to allow a cluster with a single server.
|
||||
* The User Interface (``ui``) variable when set to ``true`` enables the Consul web UI.
|
||||
* The ``client_addr`` variable is used to connect the API and UI.
|
||||
* The ``bind_addr`` variable is used to connect the client to the server. If you have multiple private IP addresses, use the address that can connect to a client.
|
||||
|
||||
* The User Interface (UI) variable is used to enable the Consul Web UI.
|
||||
* The client_addr variable is used to connect the API and UI.
|
||||
* The bind_addr variable is used to connect the client to the server. If you have multiple private IP addresses, use the address that can connect to a client.
|
||||
|
||||
4. Start the agent using the following instruction:
|
||||
2. Start the agent.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo consul agent -config-dir=/etc/consul.d/
|
||||
|
||||
$ sudo consul agent -config-dir=/etc/consul.d/
|
||||
|
||||
5. Browse to http://localhost:8500/ on the management node. You will see a single instance running.
|
||||
3. Browse to http://localhost:8500/ on the management node to see a single instance running.
|
||||
|
||||
|
||||
Setting up the Consul Client in Compute Nodes
|
||||
Setting up the Consul client in compute nodes
|
||||
---------------------------------------------
|
||||
|
||||
Follow these steps:
|
||||
To set up `Consul` client, follow these steps:
|
||||
|
||||
1. Create a configuration file ``/etc/consul.d/client.hcl``.
|
||||
1. Create a configuration file ``/etc/consul.d/client.hcl``.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
server = false
|
||||
encrypt = "<CONSUL_ENCRYPTION_KEY>"
|
||||
retry_join = ["<The consul server address>"]
|
||||
client_addr = "0.0.0.0"
|
||||
bind_addr = "<The IP address can reach server>"
|
||||
|
||||
|
||||
.. note::
|
||||
Use the same CONSUL_ENCRYPTION_KEY as the servers. In the retry_join, use the IP address of the management nodes.
|
||||
.. note::
|
||||
Use the same ``CONSUL_ENCRYPTION_KEY`` as the servers. In the ``retry_join``, use the IP address of the management nodes.
|
||||
|
||||
2. Start the Consul agent.
|
||||
2. Start the Consul agent.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo consul agent -config-dir=/etc/consul.d/
|
||||
|
||||
$ sudo consul agent -config-dir=/etc/consul.d/
|
||||
|
||||
The client has now joined the Consul.
|
||||
To see if the client has joined the `Consul`, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
$ consul members
|
||||
Node Address Status Type Build Protocol DC Segment
|
||||
management-node 10.4.22.70:8301 alive server 1.9.3 2 dc1 <all>
|
||||
compute-node 10.4.22.112:8301 alive client 1.9.3 2 dc1 <default>
|
||||
|
||||
3. Set up the `Consul` client to monitor the health of the RDC Prometheus plugin.
|
||||
|
||||
3. Set up the Consul client to monitor the health of the RDC Prometheus plugin.
|
||||
|
||||
4. Start the RDC Prometheus plugin.
|
||||
4. Start the RDC Prometheus plugin.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ python rdc_prometheus.py --rdc_embedded
|
||||
|
||||
$ python rdc_prometheus.py --rdc_embedded
|
||||
|
||||
5. Add the configuration file /etc/consul.d/rdc_prometheus.hcl.
|
||||
5. Add the configuration file ``/etc/consul.d/rdc_prometheus.hcl``.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
{
|
||||
"service": {
|
||||
"name": "rdc_prometheus",
|
||||
@@ -439,36 +425,34 @@ The client has now joined the Consul.
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
.. note::
|
||||
By default, the Prometheus plugin uses port 5000. If you do not use the default setting, ensure you change the configuration file accordingly.
|
||||
.. note::
|
||||
|
||||
After the configuration file is changed, restart the Consul client agent.
|
||||
By default, the `Prometheus` plugin uses port 5000. If you don't use the default setting, change the configuration file accordingly.
|
||||
|
||||
6. After updating the configuration file, restart the `Consul` client agent.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo consul agent -config-dir=/etc/consul.d/
|
||||
|
||||
$ sudo consul agent -config-dir=/etc/consul.d/
|
||||
|
||||
6. Enable the Prometheus integration in the Management node. For more information, refer to the Prometheus Integration section above.
|
||||
7. Enable the :ref:`Prometheus <prometheus>` integration in the management node.
|
||||
|
||||
7. In the Management node, inspect the service.
|
||||
8. In the management node, inspect the service.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ consul catalog nodes -service=rdc_prometheus
|
||||
|
||||
Node ID Address DC
|
||||
compute-node 76694ab1 10.4.22.112 dc1
|
||||
|
||||
$ consul catalog nodes -service=rdc_prometheus
|
||||
|
||||
8. Create a new Prometheus configuration rdc_prometheus_consul.yml file for the Consul integration.
|
||||
Node ID Address DC
|
||||
compute-node 76694ab1 10.4.22.112 dc1
|
||||
|
||||
9. Create a new `Prometheus` configuration ``rdc_prometheus_consul.yml`` file for the `Consul` integration.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
global:
|
||||
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
|
||||
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
|
||||
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
|
||||
scrape_configs:
|
||||
- job_name: 'consul'
|
||||
consul_sd_configs:
|
||||
@@ -480,63 +464,59 @@ After the configuration file is changed, restart the Consul client agent.
|
||||
- source_labels: [__meta_consul_service]
|
||||
target_label: job
|
||||
|
||||
|
||||
.. note::
|
||||
If you are not running the consul server and Prometheus in the same machine, change the server under consul_sd_configs to your consul server address.
|
||||
When running the `Consul` server and `Prometheus` on the same machine, change the server under ``consul_sd_configs`` to your `Consul` server address.
|
||||
|
||||
9. Start Prometheus.
|
||||
10. Start Prometheus.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ ./prometheus --config.file="rdc_prometheus_consul.yml"
|
||||
|
||||
$ ./prometheus --config.file="rdc_prometheus_consul.yml"
|
||||
|
||||
10. Browse the Prometheus UI at http://localhost:9090 on the Management node and query RDC Prometheus metrics. Ensure that the plugin starts before running the query.
|
||||
11. Browse the `Prometheus` UI at http://localhost:9090 on the management node and query RDC `Prometheus` metrics. Ensure that the plugin starts before running the query.
|
||||
|
||||
Reliability, Availability, and Serviceability plugin
|
||||
=====================================================
|
||||
|
||||
Reliability, Availability, and Serviceability Plugin
|
||||
====================================================
|
||||
The Reliability, Availability, and Serviceability plugin (RAS) plugin helps to monitor and count ECC (Error-Correcting Code) errors. The following sections provide information on integrating RAS with RDC.
|
||||
|
||||
The RAS plugin helps to gather and count errors. The details of RAS integration with RDC are given below.
|
||||
RAS plugin installation
|
||||
------------------------
|
||||
|
||||
RAS Plugin Installation
|
||||
-----------------------
|
||||
|
||||
In this release, RDC extends support to the Reliability, Availability, and Serviceability (RAS) integration. When the RAS feature is enabled in the graphic card, users can use RDC to monitor RAS errors.
|
||||
With the RAS feature enabled in the graphic card, you can use RDC to monitor RAS errors.
|
||||
|
||||
Prerequisite
|
||||
^^^^^^^^^^^^
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
You must ensure the graphic card supports RAS.
|
||||
- Ensure that the GPU supports RAS.
|
||||
|
||||
.. note::
|
||||
The RAS library is installed as part of the RDC installation, and no additional configuration is required for RDC.
|
||||
|
||||
RDC installation dynamically loads the RAS library ``librdc_ras.so``. The configuration files required by the RAS library are installed in the ``sp3`` and ``config`` folders.
|
||||
The RAS library is installed as part of the RDC installation. No additional configuration is required for RDC.
|
||||
|
||||
- RDC installation dynamically loads the RAS library ``librdc_ras.so``. The configuration files required by the RAS library are installed in the ``sp3`` and ``config`` folders.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
% ls /opt/rocm-4.2.0/rdc/lib
|
||||
... librdc_ras.so ...
|
||||
... sp3 ... config ...
|
||||
|
||||
|
||||
RAS Integration
|
||||
---------------
|
||||
RAS integration
|
||||
----------------
|
||||
|
||||
RAS exposes a list of Error-Correcting Code (ECC) correctable and uncorrectable errors for different IP blocks and enables users to successfully troubleshoot issues.
|
||||
RAS exposes a list of ECC correctable and uncorrectable errors for different IP blocks and helps to troubleshoot issues.
|
||||
|
||||
For example, the dmon command passes the ECC_CORRECT and ECC_UNCORRECT counters field id to the command.
|
||||
**Example:**
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
$ rdci dmon -i 0 -e 600,601
|
||||
|
||||
|
||||
The ``dmon`` command monitors GPU index 0, field 600, and 601, where 600 is for the ECC_CORRECT counter and 601 is for the ECC_UNCORRECT counter.
|
||||
Where, the ``dmon`` command monitors GPU index 0, and fields 600 and 601, where 600 is the field ID for the ``ECC_CORRECT`` counter and 601 for the ``ECC_UNCORRECT`` counter.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
% rdci dmon -l
|
||||
... ...
|
||||
600 RDC_FI_ECC_CORRECT_TOTAL : Accumulated Single Error Correction
|
||||
@@ -581,14 +561,15 @@ The ``dmon`` command monitors GPU index 0, field 600, and 601, where 600 is for
|
||||
639 RDC_FI_ECC_MPIO_UE : MPIO Uncorrectable Error
|
||||
... ...
|
||||
|
||||
To access the ECC correctable and uncorrectable error counters, use:
|
||||
|
||||
To access the ECC correctable and uncorrectable error counters, use the following command:
|
||||
.. _error-correction:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
% rdci dmon -i 0 -e 600,601
|
||||
|
||||
GPU ECC_CORRECT ECC_UNCORRECT
|
||||
0 0 0
|
||||
0 0 0
|
||||
0 0 0
|
||||
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
.. meta::
|
||||
:description: documentation of the installation, configuration, and use of the ROCm Data Center tool
|
||||
:description: documentation of the installation, configuration, and use of the ROCm Data Center tool
|
||||
:keywords: ROCm Data Center tool, RDC, ROCm, API, reference, data type, support
|
||||
|
||||
.. _rdc-use:
|
||||
@@ -27,7 +27,7 @@ The audience for the AMD RDC tool consists of:
|
||||
* Administrators: RDC provides the cluster administrator with the capability of monitoring, validating, and configuring policies.
|
||||
* HPC Users: Provides GPU-centric feedback for their workload submissions.
|
||||
* OEM: Add GPU information to their existing cluster management software.
|
||||
* Open source Contributors: RDC is open source and accepts contributions from the community.
|
||||
* Open source Contributors: RDC is open source and accepts contributions from the community.
|
||||
|
||||
Objective
|
||||
=========
|
||||
@@ -47,25 +47,25 @@ Terminology
|
||||
* - **Terms**
|
||||
- **Description**
|
||||
|
||||
* - RDC
|
||||
* - RDC
|
||||
- ROCm Data Center tool
|
||||
|
||||
* - Compute node (CN)
|
||||
* - Compute node (CN)
|
||||
- One of many nodes containing one or more GPUs in the Data Center on which compute jobs are run
|
||||
|
||||
* - Management node (MN) or Main console
|
||||
* - Management node (MN) or Main console
|
||||
- A machine running system administration applications to administer and manage the Data Center
|
||||
|
||||
* - GPU Groups
|
||||
* - GPU Groups
|
||||
- Logical grouping of one or more GPUs in a compute node
|
||||
|
||||
* - Fields
|
||||
* - Fields
|
||||
- A metric that can be monitored by the RDC, such as GPU temperature, memory usage, and power usage
|
||||
|
||||
* - Field Groups
|
||||
* - Field Groups
|
||||
- Logical grouping of multiple fields
|
||||
|
||||
* - Job
|
||||
* - Job
|
||||
- A workload that is submitted to one or more compute nodes
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -0,0 +1,360 @@
|
||||
.. meta::
|
||||
:description: The ROCm Data Center tool (RDC) addresses key infrastructure challenges regarding AMD GPUs in cluster and data center environments and simplifies their administration
|
||||
:keywords: ROCm Data Center usage, RDC usage, RDC user manual, ROCm Data Center user manual, RDC tutorial, ROCm Data Center tutorial, RDC user guide, ROCm Data Center user guide
|
||||
|
||||
.. _using-RDC:
|
||||
|
||||
***********
|
||||
Using RDC
|
||||
***********
|
||||
|
||||
This topic provides useful information for the following audience on using RDC:
|
||||
|
||||
* Administrators: RDC provides the cluster administrator with the capability of monitoring, validating, and configuring policies.
|
||||
* HPC users: RDC provides GPU-centric feedback for their workload submissions.
|
||||
* OEM: RDC adds GPU information to their existing cluster management software.
|
||||
* Open source contributors: RDC is open source and accepts contributions from the community.
|
||||
|
||||
Starting RDC
|
||||
============
|
||||
|
||||
You can start RDC from command line using ``systemctl`` command or directly as a user. Both these options are explained in the following sections. The capability of RDC can be configured by modifying the ``rdc.service`` system configuration file. RDC reads the ``rdc.service`` file from ``/etc/systemd/system``. If multiple RDC versions are installed, copy ``/opt/rocm-<x.y.z>/libexec/rdc/rdc.service`` from the desired RDC version, to the ``/etc/systemd/system`` folder.
|
||||
|
||||
Starting RDC using systemctl
|
||||
-----------------------------
|
||||
|
||||
Here are the steps to start RDC using ``systemctl`` command, which runs RDC in the background:
|
||||
|
||||
1. Copy the service file:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo cp /opt/rocm/libexec/rdc/rdc.service /etc/systemd/system/
|
||||
|
||||
2. Configure capabilities:
|
||||
|
||||
- Full capabilities: Uncomment the following lines in ``/etc/systemd/system/rdc.service``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
CapabilityBoundingSet=CAP_DAC_OVERRIDE
|
||||
AmbientCapabilities=CAP_DAC_OVERRIDE
|
||||
|
||||
- Monitor-only capabilities: Comment out the preceding lines in ``/etc/systemd/system/rdc.service``.
|
||||
|
||||
3. Start the service:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo systemctl start rdc
|
||||
sudo systemctl status rdc
|
||||
|
||||
4. Modify RDCD options:
|
||||
|
||||
Edit ``/opt/rocm/etc/rdc_options`` to append any additional RDCD parameters.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo nano /opt/rocm/etc/rdc_options
|
||||
|
||||
Example configuration:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
RDC_OPTS="-p 50051 -u -d"
|
||||
|
||||
Flags:
|
||||
|
||||
- `-p 50051` : Use port 50051
|
||||
- `-u` : Unauthenticated mode
|
||||
- `-d` : Enable debug messages
|
||||
|
||||
Starting RDC using command line as a user
|
||||
------------------------------------------
|
||||
|
||||
While ``systemctl`` is the preferred way to start RDC, you can also start RDC directly from the command line as a user, which runs RDC in the user's current terminal. By default, the user is defined as ``rdc`` in the ``rdc.service`` file:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
[Service]
|
||||
User=rdc
|
||||
Group=rdc
|
||||
|
||||
To change the user, you can edit the ``User`` in the ``rdc.service`` file.
|
||||
To start RDC server daemon (``rdcd``) as a user such as ``rdc`` or as ``root``, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
#Start as user rdc
|
||||
$ sudo -u rdc rdcd
|
||||
|
||||
# Start as root
|
||||
$ sudo rdcd
|
||||
|
||||
The RDC capability is determined by the privilege of the user starting ``rdcd``. For example, ``rdcd`` running under a normal user account has monitor-only capability and ``rdcd`` running as root has full capability.
|
||||
|
||||
.. note::
|
||||
|
||||
If a user other than rdc or root starts the ``rdcd`` daemon, the file ownership of the SSL keys mentioned in the :ref:`authentication <authentication>` section must be modified to allow read and write access.
|
||||
|
||||
.. _authentication:
|
||||
|
||||
Authentication
|
||||
===============
|
||||
|
||||
RDC supports encrypted communications between clients and servers.
|
||||
|
||||
You can enable or disable authentication for the communication between the client and server. By default, authentication is enabled.
|
||||
|
||||
To disable authentication, use the ``--unauth_comm`` or ``-u`` flag when starting the server. You must also use ``-u`` in ``rdci`` to access unauthenticated ``rdcd``. You can edit the ``rdc.service`` file to specify arguments to be passed while starting ``rdcd``. On the client side, the ``secure`` argument must be set to ``False`` when calling ``rdc_channel_create()``.
|
||||
The following sections provide information for setting up the ``rdcd`` server for authentication.
|
||||
|
||||
Generating keys and certificates using scripts
|
||||
------------------------------------------------
|
||||
|
||||
RDC users manage their own keys and certificates. However, some scripts generate self-signed certificates in the RDC source tree in the authentication directory for test purposes. The following flowchart depicts how to generate the root certificates using the ``openssl`` command in ``01gen_root_cert.sh``:
|
||||
|
||||
.. figure:: ../data/handbook_openssl.png
|
||||
|
||||
Generation of root certificates using openssl command
|
||||
|
||||
You can specify the default responses to ``openssl`` questions in a section in the ``openssl.conf`` file. To locate the section in the ``openssl.conf`` file, look for the following comment:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# < ** REPLACE VALUES IN THIS SECTION WITH APPROPRIATE VALUES FOR YOUR ORG. **>
|
||||
|
||||
Modifying this section with values appropriate for your organization is helpful in cases where this script is called multiple times. Additionally, you must replace the dummy values and update the ``alt_names`` section for your environment.
|
||||
|
||||
To generate the keys and certificates using these scripts, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ 01gen_root_cert.sh
|
||||
# provide answers to posed questions
|
||||
$ 02gen_ssl_artifacts.sh
|
||||
# provide answers to posed questions
|
||||
|
||||
On running the preceding scripts, the keys and certificates are generated in the newly created ``CA/artifacts`` directory.
|
||||
|
||||
.. important::
|
||||
You must delete this directory before rerunning the scripts.
|
||||
|
||||
To install the keys and certificates, access the artifacts directory and run the ``install.sh`` script as root along with specifying the install location. The default install location is ``/etc/rdc``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cd CA/artifacts
|
||||
$ sudo install_<client|server>.sh /etc/rdc
|
||||
|
||||
These files must be copied and installed on all client and server machines expected to communicate with each other.
|
||||
|
||||
Known limitation
|
||||
-----------------
|
||||
|
||||
The client and server are hardcoded to look for the ``openssl`` certificate and key files in ``/etc/rdc``. No workaround is available for this.
|
||||
|
||||
Keys and certificates for authentication
|
||||
-----------------------------------------
|
||||
|
||||
Several SSL keys and certificates must be generated and installed on clients and servers for authentication to work properly. By default, the RDC server looks in the ``/etc/rdc`` folder for the following keys and certificates:
|
||||
|
||||
Client
|
||||
+++++++
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo tree /etc/rdc
|
||||
/etc/rdc
|
||||
|-- client
|
||||
|-- certs
|
||||
| |-- rdc_cacert.pem
|
||||
| |-- rdc_client_cert.pem
|
||||
|-- private
|
||||
|-- rdc_client_cert.key
|
||||
|
||||
Server
|
||||
+++++++
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo tree /etc/rdc
|
||||
/etc/rdc
|
||||
|-- server
|
||||
|-- certs
|
||||
| |-- rdc_cacert.pem
|
||||
| |-- rdc_server_cert.pem
|
||||
|-- private
|
||||
|-- rdc_server_cert.key
|
||||
|
||||
.. note::
|
||||
|
||||
Machines acting as both client and server consist of both directory structures.
|
||||
|
||||
Modes of operation
|
||||
===================
|
||||
|
||||
RDC supports two primary modes of operation: *Standalone* and *Embedded*. The feature set is similar in both the cases. Choose the mode that best fits your deployment needs.
|
||||
|
||||
The capability in each mode depends on the user privileges while starting the RDC tool. A normal user has access only to monitor (GPU telemetry) capabilities. A privileged user can run the tool with full capabilities. In the full capability mode, GPU configuration features can be invoked. The full capability mode might affect all the users and processes sharing the GPU.
|
||||
|
||||
Standalone mode
|
||||
-----------------
|
||||
|
||||
Standalone mode allows you to run RDC independently with all its components installed.
|
||||
This is the preferred mode of operation, as it does not have any external dependencies. To start RDC in standalone mode, ``rdcd`` must run on each compute node.
|
||||
|
||||
- Starting RDCD as a privileged user: A privileged user can run RDC with full capabilities.
|
||||
|
||||
- With authentication:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo /opt/rocm/bin/rdcd
|
||||
|
||||
- Without authentication:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo /opt/rocm/bin/rdcd -u
|
||||
|
||||
- Starting RDC as a normal user: A normal user can run RDC with monitor-only capabilities only.
|
||||
|
||||
- With authentication:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
/opt/rocm/bin/rdcd
|
||||
|
||||
- Without authentication:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
/opt/rocm/bin/rdcd -u
|
||||
|
||||
Embedded mode
|
||||
--------------
|
||||
|
||||
Embedded mode integrates RDC directly into your existing management tools using its library format.
|
||||
The embedded mode is especially useful for a monitoring agent running on the compute node. The monitoring agent can directly use the RDC library to achieve a fine-grain control on how and when to invoke the RDC features. For example, if the monitoring agent has a facility to synchronize across multiple nodes, it can synchronize GPU telemetry across these nodes.
|
||||
|
||||
The RDC daemon ``rdcd`` can be used as a reference code for this purpose. The dependency on ``gRPC`` is also eliminated, if the RDC library is directly used.
|
||||
|
||||
- To run RDC in embedded mode, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
python your_management_tool.py --rdc_embedded
|
||||
|
||||
.. note::
|
||||
|
||||
Ensure that the ``rdcd`` daemon is not running separately, when using embedded mode.
|
||||
|
||||
.. caution::
|
||||
|
||||
RDC command-line ``rdci`` doesn't function in this mode. Third-party monitoring software is responsible for providing the user interface and remote access or monitoring.
|
||||
|
||||
Troubleshooting RDC
|
||||
====================
|
||||
|
||||
The RDCD logs provide useful status and debugging information. The logs can also help debug problems like ``rdcd`` failing to start, communication issues with a client, and many more.
|
||||
|
||||
- View logs:
|
||||
|
||||
When ``rdcd`` is started using ``systemctl``, you can view the logs using:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ journalctl -u rdc
|
||||
|
||||
- Run RDCD with debug logs:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
RDC_LOG=DEBUG /opt/rocm/bin/rdcd
|
||||
|
||||
Logging levels supported: `ERROR`, `INFO`, `DEBUG`.
|
||||
|
||||
- Enable additional logging messages:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export RSMI_LOGGING=3
|
||||
|
||||
If the GPU reset fails, restart the server. Note that restarting the server also initiates ``rdcd``. You might then encounter the following two scenarios:
|
||||
|
||||
- ``rdcd`` returns the correct GPU information to ``rdci``
|
||||
|
||||
- ``rdcd`` returns the `No GPUs found on the system` error to ``rdci``. To resolve this error, restart ``rdcd`` using:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo systemctl restart rdcd
|
||||
|
||||
Known issues
|
||||
-------------
|
||||
|
||||
- dmon fields return N/A
|
||||
|
||||
**Reasons:**
|
||||
|
||||
- Missing libraries:
|
||||
|
||||
- Verify ``/opt/rocm/lib/rdc/librdc_*.so`` exists.
|
||||
- Ensure all related libraries such as ``rocprofiler``, ``rocruntime``, and others are present.
|
||||
|
||||
- Unsupported GPU:
|
||||
|
||||
- Most metrics work on MI300 and newer.
|
||||
- Limited metrics on MI200.
|
||||
- Consumer GPUs such as RX6800 have fewer supported metrics.
|
||||
|
||||
- dmon RocProfiler fields return zeros
|
||||
|
||||
**Solution:**
|
||||
|
||||
Set the ``HSA_TOOLS_LIB`` environment variable before running a compute job.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
|
||||
|
||||
**Example:**
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Terminal 1
|
||||
rdcd -u
|
||||
|
||||
# Terminal 2
|
||||
export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
|
||||
gpu-burn
|
||||
|
||||
# Terminal 3
|
||||
rdci dmon -u -e 800,801 -i 0 -c 1
|
||||
|
||||
# Output:
|
||||
GPU OCCUPANCY_PERCENT ACTIVE_WAVES
|
||||
0 001.000 32640.000
|
||||
|
||||
- HSA_STATUS_ERROR_OUT_OF_RESOURCES
|
||||
|
||||
**Error message:**
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
terminate called after throwing an instance of 'std::runtime_error'
|
||||
what(): hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
|
||||
Aborted (core dumped)
|
||||
|
||||
**Solution:**
|
||||
|
||||
Follow these steps to check for missing groups:
|
||||
|
||||
1. Ensure video and render groups exist.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo usermod -aG video,render $USER
|
||||
|
||||
2. Logout and login to apply group changes.
|
||||
@@ -0,0 +1,286 @@
|
||||
.. meta::
|
||||
:description: The ROCm Data Center tool (RDC) addresses key infrastructure challenges regarding AMD GPUs in cluster and data center environments and simplifies their administration
|
||||
:keywords: RDC features, ROCm Data Center features, RDC functionalities, ROCm Data Center functionalities
|
||||
|
||||
.. _rdc-features:
|
||||
|
||||
********************
|
||||
Using RDC features
|
||||
********************
|
||||
|
||||
This topic provides information related to the features of the RDC tool.
|
||||
|
||||
.. figure:: ../data/features.png
|
||||
|
||||
RDC components and framework for describing features
|
||||
|
||||
Discovery
|
||||
==========
|
||||
|
||||
The discovery feature is used to locate and display information of GPUs present in the compute node.
|
||||
|
||||
Example:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci discovery <host_name> -l
|
||||
2 GPUs found
|
||||
|
||||
.. list-table::
|
||||
|
||||
* - **GPU index**
|
||||
- **Device information**
|
||||
|
||||
* - 0
|
||||
- Name: AMD Radeon Instinct MI50 accelerator
|
||||
|
||||
* - 1
|
||||
- Name: AMD Radeon Instinct MI50 accelerator
|
||||
|
||||
To list available GPUs, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci -l : list available GPUs
|
||||
|
||||
Groups
|
||||
=======
|
||||
|
||||
This section explains the GPU and field groups features.
|
||||
|
||||
GPU groups
|
||||
-----------
|
||||
|
||||
With the GPU groups feature, you can create, delete, and list logical groups of GPU.
|
||||
|
||||
|
||||
- To create a group, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci group -c GPU_GROUP
|
||||
Successfully created a group with a group ID 1
|
||||
|
||||
- To add GPUs to a group, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci group -g 1 -a 0,1
|
||||
Successfully added the GPU 0,1 to group 1
|
||||
|
||||
- To delete a group, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci group -d 1
|
||||
Successfully removed group 1
|
||||
|
||||
- To list groups, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci group –l
|
||||
1 group found
|
||||
|
||||
.. list-table::
|
||||
|
||||
* - **Group ID**
|
||||
- **Group name**
|
||||
- **GPU index**
|
||||
|
||||
* - 1
|
||||
- GPU_GROUP
|
||||
- 0, 1
|
||||
|
||||
Field groups
|
||||
-------------
|
||||
|
||||
The field groups feature provides you the options to create, delete, list field groups, and monitor specific GPU metrics.
|
||||
|
||||
- To create a field group, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci fieldgroup -c <fgroup> -f 150,155
|
||||
Successfully created a field group with a group ID 1
|
||||
|
||||
- To list field groups, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci fieldgroup -l
|
||||
1 group found
|
||||
|
||||
.. list-table::
|
||||
|
||||
* - **Group ID**
|
||||
- **Group Name**
|
||||
- **Field IDs**
|
||||
|
||||
* - 1
|
||||
- Fgroup
|
||||
- 150, 155
|
||||
|
||||
- To delete a field group, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci fieldgroup -d 1
|
||||
Successfully removed field group 1
|
||||
|
||||
Monitor errors
|
||||
===============
|
||||
|
||||
To get the Reliability, Availability, and Serviceability (RAS) Error-Correcting Code (ECC) counter, define the following fields:
|
||||
|
||||
- Correctable ECC errors:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
312 ``RDC_FI_ECC_CORRECT_TOTAL``
|
||||
|
||||
- Uncorrectable ECC errors:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
313 ``RDC_FI_ECC_UNCORRECT_TOTAL``
|
||||
|
||||
Device monitoring
|
||||
==================
|
||||
|
||||
The device monitoring feature is used to monitor the GPU fields such as temperature, power usage, and utilization.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000
|
||||
1 group found
|
||||
|
||||
.. list-table::
|
||||
|
||||
* - **GPU index**
|
||||
- **TEMP (m°C)**
|
||||
- **POWER (µW)**
|
||||
|
||||
* - 0
|
||||
- 25000
|
||||
- 520500
|
||||
|
||||
.. _job-stats:
|
||||
|
||||
Job stats
|
||||
==========
|
||||
|
||||
The job stats is used to display GPU statistics for any given workload.
|
||||
|
||||
- To start recording stats, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci stats -s 2 -g 1
|
||||
Successfully started recording job 2 with a group ID 1
|
||||
|
||||
- To stop recording stats, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci stats -x 2
|
||||
Successfully stopped recording job 2
|
||||
|
||||
- To display job stats, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci stats -j 2
|
||||
|
||||
.. list-table::
|
||||
|
||||
* - **Summary**
|
||||
- **Executive status**
|
||||
|
||||
* - Start time
|
||||
- 1586795401
|
||||
|
||||
* - End time
|
||||
- 1586795445
|
||||
|
||||
* - Total execution time
|
||||
- 44
|
||||
|
||||
* - Energy consumed (Joules)
|
||||
- 21682
|
||||
|
||||
* - Power usage (Watts)
|
||||
- Max: 49 Min: 13 Avg: 34
|
||||
|
||||
* - GPU clock (MHz)
|
||||
- Max: 1000 Min: 300 Avg: 903
|
||||
|
||||
* - GPU utilization (%)
|
||||
- Max: 69 Min: 0 Avg: 2
|
||||
|
||||
* - Max GPU memory used (bytes)
|
||||
- 524320768
|
||||
|
||||
* - Memory utilization (%)
|
||||
- Max: 12 Min: 11 Avg: 12
|
||||
|
||||
Job stats use case
|
||||
-------------------
|
||||
|
||||
A common job stats use case is to record GPU statistics associated with any job or workload. The following figure illustrates how all RDC features can be put together for this use case:
|
||||
|
||||
.. figure:: ../data/features_jobs.png
|
||||
|
||||
An example showing how job statistics can be recorded
|
||||
|
||||
Here are the ``rdci`` commands for this use case:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci group -c group1
|
||||
|
||||
successfully created a group with a group ID 1
|
||||
|
||||
$ rdci group -g 1 -a 0,1
|
||||
|
||||
GPU 0,1 is added to group 1 successfully.
|
||||
|
||||
rdci stats -s 123 -g 1
|
||||
|
||||
job 123 recorded successfully with the group ID
|
||||
|
||||
rdci stats -x 123
|
||||
|
||||
job 123 stops recording successfully
|
||||
|
||||
rdci stats -j 123
|
||||
|
||||
job stats printed
|
||||
|
||||
Error-correcting code output
|
||||
=============================
|
||||
|
||||
In the job output, this feature prints out the Error-Correcting Code (ECC) errors while running the job.
|
||||
|
||||
To see the ECC correctable and uncorrectable error counters, see this :ref:`example <error-correction>`.
|
||||
|
||||
Diagnostic
|
||||
===========
|
||||
|
||||
The diagnostic feature when run on a GPU group provides the following details:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rdci diag -g <gpu_group>
|
||||
|
||||
No compute process: Pass
|
||||
Node topology check: Pass
|
||||
GPU parameters check: Pass
|
||||
Compute Queue ready: Pass
|
||||
System memory check: Pass
|
||||
=============== Diagnostic Details ==================
|
||||
No compute process: No processes running on any devices.
|
||||
Node topology check: No link detected.
|
||||
GPU parameters check: GPU 0 Critical Edge temperature in range.
|
||||
Compute Queue ready: Run binary search task on GPU 0 Pass.
|
||||
System memory check: Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.
|
||||
@@ -1,23 +1,22 @@
|
||||
.. meta::
|
||||
:description: documentation of the installation, configuration, and use of the ROCm Data Center tool
|
||||
:keywords: ROCm Data Center tool, RDC, ROCm, API, reference, data type, support
|
||||
:description: The ROCm Data Center tool (RDC) addresses key infrastructure challenges regarding AMD GPUs in cluster and data center environments and simplifies their administration
|
||||
:keywords: ROCm Data Center tool, RDC, Data Center
|
||||
|
||||
.. _index:
|
||||
|
||||
******************************************
|
||||
ROCm Data Center (RDC) tool documentation
|
||||
******************************************
|
||||
*************************************
|
||||
ROCm Data Center tool documentation
|
||||
*************************************
|
||||
|
||||
The ROCm Data Center tool (RDC) simplifies the administration of, and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments. The main features are of RDC include:
|
||||
The ROCm Data Center tool (RDC) addresses key infrastructure challenges regarding AMD GPUs in cluster and data center environments and simplifies their administration.
|
||||
Here are the main RDC features:
|
||||
|
||||
* GPU telemetry
|
||||
* GPU statistics for jobs
|
||||
* Integration with third-party tools
|
||||
* Open source
|
||||
|
||||
You can access the RDC tool on `GitHub repository <https://github.com/ROCm/rdc>`_.
|
||||
|
||||
The documentation is structured as follows:
|
||||
The code is open and hosted at `<https://github.com/ROCm/rdc>`_.
|
||||
|
||||
.. grid:: 2
|
||||
:gutter: 3
|
||||
@@ -25,11 +24,10 @@ The documentation is structured as follows:
|
||||
.. grid-item-card:: Install
|
||||
|
||||
* :ref:`rdc-install`
|
||||
* :ref:`rdc-handbook`
|
||||
|
||||
.. grid-item-card:: How to
|
||||
|
||||
* :ref:`rdc-use`
|
||||
* :ref:`using-RDC`
|
||||
* :ref:`rdc-features`
|
||||
* :ref:`rdc-3rd-party`
|
||||
|
||||
@@ -38,7 +36,10 @@ The documentation is structured as follows:
|
||||
* :ref:`api-intro`
|
||||
* :ref:`rdc-ref`
|
||||
|
||||
|
||||
.. grid-item-card:: Tutorial
|
||||
|
||||
* :ref:`job-stats-sample`
|
||||
|
||||
To contribute to the documentation, refer to
|
||||
`Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
|
||||
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
.. meta::
|
||||
:description: documentation of the installation, configuration, and use of the ROCm Data Center tool
|
||||
:description: documentation of the installation, configuration, and use of the ROCm Data Center tool
|
||||
:keywords: ROCm Data Center tool, RDC, ROCm, API, reference, data type, support
|
||||
|
||||
.. _rdc-handbook:
|
||||
@@ -10,27 +10,6 @@ Building and testing RDC
|
||||
|
||||
RDC is open source and available under the MIT License. This section is helpful for open source developers. Third-party integrators may also find this information useful.
|
||||
|
||||
Prerequisites for Building RDC
|
||||
==============================
|
||||
|
||||
.. note::
|
||||
RDC is tested on the following software versions. Earlier versions may not work.
|
||||
|
||||
* CMake 3.15
|
||||
* g++ (5.4.0)
|
||||
* AMD ROCm, which includes AMD AMDSMI Library
|
||||
* gRPC and protoc
|
||||
|
||||
The following components are required to build the latest documentation:
|
||||
|
||||
* Doxygen (1.8.11)
|
||||
* Latex (pdfTeX 3.14159265-2.6-1.40.16)
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo apt install libcap-dev
|
||||
$ sudo apt install -y doxygen
|
||||
|
||||
|
||||
Build and Install RDC
|
||||
=====================
|
||||
@@ -38,7 +17,7 @@ Build and Install RDC
|
||||
To build and install, clone the RDC source code from GitHub and use CMake.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
$ git clone <GitHub for RDC>
|
||||
$ cd rdc
|
||||
$ mkdir -p build; cd build
|
||||
@@ -47,14 +26,14 @@ To build and install, clone the RDC source code from GitHub and use CMake.
|
||||
#Install library file and header and the default location is /opt/rocm
|
||||
$ make install
|
||||
|
||||
|
||||
|
||||
Build Documentation
|
||||
-------------------
|
||||
|
||||
You can generate PDF documentation after a successful build. The reference manual, refman.pdf, appears in the latex directory.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
$ make doc
|
||||
$ cd latex
|
||||
$ make
|
||||
@@ -64,130 +43,27 @@ Build Unit Tests for RDC Tool
|
||||
-----------------------------
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
$ cd rdc/tests/rdc_tests
|
||||
$ mkdir -p build; cd build
|
||||
$ cmake -DROCM_DIR=/opt/rocm -DGRPC_ROOT="$GRPC_PROTOC_ROOT"..
|
||||
$ make
|
||||
|
||||
# To run the tests
|
||||
|
||||
|
||||
$ cd build/rdctst_tests
|
||||
$ ./rdctst
|
||||
|
||||
|
||||
|
||||
Test
|
||||
----
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
# Run rdcd daemon
|
||||
$ LD_LIBRARY_PATH=$PWD/rdc_libs/ ./server/rdcd -u
|
||||
|
||||
|
||||
# In another console run the RDC command-line
|
||||
$ LD_LIBRARY_PATH=$PWD/rdc_libs/ ./rdci/rdci discovery -l -u
|
||||
|
||||
|
||||
Authentication
|
||||
==============
|
||||
|
||||
RDC supports encrypted communications between clients and servers.
|
||||
|
||||
Generate Files for Authentication
|
||||
---------------------------------
|
||||
|
||||
The communication between the client and server can be configured to be authenticated or unauthenticated. By default, authentication is enabled.
|
||||
|
||||
To disable authentication, when starting the server, use the "--unauth_comm" flag (or "-u" for short). You must also use “-u” in rdci to access unauth rdcd. The /lib/systemd/system/rdc.service file can be edited to pass arguments to rdcd on starting. On the client side, when calling rdc_channel_create(), the "secure" argument must be set to False.
|
||||
|
||||
Scripts
|
||||
-------
|
||||
|
||||
RDC users manage their own keys and certificates. However, some scripts generate self-signed certificates in the RDC source tree in the authentication directory for test purposes. The following flowchart depicts how to generate the root certificates using the openssl command in 01gen_root_cert.sh:
|
||||
|
||||
A picture containing sign, drawing Description automatically generated
|
||||
|
||||
.. figure:: ../data/handbook_openssl.png
|
||||
|
||||
Generation of root certificates using openssl command
|
||||
|
||||
The section where the default responses to ``openssl`` questions can be specified is included in ``openssl.conf``. To locate the section look for the following comment line:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# < ** REPLACE VALUES IN THIS SECTION WITH APPROPRIATE VALUES FOR YOUR ORG. **>
|
||||
|
||||
|
||||
It is helpful to modify this section with values appropriate for your organization if you expect to call this script many times. Additionally, you must replace the dummy values and update the ``alt_names`` section for your environment.
|
||||
|
||||
To generate the keys and certificates using these scripts, make the following calls:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cd /opt/rocm/libexec/rdc/authentication
|
||||
$ 01gen_root_cert.sh
|
||||
# provide answers to posed questions
|
||||
$ 02gen_ssl_artifacts.sh
|
||||
# provide answers to posed questions
|
||||
|
||||
|
||||
At this point, the keys and certificates are in the newly created ``CA/artifacts`` directory.
|
||||
|
||||
.. important::
|
||||
You must delete this directory if you need to rerun the scripts.
|
||||
|
||||
To install the keys and certificates, access the artifacts directory and run the ``install.sh`` script as root, specifying the install location. By default, RDC expects this to be in ``/etc/rdc``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cd CA/artifacts
|
||||
$ sudo install_<client|server>.sh /etc/rdc
|
||||
|
||||
|
||||
These files must be copied to and installed on all client and server machines that are expected to communicate with one another.
|
||||
|
||||
Known Limitation
|
||||
----------------
|
||||
|
||||
RDC has the following authentication limitations:
|
||||
|
||||
The client and server are hardcoded to look for the ``openssl`` certificate and key files in ``/etc/rdc``. There is no workaround available currently.
|
||||
|
||||
|
||||
Verify Files for Authentication
|
||||
===============================
|
||||
|
||||
Several SSL keys and certificates must be generated and installed on clients and servers for authentication to work properly. By default, the RDC server will look in the ``/etc/rdc`` folder for the following keys and certificates:
|
||||
|
||||
Client
|
||||
------
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo tree /etc/rdc
|
||||
/etc/rdc
|
||||
|-- client
|
||||
|-- certs
|
||||
| |-- rdc_cacert.pem
|
||||
| |-- rdc_client_cert.pem
|
||||
|-- private
|
||||
|-- rdc_client_cert.key
|
||||
|
||||
|
||||
.. note::
|
||||
Machines that are clients and servers consist of both directory structures.
|
||||
|
||||
Server
|
||||
------
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo tree /etc/rdc
|
||||
/etc/rdc
|
||||
|-- server
|
||||
|-- certs
|
||||
| |-- rdc_cacert.pem
|
||||
| |-- rdc_server_cert.pem
|
||||
|-- private
|
||||
|-- rdc_server_cert.key
|
||||
|
||||
|
||||
@@ -1,37 +1,140 @@
|
||||
.. meta::
|
||||
:description: documentation of the installation, configuration, and use of the ROCm Data Center tool
|
||||
:keywords: ROCm Data Center tool, RDC, ROCm, API, reference, data type, support
|
||||
:description: The ROCm Data Center tool (RDC) addresses key infrastructure challenges regarding AMD GPUs in cluster and data center environments and simplifies their administration
|
||||
:keywords: RDC installation, Install RDC, Install ROCm Data Center tool, Building ROCm Data Center, Building RDC
|
||||
|
||||
.. _rdc-install:
|
||||
|
||||
******************************************
|
||||
Installing and running RDC
|
||||
******************************************
|
||||
******************
|
||||
RDC installation
|
||||
******************
|
||||
|
||||
The ROCm Data Center tool (RDC) is part of the AMD ROCm software and available on the distributions supported by AMD ROCm. For RDC installation from prebuilt packages, follow the instructions in this section.
|
||||
RDC is part of the AMD ROCm software and available on the distributions supported by AMD ROCm. This topic provides information required to install RDC from prebuilt packages and source.
|
||||
|
||||
Prerequisites
|
||||
=============
|
||||
==============
|
||||
|
||||
The installation dependencies are described in `Dependencies in the README <https://github.com/ROCm/rdc?tab=readme-ov-file#dependencies>`_. To see the list of supported operating systems, refer to `System requirements <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html>`_.
|
||||
To install RDC from source, ensure that your system meets the following requirements:
|
||||
|
||||
Install gRPC
|
||||
============
|
||||
- **Supported platforms:** AMD ROCm-supported platform. See the `list of supported operating systems <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems>`_.
|
||||
|
||||
To see the instructions for building ``gRPC`` and ``protoc``, refer to `Building gRPC and protoc <https://github.com/ROCm/rdc#building-grpc-and-protoc>`_.
|
||||
- **Dependencies:**
|
||||
- CMake >= 3.15
|
||||
- g++ (5.4.0)
|
||||
- gRPC and protoc
|
||||
- libcap-dev
|
||||
- :doc:`AMD ROCm platform <rocm:index>` including:
|
||||
- :doc:`AMDSMI library <amdsmi:index>`
|
||||
- `ROCK kernel driver <https://github.com/ROCm/ROCK-Kernel-Driver>`_
|
||||
|
||||
Authentication keys
|
||||
===================
|
||||
For building latest documentation:
|
||||
- Doxygen (1.8.11)
|
||||
- LaTeX (pdfTeX 3.14159265-2.6-1.40.16)
|
||||
|
||||
RDC can be used with or without authentication. If authentication is required you must configure proper authentication keys as described in *Authentication* in :ref:`rdc-handbook`.
|
||||
.. code-block:: shell
|
||||
|
||||
Prebuilt packages
|
||||
=================
|
||||
$ sudo apt install libcap-dev
|
||||
$ sudo apt install -y doxygen
|
||||
|
||||
RDC is packaged as part of the ROCm software repository. You must install the AMD ROCm software before installing RDC, as described in `ROCm installation <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/>`_.
|
||||
Build RDC from source
|
||||
======================
|
||||
|
||||
To install RDC after installing the ROCm package, use the following instructions.
|
||||
The following sections provide steps to build RDC from source.
|
||||
|
||||
Build gRPC and Protoc
|
||||
----------------------
|
||||
|
||||
gRPC and Protoc must be built from source as the prebuilt packages are not available for the same. Here are the steps:
|
||||
|
||||
1. Install the required tools:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo apt-get update
|
||||
sudo apt-get install automake make g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang libc++-dev curl
|
||||
|
||||
2. Clone and build gRPC:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
git clone -b v1.61.0 https://github.com/grpc/grpc --depth=1 --shallow-submodules --recurse-submodules
|
||||
cd grpc
|
||||
export GRPC_ROOT=/opt/grpc
|
||||
cmake -B build \
|
||||
-DgRPC_INSTALL=ON \
|
||||
-DgRPC_BUILD_TESTS=OFF \
|
||||
-DBUILD_SHARED_LIBS=ON \
|
||||
-DCMAKE_INSTALL_PREFIX="$GRPC_ROOT" \
|
||||
-DCMAKE_INSTALL_LIBDIR=lib \
|
||||
-DCMAKE_BUILD_TYPE=Release
|
||||
make -C build -j $(nproc)
|
||||
sudo make -C build install
|
||||
echo "$GRPC_ROOT" | sudo tee /etc/ld.so.conf.d/grpc.conf
|
||||
sudo ldconfig
|
||||
cd ..
|
||||
|
||||
Build RDC
|
||||
-----------
|
||||
|
||||
1. Clone the RDC repository:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
git clone https://github.com/ROCm/rdc
|
||||
cd rdc
|
||||
|
||||
2. Configure the build:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
cmake -B build -DGRPC_ROOT="$GRPC_ROOT"
|
||||
|
||||
3. You can also enable the following optional features:
|
||||
|
||||
- ROCm profiler:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
cmake -B build -DBUILD_PROFILER=ON
|
||||
|
||||
- ROCm Validation Suite (RVS):
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
cmake -B build -DBUILD_RVS=ON
|
||||
|
||||
- RDC library only (without ``rdci`` and ``rdcd``):
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
cmake -B build -DBUILD_STANDALONE=OFF
|
||||
|
||||
- RDC library without ROCm runtime:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
cmake -B build -DBUILD_RUNTIME=OFF
|
||||
|
||||
4. Build and install:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
make -C build -j $(nproc)
|
||||
sudo make -C build install
|
||||
|
||||
5. Update system library path:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export RDC_LIB_DIR=/opt/rocm/lib/rdc
|
||||
export GRPC_LIB_DIR="/opt/grpc/lib"
|
||||
echo "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf
|
||||
echo "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf
|
||||
sudo ldconfig
|
||||
|
||||
Installing RDC using prebuilt packages
|
||||
=======================================
|
||||
|
||||
RDC is packaged as part of the ROCm software repository. To install RDC using prebuilt package, first :doc:`install the AMD ROCm software <rocm-install-on-linux:index>`, then use the following instructions:
|
||||
|
||||
.. tab-set::
|
||||
|
||||
@@ -52,140 +155,3 @@ To install RDC after installing the ROCm package, use the following instructions
|
||||
$ sudo zypper install rdc
|
||||
# or, to install a specific version
|
||||
$ sudo zypper install rdc<x.y.z>
|
||||
|
||||
|
||||
Components
|
||||
==========
|
||||
|
||||
The components of the RDC tool are as shown below:
|
||||
|
||||
.. figure:: ../data/install_components.png
|
||||
|
||||
High-level diagram of RDC components
|
||||
|
||||
|
||||
RDC (API) library
|
||||
-----------------
|
||||
|
||||
This library is the central piece, which interacts with different modules and provides all the features described. This shared library provides C API and Python bindings so that third-party tools should be able to use it directly if required.
|
||||
|
||||
RDC daemon (``rdcd``)
|
||||
---------------------
|
||||
|
||||
The ``rdcd`` daemon records telemetry information from GPUs. It also provides an interface to RDC command-line tool (``rdci``) running locally or remotely. It relies on the above RDC Library for all the core features.
|
||||
|
||||
RDC command-line tool (``rdci``)
|
||||
--------------------------------
|
||||
|
||||
A command-line tool to invoke all the features of the RDC tool. This CLI can be run locally or remotely.
|
||||
|
||||
AMDSMI library
|
||||
--------------
|
||||
|
||||
A stateless system management library that provides low-level interfaces to access GPU information
|
||||
|
||||
Starting RDC
|
||||
============
|
||||
|
||||
The RDC tool can be run in the following two modes. The feature set is similar in both the cases. You have the flexibility to choose the option that best fits your environment.
|
||||
|
||||
* :ref:`standalone`
|
||||
* :ref:`embedded`
|
||||
|
||||
The capability in each mode depends on the privileges you have for starting the RDC tool. A normal user has access only to monitor (GPU telemetry) capabilities. A privileged user can run the tool with full capability. In the full capability mode, GPU configuration features can be invoked. This may or may not affect all the users and processes sharing the GPU.
|
||||
|
||||
.. _`standalone`:
|
||||
|
||||
Standalone mode
|
||||
---------------
|
||||
|
||||
This is the preferred mode of operation, as it does not have any external dependencies. To start RDC in standalone mode, RDC Server Daemon (``rdcd``) must run on each compute node. Refer to *Terminology* in :ref:`rdc-use` for more information. You can start ``rdcd`` as a ``systemd`` service or directly from the command-line.
|
||||
|
||||
Start the RDC tool using ``systemd``
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
If multiple RDC versions are installed, copy `/opt/rocm-<x.y.z>/libexec/rdc/rdc.service`, which is installed with the desired RDC version, to the ``systemd`` folder. The capability of RDC can be configured by modifying the ``rdc.service`` system configuration file. Use the ``systemctl`` command to start ``rdcd``.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ systemctl start rdc
|
||||
|
||||
|
||||
By default, ``rdcd`` starts with full capability. To change to monitor only, comment out the following two lines:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo vi /lib/systemd/system/rdc.service
|
||||
|
||||
# CapabilityBoundingSet=CAP_DAC_OVERRIDE
|
||||
# AmbientCapabilities=CAP_DAC_OVERRIDE
|
||||
|
||||
|
||||
.. note::
|
||||
``rdcd`` can be started by using the ``systemctl`` command.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ systemctl start rdc
|
||||
|
||||
|
||||
If the GPU reset fails, restart the server. Note that restarting the server also initiates ``rdcd``. You may then encounter the following two scenarios:
|
||||
|
||||
* ``rdcd`` returns the correct GPU information to ``rdci``
|
||||
* ``rdcd`` returns the "No GPUs found on the system" error to ``rdci``. To resolve this error, restart ``rdcd`` with the following instruction:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo systemctl restart rdcd
|
||||
|
||||
|
||||
Start the RDC tool from the command-line
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
While ``systemctl`` is the preferred way to start ``rdcd``, you can also start directly from the command-line. The installation scripts create a default user - ``rdc``. Users have the option to edit the profile file (``rdc.service`` installed at ``/lib/systemd/system``) and change these lines accordingly:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
[Service]
|
||||
User=rdc
|
||||
Group=rdc
|
||||
|
||||
From the command-line, start ``rdcd`` as a user such as ``rdc``, or start it as ``root``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
#Start as user rdc
|
||||
$ sudo -u rdc rdcd
|
||||
|
||||
# Start as root
|
||||
$ sudo rdcd
|
||||
|
||||
|
||||
In this use case, the ``rdc.service`` file mentioned in the previous section is not involved. Here, the capability of RDC is determined by the privilege of the user starting ``rdcd``. If ``rdcd`` is running under a normal user account it has the monitor-only capability. If ``rdcd`` is running as ``root`` then it has the full capability.
|
||||
|
||||
.. note::
|
||||
If a user other than ``rdc`` or ``root`` starts the ``rdcd`` daemon, the file ownership of the SSL keys mentioned in the Authentication section must be modified to allow read and write access.
|
||||
|
||||
Troubleshoot ``rdcd``
|
||||
---------------------
|
||||
|
||||
When ``rdcd`` is started using ``systemctl``, the logs can be viewed using the following command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ journalctl -u rdc
|
||||
|
||||
|
||||
These messages provide useful status and debugging information. The logs can also help debug problems like ``rdcd`` failing to start, communication issues with a client, and others.
|
||||
|
||||
.. _`embedded`:
|
||||
|
||||
Embedded mode
|
||||
-------------
|
||||
|
||||
The embedded mode is useful if the end user has a monitoring agent running on the compute node. The monitoring agent can directly use the RDC library and will have a finer-grain control on how and when RDC features are invoked. For example, if the monitoring agent has a facility to synchronize across multiple nodes, it can synchronize GPU telemetry across these nodes.
|
||||
|
||||
The RDC daemon ``rdcd`` can be used as a reference code for this purpose. The dependency on ``gRPC`` is also eliminated if the RDC library is directly used.
|
||||
|
||||
.. caution::
|
||||
RDC command-line ``rdci`` will not function in this mode. Third-party monitoring software is responsible for providing the user interface and remote access/monitoring.
|
||||
|
||||
@@ -1,96 +1,34 @@
|
||||
.. meta::
|
||||
:description: documentation of the installation, configuration, and use of the ROCm Data Center tool
|
||||
:keywords: ROCm Data Center tool, RDC, ROCm, API, reference, data type, support
|
||||
:description: The ROCm Data Center tool (RDC) addresses key infrastructure challenges regarding AMD GPUs in cluster and data center environments and simplifies their administration
|
||||
:keywords: ROCm Data Center tool API, RDC API
|
||||
|
||||
.. _api-intro:
|
||||
|
||||
******************************************
|
||||
*************************
|
||||
Introduction to RDC API
|
||||
******************************************
|
||||
*************************
|
||||
|
||||
.. note::
|
||||
This is the alpha version of RDC API and is subject to change without notice. The primary purpose of this API is to solicit feedback. AMD accepts no responsibility for any software breakage caused by API changes.
|
||||
|
||||
RDC API
|
||||
===========
|
||||
========
|
||||
|
||||
RDC API is the core library that provides all the RDC features. This section focuses on how RDC API can be used by third-party software.
|
||||
RDC API is the core library that provides all the RDC features.
|
||||
|
||||
RDC includes the following libraries:
|
||||
RDC API includes the following libraries:
|
||||
|
||||
* ``librdc_bootstrap.so``: Loads one of the following two libraries during runtime, depending on the mode.
|
||||
|
||||
- ``rdci`` mode: Loads ``librdc_client.so``
|
||||
- ``rdcd`` mode: Loads ``librdc.so``
|
||||
|
||||
* ``librdc_bootstrap.so``: Loads during runtime one of the two libraries by detecting the mode.
|
||||
* ``librdc_client.so``: Exposes RDC functionality using ``gRPC`` client.
|
||||
* ``librdc.so``: RDC API. This depends on ``libamd_smi.so``.
|
||||
* ``libamd_smi.so``: Stateless low overhead access to GPU data.
|
||||
|
||||
* ``librdc.so``: RDC API. This depends on ``libamd_smi.so``.
|
||||
|
||||
* ``libamd_smi.so``: Stateless low overhead access to GPU data.
|
||||
|
||||
.. figure:: ../data/api_libs.png
|
||||
|
||||
Different libraries and how they are linked.
|
||||
|
||||
.. note::
|
||||
``librdc_bootstrap.so`` loads different libraries based on the modes.
|
||||
|
||||
Example:
|
||||
|
||||
* ``rdci``: ``librdc_bootstrap.so`` loads ``librdc_client.so``
|
||||
* ``rdcd``: ``librdc_bootstrap.so`` loads ``librdc.so``
|
||||
|
||||
For more information see the :ref:`rdc-ref`.
|
||||
|
||||
Job stats use case
|
||||
==================
|
||||
|
||||
The following pseudocode shows how RDC API can be directly used to record GPU statistics associated with any job or workload. Refer to the example code provided with RDC on how to build it.
|
||||
|
||||
For more information, see *Job Stats* in :ref:`rdc-features`.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
//Initialize the RDC
|
||||
rdc_handle_t rdc_handle;
|
||||
rdc_status_t result=rdc_init(0);
|
||||
|
||||
//Dynamically choose to run in standalone or embedded mode
|
||||
bool standalone = false;
|
||||
std::cin>> standalone;
|
||||
if (standalone)
|
||||
result = rdc_connect("127.0.0.1:50051", &rdc_handle, nullptr, nullptr, nullptr); //It will connect to the daemon
|
||||
else
|
||||
result = rdc_start_embedded(RDC_OPERATION_MODE_MANUAL, &rdc_handle); //call library directly, here we run embedded in manual mode
|
||||
|
||||
//Now we can use the same API for both standalone and embedded
|
||||
//(1) create group
|
||||
rdc_gpu_group_t groupId;
|
||||
result = rdc_group_gpu_create(rdc_handle, RDC_GROUP_EMPTY, "MyGroup1", &groupId);
|
||||
|
||||
//(2) Add the GPUs to the group
|
||||
result = rdc_group_gpu_add(rdc_handle, groupId, 0); //Add GPU 0
|
||||
result = rdc_group_gpu_add(rdc_handle, groupId, 1); //Add GPU 1
|
||||
|
||||
//(3) start the recording the Slurm job 123. Set the sample frequency to once per second
|
||||
result = rdc_job_start_stats(rdc_handle, group_id,
|
||||
"123", 1000000);
|
||||
|
||||
//For standalone mode, the daemon will update and cache the samples
|
||||
//In manual mode, we must call rdc_field_update_all periodically to take samples
|
||||
if (!standalone) { //embedded manual mode
|
||||
for (int i=5; i>0; i--) { //As an example, we will take 5 samples
|
||||
result = rdc_field_update_all(rdc_handle, 0);
|
||||
usleep(1000000);
|
||||
}
|
||||
} else { //standalone mode, do nothing
|
||||
usleep(5000000); //sleep 5 seconds before fetch the stats
|
||||
}
|
||||
|
||||
//(4) stop the Slurm job 123, which will stop the watch
|
||||
// Note: we do not have to stop the job to get stats. The rdc_job_get_stats can be called at any time before stop
|
||||
result = rdc_job_stop_stats(rdc_handle, "123");
|
||||
|
||||
//(5) Get the stats
|
||||
rdc_job_info_t job_info;
|
||||
result = rdc_job_get_stats(rdc_handle, "123", &job_info);
|
||||
std::cout<<"Average Memory Utilization: " <<job_info.summary.memoryUtilization.average <<std::endl;
|
||||
|
||||
//The cleanup and shutdown ....
|
||||
|
||||
|
||||
@@ -1,11 +1,11 @@
|
||||
.. meta::
|
||||
:description: documentation of the installation, configuration, and use of the ROCm Data Center tool
|
||||
:keywords: ROCm Data Center tool, RDC, ROCm, API, reference, data type, support
|
||||
:description: The ROCm Data Center tool (RDC) addresses key infrastructure challenges regarding AMD GPUs in cluster and data center environments and simplifies their administration
|
||||
:keywords: ROCm Data Center library, RDC library, RDC API, ROCm Data Center API
|
||||
|
||||
.. _rdc-ref:
|
||||
|
||||
******************************************
|
||||
API reference
|
||||
******************************************
|
||||
****************
|
||||
RDC API library
|
||||
****************
|
||||
|
||||
.. doxygenindex::
|
||||
|
||||
@@ -8,13 +8,11 @@ subtrees:
|
||||
entries:
|
||||
- file: install/install
|
||||
title: Installing RDC
|
||||
- file: install/handbook
|
||||
title: Building and testing RDC
|
||||
|
||||
- caption: How to
|
||||
entries:
|
||||
- file: how-to/user_guide
|
||||
- file: how-to/features
|
||||
- file: how-to/using_RDC
|
||||
- file: how-to/using_RDC_features
|
||||
- file: how-to/integration
|
||||
|
||||
- caption: API reference
|
||||
@@ -22,6 +20,10 @@ subtrees:
|
||||
- file: reference/api_intro
|
||||
- file: reference/api_ref
|
||||
|
||||
- caption: Tutorial
|
||||
entries:
|
||||
- file: tutorial/job_stats_sample
|
||||
|
||||
- caption: About
|
||||
entries:
|
||||
- file: license
|
||||
|
||||
@@ -0,0 +1,62 @@
|
||||
.. meta::
|
||||
:description: The ROCm Data Center tool (RDC) addresses key infrastructure challenges regarding AMD GPUs in cluster and data center environments and simplifies their administration
|
||||
:keywords: Job stats use case, RDC feature example, ROCm Data Center feature sample, RDC feature sample, ROCm Data Center feature example
|
||||
|
||||
.. _job-stats-sample:
|
||||
|
||||
**********************
|
||||
Job stats sample code
|
||||
**********************
|
||||
|
||||
The following pseudocode shows how RDC API can be directly used to record GPU statistics associated with any job or workload. Refer to the `example code <https://github.com/AMD-ROCm-Internal/rdc/tree/amd-staging/example>`_ on how to build it.
|
||||
|
||||
For more information on Job stats, see :ref:`Job stats <job-stats>`.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
//Initialize the RDC
|
||||
rdc_handle_t rdc_handle;
|
||||
rdc_status_t result=rdc_init(0);
|
||||
|
||||
//Dynamically choose to run in standalone or embedded mode
|
||||
bool standalone = false;
|
||||
std::cin>> standalone;
|
||||
if (standalone)
|
||||
result = rdc_connect("127.0.0.1:50051", &rdc_handle, nullptr, nullptr, nullptr); //It will connect to the daemon
|
||||
else
|
||||
result = rdc_start_embedded(RDC_OPERATION_MODE_MANUAL, &rdc_handle); //call library directly, here we run embedded in manual mode
|
||||
|
||||
//Now we can use the same API for both standalone and embedded
|
||||
//(1) create group
|
||||
rdc_gpu_group_t groupId;
|
||||
result = rdc_group_gpu_create(rdc_handle, RDC_GROUP_EMPTY, "MyGroup1", &groupId);
|
||||
|
||||
//(2) Add the GPUs to the group
|
||||
result = rdc_group_gpu_add(rdc_handle, groupId, 0); //Add GPU 0
|
||||
result = rdc_group_gpu_add(rdc_handle, groupId, 1); //Add GPU 1
|
||||
|
||||
//(3) start the recording the Slurm job 123. Set the sample frequency to once per second
|
||||
result = rdc_job_start_stats(rdc_handle, group_id,
|
||||
"123", 1000000);
|
||||
|
||||
//For standalone mode, the daemon will update and cache the samples
|
||||
//In manual mode, we must call rdc_field_update_all periodically to take samples
|
||||
if (!standalone) { //embedded manual mode
|
||||
for (int i=5; i>0; i--) { //As an example, we will take 5 samples
|
||||
result = rdc_field_update_all(rdc_handle, 0);
|
||||
usleep(1000000);
|
||||
}
|
||||
} else { //standalone mode, do nothing
|
||||
usleep(5000000); //sleep 5 seconds before fetch the stats
|
||||
}
|
||||
|
||||
//(4) stop the Slurm job 123, which will stop the watch
|
||||
// Note: we do not have to stop the job to get stats. The rdc_job_get_stats can be called at any time before stop
|
||||
result = rdc_job_stop_stats(rdc_handle, "123");
|
||||
|
||||
//(5) Get the stats
|
||||
rdc_job_info_t job_info;
|
||||
result = rdc_job_get_stats(rdc_handle, "123", &job_info);
|
||||
std::cout<<"Average Memory Utilization: " <<job_info.summary.memoryUtilization.average <<std::endl;
|
||||
|
||||
//The cleanup and shutdown ....
|
||||
Посилання в новій задачі
Заблокувати користувача