56f6f3ca19
Change-Id: I34cb1cdadc1a99d0d226441f1a6b180cb8b4b258
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: eeb59ed080]
298 sor
6.2 KiB
ReStructuredText
298 sor
6.2 KiB
ReStructuredText
.. meta::
|
||
:description: documentation of the installation, configuration, and use of the ROCm Data Center tool
|
||
:keywords: ROCm Data Center tool, RDC, ROCm, API, reference, data type, support
|
||
|
||
.. _rdc-features:
|
||
|
||
******************************************
|
||
RDC tool feature overview
|
||
******************************************
|
||
|
||
This topic provides information related to the features of the RDC tool.
|
||
|
||
.. figure:: ../data/features.png
|
||
|
||
RDC components and framework for describing features
|
||
|
||
|
||
Discovery
|
||
=========
|
||
|
||
The Discovery feature enables you to locate and display information of GPUs present in the compute node.
|
||
|
||
Example:
|
||
|
||
.. code-block:: shell
|
||
|
||
$ rdci discovery <host_name> -l
|
||
2 GPUs found
|
||
|
||
.. list-table::
|
||
|
||
* - **GPU Index**
|
||
- **Device Information**
|
||
|
||
* - 0
|
||
- Name: AMD Radeon Instinct MI50 Accelerator
|
||
|
||
* - 1
|
||
- Name: AMD Radeon Instinct MI50 Accelerator
|
||
|
||
|
||
.. code-block:: shell
|
||
|
||
$ rdci -l : list available GPUs
|
||
$ rdci -u: No SSL authentication
|
||
|
||
|
||
Groups
|
||
======
|
||
|
||
This section explains the GPU and field groups features.
|
||
|
||
GPU Groups
|
||
----------
|
||
|
||
With the GPU groups feature, you can create, delete, and list logical groups of GPU.
|
||
|
||
.. code-block:: shell
|
||
|
||
$ rdci group -c GPU_GROUP
|
||
Successfully created a group with a group ID 1
|
||
|
||
$ rdci group -g 1 -a 0,1
|
||
Successfully added the GPU 0,1 to group 1
|
||
|
||
$ rdci group –l
|
||
|
||
1 group found
|
||
|
||
|
||
.. list-table::
|
||
|
||
* - **Group ID**
|
||
- **Group Name**
|
||
- **GPU Index**
|
||
|
||
* - 1
|
||
- GPU_GROUP
|
||
- 0, 1
|
||
|
||
|
||
.. code-block:: shell
|
||
|
||
$ rdci group -d 1
|
||
Successfully removed group 1
|
||
|
||
-c create; –g group id; –a add GPU index; –l list; -d delete group
|
||
|
||
|
||
Field Groups
|
||
------------
|
||
|
||
The Field Groups feature provides you the options to create, delete, and list field groups.
|
||
|
||
.. code-block:: shell
|
||
|
||
$ rdci fieldgroup -c <fgroup> -f 150,155
|
||
Successfully created a field group with a group ID 1
|
||
|
||
$ rdci fieldgroup -l
|
||
|
||
1 group found
|
||
|
||
|
||
.. list-table::
|
||
|
||
* - **Group ID**
|
||
- **Group Name**
|
||
- **Field IDs**
|
||
|
||
* - 1
|
||
- Fgroup
|
||
- 150, 155
|
||
|
||
|
||
.. code-block:: shell
|
||
|
||
$ rdci fieldgroup -d 1
|
||
Successfully removed field group 1
|
||
|
||
rdci dmon –l
|
||
Supported fields Ids:
|
||
100 RDC_FI_GPU_CLOCK: Current GPU clock freq.
|
||
150 RDC_FI_GPU_TEMP: GPU temp. in milli Celsius.
|
||
155 RDC_FI_POWER_USAGE: Power usage in microwatts.
|
||
203 RDC_FI_GPU_UTIL: GPU busy percentage.
|
||
525 RDC_FI_GPU_MEMORY_USAGE: VRAM Memory usage in bytes
|
||
|
||
-c create; –g group id; –a add GPU index; –l list; -d delete group
|
||
|
||
|
||
Monitor Errors
|
||
--------------
|
||
|
||
You can define ``RDC_FI_ECC_CORRECT_TOTAL`` or ``RDC_FI_ECC_UNCORRECT_TOTAL`` field to get the RAS Error-Correcting Code (ECC) counter:
|
||
|
||
* 312 ``RDC_FI_ECC_CORRECT_TOTAL``: Accumulated correctable ECC errors
|
||
* 313 ``RDC_FI_ECC_UNCORRECT_TOTAL``: Accumulated uncorrectable ECC errors
|
||
|
||
|
||
Device Monitoring
|
||
=================
|
||
|
||
The RDC tool enables you to monitor the GPU fields.
|
||
|
||
.. code-block:: shell
|
||
|
||
$ rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000
|
||
|
||
|
||
1 group found
|
||
|
||
|
||
.. list-table::
|
||
|
||
* - **GPU Index**
|
||
- **TEMP (m°C)**
|
||
- **POWER (µW)**
|
||
|
||
* - 0
|
||
- 25000
|
||
- 520500
|
||
|
||
|
||
.. code-block:: shell
|
||
|
||
rdci dmon –l
|
||
Supported fields Ids:
|
||
100 RDC_FI_GPU_CLOCK: Current GPU clock freq.
|
||
150 RDC_FI_GPU_TEMP: GPU temp. in milli Celsius.
|
||
155 RDC_FI_POWER_USAGE: Power usage in microwatts.
|
||
203 RDC_FI_GPU_UTIL: GPU busy percentage.
|
||
525 RDC_FI_GPU_MEMORY_USAGE: VRAM Memory usage in bytes
|
||
|
||
-e field ids; -i GPU index; -c count; -d delay; -l list; -f fieldgroup id
|
||
|
||
|
||
Job Stats
|
||
=========
|
||
|
||
You can display GPU statistics for any given workload.
|
||
|
||
.. code-block:: shell
|
||
|
||
$ rdci stats -s 2 -g 1
|
||
Successfully started recording job 2 with a group ID 1
|
||
|
||
$ rdci stats -j 2
|
||
|
||
|
||
.. list-table::
|
||
|
||
* - **Summary**
|
||
- **Executive Status**
|
||
|
||
* - Start time
|
||
- 1586795401
|
||
|
||
* - End time
|
||
- 1586795445
|
||
|
||
* - Total execution time
|
||
- 44
|
||
|
||
* - ==============
|
||
- ==============
|
||
|
||
* - Energy Consumed (Joules)
|
||
- 21682
|
||
|
||
* - Power Usage (Watts)
|
||
- Max: 49 Min: 13 Avg: 34
|
||
|
||
* - GPU Clock (MHz)
|
||
- Max: 1000 Min: 300 Avg: 903
|
||
|
||
* - GPU Utilization (%)
|
||
- Max: 69 Min: 0 Avg: 2
|
||
|
||
* - Max GPU Memory Used (bytes)
|
||
- 524320768
|
||
|
||
* - Memory Utilization (%)
|
||
- Max: 12 Min: 11 Avg: 12
|
||
|
||
|
||
.. code-block:: shell
|
||
|
||
$ rdci stats -x 2
|
||
Successfully stopped recording job 2
|
||
|
||
-s start recording on job id; -g group id; -j display job stats; –x stop recording.
|
||
|
||
|
||
Job Stats Use Case
|
||
------------------
|
||
|
||
A common use case is to record GPU statistics associated with any job or workload. The following example shows how all these features can be put together for this use case:
|
||
|
||
|
||
.. figure:: ../data/features_jobs.png
|
||
|
||
An example showing how job statistics can be recorded
|
||
|
||
|
||
rdci commands
|
||
^^^^^^^^^^^^^
|
||
|
||
.. code-block:: shell
|
||
|
||
$ rdci group -c group1
|
||
|
||
successfully created a group with a group ID 1
|
||
|
||
$ rdci group -g 1 -a 0,1
|
||
|
||
GPU 0,1 is added to group 1 successfully.
|
||
|
||
rdci stats -s 123 -g 1
|
||
|
||
job 123 recorded successfully with the group ID
|
||
|
||
rdci stats -x 123
|
||
|
||
job 123 stops recording successfully
|
||
|
||
rdci stats -j 123
|
||
|
||
job stats printed
|
||
|
||
|
||
Error-Correcting Code Output
|
||
============================
|
||
|
||
In the job output, this feature prints out the Error-Correcting Code (ECC) errors while running the job.
|
||
|
||
Diagnostic
|
||
==========
|
||
|
||
You can run diagnostic on a GPU group as shown below:
|
||
|
||
.. code-block:: shell
|
||
|
||
$ rdci diag -g <gpu_group>
|
||
|
||
No compute process: Pass
|
||
Node topology check: Pass
|
||
GPU parameters check: Pass
|
||
Compute Queue ready: Pass
|
||
System memory check: Pass
|
||
=============== Diagnostic Details ==================
|
||
No compute process: No processes running on any devices.
|
||
Node topology check: No link detected.
|
||
GPU parameters check: GPU 0 Critical Edge temperature in range.
|
||
Compute Queue ready: Run binary search task on GPU 0 Pass.
|
||
System memory check: Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.
|
||
|