1335d19020
add handbook, user, install, and integration guides Change-Id: I996f6909f4fdf76910981c0224f5a0266907e27a remove old documentation steps Change-Id: Icfad09926e67a2dfa1de0e182fc3cd534f0448f7 formatting fixes Change-Id: I704bbbbf6ad384178f804e4a3f5e621f9c3d33b9
214 satır
5.6 KiB
Markdown
214 satır
5.6 KiB
Markdown
# Data Center Tool: Feature Overview
|
||
|
||
Note that RDC Tool is in active development. This section highlights the current feature set.
|
||
|
||

|
||
|
||
RDC components and framework for describing features
|
||
|
||
## Discovery
|
||
|
||
The Discovery feature enables you to locate and display information of GPUs present in the compute node.
|
||
|
||
Example:
|
||
|
||
```
|
||
$ rdci discovery <host_name> -l
|
||
2 GPUs found
|
||
```
|
||
|
||
| GPU Index | Device Information |
|
||
| --------- | ------------------------------------------- |
|
||
| 0 | Name: AMD Radeon Instinct™ MI50 Accelerator |
|
||
| 1 | Name: AMD Radeon Instinct™ MI50 Accelerator |
|
||
|
||
```
|
||
$ rdci -l : list available GPUs
|
||
$ rdci -u: No SSL authentication
|
||
```
|
||
|
||
## Groups
|
||
|
||
This section explains the GPU and field groups features.
|
||
|
||
### GPU Groups
|
||
|
||
With the GPU groups feature, you can create, delete, and list logical groups of GPU.
|
||
|
||
```
|
||
$ rdci group -c GPU_GROUP
|
||
Successfully created a group with a group ID 1
|
||
|
||
$ rdci group -g 1 -a 0,1
|
||
Successfully added the GPU 0,1 to group 1
|
||
|
||
$ rdci group –l
|
||
|
||
1 group found
|
||
```
|
||
|
||
| Group ID | Group Name | GPU Index |
|
||
| -------- | ------------ | --------- |
|
||
| 1 | GPU_GROUP | 0, 1 |
|
||
|
||
```
|
||
$ rdci group -d 1
|
||
Successfully removed group 1
|
||
|
||
-c create; –g group id; –a add GPU index; –l list; -d delete group
|
||
```
|
||
|
||
### Field Groups
|
||
|
||
The Field Groups feature provides you the options to create, delete, and list field groups.
|
||
|
||
```
|
||
$ rdci fieldgroup -c <fgroup> -f 150,155
|
||
Successfully created a field group with a group ID 1
|
||
|
||
$ rdci fieldgroup -l
|
||
|
||
1 group found
|
||
```
|
||
|
||
| Group ID | Group Name | Field Ids |
|
||
| -------- | ------------ | --------- |
|
||
| 1 | Fgroup | 150, 155 |
|
||
|
||
```
|
||
$ rdci fieldgroup -d 1
|
||
Successfully removed field group 1
|
||
|
||
rdci dmon –l
|
||
Supported fields Ids:
|
||
100 RDC_FI_GPU_CLOCK: Current GPU clock freq.
|
||
150 RDC_FI_GPU_TEMP: GPU temp. in milli Celsius.
|
||
155 RDC_FI_POWER_USAGE: Power usage in microwatts.
|
||
203 RDC_FI_GPU_UTIL: GPU busy percentage.
|
||
525 RDC_FI_GPU_MEMORY_USAGE: VRAM Memory usage in bytes
|
||
|
||
-c create; –g group id; –a add GPU index; –l list; -d delete group
|
||
```
|
||
|
||
### Monitor Errors
|
||
|
||
You can define RDC_FI_ECC_CORRECT_TOTAL or RDC_FI_ECC_UNCORRECT_TOTAL field to get the RAS Error-Correcting Code (ECC) counter:
|
||
|
||
• 312 RDC_FI_ECC_CORRECT_TOTAL: Accumulated correctable ECC errors
|
||
|
||
• 313 RDC_FI_ECC_UNCORRECT_TOTAL: Accumulated uncorrectable ECC errors
|
||
|
||
## Device Monitoring
|
||
|
||
The RDC Tool enables you to monitor the GPU fields.
|
||
|
||
```
|
||
$ rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000
|
||
|
||
|
||
1 group found
|
||
```
|
||
|
||
| GPU Index | TEMP (m°C) | POWER (µW) |
|
||
| --------- | ------------ | ---------- |
|
||
| 0 | 25000 | 520500 |
|
||
|
||
```
|
||
rdci dmon –l
|
||
Supported fields Ids:
|
||
100 RDC_FI_GPU_CLOCK: Current GPU clock freq.
|
||
150 RDC_FI_GPU_TEMP: GPU temp. in milli Celsius.
|
||
155 RDC_FI_POWER_USAGE: Power usage in microwatts.
|
||
203 RDC_FI_GPU_UTIL: GPU busy percentage.
|
||
525 RDC_FI_GPU_MEMORY_USAGE: VRAM Memory usage in bytes
|
||
|
||
-e field ids; -i GPU index; -c count; -d delay; -l list; -f fieldgroup id
|
||
```
|
||
|
||
## Job Stats
|
||
|
||
You can display GPU statistics for any given workload.
|
||
|
||
```
|
||
$ rdci stats -s 2 -g 1
|
||
Successfully started recording job 2 with a group ID 1
|
||
|
||
$ rdci stats -j 2
|
||
```
|
||
|
||
| Summary | Executive Status |
|
||
| --------------------------------- | ---------------------------- |
|
||
| Start time | 1586795401 |
|
||
| End time | 1586795445 |
|
||
| Total execution time | 44 |
|
||
| --------------------------------- | ---------------------------- |
|
||
| Energy Consumed (Joules) | 21682 |
|
||
| Power Usage (Watts) | Max: 49 Min: 13 Avg: 34 |
|
||
| GPU Clock (MHz) | Max: 1000 Min: 300 Avg: 903 |
|
||
| GPU Utilization (%) | Max: 69 Min: 0 Avg: 2 |
|
||
| Max GPU Memory Used (bytes) | 524320768 |
|
||
| Memory Utilization (%) | Max: 12 Min: 11 Avg: 12 |
|
||
|
||
```
|
||
$ rdci stats -x 2
|
||
Successfully stopped recording job 2
|
||
|
||
-s start recording on job id; -g group id; -j display job stats; –x stop recording.
|
||
```
|
||
|
||
## Job Stats Use Case
|
||
|
||
A common use case is to record GPU statistics associated with any job or workload. The following example shows how all these features can be put together for this use case:
|
||
|
||

|
||
|
||
An example showing how job statistics can be recorded
|
||
|
||
rdci commands
|
||
|
||
```
|
||
$ rdci group -c group1
|
||
|
||
successfully created a group with a group ID 1
|
||
|
||
$ rdci group -g 1 -a 0,1
|
||
|
||
GPU 0,1 is added to group 1 successfully.
|
||
|
||
rdci stats -s 123 -g 1
|
||
|
||
job 123 recorded successfully with the group ID
|
||
|
||
rdci stats -x 123
|
||
|
||
job 123 stops recording successfully
|
||
|
||
rdci stats -j 123
|
||
|
||
job stats printed
|
||
```
|
||
|
||
## Error-Correcting Code Output
|
||
|
||
In the job output, this feature prints out the Error-Correcting Code (ECC) errors while running the job.
|
||
|
||
## Diagnostic
|
||
|
||
You can run diagnostic on a GPU group as shown below:
|
||
|
||
```
|
||
$ rdci diag -g <gpu_group>
|
||
|
||
No compute process: Pass
|
||
Node topology check: Pass
|
||
GPU parameters check: Pass
|
||
Compute Queue ready: Pass
|
||
System memory check: Pass
|
||
=============== Diagnostic Details ==================
|
||
No compute process: No processes running on any devices.
|
||
Node topology check: No link detected.
|
||
GPU parameters check: GPU 0 Critical Edge temperature in range.
|
||
Compute Queue ready: Run binary search task on GPU 0 Pass.
|
||
System memory check: Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.
|
||
```
|