Dosyalar
rocm-systems/docs/user_guide/features.md
T
Sam Wu 1335d19020 add configs for read the docs
add handbook, user, install, and integration guides

Change-Id: I996f6909f4fdf76910981c0224f5a0266907e27a

remove old documentation steps

Change-Id: Icfad09926e67a2dfa1de0e182fc3cd534f0448f7

formatting fixes

Change-Id: I704bbbbf6ad384178f804e4a3f5e621f9c3d33b9
2023-05-05 15:44:34 -06:00

214 satır
5.6 KiB
Markdown
Ham Suçlama Geçmiş

Bu dosya muğlak Evrensel Kodlu karakter içeriyor
Bu dosya, başka karakterlerle karıştırılabilecek evrensel kodlu karakter içeriyor. Eğer bunu kasıtlı olarak yaptıysanız bu uyarıyı yok sayabilirsiniz. Gizli karakterleri göstermek için Kaçış Karakterli düğmesine tıklayın.
# Data Center Tool: Feature Overview
Note that RDC Tool is in active development. This section highlights the current feature set.
![Components](../data/features.png)
RDC components and framework for describing features
## Discovery
The Discovery feature enables you to locate and display information of GPUs present in the compute node.
Example:
```
$ rdci discovery <host_name> -l
2 GPUs found
```
| GPU Index | Device Information |
| --------- | ------------------------------------------- |
| 0 | Name: AMD Radeon Instinct™ MI50 Accelerator |
| 1 | Name: AMD Radeon Instinct™ MI50 Accelerator |
```
$ rdci -l : list available GPUs
$ rdci -u: No SSL authentication
```
## Groups
This section explains the GPU and field groups features.
### GPU Groups
With the GPU groups feature, you can create, delete, and list logical groups of GPU.
```
$ rdci group -c GPU_GROUP
Successfully created a group with a group ID 1
$ rdci group -g 1 -a 0,1
Successfully added the GPU 0,1 to group 1
$ rdci group l
1 group found
```
| Group ID | Group Name | GPU Index |
| -------- | ------------ | --------- |
| 1 | GPU_GROUP | 0, 1 |
```
$ rdci group -d 1
Successfully removed group 1
-c create; g group id; a add GPU index; l list; -d delete group
```
### Field Groups
The Field Groups feature provides you the options to create, delete, and list field groups.
```
$ rdci fieldgroup -c <fgroup> -f 150,155
Successfully created a field group with a group ID 1
$ rdci fieldgroup -l
1 group found
```
| Group ID | Group Name | Field Ids |
| -------- | ------------ | --------- |
| 1 | Fgroup | 150, 155 |
```
$ rdci fieldgroup -d 1
Successfully removed field group 1
rdci dmon l
Supported fields Ids:
100 RDC_FI_GPU_CLOCK: Current GPU clock freq.
150 RDC_FI_GPU_TEMP: GPU temp. in milli Celsius.
155 RDC_FI_POWER_USAGE: Power usage in microwatts.
203 RDC_FI_GPU_UTIL: GPU busy percentage.
525 RDC_FI_GPU_MEMORY_USAGE: VRAM Memory usage in bytes
-c create; g group id; a add GPU index; l list; -d delete group
```
### Monitor Errors
You can define RDC_FI_ECC_CORRECT_TOTAL or RDC_FI_ECC_UNCORRECT_TOTAL field to get the RAS Error-Correcting Code (ECC) counter:
• 312 RDC_FI_ECC_CORRECT_TOTAL: Accumulated correctable ECC errors
• 313 RDC_FI_ECC_UNCORRECT_TOTAL: Accumulated uncorrectable ECC errors
## Device Monitoring
The RDC Tool enables you to monitor the GPU fields.
```
$ rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000
1 group found
```
| GPU Index | TEMP (m°C) | POWER (µW) |
| --------- | ------------ | ---------- |
| 0 | 25000 | 520500 |
```
rdci dmon l
Supported fields Ids:
100 RDC_FI_GPU_CLOCK: Current GPU clock freq.
150 RDC_FI_GPU_TEMP: GPU temp. in milli Celsius.
155 RDC_FI_POWER_USAGE: Power usage in microwatts.
203 RDC_FI_GPU_UTIL: GPU busy percentage.
525 RDC_FI_GPU_MEMORY_USAGE: VRAM Memory usage in bytes
-e field ids; -i GPU index; -c count; -d delay; -l list; -f fieldgroup id
```
## Job Stats
You can display GPU statistics for any given workload.
```
$ rdci stats -s 2 -g 1
Successfully started recording job 2 with a group ID 1
$ rdci stats -j 2
```
| Summary | Executive Status |
| --------------------------------- | ---------------------------- |
| Start time | 1586795401 |
| End time | 1586795445 |
| Total execution time | 44 |
| --------------------------------- | ---------------------------- |
| Energy Consumed (Joules) | 21682 |
| Power Usage (Watts) | Max: 49 Min: 13 Avg: 34 |
| GPU Clock (MHz) | Max: 1000 Min: 300 Avg: 903 |
| GPU Utilization (%) | Max: 69 Min: 0 Avg: 2 |
| Max GPU Memory Used (bytes) | 524320768 |
| Memory Utilization (%) | Max: 12 Min: 11 Avg: 12 |
```
$ rdci stats -x 2
Successfully stopped recording job 2
-s start recording on job id; -g group id; -j display job stats; x stop recording.
```
## Job Stats Use Case
A common use case is to record GPU statistics associated with any job or workload. The following example shows how all these features can be put together for this use case:
![Jobs](../data/features_jobs.png)
An example showing how job statistics can be recorded
rdci commands
```
$ rdci group -c group1
successfully created a group with a group ID 1
$ rdci group -g 1 -a 0,1
GPU 0,1 is added to group 1 successfully.
rdci stats -s 123 -g 1
job 123 recorded successfully with the group ID
rdci stats -x 123
job 123 stops recording successfully
rdci stats -j 123
job stats printed
```
## Error-Correcting Code Output
In the job output, this feature prints out the Error-Correcting Code (ECC) errors while running the job.
## Diagnostic
You can run diagnostic on a GPU group as shown below:
```
$ rdci diag -g <gpu_group>
No compute process: Pass
Node topology check: Pass
GPU parameters check: Pass
Compute Queue ready: Pass
System memory check: Pass
=============== Diagnostic Details ==================
No compute process: No processes running on any devices.
Node topology check: No link detected.
GPU parameters check: GPU 0 Critical Edge temperature in range.
Compute Queue ready: Run binary search task on GPU 0 Pass.
System memory check: Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.
```