Fichiers
rocm-systems/projects/rdc/docs/user_guide/features.md
T
Sam Wu 041868928a add configs for read the docs
add handbook, user, install, and integration guides

Change-Id: I996f6909f4fdf76910981c0224f5a0266907e27a

remove old documentation steps

Change-Id: Icfad09926e67a2dfa1de0e182fc3cd534f0448f7

formatting fixes

Change-Id: I704bbbbf6ad384178f804e4a3f5e621f9c3d33b9


[ROCm/rdc commit: 1335d19020]
2023-05-05 15:44:34 -06:00

5.6 KiB

Data Center Tool: Feature Overview

Note that RDC Tool is in active development. This section highlights the current feature set.

Components

RDC components and framework for describing features

Discovery

The Discovery feature enables you to locate and display information of GPUs present in the compute node.

Example:

$ rdci discovery <host_name> -l
 2 GPUs found
GPU Index Device Information
0 Name: AMD Radeon Instinct™ MI50 Accelerator
1 Name: AMD Radeon Instinct™ MI50 Accelerator
$ rdci -l : list available GPUs
$ rdci -u: No SSL authentication

Groups

This section explains the GPU and field groups features.

GPU Groups

With the GPU groups feature, you can create, delete, and list logical groups of GPU.

$ rdci group -c GPU_GROUP
Successfully created a group with a group ID 1
 
$ rdci group -g 1 -a 0,1
Successfully added the GPU 0,1 to group 1
 
$ rdci group –l
 
1 group found
Group ID Group Name GPU Index
1 GPU_GROUP 0, 1
$ rdci group -d 1
Successfully removed group 1
 
-c create; –g group id; –a add GPU index; –l list; -d delete group 

Field Groups

The Field Groups feature provides you the options to create, delete, and list field groups.

$ rdci fieldgroup -c <fgroup> -f 150,155
Successfully created a field group with a group ID 1
 
$ rdci fieldgroup -l
 
1 group found
Group ID Group Name Field Ids
1 Fgroup 150, 155
$ rdci fieldgroup -d 1
Successfully removed field group 1
 
rdci dmon –l
Supported fields Ids: 
100 RDC_FI_GPU_CLOCK:  Current GPU clock freq.
150 RDC_FI_GPU_TEMP:  GPU temp. in milli Celsius.
155 RDC_FI_POWER_USAGE:  Power usage in microwatts.
203 RDC_FI_GPU_UTIL:  GPU busy percentage.
525 RDC_FI_GPU_MEMORY_USAGE: VRAM Memory usage in bytes
 
-c create; –g group id; –a add GPU index; –l list; -d delete group

Monitor Errors

You can define RDC_FI_ECC_CORRECT_TOTAL or RDC_FI_ECC_UNCORRECT_TOTAL field to get the RAS Error-Correcting Code (ECC) counter:

• 312 RDC_FI_ECC_CORRECT_TOTAL: Accumulated correctable ECC errors

• 313 RDC_FI_ECC_UNCORRECT_TOTAL: Accumulated uncorrectable ECC errors

Device Monitoring

The RDC Tool enables you to monitor the GPU fields.

$ rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000
 
 
1 group found
GPU Index TEMP (m°C) POWER (µW)
0 25000 520500
rdci dmon –l
Supported fields Ids: 
100 RDC_FI_GPU_CLOCK:  Current GPU clock freq.
150 RDC_FI_GPU_TEMP:  GPU temp. in milli Celsius.
155 RDC_FI_POWER_USAGE:  Power usage in microwatts.
203 RDC_FI_GPU_UTIL:  GPU busy percentage.
525 RDC_FI_GPU_MEMORY_USAGE: VRAM Memory usage in bytes
 
-e field ids; -i GPU index; -c count; -d delay; -l list; -f fieldgroup id 

Job Stats

You can display GPU statistics for any given workload.

$ rdci stats -s 2 -g 1
Successfully started recording job 2 with a group ID 1
 
$ rdci stats -j 2
Summary Executive Status
Start time 1586795401
End time 1586795445
Total execution time 44
--------------------------------- ----------------------------
Energy Consumed (Joules) 21682
Power Usage (Watts) Max: 49 Min: 13 Avg: 34
GPU Clock (MHz) Max: 1000 Min: 300 Avg: 903
GPU Utilization (%) Max: 69 Min: 0 Avg: 2
Max GPU Memory Used (bytes) 524320768
Memory Utilization (%) Max: 12 Min: 11 Avg: 12
$ rdci stats -x 2
Successfully stopped recording job 2
 
-s start recording on job id; -g group id; -j display job stats; –x stop recording. 

Job Stats Use Case

A common use case is to record GPU statistics associated with any job or workload. The following example shows how all these features can be put together for this use case:

Jobs

An example showing how job statistics can be recorded

rdci commands

$ rdci group -c group1

successfully created a group with a group ID 1

$ rdci group -g 1 -a 0,1

GPU 0,1 is added to group 1 successfully.

rdci stats -s 123 -g 1

job 123 recorded successfully with the group ID 

rdci stats -x 123

job 123 stops recording successfully 

rdci stats -j 123

job stats printed 

Error-Correcting Code Output

In the job output, this feature prints out the Error-Correcting Code (ECC) errors while running the job.

Diagnostic

You can run diagnostic on a GPU group as shown below:

$ rdci diag -g <gpu_group>
 
No compute process:  Pass
Node topology check:  Pass
GPU parameters check:  Pass
Compute Queue ready:  Pass
System memory check:  Pass
 =============== Diagnostic Details ==================
No compute process:  No processes running on any devices.
Node topology check:  No link detected.
GPU parameters check:  GPU 0 Critical Edge temperature in range.
Compute Queue ready:  Run binary search task on GPU 0 Pass.
System memory check:  Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.