add handbook, user, install, and integration guides Change-Id: I996f6909f4fdf76910981c0224f5a0266907e27a remove old documentation steps Change-Id: Icfad09926e67a2dfa1de0e182fc3cd534f0448f7 formatting fixes Change-Id: I704bbbbf6ad384178f804e4a3f5e621f9c3d33b9
2.2 KiB
Introduction to ROCm Data Center Tool User Guide
The ROCm™ Data Center Tool™ (RDC) simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments. The main features are:
• GPU telemetry
• GPU statistics for jobs
• Integration with third-party tools
• Open source
You can use the tool in standalone mode if all components are installed. However, the existing management tools can use the same set of features available in a library format.
For details on different modes of operation, refer to Starting RDC.
Objective
This user guide is intended to:
• Provide an overview of the RDC tool features.
• Describe how system administrators and Data Center (or HPC) users can administer and configure AMD GPUs.
• Describe the components.
• Provide an overview of the open source developer handbook.
Terminology
Table 1: Terminologies and Abbreviations
| Term | Description |
|---|---|
| RDC | ROCm Data Center tool |
| Compute node (CN) | One of many nodes containing one or more GPUs in the Data Center on which compute jobs are run |
| Management node (MN) or Main console | A machine running system administration applications to administer and manage the Data Center |
| GPU Groups | Logical grouping of one or more GPUs in a compute node |
| Fields | A metric that can be monitored by the RDC, such as GPU temperature, memory usage, and power usage |
| Field Groups | Logical grouping of multiple fields |
| Job | A workload that is submitted to one or more compute nodes |
Target Audience
The audience for the AMD RDC tool consists of:
• Administrators: The tool provides the cluster administrator with the capability of monitoring, validating, and configuring policies.
• HPC Users: Provides GPU-centric feedback for their workload submissions.
• OEM: Add GPU information to their existing cluster management software.
• Open source Contributors: RDC is open source and accepts contributions from the community.