354fe5f52c
* Show description of metrics during analysis
* Use --include-cols Description show the Description column in analyze mode (this is hidden by default)
* Remove tips field from analysis config
* Align metric names in analysis config and documentation
* Add unified config utils/unified_config.yaml
* Add python script utils/split_config.py to auto generate analysis configuration and documentation metrics description
* Add test case to ensure unified config is older than auto-generated config
* Auto generate analysis config and documentation metrics description
* Update CONTRIBUTING.md to add instructions to build documentation assets
* Add docker image and compose file to build documentation
* Update CHANGELOG and Documentation
* Use jinja template instead of hardcoding metric tables in documentation
[ROCm/rocprofiler-compute commit: bb44e90b2d]
59 lines
2.2 KiB
ReStructuredText
59 lines
2.2 KiB
ReStructuredText
.. meta::
|
||
:description: ROCm Compute Profiler performance model: Command processor (CP)
|
||
:keywords: Omniperf, ROCm Compute Profiler, ROCm, profiler, tool, Instinct, accelerator, command, processor, fetcher, packet processor, CPF, CPC
|
||
|
||
**********************
|
||
Command processor (CP)
|
||
**********************
|
||
|
||
The command processor (CP) is responsible for interacting with the AMDGPU kernel
|
||
driver -- the Linux kernel -- on the CPU and for interacting with user-space
|
||
HSA clients when they submit commands to HSA queues. Basic tasks of the CP
|
||
include reading commands (such as, corresponding to a kernel launch) out of
|
||
:hsa-runtime-pdf:`HSA queues <68>`, scheduling work to subsequent parts of the
|
||
scheduler pipeline, and marking kernels complete for synchronization events on
|
||
the host.
|
||
|
||
The command processor consists of two sub-components:
|
||
|
||
* :ref:`Fetcher <cpf-metrics>` (CPF): Fetches commands out of memory to hand
|
||
them over to the CPC for processing.
|
||
|
||
* :ref:`Packet processor <cpc-metrics>` (CPC): Micro-controller running the
|
||
command processing firmware that decodes the fetched commands and (for
|
||
kernels) passes them to the :ref:`workgroup processors <desc-spi>` for
|
||
scheduling.
|
||
|
||
Before scheduling work to the accelerator, the command processor can
|
||
first acquire a memory fence to ensure system consistency
|
||
(:hsa-runtime-pdf:`Section 2.6.4 <91>`). After the work is complete, the
|
||
command processor can apply a memory-release fence. Depending on the AMD CDNA™
|
||
accelerator under question, either of these operations *might* initiate a cache
|
||
write-back or invalidation.
|
||
|
||
Analyzing command processor performance is most interesting for kernels
|
||
that you suspect to be limited by scheduling or launch rate. The command
|
||
processor’s metrics therefore are focused on reporting, for example:
|
||
|
||
* Utilization of the fetcher
|
||
|
||
* Utilization of the packet processor, and decoding processing packets
|
||
|
||
* Stalls in fetching and processing
|
||
|
||
.. _cpf-metrics:
|
||
|
||
Command processor fetcher (CPF)
|
||
===============================
|
||
|
||
.. jinja:: cpf-metrics
|
||
:file: _templates/metrics_table.j2
|
||
|
||
.. _cpc-metrics:
|
||
|
||
Command processor packet processor (CPC)
|
||
========================================
|
||
|
||
.. jinja:: cpc-metrics
|
||
:file: _templates/metrics_table.j2
|