e265e0e24f
Add documentation for communication runtime tracing for MPI, UCX, RCCL. --------- Co-authored-by: David Galiffi <David.Galiffi@amd.com>
133 строки
4.8 KiB
ReStructuredText
133 строки
4.8 KiB
ReStructuredText
.. meta::
|
|
:description: ROCm Systems Profiler feature set documentation and reference
|
|
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, feature set, use cases, tracking, visualization, tool, Instinct, accelerator, AMD
|
|
|
|
********************************************
|
|
ROCm Systems Profiler features and use cases
|
|
********************************************
|
|
|
|
`ROCm Systems Profiler <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems>`_ is designed to be highly extensible.
|
|
Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/ROCm/timemory>`_
|
|
to manage extensions, resources, data, and other items. It supports the following features,
|
|
modes, metrics, and APIs.
|
|
|
|
Data collection modes
|
|
========================================
|
|
|
|
* Dynamic instrumentation
|
|
|
|
* Runtime instrumentation: Instrument executables and shared libraries at runtime
|
|
* Binary rewriting: Generate a new executable and/or library with instrumentation built-in
|
|
|
|
* Statistical sampling: Periodic software interrupts per-thread
|
|
* Process-level sampling: A background thread records process-, system- and device-level metrics while the application runs
|
|
* Causal profiling: Quantifies the potential impact of optimizations in parallel code
|
|
|
|
Data analysis
|
|
========================================
|
|
|
|
* High-level summary profiles with mean, min, max, and standard deviation statistics
|
|
|
|
* Low overhead and memory efficient
|
|
* Ideal for running at scale
|
|
|
|
* Comprehensive traces for every individual event and measurement
|
|
* Application speed-up predictions resulting from potential optimizations in functions and lines of code based on causal profiling
|
|
|
|
Parallelism API support
|
|
========================================
|
|
|
|
* HIP
|
|
* HSA
|
|
* Pthreads
|
|
* MPI
|
|
* Kokkos-Tools (KokkosP)
|
|
* OpenMP-Tools (OMPT)
|
|
* UCX (Unified Communication X)
|
|
|
|
GPU metrics
|
|
========================================
|
|
|
|
* GPU hardware counters
|
|
* HIP API tracing
|
|
* HIP kernel tracing
|
|
* HSA API tracing
|
|
* HSA operation tracing
|
|
* rocDecode API tracing
|
|
* rocJPEG API tracing
|
|
* System-level sampling (via AMD-SMI)
|
|
|
|
* Memory usage
|
|
* Power usage
|
|
* Temperature
|
|
* Utilization
|
|
* VCN activity
|
|
* JPEG activity
|
|
* XGMI interconnect metrics (link width, link speed, read/write data)
|
|
* PCIe metrics (link width, link speed, bandwidth)
|
|
|
|
.. note::
|
|
|
|
The availability of VCN, JPEG, XGMI, and PCIe metrics depends on device support and system topology. If unsupported, values will be reported as ``N/A`` in the output of ``amd-smi metric --usage``.
|
|
|
|
CPU metrics
|
|
========================================
|
|
|
|
* CPU hardware counters sampling and profiles
|
|
* CPU frequency sampling
|
|
* Various timing metrics
|
|
|
|
* Wall time
|
|
* CPU time (process and thread)
|
|
* CPU utilization (process and thread)
|
|
* User CPU time
|
|
* Kernel CPU time
|
|
|
|
* Various memory metrics
|
|
|
|
* High-water mark (sampling and profiles)
|
|
* Memory page allocation
|
|
* Virtual memory usage
|
|
|
|
* Network statistics
|
|
* I/O metrics
|
|
* Many others
|
|
|
|
ROCm Systems Profiler use cases
|
|
========================================
|
|
|
|
When analyzing the performance of an application, do NOT
|
|
assume you know where the performance bottlenecks are
|
|
and why they are happening. ROCm Systems Profiler is a tool for analyzing the entire
|
|
application and its performance. It is
|
|
ideal for characterizing where optimization would have the greatest impact
|
|
on an end-to-end run of the application and for
|
|
viewing what else is happening on the system during a performance bottleneck.
|
|
|
|
When GPUs are involved, there is a tendency to assume that
|
|
the quickest path to performance improvement is minimizing
|
|
the runtime of the GPU kernels. This is a highly flawed assumption.
|
|
If you optimize the runtime of a kernel from one millisecond
|
|
to 1 microsecond (1000x speed-up) but the original application never
|
|
spent time waiting for kernels to complete,
|
|
there would be no statistically significant reduction in the end-to-end
|
|
runtime of your application. In other words, it does not matter
|
|
how fast or slow the code on GPU is if the application has a
|
|
bottleneck on waiting on the GPU.
|
|
|
|
Use ROCm Systems Profiler to obtain a high-level view of the entire application. Use it
|
|
to determine where the performance bottlenecks are and
|
|
obtain clues to why these bottlenecks are happening. Rather than worrying about kernel
|
|
performance, start your investigation with ROCm Systems Profiler, which characterizes the
|
|
broad picture.
|
|
|
|
.. note::
|
|
|
|
For insight into the execution of individual kernels on the GPU,
|
|
use `ROCm Compute Profiler <https://github.com/rocm/rocprofiler-compute>`_.
|
|
|
|
In terms of CPU analysis, ROCm Systems Profiler does not target any specific vendor.
|
|
It works just as well on AMD and non-AMD CPUs.
|
|
With regard to the GPU, ROCm Systems Profiler is currently restricted to HIP and HSA APIs
|
|
and kernels running on AMD GPUs.
|