b75423b173
* Updating install doc page * Removing the Quick Start page * Add documentation for rocpd output * Update links to reference rocm-systems repo * Update README.md Installation instructions references ROCm Docs link. * Updated git clone instructions Back to using https to clone the repository * Fix formatting * Update projects/rocprofiler-systems/docs/how-to/understanding-rocprof-sys-output.rst * Add reference to "rocpd" section to the "Profiling Python" section * Update CONTRIBUTING.md * For ROCPD, document minimum version of SDK. * Update CHANGELOGS Signed-off-by: David Galiffi <David.Galiffi@amd.com> * Update CHANGELOG.md Updated based on feedback from docs team * Update CONTRIBUTING.md * Update CONTRIBUTING.md. Simplify and remove setup information overlapping with the "rocm-systems" contributing documentation. * Apply suggestion from @prbasyal-amd Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Apply suggestion from @prbasyal-amd Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Update CHANGELOG.md * Apply suggestion from @prbasyal-amd Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Apply suggestion from @prbasyal-amd Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Apply suggestion from @prbasyal-amd Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Apply suggestion from @prbasyal-amd Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Apply suggestion from @prbasyal-amd Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Apply suggestion from @prbasyal-amd Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Apply suggestion from @prbasyal-amd Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Apply suggestion from @prbasyal-amd Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Apply suggestion from @prbasyal-amd Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> --------- Signed-off-by: David Galiffi <David.Galiffi@amd.com> Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>
627 regels
33 KiB
ReStructuredText
627 regels
33 KiB
ReStructuredText
.. meta::
|
|
:description: ROCm Systems Profiler causal profiling documentation and reference
|
|
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, causal profiling, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
|
|
|
****************************************************
|
|
Performing causal profiling
|
|
****************************************************
|
|
|
|
The process of causal profiling can be summarized as:
|
|
|
|
*If you speed up a given block of code by X%, the application will run Y% faster*.
|
|
|
|
Causal profiling directs parallel application developers to where they should focus their optimization
|
|
efforts by quantifying the potential impact of optimizations. Causal profiling is rooted in the concept
|
|
that *software execution speed is relative*. Speeding up a block of code by X% is mathematically equivalent
|
|
to that block of code running at its current speed if all the other code is running slower by X%.
|
|
Thus, causal profiling works by performing experiments on blocks of code during program execution which
|
|
insert pauses to slow down all other concurrently running code. During post-processing, these experiments
|
|
are translated into calculations for the potential impact of speeding up this block of code.
|
|
|
|
Consider the following C++ code executing ``foo`` and ``bar`` concurrently in two different threads
|
|
where ``foo`` is ideally 30% faster than ``bar``:
|
|
|
|
.. code-block:: cpp
|
|
|
|
#include <cstddef>
|
|
#include <thread>
|
|
constexpr size_t FOO_N = 7 * 1000000000UL;
|
|
constexpr size_t BAR_N = 10 * 1000000000UL;
|
|
|
|
void foo()
|
|
{
|
|
for(volatile size_t i = 0; i < FOO_N; ++i) {}
|
|
}
|
|
|
|
void bar()
|
|
{
|
|
for(volatile size_t i = 0; i < BAR_N; ++i) {}
|
|
}
|
|
|
|
int main()
|
|
{
|
|
std::thread _threads[] = { std::thread{ foo },
|
|
std::thread{ bar } };
|
|
|
|
for(auto& itr : _threads)
|
|
itr.join();
|
|
}
|
|
|
|
No matter how many optimizations are applied to ``foo``, the application will always
|
|
require the same amount of time
|
|
because the end-to-end performance is limited by ``bar``. However, a 5% speed-up
|
|
in ``bar`` results in the
|
|
end-to-end performance improving by 5%. This trend continues linearly, with a 10% speed-up
|
|
in ``bar`` yielding a 10% speed-up in
|
|
end-to-end performance, and so on, up to a 30% speed-up, at which point ``bar`` runs as fast as ``foo``.
|
|
Any speed-up to ``bar`` beyond 30% still only yields an end-to-end performance
|
|
improvement of 30% because the application
|
|
is now limited by performance of ``foo``, as demonstrated below in the causal
|
|
profiling visualization:
|
|
|
|
.. image:: ../data/causal-foobar.png
|
|
:alt: Visualization of the performance improvements for two functions with causal profiling
|
|
|
|
The full details of the causal profiling methodology can be found in the paper
|
|
`Coz: Finding Code that Counts with Causal Profiling <http://arxiv.org/pdf/1608.03676v1.pdf>`_.
|
|
The author's implementation is publicly available on `GitHub <https://github.com/plasma-umass/coz>`_.
|
|
|
|
Getting started
|
|
========================================
|
|
|
|
To effectively use causal profiling, it is important to understand a few key
|
|
concepts, such as progress points.
|
|
|
|
Progress points
|
|
-----------------------------------
|
|
|
|
Causal profiling requires "progress points" to track progress through the code
|
|
in between samples. Progress points must be triggered in a deterministic manner via instrumentation.
|
|
This can happen in three different ways:
|
|
|
|
* `ROCm Systems Profiler <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems>`_ can leverage the callbacks from
|
|
Kokkos-Tools, OpenMP-Tools, rocprofiler-sdk, etc. and the wrappers around functions for
|
|
MPI, NUMA, RCCL, etc. to act as progress points
|
|
* Users can leverage the :doc:`runtime instrumentation capabilities <./instrumenting-rewriting-binary-application>`
|
|
to insert progress points
|
|
* Users can leverage :doc:`User APIs <../how-to/using-rocprof-sys-api>`,
|
|
such as ``ROCPROFSYS_CAUSAL_PROGRESS``
|
|
|
|
.. note::
|
|
|
|
Binary rewrite to insert progress points is not supported. When a rewritten binary
|
|
runs, Dyninst translates the instruction pointer address in order to perform
|
|
the instrumentation. As a result, call stack samples never return instruction
|
|
pointer addresses within the valid ROCm Systems Profiler range.
|
|
|
|
Key concepts
|
|
-----------------------------------
|
|
|
|
+------------------+--------------------------------------+----------------------------------+--------------------------------------------+
|
|
| Concept | Setting | Options | Description |
|
|
+==================+======================================+==================================+============================================+
|
|
| Backend | ``ROCPROFSYS_CAUSAL_BACKEND`` | ``perf``, ``timer`` | Backend for recording samples required |
|
|
| | | | to calculate the virtual speed-up |
|
|
+------------------+--------------------------------------+----------------------------------+--------------------------------------------+
|
|
| Mode | ``ROCPROFSYS_CAUSAL_MODE`` | ``function``, ``line`` | Select an entire function or individual |
|
|
| | | | line of code for causal experiments |
|
|
+------------------+--------------------------------------+----------------------------------+--------------------------------------------+
|
|
| End-to-end | ``ROCPROFSYS_CAUSAL_END_TO_END`` | Boolean | Perform a single experiment during the |
|
|
| | | | entire run (does not require |
|
|
| | | | progress points) |
|
|
+------------------+--------------------------------------+----------------------------------+--------------------------------------------+
|
|
| Fixed speed-up | ``ROCPROFSYS_CAUSAL_FIXED_SPEEDUP`` | one or more values from [0, 100] | Virtual speed-up or pool of virtual |
|
|
| | | | speed-ups to randomly select |
|
|
+------------------+--------------------------------------+----------------------------------+--------------------------------------------+
|
|
| Binary scope | ``ROCPROFSYS_CAUSAL_BINARY_SCOPE`` | regular expression(s) | Dynamic binaries containing code for |
|
|
| | | | experiments |
|
|
+------------------+--------------------------------------+----------------------------------+--------------------------------------------+
|
|
| Source scope | ``ROCPROFSYS_CAUSAL_SOURCE_SCOPE`` | regular expression(s) | ``<file>`` and/or ``<file>:<line>`` |
|
|
| | | | containing code to include in experiments |
|
|
+------------------+--------------------------------------+----------------------------------+--------------------------------------------+
|
|
| Function scope | ``ROCPROFSYS_CAUSAL_FUNCTION_SCOPE`` | regular expression(s) | Restricts experiments to matching |
|
|
| | | | functions (function mode) or lines of |
|
|
| | | | code within matching functions (line mode) |
|
|
+------------------+--------------------------------------+----------------------------------+--------------------------------------------+
|
|
|
|
.. note::
|
|
|
|
* Binary scope defaults to ``%MAIN%`` (in the executable), but the scope can be expanded to include linked libraries.
|
|
* ``<file>`` and ``<file>:<line>`` support requires debug info (for example, the code must be compiled with ``-g`` or, preferably, with ``-g3``)
|
|
* Function mode does not require debug info but does not support stripped binaries
|
|
|
|
Backends
|
|
-----------------------------------
|
|
|
|
There are two backends to choose from: ``perf`` and ``timer``.
|
|
They are used to record the samples required to calculate the virtual speedup.
|
|
Both backends interrupt each thread 1000 times per second (of CPU-time) to apply the virtual speed-ups.
|
|
The difference between each backend is how the samples are recorded.
|
|
There are three key differences between the two backends:
|
|
|
|
* the ``perf`` backend requires Linux Perf and elevated security priviledges
|
|
* the ``perf`` backend interrupts the application less frequently whereas the ``timer`` backend
|
|
interrupts the application 1000 times per second of realtime
|
|
* the ``timer`` backend has less accurate call stacks due to instruction pointer skid
|
|
|
|
In general, the ``perf`` backend is preferred over the ``timer`` backend when sufficient
|
|
security priviledges permit its usage.
|
|
If ``ROCPROFSYS_CAUSAL_BACKEND`` is set to ``auto``, ROCm Systems Profiler falls back
|
|
to using the ``timer`` backend only if
|
|
the ``perf`` backend fails. If ``ROCPROFSYS_CAUSAL_BACKEND`` is
|
|
set to ``perf`` and using this backend fails, ROCm Systems Profiler aborts.
|
|
|
|
Instruction pointer skid
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Instruction pointer (IP) skid measures how many instructions run after the event of interest
|
|
before the program actually stops. The IP skid is calculated by subtracting
|
|
the location of the IP at the point of interest from the location of the IP
|
|
when the kernel finally stops the application.
|
|
For the ``timer`` backend, this translates to the
|
|
difference in the IP between when the timer generated a signal and when the
|
|
signal was actually generated. Although IP skid still occurs with the ``perf`` backend,
|
|
it is much more pronounced with the ``timer`` backend due to the overhead of pausing the entire thread.
|
|
This means the ``timer`` backend tends to have a lower resolution than the ``perf`` backend,
|
|
especially in ``line`` mode.
|
|
|
|
Installing Linux Perf
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Linux Perf is built into the kernel and may already be installed
|
|
(for instance, it is included in the default kernel for OpenSUSE).
|
|
The official method of checking whether Linux Perf is installed is
|
|
checking for the existence of the file
|
|
``/proc/sys/kernel/perf_event_paranoid``. If the file exists, the kernel has Perf installed.
|
|
|
|
If this file does not exist, as with Debian-based systems like Ubuntu, run the following command as superuser:
|
|
|
|
.. code-block:: shell
|
|
|
|
apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
|
|
|
|
and reboot your computer. In order to use the ``perf`` backend, the value
|
|
of ``/proc/sys/kernel/perf_event_paranoid``
|
|
should be less than or equal to 2. If the value in this file is greater than 2, you can't
|
|
use the ``perf`` backend.
|
|
|
|
To update the paranoid level temporarily until the system is rebooted, run
|
|
one of the following commands
|
|
as a superuser (where ``PARANOID_LEVEL=<N>`` has a value of ``<N>`` in the range ``[-1, 2]``):
|
|
|
|
.. code-block:: shell
|
|
|
|
echo ${PARANOID_LEVEL} | sudo tee /proc/sys/kernel/perf_event_paranoid
|
|
|
|
or
|
|
|
|
.. code-block:: shell
|
|
|
|
sysctl kernel.perf_event_paranoid=${PARANOID_LEVEL}
|
|
|
|
To make the paranoid level persistent after a reboot, add ``kernel.perf_event_paranoid=<N>``
|
|
(where ``<N>`` is the desired paranoid level) to the ``/etc/sysctl.conf`` file.
|
|
|
|
Speed-up prediction variability and the rocprof-sys-causal executable
|
|
-----------------------------------------------------------------------
|
|
|
|
Causal profiling typically requires running the application several times in
|
|
order to adequately sample all the code domains, experiment
|
|
with speed-ups and other techniques, and resolve statistical fluctuations.
|
|
The ``rocprof-sys-causal`` executable is designed to simplify this procedure:
|
|
|
|
.. code-block:: shell
|
|
|
|
$ rocprof-sys-causal --help
|
|
[rocprof-sys-causal] Usage: ./bin/rocprof-sys-causal [ --help (count: 0, dtype: bool)
|
|
--version (count: 0, dtype: bool)
|
|
--monochrome (max: 1, dtype: bool)
|
|
--debug (max: 1, dtype: bool)
|
|
--verbose (count: 1)
|
|
--config (min: 0, dtype: filepath)
|
|
--launcher (count: 1, dtype: executable)
|
|
--generate-configs (min: 0, dtype: folder)
|
|
--no-defaults (min: 0, dtype: bool)
|
|
--mode (count: 1, dtype: string)
|
|
--output-name (min: 1, dtype: filename)
|
|
--reset (max: 1, dtype: bool)
|
|
--end-to-end (max: 1, dtype: bool)
|
|
--wait (count: 1, dtype: seconds)
|
|
--duration (count: 1, dtype: seconds)
|
|
--iterations (count: 1, dtype: int)
|
|
--speedups (min: 0, dtype: integers)
|
|
--binary-scope (min: 0, dtype: integers)
|
|
--source-scope (min: 0, dtype: integers)
|
|
--function-scope (min: 0, dtype: regex-list)
|
|
--binary-exclude (min: 0, dtype: integers)
|
|
--source-exclude (min: 0, dtype: integers)
|
|
--function-exclude (min: 0, dtype: regex-list)
|
|
]
|
|
|
|
Causal profiling usually requires multiple runs to reliably resolve the speedup estimates.
|
|
This executable is designed to streamline that process.
|
|
For example (assume all commands end with \'-- <exe> <args>\'):
|
|
|
|
rocprof-sys-causal -n 5 -- <exe> # runs <exe> 5x with causal profiling enabled
|
|
|
|
rocprof-sys-causal -s 0 5,10,15,20 # runs <exe> 2x with virtual speedups:
|
|
# - 0
|
|
# - randomly selected from 5, 10, 15, and 20
|
|
|
|
rocprof-sys-causal -F func_A func_B func_(A|B) # runs <exe> 3x with the function scope limited to:
|
|
# 1. func_A
|
|
# 2. func_B
|
|
# 3. func_A or func_B
|
|
General tips:
|
|
- Insert progress points at hotspots in your code or use rocprof-sys\'s runtime instrumentation
|
|
- Note: binary rewrite will produce a incompatible new binary
|
|
- Run rocprof-sys-causal in "function" mode first (does not require debug info)
|
|
- Run rocprof-sys-causal in "line" mode when you are targeting one function (requires debug info)
|
|
- Preferably, use predictions from the "function" mode to determine which function to target
|
|
- Limit the virtual speedups to a smaller pool, e.g., 0,5,10,25,50, to get reliable predictions quicker
|
|
- Make use of the binary, source, and function scope to limit the functions/lines selected for experiments
|
|
- Note: source scope requires debug info
|
|
|
|
|
|
Options:
|
|
-h, -?, --help Shows this page
|
|
--version Prints the version and exit
|
|
|
|
[DEBUG OPTIONS]
|
|
|
|
--monochrome Disable colorized output
|
|
--debug Debug output
|
|
-v, --verbose Verbose output
|
|
|
|
[GENERAL OPTIONS]
|
|
|
|
-c, --config Base configuration file
|
|
-l, --launcher When running MPI jobs, rocprof-sys-causal needs to be *before* the executable which launches the MPI processes (i.e.
|
|
before `mpirun`, `srun`, etc.). Pass the name of the target executable (or a regex for matching to the name of the
|
|
target) for causal profiling, e.g., `rocprof-sys-causal -l foo -- mpirun -n 4 foo`. This ensures that the rocprof-sys
|
|
library is LD_PRELOADed on the proper target
|
|
-g, --generate-configs Generate config files instead of passing environment variables directly. If no arguments are provided, the config files
|
|
will be placed in ${PWD}/rocprof-sys-causal-config folder
|
|
--no-defaults Do not activate default features which are recommended for causal profiling. For example: PID-tagging of output files
|
|
and timestamped subdirectories are disabled by default. Kokkos tools support is added by default
|
|
(ROCPROFSYS_USE_KOKKOSP=ON) because, for Kokkos applications, the Kokkos-Tools callbacks are used for progress points.
|
|
Activation of OpenMP tools support is similar
|
|
|
|
[CAUSAL PROFILING OPTIONS (General)]
|
|
(These settings will be applied to all causal profiling runs)
|
|
|
|
-m, --mode [ function (func) | line ]
|
|
Causal profiling mode
|
|
-o, --output-name Output filename of causal profiling data w/o extension
|
|
-r, --reset Overwrite any existing experiment results during the first run
|
|
-e, --end-to-end Single causal experiment for the entire application runtime
|
|
-w, --wait Set the wait time (i.e. delay) before starting the first causal experiment (in seconds)
|
|
-d, --duration Set the length of time (in seconds) to perform causal experimentationafter the first experiment is started. Once this
|
|
amount of time has elapsed, no more causal experiments will be started but any currently running experiment will be
|
|
allowed to finish.
|
|
-n, --iterations Number of times to repeat the combination of run configurations
|
|
|
|
[CAUSAL PROFILING OPTIONS (Combinatorial)]
|
|
(Each individual argument to these options will multiply the number runs by the number of arguments and the number of
|
|
iterations. E.g. -n 2 -B "MAIN" -F "foo" "bar" will produce 4 runs: 2 iterations x 1 binary scope x 2 function scopes
|
|
(MAIN+foo, MAIN+bar, MAIN+foo, MAIN+bar))
|
|
|
|
-s, --speedups Pool of virtual speedups to sample from during experimentation. Each space designates a group and multiple speedups can
|
|
be grouped together by commas, e.g. -s 0 0,10,20-50 is two groups: group #1 is \'0\' and group #2 is \'0 10 20 25 30 35 40
|
|
45 50\'
|
|
-B, --binary-scope Restricts causal experiments to the binaries matching the list of regular expressions. Each space designates a group
|
|
and multiple scopes can be grouped together with a semi-colon
|
|
-S, --source-scope Restricts causal experiments to the source files or source file + lineno pairs (i.e. <file> or <file>:<line>) matching
|
|
the list of regular expressions. Each space designates a group and multiple scopes can be grouped together with a
|
|
semi-colon
|
|
-F, --function-scope Restricts causal experiments to the functions matching the list of regular expressions. Each space designates a group
|
|
and multiple scopes can be grouped together with a semi-colon
|
|
-BE, --binary-exclude Excludes causal experiments from being performed on the binaries matching the list of regular expressions. Each space
|
|
designates a group and multiple excludes can be grouped together with a semi-colon
|
|
-SE, --source-exclude Excludes causal experiments from being performed on the code from the source files or source file + lineno pair (i.e.
|
|
<file> or <file>:<line>) matching the list of regular expressions. Each space designates a group and multiple excludes
|
|
can be grouped together with a semi-colon
|
|
-FE, --function-exclude Excludes causal experiments from being performed on the functions matching the list of regular expressions. Each space
|
|
designates a group and multiple excludes can be grouped together with a semi-colon
|
|
|
|
Examples
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
.. code-block:: shell
|
|
|
|
#!/bin/bash -e
|
|
|
|
module load rocprofiler-systems
|
|
|
|
N=20
|
|
I=3
|
|
|
|
# when providing speedups to rocprof-sys-causal, speedup
|
|
# groups are separated by a space so "0,10" results in
|
|
# one speedup group where rocprof-sys samples from
|
|
# the speedup set of {0, 10}. Passing "0 10" (without
|
|
# quotes to rocprof-sys-causal multiplies the
|
|
# number of runs by 2, where the first half of the
|
|
# runs instruct rocprof-sys to only use 0 as the
|
|
# speedup and the second half of the runs instruct
|
|
# rocprof-sys to only use 10 as the speedup.
|
|
SPEEDUPS="0,0,0,10,20,30,40,50,50,75,75,75,90,90,90"
|
|
# thus, -s ${SPEEDUPS} only multiplies the number
|
|
# of runs by 1 whereas -S ${SPEEDUPS_E2E} multiplies
|
|
# the number of runs by 15:
|
|
# - 3 runs with speedup of 0
|
|
# - 1 run for each of the speedups 10, 20, 30, and 40
|
|
# - 2 runs with speedup of 50
|
|
# - 3 runs with speedup of 75
|
|
# - 3 runs with speedup of 90
|
|
SPEEDUPS_E2E=$(echo "${SPEEDUPS}" | sed \'s/,/ /g\')
|
|
|
|
|
|
# 20 iterations in function mode with 1 speedup group
|
|
# and source scope set to .cpp files
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.func.coz
|
|
# - causal/experiments.func.json
|
|
#
|
|
# total executions: 20
|
|
#
|
|
rocprof-sys-causal \
|
|
-n ${N} \
|
|
-s ${SPEEDUPS} \
|
|
-m function \
|
|
-o experiments.func \
|
|
-S ".*\\.cpp" \
|
|
-- \
|
|
./causal-rocprofsys-cpu "${@}"
|
|
|
|
|
|
# 20 iterations in line mode with 1 speedup group
|
|
# and source scope restricted to lines 100 and 110
|
|
# in the causal.cpp file.
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.line.coz
|
|
# - causal/experiments.line.json
|
|
#
|
|
# total executions: 20
|
|
#
|
|
rocprof-sys-causal \
|
|
-n ${N} \
|
|
-s ${SPEEDUPS} \
|
|
-m line \
|
|
-o experiments.line \
|
|
-S "causal\\.cpp:(100|110)" \
|
|
-- \
|
|
./causal-rocprofsys-cpu "${@}"
|
|
|
|
|
|
# 3 iterations in function mode of 15 singular speedups
|
|
# in end-to-end mode with 2 different function scopes
|
|
# where one is restricted to "cpu_slow_func" and
|
|
# another is restricted to "cpu_fast_func".
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.func.e2e.coz
|
|
# - causal/experiments.func.e2e.json
|
|
#
|
|
# total executions: 90
|
|
#
|
|
rocprof-sys-causal \
|
|
-n ${I} \
|
|
-s ${SPEEDUPS_E2E} \
|
|
-m func \
|
|
-e \
|
|
-o experiments.func.e2e \
|
|
-F "cpu_slow_func" \
|
|
"cpu_fast_func" \
|
|
-- \
|
|
./causal-rocprofsys-cpu "${@}"
|
|
|
|
# 3 iterations in line mode of 15 singular speedups
|
|
# in end-to-end mode with 2 different source scopes
|
|
# where one is restricted to line 100 in causal.cpp
|
|
# and another is restricted to line 110 in causal.cpp.
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.line.e2e.coz
|
|
# - causal/experiments.line.e2e.json
|
|
#
|
|
# total executions: 90
|
|
#
|
|
rocprof-sys-causal \
|
|
-n ${I} \
|
|
-s ${SPEEDUPS_E2E} \
|
|
-m line \
|
|
-e \
|
|
-o experiments.line.e2e \
|
|
-S "causal\\.cpp:100" \
|
|
"causal\\.cpp:110" \
|
|
-- \
|
|
./causal-rocprofsys-cpu "${@}"
|
|
|
|
|
|
export OMP_NUM_THREADS=8
|
|
export OMP_PROC_BIND=spread
|
|
export OMP_PLACES=threads
|
|
|
|
# set number of iterations to 5
|
|
N=5
|
|
|
|
# 5 iterations in function mode of 1 speedup
|
|
# group with the source scope restricted
|
|
# to files containing "lulesh" in their filename
|
|
# and exclude functions which start with "Kokkos::"
|
|
# or "std::enable_if".
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.func.coz
|
|
# - causal/experiments.func.json
|
|
#
|
|
# total executions: 5
|
|
#
|
|
# First of 5 executions overwrites any
|
|
# existing causal/experiments.func.(coz|json)
|
|
# file due to "--reset" argument
|
|
#
|
|
rocprof-sys-causal \
|
|
--reset \
|
|
-n ${N} \
|
|
-s ${SPEEDUPS} \
|
|
-m func \
|
|
-o experiments.func \
|
|
-S "lulesh.*" \
|
|
-FE "^(Kokkos::|std::enable_if)" \
|
|
-- \
|
|
./lulesh-rocprofsys -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
|
|
|
|
|
# 5 iterations in line mode of 1 speedup
|
|
# group with the source scope restricted
|
|
# to files containing "lulesh" in their filename
|
|
# and exclude functions which start with "exec_range"
|
|
# or "execute" and which contain either
|
|
# "construct_shared_allocation" or "._omp_fn." in
|
|
# the function name.
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.line.coz
|
|
# - causal/experiments.line.json
|
|
#
|
|
# total executions: 5
|
|
#
|
|
# First of 5 executions overwrites any
|
|
# existing causal/experiments.line.(coz|json)
|
|
# file due to "--reset" argument
|
|
#
|
|
rocprof-sys-causal \
|
|
--reset \
|
|
-n ${N} \
|
|
-s ${SPEEDUPS} \
|
|
-m line \
|
|
-o experiments.line \
|
|
-S "lulesh.*" \
|
|
-FE "^(exec_range|execute);construct_shared_allocation;\\._omp_fn\\." \
|
|
-- \
|
|
./lulesh-rocprofsys -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
|
|
|
|
|
# 5 iterations in line mode of 1 speedup
|
|
# group with the source scope restricted
|
|
# to files whose basename is "lulesh.cc"
|
|
# for 3 different functions:
|
|
# - ApplyMaterialPropertiesForElems
|
|
# - CalcHourglassControlForElems
|
|
# - CalcVolumeForceForElems
|
|
#
|
|
# outputs to files:
|
|
# - causal/experiments.line.targeted.coz
|
|
# - causal/experiments.line.targeted.json
|
|
#
|
|
# total executions: 15
|
|
#
|
|
# First of 5 executions overwrites any
|
|
# existing causal/experiments.line.(coz|json)
|
|
# file due to "--reset" argument
|
|
#
|
|
rocprof-sys-causal \
|
|
--reset \
|
|
-n ${N} \
|
|
-s ${SPEEDUPS} \
|
|
-m line \
|
|
-o experiments.line.targeted \
|
|
-F "ApplyMaterialPropertiesForElems" \
|
|
"CalcHourglassControlForElems" \
|
|
"CalcVolumeForceForElems" \
|
|
-S "lulesh\\.cc" \
|
|
-- \
|
|
./lulesh-rocprofsys -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
|
|
|
Using rocprof-sys-causal with other launchers like mpirun
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The ``rocprof-sys-causal`` executable is intended to assist with application replay
|
|
and is designed to always be at the start of the command line as the primary process.
|
|
``rocprof-sys-causal`` typically adds a ``LD_PRELOAD`` of the ROCm Systems Profiler libraries
|
|
into the environment before launching the command to inject the functionality
|
|
required to start the causal profiling tooling. However, this is problematic
|
|
when the target application for causal profiling uses a launcher, in which case
|
|
it is listed as an argument rather than as the main application. For example,
|
|
``foo`` is the target application for profiling, but the command to run it is
|
|
``mpirun -n 2 foo``. Running the command ``rocprof-sys-causal -- mpirun -n 2 foo``
|
|
applies the causal profiling to ``mpirun`` instead of ``foo``.
|
|
|
|
``rocprof-sys-causal`` remedies this by providing a command-line option ``-l` / `--launcher``
|
|
to indicate the target application is using a launcher script/executable. The
|
|
argument to the command-line option is the name of, or regular expression for, the target application
|
|
on the command line. When ``--launcher`` is used, ``rocprof-sys-causal`` generates
|
|
all the replay configurations and runs them but delays adding the ``LD_PRELOAD``. Instead it
|
|
inserts a call to itself into the command line right before the target
|
|
application. This recursive call inherits the configuration from
|
|
the parent ``rocprof-sys-causal`` executable, inserts an ``LD_PRELOAD`` into the environment,
|
|
and calls ``execv`` to replace itself with the new process launched by the target
|
|
application.
|
|
|
|
In other words, the following command:
|
|
|
|
.. code-block:: shell
|
|
|
|
rocprof-sys-causal -l foo -n 3 -- mpirun -n 2 foo`
|
|
|
|
Effectively results in:
|
|
|
|
.. code-block:: shell
|
|
|
|
mpirun -n 2 rocprof-sys-causal -- foo
|
|
mpirun -n 2 rocprof-sys-causal -- foo
|
|
mpirun -n 2 rocprof-sys-causal -- foo
|
|
|
|
Visualizing the causal output
|
|
-------------------------------------------------------------------------
|
|
|
|
ROCm Systems Profiler generates ``causal/experiments.json`` and ``causal/experiments.coz`` in
|
|
``${ROCPROFSYS_OUTPUT_PATH}/${ROCPROFSYS_OUTPUT_PREFIX}``. Visit
|
|
`plasma-umass.org/coz <https://plasma-umass.org/coz/>`_ to open the ``*.coz`` file.
|
|
|
|
ROCm Systems Profiler versus Coz
|
|
=======================================
|
|
|
|
This comparison is intended for readers who are familiar with the
|
|
`Coz profiler <https://github.com/plasma-umass/coz>`_.
|
|
ROCm Systems Profiler provides several additional features and utilities for causal profiling:
|
|
|
|
.. csv-table::
|
|
:header: "Feature", "Coz", "ROCm Systems Profiler", "Notes"
|
|
:widths: 20, 60, 60, 30
|
|
|
|
"Debug info", "requires debug info in DWARF v3 format (``-gdwarf-3``)", "optional, supports any DWARF format version", "See Note #1 below"
|
|
"Experiment selection", "``<file>:<line>``", "``<function>`` or ``<file>:<line>``", "See Note #2 below"
|
|
"Experiment speed-ups", "Randomly samples b/t 0..100 in increments of 5 or one fixed speed-up", "Supports specifying smaller subset", "See Note #3 below"
|
|
"Scope options", "Supports binary and source scopes", "Supports binary, source, and function scopes", "See Note #4, #5, and #6 below"
|
|
"Scope inclusion", "Uses ``%`` as a wildcard for binary and source scopes", "Full regex support for binary, source, and function scopes", ""
|
|
"Scope exclusion", "Not supported", "Supports regexes for excluding binary/source/function", "See Note #7 below"
|
|
"Call-stack sampling", "Linux Perf", "Linux Perf, libunwind", "See Note #8 below"
|
|
|
|
.. note::
|
|
|
|
#. ROCm Systems Profiler supports a "function" mode which does not require debug info.
|
|
#. ROCm Systems Profiler supports selecting an entire range of instruction pointers for a function instead
|
|
of an instruction pointer for one line. In large code bases, "function" mode
|
|
can resolve in fewer iterations. After a target function is identified, you can
|
|
switch to line mode and limit the function scope to the target function.
|
|
#. ROCm Systems Profiler supports randomly sampling from subsets, e.g. { 0, 0, 5, 10 }
|
|
where 0% is randomly selected 50% of time and 5% and 10% are randomly selected 25% of the time.
|
|
#. ROCm Systems Profiler and COZ have the same definition for binary scope, which is the binaries
|
|
loaded at runtime (the executable and linked libraries).
|
|
#. ROCm Systems Profiler "source scope" supports both ``<file>`` and ``<file>:<line>`` formats
|
|
in contrast to the COZ "source scope" which requires ``<file>:<line>`` format.
|
|
#. ROCm Systems Profiler supports a "function" scope which narrows the function and lines
|
|
which are eligible for causal experiments to those within the matching functions.
|
|
#. ROCm Systems Profiler supports a second filter on scopes for removing binary/source/function
|
|
caught by an inclusive match. For example ``BINARY_SCOPE=.*`` and ``BINARY_EXCLUDE=libmpi.*``
|
|
initially includes all binaries but exclude regex removes MPI libraries.
|
|
#. In ROCm Systems Profiler, the Linux Perf backend is preferred over use libunwind. However,
|
|
Linux Perf usage can be restricted for security reasons.
|
|
ROCm Systems Profiler falls back to using a second POSIX timer and libunwind if
|
|
Linux Perf is not available.
|