5d22d5ac8e
* pip-compile docs/requirements.txt
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Add Sphinx docs config
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Add Sphinx config
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Update docs build config
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* style(conf.py): Apply black formatting to docs/conf.py
Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
* Update docs requirements
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Update to rocm-docs-core 1.3.0
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Update docs requirements
Signed-off-by: Peter Jun Park <peter.park@amd.com>
pip-compile requirements
Signed-off-by: Peter Jun Park <peter.park@amd.com>
bump rocm-docs-core to 1.5.0
bump rocm-docs-core to 1.4.1
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* Add dependabot.yml and update CODEOWNERS
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Update toc and conf
Signed-off-by: Peter Jun Park <peter.park@amd.com>
update dependabot
* Port docs to rocm-docs standard
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Add toc and Diataxis cards
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Add basic file structure
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add glossary
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add includes
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Add license.rst
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add compatible hw
Signed-off-by: Peter Jun Park <peter.park@amd.com>
fix spelling and license
Signed-off-by: Peter Jun Park <peter.park@amd.com>
clean up index
Signed-off-by: Peter Jun Park <peter.park@amd.com>
clean up installation guides
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add basic usage (quickstart)
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add ref to global options
update toc
Signed-off-by: Peter Jun Park <peter.park@amd.com>
modularize modes and global options
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add profile mode
Signed-off-by: Peter Jun Park <peter.park@amd.com>
fixes
Signed-off-by: Peter Jun Park <peter.park@amd.com>
reorg and clean up
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add dynamic omniperf version number in installation guide
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add datatemplate
more reorg
Signed-off-by: Peter Jun Park <peter.park@amd.com>
clean up
Signed-off-by: Peter Jun Park <peter.park@amd.com>
reorg images
move profile mode
reorg
reorg
reorg more
fix formatting
fix headings
ref anchor mi2xx note
add extlinks
add extlinks
Signed-off-by: Peter Jun Park <peter.park@amd.com>
black format
fix formatting, anchors
Signed-off-by: Peter Jun Park <peter.park@amd.com>
reorg
fix words and formatting
Signed-off-by: Peter Jun Park <peter.park@amd.com>
formatting
Signed-off-by: Peter Jun Park <peter.park@amd.com>
same
reorg
format
fix formatting
fix toc
Signed-off-by: Peter Jun Park <peter.park@amd.com>
format
* impr internal linking and fix sphinx warnings
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* add spellcheck/linting from rocm-docs-core
Signed-off-by: Peter Jun Park <peter.park@amd.com>
fix rst directives
satisfy spellcheck
fix more spelling
rm unused files
fix spelling and update wordlist
* bump rocm-docs-core to 1.6.0
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* add fixes from @skyreflectedinmirrors and @lpaoletti
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add references to toc
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add more fixes
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* add package manager install section
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* add fixes
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add metadata and fixes
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add fixes
bump to 1.6.1
more fixes
fix fmt in profiling examples
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add missing mem type table
Signed-off-by: Peter Jun Park <peter.park@amd.com>
fix formatting
fmt
* add custom css
Signed-off-by: Peter Jun Park <peter.park@amd.com>
fix css fs
* make images/figs click-to-expand
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add missed image
update
fix link
* update documentation link in README
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* formatting fixes
Signed-off-by: Peter Jun Park <peter.park@amd.com>
more formatting
* fix heading
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* move archived docs
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* exclude archived docs from docs build
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* update archived docs workflow
Signed-off-by: Peter Jun Park <peter.park@amd.com>
move files
update archived docs workflow
Signed-off-by: Peter Jun Park <peter.park@amd.com>
fix version number
clean up workflow
workflow test
workflow test
another workflow test
* rm docs linting
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* Apply cmake-format suggested changes
Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
* Apply cmake-format
Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
---------
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
Co-authored-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
[ROCm/rocprofiler-compute commit: a0dc485ceb]
273 строки
11 KiB
ReStructuredText
273 строки
11 KiB
ReStructuredText
.. _lds-examples:
|
||
|
||
LDS examples
|
||
============
|
||
|
||
For this example, consider the
|
||
:dev-sample:`LDS sample <lds.hip>` distributed as a part of Omniperf. This
|
||
code contains two kernels to explore how both :doc:`LDS </conceptual/local-data-share>` bandwidth and
|
||
bank conflicts are calculated in Omniperf.
|
||
|
||
This example was compiled and run on an MI250 accelerator using ROCm
|
||
v5.6.0, and Omniperf v2.0.0.
|
||
|
||
.. code-block:: shell-session
|
||
|
||
$ hipcc -O3 lds.hip -o lds
|
||
|
||
Finally, we generate our ``omniperf profile`` as:
|
||
|
||
.. code-block:: shell-session
|
||
|
||
$ omniperf profile -n lds --no-roof -- ./lds
|
||
|
||
.. _lds-bandwidth:
|
||
|
||
LDS bandwidth
|
||
-------------
|
||
|
||
To explore our *theoretical LDS bandwidth* metric, we use a simple
|
||
kernel:
|
||
|
||
.. code-block:: cpp
|
||
|
||
constexpr unsigned max_threads = 256;
|
||
__global__ void load(int* out, int flag) {
|
||
__shared__ int array[max_threads];
|
||
int index = threadIdx.x;
|
||
// fake a store to the LDS array to avoid unwanted behavior
|
||
if (flag)
|
||
array[max_threads - index] = index;
|
||
__syncthreads();
|
||
int x = array[index];
|
||
if (x == int(-1234567))
|
||
out[threadIdx.x] = x;
|
||
}
|
||
|
||
Here we:
|
||
|
||
* Create an array of 256 integers in :doc:`LDS </conceptual/local-data-share>`
|
||
|
||
* Fake a write to the LDS using the ``flag`` variable (always set to zero on the
|
||
host) to avoid dead-code elimination
|
||
|
||
* Read a single integer per work-item from ``threadIdx.x`` of the LDS array
|
||
|
||
* If the integer is equal to a magic number (always false), write the value out
|
||
to global memory to again, avoid dead-code elimination
|
||
|
||
Finally, we launch this kernel repeatedly, varying the number of threads
|
||
in our workgroup:
|
||
|
||
.. code-block:: cpp
|
||
|
||
void bandwidth_demo(int N) {
|
||
for (int i = 1; i <= N; ++i)
|
||
load<<<1,i>>>(nullptr, 0);
|
||
hipDeviceSynchronize();
|
||
}
|
||
|
||
Next, let’s analyze the first of our bandwidth kernel dispatches:
|
||
|
||
.. code-block:: shell
|
||
|
||
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.1 --dispatch 0 -n per_kernel
|
||
<...>
|
||
12. Local Data Share (LDS)
|
||
12.2 LDS Stats
|
||
╒═════════╤═══════════════════════╤════════╤════════╤════════╤══════════════════╕
|
||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||
╞═════════╪═══════════════════════╪════════╪════════╪════════╪══════════════════╡
|
||
│ 12.2.1 │ Theoretical Bandwidth │ 256.00 │ 256.00 │ 256.00 │ Bytes per kernel │
|
||
╘═════════╧═══════════════════════╧════════╧════════╧════════╧══════════════════╛
|
||
|
||
Here we see that our Theoretical Bandwidth metric (**12.2.1**) is reporting
|
||
256 Bytes were loaded even though we launched a single work-item
|
||
workgroup, and thus only loaded a single integer from LDS. Why is this?
|
||
|
||
Recall our definition of this metric:
|
||
|
||
Indicates the maximum amount of bytes that could have been loaded
|
||
from/stored to/atomically updated in the LDS per
|
||
:ref:`normalization unit <normalization-units>`.
|
||
|
||
Here we see that this instruction *could* have loaded up to 256 bytes of
|
||
data (4 bytes for each work-item in the wavefront), and therefore this
|
||
is the expected value for this metric in Omniperf, hence why this metric
|
||
is named the “theoretical” bandwidth.
|
||
|
||
To further illustrate this point we plot the relationship of the
|
||
theoretical bandwidth metric (**12.2.1**) as compared to the effective (or
|
||
achieved) bandwidth of this kernel, varying the number of work-items
|
||
launched from 1 to 256:
|
||
|
||
.. figure:: ../data/profiling-by-example/ldsbandwidth.png
|
||
:align: center
|
||
:alt: Comparison of effective bandwidth versus the theoretical bandwidth
|
||
metric in Omniperf for our simple example.
|
||
:width: 800
|
||
|
||
Comparison of effective bandwidth versus the theoretical bandwidth
|
||
metric in Omniperf for our simple example.
|
||
|
||
Here we see that the theoretical bandwidth metric follows a step-function. It
|
||
increases only when another wavefront issues an LDS instruction for up to 256
|
||
bytes of data. Such increases are marked in the plot using dashed lines. In
|
||
contrast, the effective bandwidth increases linearly, by 4 bytes, with the
|
||
number of work-items in the kernel, N.
|
||
|
||
.. _lds-bank-conflicts:
|
||
|
||
Bank conflicts
|
||
--------------
|
||
|
||
Next we explore bank conflicts using a slight modification of our bandwidth
|
||
kernel:
|
||
|
||
.. code-block:: cpp
|
||
|
||
constexpr unsigned nbanks = 32;
|
||
__global__ void conflicts(int* out, int flag) {
|
||
constexpr unsigned nelements = nbanks * max_threads;
|
||
__shared__ int array[nelements];
|
||
// each thread reads from the same bank
|
||
int index = threadIdx.x * nbanks;
|
||
// fake a store to the LDS array to avoid unwanted behavior
|
||
if (flag)
|
||
array[max_threads - index] = index;
|
||
__syncthreads();
|
||
int x = array[index];
|
||
if (x == int(-1234567))
|
||
out[threadIdx.x] = x;
|
||
}
|
||
|
||
Here we:
|
||
|
||
* Allocate an :doc:`LDS </conceptual/local-data-share>` array of size
|
||
:math:`32*256*4{B}=32{KiB}`
|
||
|
||
* Fake a write to the LDS using the ``flag``
|
||
variable (always set to zero on the host) to avoid dead-code elimination
|
||
|
||
* Read a single integer per work-item from index
|
||
``threadIdx.x * nbanks`` of the LDS array
|
||
|
||
* If the integer is equal to a
|
||
magic number (always false), write the value out to global memory to,
|
||
again, avoid dead-code elimination.
|
||
|
||
On the host, we again repeatedly launch this kernel, varying the number
|
||
of work-items:
|
||
|
||
.. code-block:: cpp
|
||
|
||
void conflicts_demo(int N) {
|
||
for (int i = 1; i <= N; ++i)
|
||
conflicts<<<1,i>>>(nullptr, 0);
|
||
hipDeviceSynchronize();
|
||
}
|
||
|
||
Analyzing our first ``conflicts`` kernel (i.e., a single work-item), we
|
||
see:
|
||
|
||
.. code-block:: shell
|
||
|
||
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 256 -n per_kernel
|
||
<...>
|
||
--------------------------------------------------------------------------------
|
||
12. Local Data Share (LDS)
|
||
12.2 LDS Stats
|
||
╒═════════╤════════════════╤═══════╤═══════╤═══════╤═══════════════════╕
|
||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||
╞═════════╪════════════════╪═══════╪═══════╪═══════╪═══════════════════╡
|
||
│ 12.2.4 │ Index Accesses │ 2.00 │ 2.00 │ 2.00 │ Cycles per kernel │
|
||
├─────────┼────────────────┼───────┼───────┼───────┼───────────────────┤
|
||
│ 12.2.6 │ Bank Conflict │ 0.00 │ 0.00 │ 0.00 │ Cycles per kernel │
|
||
╘═════════╧════════════════╧═══════╧═══════╧═══════╧═══════════════════╛
|
||
|
||
In our :ref:`previous example <lds-bank-conflicts>`, we showed how a load
|
||
from a single work-item is considered to have a theoretical bandwidth of
|
||
256B. Recall, the :doc:`LDS </conceptual/local-data-share>` can load up to :math:`128B` per
|
||
cycle (i.e, 32 banks x 4B / bank / cycle). Hence, we see that loading an 4B
|
||
integer spends two cycles accessing the LDS
|
||
(:math:`2\ {cycle} = (256B) / (128\ B/{cycle})`).
|
||
|
||
Looking at the next ``conflicts`` dispatch (i.e., two work-items) yields:
|
||
|
||
.. code-block:: shell
|
||
|
||
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 257 -n per_kernel
|
||
<...>
|
||
--------------------------------------------------------------------------------
|
||
12. Local Data Share (LDS)
|
||
12.2 LDS Stats
|
||
╒═════════╤════════════════╤═══════╤═══════╤═══════╤═══════════════════╕
|
||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||
╞═════════╪════════════════╪═══════╪═══════╪═══════╪═══════════════════╡
|
||
│ 12.2.4 │ Index Accesses │ 3.00 │ 3.00 │ 3.00 │ Cycles per kernel │
|
||
├─────────┼────────────────┼───────┼───────┼───────┼───────────────────┤
|
||
│ 12.2.6 │ Bank Conflict │ 1.00 │ 1.00 │ 1.00 │ Cycles per kernel │
|
||
╘═════════╧════════════════╧═══════╧═══════╧═══════╧═══════════════════╛
|
||
|
||
Here we see a bank conflict! What happened?
|
||
|
||
Recall that the index for each thread was calculated as:
|
||
|
||
.. code-block:: cpp
|
||
|
||
int index = threadIdx.x * nbanks;
|
||
|
||
Or, precisely 32 elements, and each element is 4B wide (for a standard
|
||
integer). That is, each thread strides back to the same bank in the LDS,
|
||
such that each work-item we add to the dispatch results in another bank
|
||
conflict!
|
||
|
||
Recalling our discussion of bank conflicts in our
|
||
:doc:`LDS </conceptual/local-data-share>` description:
|
||
|
||
A bank conflict occurs when two (or more) work-items in a wavefront
|
||
want to read, write, or atomically update different addresses that
|
||
map to the same bank in the same cycle. In this case, the conflict
|
||
detection hardware will determined a new schedule such that the
|
||
access is split into multiple cycles with no conflicts in any
|
||
single cycle.
|
||
|
||
Here we see the conflict resolution hardware in action! Because we have
|
||
engineered our kernel to generate conflicts, we expect our bank conflict
|
||
metric to scale linearly with the number of work-items:
|
||
|
||
.. figure:: ../data/profiling-by-example/ldsconflicts.png
|
||
:align: center
|
||
:alt: Comparison of LDS conflict cycles versus access cycles for our simple
|
||
example.
|
||
:width: 800
|
||
|
||
Comparison of LDS conflict cycles versus access cycles for our simple
|
||
example.
|
||
|
||
Here we show the comparison of the Index Accesses (**12.2.4**), to the Bank
|
||
Conflicts (**12.2.6**) for the first 20 kernel invocations. We see that each grows
|
||
linearly, and there is a constant gap of 2 cycles between them (i.e., the first
|
||
access is never considered a conflict).
|
||
|
||
Finally, we can use these two metrics to derive the Bank Conflict Rate (**12.1.4**).
|
||
Since within an Index Access we have 32 banks that may need to be updated, we
|
||
use:
|
||
|
||
$$
|
||
Bank\ Conflict\ Rate = 100 * ((Bank\ Conflicts / 32) / (Index\ Accesses - Bank\ Conflicts))
|
||
$$
|
||
|
||
Plotting this, we see:
|
||
|
||
.. figure:: ../data/profiling-by-example/ldsconflictrate.png
|
||
:align: center
|
||
:alt: LDS bank conflict rate example
|
||
:width: 800
|
||
|
||
LDS Bank Conflict rate for our simple example.
|
||
|
||
The bank conflict rate linearly increases with the number of work-items
|
||
within a wavefront that are active, *approaching* 100%, but never quite
|
||
reaching it.
|