Files
rocm-systems/docs/tutorial/includes/lds-examples.rst
T
Peter Park a0dc485ceb Docs: refactor and integrate into ROCm docs portal (#362)
* pip-compile docs/requirements.txt

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Add Sphinx docs config

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Add Sphinx config

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Update docs build config

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* style(conf.py): Apply black formatting to docs/conf.py

Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>

* Update docs requirements

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Update to rocm-docs-core 1.3.0

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Update docs requirements

Signed-off-by: Peter Jun Park <peter.park@amd.com>

pip-compile requirements

Signed-off-by: Peter Jun Park <peter.park@amd.com>

bump rocm-docs-core to 1.5.0

bump rocm-docs-core to 1.4.1

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* Add dependabot.yml and update CODEOWNERS

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Update toc and conf

Signed-off-by: Peter Jun Park <peter.park@amd.com>

update dependabot

* Port docs to rocm-docs standard

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Add toc and Diataxis cards

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Add basic file structure

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add glossary

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add includes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Add license.rst

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add compatible hw

Signed-off-by: Peter Jun Park <peter.park@amd.com>

fix spelling and license

Signed-off-by: Peter Jun Park <peter.park@amd.com>

clean up index

Signed-off-by: Peter Jun Park <peter.park@amd.com>

clean up installation guides

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add basic usage (quickstart)

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add ref to global options

update toc

Signed-off-by: Peter Jun Park <peter.park@amd.com>

modularize modes and global options

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add profile mode

Signed-off-by: Peter Jun Park <peter.park@amd.com>

fixes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

reorg and clean up

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add dynamic omniperf version number in installation guide

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add datatemplate

more reorg

Signed-off-by: Peter Jun Park <peter.park@amd.com>

clean up

Signed-off-by: Peter Jun Park <peter.park@amd.com>

reorg images

move profile mode

reorg

reorg

reorg more

fix formatting

fix headings

ref anchor mi2xx note

add extlinks

add extlinks

Signed-off-by: Peter Jun Park <peter.park@amd.com>

black format

fix formatting, anchors

Signed-off-by: Peter Jun Park <peter.park@amd.com>

reorg

fix words and formatting

Signed-off-by: Peter Jun Park <peter.park@amd.com>

formatting

Signed-off-by: Peter Jun Park <peter.park@amd.com>

same

reorg

format

fix formatting

fix toc

Signed-off-by: Peter Jun Park <peter.park@amd.com>

format

* impr internal linking and fix sphinx warnings

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* add spellcheck/linting from rocm-docs-core

Signed-off-by: Peter Jun Park <peter.park@amd.com>

fix rst directives

satisfy spellcheck

fix more spelling

rm unused files

fix spelling and update wordlist

* bump rocm-docs-core to 1.6.0

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* add fixes from @skyreflectedinmirrors and @lpaoletti

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add references to toc

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add more fixes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* add package manager install section

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* add fixes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add metadata and fixes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add fixes

bump to 1.6.1

more fixes

fix fmt in profiling examples

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add missing mem type table

Signed-off-by: Peter Jun Park <peter.park@amd.com>

fix formatting

fmt

* add custom css

Signed-off-by: Peter Jun Park <peter.park@amd.com>

fix css fs

* make images/figs click-to-expand

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add missed image

update

fix link

* update documentation link in README

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* formatting fixes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

more formatting

* fix heading

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* move archived docs

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* exclude archived docs from docs build

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* update archived docs workflow

Signed-off-by: Peter Jun Park <peter.park@amd.com>

move files

update archived docs workflow

Signed-off-by: Peter Jun Park <peter.park@amd.com>

fix version number

clean up workflow

workflow test

workflow test

another workflow test

* rm docs linting

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* Apply cmake-format suggested changes

Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>

* Apply cmake-format

Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>

---------

Signed-off-by: Peter Jun Park <peter.park@amd.com>
Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
Co-authored-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
2024-08-09 09:46:42 -04:00

273 rader
11 KiB
ReStructuredText
Blame Historik

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
.. _lds-examples:
LDS examples
============
For this example, consider the
:dev-sample:`LDS sample <lds.hip>` distributed as a part of Omniperf. This
code contains two kernels to explore how both :doc:`LDS </conceptual/local-data-share>` bandwidth and
bank conflicts are calculated in Omniperf.
This example was compiled and run on an MI250 accelerator using ROCm
v5.6.0, and Omniperf v2.0.0.
.. code-block:: shell-session
$ hipcc -O3 lds.hip -o lds
Finally, we generate our ``omniperf profile`` as:
.. code-block:: shell-session
$ omniperf profile -n lds --no-roof -- ./lds
.. _lds-bandwidth:
LDS bandwidth
-------------
To explore our *theoretical LDS bandwidth* metric, we use a simple
kernel:
.. code-block:: cpp
constexpr unsigned max_threads = 256;
__global__ void load(int* out, int flag) {
__shared__ int array[max_threads];
int index = threadIdx.x;
// fake a store to the LDS array to avoid unwanted behavior
if (flag)
array[max_threads - index] = index;
__syncthreads();
int x = array[index];
if (x == int(-1234567))
out[threadIdx.x] = x;
}
Here we:
* Create an array of 256 integers in :doc:`LDS </conceptual/local-data-share>`
* Fake a write to the LDS using the ``flag`` variable (always set to zero on the
host) to avoid dead-code elimination
* Read a single integer per work-item from ``threadIdx.x`` of the LDS array
* If the integer is equal to a magic number (always false), write the value out
to global memory to again, avoid dead-code elimination
Finally, we launch this kernel repeatedly, varying the number of threads
in our workgroup:
.. code-block:: cpp
void bandwidth_demo(int N) {
for (int i = 1; i <= N; ++i)
load<<<1,i>>>(nullptr, 0);
hipDeviceSynchronize();
}
Next, lets analyze the first of our bandwidth kernel dispatches:
.. code-block:: shell
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.1 --dispatch 0 -n per_kernel
<...>
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤═══════════════════════╤════════╤════════╤════════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪════════╪════════╪════════╪══════════════════╡
│ 12.2.1 │ Theoretical Bandwidth │ 256.00 │ 256.00 │ 256.00 │ Bytes per kernel │
╘═════════╧═══════════════════════╧════════╧════════╧════════╧══════════════════╛
Here we see that our Theoretical Bandwidth metric (**12.2.1**) is reporting
256 Bytes were loaded even though we launched a single work-item
workgroup, and thus only loaded a single integer from LDS. Why is this?
Recall our definition of this metric:
Indicates the maximum amount of bytes that could have been loaded
from/stored to/atomically updated in the LDS per
:ref:`normalization unit <normalization-units>`.
Here we see that this instruction *could* have loaded up to 256 bytes of
data (4 bytes for each work-item in the wavefront), and therefore this
is the expected value for this metric in Omniperf, hence why this metric
is named the “theoretical” bandwidth.
To further illustrate this point we plot the relationship of the
theoretical bandwidth metric (**12.2.1**) as compared to the effective (or
achieved) bandwidth of this kernel, varying the number of work-items
launched from 1 to 256:
.. figure:: ../data/profiling-by-example/ldsbandwidth.png
:align: center
:alt: Comparison of effective bandwidth versus the theoretical bandwidth
metric in Omniperf for our simple example.
:width: 800
Comparison of effective bandwidth versus the theoretical bandwidth
metric in Omniperf for our simple example.
Here we see that the theoretical bandwidth metric follows a step-function. It
increases only when another wavefront issues an LDS instruction for up to 256
bytes of data. Such increases are marked in the plot using dashed lines. In
contrast, the effective bandwidth increases linearly, by 4 bytes, with the
number of work-items in the kernel, N.
.. _lds-bank-conflicts:
Bank conflicts
--------------
Next we explore bank conflicts using a slight modification of our bandwidth
kernel:
.. code-block:: cpp
constexpr unsigned nbanks = 32;
__global__ void conflicts(int* out, int flag) {
constexpr unsigned nelements = nbanks * max_threads;
__shared__ int array[nelements];
// each thread reads from the same bank
int index = threadIdx.x * nbanks;
// fake a store to the LDS array to avoid unwanted behavior
if (flag)
array[max_threads - index] = index;
__syncthreads();
int x = array[index];
if (x == int(-1234567))
out[threadIdx.x] = x;
}
Here we:
* Allocate an :doc:`LDS </conceptual/local-data-share>` array of size
:math:`32*256*4{B}=32{KiB}`
* Fake a write to the LDS using the ``flag``
variable (always set to zero on the host) to avoid dead-code elimination
* Read a single integer per work-item from index
``threadIdx.x * nbanks`` of the LDS array
* If the integer is equal to a
magic number (always false), write the value out to global memory to,
again, avoid dead-code elimination.
On the host, we again repeatedly launch this kernel, varying the number
of work-items:
.. code-block:: cpp
void conflicts_demo(int N) {
for (int i = 1; i <= N; ++i)
conflicts<<<1,i>>>(nullptr, 0);
hipDeviceSynchronize();
}
Analyzing our first ``conflicts`` kernel (i.e., a single work-item), we
see:
.. code-block:: shell
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 256 -n per_kernel
<...>
--------------------------------------------------------------------------------
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤════════════════╤═══════╤═══════╤═══════╤═══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════════╪═══════╪═══════╪═══════╪═══════════════════╡
│ 12.2.4 │ Index Accesses │ 2.00 │ 2.00 │ 2.00 │ Cycles per kernel │
├─────────┼────────────────┼───────┼───────┼───────┼───────────────────┤
│ 12.2.6 │ Bank Conflict │ 0.00 │ 0.00 │ 0.00 │ Cycles per kernel │
╘═════════╧════════════════╧═══════╧═══════╧═══════╧═══════════════════╛
In our :ref:`previous example <lds-bank-conflicts>`, we showed how a load
from a single work-item is considered to have a theoretical bandwidth of
256B. Recall, the :doc:`LDS </conceptual/local-data-share>` can load up to :math:`128B` per
cycle (i.e, 32 banks x 4B / bank / cycle). Hence, we see that loading an 4B
integer spends two cycles accessing the LDS
(:math:`2\ {cycle} = (256B) / (128\ B/{cycle})`).
Looking at the next ``conflicts`` dispatch (i.e., two work-items) yields:
.. code-block:: shell
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 257 -n per_kernel
<...>
--------------------------------------------------------------------------------
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤════════════════╤═══════╤═══════╤═══════╤═══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════════╪═══════╪═══════╪═══════╪═══════════════════╡
│ 12.2.4 │ Index Accesses │ 3.00 │ 3.00 │ 3.00 │ Cycles per kernel │
├─────────┼────────────────┼───────┼───────┼───────┼───────────────────┤
│ 12.2.6 │ Bank Conflict │ 1.00 │ 1.00 │ 1.00 │ Cycles per kernel │
╘═════════╧════════════════╧═══════╧═══════╧═══════╧═══════════════════╛
Here we see a bank conflict! What happened?
Recall that the index for each thread was calculated as:
.. code-block:: cpp
int index = threadIdx.x * nbanks;
Or, precisely 32 elements, and each element is 4B wide (for a standard
integer). That is, each thread strides back to the same bank in the LDS,
such that each work-item we add to the dispatch results in another bank
conflict!
Recalling our discussion of bank conflicts in our
:doc:`LDS </conceptual/local-data-share>` description:
A bank conflict occurs when two (or more) work-items in a wavefront
want to read, write, or atomically update different addresses that
map to the same bank in the same cycle. In this case, the conflict
detection hardware will determined a new schedule such that the
access is split into multiple cycles with no conflicts in any
single cycle.
Here we see the conflict resolution hardware in action! Because we have
engineered our kernel to generate conflicts, we expect our bank conflict
metric to scale linearly with the number of work-items:
.. figure:: ../data/profiling-by-example/ldsconflicts.png
:align: center
:alt: Comparison of LDS conflict cycles versus access cycles for our simple
example.
:width: 800
Comparison of LDS conflict cycles versus access cycles for our simple
example.
Here we show the comparison of the Index Accesses (**12.2.4**), to the Bank
Conflicts (**12.2.6**) for the first 20 kernel invocations. We see that each grows
linearly, and there is a constant gap of 2 cycles between them (i.e., the first
access is never considered a conflict).
Finally, we can use these two metrics to derive the Bank Conflict Rate (**12.1.4**).
Since within an Index Access we have 32 banks that may need to be updated, we
use:
$$
Bank\ Conflict\ Rate = 100 * ((Bank\ Conflicts / 32) / (Index\ Accesses - Bank\ Conflicts))
$$
Plotting this, we see:
.. figure:: ../data/profiling-by-example/ldsconflictrate.png
:align: center
:alt: LDS bank conflict rate example
:width: 800
LDS Bank Conflict rate for our simple example.
The bank conflict rate linearly increases with the number of work-items
within a wavefront that are active, *approaching* 100%, but never quite
reaching it.