Add documentation describing ROCPROFSYS_USE_RCCP (#110)

* Add documentation describing ROCPROFSYS_USE_RCCP

Signed-off-by: David Galiffi <David.Galiffi@amd.com>

* Update wordlist

Signed-off-by: David Galiffi <David.Galiffi@amd.com>

* Update CHANGELOGS.md

---------

Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
This commit is contained in:
systems-assistant[bot]
2025-08-13 18:01:18 -04:00
committed by GitHub
parent 80b7e6baee
commit dd37d215fd
4 changed files with 19 additions and 0 deletions
@@ -30,6 +30,8 @@ ppc
proc
proto
Pthreads
RCCL
RCCLP
rocDecode
rocdecode
ROCprofiler
@@ -18,6 +18,7 @@ Full documentation for ROCm Systems Profiler is available at [https://rocm.docs.
- Replaced ROCm SMI backend with AMD SMI backend for collecting GPU metrics.
- ROCprofiler-SDK is now used to trace RCCL API and collect communication counters.
- Use the setting `ROCPROFSYS_USE_RCCLP = ON` to enable profiling and tracing of RCCL application data.
- Updated the Dyninst submodule to v13.0.
- Set the default value of `ROCPROFSYS_SAMPLING_CPUS` to `none`.
Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

@@ -225,6 +225,22 @@ and memory copy operations submitted. With the
``ROCPROFSYS_ROCM_GROUP_BY_QUEUE=ON`` setting, the trace will display HSA queues
to which these kernel and memory operations were submitted.
ROCPROFSYS_USE_RCCLP
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Use the setting ``ROCPROFSYS_USE_RCCLP = ON`` to enable profiling and tracing of
ROCm Communication Collectives Library (RCCL, also pronounced as 'Rickle'). When this setting is enabled,
ROCm Systems Profiler will trace the RCCL API calls and collect performance metrics related to collective operations.
The image below shows an example of a Perfetto trace with RCCL communication data and API tracing enabled:
.. image:: ../data/rccl-comm-recv.png
:alt: Perfetto tracks with RCCL Communication Data and API tracing
.. note::
There is a known issue which causes the application to exit with an error. However, the trace data can still be found in the output directory.
This issue is being tracked internally.
Exploring GPU Metrics
---------------------