diff --git a/projects/rocprofiler-systems/CHANGELOG.md b/projects/rocprofiler-systems/CHANGELOG.md index 2fabba4301..7cdbb61f5b 100644 --- a/projects/rocprofiler-systems/CHANGELOG.md +++ b/projects/rocprofiler-systems/CHANGELOG.md @@ -8,6 +8,8 @@ Full documentation for ROCm Systems Profiler is available at [https://rocm.docs. ### Added +- Profiling and metric collection capabilities for XGMI and PCIe data. +- How-to document for XGMI and PCIe sampling and monitoring. - Added a `ROCPROFSYS_PERFETTO_FLUSH_PERIOD_MS` configuration setting to set the flush period for Perfetto traces. The default value is 10000 ms (10 seconds). - Added fetching of the `rocpd` schema from rocprofiler-sdk-rocpd diff --git a/projects/rocprofiler-systems/README.md b/projects/rocprofiler-systems/README.md index 29cd6a4eee..38cfa564c1 100755 --- a/projects/rocprofiler-systems/README.md +++ b/projects/rocprofiler-systems/README.md @@ -64,9 +64,11 @@ The documentation source files reside in the [`/docs`](/docs) folder of this rep - Utilization - VCN Utilization - JPEG Utilization + - XGMI interconnect metrics (link width, link speed, read/write data) + - PCIe metrics (link width, link speed, bandwidth) > [!NOTE] -> The availability of VCN and JPEG engine utilization depends on device support for different ASICs. If unsupported, all values for VCN_ACTIVITY and JPEG_ACTIVITY will be reported as N/A in the output of `amd-smi metric --usage`. +> The availability of VCN, JPEG, XGMI, and PCIe metrics depends on device support, system topology, and GPU architecture. If unsupported, all values will be reported as N/A in the output of `amd-smi metric --usage`. ### CPU metrics diff --git a/projects/rocprofiler-systems/docs/conceptual/rocprof-sys-feature-set.rst b/projects/rocprofiler-systems/docs/conceptual/rocprof-sys-feature-set.rst index 6f7b2247bf..6601d89d56 100644 --- a/projects/rocprofiler-systems/docs/conceptual/rocprof-sys-feature-set.rst +++ b/projects/rocprofiler-systems/docs/conceptual/rocprof-sys-feature-set.rst @@ -62,7 +62,12 @@ GPU metrics * Utilization * VCN activity * JPEG activity - Note: The availability of VCN and JPEG engine activity depends on device support for different ASICs. If unsupported, all values for VCN_ACTIVITY and JPEG_ACTIVITY will be reported as N/A in the output of amd-smi metric--usage. + * XGMI interconnect metrics (link width, link speed, read/write data) + * PCIe metrics (link width, link speed, bandwidth) + + .. note:: + + The availability of VCN, JPEG, XGMI, and PCIe metrics depends on device support and system topology. If unsupported, values will be reported as ``N/A`` in the output of ``amd-smi metric --usage``. CPU metrics ======================================== diff --git a/projects/rocprofiler-systems/docs/data/rocprof-sys-pcie.png b/projects/rocprofiler-systems/docs/data/rocprof-sys-pcie.png new file mode 100644 index 0000000000..37094b601f Binary files /dev/null and b/projects/rocprofiler-systems/docs/data/rocprof-sys-pcie.png differ diff --git a/projects/rocprofiler-systems/docs/data/rocprof-sys-xgmi.png b/projects/rocprofiler-systems/docs/data/rocprof-sys-xgmi.png new file mode 100644 index 0000000000..628eee43ab Binary files /dev/null and b/projects/rocprofiler-systems/docs/data/rocprof-sys-xgmi.png differ diff --git a/projects/rocprofiler-systems/docs/how-to/configuring-runtime-options.rst b/projects/rocprofiler-systems/docs/how-to/configuring-runtime-options.rst index 89a0017671..8b5f53f8a7 100644 --- a/projects/rocprofiler-systems/docs/how-to/configuring-runtime-options.rst +++ b/projects/rocprofiler-systems/docs/how-to/configuring-runtime-options.rst @@ -252,7 +252,7 @@ For example, the following is a valid configuration: ROCPROFSYS_AMD_SMI_METRICS=busy,temp,power,vcn_activity,mem_usage -Supported values for ``ROCPROFSYS_AMD_SMI_METRICS`` are: ``busy``, ``temp``, ``power``, ``vcn_activity``, ``mem_usage``, ``jpeg_activity``. +Supported values for ``ROCPROFSYS_AMD_SMI_METRICS`` are: ``busy``, ``temp``, ``power``, ``vcn_activity``, ``mem_usage``, ``jpeg_activity``, ``xgmi``, ``pcie``. API tracing is configured with the ``ROCPROFSYS_ROCM_DOMAINS`` setting. The domains are used to filter the events that are captured during profiling. Supported values for this setting are those supported by ROCprofiler-SDK, which are returned by the API ``get_callback_tracing_names()`` and ``get_buffer_tracing_names()``. See the `ROCprofiler-SDK developer API documentation `_ to learn more about ROCprofiler-SDK APIs. diff --git a/projects/rocprofiler-systems/docs/how-to/xgmi-pcie-sampling.rst b/projects/rocprofiler-systems/docs/how-to/xgmi-pcie-sampling.rst new file mode 100644 index 0000000000..2c0033488c --- /dev/null +++ b/projects/rocprofiler-systems/docs/how-to/xgmi-pcie-sampling.rst @@ -0,0 +1,173 @@ +.. meta:: + :description: ROCm Systems Profiler XGMI and PCIe metrics sampling and monitoring + :keywords: rocprof-sys, rocprofiler-systems, ROCm, tips, how to, profiler, tracking, XGMI, PCIe, GPU connectivity, AMD + +*********************************************** +XGMI and PCIe metrics sampling and monitoring +*********************************************** + +`ROCm Systems Profiler `_ supports +sampling of XGMI and PCIe interconnect metrics. It allows you to gather key performance metrics for +GPU-to-GPU communication via XGMI links, and CPU-to-GPU communication via PCIe links. This information can be used +to optimize multi-GPU workloads, identify communication bottlenecks, and analyze data transfer efficiency +in high-performance computing applications. + +Sampling support +================= + +Sampling of XGMI and PCIe interconnect metrics is supported by leveraging `AMD SMI `_ which provides the interface for GPU metric collection. Follow the steps: + +1. Set the ``ROCPROFSYS_USE_AMD_SMI`` environment variable to enable GPU metric collection: + +.. code-block:: shell + + export ROCPROFSYS_USE_AMD_SMI=true + +2. Update the ``ROCPROFSYS_AMD_SMI_METRICS`` variable to collect the XGMI and PCIe metrics. The default value is: + +.. code-block:: shell + + ROCPROFSYS_AMD_SMI_METRICS=busy,temp,power,mem_usage + +To include XGMI and PCIe metrics, update it to: + +.. code-block:: shell + + ROCPROFSYS_AMD_SMI_METRICS=busy,temp,power,mem_usage,xgmi,pcie + +Alternatively, you can use the following to collect all available GPU metrics: + +.. code-block:: shell + + ROCPROFSYS_AMD_SMI_METRICS=all + +XGMI metrics +------------ + +XGMI (AMD Infinity Fabricâ„¢ XGMI) provides high-bandwidth, low-latency GPU-to-GPU interconnects in multi-GPU systems. The following XGMI metrics are collected: + +- **XGMI Link Width**: The number of active XGMI links between GPUs +- **XGMI Link Speed**: The speed of XGMI links (in GT/s) +- **XGMI Read Data**: Accumulated data read through each XGMI link (in KB) +- **XGMI Write Data**: Accumulated data written through each XGMI link (in KB) + +These metrics help identify GPU-to-GPU communication patterns and bandwidth utilization in multi-GPU workloads. + +.. note:: + + XGMI metrics are only available on systems with multiple GPUs connected via XGMI links. + The availability depends on the system topology and GPU architecture. If unsupported or not + available, the values will be reported as N/A in the output. + +PCIe metrics +------------ + +PCIe (PCI Express) provides the connection between the CPU and GPU. The following PCIe metrics are collected: + +- **PCIe Link Width**: The number of PCIe lanes currently active +- **PCIe Link Speed**: The current PCIe link generation and speed (e.g., Gen3, Gen4, Gen5) +- **PCIe Bandwidth Accumulated**: Total bandwidth accumulated over time (in MB) +- **PCIe Bandwidth Instantaneous**: Instantaneous bandwidth at the time of sampling (in MB/s) + +These metrics help analyze CPU-to-GPU data transfer efficiency and identify PCIe bottlenecks. + +Using TransferBench for testing +================================ + +For testing and benchmarking GPU connectivity, you can use the `TransferBench `_. +TransferBench is a benchmarking utility designed to measure the performance of simultaneous data transfers between user-specified devices, such as CPUs and GPUs. +For this example, TransferBench is used to profile XGMI and PCIe traffic for analysis. + +1. Source the ROCm Systems Profiler Environment using: + +.. code-block:: shell + + source /opt/rocprofiler-systems/share/rocprofiler-systems/setup-env.sh + +Alternatively, if you are using modules, use: + +.. code-block:: shell + + module use /opt/rocprofiler-systems/share/modulefiles + +2. Generate and configure the profiler config file. + +.. code-block:: shell + + rocprof-sys-avail -G $HOME/.rocprofsys.cfg -F txt + export ROCPROFSYS_CONFIG_FILE=$HOME/.rocprofsys.cfg + +Edit ``.rocprofsys.cfg`` with the following settings: + +.. code-block:: shell + + ROCPROFSYS_USE_AMD_SMI = true + ROCPROFSYS_AMD_SMI_METRICS = busy,temp,power,mem_usage,xgmi,pcie + ROCPROFSYS_ROCM_DOMAINS = hip_runtime_api,memory_copy,hsa_api + +3. Profile the TransferBench application. + +.. code-block:: shell + + rocprof-sys-sample -PTHD -- ./TransferBench a2a + +.. note:: + + Refer to these steps to `Install and build TransferBench `_. + +At the end of the run, a similar message appears:: + + [rocprofiler-systems][964294][perfetto]> Outputting '/home/demo/rocprofsys-transferBench-output/2025-04-25_15.52/perfetto-trace-964294.proto' + (3124.52 KB / 3.12 MB / 0.00 GB)... Done + + +To view the generated ``.proto`` file in the browser, open the +`Perfetto UI page `_. Then, click on +``Open trace file`` and select the ``.proto`` file. In the browser, you can visualize the XGMI and PCIe metrics. + +.. image:: ../data/rocprof-sys-xgmi.png + :alt: Visualization of a performance graph in Perfetto with XGMI tracks + +.. image:: ../data/rocprof-sys-pcie.png + :alt: Visualization of a performance graph in Perfetto with PCIe tracks + +The visualization will show: + +- **XGMI Read Data** and **XGMI Write Data** tracks showing data transfer through XGMI links over time +- **XGMI Link Width** and **XGMI Link Speed** tracks showing link configuration +- **PCIe Bandwidth** tracks showing CPU-to-GPU data transfer rates +- **PCIe Link Width** and **PCIe Link Speed** tracks showing PCIe link configuration + + +Tips for effective profiling +============================= + +1. **Multi-GPU workloads**: XGMI metrics are most useful when profiling applications that use multiple GPUs and transfer data between them. + +2. **Sampling frequency**: Adjust the sampling frequency using ``ROCPROFSYS_PROCESS_SAMPLING_FREQ`` (default is 50Hz) to capture more or fewer samples based on your analysis needs. + +3. **Focus on specific metrics**: If you only need XGMI or PCIe metrics, you can specify just those: + + .. code-block:: shell + + ROCPROFSYS_AMD_SMI_METRICS=xgmi # Only XGMI metrics + ROCPROFSYS_AMD_SMI_METRICS=pcie # Only PCIe metrics + +4. **Combine with API tracing**: For detailed analysis, combine XGMI/PCIe metrics with HIP/HSA API tracing to correlate data transfers with application behavior: + + .. code-block:: shell + + ROCPROFSYS_ROCM_DOMAINS=hip_runtime_api,memory_copy,kernel_dispatch,hsa_api + +Exploring available metrics +============================ + +To explore all supported metrics and domains, use the following commands: + +.. code-block:: shell + + rocprof-sys-avail --all # Show all available options + rocprof-sys-avail -bd -r AMD_SMI_METRICS # Show AMD SMI metrics + rocprof-sys-avail -bd -r ROCM_DOMAINS # Show ROCm tracing domains + +For more details on ROCm Systems Profiler configuration, refer to the `configuration guide `_. diff --git a/projects/rocprofiler-systems/docs/index.rst b/projects/rocprofiler-systems/docs/index.rst index 3fe02eb55b..1e9e733fb2 100644 --- a/projects/rocprofiler-systems/docs/index.rst +++ b/projects/rocprofiler-systems/docs/index.rst @@ -41,6 +41,7 @@ profiling, how it supports performance analysis, and how to leverage its capabil * :doc:`Profiling Python scripts <./how-to/profiling-python-scripts>` * :doc:`Network performance profiling <./how-to/nic-profiling>` * :doc:`VCN and JPEG sampling and tracing <./how-to/vcn-jpeg-sampling>` + * :doc:`XGMI and PCIe metrics monitoring <./how-to/xgmi-pcie-sampling>` * :doc:`Understanding the output <./how-to/understanding-rocprof-sys-output>` * :doc:`Using the ROCm Systems Profiler API <./how-to/using-rocprof-sys-api>` diff --git a/projects/rocprofiler-systems/docs/sphinx/_toc.yml.in b/projects/rocprofiler-systems/docs/sphinx/_toc.yml.in index c754e2d5be..bf9a02d3ad 100644 --- a/projects/rocprofiler-systems/docs/sphinx/_toc.yml.in +++ b/projects/rocprofiler-systems/docs/sphinx/_toc.yml.in @@ -37,6 +37,8 @@ subtrees: title: Network performance profiling - file: how-to/vcn-jpeg-sampling.rst title: VCN and JPEG sampling and tracing + - file: how-to/xgmi-pcie-sampling.rst + title: XGMI and PCIe metrics monitoring - file: how-to/understanding-rocprof-sys-output.rst title: Understanding the output - file: how-to/using-rocprof-sys-api.rst