[rocprofiler-compute] Adding --torch-trace option for SWDEV-559789 (#2089)

* Adding --torch-operator option in rocprof-compute. Creates csv file for
each operator that has gpu activity, showing operator to counter values
mapping.

* --torch-operators flag added to rocprofiler-sdk

* Adding ctest for --torch-operators.

* Adding pytest markers.

* Corrections in ctest and message logging.

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Adding a check for pytorch installation only when --torch-operators is passed.

* moving inject_roctx.py into src/utils.

* rebase

* Updating docs and changelog.

* Update projects/rocprofiler-compute/src/argparser.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update projects/rocprofiler-compute/src/utils/inject_roctx.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Removing special characters.

* Minor corrections.

* Setting default value for torch_operators_enabled.

* Updating the number of files according to the number of passes.

* Adding rocpd support.

* Adding a warning message to be shown when profiling a non-python workload.

* copilot suggestions, rocpd+native tool fix

* Fixed the incorrect usage of dispatch_id as event_id in the function update_rocpd_pmc_events()

* ruff format fix

* ruff formating

* Deleting torch_trace.csvs after consolidating the operator data.

* Removing checks since *torch_trace.csv files are deleted.

* Fixing file deletion.

* Update projects/rocprofiler-compute/src/utils/inject_roctx.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update projects/rocprofiler-compute/src/utils/utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update projects/rocprofiler-compute/tests/test_profile_general.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Using default options in the testcase.

* Adding test for overhead measurement.

* Corrections in docs.

* doc updates.

* Update projects/rocprofiler-compute/src/utils/inject_roctx.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Handling potential empty frames.

* Corrected the test cases.

* Changing the flag to --torch-trace

* Fixed helper_app path issues

* Path issues

* process_torch_trace_output() now takes csv file paths as input + allows default usage.

* Replaced pandas with sqlite3

* Adding marker_trace extraction to rocpd_data.py

* Allowing all workloads to use --torch-trace option. Assuming the workload is user verified.

* Modified help section for the flag.

* Added difference in runtimes for longest running kernels in each profiling runs to overhead measurements.

* Update projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Removed the accesses to the tables.

* Ruff fixes.

* ruff

* Ruff Fixes

* Adding getattr for args.torch_trace to handle mock args.

* Fix for 'Missing guid in counter collection data - in csv mode'

* Sending output_format to process_torch_trace_output

* Warning for self contained binaries.

* Ruff

* Ruff

* Measuring longest_running_kernel_baseline instead of worst_kernel_increase, very small kernel runtimes are blowing up the worst_kernel_increase metric.

* Minor fixes in input arguments

* Ruff

* Loging PyTorch version

* Fix ruff formatting for PyTorch version logging

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Dieser Commit ist enthalten in:
ggottipa-amd
2026-01-27 19:50:25 +05:30
committet von GitHub
Ursprung cac67a0f32
Commit 77f7541755
11 geänderte Dateien mit 1018 neuen und 44 gelöschten Zeilen
@@ -15,6 +15,8 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
* Iteration multiplexing to collect counters in single application run
* Added `--torch-trace` option to enable mapping of PyTorch operators to collected counter values during profiling.
* Runtime compilation of Roofline benchmarking:
* GPU kernels from [rocm-amdgpu-bench](https://github.com/ROCm/rocm-amdgpu-bench) repository are moved into the ROCm Compute Profiler and are compiled at runtime using local HIP and HIPRTC Python wrappers.
* Roofline binaries compiled from [rocm-amdgpu-bench](https://github.com/ROCm/rocm-amdgpu-bench) repository have been removed from the project, as Roofline runtime compilation performs the same work as the Roofline binaries.
@@ -617,11 +617,11 @@ The following example demonstrates profiling roofline data only:
INFO Kernel Selection: None
INFO Dispatch Selection: None
INFO Filtered sections: ['4']
INFO
INFO
INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
INFO Collecting Performance Counters (Roofline Only)
INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
INFO
INFO
INFO [Run 1/3][Approximate profiling time left: pending first measurement...]
INFO [profiling] Current input file: /app/projects/rocprofiler-compute/workloads/occupancy/MI300X_A1/perfmon/pmc_perf_0.txt
...
@@ -659,6 +659,172 @@ plot.
:alt: Sample ROCm Compute Profiler roofline output
:width: 800
.. _torch-operator-mapping:
Torch Operator Mapping
========================
To analyze performance metrics at the PyTorch operator level, ROCm Compute Profiler
offers Torch Operator Mapping functionality. This feature maps performance counters
to specific PyTorch operators, enabling detailed performance analysis of
PyTorch workloads at the operator granularity.
When enabled, this feature instruments your PyTorch application to correlate GPU
kernel executions with their originating PyTorch operators, providing insights into
which operators contribute to specific performance counter values.
.. note::
**PyTorch Operators vs GPU Kernels**: PyTorch operators (such as ``conv2d``,
``linear``, ``relu``) are high-level API functions. When executed on GPU, these
operators may dispatch one or more low-level GPU kernels (such as
``implicit_convolve_sgemm``) that perform the actual computation on the hardware.
The ``--torch-trace`` feature provides operator-level attribution by injecting
markers that map collected kernel performance counters to their originating PyTorch
operators.
Requirements
------------
* Valid PyTorch installation in the profiling environment
* PyTorch application must be run as a Python script or Python command
Usage
-----
To enable Torch operator mapping, use the ``--torch-trace`` option when profiling
a PyTorch workload:
.. code-block:: shell-session
$ rocprof-compute profile --name mnist_torch --torch-trace -- python train.py
__ _
_ __ ___ ___ _ __ _ __ ___ / _| ___ ___ _ __ ___ _ __ _ _| |_ ___
| '__/ _ \ / __| '_ \| '__/ _ \| |_ _____ / __/ _ \| '_ ` _ \| '_ \| | | | __/ _ \
| | | (_) | (__| |_) | | | (_) | _|_____| (_| (_) | | | | | | |_) | |_| | || __/
|_| \___/ \___| .__/|_| \___/|_| \___\___/|_| |_| |_| .__/ \__,_|\__\___|
|_| |_|
rocprofiler-compute version: 3.4.0
Profiler choice: rocprofiler-sdk
Path: /home/auser/workloads/mnist_torch/MI300X_A1
Target: MI300X_A1
Command: python train.py
Torch Trace: Enabled
Kernel Selection: None
Dispatch Selection: None
Hardware Blocks: All
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Collecting Performance Counters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
Output
------
When Torch operator mapping is enabled, profiling generates additional output files
in the workload directory that correlate PyTorch operators with GPU kernels and
their performance counters:
``<workload_name>_torch_trace.csv``
Contains the merged operator-to-kernel mapping with performance counter data. These
are temporary files that are removed after consolidation into per operator CSV files.
Key columns include:
* ``Function`` - PyTorch operator name (e.g., ``aten::conv2d``, ``aten::linear``)
* ``Kernel_Name`` - GPU kernel name dispatched by the operator
* ``Counter_Name`` / ``Counter_Value`` - Hardware performance counter measurements
* ``Start_Timestamp_function`` / ``End_Timestamp_function`` - Operator execution time
* ``Start_Timestamp_kernel`` / ``End_Timestamp_kernel`` - Kernel execution time
* ``Correlation_Id`` - Links operator calls to their kernel dispatches
.. table:: SQC_ICACHE_INFLIGHT_LEVEL_torch_trace.csv from profiling mnist model.
:widths: 20 80
| Domain | Function | Process_Id | Thread_Id | Correlation_Id | Start_Timestamp_function | End_Timestamp_function | GPU_ID | Dispatch_ID | PID | Grid_Size | Workgroup_Size | LDS_Per_Workgroup | Scratch_Per_Workitem | Arch_VGPR | Accum_VGPR | SGPR | Kernel_Name | Start_Timestamp_kernel | End_Timestamp_kernel | Kernel_ID | Counter_Name | Counter_Value |
|:----------------------|:--------------------------------|-------------:|------------:|-----------------:|---------------------------:|-------------------------:|---------:|--------------:|--------:|------------:|-----------------:|--------------------:|-----------------------:|------------:|-------------:|-------:|:------------------------|-------------------------:|-----------------------:|------------:|:--------------------------|----------------:|
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | CPC_CPC_STAT_STALL | 17946 |
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | CPC_CPC_TCIU_BUSY | 714 |
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | CPF_CPF_STAT_IDLE | 0 |
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | CPF_CPF_STAT_STALL | 78 |
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | GRBM_SPI_BUSY | 7277 |
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_REQ_NO_ALLOC_CSN | 8 |
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_RES_STALL_CSN | 0 |
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_SGPR_SIMD_FULL_CSN | 0 |
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_TGLIM_CU_FULL_CSN | 0 |
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_TMP_STALL_CSN | 0 |
``torch_trace/`` directory
Contains individual CSV files for each PyTorch operator detected during profiling.
Each file is named after the operator (e.g., ``nn_functional_conv2d.csv``,
``nn_functional_linear.csv``, ``relu.csv``) and contains all kernel executions and
performance counters for that specific operator. Columns include:
* ``Operator_Name`` - PyTorch operator name
* ``Context_Id`` - Source location where operator was called (e.g., ``conv2d:10@conv.py:543``)
* ``Counter_Name`` / ``Counter_Value`` - Hardware counter measurements
* ``Start_Timestamp_function`` / ``End_Timestamp_function`` - Operator timing
* ``Start_Timestamp_kernel`` / ``End_Timestamp_kernel`` - Kernel timing
This per-operator organization enables focused analysis of specific operators without
processing the entire trace.
.. table:: torch_trace/ones_like.csv from profiling mnist model.
:widths: 20 80
| Operator_Name | Context_Id | Kernel_Name | Counter_Name | Counter_Value | Start_Timestamp_function | End_Timestamp_function | Start_Timestamp_kernel | End_Timestamp_kernel |
|:----------------|:------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------|----------------:|---------------------------:|-------------------------:|-------------------------:|-----------------------:|
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_STAT_BUSY | 23004 | 6789210204040073 | 6789210223815845 | 6789210223810274 | 6789210223811914 |
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_STAT_IDLE | 0 | 6789210204040073 | 6789210223815845 | 6789210223810274 | 6789210223811914 |
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_STAT_STALL | 6715 | 6789281060081123 | 6789281079930585 | 6789281079932564 | 6789281079934204 |
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_TCIU_BUSY | 534 | 6789281060081123 | 6789281079930585 | 6789281079932564 | 6789281079934204 |
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_TCIU_IDLE | 20569 | 6789352286866085 | 6789352306292985 | 6789352306292904 | 6789352306294424 |
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_UTCL2IU_BUSY | 358 | 6789352286866085 | 6789352306292985 | 6789352306292904 | 6789352306294424 |
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_UTCL2IU_IDLE | 20046 | 6789422289668823 | 6789422308914683 | 6789422308913883 | 6789422308915403 |
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_ME1_BUSY_FOR_PACKET_DECODE | 16331 | 6789422289668823 | 6789422308914683 | 6789422308913883 | 6789422308915403 |
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_ME1_DC0_SPI_BUSY | 455 | 6789492192490428 | 6789492210892375 | 6789492210897243 | 6789492210898883 |
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_UTCL1_STALL_ON_TRANSLATION | 374 | 6789492192490428 | 6789492210892375 | 6789492210897243 | 6789492210898883 |
``pmc_perf.csv``
Standard performance counter data (same as non-torch profiling)
This data enables analysis such as:
* Identifying which PyTorch operators executed which GPU kernels
* Aggregating performance counter values by operator
* Correlating operator-level timing with kernel-level hardware metrics
* Tracing the execution flow from high-level PyTorch API to low-level GPU kernels
Limitations
-----------
.. note::
* The ``--torch-trace`` option requires the application to be a Python command
or Python script.
* A valid PyTorch installation must be available in the environment where profiling
is executed.
* This feature adds instrumentation overhead to track operator boundaries. For
performance-critical measurements, consider profiling without this option first.
Combined with Other Options
----------------------------
Torch operator mapping can be combined with other profiling options:
.. code-block:: shell-session
# Combine with block filtering for targeted counter collection
$ rocprof-compute profile --name mnist --torch-trace -b 11 12 -- python train.py
# Combine with iteration multiplexing
$ rocprof-compute profile --name mnist --torch-trace --iteration-multiplexing kernel -- python train.py
# Combine with kernel filtering (filters by GPU kernel name)
$ rocprof-compute profile --name mnist --torch-trace -k elementwise -- python train.py
.. _iteration-multiplexing:
@@ -687,7 +853,7 @@ To enable iteration multiplexing in ROCm Compute Profiler, use the
``--iteration-multiplexing`` option in your profiling command. You can optionally specify
the policy for multiplexing. The available policies are:
* ``kernel``
* ``kernel``
The counters are divided based on the kernels being executed. Each kernel call
for a particular kernel collects a different subset of counters.
* ``kernel_launch_params``
@@ -707,10 +873,10 @@ By default, if no policy is specified, ROCm Compute Profiler uses the ``kernel_l
Iteration multiplexing is only supported when using ROCm Compute Profiler with
the native counter collection tool. Ensure that ``--attach-pid`` is not used in your profiling command.
* Ensure that your workload runs for enough iterations to cover all counter subsets.
When using iteration multiplexing, the total number of iterations, for each kernel (for ``kernel`` policy)
or for each unique kernel and launch parameters combination (for ``kernel_launch_params`` policy),
specified in the workload should be sufficient to cover all subsets of counters. If the number of iterations
* Ensure that your workload runs for enough iterations to cover all counter subsets.
When using iteration multiplexing, the total number of iterations, for each kernel (for ``kernel`` policy)
or for each unique kernel and launch parameters combination (for ``kernel_launch_params`` policy),
specified in the workload should be sufficient to cover all subsets of counters. If the number of iterations
is too low, some counters may not be collected.
* Launch paramaters for ``kernel_launch_params`` policy.
@@ -736,11 +902,11 @@ The following example demonstrates how to use iteration multiplexing with the
[INFO] Kernel Selection: None
[INFO] Dispatch Selection: None
[INFO] Filtered sections: All
[INFO]
[INFO]
[INFO] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[INFO] Collecting Performance Counters
[INFO] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[INFO]
[INFO]
[INFO] Using native counter collection tool: /tmp/rocprofiler-compute-tool-hlz4fagh/librocprofiler-compute-tool.so
[INFO] Iteration multiplexing: kernel
[INFO] [profiling] Current input files: /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQC_DCACHE_INFLIGHT_LEVEL.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQC_ICACHE_INFLIGHT_LEVEL.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_IFETCH_LEVEL.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_INST_LEVEL_LDS.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_INST_LEVEL_SMEM.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_INST_LEVEL_VMEM.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_LEVEL_WAVES.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_0.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_1.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_10.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_11.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_12.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_2.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_3.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_4.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_5.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_6.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_7.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_8.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_9.txt
@@ -111,4 +111,5 @@ markers = [
"iteration_multiplexing_2",
"iteration_multiplexing_stochastic",
"noise_clamp",
"torch_ops",
]
@@ -239,6 +239,17 @@ Examples:
help=argparse.SUPPRESS,
# help="\t\t\tKokkos trace, traces Kokkos API calls.",
)
profile_group.add_argument(
"--torch-trace",
dest="torch_trace",
required=False,
default=False,
action="store_true",
help=(
"\t\t\tTorch Trace, maps PyTorch operators to performance counters.\n"
"\t\t\tShould be used only when profiling PyTorch applications."
),
)
profile_group.add_argument(
"-k",
"--kernel",
@@ -109,16 +109,62 @@ class RocProfCompute_Base:
"--attach-pid cannot be used with --iteration-multiplexing. "
"Please remove one of these options."
)
# verify correct formatting for application binary
args.remaining = args.remaining[1:]
resolved_exec_path: Optional[Path] = None
if args.remaining:
# Ensure that command points to an executable
if not shutil.which(args.remaining[0]):
exec_candidate = shutil.which(args.remaining[0])
if not exec_candidate:
console_error(
f"Your command {args.remaining[0]} doesn't point to a executable. "
"Please verify."
)
resolved_exec_path = Path(exec_candidate).resolve()
# Appending a wrapper for injecting roctx-markers
if getattr(args, "torch_trace", False):
# Find the inject_roctx.py script in src/utils
inject_script = (
Path(__file__).parent.parent / "utils" / "inject_roctx.py"
)
if not inject_script.exists():
console_error(
f"Cannot find inject_roctx.py at {inject_script}. "
"Please verify your installation."
)
# Case 1: Explicit python command (python, python3, etc.)
if args.remaining[0].startswith("python"):
# Insert inject_roctx.py after the python interpreter
args.remaining.insert(1, str(inject_script))
# Case 2: Direct Python script execution (./main.py, /path/to/script.py)
elif args.remaining[0].endswith((".py", ".pyw", ".pyc", ".pyo")):
# Use current Python interpreter
args.remaining.insert(0, str(inject_script))
args.remaining.insert(0, sys.executable)
else:
console_warning(
"Command does not look like a Python entry point, "
"skipping ROCTX auto-injection and launching workload as-is."
)
console_warning(
"Ensure the binary already initializes PyTorch/ROCTX markers, "
"otherwise --torch-trace will have no effect."
)
if (
resolved_exec_path
and (resolved_exec_path.parent / "_internal").is_dir()
):
console_warning(
"Workload appears to be a self-contained binary. "
"Such bundles typically ship private ROCm/HSA libraries, which "
"prevents --torch-trace from collecting data."
"Rebuild without packaging libhsa/libhip or "
"adjust LD_LIBRARY_PATH to /opt/rocm) before profiling."
)
args.remaining = " ".join(args.remaining)
elif not args.attach_pid:
console_error(
@@ -471,6 +517,8 @@ class RocProfCompute_Base:
f'passes. Please use "--block" or "--set" '
f"to adjust or reduce the requested performance metrics!"
)
console_debug(f"Sending profiler options to run_prof: {options}")
run_prof(
fnames=str_fnames,
profiler_options=options,
@@ -478,6 +526,7 @@ class RocProfCompute_Base:
mspec=self._soc._mspec,
loglevel=args.loglevel,
format_rocprof_output=args.format_rocprof_output,
torch_trace_enabled=getattr(args, "torch_trace", False),
retain_rocpd_output=args.retain_rocpd_output,
)
@@ -30,6 +30,7 @@ from pathlib import Path
from rocprof_compute_profile.profiler_base import RocProfCompute_Base
from rocprof_compute_soc.soc_base import OmniSoC_Base
from utils.logger import console_error, console_log, demarcate
from utils.utils import consolidate_torch_trace_output
class rocprof_v3_profiler(RocProfCompute_Base):
@@ -49,7 +50,6 @@ class rocprof_v3_profiler(RocProfCompute_Base):
def get_profiler_options(self) -> list[str]:
args = self.get_args()
app_cmd = shlex.split(args.remaining)
if args.kokkos_trace:
trace_option = "--kokkos-trace"
# NOTE: --kokkos-trace feature is incomplete and is disabled for now.
@@ -60,9 +60,10 @@ class rocprof_v3_profiler(RocProfCompute_Base):
)
elif args.hip_trace:
trace_option = "--hip-trace"
elif getattr(args, "torch_trace", False):
trace_option = "--marker-trace"
else:
trace_option = "--kernel-trace"
profiling_options = [
# v3 requires output directory argument
"-d",
@@ -134,6 +135,10 @@ class rocprof_v3_profiler(RocProfCompute_Base):
if self.ready_to_profile:
# Manually join each pmc_perf*.csv output
self.join_prof()
# Consolidate torch trace output if --torch-trace was used
if self.get_args().torch_trace:
consolidate_torch_trace_output(self.get_args().path)
# Run roofline microbenchmark
super().post_processing()
else:
@@ -31,6 +31,7 @@ from typing import Optional, Union
from rocprof_compute_profile.profiler_base import RocProfCompute_Base
from rocprof_compute_soc.soc_base import OmniSoC_Base
from utils.logger import console_error, console_log, demarcate
from utils.utils import consolidate_torch_trace_output
class rocprofiler_sdk_profiler(RocProfCompute_Base):
@@ -71,6 +72,8 @@ class rocprofiler_sdk_profiler(RocProfCompute_Base):
"ROCPROF_OUTPUT_PATH": f"{args.path}/out/pmc_1",
})
if getattr(args, "torch_trace", False):
options["ROCPROF_MARKER_API_TRACE"] = "1"
# Create folder pointed by ROCPROF_OUTPUT_PATH
Path(options["ROCPROF_OUTPUT_PATH"]).mkdir(parents=True, exist_ok=True)
@@ -161,6 +164,9 @@ class rocprofiler_sdk_profiler(RocProfCompute_Base):
if self.ready_to_profile:
# Manually join each pmc_perf*.csv output
self.join_prof()
if self.get_args().torch_trace:
consolidate_torch_trace_output(self.get_args().path)
# Run roofline microbenchmark
super().post_processing()
else:
@@ -0,0 +1,292 @@
# ruff: noqa
##############################################################################
# MIT License
#
# Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
##############################################################################
"""
ROCTX Injection Wrapper - Auto-discovers and intercepts ALL PyTorch operators
Usage: python inject_roctx.py main.py --epochs 1 --batch-size 4
"""
import os
import sys
from pathlib import Path
# Add parent directory to Python path for config module
script_dir = Path(__file__).resolve().parent
sys.path.insert(0, str(script_dir.parent))
from utils.logger import console_log, console_warning
rocm_root = os.environ.get("ROCM_PATH", "/opt/rocm")
python_version = f"python{sys.version_info.major}.{sys.version_info.minor}"
candidate_paths = [
f"{rocm_root}/lib/{python_version}/site-packages",
f"{rocm_root}/libexec/rocprofiler-sdk/python",
]
for candidate in candidate_paths:
if candidate not in sys.path:
sys.path.insert(0, candidate)
try:
import torch
console_log(f"PyTorch version: {torch.__version__}")
except ImportError:
console_warning(
"PyTorch is not installed or not properly configured.\n"
"The --torch-trace option requires a valid PyTorch installation.\n"
"Please install PyTorch and try again."
)
sys.exit(0)
import importlib.util
import inspect
from functools import wraps
import torch.nn.functional as F
from roctx import rangePop, rangePush
def roctx_wrapper(func, name=None):
func_name = name or func.__name__
call_counter = {"count": 0}
@wraps(func)
def wrapper(*args, **kwargs):
call_counter["count"] += 1
current_frame = inspect.currentframe()
caller_frame = current_frame.f_back if current_frame is not None else None
if caller_frame is not None:
filename = caller_frame.f_code.co_filename
location = f"{Path(filename).name}:{caller_frame.f_lineno}"
else:
location = "unknown:0"
# Unique marker: function + call_number + source_location
rangePush(f"{func_name}:#{call_counter['count']}@{location}")
try:
result = func(*args, **kwargs)
finally:
rangePop()
return result
return wrapper
def auto_discover_torch_callables(module, prefix, exclude_patterns=None):
"""Automatically discover all callable functions in a module."""
if exclude_patterns is None:
exclude_patterns = ["__", "_", "is_", "set_", "get_"]
functions = {}
for name in dir(module):
# Skip private/internal functions
if any(name.startswith(pat) for pat in exclude_patterns):
continue
try:
attr = getattr(module, name)
# Only wrap callables (functions, not classes or constants)
if callable(attr) and not isinstance(attr, type):
full_name = f"{prefix}.{name}"
functions[full_name] = (module, name, attr)
except Exception as e:
console_warning(type(e))
console_warning(f"Could not access {prefix}.{name}: {e}")
return functions
def inject_roctx_into_torch():
"""Monkey-patch PyTorch operations to add ROCTX markers."""
console_log("Auto-discovering PyTorch operations to wrap...")
# Auto-discover functions from key modules
all_operations = {}
# torch.* functions (matmul, mm, cat, etc.)
all_operations.update(auto_discover_torch_callables(torch, "torch"))
# torch.nn.functional.* functions (linear, relu, softmax, etc.)
all_operations.update(auto_discover_torch_callables(F, "torch.nn.functional"))
# torch.linalg.* functions (matrix operations)
try:
all_operations.update(
auto_discover_torch_callables(torch.linalg, "torch.linalg")
)
except Exception as e:
console_warning(type(e))
console_warning(f"Could not access torch.linalg: {e}")
# torch.fft.* functions (FFT operations)
try:
all_operations.update(auto_discover_torch_callables(torch.fft, "torch.fft"))
except Exception as e:
console_warning(type(e))
console_warning(f"Could not access torch.fft: {e}")
console_log(f"Found {len(all_operations)} operations to wrap")
console_log("Injecting ROCTX markers into PyTorch operations...")
wrapped_count = 0
failed_count = 0
for full_name, (module, attr_name, original_func) in all_operations.items():
try:
# Replace with wrapped version
wrapped_func = roctx_wrapper(original_func, full_name)
setattr(module, attr_name, wrapped_func)
wrapped_count += 1
# Print first 20 and last 5 for visibility
if wrapped_count <= 20 or wrapped_count > len(all_operations) - 5:
console_log(f"Wrapped: {full_name}")
elif wrapped_count == 21:
console_log(
f" ... (wrapping {len(all_operations) - 25} more operations)"
)
except Exception as e:
failed_count += 1
if failed_count <= 5: # Only show first few failures
console_warning(f"Failed to wrap {full_name}: {e}")
# Wrap tensor methods
original_backward = torch.Tensor.backward
backward_counter = {"count": 0}
def backward_with_roctx(self, *args, **kwargs):
backward_counter["count"] += 1
current_frame = inspect.currentframe()
caller_frame = current_frame.f_back if current_frame is not None else None
if caller_frame is not None:
filename = caller_frame.f_code.co_filename
location = f"{Path(filename).name}:{caller_frame.f_lineno}"
else:
location = "unknown:0"
rangePush(f"torch.Tensor.backward:#{backward_counter['count']}@{location}")
try:
return original_backward(self, *args, **kwargs)
finally:
rangePop()
torch.Tensor.backward = backward_with_roctx
wrapped_count += 1
console_log("Wrapped: torch.Tensor.backward")
console_log(f"Wrapped {wrapped_count} operations with ROCTX markers")
if failed_count > 0:
console_warning(
f"Failed to wrap {failed_count} operations (likely not patchable)"
)
def inject_roctx_into_optimizer():
"""Wrap optimizer step() method."""
from torch.optim import Optimizer
original_step = Optimizer.step
def step_with_roctx(self, *args, **kwargs):
rangePush(f"optimizer.{self.__class__.__name__}.step")
try:
return original_step(self, *args, **kwargs)
finally:
rangePop()
Optimizer.step = step_with_roctx
console_log("Wrapped optimizer.step() with ROCTX markers\n")
def inject_roctx_into_model():
"""Wrap nn.Module forward() method with call counter."""
from torch import nn
from typing import Any
original_call = nn.Module.__call__
# Per-instance call counters
def call_with_roctx(self, *args, **kwargs):
class_name = self.__class__.__name__
# Initialize counter for this instance if not exists
if not hasattr(self, "_roctx_call_count"):
self._roctx_call_count = 0
self._roctx_call_count += 1
# Get caller location
current_frame = inspect.currentframe()
caller_frame = current_frame.f_back if current_frame is not None else None
if caller_frame is not None:
filename = caller_frame.f_code.co_filename
location = f"{Path(filename).name}:{caller_frame.f_lineno}"
else:
location = "unknown:0"
# Create detailed marker
rangePush(
f"nn.Module.{class_name}.forward:#{self._roctx_call_count}@{location}"
)
try:
return original_call(self, *args, **kwargs)
finally:
rangePop()
nn.Module.__call__ = call_with_roctx
console_log("Wrapped nn.Module forward() with ROCTX markers\n")
if __name__ == "__main__":
if len(sys.argv) < 2:
console_log("Usage: python inject_roctx.py <script.py> [script_args...]")
sys.exit(1)
# Get target script and its arguments
target_script = sys.argv[1]
script_args = sys.argv[2:]
# Inject ROCTX markers BEFORE importing the target script
inject_roctx_into_torch()
inject_roctx_into_optimizer()
inject_roctx_into_model()
console_log("=" * 70)
console_log("Starting target script with ROCTX instrumentation...")
console_log("=" * 70)
# Modify sys.argv so the target script sees correct arguments
sys.argv = [target_script] + script_args
# Load and execute the target script
spec = importlib.util.spec_from_file_location("__main__", target_script)
module = importlib.util.module_from_spec(spec)
sys.modules["__main__"] = module
spec.loader.exec_module(module)
@@ -25,7 +25,7 @@
import csv
import sqlite3
from contextlib import closing
from contextlib import ExitStack, closing
from typing import Any
import pandas as pd
@@ -37,6 +37,8 @@ from utils.logger import console_error
COUNTERS_COLLECTION_QUERY = """
SELECT
agent_id as GPU_ID,
guid as GUID,
correlation_id as Correlation_Id,
dispatch_id as Dispatch_ID,
pid as PID,
grid_size as Grid_Size,
@@ -54,6 +56,24 @@ SELECT
value as Counter_Value
FROM counters_collection
"""
MARKER_API_TRACE_QUERY = """
SELECT
category AS Domain,
json_extract(extdata, '$.message') AS Function,
pid AS Process_Id,
tid AS Thread_Id,
corr_id AS Correlation_Id,
guid AS GUID,
start AS Start_Timestamp,
end AS End_Timestamp
FROM regions
ORDER BY start
"""
KERNEL_DISPATCH_QUERY = """
SELECT dispatch_id, event_id, guid
FROM rocpd_kernel_dispatch
WHERE guid = ?
"""
ROCPD_PMC_EVENT_TABLE_NAME_PREFIX = "rocpd_pmc_event_"
TABLE_NAME_PREFIX_QUERY = (
"SELECT name FROM sqlite_master WHERE type='table' "
@@ -64,30 +84,43 @@ INSERT_QUERY = "INSERT INTO {table_name} ({columns}) VALUES ({placeholders})"
def convert_dbs_to_csv(
db_paths: list[str],
csv_file_path: str,
counter_collection_csv_path: str,
marker_trace_csv_path: str,
) -> None:
"""
Read rocpd databases and write to CSV file
"""
# Read counters_collection view from the databases and write to CSV
try:
with open(csv_file_path, "w", newline="") as csvfile:
writer = csv.writer(csvfile)
header_written = False
for db_path in db_paths:
with closing(sqlite3.connect(db_path)) as conn:
with closing(conn.execute(COUNTERS_COLLECTION_QUERY)) as cursor:
if not header_written:
writer.writerow([
description[0] for description in cursor.description
])
header_written = True
for row in cursor:
writer.writerow(row)
except OSError as e:
console_error(f"Database error while converting to CSV: {e}")
except Exception as e:
console_error(f"Unexpected error converting database to CSV: {e}")
queries = {
counter_collection_csv_path: COUNTERS_COLLECTION_QUERY,
marker_trace_csv_path: MARKER_API_TRACE_QUERY,
}
header_written = {path: False for path in queries}
with ExitStack() as stack:
writers = {
path: csv.writer(stack.enter_context(open(path, "w", newline="")))
for path in queries
}
for db_path in db_paths:
with closing(sqlite3.connect(db_path)) as conn:
for file_path, query in queries.items():
try:
with closing(conn.execute(query)) as cursor:
if cursor.description is None:
continue
if not header_written[file_path]:
writers[file_path].writerow([
desc[0] for desc in cursor.description
])
header_written[file_path] = True
writers[file_path].writerows(cursor)
except OSError as e:
console_error(
f"Database error while extracting {file_path} "
f"from {db_path}: {e}"
)
except Exception as e:
console_error(
f"Unexpected error while extracting {file_path} "
f"from {db_path}: {e}"
)
def process_rocpd_csv(df: pd.DataFrame) -> pd.DataFrame:
@@ -134,7 +167,7 @@ def process_rocpd_csv(df: pd.DataFrame) -> pd.DataFrame:
def update_rocpd_pmc_events(counter_info: pd.DataFrame, rocpd_db_path: str) -> None:
"""Update pmc_event table in the given rocpd database path"""
"""Updates pmc_event table in the given rocpd database path."""
try:
with closing(sqlite3.connect(rocpd_db_path)) as conn:
# Get pmc_event table name
@@ -154,13 +187,27 @@ def update_rocpd_pmc_events(counter_info: pd.DataFrame, rocpd_db_path: str) -> N
guid = table_name[len(ROCPD_PMC_EVENT_TABLE_NAME_PREFIX) :].replace(
"_", "-"
)
# Map dispatch_id to event_id from rocpd_kernel_dispatch
# Native counter collection CSV has dispatch_id, but schema needs event_id
# event_id may differ from dispatch_id when marker API tracing is enabled
with closing(conn.execute(KERNEL_DISPATCH_QUERY, (guid,))) as cursor:
rows = cursor.fetchall()
if not rows:
console_error("No kernel dispatch data found.")
return
dispatch_to_event = {
dispatch_id: event_id for dispatch_id, event_id, _ in rows
}
counter_info["event_id"] = counter_info["dispatch_id"].map(
dispatch_to_event
)
columns = ("guid", "event_id", "pmc_id", "value")
values = list(
zip(
# guid
[guid] * len(counter_info),
# event_id
counter_info["dispatch_id"],
counter_info["event_id"],
# pmc_id
counter_info["counter_id"],
# value
@@ -786,6 +786,7 @@ def run_prof(
mspec: Any, # noqa: ANN401
loglevel: int,
format_rocprof_output: str,
torch_trace_enabled: bool = False,
retain_rocpd_output: bool = False,
) -> None:
multiple_files = isinstance(fnames, list)
@@ -939,9 +940,12 @@ def run_prof(
# Write results_fbase.csv
rocpd_data.convert_dbs_to_csv(
glob.glob(workload_dir + "/out/pmc_1/*/*.db"),
workload_dir + f"/results_{fbase}.csv",
workload_dir + f"/out/pmc_1/{fbase}_counter_collection.csv",
workload_dir + f"/out/pmc_1/{fbase}_marker_api_trace.csv",
)
combined_df = pd.read_csv(
workload_dir + f"/out/pmc_1/{fbase}_counter_collection.csv"
)
combined_df = pd.read_csv(workload_dir + f"/results_{fbase}.csv")
# Reset Dispatch_ID based on PID, Kernel_Name, Grid_Size,
# Workgroup_Size, LDS_Per_Workgroup, Start_Timestamp, End_Timestamp
combined_df["Dispatch_ID"] = combined_df.groupby(
@@ -964,8 +968,12 @@ def run_prof(
).ngroup()
# Drop PID since its not required
combined_df = combined_df.drop(columns=["PID"])
combined_df.to_csv(
workload_dir + f"/out/pmc_1/{fbase}_counter_collection.csv", index=False
)
combined_df.to_csv(workload_dir + f"/results_{fbase}.csv", index=False)
if torch_trace_enabled:
process_torch_trace_output(workload_dir, fbase, format_rocprof_output)
if retain_rocpd_output:
for db_path in glob.glob(workload_dir + "/out/pmc_1/*/*.db"):
pid = Path(db_path).stem.split("_")[0]
@@ -1004,7 +1012,9 @@ def run_prof(
process_kokkos_trace_output(workload_dir, fbase)
elif "--hip-trace" in options:
process_hip_trace_output(workload_dir, fbase)
# Add torch operator trace processing
if torch_trace_enabled:
process_torch_trace_output(workload_dir, fbase, format_rocprof_output)
# Combine results into single CSV file
if results_files:
combined_results = pd.concat(
@@ -1175,7 +1185,7 @@ def convert_native_counter_collection_csv(workload_dir: str) -> None:
)
rocprofv3_counter_data = pd.DataFrame({
"Correlation_Id": merged_data["dispatch_id"],
"Correlation_Id": merged_data["Correlation_Id"],
"Dispatch_Id": merged_data["dispatch_id"],
"Agent_Id": merged_data["Agent_Id"],
"Queue_Id": merged_data["Queue_Id"],
@@ -1262,6 +1272,178 @@ def process_rocprofv3_output(workload_dir: str, using_native_tool: bool) -> list
return results_files_csv
@demarcate
def process_torch_trace_output(
workload_dir: str,
fbase: str,
output_format: str = "rocpd",
) -> None:
"""
Creates PyTorch operator trace from counter_collection and marker_api_trace data.
- Performs inner join on Correlation_Id, filtering out unmatched entries
- Output file is saved to workload root, not the temporary out/ directory
"""
marker_trace_csv_file_path = f"{workload_dir}/out/pmc_1/"
# Find all marker_api_trace CSV files
marker_api_trace_csvs = list(
Path(marker_trace_csv_file_path).glob("**/*_marker_api_trace.csv")
)
counter_collection_csvs = [
markers_file.parent
/ markers_file.name.replace("_marker_api_trace.", "_counter_collection.")
for markers_file in marker_api_trace_csvs
]
existing_csv_files = [
[marker_api_trace_csvs[i], counter_collection_csvs[i]]
for i in range(len(marker_api_trace_csvs))
if counter_collection_csvs[i].is_file() and marker_api_trace_csvs[i].is_file()
]
if not existing_csv_files:
console_warning(
f"No marker files with corresponding counter files found for {fbase}"
)
return
# Join marker and counter data
def _merge_pair(
marker_path: Path,
counter_path: Path,
join_keys: list = ("Correlation_Id"),
) -> pd.DataFrame:
marker_df = pd.read_csv(marker_path)
counter_df = pd.read_csv(counter_path)
return pd.merge(
marker_df,
counter_df,
on=join_keys,
how="inner",
suffixes=("_function", "_kernel"),
)
if output_format == "csv":
merged_results = pd.concat(
[_merge_pair(f[0], f[1]) for f in existing_csv_files],
ignore_index=True,
)
elif output_format == "rocpd":
# There will one pair of csv files extracted from rocpd db and consolidated.
merged_results = _merge_pair(
existing_csv_files[0][0],
existing_csv_files[0][1],
("Correlation_Id", "GUID"),
)
# Save merged results
merged_results.to_csv(
f"{workload_dir}/{fbase}_torch_trace.csv",
index=False,
)
console_log("Created ", f"{workload_dir}/{fbase}_torch_trace.csv")
@demarcate
def consolidate_torch_trace_output(workload_dir: str) -> None:
# Consolidate torch operator trace CSV files from multiple processes
console_log("Consolidating torch operator trace output...")
# Find all torch trace CSV files in workload directory
torch_trace_files = glob.glob(f"{workload_dir}/*_torch_trace.csv")
if not torch_trace_files:
console_warning("No torch trace files found.")
return
# Read and concatenate all torch trace files
all_traces = []
required_columns = [
"Function",
"Kernel_Name",
"Counter_Name",
"Counter_Value",
"Start_Timestamp_function",
"End_Timestamp_function",
"Start_Timestamp_kernel",
"End_Timestamp_kernel",
]
for trace_file in torch_trace_files:
try:
df = pd.read_csv(trace_file)
except pd.errors.ParserError as e:
console_warning(f"Parser error while reading {trace_file}: {e}")
continue
except OSError as e:
console_warning(f"I/O error while reading {trace_file}: {e}")
continue
except Exception as e:
# Unexpected error; log full details for debugging
console_warning(
f"Unexpected error while reading {trace_file}: {e}\n"
f"{traceback.format_exc()}"
)
continue
missing_columns = [col for col in required_columns if col not in df.columns]
if missing_columns:
console_warning(
f"Skipping {trace_file}: missing required columns {missing_columns}"
)
continue
all_traces.append(df[required_columns])
if not all_traces:
console_warning("No valid torch trace data to consolidate.")
return
consolidated_df = pd.concat(all_traces, ignore_index=True)
if consolidated_df.isnull().values.any():
console_warning("Consolidated torch trace contains missing values")
return
consolidated_df = consolidated_df.sort_values(by=["Function", "Counter_Name"])
split_columns = consolidated_df["Function"].str.split(":#", expand=True)
consolidated_df["Operator_Name"] = (
split_columns[0] if len(split_columns.columns) > 0 else None
)
consolidated_df["Context_Id"] = (
split_columns[1] if len(split_columns.columns) > 1 else None
)
consolidated_df.drop(columns=["Function"], inplace=True)
consolidated_df = consolidated_df[
[
"Operator_Name",
"Context_Id",
"Kernel_Name",
"Counter_Name",
"Counter_Value",
"Start_Timestamp_function",
"End_Timestamp_function",
"Start_Timestamp_kernel",
"End_Timestamp_kernel",
]
]
if consolidated_df.isnull().values.any():
console_error(
"Missing values in consolidated torch trace after splitting ",
"the Function name.",
)
return
grouped = consolidated_df.groupby("Operator_Name")
for operator_name, group in grouped:
sanitized_operator_name = operator_name.replace("torch.", "").replace(".", "_")
# Ensure output directory exists
Path(f"{workload_dir}/torch_trace").mkdir(parents=True, exist_ok=True)
output_file = f"{workload_dir}/torch_trace/{sanitized_operator_name}.csv"
group.to_csv(output_file, index=False)
console_log(
f"Saved consolidated trace for {sanitized_operator_name} to {output_file}"
)
for trace_file in torch_trace_files:
try:
Path(trace_file).unlink()
console_debug(f"Removed temporary torch trace file: {trace_file}")
except OSError as e:
console_warning(f"Error removing temporary file {trace_file}: {e}")
@demarcate
def process_kokkos_trace_output(workload_dir: str, fbase: str) -> None:
# marker api trace csv files are generated for each process
@@ -23,6 +23,7 @@
##############################################################################
import importlib.util
import inspect
import os
import re
@@ -2779,3 +2780,215 @@ def test_iteration_multiplexing_all_counter_accuracy(
assert are_stochastic_counters_similar(
[counters_kernel, counters_kernel_launch_params], counters_no_multiplexing
)
skip_if_no_torch = pytest.mark.skipif(
importlib.util.find_spec("torch") is None, reason="torch is required for this test"
)
@skip_if_no_torch
def test_torch_trace_profile(binary_handler_profile_rocprof_compute):
"""
Test profiling a PyTorch application with --torch-trace option.
Verifies that all required files are generated and counter values are valid.
NOTE: Not included in the test suite since this requires PyTorch installation.
"""
workload_dir = test_utils.get_output_dir(param_id="torch_ops")
Path(workload_dir).mkdir(parents=True, exist_ok=True)
torch_app_path = Path(workload_dir) / "test_torch_app.py"
torch_app_code = """
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleNet(nn.Module):
def __init__(self):
super(SimpleNet, self).__init__()
self.fc1 = nn.Linear(10, 20)
self.fc2 = nn.Linear(20, 10)
def forward(self, x):
x = self.fc1(x)
x = F.relu(x)
x = self.fc2(x)
return x
if __name__ == "__main__":
if not torch.cuda.is_available():
import sys
print("GPU is required for this test. Exiting.")
sys.exit(1)
model = SimpleNet()
model = model.cuda()
x = torch.randn(5, 10).cuda()
# Run a few iterations
for epoch in range(1):
output = model(x)
loss = output.sum()
loss.backward()
print("Training completed")
"""
with open(torch_app_path, "w") as f:
f.write(torch_app_code)
config["torch_test_app"] = ["python3", str(torch_app_path)]
# Profile with --torch-trace option
options = [
"--torch-trace",
]
returncode = binary_handler_profile_rocprof_compute(
config,
workload_dir,
options,
check_success=True,
app_name="torch_test_app",
)
assert returncode == 0, "Profiling the torch application failed"
# Verify files are generated
# 1. Check basic CSV files
num_devices = config.get("num_devices", 1)
file_dict = test_utils.check_csv_files(workload_dir, num_devices, 1)
assert "pmc_perf.csv" in file_dict, "pmc_perf.csv not generated"
# 2. Check torch trace directory
torch_trace_dir = Path(workload_dir) / "torch_trace"
assert torch_trace_dir.exists(), "torch_trace directory not created"
assert torch_trace_dir.is_dir(), "torch_trace is not a directory"
# 3. Check per-operator CSV files exist
operator_csv_files = list(torch_trace_dir.glob("*.csv"))
assert len(operator_csv_files) > 0, "No per-operator CSV files generated"
# 4. Verify per-operator CSV structure
for op_csv in operator_csv_files:
op_df = pd.read_csv(op_csv)
assert len(op_df) > 0, f"Per-operator CSV {op_csv.name} is empty"
test_utils.clean_output_dir(config["cleanup"], workload_dir)
@skip_if_no_torch
def test_torch_trace_overhead(binary_handler_profile_rocprof_compute):
"""
Measure overhead introduced by --torch-trace flag.
Compares execution time with and without the flag to ensure overhead is acceptable.
NOTE: Not included in the test suite since this requires PyTorch installation.
"""
helper_dir = Path(test_utils.get_output_dir(param_id="torch_helper_script"))
helper_dir.mkdir(parents=True, exist_ok=True)
torch_app_path = helper_dir / "test_torch_app.py"
torch_app_code = """
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleNet(nn.Module):
def __init__(self):
super(SimpleNet, self).__init__()
self.fc1 = nn.Linear(10, 20)
self.fc2 = nn.Linear(20, 10)
def forward(self, x):
x = self.fc1(x)
x = F.relu(x)
x = self.fc2(x)
return x
if __name__ == "__main__":
if not torch.cuda.is_available():
import sys
print("GPU is required for this test. Exiting.")
sys.exit(1)
model = SimpleNet()
model = model.cuda()
x = torch.randn(5, 10).cuda()
# Run a few iterations
for epoch in range(1):
output = model(x)
loss = output.sum()
loss.backward()
print("Training completed")
"""
with open(torch_app_path, "w") as f:
f.write(torch_app_code)
config["torch_test_app"] = ["python3", str(torch_app_path)]
# Run WITHOUT --torch-trace (baseline)
workload_dir_baseline = test_utils.get_output_dir(param_id="torch_baseline")
start_baseline = time.time()
returncode_baseline = binary_handler_profile_rocprof_compute(
config,
workload_dir_baseline,
[], # No torch-trace flag
check_success=True,
roof=False,
app_name="torch_test_app",
)
baseline_time = time.time() - start_baseline
assert returncode_baseline == 0, "Baseline profiling failed"
# Read baseline timestamps
baseline_df = pd.read_csv(f"{workload_dir_baseline}/pmc_perf.csv")
baseline_kernel_duration_total = (
baseline_df["End_Timestamp"].max() - baseline_df["Start_Timestamp"].min()
)
test_utils.clean_output_dir(config["cleanup"], workload_dir_baseline)
# Run WITH --torch-trace
workload_dir_with_flag = test_utils.get_output_dir(param_id="torch_with_ops")
start_with_flag = time.time()
returncode_with_flag = binary_handler_profile_rocprof_compute(
config,
workload_dir_with_flag,
["--torch-trace"],
check_success=True,
roof=False,
app_name="torch_test_app",
)
with_flag_time = time.time() - start_with_flag
assert returncode_with_flag == 0, "Profiling with torch-trace failed"
# Read with-flag timestamps
with_flag_df = pd.read_csv(f"{workload_dir_with_flag}/pmc_perf.csv")
with_flag_kernel_duration_total = (
with_flag_df["End_Timestamp"].max() - with_flag_df["Start_Timestamp"].min()
)
longest_running_kernel_baseline = (
baseline_df["End_Timestamp"] - baseline_df["Start_Timestamp"]
).max()
longest_running_kernel_with_flag = (
with_flag_df["End_Timestamp"] - with_flag_df["Start_Timestamp"]
).max()
# Calculate overheads
longest_running_kernel_overhead = (
(longest_running_kernel_with_flag - longest_running_kernel_baseline)
/ longest_running_kernel_baseline
) * 100
wall_clock_overhead = ((with_flag_time - baseline_time) / baseline_time) * 100
kernel_overhead = (
(with_flag_kernel_duration_total - baseline_kernel_duration_total)
/ baseline_kernel_duration_total
) * 100
print(f"\n{'=' * 70}")
print("Performance Overhead Analysis:")
print(f" Longest running kernel overhead: {longest_running_kernel_overhead:.1f}%")
print(f" Baseline wall-clock time: {baseline_time:.2f}s")
print(f" With --torch-trace time: {with_flag_time:.2f}s")
print(f" Wall-clock overhead: {wall_clock_overhead:.1f}%")
print(f" Baseline kernel duration: {baseline_kernel_duration_total:.0f} ns")
print(f" With flag kernel duration: {with_flag_kernel_duration_total:.0f} ns")
print(f" Kernel execution overhead: {kernel_overhead:.1f}%")
print(f"{'=' * 70}\n")
# Verify torch trace directory was created
torch_trace_dir = Path(workload_dir_with_flag) / "torch_trace"
assert torch_trace_dir.exists(), "torch_trace directory should be created"
operator_csv_files = list(torch_trace_dir.glob("*.csv"))
assert len(operator_csv_files) > 0, "Operator CSV files should be generated"
test_utils.clean_output_dir(config["cleanup"], workload_dir_with_flag)
# Assert overhead is reasonable (< 100% wall-clock, < 50% kernel)
assert wall_clock_overhead < 100, (
f"Wall-clock overhead too high: {wall_clock_overhead:.1f}%"
)
assert kernel_overhead < 50, (
f"Kernel execution overhead too high: {kernel_overhead:.1f}%"
)
assert longest_running_kernel_overhead < 50, (
f"longest running kernel increase too high: "
f"{longest_running_kernel_overhead:.1f}%"
)