[rocprofiler-compute] Adding --torch-trace option for SWDEV-559789 (#2089)

* Adding --torch-operator option in rocprof-compute. Creates csv file for each operator that has gpu activity, showing operator to counter values mapping. * --torch-operators flag added to rocprofiler-sdk * Adding ctest for --torch-operators. * Adding pytest markers. * Corrections in ctest and message logging. * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Adding a check for pytorch installation only when --torch-operators is passed. * moving inject_roctx.py into src/utils. * rebase * Updating docs and changelog. * Update projects/rocprofiler-compute/src/argparser.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update projects/rocprofiler-compute/src/utils/inject_roctx.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Removing special characters. * Minor corrections. * Setting default value for torch_operators_enabled. * Updating the number of files according to the number of passes. * Adding rocpd support. * Adding a warning message to be shown when profiling a non-python workload. * copilot suggestions, rocpd+native tool fix * Fixed the incorrect usage of dispatch_id as event_id in the function update_rocpd_pmc_events() * ruff format fix * ruff formating * Deleting torch_trace.csvs after consolidating the operator data. * Removing checks since *torch_trace.csv files are deleted. * Fixing file deletion. * Update projects/rocprofiler-compute/src/utils/inject_roctx.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update projects/rocprofiler-compute/src/utils/utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update projects/rocprofiler-compute/tests/test_profile_general.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Using default options in the testcase. * Adding test for overhead measurement. * Corrections in docs. * doc updates. * Update projects/rocprofiler-compute/src/utils/inject_roctx.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Handling potential empty frames. * Corrected the test cases. * Changing the flag to --torch-trace * Fixed helper_app path issues * Path issues * process_torch_trace_output() now takes csv file paths as input + allows default usage. * Replaced pandas with sqlite3 * Adding marker_trace extraction to rocpd_data.py * Allowing all workloads to use --torch-trace option. Assuming the workload is user verified. * Modified help section for the flag. * Added difference in runtimes for longest running kernels in each profiling runs to overhead measurements. * Update projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Removed the accesses to the tables. * Ruff fixes. * ruff * Ruff Fixes * Adding getattr for args.torch_trace to handle mock args. * Fix for 'Missing guid in counter collection data - in csv mode' * Sending output_format to process_torch_trace_output * Warning for self contained binaries. * Ruff * Ruff * Measuring longest_running_kernel_baseline instead of worst_kernel_increase, very small kernel runtimes are blowing up the worst_kernel_increase metric. * Minor fixes in input arguments * Ruff * Loging PyTorch version * Fix ruff formatting for PyTorch version logging --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-27 19:50:25 +05:30
@@ -15,6 +15,8 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.

 * Iteration multiplexing to collect counters in single application run

+* Added `--torch-trace` option to enable mapping of PyTorch operators to collected counter values during profiling.
+
 * Runtime compilation of Roofline benchmarking:
  * GPU kernels from [rocm-amdgpu-bench](https://github.com/ROCm/rocm-amdgpu-bench) repository are moved into the ROCm Compute Profiler and are compiled at runtime using local HIP and HIPRTC Python wrappers.
  * Roofline binaries compiled from [rocm-amdgpu-bench](https://github.com/ROCm/rocm-amdgpu-bench) repository have been removed from the project, as Roofline runtime compilation performs the same work as the Roofline binaries.
@@ -617,11 +617,11 @@ The following example demonstrates profiling roofline data only:
   INFO Kernel Selection: None
   INFO Dispatch Selection: None
   INFO Filtered sections: ['4']
-   INFO 
+   INFO
   INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   INFO Collecting Performance Counters (Roofline Only)
   INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-   INFO 
+   INFO
   INFO [Run 1/3][Approximate profiling time left: pending first measurement...]
   INFO [profiling] Current input file: /app/projects/rocprofiler-compute/workloads/occupancy/MI300X_A1/perfmon/pmc_perf_0.txt
   ...
@@ -659,6 +659,172 @@ plot.
   :alt: Sample ROCm Compute Profiler roofline output
   :width: 800

+.. _torch-operator-mapping:
+
+Torch Operator Mapping
+========================
+
+To analyze performance metrics at the PyTorch operator level, ROCm Compute Profiler
+offers Torch Operator Mapping functionality. This feature maps performance counters
+to specific PyTorch operators, enabling detailed performance analysis of
+PyTorch workloads at the operator granularity.
+
+When enabled, this feature instruments your PyTorch application to correlate GPU
+kernel executions with their originating PyTorch operators, providing insights into
+which operators contribute to specific performance counter values.
+
+.. note::
+
+   **PyTorch Operators vs GPU Kernels**: PyTorch operators (such as ``conv2d``,
+   ``linear``, ``relu``) are high-level API functions. When executed on GPU, these
+   operators may dispatch one or more low-level GPU kernels (such as
+   ``implicit_convolve_sgemm``) that perform the actual computation on the hardware.
+   The ``--torch-trace`` feature provides operator-level attribution by injecting
+   markers that map collected kernel performance counters to their originating PyTorch
+   operators.
+
+Requirements
+------------
+
+* Valid PyTorch installation in the profiling environment
+* PyTorch application must be run as a Python script or Python command
+
+Usage
+-----
+
+To enable Torch operator mapping, use the ``--torch-trace`` option when profiling
+a PyTorch workload:
+
+.. code-block:: shell-session
+
+   $ rocprof-compute profile --name mnist_torch --torch-trace -- python train.py
+
+                                    __                                       _
+    _ __ ___   ___ _ __  _ __ ___  / _|       ___ ___  _ __ ___  _ __  _   _| |_ ___
+   | '__/ _ \ / __| '_ \| '__/ _ \| |_ _____ / __/ _ \| '_ ` _ \| '_ \| | | | __/ _ \
+   | | | (_) | (__| |_) | | | (_) |  _|_____| (_| (_) | | | | | | |_) | |_| | ||  __/
+   |_|  \___/ \___| .__/|_|  \___/|_|        \___\___/|_| |_| |_| .__/ \__,_|\__\___|
+                  |_|                                           |_|
+
+   rocprofiler-compute version: 3.4.0
+   Profiler choice: rocprofiler-sdk
+   Path: /home/auser/workloads/mnist_torch/MI300X_A1
+   Target: MI300X_A1
+   Command: python train.py
+   Torch Trace: Enabled
+   Kernel Selection: None
+   Dispatch Selection: None
+   Hardware Blocks: All
+
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+   Collecting Performance Counters
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+   ...
+
+Output
+------
+
+When Torch operator mapping is enabled, profiling generates additional output files
+in the workload directory that correlate PyTorch operators with GPU kernels and
+their performance counters:
+
+``<workload_name>_torch_trace.csv``
+   Contains the merged operator-to-kernel mapping with performance counter data. These
+   are temporary files that are removed after consolidation into per operator CSV files.
+   Key columns include:
+
+   * ``Function`` - PyTorch operator name (e.g., ``aten::conv2d``, ``aten::linear``)
+   * ``Kernel_Name`` - GPU kernel name dispatched by the operator
+   * ``Counter_Name`` / ``Counter_Value`` - Hardware performance counter measurements
+   * ``Start_Timestamp_function`` / ``End_Timestamp_function`` - Operator execution time
+   * ``Start_Timestamp_kernel`` / ``End_Timestamp_kernel`` - Kernel execution time
+   * ``Correlation_Id`` - Links operator calls to their kernel dispatches
+
+.. table:: SQC_ICACHE_INFLIGHT_LEVEL_torch_trace.csv from profiling mnist model.
+   :widths: 20 80
+| Domain                | Function                        |   Process_Id |   Thread_Id |   Correlation_Id |   Start_Timestamp_function |   End_Timestamp_function |   GPU_ID |   Dispatch_ID |     PID |   Grid_Size |   Workgroup_Size |   LDS_Per_Workgroup |   Scratch_Per_Workitem |   Arch_VGPR |   Accum_VGPR |   SGPR | Kernel_Name             |   Start_Timestamp_kernel |   End_Timestamp_kernel |   Kernel_ID | Counter_Name              |   Counter_Value |
+|:----------------------|:--------------------------------|-------------:|------------:|-----------------:|---------------------------:|-------------------------:|---------:|--------------:|--------:|------------:|-----------------:|--------------------:|-----------------------:|------------:|-------------:|-------:|:------------------------|-------------------------:|-----------------------:|------------:|:--------------------------|----------------:|
+| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 |      1214226 |     1214226 |                0 |           7072577770736616 |         7072577771920451 |        4 |             1 | 1214226 |         512 |              512 |                   0 |                      0 |          16 |            0 |     32 | __amd_rocclr_copyBuffer |         7072577923044453 |       7072577923046813 |           6 | CPC_CPC_STAT_STALL        |           17946 |
+| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 |      1214226 |     1214226 |                0 |           7072577770736616 |         7072577771920451 |        4 |             1 | 1214226 |         512 |              512 |                   0 |                      0 |          16 |            0 |     32 | __amd_rocclr_copyBuffer |         7072577923044453 |       7072577923046813 |           6 | CPC_CPC_TCIU_BUSY         |             714 |
+| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 |      1214226 |     1214226 |                0 |           7072577770736616 |         7072577771920451 |        4 |             1 | 1214226 |         512 |              512 |                   0 |                      0 |          16 |            0 |     32 | __amd_rocclr_copyBuffer |         7072577923044453 |       7072577923046813 |           6 | CPF_CPF_STAT_IDLE         |               0 |
+| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 |      1214226 |     1214226 |                0 |           7072577770736616 |         7072577771920451 |        4 |             1 | 1214226 |         512 |              512 |                   0 |                      0 |          16 |            0 |     32 | __amd_rocclr_copyBuffer |         7072577923044453 |       7072577923046813 |           6 | CPF_CPF_STAT_STALL        |              78 |
+| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 |      1214226 |     1214226 |                0 |           7072577770736616 |         7072577771920451 |        4 |             1 | 1214226 |         512 |              512 |                   0 |                      0 |          16 |            0 |     32 | __amd_rocclr_copyBuffer |         7072577923044453 |       7072577923046813 |           6 | GRBM_SPI_BUSY             |            7277 |
+| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 |      1214226 |     1214226 |                0 |           7072577770736616 |         7072577771920451 |        4 |             1 | 1214226 |         512 |              512 |                   0 |                      0 |          16 |            0 |     32 | __amd_rocclr_copyBuffer |         7072577923044453 |       7072577923046813 |           6 | SPI_RA_REQ_NO_ALLOC_CSN   |               8 |
+| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 |      1214226 |     1214226 |                0 |           7072577770736616 |         7072577771920451 |        4 |             1 | 1214226 |         512 |              512 |                   0 |                      0 |          16 |            0 |     32 | __amd_rocclr_copyBuffer |         7072577923044453 |       7072577923046813 |           6 | SPI_RA_RES_STALL_CSN      |               0 |
+| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 |      1214226 |     1214226 |                0 |           7072577770736616 |         7072577771920451 |        4 |             1 | 1214226 |         512 |              512 |                   0 |                      0 |          16 |            0 |     32 | __amd_rocclr_copyBuffer |         7072577923044453 |       7072577923046813 |           6 | SPI_RA_SGPR_SIMD_FULL_CSN |               0 |
+| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 |      1214226 |     1214226 |                0 |           7072577770736616 |         7072577771920451 |        4 |             1 | 1214226 |         512 |              512 |                   0 |                      0 |          16 |            0 |     32 | __amd_rocclr_copyBuffer |         7072577923044453 |       7072577923046813 |           6 | SPI_RA_TGLIM_CU_FULL_CSN  |               0 |
+| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 |      1214226 |     1214226 |                0 |           7072577770736616 |         7072577771920451 |        4 |             1 | 1214226 |         512 |              512 |                   0 |                      0 |          16 |            0 |     32 | __amd_rocclr_copyBuffer |         7072577923044453 |       7072577923046813 |           6 | SPI_RA_TMP_STALL_CSN      |               0 |
+
+``torch_trace/`` directory
+   Contains individual CSV files for each PyTorch operator detected during profiling.
+   Each file is named after the operator (e.g., ``nn_functional_conv2d.csv``,
+   ``nn_functional_linear.csv``, ``relu.csv``) and contains all kernel executions and
+   performance counters for that specific operator. Columns include:
+
+   * ``Operator_Name`` - PyTorch operator name
+   * ``Context_Id`` - Source location where operator was called (e.g., ``conv2d:10@conv.py:543``)
+   * ``Counter_Name`` / ``Counter_Value`` - Hardware counter measurements
+   * ``Start_Timestamp_function`` / ``End_Timestamp_function`` - Operator timing
+   * ``Start_Timestamp_kernel`` / ``End_Timestamp_kernel`` - Kernel timing
+
+   This per-operator organization enables focused analysis of specific operators without
+   processing the entire trace.
+
+.. table:: torch_trace/ones_like.csv from profiling mnist model.
+   :widths: 20 80
+
+| Operator_Name   | Context_Id        | Kernel_Name                                                                                                                                                             | Counter_Name                   |   Counter_Value |   Start_Timestamp_function |   End_Timestamp_function |   Start_Timestamp_kernel |   End_Timestamp_kernel |
+|:----------------|:------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------|----------------:|---------------------------:|-------------------------:|-------------------------:|-----------------------:|
+| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_STAT_BUSY              |           23004 |           6789210204040073 |         6789210223815845 |         6789210223810274 |       6789210223811914 |
+| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_STAT_IDLE              |               0 |           6789210204040073 |         6789210223815845 |         6789210223810274 |       6789210223811914 |
+| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_STAT_STALL             |            6715 |           6789281060081123 |         6789281079930585 |         6789281079932564 |       6789281079934204 |
+| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_TCIU_BUSY              |             534 |           6789281060081123 |         6789281079930585 |         6789281079932564 |       6789281079934204 |
+| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_TCIU_IDLE              |           20569 |           6789352286866085 |         6789352306292985 |         6789352306292904 |       6789352306294424 |
+| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_UTCL2IU_BUSY           |             358 |           6789352286866085 |         6789352306292985 |         6789352306292904 |       6789352306294424 |
+| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_UTCL2IU_IDLE           |           20046 |           6789422289668823 |         6789422308914683 |         6789422308913883 |       6789422308915403 |
+| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_ME1_BUSY_FOR_PACKET_DECODE |           16331 |           6789422289668823 |         6789422308914683 |         6789422308913883 |       6789422308915403 |
+| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_ME1_DC0_SPI_BUSY           |             455 |           6789492192490428 |         6789492210892375 |         6789492210897243 |       6789492210898883 |
+| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_UTCL1_STALL_ON_TRANSLATION |             374 |           6789492192490428 |         6789492210892375 |         6789492210897243 |       6789492210898883 |
+
+``pmc_perf.csv``
+   Standard performance counter data (same as non-torch profiling)
+
+This data enables analysis such as:
+
+* Identifying which PyTorch operators executed which GPU kernels
+* Aggregating performance counter values by operator
+* Correlating operator-level timing with kernel-level hardware metrics
+* Tracing the execution flow from high-level PyTorch API to low-level GPU kernels
+
+Limitations
+-----------
+
+.. note::
+
+   * The ``--torch-trace`` option requires the application to be a Python command
+     or Python script.
+
+   * A valid PyTorch installation must be available in the environment where profiling
+     is executed.
+
+   * This feature adds instrumentation overhead to track operator boundaries. For
+     performance-critical measurements, consider profiling without this option first.
+
+Combined with Other Options
+----------------------------
+
+Torch operator mapping can be combined with other profiling options:
+
+.. code-block:: shell-session
+
+   # Combine with block filtering for targeted counter collection
+   $ rocprof-compute profile --name mnist --torch-trace -b 11 12 -- python train.py
+
+   # Combine with iteration multiplexing
+   $ rocprof-compute profile --name mnist --torch-trace --iteration-multiplexing kernel -- python train.py
+
+   # Combine with kernel filtering (filters by GPU kernel name)
+   $ rocprof-compute profile --name mnist --torch-trace -k elementwise -- python train.py

 .. _iteration-multiplexing:

@@ -687,7 +853,7 @@ To enable iteration multiplexing in ROCm Compute Profiler, use the
 ``--iteration-multiplexing`` option in your profiling command. You can optionally specify
 the policy for multiplexing. The available policies are:

-* ``kernel`` 
+* ``kernel``
   The counters are divided based on the kernels being executed. Each kernel call
   for a particular kernel collects a different subset of counters.
 * ``kernel_launch_params``
@@ -707,10 +873,10 @@ By default, if no policy is specified, ROCm Compute Profiler uses the ``kernel_l
     Iteration multiplexing is only supported when using ROCm Compute Profiler with
     the native counter collection tool. Ensure that ``--attach-pid`` is not used in your profiling command.

-   * Ensure that your workload runs for enough iterations to cover all counter subsets. 
-     When using iteration multiplexing, the total number of iterations, for each kernel (for ``kernel`` policy)  
-     or for each unique kernel and launch parameters combination (for ``kernel_launch_params`` policy), 
-     specified in the workload should be sufficient to cover all subsets of counters. If the number of iterations 
+   * Ensure that your workload runs for enough iterations to cover all counter subsets.
+     When using iteration multiplexing, the total number of iterations, for each kernel (for ``kernel`` policy)
+     or for each unique kernel and launch parameters combination (for ``kernel_launch_params`` policy),
+     specified in the workload should be sufficient to cover all subsets of counters. If the number of iterations
     is too low, some counters may not be collected.

   * Launch paramaters for ``kernel_launch_params`` policy.
@@ -736,11 +902,11 @@ The following example demonstrates how to use iteration multiplexing with the
   [INFO] Kernel Selection: None
   [INFO] Dispatch Selection: None
   [INFO] Filtered sections: All
-   [INFO] 
+   [INFO]
   [INFO] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   [INFO] Collecting Performance Counters
   [INFO] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-   [INFO] 
+   [INFO]
   [INFO] Using native counter collection tool: /tmp/rocprofiler-compute-tool-hlz4fagh/librocprofiler-compute-tool.so
   [INFO] Iteration multiplexing: kernel
   [INFO] [profiling] Current input files: /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQC_DCACHE_INFLIGHT_LEVEL.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQC_ICACHE_INFLIGHT_LEVEL.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_IFETCH_LEVEL.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_INST_LEVEL_LDS.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_INST_LEVEL_SMEM.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_INST_LEVEL_VMEM.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_LEVEL_WAVES.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_0.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_1.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_10.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_11.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_12.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_2.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_3.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_4.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_5.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_6.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_7.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_8.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_9.txt
@@ -111,4 +111,5 @@ markers = [
    "iteration_multiplexing_2",
    "iteration_multiplexing_stochastic",
    "noise_clamp",
+    "torch_ops",
 ]
@@ -239,6 +239,17 @@ Examples:
        help=argparse.SUPPRESS,
        # help="\t\t\tKokkos trace, traces Kokkos API calls.",
    )
+    profile_group.add_argument(
+        "--torch-trace",
+        dest="torch_trace",
+        required=False,
+        default=False,
+        action="store_true",
+        help=(
+            "\t\t\tTorch Trace, maps PyTorch operators to performance counters.\n"
+            "\t\t\tShould be used only when profiling PyTorch applications."
+        ),
+    )
    profile_group.add_argument(
        "-k",
        "--kernel",
@@ -109,16 +109,62 @@ class RocProfCompute_Base:
                "--attach-pid cannot be used with --iteration-multiplexing. "
                "Please remove one of these options."
            )
-
        # verify correct formatting for application binary
        args.remaining = args.remaining[1:]
+        resolved_exec_path: Optional[Path] = None
+
        if args.remaining:
            # Ensure that command points to an executable
-            if not shutil.which(args.remaining[0]):
+            exec_candidate = shutil.which(args.remaining[0])
+            if not exec_candidate:
                console_error(
                    f"Your command {args.remaining[0]} doesn't point to a executable. "
                    "Please verify."
                )
+            resolved_exec_path = Path(exec_candidate).resolve()
+
+            # Appending a wrapper for injecting roctx-markers
+            if getattr(args, "torch_trace", False):
+                # Find the inject_roctx.py script in src/utils
+                inject_script = (
+                    Path(__file__).parent.parent / "utils" / "inject_roctx.py"
+                )
+                if not inject_script.exists():
+                    console_error(
+                        f"Cannot find inject_roctx.py at {inject_script}. "
+                        "Please verify your installation."
+                    )
+
+                # Case 1: Explicit python command (python, python3, etc.)
+                if args.remaining[0].startswith("python"):
+                    # Insert inject_roctx.py after the python interpreter
+                    args.remaining.insert(1, str(inject_script))
+                # Case 2: Direct Python script execution (./main.py, /path/to/script.py)
+                elif args.remaining[0].endswith((".py", ".pyw", ".pyc", ".pyo")):
+                    # Use current Python interpreter
+                    args.remaining.insert(0, str(inject_script))
+                    args.remaining.insert(0, sys.executable)
+                else:
+                    console_warning(
+                        "Command does not look like a Python entry point, "
+                        "skipping ROCTX auto-injection and launching workload as-is."
+                    )
+                    console_warning(
+                        "Ensure the binary already initializes PyTorch/ROCTX markers, "
+                        "otherwise --torch-trace will have no effect."
+                    )
+
+                if (
+                    resolved_exec_path
+                    and (resolved_exec_path.parent / "_internal").is_dir()
+                ):
+                    console_warning(
+                        "Workload appears to be a self-contained binary. "
+                        "Such bundles typically ship private ROCm/HSA libraries, which "
+                        "prevents --torch-trace from collecting data."
+                        "Rebuild without packaging libhsa/libhip or "
+                        "adjust LD_LIBRARY_PATH to /opt/rocm) before profiling."
+                    )
            args.remaining = " ".join(args.remaining)
        elif not args.attach_pid:
            console_error(
@@ -471,6 +517,8 @@ class RocProfCompute_Base:
                        f'passes. Please use "--block" or "--set" '
                        f"to adjust or reduce the requested performance metrics!"
                    )
+            console_debug(f"Sending profiler options to run_prof: {options}")
+
            run_prof(
                fnames=str_fnames,
                profiler_options=options,
@@ -478,6 +526,7 @@ class RocProfCompute_Base:
                mspec=self._soc._mspec,
                loglevel=args.loglevel,
                format_rocprof_output=args.format_rocprof_output,
+                torch_trace_enabled=getattr(args, "torch_trace", False),
                retain_rocpd_output=args.retain_rocpd_output,
            )

@@ -30,6 +30,7 @@ from pathlib import Path
 from rocprof_compute_profile.profiler_base import RocProfCompute_Base
 from rocprof_compute_soc.soc_base import OmniSoC_Base
 from utils.logger import console_error, console_log, demarcate
+from utils.utils import consolidate_torch_trace_output


 class rocprof_v3_profiler(RocProfCompute_Base):
@@ -49,7 +50,6 @@ class rocprof_v3_profiler(RocProfCompute_Base):
    def get_profiler_options(self) -> list[str]:
        args = self.get_args()
        app_cmd = shlex.split(args.remaining)
-
        if args.kokkos_trace:
            trace_option = "--kokkos-trace"
            # NOTE: --kokkos-trace feature is incomplete and is disabled for now.
@@ -60,9 +60,10 @@ class rocprof_v3_profiler(RocProfCompute_Base):
            )
        elif args.hip_trace:
            trace_option = "--hip-trace"
+        elif getattr(args, "torch_trace", False):
+            trace_option = "--marker-trace"
        else:
            trace_option = "--kernel-trace"
-
        profiling_options = [
            # v3 requires output directory argument
            "-d",
@@ -134,6 +135,10 @@ class rocprof_v3_profiler(RocProfCompute_Base):
        if self.ready_to_profile:
            # Manually join each pmc_perf*.csv output
            self.join_prof()
+            # Consolidate torch trace output if --torch-trace was used
+            if self.get_args().torch_trace:
+                consolidate_torch_trace_output(self.get_args().path)
+
            # Run roofline microbenchmark
            super().post_processing()
        else:
@@ -31,6 +31,7 @@ from typing import Optional, Union
 from rocprof_compute_profile.profiler_base import RocProfCompute_Base
 from rocprof_compute_soc.soc_base import OmniSoC_Base
 from utils.logger import console_error, console_log, demarcate
+from utils.utils import consolidate_torch_trace_output


 class rocprofiler_sdk_profiler(RocProfCompute_Base):
@@ -71,6 +72,8 @@ class rocprofiler_sdk_profiler(RocProfCompute_Base):
            "ROCPROF_OUTPUT_PATH": f"{args.path}/out/pmc_1",
        })

+        if getattr(args, "torch_trace", False):
+            options["ROCPROF_MARKER_API_TRACE"] = "1"
        # Create folder pointed by ROCPROF_OUTPUT_PATH
        Path(options["ROCPROF_OUTPUT_PATH"]).mkdir(parents=True, exist_ok=True)

@@ -161,6 +164,9 @@ class rocprofiler_sdk_profiler(RocProfCompute_Base):
        if self.ready_to_profile:
            # Manually join each pmc_perf*.csv output
            self.join_prof()
+            if self.get_args().torch_trace:
+                consolidate_torch_trace_output(self.get_args().path)
+
            # Run roofline microbenchmark
            super().post_processing()
        else:
@@ -0,0 +1,292 @@
+# ruff: noqa
+##############################################################################
+# MIT License
+#
+# Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE.
+
+##############################################################################
+
+
+"""
+ROCTX Injection Wrapper - Auto-discovers and intercepts ALL PyTorch operators
+Usage: python inject_roctx.py main.py --epochs 1 --batch-size 4
+"""
+
+import os
+import sys
+from pathlib import Path
+
+# Add parent directory to Python path for config module
+script_dir = Path(__file__).resolve().parent
+sys.path.insert(0, str(script_dir.parent))
+
+from utils.logger import console_log, console_warning
+
+rocm_root = os.environ.get("ROCM_PATH", "/opt/rocm")
+python_version = f"python{sys.version_info.major}.{sys.version_info.minor}"
+candidate_paths = [
+    f"{rocm_root}/lib/{python_version}/site-packages",
+    f"{rocm_root}/libexec/rocprofiler-sdk/python",
+]
+
+for candidate in candidate_paths:
+    if candidate not in sys.path:
+        sys.path.insert(0, candidate)
+
+try:
+    import torch
+
+    console_log(f"PyTorch version: {torch.__version__}")
+except ImportError:
+    console_warning(
+        "PyTorch is not installed or not properly configured.\n"
+        "The --torch-trace option requires a valid PyTorch installation.\n"
+        "Please install PyTorch and try again."
+    )
+    sys.exit(0)
+
+import importlib.util
+import inspect
+from functools import wraps
+
+import torch.nn.functional as F
+from roctx import rangePop, rangePush
+
+
+def roctx_wrapper(func, name=None):
+    func_name = name or func.__name__
+    call_counter = {"count": 0}
+
+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        call_counter["count"] += 1
+        current_frame = inspect.currentframe()
+        caller_frame = current_frame.f_back if current_frame is not None else None
+        if caller_frame is not None:
+            filename = caller_frame.f_code.co_filename
+            location = f"{Path(filename).name}:{caller_frame.f_lineno}"
+        else:
+            location = "unknown:0"
+
+        # Unique marker: function + call_number + source_location
+        rangePush(f"{func_name}:#{call_counter['count']}@{location}")
+        try:
+            result = func(*args, **kwargs)
+        finally:
+            rangePop()
+        return result
+
+    return wrapper
+
+
+def auto_discover_torch_callables(module, prefix, exclude_patterns=None):
+    """Automatically discover all callable functions in a module."""
+    if exclude_patterns is None:
+        exclude_patterns = ["__", "_", "is_", "set_", "get_"]
+
+    functions = {}
+    for name in dir(module):
+        # Skip private/internal functions
+        if any(name.startswith(pat) for pat in exclude_patterns):
+            continue
+
+        try:
+            attr = getattr(module, name)
+            # Only wrap callables (functions, not classes or constants)
+            if callable(attr) and not isinstance(attr, type):
+                full_name = f"{prefix}.{name}"
+                functions[full_name] = (module, name, attr)
+        except Exception as e:
+            console_warning(type(e))
+            console_warning(f"Could not access {prefix}.{name}: {e}")
+
+    return functions
+
+
+def inject_roctx_into_torch():
+    """Monkey-patch PyTorch operations to add ROCTX markers."""
+
+    console_log("Auto-discovering PyTorch operations to wrap...")
+
+    # Auto-discover functions from key modules
+    all_operations = {}
+
+    # torch.* functions (matmul, mm, cat, etc.)
+    all_operations.update(auto_discover_torch_callables(torch, "torch"))
+
+    # torch.nn.functional.* functions (linear, relu, softmax, etc.)
+    all_operations.update(auto_discover_torch_callables(F, "torch.nn.functional"))
+
+    # torch.linalg.* functions (matrix operations)
+    try:
+        all_operations.update(
+            auto_discover_torch_callables(torch.linalg, "torch.linalg")
+        )
+    except Exception as e:
+        console_warning(type(e))
+        console_warning(f"Could not access torch.linalg: {e}")
+
+    # torch.fft.* functions (FFT operations)
+    try:
+        all_operations.update(auto_discover_torch_callables(torch.fft, "torch.fft"))
+    except Exception as e:
+        console_warning(type(e))
+        console_warning(f"Could not access torch.fft: {e}")
+    console_log(f"Found {len(all_operations)} operations to wrap")
+    console_log("Injecting ROCTX markers into PyTorch operations...")
+
+    wrapped_count = 0
+    failed_count = 0
+
+    for full_name, (module, attr_name, original_func) in all_operations.items():
+        try:
+            # Replace with wrapped version
+            wrapped_func = roctx_wrapper(original_func, full_name)
+            setattr(module, attr_name, wrapped_func)
+            wrapped_count += 1
+
+            # Print first 20 and last 5 for visibility
+            if wrapped_count <= 20 or wrapped_count > len(all_operations) - 5:
+                console_log(f"Wrapped: {full_name}")
+            elif wrapped_count == 21:
+                console_log(
+                    f"  ... (wrapping {len(all_operations) - 25} more operations)"
+                )
+
+        except Exception as e:
+            failed_count += 1
+            if failed_count <= 5:  # Only show first few failures
+                console_warning(f"Failed to wrap {full_name}: {e}")
+
+    # Wrap tensor methods
+    original_backward = torch.Tensor.backward
+    backward_counter = {"count": 0}
+
+    def backward_with_roctx(self, *args, **kwargs):
+        backward_counter["count"] += 1
+        current_frame = inspect.currentframe()
+        caller_frame = current_frame.f_back if current_frame is not None else None
+        if caller_frame is not None:
+            filename = caller_frame.f_code.co_filename
+            location = f"{Path(filename).name}:{caller_frame.f_lineno}"
+        else:
+            location = "unknown:0"
+
+        rangePush(f"torch.Tensor.backward:#{backward_counter['count']}@{location}")
+        try:
+            return original_backward(self, *args, **kwargs)
+        finally:
+            rangePop()
+
+    torch.Tensor.backward = backward_with_roctx
+
+    wrapped_count += 1
+    console_log("Wrapped: torch.Tensor.backward")
+
+    console_log(f"Wrapped {wrapped_count} operations with ROCTX markers")
+    if failed_count > 0:
+        console_warning(
+            f"Failed to wrap {failed_count} operations (likely not patchable)"
+        )
+
+
+def inject_roctx_into_optimizer():
+    """Wrap optimizer step() method."""
+    from torch.optim import Optimizer
+
+    original_step = Optimizer.step
+
+    def step_with_roctx(self, *args, **kwargs):
+        rangePush(f"optimizer.{self.__class__.__name__}.step")
+        try:
+            return original_step(self, *args, **kwargs)
+        finally:
+            rangePop()
+
+    Optimizer.step = step_with_roctx
+    console_log("Wrapped optimizer.step() with ROCTX markers\n")
+
+
+def inject_roctx_into_model():
+    """Wrap nn.Module forward() method with call counter."""
+
+    from torch import nn
+    from typing import Any
+
+    original_call = nn.Module.__call__
+
+    # Per-instance call counters
+    def call_with_roctx(self, *args, **kwargs):
+        class_name = self.__class__.__name__
+
+        # Initialize counter for this instance if not exists
+        if not hasattr(self, "_roctx_call_count"):
+            self._roctx_call_count = 0
+        self._roctx_call_count += 1
+
+        # Get caller location
+        current_frame = inspect.currentframe()
+        caller_frame = current_frame.f_back if current_frame is not None else None
+        if caller_frame is not None:
+            filename = caller_frame.f_code.co_filename
+            location = f"{Path(filename).name}:{caller_frame.f_lineno}"
+        else:
+            location = "unknown:0"
+
+        # Create detailed marker
+        rangePush(
+            f"nn.Module.{class_name}.forward:#{self._roctx_call_count}@{location}"
+        )
+        try:
+            return original_call(self, *args, **kwargs)
+        finally:
+            rangePop()
+
+    nn.Module.__call__ = call_with_roctx
+    console_log("Wrapped nn.Module forward() with ROCTX markers\n")
+
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        console_log("Usage: python inject_roctx.py <script.py> [script_args...]")
+        sys.exit(1)
+
+    # Get target script and its arguments
+    target_script = sys.argv[1]
+    script_args = sys.argv[2:]
+
+    # Inject ROCTX markers BEFORE importing the target script
+    inject_roctx_into_torch()
+    inject_roctx_into_optimizer()
+    inject_roctx_into_model()
+
+    console_log("=" * 70)
+    console_log("Starting target script with ROCTX instrumentation...")
+    console_log("=" * 70)
+
+    # Modify sys.argv so the target script sees correct arguments
+    sys.argv = [target_script] + script_args
+
+    # Load and execute the target script
+    spec = importlib.util.spec_from_file_location("__main__", target_script)
+    module = importlib.util.module_from_spec(spec)
+    sys.modules["__main__"] = module
+    spec.loader.exec_module(module)
@@ -25,7 +25,7 @@

 import csv
 import sqlite3
-from contextlib import closing
+from contextlib import ExitStack, closing
 from typing import Any

 import pandas as pd
@@ -37,6 +37,8 @@ from utils.logger import console_error
 COUNTERS_COLLECTION_QUERY = """
 SELECT
    agent_id as GPU_ID,
+    guid as GUID,
+    correlation_id as Correlation_Id,
    dispatch_id as Dispatch_ID,
    pid as PID,
    grid_size as Grid_Size,
@@ -54,6 +56,24 @@ SELECT
    value as Counter_Value
 FROM counters_collection
 """
+MARKER_API_TRACE_QUERY = """
+SELECT
+    category AS Domain,
+    json_extract(extdata, '$.message') AS Function,
+    pid AS Process_Id,
+    tid AS Thread_Id,
+    corr_id AS Correlation_Id,
+    guid AS GUID,
+    start AS Start_Timestamp,
+    end AS End_Timestamp
+FROM regions
+ORDER BY start
+"""
+KERNEL_DISPATCH_QUERY = """
+SELECT dispatch_id, event_id, guid
+FROM rocpd_kernel_dispatch
+WHERE guid = ?
+"""
 ROCPD_PMC_EVENT_TABLE_NAME_PREFIX = "rocpd_pmc_event_"
 TABLE_NAME_PREFIX_QUERY = (
    "SELECT name FROM sqlite_master WHERE type='table' "
@@ -64,30 +84,43 @@ INSERT_QUERY = "INSERT INTO {table_name} ({columns}) VALUES ({placeholders})"

 def convert_dbs_to_csv(
    db_paths: list[str],
-    csv_file_path: str,
+    counter_collection_csv_path: str,
+    marker_trace_csv_path: str,
 ) -> None:
-    """
-    Read rocpd databases and write to CSV file
-    """
-    # Read counters_collection view from the databases and write to CSV
-    try:
-        with open(csv_file_path, "w", newline="") as csvfile:
-            writer = csv.writer(csvfile)
-            header_written = False
-            for db_path in db_paths:
-                with closing(sqlite3.connect(db_path)) as conn:
-                    with closing(conn.execute(COUNTERS_COLLECTION_QUERY)) as cursor:
-                        if not header_written:
-                            writer.writerow([
-                                description[0] for description in cursor.description
-                            ])
-                            header_written = True
-                        for row in cursor:
-                            writer.writerow(row)
-    except OSError as e:
-        console_error(f"Database error while converting to CSV: {e}")
-    except Exception as e:
-        console_error(f"Unexpected error converting database to CSV: {e}")
+    queries = {
+        counter_collection_csv_path: COUNTERS_COLLECTION_QUERY,
+        marker_trace_csv_path: MARKER_API_TRACE_QUERY,
+    }
+    header_written = {path: False for path in queries}
+
+    with ExitStack() as stack:
+        writers = {
+            path: csv.writer(stack.enter_context(open(path, "w", newline="")))
+            for path in queries
+        }
+        for db_path in db_paths:
+            with closing(sqlite3.connect(db_path)) as conn:
+                for file_path, query in queries.items():
+                    try:
+                        with closing(conn.execute(query)) as cursor:
+                            if cursor.description is None:
+                                continue
+                            if not header_written[file_path]:
+                                writers[file_path].writerow([
+                                    desc[0] for desc in cursor.description
+                                ])
+                                header_written[file_path] = True
+                            writers[file_path].writerows(cursor)
+                    except OSError as e:
+                        console_error(
+                            f"Database error while extracting {file_path} "
+                            f"from {db_path}: {e}"
+                        )
+                    except Exception as e:
+                        console_error(
+                            f"Unexpected error while extracting {file_path} "
+                            f"from {db_path}: {e}"
+                        )


 def process_rocpd_csv(df: pd.DataFrame) -> pd.DataFrame:
@@ -134,7 +167,7 @@ def process_rocpd_csv(df: pd.DataFrame) -> pd.DataFrame:


 def update_rocpd_pmc_events(counter_info: pd.DataFrame, rocpd_db_path: str) -> None:
-    """Update pmc_event table in the given rocpd database path"""
+    """Updates pmc_event table in the given rocpd database path."""
    try:
        with closing(sqlite3.connect(rocpd_db_path)) as conn:
            # Get pmc_event table name
@@ -154,13 +187,27 @@ def update_rocpd_pmc_events(counter_info: pd.DataFrame, rocpd_db_path: str) -> N
            guid = table_name[len(ROCPD_PMC_EVENT_TABLE_NAME_PREFIX) :].replace(
                "_", "-"
            )
+            # Map dispatch_id to event_id from rocpd_kernel_dispatch
+            # Native counter collection CSV has dispatch_id, but schema needs event_id
+            # event_id may differ from dispatch_id when marker API tracing is enabled
+            with closing(conn.execute(KERNEL_DISPATCH_QUERY, (guid,))) as cursor:
+                rows = cursor.fetchall()
+            if not rows:
+                console_error("No kernel dispatch data found.")
+                return
+            dispatch_to_event = {
+                dispatch_id: event_id for dispatch_id, event_id, _ in rows
+            }
+            counter_info["event_id"] = counter_info["dispatch_id"].map(
+                dispatch_to_event
+            )
            columns = ("guid", "event_id", "pmc_id", "value")
            values = list(
                zip(
                    # guid
                    [guid] * len(counter_info),
                    # event_id
-                    counter_info["dispatch_id"],
+                    counter_info["event_id"],
                    # pmc_id
                    counter_info["counter_id"],
                    # value
@@ -786,6 +786,7 @@ def run_prof(
    mspec: Any,  # noqa: ANN401
    loglevel: int,
    format_rocprof_output: str,
+    torch_trace_enabled: bool = False,
    retain_rocpd_output: bool = False,
 ) -> None:
    multiple_files = isinstance(fnames, list)
@@ -939,9 +940,12 @@ def run_prof(
        # Write results_fbase.csv
        rocpd_data.convert_dbs_to_csv(
            glob.glob(workload_dir + "/out/pmc_1/*/*.db"),
-            workload_dir + f"/results_{fbase}.csv",
+            workload_dir + f"/out/pmc_1/{fbase}_counter_collection.csv",
+            workload_dir + f"/out/pmc_1/{fbase}_marker_api_trace.csv",
+        )
+        combined_df = pd.read_csv(
+            workload_dir + f"/out/pmc_1/{fbase}_counter_collection.csv"
        )
-        combined_df = pd.read_csv(workload_dir + f"/results_{fbase}.csv")
        # Reset Dispatch_ID based on PID, Kernel_Name, Grid_Size,
        # Workgroup_Size, LDS_Per_Workgroup, Start_Timestamp, End_Timestamp
        combined_df["Dispatch_ID"] = combined_df.groupby(
@@ -964,8 +968,12 @@ def run_prof(
        ).ngroup()
        # Drop PID since its not required
        combined_df = combined_df.drop(columns=["PID"])
+        combined_df.to_csv(
+            workload_dir + f"/out/pmc_1/{fbase}_counter_collection.csv", index=False
+        )
        combined_df.to_csv(workload_dir + f"/results_{fbase}.csv", index=False)
-
+        if torch_trace_enabled:
+            process_torch_trace_output(workload_dir, fbase, format_rocprof_output)
        if retain_rocpd_output:
            for db_path in glob.glob(workload_dir + "/out/pmc_1/*/*.db"):
                pid = Path(db_path).stem.split("_")[0]
@@ -1004,7 +1012,9 @@ def run_prof(
                process_kokkos_trace_output(workload_dir, fbase)
            elif "--hip-trace" in options:
                process_hip_trace_output(workload_dir, fbase)
-
+        # Add torch operator trace processing
+        if torch_trace_enabled:
+            process_torch_trace_output(workload_dir, fbase, format_rocprof_output)
        # Combine results into single CSV file
        if results_files:
            combined_results = pd.concat(
@@ -1175,7 +1185,7 @@ def convert_native_counter_collection_csv(workload_dir: str) -> None:
        )

        rocprofv3_counter_data = pd.DataFrame({
-            "Correlation_Id": merged_data["dispatch_id"],
+            "Correlation_Id": merged_data["Correlation_Id"],
            "Dispatch_Id": merged_data["dispatch_id"],
            "Agent_Id": merged_data["Agent_Id"],
            "Queue_Id": merged_data["Queue_Id"],
@@ -1262,6 +1272,178 @@ def process_rocprofv3_output(workload_dir: str, using_native_tool: bool) -> list
    return results_files_csv


+@demarcate
+def process_torch_trace_output(
+    workload_dir: str,
+    fbase: str,
+    output_format: str = "rocpd",
+) -> None:
+    """
+    Creates PyTorch operator trace from counter_collection and marker_api_trace data.
+        - Performs inner join on Correlation_Id, filtering out unmatched entries
+        - Output file is saved to workload root, not the temporary out/ directory
+    """
+    marker_trace_csv_file_path = f"{workload_dir}/out/pmc_1/"
+    # Find all marker_api_trace CSV files
+    marker_api_trace_csvs = list(
+        Path(marker_trace_csv_file_path).glob("**/*_marker_api_trace.csv")
+    )
+    counter_collection_csvs = [
+        markers_file.parent
+        / markers_file.name.replace("_marker_api_trace.", "_counter_collection.")
+        for markers_file in marker_api_trace_csvs
+    ]
+    existing_csv_files = [
+        [marker_api_trace_csvs[i], counter_collection_csvs[i]]
+        for i in range(len(marker_api_trace_csvs))
+        if counter_collection_csvs[i].is_file() and marker_api_trace_csvs[i].is_file()
+    ]
+    if not existing_csv_files:
+        console_warning(
+            f"No marker files with corresponding counter files found for {fbase}"
+        )
+        return
+
+    # Join marker and counter data
+    def _merge_pair(
+        marker_path: Path,
+        counter_path: Path,
+        join_keys: list = ("Correlation_Id"),
+    ) -> pd.DataFrame:
+        marker_df = pd.read_csv(marker_path)
+        counter_df = pd.read_csv(counter_path)
+        return pd.merge(
+            marker_df,
+            counter_df,
+            on=join_keys,
+            how="inner",
+            suffixes=("_function", "_kernel"),
+        )
+
+    if output_format == "csv":
+        merged_results = pd.concat(
+            [_merge_pair(f[0], f[1]) for f in existing_csv_files],
+            ignore_index=True,
+        )
+    elif output_format == "rocpd":
+        # There will one pair of csv files extracted from rocpd db and consolidated.
+        merged_results = _merge_pair(
+            existing_csv_files[0][0],
+            existing_csv_files[0][1],
+            ("Correlation_Id", "GUID"),
+        )
+    # Save merged results
+    merged_results.to_csv(
+        f"{workload_dir}/{fbase}_torch_trace.csv",
+        index=False,
+    )
+    console_log("Created ", f"{workload_dir}/{fbase}_torch_trace.csv")
+
+
+@demarcate
+def consolidate_torch_trace_output(workload_dir: str) -> None:
+    # Consolidate torch operator trace CSV files from multiple processes
+    console_log("Consolidating torch operator trace output...")
+    # Find all torch trace CSV files in workload directory
+    torch_trace_files = glob.glob(f"{workload_dir}/*_torch_trace.csv")
+    if not torch_trace_files:
+        console_warning("No torch trace files found.")
+        return
+    # Read and concatenate all torch trace files
+    all_traces = []
+    required_columns = [
+        "Function",
+        "Kernel_Name",
+        "Counter_Name",
+        "Counter_Value",
+        "Start_Timestamp_function",
+        "End_Timestamp_function",
+        "Start_Timestamp_kernel",
+        "End_Timestamp_kernel",
+    ]
+    for trace_file in torch_trace_files:
+        try:
+            df = pd.read_csv(trace_file)
+        except pd.errors.ParserError as e:
+            console_warning(f"Parser error while reading {trace_file}: {e}")
+            continue
+        except OSError as e:
+            console_warning(f"I/O error while reading {trace_file}: {e}")
+            continue
+        except Exception as e:
+            # Unexpected error; log full details for debugging
+            console_warning(
+                f"Unexpected error while reading {trace_file}: {e}\n"
+                f"{traceback.format_exc()}"
+            )
+            continue
+
+        missing_columns = [col for col in required_columns if col not in df.columns]
+        if missing_columns:
+            console_warning(
+                f"Skipping {trace_file}: missing required columns {missing_columns}"
+            )
+            continue
+
+        all_traces.append(df[required_columns])
+    if not all_traces:
+        console_warning("No valid torch trace data to consolidate.")
+        return
+
+    consolidated_df = pd.concat(all_traces, ignore_index=True)
+    if consolidated_df.isnull().values.any():
+        console_warning("Consolidated torch trace contains missing values")
+        return
+    consolidated_df = consolidated_df.sort_values(by=["Function", "Counter_Name"])
+
+    split_columns = consolidated_df["Function"].str.split(":#", expand=True)
+    consolidated_df["Operator_Name"] = (
+        split_columns[0] if len(split_columns.columns) > 0 else None
+    )
+    consolidated_df["Context_Id"] = (
+        split_columns[1] if len(split_columns.columns) > 1 else None
+    )
+    consolidated_df.drop(columns=["Function"], inplace=True)
+    consolidated_df = consolidated_df[
+        [
+            "Operator_Name",
+            "Context_Id",
+            "Kernel_Name",
+            "Counter_Name",
+            "Counter_Value",
+            "Start_Timestamp_function",
+            "End_Timestamp_function",
+            "Start_Timestamp_kernel",
+            "End_Timestamp_kernel",
+        ]
+    ]
+
+    if consolidated_df.isnull().values.any():
+        console_error(
+            "Missing values in consolidated torch trace after splitting ",
+            "the Function name.",
+        )
+        return
+
+    grouped = consolidated_df.groupby("Operator_Name")
+    for operator_name, group in grouped:
+        sanitized_operator_name = operator_name.replace("torch.", "").replace(".", "_")
+        # Ensure output directory exists
+        Path(f"{workload_dir}/torch_trace").mkdir(parents=True, exist_ok=True)
+        output_file = f"{workload_dir}/torch_trace/{sanitized_operator_name}.csv"
+        group.to_csv(output_file, index=False)
+        console_log(
+            f"Saved consolidated trace for {sanitized_operator_name} to {output_file}"
+        )
+
+    for trace_file in torch_trace_files:
+        try:
+            Path(trace_file).unlink()
+            console_debug(f"Removed temporary torch trace file: {trace_file}")
+        except OSError as e:
+            console_warning(f"Error removing temporary file {trace_file}: {e}")
+
+
@demarcate
 def process_kokkos_trace_output(workload_dir: str, fbase: str) -> None:
    # marker api trace csv files are generated for each process
@@ -23,6 +23,7 @@

 ##############################################################################

+import importlib.util
 import inspect
 import os
 import re
@@ -2779,3 +2780,215 @@ def test_iteration_multiplexing_all_counter_accuracy(
    assert are_stochastic_counters_similar(
        [counters_kernel, counters_kernel_launch_params], counters_no_multiplexing
    )
+
+
+skip_if_no_torch = pytest.mark.skipif(
+    importlib.util.find_spec("torch") is None, reason="torch is required for this test"
+)
+
+
+@skip_if_no_torch
+def test_torch_trace_profile(binary_handler_profile_rocprof_compute):
+    """
+    Test profiling a PyTorch application with --torch-trace option.
+    Verifies that all required files are generated and counter values are valid.
+    NOTE: Not included in the test suite since this requires PyTorch installation.
+    """
+    workload_dir = test_utils.get_output_dir(param_id="torch_ops")
+    Path(workload_dir).mkdir(parents=True, exist_ok=True)
+    torch_app_path = Path(workload_dir) / "test_torch_app.py"
+
+    torch_app_code = """
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+class SimpleNet(nn.Module):
+    def __init__(self):
+        super(SimpleNet, self).__init__()
+        self.fc1 = nn.Linear(10, 20)
+        self.fc2 = nn.Linear(20, 10)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = F.relu(x)
+        x = self.fc2(x)
+        return x
+
+if __name__ == "__main__":
+    if not torch.cuda.is_available():
+        import sys
+        print("GPU is required for this test. Exiting.")
+        sys.exit(1)
+    model = SimpleNet()
+    model = model.cuda()
+    x = torch.randn(5, 10).cuda()
+    # Run a few iterations
+    for epoch in range(1):
+        output = model(x)
+        loss = output.sum()
+        loss.backward()
+        print("Training completed")
+"""
+
+    with open(torch_app_path, "w") as f:
+        f.write(torch_app_code)
+
+    config["torch_test_app"] = ["python3", str(torch_app_path)]
+
+    # Profile with --torch-trace option
+    options = [
+        "--torch-trace",
+    ]
+
+    returncode = binary_handler_profile_rocprof_compute(
+        config,
+        workload_dir,
+        options,
+        check_success=True,
+        app_name="torch_test_app",
+    )
+    assert returncode == 0, "Profiling the torch application failed"
+    # Verify files are generated
+    # 1. Check basic CSV files
+    num_devices = config.get("num_devices", 1)
+    file_dict = test_utils.check_csv_files(workload_dir, num_devices, 1)
+    assert "pmc_perf.csv" in file_dict, "pmc_perf.csv not generated"
+    # 2. Check torch trace directory
+    torch_trace_dir = Path(workload_dir) / "torch_trace"
+    assert torch_trace_dir.exists(), "torch_trace directory not created"
+    assert torch_trace_dir.is_dir(), "torch_trace is not a directory"
+    # 3. Check per-operator CSV files exist
+    operator_csv_files = list(torch_trace_dir.glob("*.csv"))
+    assert len(operator_csv_files) > 0, "No per-operator CSV files generated"
+    # 4. Verify per-operator CSV structure
+    for op_csv in operator_csv_files:
+        op_df = pd.read_csv(op_csv)
+        assert len(op_df) > 0, f"Per-operator CSV {op_csv.name} is empty"
+    test_utils.clean_output_dir(config["cleanup"], workload_dir)
+
+
+@skip_if_no_torch
+def test_torch_trace_overhead(binary_handler_profile_rocprof_compute):
+    """
+    Measure overhead introduced by --torch-trace flag.
+    Compares execution time with and without the flag to ensure overhead is acceptable.
+    NOTE: Not included in the test suite since this requires PyTorch installation.
+    """
+    helper_dir = Path(test_utils.get_output_dir(param_id="torch_helper_script"))
+    helper_dir.mkdir(parents=True, exist_ok=True)
+    torch_app_path = helper_dir / "test_torch_app.py"
+    torch_app_code = """
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+class SimpleNet(nn.Module):
+    def __init__(self):
+        super(SimpleNet, self).__init__()
+        self.fc1 = nn.Linear(10, 20)
+        self.fc2 = nn.Linear(20, 10)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = F.relu(x)
+        x = self.fc2(x)
+        return x
+
+if __name__ == "__main__":
+    if not torch.cuda.is_available():
+        import sys
+        print("GPU is required for this test. Exiting.")
+        sys.exit(1)
+    model = SimpleNet()
+    model = model.cuda()
+    x = torch.randn(5, 10).cuda()
+    # Run a few iterations
+    for epoch in range(1):
+        output = model(x)
+        loss = output.sum()
+        loss.backward()
+    print("Training completed")
+"""
+    with open(torch_app_path, "w") as f:
+        f.write(torch_app_code)
+    config["torch_test_app"] = ["python3", str(torch_app_path)]
+    # Run WITHOUT --torch-trace (baseline)
+    workload_dir_baseline = test_utils.get_output_dir(param_id="torch_baseline")
+    start_baseline = time.time()
+    returncode_baseline = binary_handler_profile_rocprof_compute(
+        config,
+        workload_dir_baseline,
+        [],  # No torch-trace flag
+        check_success=True,
+        roof=False,
+        app_name="torch_test_app",
+    )
+    baseline_time = time.time() - start_baseline
+    assert returncode_baseline == 0, "Baseline profiling failed"
+
+    # Read baseline timestamps
+    baseline_df = pd.read_csv(f"{workload_dir_baseline}/pmc_perf.csv")
+    baseline_kernel_duration_total = (
+        baseline_df["End_Timestamp"].max() - baseline_df["Start_Timestamp"].min()
+    )
+    test_utils.clean_output_dir(config["cleanup"], workload_dir_baseline)
+    # Run WITH --torch-trace
+    workload_dir_with_flag = test_utils.get_output_dir(param_id="torch_with_ops")
+    start_with_flag = time.time()
+    returncode_with_flag = binary_handler_profile_rocprof_compute(
+        config,
+        workload_dir_with_flag,
+        ["--torch-trace"],
+        check_success=True,
+        roof=False,
+        app_name="torch_test_app",
+    )
+    with_flag_time = time.time() - start_with_flag
+    assert returncode_with_flag == 0, "Profiling with torch-trace failed"
+    # Read with-flag timestamps
+    with_flag_df = pd.read_csv(f"{workload_dir_with_flag}/pmc_perf.csv")
+    with_flag_kernel_duration_total = (
+        with_flag_df["End_Timestamp"].max() - with_flag_df["Start_Timestamp"].min()
+    )
+    longest_running_kernel_baseline = (
+        baseline_df["End_Timestamp"] - baseline_df["Start_Timestamp"]
+    ).max()
+    longest_running_kernel_with_flag = (
+        with_flag_df["End_Timestamp"] - with_flag_df["Start_Timestamp"]
+    ).max()
+    # Calculate overheads
+    longest_running_kernel_overhead = (
+        (longest_running_kernel_with_flag - longest_running_kernel_baseline)
+        / longest_running_kernel_baseline
+    ) * 100
+    wall_clock_overhead = ((with_flag_time - baseline_time) / baseline_time) * 100
+    kernel_overhead = (
+        (with_flag_kernel_duration_total - baseline_kernel_duration_total)
+        / baseline_kernel_duration_total
+    ) * 100
+    print(f"\n{'=' * 70}")
+    print("Performance Overhead Analysis:")
+    print(f"  Longest running kernel overhead: {longest_running_kernel_overhead:.1f}%")
+    print(f"  Baseline wall-clock time:     {baseline_time:.2f}s")
+    print(f"  With --torch-trace time:  {with_flag_time:.2f}s")
+    print(f"  Wall-clock overhead:          {wall_clock_overhead:.1f}%")
+    print(f"  Baseline kernel duration:     {baseline_kernel_duration_total:.0f} ns")
+    print(f"  With flag kernel duration:    {with_flag_kernel_duration_total:.0f} ns")
+    print(f"  Kernel execution overhead:    {kernel_overhead:.1f}%")
+    print(f"{'=' * 70}\n")
+    # Verify torch trace directory was created
+    torch_trace_dir = Path(workload_dir_with_flag) / "torch_trace"
+    assert torch_trace_dir.exists(), "torch_trace directory should be created"
+    operator_csv_files = list(torch_trace_dir.glob("*.csv"))
+    assert len(operator_csv_files) > 0, "Operator CSV files should be generated"
+    test_utils.clean_output_dir(config["cleanup"], workload_dir_with_flag)
+    # Assert overhead is reasonable (< 100% wall-clock, < 50% kernel)
+    assert wall_clock_overhead < 100, (
+        f"Wall-clock overhead too high: {wall_clock_overhead:.1f}%"
+    )
+    assert kernel_overhead < 50, (
+        f"Kernel execution overhead too high: {kernel_overhead:.1f}%"
+    )
+    assert longest_running_kernel_overhead < 50, (
+        f"longest running kernel increase too high: "
+        f"{longest_running_kernel_overhead:.1f}%"
+    )