[rocprofiler-compute] Adding --torch-trace option for SWDEV-559789 (#2089)
* Adding --torch-operator option in rocprof-compute. Creates csv file for each operator that has gpu activity, showing operator to counter values mapping. * --torch-operators flag added to rocprofiler-sdk * Adding ctest for --torch-operators. * Adding pytest markers. * Corrections in ctest and message logging. * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Adding a check for pytorch installation only when --torch-operators is passed. * moving inject_roctx.py into src/utils. * rebase * Updating docs and changelog. * Update projects/rocprofiler-compute/src/argparser.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update projects/rocprofiler-compute/src/utils/inject_roctx.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Removing special characters. * Minor corrections. * Setting default value for torch_operators_enabled. * Updating the number of files according to the number of passes. * Adding rocpd support. * Adding a warning message to be shown when profiling a non-python workload. * copilot suggestions, rocpd+native tool fix * Fixed the incorrect usage of dispatch_id as event_id in the function update_rocpd_pmc_events() * ruff format fix * ruff formating * Deleting torch_trace.csvs after consolidating the operator data. * Removing checks since *torch_trace.csv files are deleted. * Fixing file deletion. * Update projects/rocprofiler-compute/src/utils/inject_roctx.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update projects/rocprofiler-compute/src/utils/utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update projects/rocprofiler-compute/tests/test_profile_general.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Using default options in the testcase. * Adding test for overhead measurement. * Corrections in docs. * doc updates. * Update projects/rocprofiler-compute/src/utils/inject_roctx.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Handling potential empty frames. * Corrected the test cases. * Changing the flag to --torch-trace * Fixed helper_app path issues * Path issues * process_torch_trace_output() now takes csv file paths as input + allows default usage. * Replaced pandas with sqlite3 * Adding marker_trace extraction to rocpd_data.py * Allowing all workloads to use --torch-trace option. Assuming the workload is user verified. * Modified help section for the flag. * Added difference in runtimes for longest running kernels in each profiling runs to overhead measurements. * Update projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Removed the accesses to the tables. * Ruff fixes. * ruff * Ruff Fixes * Adding getattr for args.torch_trace to handle mock args. * Fix for 'Missing guid in counter collection data - in csv mode' * Sending output_format to process_torch_trace_output * Warning for self contained binaries. * Ruff * Ruff * Measuring longest_running_kernel_baseline instead of worst_kernel_increase, very small kernel runtimes are blowing up the worst_kernel_increase metric. * Minor fixes in input arguments * Ruff * Loging PyTorch version * Fix ruff formatting for PyTorch version logging --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
这个提交包含在:
@@ -15,6 +15,8 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
|
||||
|
||||
* Iteration multiplexing to collect counters in single application run
|
||||
|
||||
* Added `--torch-trace` option to enable mapping of PyTorch operators to collected counter values during profiling.
|
||||
|
||||
* Runtime compilation of Roofline benchmarking:
|
||||
* GPU kernels from [rocm-amdgpu-bench](https://github.com/ROCm/rocm-amdgpu-bench) repository are moved into the ROCm Compute Profiler and are compiled at runtime using local HIP and HIPRTC Python wrappers.
|
||||
* Roofline binaries compiled from [rocm-amdgpu-bench](https://github.com/ROCm/rocm-amdgpu-bench) repository have been removed from the project, as Roofline runtime compilation performs the same work as the Roofline binaries.
|
||||
|
||||
@@ -617,11 +617,11 @@ The following example demonstrates profiling roofline data only:
|
||||
INFO Kernel Selection: None
|
||||
INFO Dispatch Selection: None
|
||||
INFO Filtered sections: ['4']
|
||||
INFO
|
||||
INFO
|
||||
INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
INFO Collecting Performance Counters (Roofline Only)
|
||||
INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
INFO
|
||||
INFO
|
||||
INFO [Run 1/3][Approximate profiling time left: pending first measurement...]
|
||||
INFO [profiling] Current input file: /app/projects/rocprofiler-compute/workloads/occupancy/MI300X_A1/perfmon/pmc_perf_0.txt
|
||||
...
|
||||
@@ -659,6 +659,172 @@ plot.
|
||||
:alt: Sample ROCm Compute Profiler roofline output
|
||||
:width: 800
|
||||
|
||||
.. _torch-operator-mapping:
|
||||
|
||||
Torch Operator Mapping
|
||||
========================
|
||||
|
||||
To analyze performance metrics at the PyTorch operator level, ROCm Compute Profiler
|
||||
offers Torch Operator Mapping functionality. This feature maps performance counters
|
||||
to specific PyTorch operators, enabling detailed performance analysis of
|
||||
PyTorch workloads at the operator granularity.
|
||||
|
||||
When enabled, this feature instruments your PyTorch application to correlate GPU
|
||||
kernel executions with their originating PyTorch operators, providing insights into
|
||||
which operators contribute to specific performance counter values.
|
||||
|
||||
.. note::
|
||||
|
||||
**PyTorch Operators vs GPU Kernels**: PyTorch operators (such as ``conv2d``,
|
||||
``linear``, ``relu``) are high-level API functions. When executed on GPU, these
|
||||
operators may dispatch one or more low-level GPU kernels (such as
|
||||
``implicit_convolve_sgemm``) that perform the actual computation on the hardware.
|
||||
The ``--torch-trace`` feature provides operator-level attribution by injecting
|
||||
markers that map collected kernel performance counters to their originating PyTorch
|
||||
operators.
|
||||
|
||||
Requirements
|
||||
------------
|
||||
|
||||
* Valid PyTorch installation in the profiling environment
|
||||
* PyTorch application must be run as a Python script or Python command
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
To enable Torch operator mapping, use the ``--torch-trace`` option when profiling
|
||||
a PyTorch workload:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ rocprof-compute profile --name mnist_torch --torch-trace -- python train.py
|
||||
|
||||
__ _
|
||||
_ __ ___ ___ _ __ _ __ ___ / _| ___ ___ _ __ ___ _ __ _ _| |_ ___
|
||||
| '__/ _ \ / __| '_ \| '__/ _ \| |_ _____ / __/ _ \| '_ ` _ \| '_ \| | | | __/ _ \
|
||||
| | | (_) | (__| |_) | | | (_) | _|_____| (_| (_) | | | | | | |_) | |_| | || __/
|
||||
|_| \___/ \___| .__/|_| \___/|_| \___\___/|_| |_| |_| .__/ \__,_|\__\___|
|
||||
|_| |_|
|
||||
|
||||
rocprofiler-compute version: 3.4.0
|
||||
Profiler choice: rocprofiler-sdk
|
||||
Path: /home/auser/workloads/mnist_torch/MI300X_A1
|
||||
Target: MI300X_A1
|
||||
Command: python train.py
|
||||
Torch Trace: Enabled
|
||||
Kernel Selection: None
|
||||
Dispatch Selection: None
|
||||
Hardware Blocks: All
|
||||
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Collecting Performance Counters
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
...
|
||||
|
||||
Output
|
||||
------
|
||||
|
||||
When Torch operator mapping is enabled, profiling generates additional output files
|
||||
in the workload directory that correlate PyTorch operators with GPU kernels and
|
||||
their performance counters:
|
||||
|
||||
``<workload_name>_torch_trace.csv``
|
||||
Contains the merged operator-to-kernel mapping with performance counter data. These
|
||||
are temporary files that are removed after consolidation into per operator CSV files.
|
||||
Key columns include:
|
||||
|
||||
* ``Function`` - PyTorch operator name (e.g., ``aten::conv2d``, ``aten::linear``)
|
||||
* ``Kernel_Name`` - GPU kernel name dispatched by the operator
|
||||
* ``Counter_Name`` / ``Counter_Value`` - Hardware performance counter measurements
|
||||
* ``Start_Timestamp_function`` / ``End_Timestamp_function`` - Operator execution time
|
||||
* ``Start_Timestamp_kernel`` / ``End_Timestamp_kernel`` - Kernel execution time
|
||||
* ``Correlation_Id`` - Links operator calls to their kernel dispatches
|
||||
|
||||
.. table:: SQC_ICACHE_INFLIGHT_LEVEL_torch_trace.csv from profiling mnist model.
|
||||
:widths: 20 80
|
||||
| Domain | Function | Process_Id | Thread_Id | Correlation_Id | Start_Timestamp_function | End_Timestamp_function | GPU_ID | Dispatch_ID | PID | Grid_Size | Workgroup_Size | LDS_Per_Workgroup | Scratch_Per_Workitem | Arch_VGPR | Accum_VGPR | SGPR | Kernel_Name | Start_Timestamp_kernel | End_Timestamp_kernel | Kernel_ID | Counter_Name | Counter_Value |
|
||||
|:----------------------|:--------------------------------|-------------:|------------:|-----------------:|---------------------------:|-------------------------:|---------:|--------------:|--------:|------------:|-----------------:|--------------------:|-----------------------:|------------:|-------------:|-------:|:------------------------|-------------------------:|-----------------------:|------------:|:--------------------------|----------------:|
|
||||
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | CPC_CPC_STAT_STALL | 17946 |
|
||||
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | CPC_CPC_TCIU_BUSY | 714 |
|
||||
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | CPF_CPF_STAT_IDLE | 0 |
|
||||
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | CPF_CPF_STAT_STALL | 78 |
|
||||
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | GRBM_SPI_BUSY | 7277 |
|
||||
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_REQ_NO_ALLOC_CSN | 8 |
|
||||
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_RES_STALL_CSN | 0 |
|
||||
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_SGPR_SIMD_FULL_CSN | 0 |
|
||||
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_TGLIM_CU_FULL_CSN | 0 |
|
||||
| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_TMP_STALL_CSN | 0 |
|
||||
|
||||
``torch_trace/`` directory
|
||||
Contains individual CSV files for each PyTorch operator detected during profiling.
|
||||
Each file is named after the operator (e.g., ``nn_functional_conv2d.csv``,
|
||||
``nn_functional_linear.csv``, ``relu.csv``) and contains all kernel executions and
|
||||
performance counters for that specific operator. Columns include:
|
||||
|
||||
* ``Operator_Name`` - PyTorch operator name
|
||||
* ``Context_Id`` - Source location where operator was called (e.g., ``conv2d:10@conv.py:543``)
|
||||
* ``Counter_Name`` / ``Counter_Value`` - Hardware counter measurements
|
||||
* ``Start_Timestamp_function`` / ``End_Timestamp_function`` - Operator timing
|
||||
* ``Start_Timestamp_kernel`` / ``End_Timestamp_kernel`` - Kernel timing
|
||||
|
||||
This per-operator organization enables focused analysis of specific operators without
|
||||
processing the entire trace.
|
||||
|
||||
.. table:: torch_trace/ones_like.csv from profiling mnist model.
|
||||
:widths: 20 80
|
||||
|
||||
| Operator_Name | Context_Id | Kernel_Name | Counter_Name | Counter_Value | Start_Timestamp_function | End_Timestamp_function | Start_Timestamp_kernel | End_Timestamp_kernel |
|
||||
|:----------------|:------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------|----------------:|---------------------------:|-------------------------:|-------------------------:|-----------------------:|
|
||||
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_STAT_BUSY | 23004 | 6789210204040073 | 6789210223815845 | 6789210223810274 | 6789210223811914 |
|
||||
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_STAT_IDLE | 0 | 6789210204040073 | 6789210223815845 | 6789210223810274 | 6789210223811914 |
|
||||
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_STAT_STALL | 6715 | 6789281060081123 | 6789281079930585 | 6789281079932564 | 6789281079934204 |
|
||||
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_TCIU_BUSY | 534 | 6789281060081123 | 6789281079930585 | 6789281079932564 | 6789281079934204 |
|
||||
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_TCIU_IDLE | 20569 | 6789352286866085 | 6789352306292985 | 6789352306292904 | 6789352306294424 |
|
||||
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_UTCL2IU_BUSY | 358 | 6789352286866085 | 6789352306292985 | 6789352306292904 | 6789352306294424 |
|
||||
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_CPC_UTCL2IU_IDLE | 20046 | 6789422289668823 | 6789422308914683 | 6789422308913883 | 6789422308915403 |
|
||||
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_ME1_BUSY_FOR_PACKET_DECODE | 16331 | 6789422289668823 | 6789422308914683 | 6789422308913883 | 6789422308915403 |
|
||||
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_ME1_DC0_SPI_BUSY | 455 | 6789492192490428 | 6789492210892375 | 6789492210897243 | 6789492210898883 |
|
||||
| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, std::array<char*, 1ul> >(int, at::native::FillFunctor<float>, std::array<char*, 1ul>) | CPC_UTCL1_STALL_ON_TRANSLATION | 374 | 6789492192490428 | 6789492210892375 | 6789492210897243 | 6789492210898883 |
|
||||
|
||||
``pmc_perf.csv``
|
||||
Standard performance counter data (same as non-torch profiling)
|
||||
|
||||
This data enables analysis such as:
|
||||
|
||||
* Identifying which PyTorch operators executed which GPU kernels
|
||||
* Aggregating performance counter values by operator
|
||||
* Correlating operator-level timing with kernel-level hardware metrics
|
||||
* Tracing the execution flow from high-level PyTorch API to low-level GPU kernels
|
||||
|
||||
Limitations
|
||||
-----------
|
||||
|
||||
.. note::
|
||||
|
||||
* The ``--torch-trace`` option requires the application to be a Python command
|
||||
or Python script.
|
||||
|
||||
* A valid PyTorch installation must be available in the environment where profiling
|
||||
is executed.
|
||||
|
||||
* This feature adds instrumentation overhead to track operator boundaries. For
|
||||
performance-critical measurements, consider profiling without this option first.
|
||||
|
||||
Combined with Other Options
|
||||
----------------------------
|
||||
|
||||
Torch operator mapping can be combined with other profiling options:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
# Combine with block filtering for targeted counter collection
|
||||
$ rocprof-compute profile --name mnist --torch-trace -b 11 12 -- python train.py
|
||||
|
||||
# Combine with iteration multiplexing
|
||||
$ rocprof-compute profile --name mnist --torch-trace --iteration-multiplexing kernel -- python train.py
|
||||
|
||||
# Combine with kernel filtering (filters by GPU kernel name)
|
||||
$ rocprof-compute profile --name mnist --torch-trace -k elementwise -- python train.py
|
||||
|
||||
.. _iteration-multiplexing:
|
||||
|
||||
@@ -687,7 +853,7 @@ To enable iteration multiplexing in ROCm Compute Profiler, use the
|
||||
``--iteration-multiplexing`` option in your profiling command. You can optionally specify
|
||||
the policy for multiplexing. The available policies are:
|
||||
|
||||
* ``kernel``
|
||||
* ``kernel``
|
||||
The counters are divided based on the kernels being executed. Each kernel call
|
||||
for a particular kernel collects a different subset of counters.
|
||||
* ``kernel_launch_params``
|
||||
@@ -707,10 +873,10 @@ By default, if no policy is specified, ROCm Compute Profiler uses the ``kernel_l
|
||||
Iteration multiplexing is only supported when using ROCm Compute Profiler with
|
||||
the native counter collection tool. Ensure that ``--attach-pid`` is not used in your profiling command.
|
||||
|
||||
* Ensure that your workload runs for enough iterations to cover all counter subsets.
|
||||
When using iteration multiplexing, the total number of iterations, for each kernel (for ``kernel`` policy)
|
||||
or for each unique kernel and launch parameters combination (for ``kernel_launch_params`` policy),
|
||||
specified in the workload should be sufficient to cover all subsets of counters. If the number of iterations
|
||||
* Ensure that your workload runs for enough iterations to cover all counter subsets.
|
||||
When using iteration multiplexing, the total number of iterations, for each kernel (for ``kernel`` policy)
|
||||
or for each unique kernel and launch parameters combination (for ``kernel_launch_params`` policy),
|
||||
specified in the workload should be sufficient to cover all subsets of counters. If the number of iterations
|
||||
is too low, some counters may not be collected.
|
||||
|
||||
* Launch paramaters for ``kernel_launch_params`` policy.
|
||||
@@ -736,11 +902,11 @@ The following example demonstrates how to use iteration multiplexing with the
|
||||
[INFO] Kernel Selection: None
|
||||
[INFO] Dispatch Selection: None
|
||||
[INFO] Filtered sections: All
|
||||
[INFO]
|
||||
[INFO]
|
||||
[INFO] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
[INFO] Collecting Performance Counters
|
||||
[INFO] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
[INFO]
|
||||
[INFO]
|
||||
[INFO] Using native counter collection tool: /tmp/rocprofiler-compute-tool-hlz4fagh/librocprofiler-compute-tool.so
|
||||
[INFO] Iteration multiplexing: kernel
|
||||
[INFO] [profiling] Current input files: /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQC_DCACHE_INFLIGHT_LEVEL.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQC_ICACHE_INFLIGHT_LEVEL.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_IFETCH_LEVEL.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_INST_LEVEL_LDS.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_INST_LEVEL_SMEM.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_INST_LEVEL_VMEM.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_LEVEL_WAVES.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_0.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_1.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_10.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_11.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_12.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_2.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_3.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_4.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_5.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_6.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_7.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_8.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_9.txt
|
||||
|
||||
@@ -111,4 +111,5 @@ markers = [
|
||||
"iteration_multiplexing_2",
|
||||
"iteration_multiplexing_stochastic",
|
||||
"noise_clamp",
|
||||
"torch_ops",
|
||||
]
|
||||
|
||||
@@ -239,6 +239,17 @@ Examples:
|
||||
help=argparse.SUPPRESS,
|
||||
# help="\t\t\tKokkos trace, traces Kokkos API calls.",
|
||||
)
|
||||
profile_group.add_argument(
|
||||
"--torch-trace",
|
||||
dest="torch_trace",
|
||||
required=False,
|
||||
default=False,
|
||||
action="store_true",
|
||||
help=(
|
||||
"\t\t\tTorch Trace, maps PyTorch operators to performance counters.\n"
|
||||
"\t\t\tShould be used only when profiling PyTorch applications."
|
||||
),
|
||||
)
|
||||
profile_group.add_argument(
|
||||
"-k",
|
||||
"--kernel",
|
||||
|
||||
@@ -109,16 +109,62 @@ class RocProfCompute_Base:
|
||||
"--attach-pid cannot be used with --iteration-multiplexing. "
|
||||
"Please remove one of these options."
|
||||
)
|
||||
|
||||
# verify correct formatting for application binary
|
||||
args.remaining = args.remaining[1:]
|
||||
resolved_exec_path: Optional[Path] = None
|
||||
|
||||
if args.remaining:
|
||||
# Ensure that command points to an executable
|
||||
if not shutil.which(args.remaining[0]):
|
||||
exec_candidate = shutil.which(args.remaining[0])
|
||||
if not exec_candidate:
|
||||
console_error(
|
||||
f"Your command {args.remaining[0]} doesn't point to a executable. "
|
||||
"Please verify."
|
||||
)
|
||||
resolved_exec_path = Path(exec_candidate).resolve()
|
||||
|
||||
# Appending a wrapper for injecting roctx-markers
|
||||
if getattr(args, "torch_trace", False):
|
||||
# Find the inject_roctx.py script in src/utils
|
||||
inject_script = (
|
||||
Path(__file__).parent.parent / "utils" / "inject_roctx.py"
|
||||
)
|
||||
if not inject_script.exists():
|
||||
console_error(
|
||||
f"Cannot find inject_roctx.py at {inject_script}. "
|
||||
"Please verify your installation."
|
||||
)
|
||||
|
||||
# Case 1: Explicit python command (python, python3, etc.)
|
||||
if args.remaining[0].startswith("python"):
|
||||
# Insert inject_roctx.py after the python interpreter
|
||||
args.remaining.insert(1, str(inject_script))
|
||||
# Case 2: Direct Python script execution (./main.py, /path/to/script.py)
|
||||
elif args.remaining[0].endswith((".py", ".pyw", ".pyc", ".pyo")):
|
||||
# Use current Python interpreter
|
||||
args.remaining.insert(0, str(inject_script))
|
||||
args.remaining.insert(0, sys.executable)
|
||||
else:
|
||||
console_warning(
|
||||
"Command does not look like a Python entry point, "
|
||||
"skipping ROCTX auto-injection and launching workload as-is."
|
||||
)
|
||||
console_warning(
|
||||
"Ensure the binary already initializes PyTorch/ROCTX markers, "
|
||||
"otherwise --torch-trace will have no effect."
|
||||
)
|
||||
|
||||
if (
|
||||
resolved_exec_path
|
||||
and (resolved_exec_path.parent / "_internal").is_dir()
|
||||
):
|
||||
console_warning(
|
||||
"Workload appears to be a self-contained binary. "
|
||||
"Such bundles typically ship private ROCm/HSA libraries, which "
|
||||
"prevents --torch-trace from collecting data."
|
||||
"Rebuild without packaging libhsa/libhip or "
|
||||
"adjust LD_LIBRARY_PATH to /opt/rocm) before profiling."
|
||||
)
|
||||
args.remaining = " ".join(args.remaining)
|
||||
elif not args.attach_pid:
|
||||
console_error(
|
||||
@@ -471,6 +517,8 @@ class RocProfCompute_Base:
|
||||
f'passes. Please use "--block" or "--set" '
|
||||
f"to adjust or reduce the requested performance metrics!"
|
||||
)
|
||||
console_debug(f"Sending profiler options to run_prof: {options}")
|
||||
|
||||
run_prof(
|
||||
fnames=str_fnames,
|
||||
profiler_options=options,
|
||||
@@ -478,6 +526,7 @@ class RocProfCompute_Base:
|
||||
mspec=self._soc._mspec,
|
||||
loglevel=args.loglevel,
|
||||
format_rocprof_output=args.format_rocprof_output,
|
||||
torch_trace_enabled=getattr(args, "torch_trace", False),
|
||||
retain_rocpd_output=args.retain_rocpd_output,
|
||||
)
|
||||
|
||||
|
||||
@@ -30,6 +30,7 @@ from pathlib import Path
|
||||
from rocprof_compute_profile.profiler_base import RocProfCompute_Base
|
||||
from rocprof_compute_soc.soc_base import OmniSoC_Base
|
||||
from utils.logger import console_error, console_log, demarcate
|
||||
from utils.utils import consolidate_torch_trace_output
|
||||
|
||||
|
||||
class rocprof_v3_profiler(RocProfCompute_Base):
|
||||
@@ -49,7 +50,6 @@ class rocprof_v3_profiler(RocProfCompute_Base):
|
||||
def get_profiler_options(self) -> list[str]:
|
||||
args = self.get_args()
|
||||
app_cmd = shlex.split(args.remaining)
|
||||
|
||||
if args.kokkos_trace:
|
||||
trace_option = "--kokkos-trace"
|
||||
# NOTE: --kokkos-trace feature is incomplete and is disabled for now.
|
||||
@@ -60,9 +60,10 @@ class rocprof_v3_profiler(RocProfCompute_Base):
|
||||
)
|
||||
elif args.hip_trace:
|
||||
trace_option = "--hip-trace"
|
||||
elif getattr(args, "torch_trace", False):
|
||||
trace_option = "--marker-trace"
|
||||
else:
|
||||
trace_option = "--kernel-trace"
|
||||
|
||||
profiling_options = [
|
||||
# v3 requires output directory argument
|
||||
"-d",
|
||||
@@ -134,6 +135,10 @@ class rocprof_v3_profiler(RocProfCompute_Base):
|
||||
if self.ready_to_profile:
|
||||
# Manually join each pmc_perf*.csv output
|
||||
self.join_prof()
|
||||
# Consolidate torch trace output if --torch-trace was used
|
||||
if self.get_args().torch_trace:
|
||||
consolidate_torch_trace_output(self.get_args().path)
|
||||
|
||||
# Run roofline microbenchmark
|
||||
super().post_processing()
|
||||
else:
|
||||
|
||||
@@ -31,6 +31,7 @@ from typing import Optional, Union
|
||||
from rocprof_compute_profile.profiler_base import RocProfCompute_Base
|
||||
from rocprof_compute_soc.soc_base import OmniSoC_Base
|
||||
from utils.logger import console_error, console_log, demarcate
|
||||
from utils.utils import consolidate_torch_trace_output
|
||||
|
||||
|
||||
class rocprofiler_sdk_profiler(RocProfCompute_Base):
|
||||
@@ -71,6 +72,8 @@ class rocprofiler_sdk_profiler(RocProfCompute_Base):
|
||||
"ROCPROF_OUTPUT_PATH": f"{args.path}/out/pmc_1",
|
||||
})
|
||||
|
||||
if getattr(args, "torch_trace", False):
|
||||
options["ROCPROF_MARKER_API_TRACE"] = "1"
|
||||
# Create folder pointed by ROCPROF_OUTPUT_PATH
|
||||
Path(options["ROCPROF_OUTPUT_PATH"]).mkdir(parents=True, exist_ok=True)
|
||||
|
||||
@@ -161,6 +164,9 @@ class rocprofiler_sdk_profiler(RocProfCompute_Base):
|
||||
if self.ready_to_profile:
|
||||
# Manually join each pmc_perf*.csv output
|
||||
self.join_prof()
|
||||
if self.get_args().torch_trace:
|
||||
consolidate_torch_trace_output(self.get_args().path)
|
||||
|
||||
# Run roofline microbenchmark
|
||||
super().post_processing()
|
||||
else:
|
||||
|
||||
@@ -0,0 +1,292 @@
|
||||
# ruff: noqa
|
||||
##############################################################################
|
||||
# MIT License
|
||||
#
|
||||
# Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved.
|
||||
#
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
# of this software and associated documentation files (the "Software"), to deal
|
||||
# in the Software without restriction, including without limitation the rights
|
||||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
# copies of the Software, and to permit persons to whom the Software is
|
||||
# furnished to do so, subject to the following conditions:
|
||||
#
|
||||
# The above copyright notice and this permission notice shall be included in
|
||||
# all copies or substantial portions of the Software.
|
||||
#
|
||||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
||||
# THE SOFTWARE.
|
||||
|
||||
##############################################################################
|
||||
|
||||
|
||||
"""
|
||||
ROCTX Injection Wrapper - Auto-discovers and intercepts ALL PyTorch operators
|
||||
Usage: python inject_roctx.py main.py --epochs 1 --batch-size 4
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add parent directory to Python path for config module
|
||||
script_dir = Path(__file__).resolve().parent
|
||||
sys.path.insert(0, str(script_dir.parent))
|
||||
|
||||
from utils.logger import console_log, console_warning
|
||||
|
||||
rocm_root = os.environ.get("ROCM_PATH", "/opt/rocm")
|
||||
python_version = f"python{sys.version_info.major}.{sys.version_info.minor}"
|
||||
candidate_paths = [
|
||||
f"{rocm_root}/lib/{python_version}/site-packages",
|
||||
f"{rocm_root}/libexec/rocprofiler-sdk/python",
|
||||
]
|
||||
|
||||
for candidate in candidate_paths:
|
||||
if candidate not in sys.path:
|
||||
sys.path.insert(0, candidate)
|
||||
|
||||
try:
|
||||
import torch
|
||||
|
||||
console_log(f"PyTorch version: {torch.__version__}")
|
||||
except ImportError:
|
||||
console_warning(
|
||||
"PyTorch is not installed or not properly configured.\n"
|
||||
"The --torch-trace option requires a valid PyTorch installation.\n"
|
||||
"Please install PyTorch and try again."
|
||||
)
|
||||
sys.exit(0)
|
||||
|
||||
import importlib.util
|
||||
import inspect
|
||||
from functools import wraps
|
||||
|
||||
import torch.nn.functional as F
|
||||
from roctx import rangePop, rangePush
|
||||
|
||||
|
||||
def roctx_wrapper(func, name=None):
|
||||
func_name = name or func.__name__
|
||||
call_counter = {"count": 0}
|
||||
|
||||
@wraps(func)
|
||||
def wrapper(*args, **kwargs):
|
||||
call_counter["count"] += 1
|
||||
current_frame = inspect.currentframe()
|
||||
caller_frame = current_frame.f_back if current_frame is not None else None
|
||||
if caller_frame is not None:
|
||||
filename = caller_frame.f_code.co_filename
|
||||
location = f"{Path(filename).name}:{caller_frame.f_lineno}"
|
||||
else:
|
||||
location = "unknown:0"
|
||||
|
||||
# Unique marker: function + call_number + source_location
|
||||
rangePush(f"{func_name}:#{call_counter['count']}@{location}")
|
||||
try:
|
||||
result = func(*args, **kwargs)
|
||||
finally:
|
||||
rangePop()
|
||||
return result
|
||||
|
||||
return wrapper
|
||||
|
||||
|
||||
def auto_discover_torch_callables(module, prefix, exclude_patterns=None):
|
||||
"""Automatically discover all callable functions in a module."""
|
||||
if exclude_patterns is None:
|
||||
exclude_patterns = ["__", "_", "is_", "set_", "get_"]
|
||||
|
||||
functions = {}
|
||||
for name in dir(module):
|
||||
# Skip private/internal functions
|
||||
if any(name.startswith(pat) for pat in exclude_patterns):
|
||||
continue
|
||||
|
||||
try:
|
||||
attr = getattr(module, name)
|
||||
# Only wrap callables (functions, not classes or constants)
|
||||
if callable(attr) and not isinstance(attr, type):
|
||||
full_name = f"{prefix}.{name}"
|
||||
functions[full_name] = (module, name, attr)
|
||||
except Exception as e:
|
||||
console_warning(type(e))
|
||||
console_warning(f"Could not access {prefix}.{name}: {e}")
|
||||
|
||||
return functions
|
||||
|
||||
|
||||
def inject_roctx_into_torch():
|
||||
"""Monkey-patch PyTorch operations to add ROCTX markers."""
|
||||
|
||||
console_log("Auto-discovering PyTorch operations to wrap...")
|
||||
|
||||
# Auto-discover functions from key modules
|
||||
all_operations = {}
|
||||
|
||||
# torch.* functions (matmul, mm, cat, etc.)
|
||||
all_operations.update(auto_discover_torch_callables(torch, "torch"))
|
||||
|
||||
# torch.nn.functional.* functions (linear, relu, softmax, etc.)
|
||||
all_operations.update(auto_discover_torch_callables(F, "torch.nn.functional"))
|
||||
|
||||
# torch.linalg.* functions (matrix operations)
|
||||
try:
|
||||
all_operations.update(
|
||||
auto_discover_torch_callables(torch.linalg, "torch.linalg")
|
||||
)
|
||||
except Exception as e:
|
||||
console_warning(type(e))
|
||||
console_warning(f"Could not access torch.linalg: {e}")
|
||||
|
||||
# torch.fft.* functions (FFT operations)
|
||||
try:
|
||||
all_operations.update(auto_discover_torch_callables(torch.fft, "torch.fft"))
|
||||
except Exception as e:
|
||||
console_warning(type(e))
|
||||
console_warning(f"Could not access torch.fft: {e}")
|
||||
console_log(f"Found {len(all_operations)} operations to wrap")
|
||||
console_log("Injecting ROCTX markers into PyTorch operations...")
|
||||
|
||||
wrapped_count = 0
|
||||
failed_count = 0
|
||||
|
||||
for full_name, (module, attr_name, original_func) in all_operations.items():
|
||||
try:
|
||||
# Replace with wrapped version
|
||||
wrapped_func = roctx_wrapper(original_func, full_name)
|
||||
setattr(module, attr_name, wrapped_func)
|
||||
wrapped_count += 1
|
||||
|
||||
# Print first 20 and last 5 for visibility
|
||||
if wrapped_count <= 20 or wrapped_count > len(all_operations) - 5:
|
||||
console_log(f"Wrapped: {full_name}")
|
||||
elif wrapped_count == 21:
|
||||
console_log(
|
||||
f" ... (wrapping {len(all_operations) - 25} more operations)"
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
failed_count += 1
|
||||
if failed_count <= 5: # Only show first few failures
|
||||
console_warning(f"Failed to wrap {full_name}: {e}")
|
||||
|
||||
# Wrap tensor methods
|
||||
original_backward = torch.Tensor.backward
|
||||
backward_counter = {"count": 0}
|
||||
|
||||
def backward_with_roctx(self, *args, **kwargs):
|
||||
backward_counter["count"] += 1
|
||||
current_frame = inspect.currentframe()
|
||||
caller_frame = current_frame.f_back if current_frame is not None else None
|
||||
if caller_frame is not None:
|
||||
filename = caller_frame.f_code.co_filename
|
||||
location = f"{Path(filename).name}:{caller_frame.f_lineno}"
|
||||
else:
|
||||
location = "unknown:0"
|
||||
|
||||
rangePush(f"torch.Tensor.backward:#{backward_counter['count']}@{location}")
|
||||
try:
|
||||
return original_backward(self, *args, **kwargs)
|
||||
finally:
|
||||
rangePop()
|
||||
|
||||
torch.Tensor.backward = backward_with_roctx
|
||||
|
||||
wrapped_count += 1
|
||||
console_log("Wrapped: torch.Tensor.backward")
|
||||
|
||||
console_log(f"Wrapped {wrapped_count} operations with ROCTX markers")
|
||||
if failed_count > 0:
|
||||
console_warning(
|
||||
f"Failed to wrap {failed_count} operations (likely not patchable)"
|
||||
)
|
||||
|
||||
|
||||
def inject_roctx_into_optimizer():
|
||||
"""Wrap optimizer step() method."""
|
||||
from torch.optim import Optimizer
|
||||
|
||||
original_step = Optimizer.step
|
||||
|
||||
def step_with_roctx(self, *args, **kwargs):
|
||||
rangePush(f"optimizer.{self.__class__.__name__}.step")
|
||||
try:
|
||||
return original_step(self, *args, **kwargs)
|
||||
finally:
|
||||
rangePop()
|
||||
|
||||
Optimizer.step = step_with_roctx
|
||||
console_log("Wrapped optimizer.step() with ROCTX markers\n")
|
||||
|
||||
|
||||
def inject_roctx_into_model():
|
||||
"""Wrap nn.Module forward() method with call counter."""
|
||||
|
||||
from torch import nn
|
||||
from typing import Any
|
||||
|
||||
original_call = nn.Module.__call__
|
||||
|
||||
# Per-instance call counters
|
||||
def call_with_roctx(self, *args, **kwargs):
|
||||
class_name = self.__class__.__name__
|
||||
|
||||
# Initialize counter for this instance if not exists
|
||||
if not hasattr(self, "_roctx_call_count"):
|
||||
self._roctx_call_count = 0
|
||||
self._roctx_call_count += 1
|
||||
|
||||
# Get caller location
|
||||
current_frame = inspect.currentframe()
|
||||
caller_frame = current_frame.f_back if current_frame is not None else None
|
||||
if caller_frame is not None:
|
||||
filename = caller_frame.f_code.co_filename
|
||||
location = f"{Path(filename).name}:{caller_frame.f_lineno}"
|
||||
else:
|
||||
location = "unknown:0"
|
||||
|
||||
# Create detailed marker
|
||||
rangePush(
|
||||
f"nn.Module.{class_name}.forward:#{self._roctx_call_count}@{location}"
|
||||
)
|
||||
try:
|
||||
return original_call(self, *args, **kwargs)
|
||||
finally:
|
||||
rangePop()
|
||||
|
||||
nn.Module.__call__ = call_with_roctx
|
||||
console_log("Wrapped nn.Module forward() with ROCTX markers\n")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) < 2:
|
||||
console_log("Usage: python inject_roctx.py <script.py> [script_args...]")
|
||||
sys.exit(1)
|
||||
|
||||
# Get target script and its arguments
|
||||
target_script = sys.argv[1]
|
||||
script_args = sys.argv[2:]
|
||||
|
||||
# Inject ROCTX markers BEFORE importing the target script
|
||||
inject_roctx_into_torch()
|
||||
inject_roctx_into_optimizer()
|
||||
inject_roctx_into_model()
|
||||
|
||||
console_log("=" * 70)
|
||||
console_log("Starting target script with ROCTX instrumentation...")
|
||||
console_log("=" * 70)
|
||||
|
||||
# Modify sys.argv so the target script sees correct arguments
|
||||
sys.argv = [target_script] + script_args
|
||||
|
||||
# Load and execute the target script
|
||||
spec = importlib.util.spec_from_file_location("__main__", target_script)
|
||||
module = importlib.util.module_from_spec(spec)
|
||||
sys.modules["__main__"] = module
|
||||
spec.loader.exec_module(module)
|
||||
@@ -25,7 +25,7 @@
|
||||
|
||||
import csv
|
||||
import sqlite3
|
||||
from contextlib import closing
|
||||
from contextlib import ExitStack, closing
|
||||
from typing import Any
|
||||
|
||||
import pandas as pd
|
||||
@@ -37,6 +37,8 @@ from utils.logger import console_error
|
||||
COUNTERS_COLLECTION_QUERY = """
|
||||
SELECT
|
||||
agent_id as GPU_ID,
|
||||
guid as GUID,
|
||||
correlation_id as Correlation_Id,
|
||||
dispatch_id as Dispatch_ID,
|
||||
pid as PID,
|
||||
grid_size as Grid_Size,
|
||||
@@ -54,6 +56,24 @@ SELECT
|
||||
value as Counter_Value
|
||||
FROM counters_collection
|
||||
"""
|
||||
MARKER_API_TRACE_QUERY = """
|
||||
SELECT
|
||||
category AS Domain,
|
||||
json_extract(extdata, '$.message') AS Function,
|
||||
pid AS Process_Id,
|
||||
tid AS Thread_Id,
|
||||
corr_id AS Correlation_Id,
|
||||
guid AS GUID,
|
||||
start AS Start_Timestamp,
|
||||
end AS End_Timestamp
|
||||
FROM regions
|
||||
ORDER BY start
|
||||
"""
|
||||
KERNEL_DISPATCH_QUERY = """
|
||||
SELECT dispatch_id, event_id, guid
|
||||
FROM rocpd_kernel_dispatch
|
||||
WHERE guid = ?
|
||||
"""
|
||||
ROCPD_PMC_EVENT_TABLE_NAME_PREFIX = "rocpd_pmc_event_"
|
||||
TABLE_NAME_PREFIX_QUERY = (
|
||||
"SELECT name FROM sqlite_master WHERE type='table' "
|
||||
@@ -64,30 +84,43 @@ INSERT_QUERY = "INSERT INTO {table_name} ({columns}) VALUES ({placeholders})"
|
||||
|
||||
def convert_dbs_to_csv(
|
||||
db_paths: list[str],
|
||||
csv_file_path: str,
|
||||
counter_collection_csv_path: str,
|
||||
marker_trace_csv_path: str,
|
||||
) -> None:
|
||||
"""
|
||||
Read rocpd databases and write to CSV file
|
||||
"""
|
||||
# Read counters_collection view from the databases and write to CSV
|
||||
try:
|
||||
with open(csv_file_path, "w", newline="") as csvfile:
|
||||
writer = csv.writer(csvfile)
|
||||
header_written = False
|
||||
for db_path in db_paths:
|
||||
with closing(sqlite3.connect(db_path)) as conn:
|
||||
with closing(conn.execute(COUNTERS_COLLECTION_QUERY)) as cursor:
|
||||
if not header_written:
|
||||
writer.writerow([
|
||||
description[0] for description in cursor.description
|
||||
])
|
||||
header_written = True
|
||||
for row in cursor:
|
||||
writer.writerow(row)
|
||||
except OSError as e:
|
||||
console_error(f"Database error while converting to CSV: {e}")
|
||||
except Exception as e:
|
||||
console_error(f"Unexpected error converting database to CSV: {e}")
|
||||
queries = {
|
||||
counter_collection_csv_path: COUNTERS_COLLECTION_QUERY,
|
||||
marker_trace_csv_path: MARKER_API_TRACE_QUERY,
|
||||
}
|
||||
header_written = {path: False for path in queries}
|
||||
|
||||
with ExitStack() as stack:
|
||||
writers = {
|
||||
path: csv.writer(stack.enter_context(open(path, "w", newline="")))
|
||||
for path in queries
|
||||
}
|
||||
for db_path in db_paths:
|
||||
with closing(sqlite3.connect(db_path)) as conn:
|
||||
for file_path, query in queries.items():
|
||||
try:
|
||||
with closing(conn.execute(query)) as cursor:
|
||||
if cursor.description is None:
|
||||
continue
|
||||
if not header_written[file_path]:
|
||||
writers[file_path].writerow([
|
||||
desc[0] for desc in cursor.description
|
||||
])
|
||||
header_written[file_path] = True
|
||||
writers[file_path].writerows(cursor)
|
||||
except OSError as e:
|
||||
console_error(
|
||||
f"Database error while extracting {file_path} "
|
||||
f"from {db_path}: {e}"
|
||||
)
|
||||
except Exception as e:
|
||||
console_error(
|
||||
f"Unexpected error while extracting {file_path} "
|
||||
f"from {db_path}: {e}"
|
||||
)
|
||||
|
||||
|
||||
def process_rocpd_csv(df: pd.DataFrame) -> pd.DataFrame:
|
||||
@@ -134,7 +167,7 @@ def process_rocpd_csv(df: pd.DataFrame) -> pd.DataFrame:
|
||||
|
||||
|
||||
def update_rocpd_pmc_events(counter_info: pd.DataFrame, rocpd_db_path: str) -> None:
|
||||
"""Update pmc_event table in the given rocpd database path"""
|
||||
"""Updates pmc_event table in the given rocpd database path."""
|
||||
try:
|
||||
with closing(sqlite3.connect(rocpd_db_path)) as conn:
|
||||
# Get pmc_event table name
|
||||
@@ -154,13 +187,27 @@ def update_rocpd_pmc_events(counter_info: pd.DataFrame, rocpd_db_path: str) -> N
|
||||
guid = table_name[len(ROCPD_PMC_EVENT_TABLE_NAME_PREFIX) :].replace(
|
||||
"_", "-"
|
||||
)
|
||||
# Map dispatch_id to event_id from rocpd_kernel_dispatch
|
||||
# Native counter collection CSV has dispatch_id, but schema needs event_id
|
||||
# event_id may differ from dispatch_id when marker API tracing is enabled
|
||||
with closing(conn.execute(KERNEL_DISPATCH_QUERY, (guid,))) as cursor:
|
||||
rows = cursor.fetchall()
|
||||
if not rows:
|
||||
console_error("No kernel dispatch data found.")
|
||||
return
|
||||
dispatch_to_event = {
|
||||
dispatch_id: event_id for dispatch_id, event_id, _ in rows
|
||||
}
|
||||
counter_info["event_id"] = counter_info["dispatch_id"].map(
|
||||
dispatch_to_event
|
||||
)
|
||||
columns = ("guid", "event_id", "pmc_id", "value")
|
||||
values = list(
|
||||
zip(
|
||||
# guid
|
||||
[guid] * len(counter_info),
|
||||
# event_id
|
||||
counter_info["dispatch_id"],
|
||||
counter_info["event_id"],
|
||||
# pmc_id
|
||||
counter_info["counter_id"],
|
||||
# value
|
||||
|
||||
@@ -786,6 +786,7 @@ def run_prof(
|
||||
mspec: Any, # noqa: ANN401
|
||||
loglevel: int,
|
||||
format_rocprof_output: str,
|
||||
torch_trace_enabled: bool = False,
|
||||
retain_rocpd_output: bool = False,
|
||||
) -> None:
|
||||
multiple_files = isinstance(fnames, list)
|
||||
@@ -939,9 +940,12 @@ def run_prof(
|
||||
# Write results_fbase.csv
|
||||
rocpd_data.convert_dbs_to_csv(
|
||||
glob.glob(workload_dir + "/out/pmc_1/*/*.db"),
|
||||
workload_dir + f"/results_{fbase}.csv",
|
||||
workload_dir + f"/out/pmc_1/{fbase}_counter_collection.csv",
|
||||
workload_dir + f"/out/pmc_1/{fbase}_marker_api_trace.csv",
|
||||
)
|
||||
combined_df = pd.read_csv(
|
||||
workload_dir + f"/out/pmc_1/{fbase}_counter_collection.csv"
|
||||
)
|
||||
combined_df = pd.read_csv(workload_dir + f"/results_{fbase}.csv")
|
||||
# Reset Dispatch_ID based on PID, Kernel_Name, Grid_Size,
|
||||
# Workgroup_Size, LDS_Per_Workgroup, Start_Timestamp, End_Timestamp
|
||||
combined_df["Dispatch_ID"] = combined_df.groupby(
|
||||
@@ -964,8 +968,12 @@ def run_prof(
|
||||
).ngroup()
|
||||
# Drop PID since its not required
|
||||
combined_df = combined_df.drop(columns=["PID"])
|
||||
combined_df.to_csv(
|
||||
workload_dir + f"/out/pmc_1/{fbase}_counter_collection.csv", index=False
|
||||
)
|
||||
combined_df.to_csv(workload_dir + f"/results_{fbase}.csv", index=False)
|
||||
|
||||
if torch_trace_enabled:
|
||||
process_torch_trace_output(workload_dir, fbase, format_rocprof_output)
|
||||
if retain_rocpd_output:
|
||||
for db_path in glob.glob(workload_dir + "/out/pmc_1/*/*.db"):
|
||||
pid = Path(db_path).stem.split("_")[0]
|
||||
@@ -1004,7 +1012,9 @@ def run_prof(
|
||||
process_kokkos_trace_output(workload_dir, fbase)
|
||||
elif "--hip-trace" in options:
|
||||
process_hip_trace_output(workload_dir, fbase)
|
||||
|
||||
# Add torch operator trace processing
|
||||
if torch_trace_enabled:
|
||||
process_torch_trace_output(workload_dir, fbase, format_rocprof_output)
|
||||
# Combine results into single CSV file
|
||||
if results_files:
|
||||
combined_results = pd.concat(
|
||||
@@ -1175,7 +1185,7 @@ def convert_native_counter_collection_csv(workload_dir: str) -> None:
|
||||
)
|
||||
|
||||
rocprofv3_counter_data = pd.DataFrame({
|
||||
"Correlation_Id": merged_data["dispatch_id"],
|
||||
"Correlation_Id": merged_data["Correlation_Id"],
|
||||
"Dispatch_Id": merged_data["dispatch_id"],
|
||||
"Agent_Id": merged_data["Agent_Id"],
|
||||
"Queue_Id": merged_data["Queue_Id"],
|
||||
@@ -1262,6 +1272,178 @@ def process_rocprofv3_output(workload_dir: str, using_native_tool: bool) -> list
|
||||
return results_files_csv
|
||||
|
||||
|
||||
@demarcate
|
||||
def process_torch_trace_output(
|
||||
workload_dir: str,
|
||||
fbase: str,
|
||||
output_format: str = "rocpd",
|
||||
) -> None:
|
||||
"""
|
||||
Creates PyTorch operator trace from counter_collection and marker_api_trace data.
|
||||
- Performs inner join on Correlation_Id, filtering out unmatched entries
|
||||
- Output file is saved to workload root, not the temporary out/ directory
|
||||
"""
|
||||
marker_trace_csv_file_path = f"{workload_dir}/out/pmc_1/"
|
||||
# Find all marker_api_trace CSV files
|
||||
marker_api_trace_csvs = list(
|
||||
Path(marker_trace_csv_file_path).glob("**/*_marker_api_trace.csv")
|
||||
)
|
||||
counter_collection_csvs = [
|
||||
markers_file.parent
|
||||
/ markers_file.name.replace("_marker_api_trace.", "_counter_collection.")
|
||||
for markers_file in marker_api_trace_csvs
|
||||
]
|
||||
existing_csv_files = [
|
||||
[marker_api_trace_csvs[i], counter_collection_csvs[i]]
|
||||
for i in range(len(marker_api_trace_csvs))
|
||||
if counter_collection_csvs[i].is_file() and marker_api_trace_csvs[i].is_file()
|
||||
]
|
||||
if not existing_csv_files:
|
||||
console_warning(
|
||||
f"No marker files with corresponding counter files found for {fbase}"
|
||||
)
|
||||
return
|
||||
|
||||
# Join marker and counter data
|
||||
def _merge_pair(
|
||||
marker_path: Path,
|
||||
counter_path: Path,
|
||||
join_keys: list = ("Correlation_Id"),
|
||||
) -> pd.DataFrame:
|
||||
marker_df = pd.read_csv(marker_path)
|
||||
counter_df = pd.read_csv(counter_path)
|
||||
return pd.merge(
|
||||
marker_df,
|
||||
counter_df,
|
||||
on=join_keys,
|
||||
how="inner",
|
||||
suffixes=("_function", "_kernel"),
|
||||
)
|
||||
|
||||
if output_format == "csv":
|
||||
merged_results = pd.concat(
|
||||
[_merge_pair(f[0], f[1]) for f in existing_csv_files],
|
||||
ignore_index=True,
|
||||
)
|
||||
elif output_format == "rocpd":
|
||||
# There will one pair of csv files extracted from rocpd db and consolidated.
|
||||
merged_results = _merge_pair(
|
||||
existing_csv_files[0][0],
|
||||
existing_csv_files[0][1],
|
||||
("Correlation_Id", "GUID"),
|
||||
)
|
||||
# Save merged results
|
||||
merged_results.to_csv(
|
||||
f"{workload_dir}/{fbase}_torch_trace.csv",
|
||||
index=False,
|
||||
)
|
||||
console_log("Created ", f"{workload_dir}/{fbase}_torch_trace.csv")
|
||||
|
||||
|
||||
@demarcate
|
||||
def consolidate_torch_trace_output(workload_dir: str) -> None:
|
||||
# Consolidate torch operator trace CSV files from multiple processes
|
||||
console_log("Consolidating torch operator trace output...")
|
||||
# Find all torch trace CSV files in workload directory
|
||||
torch_trace_files = glob.glob(f"{workload_dir}/*_torch_trace.csv")
|
||||
if not torch_trace_files:
|
||||
console_warning("No torch trace files found.")
|
||||
return
|
||||
# Read and concatenate all torch trace files
|
||||
all_traces = []
|
||||
required_columns = [
|
||||
"Function",
|
||||
"Kernel_Name",
|
||||
"Counter_Name",
|
||||
"Counter_Value",
|
||||
"Start_Timestamp_function",
|
||||
"End_Timestamp_function",
|
||||
"Start_Timestamp_kernel",
|
||||
"End_Timestamp_kernel",
|
||||
]
|
||||
for trace_file in torch_trace_files:
|
||||
try:
|
||||
df = pd.read_csv(trace_file)
|
||||
except pd.errors.ParserError as e:
|
||||
console_warning(f"Parser error while reading {trace_file}: {e}")
|
||||
continue
|
||||
except OSError as e:
|
||||
console_warning(f"I/O error while reading {trace_file}: {e}")
|
||||
continue
|
||||
except Exception as e:
|
||||
# Unexpected error; log full details for debugging
|
||||
console_warning(
|
||||
f"Unexpected error while reading {trace_file}: {e}\n"
|
||||
f"{traceback.format_exc()}"
|
||||
)
|
||||
continue
|
||||
|
||||
missing_columns = [col for col in required_columns if col not in df.columns]
|
||||
if missing_columns:
|
||||
console_warning(
|
||||
f"Skipping {trace_file}: missing required columns {missing_columns}"
|
||||
)
|
||||
continue
|
||||
|
||||
all_traces.append(df[required_columns])
|
||||
if not all_traces:
|
||||
console_warning("No valid torch trace data to consolidate.")
|
||||
return
|
||||
|
||||
consolidated_df = pd.concat(all_traces, ignore_index=True)
|
||||
if consolidated_df.isnull().values.any():
|
||||
console_warning("Consolidated torch trace contains missing values")
|
||||
return
|
||||
consolidated_df = consolidated_df.sort_values(by=["Function", "Counter_Name"])
|
||||
|
||||
split_columns = consolidated_df["Function"].str.split(":#", expand=True)
|
||||
consolidated_df["Operator_Name"] = (
|
||||
split_columns[0] if len(split_columns.columns) > 0 else None
|
||||
)
|
||||
consolidated_df["Context_Id"] = (
|
||||
split_columns[1] if len(split_columns.columns) > 1 else None
|
||||
)
|
||||
consolidated_df.drop(columns=["Function"], inplace=True)
|
||||
consolidated_df = consolidated_df[
|
||||
[
|
||||
"Operator_Name",
|
||||
"Context_Id",
|
||||
"Kernel_Name",
|
||||
"Counter_Name",
|
||||
"Counter_Value",
|
||||
"Start_Timestamp_function",
|
||||
"End_Timestamp_function",
|
||||
"Start_Timestamp_kernel",
|
||||
"End_Timestamp_kernel",
|
||||
]
|
||||
]
|
||||
|
||||
if consolidated_df.isnull().values.any():
|
||||
console_error(
|
||||
"Missing values in consolidated torch trace after splitting ",
|
||||
"the Function name.",
|
||||
)
|
||||
return
|
||||
|
||||
grouped = consolidated_df.groupby("Operator_Name")
|
||||
for operator_name, group in grouped:
|
||||
sanitized_operator_name = operator_name.replace("torch.", "").replace(".", "_")
|
||||
# Ensure output directory exists
|
||||
Path(f"{workload_dir}/torch_trace").mkdir(parents=True, exist_ok=True)
|
||||
output_file = f"{workload_dir}/torch_trace/{sanitized_operator_name}.csv"
|
||||
group.to_csv(output_file, index=False)
|
||||
console_log(
|
||||
f"Saved consolidated trace for {sanitized_operator_name} to {output_file}"
|
||||
)
|
||||
|
||||
for trace_file in torch_trace_files:
|
||||
try:
|
||||
Path(trace_file).unlink()
|
||||
console_debug(f"Removed temporary torch trace file: {trace_file}")
|
||||
except OSError as e:
|
||||
console_warning(f"Error removing temporary file {trace_file}: {e}")
|
||||
|
||||
|
||||
@demarcate
|
||||
def process_kokkos_trace_output(workload_dir: str, fbase: str) -> None:
|
||||
# marker api trace csv files are generated for each process
|
||||
|
||||
@@ -23,6 +23,7 @@
|
||||
|
||||
##############################################################################
|
||||
|
||||
import importlib.util
|
||||
import inspect
|
||||
import os
|
||||
import re
|
||||
@@ -2779,3 +2780,215 @@ def test_iteration_multiplexing_all_counter_accuracy(
|
||||
assert are_stochastic_counters_similar(
|
||||
[counters_kernel, counters_kernel_launch_params], counters_no_multiplexing
|
||||
)
|
||||
|
||||
|
||||
skip_if_no_torch = pytest.mark.skipif(
|
||||
importlib.util.find_spec("torch") is None, reason="torch is required for this test"
|
||||
)
|
||||
|
||||
|
||||
@skip_if_no_torch
|
||||
def test_torch_trace_profile(binary_handler_profile_rocprof_compute):
|
||||
"""
|
||||
Test profiling a PyTorch application with --torch-trace option.
|
||||
Verifies that all required files are generated and counter values are valid.
|
||||
NOTE: Not included in the test suite since this requires PyTorch installation.
|
||||
"""
|
||||
workload_dir = test_utils.get_output_dir(param_id="torch_ops")
|
||||
Path(workload_dir).mkdir(parents=True, exist_ok=True)
|
||||
torch_app_path = Path(workload_dir) / "test_torch_app.py"
|
||||
|
||||
torch_app_code = """
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
|
||||
class SimpleNet(nn.Module):
|
||||
def __init__(self):
|
||||
super(SimpleNet, self).__init__()
|
||||
self.fc1 = nn.Linear(10, 20)
|
||||
self.fc2 = nn.Linear(20, 10)
|
||||
def forward(self, x):
|
||||
x = self.fc1(x)
|
||||
x = F.relu(x)
|
||||
x = self.fc2(x)
|
||||
return x
|
||||
|
||||
if __name__ == "__main__":
|
||||
if not torch.cuda.is_available():
|
||||
import sys
|
||||
print("GPU is required for this test. Exiting.")
|
||||
sys.exit(1)
|
||||
model = SimpleNet()
|
||||
model = model.cuda()
|
||||
x = torch.randn(5, 10).cuda()
|
||||
# Run a few iterations
|
||||
for epoch in range(1):
|
||||
output = model(x)
|
||||
loss = output.sum()
|
||||
loss.backward()
|
||||
print("Training completed")
|
||||
"""
|
||||
|
||||
with open(torch_app_path, "w") as f:
|
||||
f.write(torch_app_code)
|
||||
|
||||
config["torch_test_app"] = ["python3", str(torch_app_path)]
|
||||
|
||||
# Profile with --torch-trace option
|
||||
options = [
|
||||
"--torch-trace",
|
||||
]
|
||||
|
||||
returncode = binary_handler_profile_rocprof_compute(
|
||||
config,
|
||||
workload_dir,
|
||||
options,
|
||||
check_success=True,
|
||||
app_name="torch_test_app",
|
||||
)
|
||||
assert returncode == 0, "Profiling the torch application failed"
|
||||
# Verify files are generated
|
||||
# 1. Check basic CSV files
|
||||
num_devices = config.get("num_devices", 1)
|
||||
file_dict = test_utils.check_csv_files(workload_dir, num_devices, 1)
|
||||
assert "pmc_perf.csv" in file_dict, "pmc_perf.csv not generated"
|
||||
# 2. Check torch trace directory
|
||||
torch_trace_dir = Path(workload_dir) / "torch_trace"
|
||||
assert torch_trace_dir.exists(), "torch_trace directory not created"
|
||||
assert torch_trace_dir.is_dir(), "torch_trace is not a directory"
|
||||
# 3. Check per-operator CSV files exist
|
||||
operator_csv_files = list(torch_trace_dir.glob("*.csv"))
|
||||
assert len(operator_csv_files) > 0, "No per-operator CSV files generated"
|
||||
# 4. Verify per-operator CSV structure
|
||||
for op_csv in operator_csv_files:
|
||||
op_df = pd.read_csv(op_csv)
|
||||
assert len(op_df) > 0, f"Per-operator CSV {op_csv.name} is empty"
|
||||
test_utils.clean_output_dir(config["cleanup"], workload_dir)
|
||||
|
||||
|
||||
@skip_if_no_torch
|
||||
def test_torch_trace_overhead(binary_handler_profile_rocprof_compute):
|
||||
"""
|
||||
Measure overhead introduced by --torch-trace flag.
|
||||
Compares execution time with and without the flag to ensure overhead is acceptable.
|
||||
NOTE: Not included in the test suite since this requires PyTorch installation.
|
||||
"""
|
||||
helper_dir = Path(test_utils.get_output_dir(param_id="torch_helper_script"))
|
||||
helper_dir.mkdir(parents=True, exist_ok=True)
|
||||
torch_app_path = helper_dir / "test_torch_app.py"
|
||||
torch_app_code = """
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
|
||||
class SimpleNet(nn.Module):
|
||||
def __init__(self):
|
||||
super(SimpleNet, self).__init__()
|
||||
self.fc1 = nn.Linear(10, 20)
|
||||
self.fc2 = nn.Linear(20, 10)
|
||||
def forward(self, x):
|
||||
x = self.fc1(x)
|
||||
x = F.relu(x)
|
||||
x = self.fc2(x)
|
||||
return x
|
||||
|
||||
if __name__ == "__main__":
|
||||
if not torch.cuda.is_available():
|
||||
import sys
|
||||
print("GPU is required for this test. Exiting.")
|
||||
sys.exit(1)
|
||||
model = SimpleNet()
|
||||
model = model.cuda()
|
||||
x = torch.randn(5, 10).cuda()
|
||||
# Run a few iterations
|
||||
for epoch in range(1):
|
||||
output = model(x)
|
||||
loss = output.sum()
|
||||
loss.backward()
|
||||
print("Training completed")
|
||||
"""
|
||||
with open(torch_app_path, "w") as f:
|
||||
f.write(torch_app_code)
|
||||
config["torch_test_app"] = ["python3", str(torch_app_path)]
|
||||
# Run WITHOUT --torch-trace (baseline)
|
||||
workload_dir_baseline = test_utils.get_output_dir(param_id="torch_baseline")
|
||||
start_baseline = time.time()
|
||||
returncode_baseline = binary_handler_profile_rocprof_compute(
|
||||
config,
|
||||
workload_dir_baseline,
|
||||
[], # No torch-trace flag
|
||||
check_success=True,
|
||||
roof=False,
|
||||
app_name="torch_test_app",
|
||||
)
|
||||
baseline_time = time.time() - start_baseline
|
||||
assert returncode_baseline == 0, "Baseline profiling failed"
|
||||
|
||||
# Read baseline timestamps
|
||||
baseline_df = pd.read_csv(f"{workload_dir_baseline}/pmc_perf.csv")
|
||||
baseline_kernel_duration_total = (
|
||||
baseline_df["End_Timestamp"].max() - baseline_df["Start_Timestamp"].min()
|
||||
)
|
||||
test_utils.clean_output_dir(config["cleanup"], workload_dir_baseline)
|
||||
# Run WITH --torch-trace
|
||||
workload_dir_with_flag = test_utils.get_output_dir(param_id="torch_with_ops")
|
||||
start_with_flag = time.time()
|
||||
returncode_with_flag = binary_handler_profile_rocprof_compute(
|
||||
config,
|
||||
workload_dir_with_flag,
|
||||
["--torch-trace"],
|
||||
check_success=True,
|
||||
roof=False,
|
||||
app_name="torch_test_app",
|
||||
)
|
||||
with_flag_time = time.time() - start_with_flag
|
||||
assert returncode_with_flag == 0, "Profiling with torch-trace failed"
|
||||
# Read with-flag timestamps
|
||||
with_flag_df = pd.read_csv(f"{workload_dir_with_flag}/pmc_perf.csv")
|
||||
with_flag_kernel_duration_total = (
|
||||
with_flag_df["End_Timestamp"].max() - with_flag_df["Start_Timestamp"].min()
|
||||
)
|
||||
longest_running_kernel_baseline = (
|
||||
baseline_df["End_Timestamp"] - baseline_df["Start_Timestamp"]
|
||||
).max()
|
||||
longest_running_kernel_with_flag = (
|
||||
with_flag_df["End_Timestamp"] - with_flag_df["Start_Timestamp"]
|
||||
).max()
|
||||
# Calculate overheads
|
||||
longest_running_kernel_overhead = (
|
||||
(longest_running_kernel_with_flag - longest_running_kernel_baseline)
|
||||
/ longest_running_kernel_baseline
|
||||
) * 100
|
||||
wall_clock_overhead = ((with_flag_time - baseline_time) / baseline_time) * 100
|
||||
kernel_overhead = (
|
||||
(with_flag_kernel_duration_total - baseline_kernel_duration_total)
|
||||
/ baseline_kernel_duration_total
|
||||
) * 100
|
||||
print(f"\n{'=' * 70}")
|
||||
print("Performance Overhead Analysis:")
|
||||
print(f" Longest running kernel overhead: {longest_running_kernel_overhead:.1f}%")
|
||||
print(f" Baseline wall-clock time: {baseline_time:.2f}s")
|
||||
print(f" With --torch-trace time: {with_flag_time:.2f}s")
|
||||
print(f" Wall-clock overhead: {wall_clock_overhead:.1f}%")
|
||||
print(f" Baseline kernel duration: {baseline_kernel_duration_total:.0f} ns")
|
||||
print(f" With flag kernel duration: {with_flag_kernel_duration_total:.0f} ns")
|
||||
print(f" Kernel execution overhead: {kernel_overhead:.1f}%")
|
||||
print(f"{'=' * 70}\n")
|
||||
# Verify torch trace directory was created
|
||||
torch_trace_dir = Path(workload_dir_with_flag) / "torch_trace"
|
||||
assert torch_trace_dir.exists(), "torch_trace directory should be created"
|
||||
operator_csv_files = list(torch_trace_dir.glob("*.csv"))
|
||||
assert len(operator_csv_files) > 0, "Operator CSV files should be generated"
|
||||
test_utils.clean_output_dir(config["cleanup"], workload_dir_with_flag)
|
||||
# Assert overhead is reasonable (< 100% wall-clock, < 50% kernel)
|
||||
assert wall_clock_overhead < 100, (
|
||||
f"Wall-clock overhead too high: {wall_clock_overhead:.1f}%"
|
||||
)
|
||||
assert kernel_overhead < 50, (
|
||||
f"Kernel execution overhead too high: {kernel_overhead:.1f}%"
|
||||
)
|
||||
assert longest_running_kernel_overhead < 50, (
|
||||
f"longest running kernel increase too high: "
|
||||
f"{longest_running_kernel_overhead:.1f}%"
|
||||
)
|
||||
|
||||
在新工单中引用
屏蔽一个用户