diff --git a/projects/rocprofiler-compute/CHANGELOG.md b/projects/rocprofiler-compute/CHANGELOG.md index cc8d1cf725..0b28ffca84 100644 --- a/projects/rocprofiler-compute/CHANGELOG.md +++ b/projects/rocprofiler-compute/CHANGELOG.md @@ -15,6 +15,8 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs. * Iteration multiplexing to collect counters in single application run +* Added `--torch-trace` option to enable mapping of PyTorch operators to collected counter values during profiling. + * Runtime compilation of Roofline benchmarking: * GPU kernels from [rocm-amdgpu-bench](https://github.com/ROCm/rocm-amdgpu-bench) repository are moved into the ROCm Compute Profiler and are compiled at runtime using local HIP and HIPRTC Python wrappers. * Roofline binaries compiled from [rocm-amdgpu-bench](https://github.com/ROCm/rocm-amdgpu-bench) repository have been removed from the project, as Roofline runtime compilation performs the same work as the Roofline binaries. diff --git a/projects/rocprofiler-compute/docs/how-to/profile/mode.rst b/projects/rocprofiler-compute/docs/how-to/profile/mode.rst index 0eb02b47ef..849cc91e31 100644 --- a/projects/rocprofiler-compute/docs/how-to/profile/mode.rst +++ b/projects/rocprofiler-compute/docs/how-to/profile/mode.rst @@ -617,11 +617,11 @@ The following example demonstrates profiling roofline data only: INFO Kernel Selection: None INFO Dispatch Selection: None INFO Filtered sections: ['4'] - INFO + INFO INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ INFO Collecting Performance Counters (Roofline Only) INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - INFO + INFO INFO [Run 1/3][Approximate profiling time left: pending first measurement...] INFO [profiling] Current input file: /app/projects/rocprofiler-compute/workloads/occupancy/MI300X_A1/perfmon/pmc_perf_0.txt ... @@ -659,6 +659,172 @@ plot. :alt: Sample ROCm Compute Profiler roofline output :width: 800 +.. _torch-operator-mapping: + +Torch Operator Mapping +======================== + +To analyze performance metrics at the PyTorch operator level, ROCm Compute Profiler +offers Torch Operator Mapping functionality. This feature maps performance counters +to specific PyTorch operators, enabling detailed performance analysis of +PyTorch workloads at the operator granularity. + +When enabled, this feature instruments your PyTorch application to correlate GPU +kernel executions with their originating PyTorch operators, providing insights into +which operators contribute to specific performance counter values. + +.. note:: + + **PyTorch Operators vs GPU Kernels**: PyTorch operators (such as ``conv2d``, + ``linear``, ``relu``) are high-level API functions. When executed on GPU, these + operators may dispatch one or more low-level GPU kernels (such as + ``implicit_convolve_sgemm``) that perform the actual computation on the hardware. + The ``--torch-trace`` feature provides operator-level attribution by injecting + markers that map collected kernel performance counters to their originating PyTorch + operators. + +Requirements +------------ + +* Valid PyTorch installation in the profiling environment +* PyTorch application must be run as a Python script or Python command + +Usage +----- + +To enable Torch operator mapping, use the ``--torch-trace`` option when profiling +a PyTorch workload: + +.. code-block:: shell-session + + $ rocprof-compute profile --name mnist_torch --torch-trace -- python train.py + + __ _ + _ __ ___ ___ _ __ _ __ ___ / _| ___ ___ _ __ ___ _ __ _ _| |_ ___ + | '__/ _ \ / __| '_ \| '__/ _ \| |_ _____ / __/ _ \| '_ ` _ \| '_ \| | | | __/ _ \ + | | | (_) | (__| |_) | | | (_) | _|_____| (_| (_) | | | | | | |_) | |_| | || __/ + |_| \___/ \___| .__/|_| \___/|_| \___\___/|_| |_| |_| .__/ \__,_|\__\___| + |_| |_| + + rocprofiler-compute version: 3.4.0 + Profiler choice: rocprofiler-sdk + Path: /home/auser/workloads/mnist_torch/MI300X_A1 + Target: MI300X_A1 + Command: python train.py + Torch Trace: Enabled + Kernel Selection: None + Dispatch Selection: None + Hardware Blocks: All + + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Collecting Performance Counters + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ... + +Output +------ + +When Torch operator mapping is enabled, profiling generates additional output files +in the workload directory that correlate PyTorch operators with GPU kernels and +their performance counters: + +``_torch_trace.csv`` + Contains the merged operator-to-kernel mapping with performance counter data. These + are temporary files that are removed after consolidation into per operator CSV files. + Key columns include: + + * ``Function`` - PyTorch operator name (e.g., ``aten::conv2d``, ``aten::linear``) + * ``Kernel_Name`` - GPU kernel name dispatched by the operator + * ``Counter_Name`` / ``Counter_Value`` - Hardware performance counter measurements + * ``Start_Timestamp_function`` / ``End_Timestamp_function`` - Operator execution time + * ``Start_Timestamp_kernel`` / ``End_Timestamp_kernel`` - Kernel execution time + * ``Correlation_Id`` - Links operator calls to their kernel dispatches + +.. table:: SQC_ICACHE_INFLIGHT_LEVEL_torch_trace.csv from profiling mnist model. + :widths: 20 80 +| Domain | Function | Process_Id | Thread_Id | Correlation_Id | Start_Timestamp_function | End_Timestamp_function | GPU_ID | Dispatch_ID | PID | Grid_Size | Workgroup_Size | LDS_Per_Workgroup | Scratch_Per_Workitem | Arch_VGPR | Accum_VGPR | SGPR | Kernel_Name | Start_Timestamp_kernel | End_Timestamp_kernel | Kernel_ID | Counter_Name | Counter_Value | +|:----------------------|:--------------------------------|-------------:|------------:|-----------------:|---------------------------:|-------------------------:|---------:|--------------:|--------:|------------:|-----------------:|--------------------:|-----------------------:|------------:|-------------:|-------:|:------------------------|-------------------------:|-----------------------:|------------:|:--------------------------|----------------:| +| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | CPC_CPC_STAT_STALL | 17946 | +| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | CPC_CPC_TCIU_BUSY | 714 | +| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | CPF_CPF_STAT_IDLE | 0 | +| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | CPF_CPF_STAT_STALL | 78 | +| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | GRBM_SPI_BUSY | 7277 | +| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_REQ_NO_ALLOC_CSN | 8 | +| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_RES_STALL_CSN | 0 | +| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_SGPR_SIMD_FULL_CSN | 0 | +| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_TGLIM_CU_FULL_CSN | 0 | +| MARKER_CORE_RANGE_API | torch.manual_seed:#1@main.py:99 | 1214226 | 1214226 | 0 | 7072577770736616 | 7072577771920451 | 4 | 1 | 1214226 | 512 | 512 | 0 | 0 | 16 | 0 | 32 | __amd_rocclr_copyBuffer | 7072577923044453 | 7072577923046813 | 6 | SPI_RA_TMP_STALL_CSN | 0 | + +``torch_trace/`` directory + Contains individual CSV files for each PyTorch operator detected during profiling. + Each file is named after the operator (e.g., ``nn_functional_conv2d.csv``, + ``nn_functional_linear.csv``, ``relu.csv``) and contains all kernel executions and + performance counters for that specific operator. Columns include: + + * ``Operator_Name`` - PyTorch operator name + * ``Context_Id`` - Source location where operator was called (e.g., ``conv2d:10@conv.py:543``) + * ``Counter_Name`` / ``Counter_Value`` - Hardware counter measurements + * ``Start_Timestamp_function`` / ``End_Timestamp_function`` - Operator timing + * ``Start_Timestamp_kernel`` / ``End_Timestamp_kernel`` - Kernel timing + + This per-operator organization enables focused analysis of specific operators without + processing the entire trace. + +.. table:: torch_trace/ones_like.csv from profiling mnist model. + :widths: 20 80 + +| Operator_Name | Context_Id | Kernel_Name | Counter_Name | Counter_Value | Start_Timestamp_function | End_Timestamp_function | Start_Timestamp_kernel | End_Timestamp_kernel | +|:----------------|:------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------|----------------:|---------------------------:|-------------------------:|-------------------------:|-----------------------:| +| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor, std::array >(int, at::native::FillFunctor, std::array) | CPC_CPC_STAT_BUSY | 23004 | 6789210204040073 | 6789210223815845 | 6789210223810274 | 6789210223811914 | +| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor, std::array >(int, at::native::FillFunctor, std::array) | CPC_CPC_STAT_IDLE | 0 | 6789210204040073 | 6789210223815845 | 6789210223810274 | 6789210223811914 | +| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor, std::array >(int, at::native::FillFunctor, std::array) | CPC_CPC_STAT_STALL | 6715 | 6789281060081123 | 6789281079930585 | 6789281079932564 | 6789281079934204 | +| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor, std::array >(int, at::native::FillFunctor, std::array) | CPC_CPC_TCIU_BUSY | 534 | 6789281060081123 | 6789281079930585 | 6789281079932564 | 6789281079934204 | +| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor, std::array >(int, at::native::FillFunctor, std::array) | CPC_CPC_TCIU_IDLE | 20569 | 6789352286866085 | 6789352306292985 | 6789352306292904 | 6789352306294424 | +| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor, std::array >(int, at::native::FillFunctor, std::array) | CPC_CPC_UTCL2IU_BUSY | 358 | 6789352286866085 | 6789352306292985 | 6789352306292904 | 6789352306294424 | +| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor, std::array >(int, at::native::FillFunctor, std::array) | CPC_CPC_UTCL2IU_IDLE | 20046 | 6789422289668823 | 6789422308914683 | 6789422308913883 | 6789422308915403 | +| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor, std::array >(int, at::native::FillFunctor, std::array) | CPC_ME1_BUSY_FOR_PACKET_DECODE | 16331 | 6789422289668823 | 6789422308914683 | 6789422308913883 | 6789422308915403 | +| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor, std::array >(int, at::native::FillFunctor, std::array) | CPC_ME1_DC0_SPI_BUSY | 455 | 6789492192490428 | 6789492210892375 | 6789492210897243 | 6789492210898883 | +| torch.ones_like | 1@__init__.py:231 | void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor, std::array >(int, at::native::FillFunctor, std::array) | CPC_UTCL1_STALL_ON_TRANSLATION | 374 | 6789492192490428 | 6789492210892375 | 6789492210897243 | 6789492210898883 | + +``pmc_perf.csv`` + Standard performance counter data (same as non-torch profiling) + +This data enables analysis such as: + +* Identifying which PyTorch operators executed which GPU kernels +* Aggregating performance counter values by operator +* Correlating operator-level timing with kernel-level hardware metrics +* Tracing the execution flow from high-level PyTorch API to low-level GPU kernels + +Limitations +----------- + +.. note:: + + * The ``--torch-trace`` option requires the application to be a Python command + or Python script. + + * A valid PyTorch installation must be available in the environment where profiling + is executed. + + * This feature adds instrumentation overhead to track operator boundaries. For + performance-critical measurements, consider profiling without this option first. + +Combined with Other Options +---------------------------- + +Torch operator mapping can be combined with other profiling options: + +.. code-block:: shell-session + + # Combine with block filtering for targeted counter collection + $ rocprof-compute profile --name mnist --torch-trace -b 11 12 -- python train.py + + # Combine with iteration multiplexing + $ rocprof-compute profile --name mnist --torch-trace --iteration-multiplexing kernel -- python train.py + + # Combine with kernel filtering (filters by GPU kernel name) + $ rocprof-compute profile --name mnist --torch-trace -k elementwise -- python train.py .. _iteration-multiplexing: @@ -687,7 +853,7 @@ To enable iteration multiplexing in ROCm Compute Profiler, use the ``--iteration-multiplexing`` option in your profiling command. You can optionally specify the policy for multiplexing. The available policies are: -* ``kernel`` +* ``kernel`` The counters are divided based on the kernels being executed. Each kernel call for a particular kernel collects a different subset of counters. * ``kernel_launch_params`` @@ -707,10 +873,10 @@ By default, if no policy is specified, ROCm Compute Profiler uses the ``kernel_l Iteration multiplexing is only supported when using ROCm Compute Profiler with the native counter collection tool. Ensure that ``--attach-pid`` is not used in your profiling command. - * Ensure that your workload runs for enough iterations to cover all counter subsets. - When using iteration multiplexing, the total number of iterations, for each kernel (for ``kernel`` policy) - or for each unique kernel and launch parameters combination (for ``kernel_launch_params`` policy), - specified in the workload should be sufficient to cover all subsets of counters. If the number of iterations + * Ensure that your workload runs for enough iterations to cover all counter subsets. + When using iteration multiplexing, the total number of iterations, for each kernel (for ``kernel`` policy) + or for each unique kernel and launch parameters combination (for ``kernel_launch_params`` policy), + specified in the workload should be sufficient to cover all subsets of counters. If the number of iterations is too low, some counters may not be collected. * Launch paramaters for ``kernel_launch_params`` policy. @@ -736,11 +902,11 @@ The following example demonstrates how to use iteration multiplexing with the [INFO] Kernel Selection: None [INFO] Dispatch Selection: None [INFO] Filtered sections: All - [INFO] + [INFO] [INFO] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [INFO] Collecting Performance Counters [INFO] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - [INFO] + [INFO] [INFO] Using native counter collection tool: /tmp/rocprofiler-compute-tool-hlz4fagh/librocprofiler-compute-tool.so [INFO] Iteration multiplexing: kernel [INFO] [profiling] Current input files: /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQC_DCACHE_INFLIGHT_LEVEL.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQC_ICACHE_INFLIGHT_LEVEL.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_IFETCH_LEVEL.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_INST_LEVEL_LDS.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_INST_LEVEL_SMEM.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_INST_LEVEL_VMEM.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/SQ_LEVEL_WAVES.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_0.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_1.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_10.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_11.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_12.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_2.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_3.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_4.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_5.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_6.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_7.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_8.txt, /home/rocm-systems/projects/rocprofiler-compute/sample/workloads/vcopy_kernel/MI200/perfmon/pmc_perf_9.txt diff --git a/projects/rocprofiler-compute/pyproject.toml b/projects/rocprofiler-compute/pyproject.toml index be99250be4..f11dc6ae4e 100644 --- a/projects/rocprofiler-compute/pyproject.toml +++ b/projects/rocprofiler-compute/pyproject.toml @@ -111,4 +111,5 @@ markers = [ "iteration_multiplexing_2", "iteration_multiplexing_stochastic", "noise_clamp", + "torch_ops", ] diff --git a/projects/rocprofiler-compute/src/argparser.py b/projects/rocprofiler-compute/src/argparser.py index 39abcd8bd4..a99a023a1b 100644 --- a/projects/rocprofiler-compute/src/argparser.py +++ b/projects/rocprofiler-compute/src/argparser.py @@ -239,6 +239,17 @@ Examples: help=argparse.SUPPRESS, # help="\t\t\tKokkos trace, traces Kokkos API calls.", ) + profile_group.add_argument( + "--torch-trace", + dest="torch_trace", + required=False, + default=False, + action="store_true", + help=( + "\t\t\tTorch Trace, maps PyTorch operators to performance counters.\n" + "\t\t\tShould be used only when profiling PyTorch applications." + ), + ) profile_group.add_argument( "-k", "--kernel", diff --git a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py index a79d188c75..4dcd9dfa6b 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py @@ -109,16 +109,62 @@ class RocProfCompute_Base: "--attach-pid cannot be used with --iteration-multiplexing. " "Please remove one of these options." ) - # verify correct formatting for application binary args.remaining = args.remaining[1:] + resolved_exec_path: Optional[Path] = None + if args.remaining: # Ensure that command points to an executable - if not shutil.which(args.remaining[0]): + exec_candidate = shutil.which(args.remaining[0]) + if not exec_candidate: console_error( f"Your command {args.remaining[0]} doesn't point to a executable. " "Please verify." ) + resolved_exec_path = Path(exec_candidate).resolve() + + # Appending a wrapper for injecting roctx-markers + if getattr(args, "torch_trace", False): + # Find the inject_roctx.py script in src/utils + inject_script = ( + Path(__file__).parent.parent / "utils" / "inject_roctx.py" + ) + if not inject_script.exists(): + console_error( + f"Cannot find inject_roctx.py at {inject_script}. " + "Please verify your installation." + ) + + # Case 1: Explicit python command (python, python3, etc.) + if args.remaining[0].startswith("python"): + # Insert inject_roctx.py after the python interpreter + args.remaining.insert(1, str(inject_script)) + # Case 2: Direct Python script execution (./main.py, /path/to/script.py) + elif args.remaining[0].endswith((".py", ".pyw", ".pyc", ".pyo")): + # Use current Python interpreter + args.remaining.insert(0, str(inject_script)) + args.remaining.insert(0, sys.executable) + else: + console_warning( + "Command does not look like a Python entry point, " + "skipping ROCTX auto-injection and launching workload as-is." + ) + console_warning( + "Ensure the binary already initializes PyTorch/ROCTX markers, " + "otherwise --torch-trace will have no effect." + ) + + if ( + resolved_exec_path + and (resolved_exec_path.parent / "_internal").is_dir() + ): + console_warning( + "Workload appears to be a self-contained binary. " + "Such bundles typically ship private ROCm/HSA libraries, which " + "prevents --torch-trace from collecting data." + "Rebuild without packaging libhsa/libhip or " + "adjust LD_LIBRARY_PATH to /opt/rocm) before profiling." + ) args.remaining = " ".join(args.remaining) elif not args.attach_pid: console_error( @@ -471,6 +517,8 @@ class RocProfCompute_Base: f'passes. Please use "--block" or "--set" ' f"to adjust or reduce the requested performance metrics!" ) + console_debug(f"Sending profiler options to run_prof: {options}") + run_prof( fnames=str_fnames, profiler_options=options, @@ -478,6 +526,7 @@ class RocProfCompute_Base: mspec=self._soc._mspec, loglevel=args.loglevel, format_rocprof_output=args.format_rocprof_output, + torch_trace_enabled=getattr(args, "torch_trace", False), retain_rocpd_output=args.retain_rocpd_output, ) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v3.py b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v3.py index 661732ebb1..1a2aaa9ac4 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v3.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v3.py @@ -30,6 +30,7 @@ from pathlib import Path from rocprof_compute_profile.profiler_base import RocProfCompute_Base from rocprof_compute_soc.soc_base import OmniSoC_Base from utils.logger import console_error, console_log, demarcate +from utils.utils import consolidate_torch_trace_output class rocprof_v3_profiler(RocProfCompute_Base): @@ -49,7 +50,6 @@ class rocprof_v3_profiler(RocProfCompute_Base): def get_profiler_options(self) -> list[str]: args = self.get_args() app_cmd = shlex.split(args.remaining) - if args.kokkos_trace: trace_option = "--kokkos-trace" # NOTE: --kokkos-trace feature is incomplete and is disabled for now. @@ -60,9 +60,10 @@ class rocprof_v3_profiler(RocProfCompute_Base): ) elif args.hip_trace: trace_option = "--hip-trace" + elif getattr(args, "torch_trace", False): + trace_option = "--marker-trace" else: trace_option = "--kernel-trace" - profiling_options = [ # v3 requires output directory argument "-d", @@ -134,6 +135,10 @@ class rocprof_v3_profiler(RocProfCompute_Base): if self.ready_to_profile: # Manually join each pmc_perf*.csv output self.join_prof() + # Consolidate torch trace output if --torch-trace was used + if self.get_args().torch_trace: + consolidate_torch_trace_output(self.get_args().path) + # Run roofline microbenchmark super().post_processing() else: diff --git a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprofiler_sdk.py b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprofiler_sdk.py index 542ec04619..483c2f71c2 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprofiler_sdk.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprofiler_sdk.py @@ -31,6 +31,7 @@ from typing import Optional, Union from rocprof_compute_profile.profiler_base import RocProfCompute_Base from rocprof_compute_soc.soc_base import OmniSoC_Base from utils.logger import console_error, console_log, demarcate +from utils.utils import consolidate_torch_trace_output class rocprofiler_sdk_profiler(RocProfCompute_Base): @@ -71,6 +72,8 @@ class rocprofiler_sdk_profiler(RocProfCompute_Base): "ROCPROF_OUTPUT_PATH": f"{args.path}/out/pmc_1", }) + if getattr(args, "torch_trace", False): + options["ROCPROF_MARKER_API_TRACE"] = "1" # Create folder pointed by ROCPROF_OUTPUT_PATH Path(options["ROCPROF_OUTPUT_PATH"]).mkdir(parents=True, exist_ok=True) @@ -161,6 +164,9 @@ class rocprofiler_sdk_profiler(RocProfCompute_Base): if self.ready_to_profile: # Manually join each pmc_perf*.csv output self.join_prof() + if self.get_args().torch_trace: + consolidate_torch_trace_output(self.get_args().path) + # Run roofline microbenchmark super().post_processing() else: diff --git a/projects/rocprofiler-compute/src/utils/inject_roctx.py b/projects/rocprofiler-compute/src/utils/inject_roctx.py new file mode 100644 index 0000000000..429e8adb8a --- /dev/null +++ b/projects/rocprofiler-compute/src/utils/inject_roctx.py @@ -0,0 +1,292 @@ +# ruff: noqa +############################################################################## +# MIT License +# +# Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + + +""" +ROCTX Injection Wrapper - Auto-discovers and intercepts ALL PyTorch operators +Usage: python inject_roctx.py main.py --epochs 1 --batch-size 4 +""" + +import os +import sys +from pathlib import Path + +# Add parent directory to Python path for config module +script_dir = Path(__file__).resolve().parent +sys.path.insert(0, str(script_dir.parent)) + +from utils.logger import console_log, console_warning + +rocm_root = os.environ.get("ROCM_PATH", "/opt/rocm") +python_version = f"python{sys.version_info.major}.{sys.version_info.minor}" +candidate_paths = [ + f"{rocm_root}/lib/{python_version}/site-packages", + f"{rocm_root}/libexec/rocprofiler-sdk/python", +] + +for candidate in candidate_paths: + if candidate not in sys.path: + sys.path.insert(0, candidate) + +try: + import torch + + console_log(f"PyTorch version: {torch.__version__}") +except ImportError: + console_warning( + "PyTorch is not installed or not properly configured.\n" + "The --torch-trace option requires a valid PyTorch installation.\n" + "Please install PyTorch and try again." + ) + sys.exit(0) + +import importlib.util +import inspect +from functools import wraps + +import torch.nn.functional as F +from roctx import rangePop, rangePush + + +def roctx_wrapper(func, name=None): + func_name = name or func.__name__ + call_counter = {"count": 0} + + @wraps(func) + def wrapper(*args, **kwargs): + call_counter["count"] += 1 + current_frame = inspect.currentframe() + caller_frame = current_frame.f_back if current_frame is not None else None + if caller_frame is not None: + filename = caller_frame.f_code.co_filename + location = f"{Path(filename).name}:{caller_frame.f_lineno}" + else: + location = "unknown:0" + + # Unique marker: function + call_number + source_location + rangePush(f"{func_name}:#{call_counter['count']}@{location}") + try: + result = func(*args, **kwargs) + finally: + rangePop() + return result + + return wrapper + + +def auto_discover_torch_callables(module, prefix, exclude_patterns=None): + """Automatically discover all callable functions in a module.""" + if exclude_patterns is None: + exclude_patterns = ["__", "_", "is_", "set_", "get_"] + + functions = {} + for name in dir(module): + # Skip private/internal functions + if any(name.startswith(pat) for pat in exclude_patterns): + continue + + try: + attr = getattr(module, name) + # Only wrap callables (functions, not classes or constants) + if callable(attr) and not isinstance(attr, type): + full_name = f"{prefix}.{name}" + functions[full_name] = (module, name, attr) + except Exception as e: + console_warning(type(e)) + console_warning(f"Could not access {prefix}.{name}: {e}") + + return functions + + +def inject_roctx_into_torch(): + """Monkey-patch PyTorch operations to add ROCTX markers.""" + + console_log("Auto-discovering PyTorch operations to wrap...") + + # Auto-discover functions from key modules + all_operations = {} + + # torch.* functions (matmul, mm, cat, etc.) + all_operations.update(auto_discover_torch_callables(torch, "torch")) + + # torch.nn.functional.* functions (linear, relu, softmax, etc.) + all_operations.update(auto_discover_torch_callables(F, "torch.nn.functional")) + + # torch.linalg.* functions (matrix operations) + try: + all_operations.update( + auto_discover_torch_callables(torch.linalg, "torch.linalg") + ) + except Exception as e: + console_warning(type(e)) + console_warning(f"Could not access torch.linalg: {e}") + + # torch.fft.* functions (FFT operations) + try: + all_operations.update(auto_discover_torch_callables(torch.fft, "torch.fft")) + except Exception as e: + console_warning(type(e)) + console_warning(f"Could not access torch.fft: {e}") + console_log(f"Found {len(all_operations)} operations to wrap") + console_log("Injecting ROCTX markers into PyTorch operations...") + + wrapped_count = 0 + failed_count = 0 + + for full_name, (module, attr_name, original_func) in all_operations.items(): + try: + # Replace with wrapped version + wrapped_func = roctx_wrapper(original_func, full_name) + setattr(module, attr_name, wrapped_func) + wrapped_count += 1 + + # Print first 20 and last 5 for visibility + if wrapped_count <= 20 or wrapped_count > len(all_operations) - 5: + console_log(f"Wrapped: {full_name}") + elif wrapped_count == 21: + console_log( + f" ... (wrapping {len(all_operations) - 25} more operations)" + ) + + except Exception as e: + failed_count += 1 + if failed_count <= 5: # Only show first few failures + console_warning(f"Failed to wrap {full_name}: {e}") + + # Wrap tensor methods + original_backward = torch.Tensor.backward + backward_counter = {"count": 0} + + def backward_with_roctx(self, *args, **kwargs): + backward_counter["count"] += 1 + current_frame = inspect.currentframe() + caller_frame = current_frame.f_back if current_frame is not None else None + if caller_frame is not None: + filename = caller_frame.f_code.co_filename + location = f"{Path(filename).name}:{caller_frame.f_lineno}" + else: + location = "unknown:0" + + rangePush(f"torch.Tensor.backward:#{backward_counter['count']}@{location}") + try: + return original_backward(self, *args, **kwargs) + finally: + rangePop() + + torch.Tensor.backward = backward_with_roctx + + wrapped_count += 1 + console_log("Wrapped: torch.Tensor.backward") + + console_log(f"Wrapped {wrapped_count} operations with ROCTX markers") + if failed_count > 0: + console_warning( + f"Failed to wrap {failed_count} operations (likely not patchable)" + ) + + +def inject_roctx_into_optimizer(): + """Wrap optimizer step() method.""" + from torch.optim import Optimizer + + original_step = Optimizer.step + + def step_with_roctx(self, *args, **kwargs): + rangePush(f"optimizer.{self.__class__.__name__}.step") + try: + return original_step(self, *args, **kwargs) + finally: + rangePop() + + Optimizer.step = step_with_roctx + console_log("Wrapped optimizer.step() with ROCTX markers\n") + + +def inject_roctx_into_model(): + """Wrap nn.Module forward() method with call counter.""" + + from torch import nn + from typing import Any + + original_call = nn.Module.__call__ + + # Per-instance call counters + def call_with_roctx(self, *args, **kwargs): + class_name = self.__class__.__name__ + + # Initialize counter for this instance if not exists + if not hasattr(self, "_roctx_call_count"): + self._roctx_call_count = 0 + self._roctx_call_count += 1 + + # Get caller location + current_frame = inspect.currentframe() + caller_frame = current_frame.f_back if current_frame is not None else None + if caller_frame is not None: + filename = caller_frame.f_code.co_filename + location = f"{Path(filename).name}:{caller_frame.f_lineno}" + else: + location = "unknown:0" + + # Create detailed marker + rangePush( + f"nn.Module.{class_name}.forward:#{self._roctx_call_count}@{location}" + ) + try: + return original_call(self, *args, **kwargs) + finally: + rangePop() + + nn.Module.__call__ = call_with_roctx + console_log("Wrapped nn.Module forward() with ROCTX markers\n") + + +if __name__ == "__main__": + if len(sys.argv) < 2: + console_log("Usage: python inject_roctx.py [script_args...]") + sys.exit(1) + + # Get target script and its arguments + target_script = sys.argv[1] + script_args = sys.argv[2:] + + # Inject ROCTX markers BEFORE importing the target script + inject_roctx_into_torch() + inject_roctx_into_optimizer() + inject_roctx_into_model() + + console_log("=" * 70) + console_log("Starting target script with ROCTX instrumentation...") + console_log("=" * 70) + + # Modify sys.argv so the target script sees correct arguments + sys.argv = [target_script] + script_args + + # Load and execute the target script + spec = importlib.util.spec_from_file_location("__main__", target_script) + module = importlib.util.module_from_spec(spec) + sys.modules["__main__"] = module + spec.loader.exec_module(module) diff --git a/projects/rocprofiler-compute/src/utils/rocpd_data.py b/projects/rocprofiler-compute/src/utils/rocpd_data.py index 9fa4bfb5ac..284f0031f4 100644 --- a/projects/rocprofiler-compute/src/utils/rocpd_data.py +++ b/projects/rocprofiler-compute/src/utils/rocpd_data.py @@ -25,7 +25,7 @@ import csv import sqlite3 -from contextlib import closing +from contextlib import ExitStack, closing from typing import Any import pandas as pd @@ -37,6 +37,8 @@ from utils.logger import console_error COUNTERS_COLLECTION_QUERY = """ SELECT agent_id as GPU_ID, + guid as GUID, + correlation_id as Correlation_Id, dispatch_id as Dispatch_ID, pid as PID, grid_size as Grid_Size, @@ -54,6 +56,24 @@ SELECT value as Counter_Value FROM counters_collection """ +MARKER_API_TRACE_QUERY = """ +SELECT + category AS Domain, + json_extract(extdata, '$.message') AS Function, + pid AS Process_Id, + tid AS Thread_Id, + corr_id AS Correlation_Id, + guid AS GUID, + start AS Start_Timestamp, + end AS End_Timestamp +FROM regions +ORDER BY start +""" +KERNEL_DISPATCH_QUERY = """ +SELECT dispatch_id, event_id, guid +FROM rocpd_kernel_dispatch +WHERE guid = ? +""" ROCPD_PMC_EVENT_TABLE_NAME_PREFIX = "rocpd_pmc_event_" TABLE_NAME_PREFIX_QUERY = ( "SELECT name FROM sqlite_master WHERE type='table' " @@ -64,30 +84,43 @@ INSERT_QUERY = "INSERT INTO {table_name} ({columns}) VALUES ({placeholders})" def convert_dbs_to_csv( db_paths: list[str], - csv_file_path: str, + counter_collection_csv_path: str, + marker_trace_csv_path: str, ) -> None: - """ - Read rocpd databases and write to CSV file - """ - # Read counters_collection view from the databases and write to CSV - try: - with open(csv_file_path, "w", newline="") as csvfile: - writer = csv.writer(csvfile) - header_written = False - for db_path in db_paths: - with closing(sqlite3.connect(db_path)) as conn: - with closing(conn.execute(COUNTERS_COLLECTION_QUERY)) as cursor: - if not header_written: - writer.writerow([ - description[0] for description in cursor.description - ]) - header_written = True - for row in cursor: - writer.writerow(row) - except OSError as e: - console_error(f"Database error while converting to CSV: {e}") - except Exception as e: - console_error(f"Unexpected error converting database to CSV: {e}") + queries = { + counter_collection_csv_path: COUNTERS_COLLECTION_QUERY, + marker_trace_csv_path: MARKER_API_TRACE_QUERY, + } + header_written = {path: False for path in queries} + + with ExitStack() as stack: + writers = { + path: csv.writer(stack.enter_context(open(path, "w", newline=""))) + for path in queries + } + for db_path in db_paths: + with closing(sqlite3.connect(db_path)) as conn: + for file_path, query in queries.items(): + try: + with closing(conn.execute(query)) as cursor: + if cursor.description is None: + continue + if not header_written[file_path]: + writers[file_path].writerow([ + desc[0] for desc in cursor.description + ]) + header_written[file_path] = True + writers[file_path].writerows(cursor) + except OSError as e: + console_error( + f"Database error while extracting {file_path} " + f"from {db_path}: {e}" + ) + except Exception as e: + console_error( + f"Unexpected error while extracting {file_path} " + f"from {db_path}: {e}" + ) def process_rocpd_csv(df: pd.DataFrame) -> pd.DataFrame: @@ -134,7 +167,7 @@ def process_rocpd_csv(df: pd.DataFrame) -> pd.DataFrame: def update_rocpd_pmc_events(counter_info: pd.DataFrame, rocpd_db_path: str) -> None: - """Update pmc_event table in the given rocpd database path""" + """Updates pmc_event table in the given rocpd database path.""" try: with closing(sqlite3.connect(rocpd_db_path)) as conn: # Get pmc_event table name @@ -154,13 +187,27 @@ def update_rocpd_pmc_events(counter_info: pd.DataFrame, rocpd_db_path: str) -> N guid = table_name[len(ROCPD_PMC_EVENT_TABLE_NAME_PREFIX) :].replace( "_", "-" ) + # Map dispatch_id to event_id from rocpd_kernel_dispatch + # Native counter collection CSV has dispatch_id, but schema needs event_id + # event_id may differ from dispatch_id when marker API tracing is enabled + with closing(conn.execute(KERNEL_DISPATCH_QUERY, (guid,))) as cursor: + rows = cursor.fetchall() + if not rows: + console_error("No kernel dispatch data found.") + return + dispatch_to_event = { + dispatch_id: event_id for dispatch_id, event_id, _ in rows + } + counter_info["event_id"] = counter_info["dispatch_id"].map( + dispatch_to_event + ) columns = ("guid", "event_id", "pmc_id", "value") values = list( zip( # guid [guid] * len(counter_info), # event_id - counter_info["dispatch_id"], + counter_info["event_id"], # pmc_id counter_info["counter_id"], # value diff --git a/projects/rocprofiler-compute/src/utils/utils.py b/projects/rocprofiler-compute/src/utils/utils.py index e6805bba08..b78abeb1ac 100644 --- a/projects/rocprofiler-compute/src/utils/utils.py +++ b/projects/rocprofiler-compute/src/utils/utils.py @@ -786,6 +786,7 @@ def run_prof( mspec: Any, # noqa: ANN401 loglevel: int, format_rocprof_output: str, + torch_trace_enabled: bool = False, retain_rocpd_output: bool = False, ) -> None: multiple_files = isinstance(fnames, list) @@ -939,9 +940,12 @@ def run_prof( # Write results_fbase.csv rocpd_data.convert_dbs_to_csv( glob.glob(workload_dir + "/out/pmc_1/*/*.db"), - workload_dir + f"/results_{fbase}.csv", + workload_dir + f"/out/pmc_1/{fbase}_counter_collection.csv", + workload_dir + f"/out/pmc_1/{fbase}_marker_api_trace.csv", + ) + combined_df = pd.read_csv( + workload_dir + f"/out/pmc_1/{fbase}_counter_collection.csv" ) - combined_df = pd.read_csv(workload_dir + f"/results_{fbase}.csv") # Reset Dispatch_ID based on PID, Kernel_Name, Grid_Size, # Workgroup_Size, LDS_Per_Workgroup, Start_Timestamp, End_Timestamp combined_df["Dispatch_ID"] = combined_df.groupby( @@ -964,8 +968,12 @@ def run_prof( ).ngroup() # Drop PID since its not required combined_df = combined_df.drop(columns=["PID"]) + combined_df.to_csv( + workload_dir + f"/out/pmc_1/{fbase}_counter_collection.csv", index=False + ) combined_df.to_csv(workload_dir + f"/results_{fbase}.csv", index=False) - + if torch_trace_enabled: + process_torch_trace_output(workload_dir, fbase, format_rocprof_output) if retain_rocpd_output: for db_path in glob.glob(workload_dir + "/out/pmc_1/*/*.db"): pid = Path(db_path).stem.split("_")[0] @@ -1004,7 +1012,9 @@ def run_prof( process_kokkos_trace_output(workload_dir, fbase) elif "--hip-trace" in options: process_hip_trace_output(workload_dir, fbase) - + # Add torch operator trace processing + if torch_trace_enabled: + process_torch_trace_output(workload_dir, fbase, format_rocprof_output) # Combine results into single CSV file if results_files: combined_results = pd.concat( @@ -1175,7 +1185,7 @@ def convert_native_counter_collection_csv(workload_dir: str) -> None: ) rocprofv3_counter_data = pd.DataFrame({ - "Correlation_Id": merged_data["dispatch_id"], + "Correlation_Id": merged_data["Correlation_Id"], "Dispatch_Id": merged_data["dispatch_id"], "Agent_Id": merged_data["Agent_Id"], "Queue_Id": merged_data["Queue_Id"], @@ -1262,6 +1272,178 @@ def process_rocprofv3_output(workload_dir: str, using_native_tool: bool) -> list return results_files_csv +@demarcate +def process_torch_trace_output( + workload_dir: str, + fbase: str, + output_format: str = "rocpd", +) -> None: + """ + Creates PyTorch operator trace from counter_collection and marker_api_trace data. + - Performs inner join on Correlation_Id, filtering out unmatched entries + - Output file is saved to workload root, not the temporary out/ directory + """ + marker_trace_csv_file_path = f"{workload_dir}/out/pmc_1/" + # Find all marker_api_trace CSV files + marker_api_trace_csvs = list( + Path(marker_trace_csv_file_path).glob("**/*_marker_api_trace.csv") + ) + counter_collection_csvs = [ + markers_file.parent + / markers_file.name.replace("_marker_api_trace.", "_counter_collection.") + for markers_file in marker_api_trace_csvs + ] + existing_csv_files = [ + [marker_api_trace_csvs[i], counter_collection_csvs[i]] + for i in range(len(marker_api_trace_csvs)) + if counter_collection_csvs[i].is_file() and marker_api_trace_csvs[i].is_file() + ] + if not existing_csv_files: + console_warning( + f"No marker files with corresponding counter files found for {fbase}" + ) + return + + # Join marker and counter data + def _merge_pair( + marker_path: Path, + counter_path: Path, + join_keys: list = ("Correlation_Id"), + ) -> pd.DataFrame: + marker_df = pd.read_csv(marker_path) + counter_df = pd.read_csv(counter_path) + return pd.merge( + marker_df, + counter_df, + on=join_keys, + how="inner", + suffixes=("_function", "_kernel"), + ) + + if output_format == "csv": + merged_results = pd.concat( + [_merge_pair(f[0], f[1]) for f in existing_csv_files], + ignore_index=True, + ) + elif output_format == "rocpd": + # There will one pair of csv files extracted from rocpd db and consolidated. + merged_results = _merge_pair( + existing_csv_files[0][0], + existing_csv_files[0][1], + ("Correlation_Id", "GUID"), + ) + # Save merged results + merged_results.to_csv( + f"{workload_dir}/{fbase}_torch_trace.csv", + index=False, + ) + console_log("Created ", f"{workload_dir}/{fbase}_torch_trace.csv") + + +@demarcate +def consolidate_torch_trace_output(workload_dir: str) -> None: + # Consolidate torch operator trace CSV files from multiple processes + console_log("Consolidating torch operator trace output...") + # Find all torch trace CSV files in workload directory + torch_trace_files = glob.glob(f"{workload_dir}/*_torch_trace.csv") + if not torch_trace_files: + console_warning("No torch trace files found.") + return + # Read and concatenate all torch trace files + all_traces = [] + required_columns = [ + "Function", + "Kernel_Name", + "Counter_Name", + "Counter_Value", + "Start_Timestamp_function", + "End_Timestamp_function", + "Start_Timestamp_kernel", + "End_Timestamp_kernel", + ] + for trace_file in torch_trace_files: + try: + df = pd.read_csv(trace_file) + except pd.errors.ParserError as e: + console_warning(f"Parser error while reading {trace_file}: {e}") + continue + except OSError as e: + console_warning(f"I/O error while reading {trace_file}: {e}") + continue + except Exception as e: + # Unexpected error; log full details for debugging + console_warning( + f"Unexpected error while reading {trace_file}: {e}\n" + f"{traceback.format_exc()}" + ) + continue + + missing_columns = [col for col in required_columns if col not in df.columns] + if missing_columns: + console_warning( + f"Skipping {trace_file}: missing required columns {missing_columns}" + ) + continue + + all_traces.append(df[required_columns]) + if not all_traces: + console_warning("No valid torch trace data to consolidate.") + return + + consolidated_df = pd.concat(all_traces, ignore_index=True) + if consolidated_df.isnull().values.any(): + console_warning("Consolidated torch trace contains missing values") + return + consolidated_df = consolidated_df.sort_values(by=["Function", "Counter_Name"]) + + split_columns = consolidated_df["Function"].str.split(":#", expand=True) + consolidated_df["Operator_Name"] = ( + split_columns[0] if len(split_columns.columns) > 0 else None + ) + consolidated_df["Context_Id"] = ( + split_columns[1] if len(split_columns.columns) > 1 else None + ) + consolidated_df.drop(columns=["Function"], inplace=True) + consolidated_df = consolidated_df[ + [ + "Operator_Name", + "Context_Id", + "Kernel_Name", + "Counter_Name", + "Counter_Value", + "Start_Timestamp_function", + "End_Timestamp_function", + "Start_Timestamp_kernel", + "End_Timestamp_kernel", + ] + ] + + if consolidated_df.isnull().values.any(): + console_error( + "Missing values in consolidated torch trace after splitting ", + "the Function name.", + ) + return + + grouped = consolidated_df.groupby("Operator_Name") + for operator_name, group in grouped: + sanitized_operator_name = operator_name.replace("torch.", "").replace(".", "_") + # Ensure output directory exists + Path(f"{workload_dir}/torch_trace").mkdir(parents=True, exist_ok=True) + output_file = f"{workload_dir}/torch_trace/{sanitized_operator_name}.csv" + group.to_csv(output_file, index=False) + console_log( + f"Saved consolidated trace for {sanitized_operator_name} to {output_file}" + ) + + for trace_file in torch_trace_files: + try: + Path(trace_file).unlink() + console_debug(f"Removed temporary torch trace file: {trace_file}") + except OSError as e: + console_warning(f"Error removing temporary file {trace_file}: {e}") + + @demarcate def process_kokkos_trace_output(workload_dir: str, fbase: str) -> None: # marker api trace csv files are generated for each process diff --git a/projects/rocprofiler-compute/tests/test_profile_general.py b/projects/rocprofiler-compute/tests/test_profile_general.py index 4c4a397718..11fffc85e7 100644 --- a/projects/rocprofiler-compute/tests/test_profile_general.py +++ b/projects/rocprofiler-compute/tests/test_profile_general.py @@ -23,6 +23,7 @@ ############################################################################## +import importlib.util import inspect import os import re @@ -2779,3 +2780,215 @@ def test_iteration_multiplexing_all_counter_accuracy( assert are_stochastic_counters_similar( [counters_kernel, counters_kernel_launch_params], counters_no_multiplexing ) + + +skip_if_no_torch = pytest.mark.skipif( + importlib.util.find_spec("torch") is None, reason="torch is required for this test" +) + + +@skip_if_no_torch +def test_torch_trace_profile(binary_handler_profile_rocprof_compute): + """ + Test profiling a PyTorch application with --torch-trace option. + Verifies that all required files are generated and counter values are valid. + NOTE: Not included in the test suite since this requires PyTorch installation. + """ + workload_dir = test_utils.get_output_dir(param_id="torch_ops") + Path(workload_dir).mkdir(parents=True, exist_ok=True) + torch_app_path = Path(workload_dir) / "test_torch_app.py" + + torch_app_code = """ +import torch +import torch.nn as nn +import torch.nn.functional as F + +class SimpleNet(nn.Module): + def __init__(self): + super(SimpleNet, self).__init__() + self.fc1 = nn.Linear(10, 20) + self.fc2 = nn.Linear(20, 10) + def forward(self, x): + x = self.fc1(x) + x = F.relu(x) + x = self.fc2(x) + return x + +if __name__ == "__main__": + if not torch.cuda.is_available(): + import sys + print("GPU is required for this test. Exiting.") + sys.exit(1) + model = SimpleNet() + model = model.cuda() + x = torch.randn(5, 10).cuda() + # Run a few iterations + for epoch in range(1): + output = model(x) + loss = output.sum() + loss.backward() + print("Training completed") +""" + + with open(torch_app_path, "w") as f: + f.write(torch_app_code) + + config["torch_test_app"] = ["python3", str(torch_app_path)] + + # Profile with --torch-trace option + options = [ + "--torch-trace", + ] + + returncode = binary_handler_profile_rocprof_compute( + config, + workload_dir, + options, + check_success=True, + app_name="torch_test_app", + ) + assert returncode == 0, "Profiling the torch application failed" + # Verify files are generated + # 1. Check basic CSV files + num_devices = config.get("num_devices", 1) + file_dict = test_utils.check_csv_files(workload_dir, num_devices, 1) + assert "pmc_perf.csv" in file_dict, "pmc_perf.csv not generated" + # 2. Check torch trace directory + torch_trace_dir = Path(workload_dir) / "torch_trace" + assert torch_trace_dir.exists(), "torch_trace directory not created" + assert torch_trace_dir.is_dir(), "torch_trace is not a directory" + # 3. Check per-operator CSV files exist + operator_csv_files = list(torch_trace_dir.glob("*.csv")) + assert len(operator_csv_files) > 0, "No per-operator CSV files generated" + # 4. Verify per-operator CSV structure + for op_csv in operator_csv_files: + op_df = pd.read_csv(op_csv) + assert len(op_df) > 0, f"Per-operator CSV {op_csv.name} is empty" + test_utils.clean_output_dir(config["cleanup"], workload_dir) + + +@skip_if_no_torch +def test_torch_trace_overhead(binary_handler_profile_rocprof_compute): + """ + Measure overhead introduced by --torch-trace flag. + Compares execution time with and without the flag to ensure overhead is acceptable. + NOTE: Not included in the test suite since this requires PyTorch installation. + """ + helper_dir = Path(test_utils.get_output_dir(param_id="torch_helper_script")) + helper_dir.mkdir(parents=True, exist_ok=True) + torch_app_path = helper_dir / "test_torch_app.py" + torch_app_code = """ +import torch +import torch.nn as nn +import torch.nn.functional as F + +class SimpleNet(nn.Module): + def __init__(self): + super(SimpleNet, self).__init__() + self.fc1 = nn.Linear(10, 20) + self.fc2 = nn.Linear(20, 10) + def forward(self, x): + x = self.fc1(x) + x = F.relu(x) + x = self.fc2(x) + return x + +if __name__ == "__main__": + if not torch.cuda.is_available(): + import sys + print("GPU is required for this test. Exiting.") + sys.exit(1) + model = SimpleNet() + model = model.cuda() + x = torch.randn(5, 10).cuda() + # Run a few iterations + for epoch in range(1): + output = model(x) + loss = output.sum() + loss.backward() + print("Training completed") +""" + with open(torch_app_path, "w") as f: + f.write(torch_app_code) + config["torch_test_app"] = ["python3", str(torch_app_path)] + # Run WITHOUT --torch-trace (baseline) + workload_dir_baseline = test_utils.get_output_dir(param_id="torch_baseline") + start_baseline = time.time() + returncode_baseline = binary_handler_profile_rocprof_compute( + config, + workload_dir_baseline, + [], # No torch-trace flag + check_success=True, + roof=False, + app_name="torch_test_app", + ) + baseline_time = time.time() - start_baseline + assert returncode_baseline == 0, "Baseline profiling failed" + + # Read baseline timestamps + baseline_df = pd.read_csv(f"{workload_dir_baseline}/pmc_perf.csv") + baseline_kernel_duration_total = ( + baseline_df["End_Timestamp"].max() - baseline_df["Start_Timestamp"].min() + ) + test_utils.clean_output_dir(config["cleanup"], workload_dir_baseline) + # Run WITH --torch-trace + workload_dir_with_flag = test_utils.get_output_dir(param_id="torch_with_ops") + start_with_flag = time.time() + returncode_with_flag = binary_handler_profile_rocprof_compute( + config, + workload_dir_with_flag, + ["--torch-trace"], + check_success=True, + roof=False, + app_name="torch_test_app", + ) + with_flag_time = time.time() - start_with_flag + assert returncode_with_flag == 0, "Profiling with torch-trace failed" + # Read with-flag timestamps + with_flag_df = pd.read_csv(f"{workload_dir_with_flag}/pmc_perf.csv") + with_flag_kernel_duration_total = ( + with_flag_df["End_Timestamp"].max() - with_flag_df["Start_Timestamp"].min() + ) + longest_running_kernel_baseline = ( + baseline_df["End_Timestamp"] - baseline_df["Start_Timestamp"] + ).max() + longest_running_kernel_with_flag = ( + with_flag_df["End_Timestamp"] - with_flag_df["Start_Timestamp"] + ).max() + # Calculate overheads + longest_running_kernel_overhead = ( + (longest_running_kernel_with_flag - longest_running_kernel_baseline) + / longest_running_kernel_baseline + ) * 100 + wall_clock_overhead = ((with_flag_time - baseline_time) / baseline_time) * 100 + kernel_overhead = ( + (with_flag_kernel_duration_total - baseline_kernel_duration_total) + / baseline_kernel_duration_total + ) * 100 + print(f"\n{'=' * 70}") + print("Performance Overhead Analysis:") + print(f" Longest running kernel overhead: {longest_running_kernel_overhead:.1f}%") + print(f" Baseline wall-clock time: {baseline_time:.2f}s") + print(f" With --torch-trace time: {with_flag_time:.2f}s") + print(f" Wall-clock overhead: {wall_clock_overhead:.1f}%") + print(f" Baseline kernel duration: {baseline_kernel_duration_total:.0f} ns") + print(f" With flag kernel duration: {with_flag_kernel_duration_total:.0f} ns") + print(f" Kernel execution overhead: {kernel_overhead:.1f}%") + print(f"{'=' * 70}\n") + # Verify torch trace directory was created + torch_trace_dir = Path(workload_dir_with_flag) / "torch_trace" + assert torch_trace_dir.exists(), "torch_trace directory should be created" + operator_csv_files = list(torch_trace_dir.glob("*.csv")) + assert len(operator_csv_files) > 0, "Operator CSV files should be generated" + test_utils.clean_output_dir(config["cleanup"], workload_dir_with_flag) + # Assert overhead is reasonable (< 100% wall-clock, < 50% kernel) + assert wall_clock_overhead < 100, ( + f"Wall-clock overhead too high: {wall_clock_overhead:.1f}%" + ) + assert kernel_overhead < 50, ( + f"Kernel execution overhead too high: {kernel_overhead:.1f}%" + ) + assert longest_running_kernel_overhead < 50, ( + f"longest running kernel increase too high: " + f"{longest_running_kernel_overhead:.1f}%" + )