diff --git a/projects/rocprofiler-compute/CHANGELOG.md b/projects/rocprofiler-compute/CHANGELOG.md index 0ba4e8dbab..9f33653aa6 100644 --- a/projects/rocprofiler-compute/CHANGELOG.md +++ b/projects/rocprofiler-compute/CHANGELOG.md @@ -8,6 +8,9 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs. * Add `rocpd` choice for `--format-rocprof-output` option in profile mode * Add `--retain-rocpd-output` option in profile mode to save large raw rocpd databases in workload directory +* Show description of metrics during analysis + * Use `--include-cols Description` to show the Description column, which is excluded by default from the + ROCm Compute Profiler CLI output. ### Changed @@ -16,43 +19,39 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs. * When `--format-rocprof-output rocpd` is used, only pmc_perf.csv will be written to workload directory instead of mulitple csv files. +* Improve analysis block based filtering to accept metric id level filtering + * This can be used to collect individual metrics from various sections of analysis config + +* CLI analysis mode baseline comparison will now only compare common metrics across workloads and will not show Metric ID + * Remove metrics from analysis configuration files which are explicitly marked as empty or None + +* Change the basic view of TUI from aggregated analysis data to individual kernel analysis data + ### Resolved issues +* Fixed not detecting memory clock issue when using amd-smi +* Fixed standalone GUI crashing +* Fixed L2 read/write/atomic bandwidths on MI350 +* Update metric names for better alignment between analysis configuration and documentation + ### Known issues +### Optimized + +* Improved `--time-unit` option in analyze mode to apply time unit conversion across all analysis sections, not just kernel top stats. + ### Removed -## ROCm Compute Profiler 3.2.0 for ROCm 7.0.0 +* Usage of rocm-smi +* Hardware IP block based filtering has been removed in favor of analysis report block based filtering +* Remove aggregated analysis view from TUI mode + + +## ROCm Compute Profiler 3.2.1 for ROCm 7.0.0 ### Added -* Support Roofline plot on CLI (single run) - -* Stochastic (hardware-based) PC sampling has been enabled for AMD Instinct MI300X series and later accelerators. - -* Sorting of PC sampling by type: offset or count. - -* Add rocprof-compute Text User Interface (TUI) support for analyze mode (beta version) - * A command line based user interface to support interactive single-run analysis - * launch with `--tui` option in analyze mode. i.e., `rocprof-compute analyze --tui` - -* Add support to be able to acquire from rocprofv3 every single channle on each XCD of TCC counters - -* Add Docker files to package the application and dependencies into a single portable and executable standalone binary file - -* Analysis report based filtering - * -b option in profile mode now additionally accepts metric id(s) for analysis report based filtering - * -b option in profile mode also accept hardware IP block for filtering, however, this support will be deprecated soon - * --list-metrics option added in profile mode to list possible metric id(s), similar to analyze mode - -* Data type selection option for roofline profiling - * --roofline-data-type / -R option added to specify which data types the user wants to capture in the roofline PDF plot outputs - * Default is FP32, but user can specify as many types as desired to overlay on the same plot output - -* Additional data types for roofline profiling - * Now supports FP4, FP6, FP8, FP16, BF16, FP32, FP64, I8, I32, I64 (dependent on gpu architecture) - -* Support host-trap PC Sampling on CLI (beta version) +#### CDNA4 (AMD Instinct MI350/MI355) support * Support for AMD Instinct MI350 series GPUs with the addition of the following counters: * VALU co-issue (Two VALUs are issued instructions) efficiency @@ -73,82 +72,130 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs. * L2 to EA stalls * L2 to EA stalls per channel -* Roofline support for RHEL 10 +* Roofline support for AMD Instinct MI350 series architecture. -* Roofline support for MI350 series architecture +#### Textual User Interface (TUI) (beta version) -* Interface to rocprofiler-sdk - * Setting ROCPROF=rocprofiler-sdk environment variable will use rocprofiler-sdk C++ library instead of rocprofv3 python script +* Text User Interface (TUI) support for analyze mode + * A command line based user interface to support interactive single-run analysis + * To launch, use `--tui` option in analyze mode. For example, ``rocprof-compute analyze --tui``. + +#### PC Sampling (beta version) + +* Stochastic (hardware-based) PC sampling has been enabled for AMD Instinct MI300X series and later accelerators. + +* Host-trap PC Sampling has been enabled for AMD Instinct MI200 series and later accelerators. + +* Support for sorting of PC sampling by type: offset or count. + +* PC Sampling Support on CLI and TUI analysis. + +#### Roofline + +* Support for Roofline plot on CLI (single run) analysis. + +* Roofline support for RHEL 10 OS. + +* FP4 and FP6 data types have been added for roofline profiling on AMD Instinct MI350 series. + +#### rocprofv3 support + +* ``rocprofv3`` is supported as the default backend for profiling. +* Support to obtain performance information for all channels for TCC counters. +* Support for profiling on AMD Instinct MI 100 using ``rocprofv3``. +* Deprecation warning for ``rocprofv3`` interface in favor of the ROCprofiler-SDK interface, which directly accesses ``rocprofv3`` C++ tool. + +#### Others + +* Docker files to package the application and dependencies into a single portable and executable standalone binary file. + +* Analysis report based filtering + * ``-b`` option in profile mode now also accepts metric id(s) for analysis report based filtering. + * ``-b`` option in profile mode also accepts hardware IP block for filtering; however, this filter support will be deprecated soon. + * ``--list-metrics`` option added in profile mode to list possible metric id(s), similar to analyze mode. + +* Interface to ROCprofiler-SDK. + * Setting the environment variable ``ROCPROF=rocprofiler-sdk`` will use ROCprofiler-SDK C++ library instead of ``rocprofv3`` python script. * Add --rocprofiler-sdk-library-path runtime option to choose the path to rocprofiler-sdk library to be used * Using rocprof v1 / v2 / v3 interfaces will trigger a deprecation warning to use rocprofiler-sdk interface * Support MEM chart on CLI (single run) -* Add deprecation warning for database update mode. +* Deprecation warning for MongoDB database update mode. -* Show description of metrics during analysis - * Use `--include-cols Description` to show `Description` column which is excluded by default from cli output +* Deprecation warning for ``rocm-smi`` + +* ``--specs-correction`` option to provide missing system specifications for analysis. ### Changed -* Change the default rocprof version to rocprofv3, this is used when environment variable "ROCPROF" is not set -* Change the rocprof version for unit tests to rocprofv3 on all SoCs except MI100 -* Change normal_unit default to per_kernel -* Change dependency from rocm-smi to amd-smi -* Decrease profiling time by not collecting counters not used in post analysis -* Update definition of following metrics for MI 350: - * VGPR Writes - * Total FLOPs (consider fp6 and fp4 ops) -* Update Dash to >=3.0.0 (for web UI) -* Change when Roofline PDFs are generated- during general profiling and --roof-only profiling (skip only when --no-roof option is present) -* Update Roofline binaries +* Changed the default ``rocprof`` version to ``rocprofv3``. This is used when environment variable ``ROCPROF`` is not set. +* Changed ``normal_unit`` default to ``per_kernel``. +* Decreased profiling time by not collecting unused counters in post-analysis. +* Updated Dash to >=3.0.0 (for web UI). +* Changed the condition when Roofline PDFs are generated during general profiling and ``--roof-only`` profiling (skip only when ``--no-roof`` option is present). +* Updated Roofline binaries: * Rebuild using latest ROCm stack - * OS distribution support minimum for roofline feature is now Ubuntu22.04, RHEL9, and SLES15SP6 -* Improve analysis block based filtering to accept metric id level filtering - * This can be used to collect individual metrics from various sections of analysis config -* CLI analysis mode baseline comparison will now only compare common metrics across workloads and will not show Metric ID - * Remove metrics from analysis configuration files which are explicitly marked as empty or None + * Minimum OS distribution support minimum for roofline feature is now Ubuntu 22.04, RHEL 9, and SLES15 SP6. ### Optimized * ROCm Compute Profiler CLI has been improved to better display the GPU architecture analytics -* Improved `--time-unit` option in analyze mode to apply time unit conversion across all analysis sections, not just kernel top stats. ### Resolved issues -* Fixed MI 100 counters not being collected when rocprofv3 is used -* Fixed option specs-correction -* Fixed kernel name and kernel dispatch filtering when using rocprof v3 -* Fixed not collecting TCC channel counters in rocprof v3 -* Fixed peak FLOPS of F8 I8 F16 and BF16 on MI300 -* Fixed not detecting memory clock issue when using amd-smi -* Fixed standalone GUI crashing -* Fixed L2 read/write/atomic bandwidths on MI350 -* Update metric names for better alignment between analysis configuration and documentation +* Fixed kernel name and kernel dispatch filtering when using ``rocprofv3``. +* Fixed an issue of TCC channel counters collection in ``rocprofv3``. +* Fixed peak FLOPS of F8, I8, F16, and BF16 on AMD Instinct MI 300. ### Known issues -* On MI 100, accumulation counters will not be collected and the following metrics will not show up in analysis: Instruction Fetch Latency, Wavefront Occupancy, LDS Latency - * As a workaround, use ROCPROF=rocprof environement variable, to use rocprofv1 for profiling on MI 100 +* On AMD Instinct MI100, accumulation counters are not collected, resulting in the following metrics failing to show up in the analysis: Instruction Fetch Latency, Wavefront Occupancy, LDS Latency + * As a workaround, use the environment variable ``ROCPROF=rocprof``, to use ``rocprof v1`` for profiling on AMD Instinct MI100. -* GPU id filtering is not supported when using rocprof v3 +* GPU id filtering is not supported when using ``rocprofv3``. -* Analysis of previously collected workload data will not work due to sysinfo.csv schema change - * As a workaround, run the profiling operation again for the workload and interrupt the process after ten seconds. - Followed by copying the `sysinfo.csv` file from the new data folder to the old one. - This assumes your system specification hasn't changed since the creation of the previous workload data. +* Analysis of previously collected workload data will not work due to sysinfo.csv schema change. + * As a workaround, re-run the profiling operation for the workload and interrupt the process after 10 seconds. + Followed by copying the ``sysinfo.csv`` file from the new data folder to the old one. + This assumes your system specification hasn't changed since the creation of the previous workload data. * Analysis of new workloads might require providing shader/memory clock speed using ---specs-correction operation if `amd-smi` or `rocminfo` does not provide clock speeds. +``--specs-correction`` operation if amd-smi or rocminfo does not provide clock speeds. -* Memory chart on CLI might look corrupted if CLI width is too narrow +* Memory chart on ROCm Compute Profiler CLI might look corrupted if the CLI width is too narrow. ### Removed * Roofline support for Ubuntu 20.04 and SLES below 15.6 -* Usage of rocm-smi -* Remove support for MI50/MI60 in accordance with the documentation -* Hardware IP block based filtering has been removed in favor of analysis report block based filtering +* Removed support for AMD Instinct MI50 and MI60. + +### Upcoming changes + +* ``rocprof v1/v2/v3`` interfaces will be removed in favor of the ROCprofiler-SDK interface, which directly accesses ``rocprofv3`` C++ tool. + * To use ROCprofiler-SDK interface, set environment variable `ROCPROF=rocprofiler-sdk` and optionally provide profile mode option ``--rocprofiler-sdk-library-path /path/to/librocprofiler-sdk.so`` +* Hardware IP block based filtering using ``-b`` option in profile mode will be removed in favor of analysis report block based filtering using ``-b`` option in profile mode. +* Using rocprof v1 / v2 / v3 interfaces will trigger a deprecation warning to use rocprofiler-sdk interface +* MongoDB database support will be removed. +* Usage of ``rocm-smi`` will be removed in favor of ``amd-smi``. + + +## ROCm Compute Profiler 3.1.1 for ROCm 6.4.2 + +### Added + +* 8-bit floating point (FP8) metrics support for AMD Instinct MI300 GPUs. +* Additional data types for roofline: FP8, FP16, BF16, FP32, FP64, I8, I32, I64 (dependent on the GPU architecture). +* Data type selection option ``--roofline-data-type / -R`` for roofline profiling. The default data type is FP32. + +### Changed + +* Change dependency from `rocm-smi` to `amd-smi`. + +### Resolved issues + +* Fixed a crash related to Agent ID caused by the new format of the `rocprofv3` output CSV file. + ## ROCm Compute Profiler 3.1.0 for ROCm 6.4.0 diff --git a/projects/rocprofiler-compute/docs/conf.py b/projects/rocprofiler-compute/docs/conf.py index aef5591810..2edd1c9b4f 100644 --- a/projects/rocprofiler-compute/docs/conf.py +++ b/projects/rocprofiler-compute/docs/conf.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + # Configuration file for the Sphinx documentation builder. # @@ -212,4 +214,4 @@ extlinks = { } # Uncomment if facing rate limit exceed issue with local build -external_projects_remote_repository = "" \ No newline at end of file +external_projects_remote_repository = "" diff --git a/projects/rocprofiler-compute/docs/data/analyze/tui_home.png b/projects/rocprofiler-compute/docs/data/analyze/tui_home.png new file mode 100644 index 0000000000..24fde654f0 Binary files /dev/null and b/projects/rocprofiler-compute/docs/data/analyze/tui_home.png differ diff --git a/projects/rocprofiler-compute/docs/data/analyze/tui_kernel_selection.png b/projects/rocprofiler-compute/docs/data/analyze/tui_kernel_selection.png new file mode 100644 index 0000000000..0ea6204b97 Binary files /dev/null and b/projects/rocprofiler-compute/docs/data/analyze/tui_kernel_selection.png differ diff --git a/projects/rocprofiler-compute/docs/data/metrics_description.yaml b/projects/rocprofiler-compute/docs/data/metrics_description.yaml index 7184df52e1..512518ab65 100644 --- a/projects/rocprofiler-compute/docs/data/metrics_description.yaml +++ b/projects/rocprofiler-compute/docs/data/metrics_description.yaml @@ -1,173 +1,65 @@ # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py Wavefront launch stats: - Grid Size: - rst: The total number of work-items (or, threads) launched as a part of the kernel - dispatch. In HIP, this is equivalent to the total grid size multiplied by the - total workgroup (or, block) size. - unit: Work-Items - Workgroup Size: - rst: The total number of work-items (or, threads) in each workgroup (or, block) - launched as part of the kernel dispatch. In HIP, this is equivalent to the total - block size. - unit: Work-Items - Total Wavefronts: - rst: "The total number of wavefronts launched as part of the kernel dispatch.\ - \ On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront\ - \ size is always 64 work-items. Thus, the total number of wavefronts should\ - \ be equivalent to the ceiling of grid size divided by 64." - unit: Wavefronts - Saved Wavefronts: - rst: The total number of wavefronts saved at a context-save. See `cwsr_enable - `_. - unit: Wavefronts - Restored Wavefronts: - rst: The total number of wavefronts restored from a context-save. See `cwsr_enable - `_. - unit: Wavefronts - VGPRs: - rst: 'The number of architected vector general-purpose registers allocated for the - kernel, see :ref:`VALU `. Note: this may not exactly match the - number of VGPRs requested by the compiler due to allocation granularity.' - unit: VGPRs AGPRs: rst: 'The number of accumulation vector general-purpose registers allocated for the kernel, see :ref:`AGPRs `. Note: this may not exactly match the number of AGPRs requested by the compiler due to allocation granularity.' unit: AGPRs - SGPRs: - rst: 'The number of scalar general-purpose registers allocated for the kernel, see - :ref:`SALU `. Note: this may not exactly match the number of SGPRs - requested by the compiler due to allocation granularity. plain' - unit: SGPRs + Grid Size: + rst: The total number of work-items (or, threads) launched as a part of the kernel + dispatch. In HIP, this is equivalent to the total grid size multiplied by the + total workgroup (or, block) size. + unit: Work-Items LDS Allocation: rst: 'The number of bytes of :doc:`LDS ` memory (or, shared memory) allocated for this kernel. Note: This may also be larger than what was requested at compile time due to both allocation granularity and dynamic per-dispatch LDS allocations.' unit: Bytes per workgroup + Restored Wavefronts: + rst: The total number of wavefronts restored from a context-save. See `cwsr_enable + `_. + unit: Wavefronts + SGPRs: + rst: 'The number of scalar general-purpose registers allocated for the kernel, see + :ref:`SALU `. Note: this may not exactly match the number of SGPRs + requested by the compiler due to allocation granularity. plain' + unit: SGPRs + Saved Wavefronts: + rst: The total number of wavefronts saved at a context-save. See `cwsr_enable + `_. + unit: Wavefronts Scratch Allocation: rst: The number of bytes of :ref:`scratch memory ` requested per work-item for this kernel. Scratch memory is used for stack memory on the accelerator, as well as for register spills and restores. unit: Bytes per work-item - Kernel Time: - rst: The total duration of the executed kernel. - unit: Nanoseconds - Kernel Time (Cycles): - rst: The total duration of the executed kernel in cycles. - unit: Cycles - Instructions per wavefront: - rst: The average number of instructions (of all types) executed per wavefront. - This is averaged over all wavefronts in a kernel dispatch. - unit: Instructions per wavefront - Wave Cycles: - rst: 'The number of cycles a wavefront in the kernel dispatch spent resident on a - compute unit per :ref:`normalization unit `. This is averaged - over all wavefronts in a kernel dispatch. Note: this should not be directly - compared to the kernel cycles above.' - unit: Cycles per normalization unit - Dependency Wait Cycles: - rst: The number of cycles a wavefront in the kernel dispatch stalled waiting on - memory of any kind (e.g., instruction fetch, vector or scalar memory, etc.) - per :ref:`normalization unit `. This counter is incremented - at every cycle by *all* wavefronts on a CU stalled at a memory operation. As - such, it is most useful to get a sense of how waves were spending their time, - rather than identification of a precise limiter because another wave could - be actively executing while a wave is stalled. The sum of this metric, Issue - Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric. - unit: Cycles per normalization unit - Issue Wait Cycles: - rst: The number of cycles a wavefront in the kernel dispatch was unable to issue - an instruction for any reason (e.g., execution pipe back-pressure, arbitration - loss, etc.) per :ref:`normalization unit `. This counter - is incremented at every cycle by *all* wavefronts on a CU unable to issue an instruction. As - such, it is most useful to get a sense of how waves were spending their time, - rather than identification of a precise limiter because another wave could - be actively executing while a wave is issue stalled. The sum of this metric, - Dependency Wait Cycles and Active Cycles should be equal to the total Wave - Cycles metric. - unit: Cycles per normalization unit - Active Cycles: - rst: The average number of cycles a wavefront in the kernel dispatch was actively - executing instructions per :ref:`normalization unit `. - This measurement is made on a per-wavefront basis, and may include cycles that - another wavefront spent actively executing (on another execution unit, for - example) or was stalled. As such, it is most useful to get a sense of how - waves were spending their time, rather than identification of a precise limiter. - The sum of this metric, Issue Wait Cycles and Active Wait Cycles should be equal - to the total Wave Cycles metric. - unit: Cycles per normalization unit - Wavefront Occupancy: - rst: 'The time-averaged number of wavefronts resident on the accelerator over the - lifetime of the kernel. Note: this metric may be inaccurate for short-running - kernels (less than 1ms).' + Total Wavefronts: + rst: "The total number of wavefronts launched as part of the kernel dispatch.\ + \ On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront\ + \ size is always 64 work-items. Thus, the total number of wavefronts should\ + \ be equivalent to the ceiling of grid size divided by 64." unit: Wavefronts + VGPRs: + rst: 'The number of architected vector general-purpose registers allocated for the + kernel, see :ref:`VALU `. Note: this may not exactly match the + number of VGPRs requested by the compiler due to allocation granularity.' + unit: VGPRs + Workgroup Size: + rst: The total number of work-items (or, threads) in each workgroup (or, block) + launched as part of the kernel dispatch. In HIP, this is equivalent to the total + block size. + unit: Work-Items Wavefront runtime stats: - Grid Size: - rst: The total number of work-items (or, threads) launched as a part of the kernel - dispatch. In HIP, this is equivalent to the total grid size multiplied by the - total workgroup (or, block) size. - unit: Work-Items - Workgroup Size: - rst: The total number of work-items (or, threads) in each workgroup (or, block) - launched as part of the kernel dispatch. In HIP, this is equivalent to the total - block size. - unit: Work-Items - Total Wavefronts: - rst: "The total number of wavefronts launched as part of the kernel dispatch.\ - \ On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront\ - \ size is always 64 work-items. Thus, the total number of wavefronts should\ - \ be equivalent to the ceiling of grid size divided by 64." - unit: Wavefronts - Saved Wavefronts: - rst: The total number of wavefronts saved at a context-save. See `cwsr_enable - `_. - unit: Wavefronts - Restored Wavefronts: - rst: The total number of wavefronts restored from a context-save. See `cwsr_enable - `_. - unit: Wavefronts - VGPRs: - rst: 'The number of architected vector general-purpose registers allocated for the - kernel, see :ref:`VALU `. Note: this may not exactly match the - number of VGPRs requested by the compiler due to allocation granularity.' - unit: VGPRs - AGPRs: - rst: 'The number of accumulation vector general-purpose registers allocated for the - kernel, see :ref:`AGPRs `. Note: this may not exactly match the - number of AGPRs requested by the compiler due to allocation granularity.' - unit: AGPRs - SGPRs: - rst: 'The number of scalar general-purpose registers allocated for the kernel, see - :ref:`SALU `. Note: this may not exactly match the number of SGPRs - requested by the compiler due to allocation granularity. plain' - unit: SGPRs - LDS Allocation: - rst: 'The number of bytes of :doc:`LDS ` memory (or, shared memory) - allocated for this kernel. Note: This may also be larger than what was requested - at compile time due to both allocation granularity and dynamic per-dispatch - LDS allocations.' - unit: Bytes per workgroup - Scratch Allocation: - rst: The number of bytes of :ref:`scratch memory ` requested per - work-item for this kernel. Scratch memory is used for stack memory on the accelerator, - as well as for register spills and restores. - unit: Bytes per work-item - Kernel Time: - rst: The total duration of the executed kernel. - unit: Nanoseconds - Kernel Time (Cycles): - rst: The total duration of the executed kernel in cycles. - unit: Cycles - Instructions per wavefront: - rst: The average number of instructions (of all types) executed per wavefront. - This is averaged over all wavefronts in a kernel dispatch. - unit: Instructions per wavefront - Wave Cycles: - rst: 'The number of cycles a wavefront in the kernel dispatch spent resident on a - compute unit per :ref:`normalization unit `. This is averaged - over all wavefronts in a kernel dispatch. Note: this should not be directly - compared to the kernel cycles above.' + Active Cycles: + rst: The average number of cycles a wavefront in the kernel dispatch was actively + executing instructions per :ref:`normalization unit `. + This measurement is made on a per-wavefront basis, and may include cycles that + another wavefront spent actively executing (on another execution unit, for + example) or was stalled. As such, it is most useful to get a sense of how + waves were spending their time, rather than identification of a precise limiter. + The sum of this metric, Issue Wait Cycles and Active Wait Cycles should be equal + to the total Wave Cycles metric. unit: Cycles per normalization unit Dependency Wait Cycles: rst: The number of cycles a wavefront in the kernel dispatch stalled waiting on @@ -179,6 +71,10 @@ Wavefront runtime stats: be actively executing while a wave is stalled. The sum of this metric, Issue Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric. unit: Cycles per normalization unit + Instructions per wavefront: + rst: The average number of instructions (of all types) executed per wavefront. + This is averaged over all wavefronts in a kernel dispatch. + unit: Instructions per wavefront Issue Wait Cycles: rst: The number of cycles a wavefront in the kernel dispatch was unable to issue an instruction for any reason (e.g., execution pipe back-pressure, arbitration @@ -190,15 +86,17 @@ Wavefront runtime stats: Dependency Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric. unit: Cycles per normalization unit - Active Cycles: - rst: The average number of cycles a wavefront in the kernel dispatch was actively - executing instructions per :ref:`normalization unit `. - This measurement is made on a per-wavefront basis, and may include cycles that - another wavefront spent actively executing (on another execution unit, for - example) or was stalled. As such, it is most useful to get a sense of how - waves were spending their time, rather than identification of a precise limiter. - The sum of this metric, Issue Wait Cycles and Active Wait Cycles should be equal - to the total Wave Cycles metric. + Kernel Time: + rst: The total duration of the executed kernel. + unit: Nanoseconds + Kernel Time (Cycles): + rst: The total duration of the executed kernel in cycles. + unit: Cycles + Wave Cycles: + rst: 'The number of cycles a wavefront in the kernel dispatch spent resident on a + compute unit per :ref:`normalization unit `. This is averaged + over all wavefronts in a kernel dispatch. Note: this should not be directly + compared to the kernel cycles above.' unit: Cycles per normalization unit Wavefront Occupancy: rst: 'The time-averaged number of wavefronts resident on the accelerator over the @@ -206,17 +104,9 @@ Wavefront runtime stats: kernels (less than 1ms).' unit: Wavefronts Overall instruction mix: - VALU: - rst: The total number of vector arithmetic logic unit (VALU) operations issued. - These are the workhorses of the :doc:`compute unit `, and are - used to execute a wide range of instruction types including floating point - operations, non-uniform address calculations, transcendental operations, integer - operations, shifts, conditional evaluation, etc. - unit: Instructions - VMEM: - rst: The total number of vector memory operations issued. These include most loads, - stores and atomic operations and all accesses to :ref:`generic, global, private - and texture ` memory. + Branch: + rst: The total number of branch operations issued. These typically consist of jump + or branch operations and are used to implement control flow. unit: Instructions LDS: rst: The total number of LDS (also known as shared memory) operations issued. These @@ -237,192 +127,36 @@ Overall instruction mix: used for loading kernel arguments, base-pointers and loads from HIP's ``__constant__`` memory. unit: Instructions - Branch: - rst: The total number of branch operations issued. These typically consist of jump - or branch operations and are used to implement control flow. + VALU: + rst: The total number of vector arithmetic logic unit (VALU) operations issued. + These are the workhorses of the :doc:`compute unit `, and are + used to execute a wide range of instruction types including floating point + operations, non-uniform address calculations, transcendental operations, integer + operations, shifts, conditional evaluation, etc. + unit: Instructions + VMEM: + rst: The total number of vector memory operations issued. These include most loads, + stores and atomic operations and all accesses to :ref:`generic, global, private + and texture ` memory. unit: Instructions - INT32: - rst: The total number of instructions operating on 32-bit integer operands issued - to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - INT64: - rst: The total number of instructions operating on 64-bit integer operands issued - to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-ADD: - rst: The total number of addition instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-MUL: - rst: The total number of multiplication instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-FMA: - rst: The total number of fused multiply-add instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-Trans: - rst: The total number of transcendental instructions (e.g., `sqrt`) operating on - 16-bit floating-point operands issued to the VALU per :ref:`normalization unit - `. - unit: Instructions per normalization unit - F32-ADD: - rst: The total number of addition instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-MUL: - rst: The total number of multiplication instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-FMA: - rst: The total number of fused multiply-add instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-Trans: - rst: The total number of transcendental instructions (such as ``sqrt``) operating - on 32-bit floating-point operands issued to the VALU per :ref:`normalization - unit `. - unit: Instructions per normalization unit - F64-ADD: - rst: The total number of addition instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-MUL: - rst: The total number of multiplication instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-FMA: - rst: The total number of fused multiply-add instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-Trans: - rst: The total number of transcendental instructions (such as `sqrt`) operating - on 64-bit floating-point operands issued to the VALU per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Conversion: - rst: "The total number of type conversion instructions (such as converting data\ - \ to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit\ - \ `." - unit: Instructions per normalization unit - Global/Generic Instr: - rst: The total number of global & generic memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Global/Generic Read: - rst: The total number of global & generic memory read instructions executed on all - :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Write: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Atomic: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Instr: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Read: - rst: The total number of spill/stack memory read instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Write: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Atomic: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. - unit: Instructions per normalization unit - MFMA-I8: - rst: The total number of 8-bit integer :ref:`MFMA ` instructions issued - per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F8: - rst: The total number of 8-bit floating point :ref:`MFMA ` instructions issued - per :ref:`normalization unit `. This is supported in AMD - Instinct MI300 series and later only. - unit: Instructions per normalization unit - MFMA-F16: - rst: The total number of 16-bit floating point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-BF16: - rst: The total number of 16-bit brain floating point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F32: - rst: The total number of 32-bit floating-point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F64: - rst: The total number of 64-bit floating-point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit VALU arithmetic instruction mix: - VALU: - rst: The total number of vector arithmetic logic unit (VALU) operations issued. - These are the workhorses of the :doc:`compute unit `, and are - used to execute a wide range of instruction types including floating point - operations, non-uniform address calculations, transcendental operations, integer - operations, shifts, conditional evaluation, etc. - unit: Instructions - VMEM: - rst: The total number of vector memory operations issued. These include most loads, - stores and atomic operations and all accesses to :ref:`generic, global, private - and texture ` memory. - unit: Instructions - LDS: - rst: The total number of LDS (also known as shared memory) operations issued. These - include loads, stores, atomics, and HIP's ``__shfl`` operations. - unit: Instructions - MFMA: - rst: The total number of matrix fused multiply-add instructions issued. - unit: Instructions - SALU: - rst: The total number of scalar arithmetic logic unit (SALU) operations issued. - Typically these are used for address calculations, literal constants, and other - operations that are provably uniform across a wavefront. Although scalar memory - (SMEM) operations are issued by the SALU, they are counted separately in this - section. - unit: Instructions - SMEM: - rst: The total number of scalar memory (SMEM) operations issued. These are typically - used for loading kernel arguments, base-pointers and loads from HIP's ``__constant__`` - memory. - unit: Instructions - Branch: - rst: The total number of branch operations issued. These typically consist of jump - or branch operations and are used to implement control flow. - unit: Instructions - INT32: - rst: The total number of instructions operating on 32-bit integer operands issued - to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - INT64: - rst: The total number of instructions operating on 64-bit integer operands issued - to the VALU per :ref:`normalization unit `. + Conversion: + rst: "The total number of type conversion instructions (such as converting data\ + \ to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit\ + \ `." unit: Instructions per normalization unit F16-ADD: rst: The total number of addition instructions operating on 16-bit floating-point operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit - F16-MUL: - rst: The total number of multiplication instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit F16-FMA: rst: The total number of fused multiply-add instructions operating on 16-bit floating-point operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit + F16-MUL: + rst: The total number of multiplication instructions operating on 16-bit floating-point + operands issued to the VALU per :ref:`normalization unit `. + unit: Instructions per normalization unit F16-Trans: rst: The total number of transcendental instructions (e.g., `sqrt`) operating on 16-bit floating-point operands issued to the VALU per :ref:`normalization unit @@ -432,14 +166,14 @@ VALU arithmetic instruction mix: rst: The total number of addition instructions operating on 32-bit floating-point operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit - F32-MUL: - rst: The total number of multiplication instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit F32-FMA: rst: The total number of fused multiply-add instructions operating on 32-bit floating-point operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit + F32-MUL: + rst: The total number of multiplication instructions operating on 32-bit floating-point + operands issued to the VALU per :ref:`normalization unit `. + unit: Instructions per normalization unit F32-Trans: rst: The total number of transcendental instructions (such as ``sqrt``) operating on 32-bit floating-point operands issued to the VALU per :ref:`normalization @@ -449,240 +183,36 @@ VALU arithmetic instruction mix: rst: The total number of addition instructions operating on 64-bit floating-point operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit - F64-MUL: - rst: The total number of multiplication instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit F64-FMA: rst: The total number of fused multiply-add instructions operating on 64-bit floating-point operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit + F64-MUL: + rst: The total number of multiplication instructions operating on 64-bit floating-point + operands issued to the VALU per :ref:`normalization unit `. + unit: Instructions per normalization unit F64-Trans: rst: The total number of transcendental instructions (such as `sqrt`) operating on 64-bit floating-point operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit - Conversion: - rst: "The total number of type conversion instructions (such as converting data\ - \ to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit\ - \ `." + INT32: + rst: The total number of instructions operating on 32-bit integer operands issued + to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit - Global/Generic Instr: - rst: The total number of global & generic memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Global/Generic Read: - rst: The total number of global & generic memory read instructions executed on all - :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Write: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Atomic: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Instr: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Read: - rst: The total number of spill/stack memory read instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Write: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Atomic: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. - unit: Instructions per normalization unit - MFMA-I8: - rst: The total number of 8-bit integer :ref:`MFMA ` instructions issued - per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F8: - rst: The total number of 8-bit floating point :ref:`MFMA ` instructions issued - per :ref:`normalization unit `. This is supported in AMD - Instinct MI300 series and later only. - unit: Instructions per normalization unit - MFMA-F16: - rst: The total number of 16-bit floating point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-BF16: - rst: The total number of 16-bit brain floating point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F32: - rst: The total number of 32-bit floating-point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F64: - rst: The total number of 64-bit floating-point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. + INT64: + rst: The total number of instructions operating on 64-bit integer operands issued + to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit MFMA instruction mix: - VALU: - rst: The total number of vector arithmetic logic unit (VALU) operations issued. - These are the workhorses of the :doc:`compute unit `, and are - used to execute a wide range of instruction types including floating point - operations, non-uniform address calculations, transcendental operations, integer - operations, shifts, conditional evaluation, etc. - unit: Instructions - VMEM: - rst: The total number of vector memory operations issued. These include most loads, - stores and atomic operations and all accesses to :ref:`generic, global, private - and texture ` memory. - unit: Instructions - LDS: - rst: The total number of LDS (also known as shared memory) operations issued. These - include loads, stores, atomics, and HIP's ``__shfl`` operations. - unit: Instructions - MFMA: - rst: The total number of matrix fused multiply-add instructions issued. - unit: Instructions - SALU: - rst: The total number of scalar arithmetic logic unit (SALU) operations issued. - Typically these are used for address calculations, literal constants, and other - operations that are provably uniform across a wavefront. Although scalar memory - (SMEM) operations are issued by the SALU, they are counted separately in this - section. - unit: Instructions - SMEM: - rst: The total number of scalar memory (SMEM) operations issued. These are typically - used for loading kernel arguments, base-pointers and loads from HIP's ``__constant__`` - memory. - unit: Instructions - Branch: - rst: The total number of branch operations issued. These typically consist of jump - or branch operations and are used to implement control flow. - unit: Instructions - INT32: - rst: The total number of instructions operating on 32-bit integer operands issued - to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - INT64: - rst: The total number of instructions operating on 64-bit integer operands issued - to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-ADD: - rst: The total number of addition instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-MUL: - rst: The total number of multiplication instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-FMA: - rst: The total number of fused multiply-add instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-Trans: - rst: The total number of transcendental instructions (e.g., `sqrt`) operating on - 16-bit floating-point operands issued to the VALU per :ref:`normalization unit - `. - unit: Instructions per normalization unit - F32-ADD: - rst: The total number of addition instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-MUL: - rst: The total number of multiplication instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-FMA: - rst: The total number of fused multiply-add instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-Trans: - rst: The total number of transcendental instructions (such as ``sqrt``) operating - on 32-bit floating-point operands issued to the VALU per :ref:`normalization - unit `. - unit: Instructions per normalization unit - F64-ADD: - rst: The total number of addition instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-MUL: - rst: The total number of multiplication instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-FMA: - rst: The total number of fused multiply-add instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-Trans: - rst: The total number of transcendental instructions (such as `sqrt`) operating - on 64-bit floating-point operands issued to the VALU per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Conversion: - rst: "The total number of type conversion instructions (such as converting data\ - \ to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit\ - \ `." - unit: Instructions per normalization unit - Global/Generic Instr: - rst: The total number of global & generic memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Global/Generic Read: - rst: The total number of global & generic memory read instructions executed on all - :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Write: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Atomic: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Instr: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Read: - rst: The total number of spill/stack memory read instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Write: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Atomic: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. - unit: Instructions per normalization unit - MFMA-I8: - rst: The total number of 8-bit integer :ref:`MFMA ` instructions issued - per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F8: - rst: The total number of 8-bit floating point :ref:`MFMA ` instructions issued - per :ref:`normalization unit `. This is supported in AMD - Instinct MI300 series and later only. + MFMA-BF16: + rst: The total number of 16-bit brain floating point :ref:`MFMA ` instructions + issued per :ref:`normalization unit `. unit: Instructions per normalization unit MFMA-F16: rst: The total number of 16-bit floating point :ref:`MFMA ` instructions issued per :ref:`normalization unit `. unit: Instructions per normalization unit - MFMA-BF16: - rst: The total number of 16-bit brain floating point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit MFMA-F32: rst: The total number of 32-bit floating-point :ref:`MFMA ` instructions issued per :ref:`normalization unit `. @@ -691,19 +221,16 @@ MFMA instruction mix: rst: The total number of 64-bit floating-point :ref:`MFMA ` instructions issued per :ref:`normalization unit `. unit: Instructions per normalization unit + MFMA-F8: + rst: The total number of 8-bit floating point :ref:`MFMA ` instructions issued + per :ref:`normalization unit `. This is supported in AMD + Instinct MI300 series and later only. + unit: Instructions per normalization unit + MFMA-I8: + rst: The total number of 8-bit integer :ref:`MFMA ` instructions issued + per :ref:`normalization unit `. + unit: Instructions per normalization unit Compute Speed-of-Light: - VALU FLOPs: - rst: 'The total floating-point operations executed per second on the :ref:`VALU - `. This is also presented as a percent of the peak theoretical FLOPs - achievable on the specific accelerator. Note: this does not include any floating-point - operations from :ref:`MFMA ` instructions.' - unit: GFLOPs - VALU IOPs: - rst: 'The total integer operations executed per second on the :ref:`VALU `. - This is also presented as a percent of the peak theoretical IOPs achievable - on the specific accelerator. Note: this does not include any integer operations - from :ref:`MFMA ` instructions.' - unit: GIOPs MFMA FLOPs (BF16): rst: 'The total number of 16-bit brain floating point :ref:`MFMA ` operations executed per second. Note: this does not include any 16-bit brain floating @@ -742,157 +269,25 @@ Compute Speed-of-Light: ` instructions. This is also presented as a percent of the peak theoretical INT8 MFMA operations achievable on the specific accelerator.' unit: GFLOPs - IPC: - rst: The ratio of the total number of instructions executed on the :doc:`CU ` - over the :ref:`total active CU cycles `. - unit: Instructions per cycle - IPC (Issued): - rst: The ratio of the total number of (non-:ref:`internal `) - instructions issued over the number of cycles where the :ref:`scheduler ` - was actively working on issuing instructions. Refer to the :ref:`Issued IPC - ` example for further detail. - unit: Instructions per cycle - SALU Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`SALU ` - was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`scheduler ` issuing SALU / :ref:`SMEM - ` instructions over the :ref:`total CU cycles `. - unit: Percent - VALU Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`VALU ` - was busy executing instructions. Does not include :ref:`VMEM ` operations. - Computed as the ratio of the total number of cycles spent by the :ref:`scheduler - ` issuing VALU instructions over the :ref:`total CU cycles - `. - unit: Percent - VMEM Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`VMEM ` - unit was busy executing instructions, including both global/generic and spill/scratch - operations (see the :ref:`VMEM instruction count metrics ` - for more detail). Does not include :ref:`VALU ` operations. Computed as - the ratio of the total number of cycles spent by the :ref:`scheduler ` - issuing VMEM instructions over the :ref:`total CU cycles `. - unit: Percent - Branch Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`branch ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`scheduler ` issuing branch instructions - over the :ref:`total CU cycles `. - unit: Percent - VALU Active Threads: - rst: Indicates the average level of :ref:`divergence ` within a - wavefront over the lifetime of the kernel. The number of work-items that were - active in a wavefront during execution of each :ref:`VALU ` instruction, - time-averaged over all VALU instructions run on all wavefronts in the kernel. - unit: Work-items - MFMA Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`MFMA ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`MFMA ` was busy over the :ref:`total - CU cycles `. - unit: Percent - MFMA Instruction Cycles: - rst: The average duration of :ref:`MFMA ` instructions in this kernel - in cycles. Computed as the ratio of the total number of cycles the MFMA unit - was busy over the total number of MFMA instructions. Compare to, for example, - the `AMD Matrix Instruction Calculator `_. - unit: Cycles per instruction - VMEM Latency: - rst: The average number of round-trip cycles (that is, from issue to data return - / acknowledgment) required for a VMEM instruction to complete. - unit: Cycles - SMEM Latency: - rst: The average number of round-trip cycles (that is, from issue to data return - / acknowledgment) required for a SMEM instruction to complete. - unit: Cycles - FLOPs (Total): - rst: The total number of floating-point operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: FLOP per normalization unit - IOPs (Total): - rst: The total number of integer operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: IOP per normalization unit - F16 OPs: - rst: The total number of 16-bit floating-point operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: FLOP per normalization unit - BF16 OPs: - rst: 'The total number of 16-bit brain floating-point operations executed on either - the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. Note: on current CDNA accelerators, the VALU has - no native BF16 instructions.' - unit: FLOP per normalization unit - F32 OPs: - rst: The total number of 32-bit floating-point operations executed on either the - :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. - unit: FLOP per normalization unit - F64 OPs: - rst: The total number of 64-bit floating-point operations executed on either the - :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. - unit: FLOP per normalization unit - INT8 OPs: - rst: 'The total number of 8-bit integer operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. Note: on current CDNA accelerators, the VALU has no - native INT8 instructions.' - unit: IOP per normalization unit + VALU FLOPs: + rst: 'The total floating-point operations executed per second on the :ref:`VALU + `. This is also presented as a percent of the peak theoretical FLOPs + achievable on the specific accelerator. Note: this does not include any floating-point + operations from :ref:`MFMA ` instructions.' + unit: GFLOPs + VALU IOPs: + rst: 'The total integer operations executed per second on the :ref:`VALU `. + This is also presented as a percent of the peak theoretical IOPs achievable + on the specific accelerator. Note: this does not include any integer operations + from :ref:`MFMA ` instructions.' + unit: GIOPs Pipeline statistics: - VALU FLOPs: - rst: 'The total floating-point operations executed per second on the :ref:`VALU - `. This is also presented as a percent of the peak theoretical FLOPs - achievable on the specific accelerator. Note: this does not include any floating-point - operations from :ref:`MFMA ` instructions.' - unit: GFLOPs - VALU IOPs: - rst: 'The total integer operations executed per second on the :ref:`VALU `. - This is also presented as a percent of the peak theoretical IOPs achievable - on the specific accelerator. Note: this does not include any integer operations - from :ref:`MFMA ` instructions.' - unit: GIOPs - MFMA FLOPs (BF16): - rst: 'The total number of 16-bit brain floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 16-bit brain floating - point operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical BF16 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F16): - rst: 'The total number of 16-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 16-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F16 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F32): - rst: 'The total number of 32-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 32-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F32 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F64): - rst: 'The total number of 64-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 64-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F64 MFMA operations achievable on the - specific accelerator. The total number of 64-bit floating point :ref:`MFMA - ` operations executed per second. Note: this does not include any - 64-bit floating point operations from :ref:`VALU ` instructions. - This is also presented as a percent of the peak theoretical F64 MFMA operations - achievable on the specific accelerator.' - unit: GFLOPs - MFMA IOPs (INT8): - rst: 'The total number of 8-bit integer :ref:`MFMA ` operations executed - per second. Note: this does not include any 8-bit integer operations from :ref:`VALU - ` instructions. This is also presented as a percent of the peak - theoretical INT8 MFMA operations achievable on the specific accelerator.' - unit: GFLOPs + Branch Utilization: + rst: Indicates what percent of the kernel's duration the :ref:`branch ` + unit was busy executing instructions. Computed as the ratio of the total number + of cycles spent by the :ref:`scheduler ` issuing branch instructions + over the :ref:`total CU cycles `. + unit: Percent IPC: rst: The ratio of the total number of instructions executed on the :doc:`CU ` over the :ref:`total active CU cycles `. @@ -903,12 +298,34 @@ Pipeline statistics: was actively working on issuing instructions. Refer to the :ref:`Issued IPC ` example for further detail. unit: Instructions per cycle + MFMA Instruction Cycles: + rst: The average duration of :ref:`MFMA ` instructions in this kernel + in cycles. Computed as the ratio of the total number of cycles the MFMA unit + was busy over the total number of MFMA instructions. Compare to, for example, + the `AMD Matrix Instruction Calculator `_. + unit: Cycles per instruction + MFMA Utilization: + rst: Indicates what percent of the kernel's duration the :ref:`MFMA ` + unit was busy executing instructions. Computed as the ratio of the total number + of cycles spent by the :ref:`MFMA ` was busy over the :ref:`total + CU cycles `. + unit: Percent SALU Utilization: rst: Indicates what percent of the kernel's duration the :ref:`SALU ` was busy executing instructions. Computed as the ratio of the total number of cycles spent by the :ref:`scheduler ` issuing SALU / :ref:`SMEM ` instructions over the :ref:`total CU cycles `. unit: Percent + SMEM Latency: + rst: The average number of round-trip cycles (that is, from issue to data return + / acknowledgment) required for a SMEM instruction to complete. + unit: Cycles + VALU Active Threads: + rst: Indicates the average level of :ref:`divergence ` within a + wavefront over the lifetime of the kernel. The number of work-items that were + active in a wavefront during execution of each :ref:`VALU ` instruction, + time-averaged over all VALU instructions run on all wavefronts in the kernel. + unit: Work-items VALU Utilization: rst: Indicates what percent of the kernel's duration the :ref:`VALU ` was busy executing instructions. Does not include :ref:`VMEM ` operations. @@ -916,6 +333,10 @@ Pipeline statistics: ` issuing VALU instructions over the :ref:`total CU cycles `. unit: Percent + VMEM Latency: + rst: The average number of round-trip cycles (that is, from issue to data return + / acknowledgment) required for a VMEM instruction to complete. + unit: Cycles VMEM Utilization: rst: Indicates what percent of the kernel's duration the :ref:`VMEM ` unit was busy executing instructions, including both global/generic and spill/scratch @@ -924,210 +345,18 @@ Pipeline statistics: the ratio of the total number of cycles spent by the :ref:`scheduler ` issuing VMEM instructions over the :ref:`total CU cycles `. unit: Percent - Branch Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`branch ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`scheduler ` issuing branch instructions - over the :ref:`total CU cycles `. - unit: Percent - VALU Active Threads: - rst: Indicates the average level of :ref:`divergence ` within a - wavefront over the lifetime of the kernel. The number of work-items that were - active in a wavefront during execution of each :ref:`VALU ` instruction, - time-averaged over all VALU instructions run on all wavefronts in the kernel. - unit: Work-items - MFMA Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`MFMA ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`MFMA ` was busy over the :ref:`total - CU cycles `. - unit: Percent - MFMA Instruction Cycles: - rst: The average duration of :ref:`MFMA ` instructions in this kernel - in cycles. Computed as the ratio of the total number of cycles the MFMA unit - was busy over the total number of MFMA instructions. Compare to, for example, - the `AMD Matrix Instruction Calculator `_. - unit: Cycles per instruction - VMEM Latency: - rst: The average number of round-trip cycles (that is, from issue to data return - / acknowledgment) required for a VMEM instruction to complete. - unit: Cycles - SMEM Latency: - rst: The average number of round-trip cycles (that is, from issue to data return - / acknowledgment) required for a SMEM instruction to complete. - unit: Cycles - FLOPs (Total): - rst: The total number of floating-point operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: FLOP per normalization unit - IOPs (Total): - rst: The total number of integer operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: IOP per normalization unit - F16 OPs: - rst: The total number of 16-bit floating-point operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: FLOP per normalization unit - BF16 OPs: - rst: 'The total number of 16-bit brain floating-point operations executed on either - the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. Note: on current CDNA accelerators, the VALU has - no native BF16 instructions.' - unit: FLOP per normalization unit - F32 OPs: - rst: The total number of 32-bit floating-point operations executed on either the - :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. - unit: FLOP per normalization unit - F64 OPs: - rst: The total number of 64-bit floating-point operations executed on either the - :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. - unit: FLOP per normalization unit - INT8 OPs: - rst: 'The total number of 8-bit integer operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. Note: on current CDNA accelerators, the VALU has no - native INT8 instructions.' - unit: IOP per normalization unit Arithmetic operations: - VALU FLOPs: - rst: 'The total floating-point operations executed per second on the :ref:`VALU - `. This is also presented as a percent of the peak theoretical FLOPs - achievable on the specific accelerator. Note: this does not include any floating-point - operations from :ref:`MFMA ` instructions.' - unit: GFLOPs - VALU IOPs: - rst: 'The total integer operations executed per second on the :ref:`VALU `. - This is also presented as a percent of the peak theoretical IOPs achievable - on the specific accelerator. Note: this does not include any integer operations - from :ref:`MFMA ` instructions.' - unit: GIOPs - MFMA FLOPs (BF16): - rst: 'The total number of 16-bit brain floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 16-bit brain floating - point operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical BF16 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F16): - rst: 'The total number of 16-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 16-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F16 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F32): - rst: 'The total number of 32-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 32-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F32 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F64): - rst: 'The total number of 64-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 64-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F64 MFMA operations achievable on the - specific accelerator. The total number of 64-bit floating point :ref:`MFMA - ` operations executed per second. Note: this does not include any - 64-bit floating point operations from :ref:`VALU ` instructions. - This is also presented as a percent of the peak theoretical F64 MFMA operations - achievable on the specific accelerator.' - unit: GFLOPs - MFMA IOPs (INT8): - rst: 'The total number of 8-bit integer :ref:`MFMA ` operations executed - per second. Note: this does not include any 8-bit integer operations from :ref:`VALU - ` instructions. This is also presented as a percent of the peak - theoretical INT8 MFMA operations achievable on the specific accelerator.' - unit: GFLOPs - IPC: - rst: The ratio of the total number of instructions executed on the :doc:`CU ` - over the :ref:`total active CU cycles `. - unit: Instructions per cycle - IPC (Issued): - rst: The ratio of the total number of (non-:ref:`internal `) - instructions issued over the number of cycles where the :ref:`scheduler ` - was actively working on issuing instructions. Refer to the :ref:`Issued IPC - ` example for further detail. - unit: Instructions per cycle - SALU Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`SALU ` - was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`scheduler ` issuing SALU / :ref:`SMEM - ` instructions over the :ref:`total CU cycles `. - unit: Percent - VALU Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`VALU ` - was busy executing instructions. Does not include :ref:`VMEM ` operations. - Computed as the ratio of the total number of cycles spent by the :ref:`scheduler - ` issuing VALU instructions over the :ref:`total CU cycles - `. - unit: Percent - VMEM Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`VMEM ` - unit was busy executing instructions, including both global/generic and spill/scratch - operations (see the :ref:`VMEM instruction count metrics ` - for more detail). Does not include :ref:`VALU ` operations. Computed as - the ratio of the total number of cycles spent by the :ref:`scheduler ` - issuing VMEM instructions over the :ref:`total CU cycles `. - unit: Percent - Branch Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`branch ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`scheduler ` issuing branch instructions - over the :ref:`total CU cycles `. - unit: Percent - VALU Active Threads: - rst: Indicates the average level of :ref:`divergence ` within a - wavefront over the lifetime of the kernel. The number of work-items that were - active in a wavefront during execution of each :ref:`VALU ` instruction, - time-averaged over all VALU instructions run on all wavefronts in the kernel. - unit: Work-items - MFMA Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`MFMA ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`MFMA ` was busy over the :ref:`total - CU cycles `. - unit: Percent - MFMA Instruction Cycles: - rst: The average duration of :ref:`MFMA ` instructions in this kernel - in cycles. Computed as the ratio of the total number of cycles the MFMA unit - was busy over the total number of MFMA instructions. Compare to, for example, - the `AMD Matrix Instruction Calculator `_. - unit: Cycles per instruction - VMEM Latency: - rst: The average number of round-trip cycles (that is, from issue to data return - / acknowledgment) required for a VMEM instruction to complete. - unit: Cycles - SMEM Latency: - rst: The average number of round-trip cycles (that is, from issue to data return - / acknowledgment) required for a SMEM instruction to complete. - unit: Cycles - FLOPs (Total): - rst: The total number of floating-point operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: FLOP per normalization unit - IOPs (Total): - rst: The total number of integer operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: IOP per normalization unit - F16 OPs: - rst: The total number of 16-bit floating-point operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: FLOP per normalization unit BF16 OPs: rst: 'The total number of 16-bit brain floating-point operations executed on either the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization unit `. Note: on current CDNA accelerators, the VALU has no native BF16 instructions.' unit: FLOP per normalization unit + F16 OPs: + rst: The total number of 16-bit floating-point operations executed on either the :ref:`VALU + ` or :ref:`MFMA ` units, per :ref:`normalization unit + `. + unit: FLOP per normalization unit F32 OPs: rst: The total number of 32-bit floating-point operations executed on either the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization @@ -1138,19 +367,23 @@ Arithmetic operations: :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization unit `. unit: FLOP per normalization unit + FLOPs (Total): + rst: The total number of floating-point operations executed on either the :ref:`VALU + ` or :ref:`MFMA ` units, per :ref:`normalization unit + `. + unit: FLOP per normalization unit INT8 OPs: rst: 'The total number of 8-bit integer operations executed on either the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization unit `. Note: on current CDNA accelerators, the VALU has no native INT8 instructions.' unit: IOP per normalization unit + IOPs (Total): + rst: The total number of integer operations executed on either the :ref:`VALU + ` or :ref:`MFMA ` units, per :ref:`normalization unit + `. + unit: IOP per normalization unit LDS Speed-of-Light: - Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`LDS ` was - actively executing instructions (including, but not limited to, load, store, - atomic and HIP's ``__shfl`` operations). Calculated as the ratio of the total - number of cycles LDS was active over the :ref:`total CU cycles `. - unit: Percent Access Rate: rst: Indicates the percentage of SIMDs in the :ref:`VALU ` [#lds-workload]_ actively issuing LDS instructions, averaged over the lifetime of the kernel. @@ -1158,6 +391,12 @@ LDS Speed-of-Light: ` issuing :ref:`LDS ` instructions over the :ref:`total CU cycles `. unit: Percent + Bank Conflict Rate: + rst: Indicates the percentage of active LDS cycles that were spent servicing bank + conflicts. Calculated as the ratio of LDS cycles spent servicing bank conflicts + over the number of LDS cycles that would have been required to move the same + amount of data in an uncontended access. [#lds-bank-conflict]_ + unit: Percent Theoretical Bandwidth: rst: Indicates the maximum amount of bytes that could have been loaded from, stored to, or atomically updated in the LDS per :ref:`normalization unit `. @@ -1165,97 +404,17 @@ LDS Speed-of-Light: was executed. See the :ref:`LDS bandwidth example ` for more detail. unit: Bytes per normalization unit - Bank Conflict Rate: - rst: Indicates the percentage of active LDS cycles that were spent servicing bank - conflicts. Calculated as the ratio of LDS cycles spent servicing bank conflicts - over the number of LDS cycles that would have been required to move the same - amount of data in an uncontended access. [#lds-bank-conflict]_ + Utilization: + rst: Indicates what percent of the kernel's duration the :ref:`LDS ` was + actively executing instructions (including, but not limited to, load, store, + atomic and HIP's ``__shfl`` operations). Calculated as the ratio of the total + number of cycles LDS was active over the :ref:`total CU cycles `. unit: Percent - LDS Instructions: - rst: The total number of LDS instructions (including, but not limited to, read/write/atomics - and HIP's ``__shfl`` instructions) executed per :ref:`normalization unit `. - unit: Instructions per normalization unit - LDS Latency: - rst: The average number of round-trip cycles (i.e., from issue to data-return / - acknowledgment) required for an LDS instruction to complete. - unit: Cycles - Bank Conflicts/Access: - rst: The ratio of the number of cycles spent in the :ref:`LDS scheduler ` - due to bank conflicts (as determined by the conflict resolution hardware) to - the base number of cycles that would be spent in the LDS scheduler in a completely - uncontended case. This is the unnormalized form of the Bank Conflict Rate. - unit: Conflicts per Access - Index Accesses: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` over - all operations per :ref:`normalization unit `. - unit: Cycles per normalization unit - Atomic Return Cycles: - rst: The total number of cycles spent on LDS atomics with return per :ref:`normalization - unit `. - unit: Cycles per normalization unit - Bank Conflict: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` due - to bank conflicts (as determined by the conflict resolution hardware) per :ref:`normalization - unit `. - unit: Cycles per normalization unit - Addr Conflict: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` due - to address conflicts (as determined by the conflict resolution hardware) per - :ref:`normalization unit `. - unit: Cycles per normalization unit - Unaligned Stall: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` due - to stalls from non-dword aligned addresses per :ref:`normalization unit `. - unit: Cycles per normalization unit - Mem Violations: - rst: "The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization\ - \ unit `. This is unused and expected to be zero in most\ - \ configurations for modern CDNA\u2122 accelerators." - unit: Accesses per normalization unit LDS Statistics: - Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`LDS ` was - actively executing instructions (including, but not limited to, load, store, - atomic and HIP's ``__shfl`` operations). Calculated as the ratio of the total - number of cycles LDS was active over the :ref:`total CU cycles `. - unit: Percent - Access Rate: - rst: Indicates the percentage of SIMDs in the :ref:`VALU ` [#lds-workload]_ - actively issuing LDS instructions, averaged over the lifetime of the kernel. - Calculated as the ratio of the total number of cycles spent by the :ref:`scheduler - ` issuing :ref:`LDS ` instructions over the :ref:`total - CU cycles `. - unit: Percent - Theoretical Bandwidth: - rst: Indicates the maximum amount of bytes that could have been loaded from, stored - to, or atomically updated in the LDS per :ref:`normalization unit `. - Does *not* take into account the execution mask of the wavefront when the instruction - was executed. See the :ref:`LDS bandwidth example ` for more - detail. - unit: Bytes per normalization unit - Bank Conflict Rate: - rst: Indicates the percentage of active LDS cycles that were spent servicing bank - conflicts. Calculated as the ratio of LDS cycles spent servicing bank conflicts - over the number of LDS cycles that would have been required to move the same - amount of data in an uncontended access. [#lds-bank-conflict]_ - unit: Percent - LDS Instructions: - rst: The total number of LDS instructions (including, but not limited to, read/write/atomics - and HIP's ``__shfl`` instructions) executed per :ref:`normalization unit `. - unit: Instructions per normalization unit - LDS Latency: - rst: The average number of round-trip cycles (i.e., from issue to data-return / - acknowledgment) required for an LDS instruction to complete. - unit: Cycles - Bank Conflicts/Access: - rst: The ratio of the number of cycles spent in the :ref:`LDS scheduler ` - due to bank conflicts (as determined by the conflict resolution hardware) to - the base number of cycles that would be spent in the LDS scheduler in a completely - uncontended case. This is the unnormalized form of the Bank Conflict Rate. - unit: Conflicts per Access - Index Accesses: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` over - all operations per :ref:`normalization unit `. + Addr Conflict: + rst: The total number of cycles spent in the :ref:`LDS scheduler ` due + to address conflicts (as determined by the conflict resolution hardware) per + :ref:`normalization unit `. unit: Cycles per normalization unit Atomic Return Cycles: rst: The total number of cycles spent on LDS atomics with return per :ref:`normalization @@ -1266,26 +425,41 @@ LDS Statistics: to bank conflicts (as determined by the conflict resolution hardware) per :ref:`normalization unit `. unit: Cycles per normalization unit - Addr Conflict: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` due - to address conflicts (as determined by the conflict resolution hardware) per - :ref:`normalization unit `. - unit: Cycles per normalization unit - Unaligned Stall: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` due - to stalls from non-dword aligned addresses per :ref:`normalization unit `. + Bank Conflicts/Access: + rst: The ratio of the number of cycles spent in the :ref:`LDS scheduler ` + due to bank conflicts (as determined by the conflict resolution hardware) to + the base number of cycles that would be spent in the LDS scheduler in a completely + uncontended case. This is the unnormalized form of the Bank Conflict Rate. + unit: Conflicts per Access + Index Accesses: + rst: The total number of cycles spent in the :ref:`LDS scheduler ` over + all operations per :ref:`normalization unit `. unit: Cycles per normalization unit + LDS Instructions: + rst: The total number of LDS instructions (including, but not limited to, read/write/atomics + and HIP's ``__shfl`` instructions) executed per :ref:`normalization unit `. + unit: Instructions per normalization unit + LDS Latency: + rst: The average number of round-trip cycles (i.e., from issue to data-return / + acknowledgment) required for an LDS instruction to complete. + unit: Cycles Mem Violations: rst: "The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization\ \ unit `. This is unused and expected to be zero in most\ \ configurations for modern CDNA\u2122 accelerators." unit: Accesses per normalization unit + Theoretical Bandwidth: + rst: Indicates the maximum amount of bytes that could have been loaded from, stored + to, or atomically updated in the LDS per :ref:`normalization unit `. + Does *not* take into account the execution mask of the wavefront when the instruction + was executed. See the :ref:`LDS bandwidth example ` for more + detail. + unit: Bytes per normalization unit + Unaligned Stall: + rst: The total number of cycles spent in the :ref:`LDS scheduler ` due + to stalls from non-dword aligned addresses per :ref:`normalization unit `. + unit: Cycles per normalization unit vL1D Speed-of-Light: - Hit rate: - rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in - vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache - RAM `. - unit: Percent Bandwidth: rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM ` instructions, as a percent of the peak theoretical bandwidth achievable @@ -1294,11 +468,6 @@ vL1D Speed-of-Light: not consider partial requests, so for instance, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. unit: Percent - Utilization: - rst: Indicates how busy the :ref:`vL1D Cache RAM ` was during the kernel - execution. The number of cycles where the vL1D Cache RAM is actively processing - any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent Coalescing: rst: Indicates how well memory instructions were coalesced by the :ref:`address processing unit `, ranging from uncoalesced (25%) to fully coalesced @@ -1306,176 +475,16 @@ vL1D Speed-of-Light: generated per instruction divided by the ideal number of thread-requests per instruction. unit: Percent - Stalled on L2 Data: - rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested - data to return from the :doc:`L2 cache ` divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. + Hit rate: + rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in + vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache + RAM `. unit: Percent - Stalled on L2 Req: - rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue - a request for data to the :doc:`L2 cache ` divided by the number - of cycles where the vL1D is active [#vl1d-activity]_. + Utilization: + rst: Indicates how busy the :ref:`vL1D Cache RAM ` was during the kernel + execution. The number of cycles where the vL1D Cache RAM is actively processing + any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_. unit: Percent - Tag RAM Stall (Read): - rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests - with conflicting tags being looked up concurrently, divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Write): - rst: The ratio of the number of cycles where the vL1D is stalled due to Write - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Atomic): - rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Total Req: - rst: The total number of incoming requests from the :ref:`address processing - unit ` after coalescing. - unit: Requests - Read Req: - rst: The total number of incoming read requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Write Req: - rst: The total number of incoming write requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Atomic Req: - rst: The total number of incoming atomic requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Cache BW: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions per :ref:`normalization unit `. The - number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so - for instance, if only a single value is requested in a cache line, the data movement - will still be counted as a full cache line. - unit: Bytes per normalization unit - Cache Hit Rate: - rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache - over the total number of cache line requests to the :ref:`vL1D Cache RAM `. - unit: Percent - Cache Accesses: - rst: The total number of cache line lookups in the vL1D. - unit: Cache lines - Cache Hits: - rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2 - cache `, that is, the number of cache line requests serviced by the - :ref:`vL1D Cache RAM ` per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Invalidations: - rst: The number of times the vL1D was issued a write-back invalidate command during - the kernel's execution per :ref:`normalization unit `. This - may be triggered by, for instance, the ``buffer_wbinvl1`` instruction. - unit: Invalidations per normalization unit - L1-L2 BW: - rst: The number of bytes transferred across the vL1D-L2 interface as a result of - :ref:`VMEM ` instructions, per :ref:`normalization unit `. - The number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so for instance, - if only a single value is requested in a cache line, the data movement will - still be counted as a full cache line. - unit: Bytes per normalization unit - L1-L2 Read: - rst: The number of read requests for a vL1D cache line that were not satisfied by - the vL1D and must be retrieved from the to the :doc:`L2 Cache ` per :ref:`normalization - unit `. - unit: Requests per normalization unit - L1-L2 Write: - rst: The number of write requests to a vL1D cache line that were sent through the - vL1D to the :doc:`L2 cache `, per :ref:`normalization unit `. - unit: Requests per normalization unit - L1-L2 Atomic: - rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2 - cache `, per :ref:`normalization unit `. This - includes requests for atomics with, and without return. - unit: Requests per normalization unit - L1 Access Latency: - rst: Calculated as the average number of cycles that a vL1D cache line request - spent in the vL1D cache pipeline. - unit: Cycles - L1-L2 Read Latency: - rst: Calculated as the average number of cycles that the vL1D cache took to issue - and receive read requests from the :doc:`L2 Cache `. This number - also includes requests for atomics with return values. - unit: Cycles - L1-L2 Write Latency: - rst: Calculated as the average number of cycles that the vL1D cache took to issue - and receive acknowledgement of a write request to the :doc:`L2 Cache `. - This number also includes requests for atomics without return values. - unit: Cycles - NC - Read: - rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Read: - rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Read: - rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Read: - rst: '' - unit: Requests per normalization unit - RW - Write: - rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Write: - rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Write: - rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Write: - rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Atomic: - rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Atomic: - rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Atomic: - rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Atomic: - rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - Req: - rst: The number of translation requests made to the UTCL1 per normalization unit. - unit: Requests per normalization unit - Hit Ratio: - rst: The ratio of the number of translation requests that hit in the UTCL1 divided - by the total number of translation requests made to the UTCL1. - unit: Percent - Hits: - rst: The number of translation requests that hit in the UTCL1, and could be reused, - per normalization unit. - unit: Requests per normalization unit - Translation Misses: - rst: The total number of translation requests that missed in the UTCL1 due to translation - not being present in the cache, per :ref:`normalization unit `. - unit: unit - Permission Misses: - rst: "The total number of translation requests that missed in the UTCL1 due to\ - \ a permission error, per :ref:`normalization unit `.\ - \ This is unused and expected to be zero in most configurations for modern\ - \ CDNA\u2122 accelerators." - unit: Requests per normalization unit Busy / stall metrics: Address Processing Unit Busy: rst: Percent of the :ref:`total CU cycles ` the address processor @@ -1493,118 +502,11 @@ Busy / stall metrics: rst: Percent of :ref:`total CU cycles ` the address processor was stalled waiting to send command data to the :ref:`data processor ` unit: Percent - Total Instructions: - rst: The total number of memory instructions executed by the address processer - over all compute units on the accelerator, per normalization unit. - unit: Instructions per normalization unit - Global/Generic Instructions: - rst: The total number of global & generic memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Global/Generic Read Instructions: - rst: The total number of global & generic memory read instructions executed on all - :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Write Instructions: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Atomic Instructions: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Instructions: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Read Instructions: - rst: The total number of spill/stack memory read instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Write Instructions: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Atomic Instructions: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. - unit: Instructions per normalization unit - Spill/Stack Total Cycles: - rst: The number of cycles the address processing unit spent working on spill/stack - instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Spill/Stack Coalesced Read: - rst: The number of cycles the address processing unit spent working on coalesced - spill/stack read instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Spill/Stack Coalesced Write: - rst: The number of cycles the address processing unit spent working on coalesced - spill/stack write instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Data-Return Busy: - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was busy processing or waiting on data to return to the :doc:`CU `. - unit: Percent - "Cache RAM \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled on data to be returned from the :ref:`vL1D Cache RAM `. - unit: Percent - "Workgroup manager \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled by the :ref:`workgroup manager ` due to initialization - of registers as a part of launching new workgroups. - unit: Percent - Coalescable Instructions: - rst: The number of instructions submitted to the :ref:`data-return unit ` - by the :ref:`address processor ` that were found to be coalescable, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Read Instructions: - rst: The number of read instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack reads in the :ref:`address - processor `. - unit: Instructions per normalization unit - Write Instructions: - rst: The number of store instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack stores counted - by the :ref:`vL1D cache-front-end `. - unit: Instructions per normalization unit - Atomic Instructions: - rst: The number of atomic instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack atomics in - the :ref:`address processor `. - unit: Instructions per normalization unit Instruction counts: - Address Processing Unit Busy: - rst: Percent of the :ref:`total CU cycles ` the address processor - was busy - unit: Percent - Address Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending address requests further into the vL1D pipeline - unit: Percent - Data Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending write/atomic data further into the vL1D pipeline - unit: Percent - "Data-Processor \u2192 Address Stall": - rst: Percent of :ref:`total CU cycles ` the address processor was - stalled waiting to send command data to the :ref:`data processor ` - unit: Percent - Total Instructions: - rst: The total number of memory instructions executed by the address processer - over all compute units on the accelerator, per normalization unit. + Global/Generic Atomic Instructions: + rst: The total number of global & generic memory atomic (with and without return) + instructions executed on all :doc:`compute units ` on the accelerator, + per :ref:`normalization unit `. unit: Instructions per normalization unit Global/Generic Instructions: rst: The total number of global & generic memory instructions executed on all :doc:`compute @@ -1620,10 +522,11 @@ Instruction counts: all :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. unit: Instructions per normalization unit - Global/Generic Atomic Instructions: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. + Spill/Stack Atomic Instructions: + rst: The total number of spill/stack memory atomic (with and without return) instructions + executed on all :doc:`compute units ` on the accelerator, per + :ref:`normalization unit `. Typically unused as these + memory operations are typically used to implement thread-local storage. unit: Instructions per normalization unit Spill/Stack Instructions: rst: The total number of spill/stack memory instructions executed on all :doc:`compute @@ -1637,125 +540,11 @@ Instruction counts: rst: The total number of spill/stack memory write instructions executed on all :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. unit: Instructions per normalization unit - Spill/Stack Atomic Instructions: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. - unit: Instructions per normalization unit - Spill/Stack Total Cycles: - rst: The number of cycles the address processing unit spent working on spill/stack - instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Spill/Stack Coalesced Read: - rst: The number of cycles the address processing unit spent working on coalesced - spill/stack read instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Spill/Stack Coalesced Write: - rst: The number of cycles the address processing unit spent working on coalesced - spill/stack write instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Data-Return Busy: - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was busy processing or waiting on data to return to the :doc:`CU `. - unit: Percent - "Cache RAM \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled on data to be returned from the :ref:`vL1D Cache RAM `. - unit: Percent - "Workgroup manager \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled by the :ref:`workgroup manager ` due to initialization - of registers as a part of launching new workgroups. - unit: Percent - Coalescable Instructions: - rst: The number of instructions submitted to the :ref:`data-return unit ` - by the :ref:`address processor ` that were found to be coalescable, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Read Instructions: - rst: The number of read instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack reads in the :ref:`address - processor `. - unit: Instructions per normalization unit - Write Instructions: - rst: The number of store instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack stores counted - by the :ref:`vL1D cache-front-end `. - unit: Instructions per normalization unit - Atomic Instructions: - rst: The number of atomic instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack atomics in - the :ref:`address processor `. + Total Instructions: + rst: The total number of memory instructions executed by the address processer + over all compute units on the accelerator, per normalization unit. unit: Instructions per normalization unit Spill / stack metrics: - Address Processing Unit Busy: - rst: Percent of the :ref:`total CU cycles ` the address processor - was busy - unit: Percent - Address Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending address requests further into the vL1D pipeline - unit: Percent - Data Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending write/atomic data further into the vL1D pipeline - unit: Percent - "Data-Processor \u2192 Address Stall": - rst: Percent of :ref:`total CU cycles ` the address processor was - stalled waiting to send command data to the :ref:`data processor ` - unit: Percent - Total Instructions: - rst: The total number of memory instructions executed by the address processer - over all compute units on the accelerator, per normalization unit. - unit: Instructions per normalization unit - Global/Generic Instructions: - rst: The total number of global & generic memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Global/Generic Read Instructions: - rst: The total number of global & generic memory read instructions executed on all - :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Write Instructions: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Atomic Instructions: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Instructions: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Read Instructions: - rst: The total number of spill/stack memory read instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Write Instructions: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Atomic Instructions: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. - unit: Instructions per normalization unit - Spill/Stack Total Cycles: - rst: The number of cycles the address processing unit spent working on spill/stack - instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit Spill/Stack Coalesced Read: rst: The number of cycles the address processing unit spent working on coalesced spill/stack read instructions, per :ref:`normalization unit `. @@ -1764,223 +553,11 @@ Spill / stack metrics: rst: The number of cycles the address processing unit spent working on coalesced spill/stack write instructions, per :ref:`normalization unit `. unit: Cycles per normalization unit - Data-Return Busy: - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was busy processing or waiting on data to return to the :doc:`CU `. - unit: Percent - "Cache RAM \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled on data to be returned from the :ref:`vL1D Cache RAM `. - unit: Percent - "Workgroup manager \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled by the :ref:`workgroup manager ` due to initialization - of registers as a part of launching new workgroups. - unit: Percent - Coalescable Instructions: - rst: The number of instructions submitted to the :ref:`data-return unit ` - by the :ref:`address processor ` that were found to be coalescable, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Read Instructions: - rst: The number of read instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack reads in the :ref:`address - processor `. - unit: Instructions per normalization unit - Write Instructions: - rst: The number of store instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack stores counted - by the :ref:`vL1D cache-front-end `. - unit: Instructions per normalization unit - Atomic Instructions: - rst: The number of atomic instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack atomics in - the :ref:`address processor `. - unit: Instructions per normalization unit + Spill/Stack Total Cycles: + rst: The number of cycles the address processing unit spent working on spill/stack + instructions, per :ref:`normalization unit `. + unit: Cycles per normalization unit L1 Unified Translation Cache (UTCL1): - Hit rate: - rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in - vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache - RAM `. - unit: Percent - Bandwidth: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions, as a percent of the peak theoretical bandwidth achievable - on the specific accelerator. The number of bytes is calculated as the number - of cache lines requested multiplied by the cache line size. This value does - not consider partial requests, so for instance, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Percent - Utilization: - rst: Indicates how busy the :ref:`vL1D Cache RAM ` was during the kernel - execution. The number of cycles where the vL1D Cache RAM is actively processing - any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Coalescing: - rst: Indicates how well memory instructions were coalesced by the :ref:`address - processing unit `, ranging from uncoalesced (25%) to fully coalesced - (100%). Calculated as the average number of :ref:`thread-requests ` - generated per instruction divided by the ideal number of thread-requests per - instruction. - unit: Percent - Stalled on L2 Data: - rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested - data to return from the :doc:`L2 cache ` divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Stalled on L2 Req: - rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue - a request for data to the :doc:`L2 cache ` divided by the number - of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Read): - rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests - with conflicting tags being looked up concurrently, divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Write): - rst: The ratio of the number of cycles where the vL1D is stalled due to Write - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Atomic): - rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Total Req: - rst: The total number of incoming requests from the :ref:`address processing - unit ` after coalescing. - unit: Requests - Read Req: - rst: The total number of incoming read requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Write Req: - rst: The total number of incoming write requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Atomic Req: - rst: The total number of incoming atomic requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Cache BW: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions per :ref:`normalization unit `. The - number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so - for instance, if only a single value is requested in a cache line, the data movement - will still be counted as a full cache line. - unit: Bytes per normalization unit - Cache Hit Rate: - rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache - over the total number of cache line requests to the :ref:`vL1D Cache RAM `. - unit: Percent - Cache Accesses: - rst: The total number of cache line lookups in the vL1D. - unit: Cache lines - Cache Hits: - rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2 - cache `, that is, the number of cache line requests serviced by the - :ref:`vL1D Cache RAM ` per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Invalidations: - rst: The number of times the vL1D was issued a write-back invalidate command during - the kernel's execution per :ref:`normalization unit `. This - may be triggered by, for instance, the ``buffer_wbinvl1`` instruction. - unit: Invalidations per normalization unit - L1-L2 BW: - rst: The number of bytes transferred across the vL1D-L2 interface as a result of - :ref:`VMEM ` instructions, per :ref:`normalization unit `. - The number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so for instance, - if only a single value is requested in a cache line, the data movement will - still be counted as a full cache line. - unit: Bytes per normalization unit - L1-L2 Read: - rst: The number of read requests for a vL1D cache line that were not satisfied by - the vL1D and must be retrieved from the to the :doc:`L2 Cache ` per :ref:`normalization - unit `. - unit: Requests per normalization unit - L1-L2 Write: - rst: The number of write requests to a vL1D cache line that were sent through the - vL1D to the :doc:`L2 cache `, per :ref:`normalization unit `. - unit: Requests per normalization unit - L1-L2 Atomic: - rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2 - cache `, per :ref:`normalization unit `. This - includes requests for atomics with, and without return. - unit: Requests per normalization unit - L1 Access Latency: - rst: Calculated as the average number of cycles that a vL1D cache line request - spent in the vL1D cache pipeline. - unit: Cycles - L1-L2 Read Latency: - rst: Calculated as the average number of cycles that the vL1D cache took to issue - and receive read requests from the :doc:`L2 Cache `. This number - also includes requests for atomics with return values. - unit: Cycles - L1-L2 Write Latency: - rst: Calculated as the average number of cycles that the vL1D cache took to issue - and receive acknowledgement of a write request to the :doc:`L2 Cache `. - This number also includes requests for atomics without return values. - unit: Cycles - NC - Read: - rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Read: - rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Read: - rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Read: - rst: '' - unit: Requests per normalization unit - RW - Write: - rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Write: - rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Write: - rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Write: - rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Atomic: - rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Atomic: - rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Atomic: - rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Atomic: - rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - Req: - rst: The number of translation requests made to the UTCL1 per normalization unit. - unit: Requests per normalization unit Hit Ratio: rst: The ratio of the number of translation requests that hit in the UTCL1 divided by the total number of translation requests made to the UTCL1. @@ -1989,42 +566,20 @@ L1 Unified Translation Cache (UTCL1): rst: The number of translation requests that hit in the UTCL1, and could be reused, per normalization unit. unit: Requests per normalization unit - Translation Misses: - rst: The total number of translation requests that missed in the UTCL1 due to translation - not being present in the cache, per :ref:`normalization unit `. - unit: unit Permission Misses: rst: "The total number of translation requests that missed in the UTCL1 due to\ \ a permission error, per :ref:`normalization unit `.\ \ This is unused and expected to be zero in most configurations for modern\ \ CDNA\u2122 accelerators." unit: Requests per normalization unit + Req: + rst: The number of translation requests made to the UTCL1 per normalization unit. + unit: Requests per normalization unit + Translation Misses: + rst: The total number of translation requests that missed in the UTCL1 due to translation + not being present in the cache, per :ref:`normalization unit `. + unit: unit vL1D cache stall metrics: - Hit rate: - rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in - vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache - RAM `. - unit: Percent - Bandwidth: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions, as a percent of the peak theoretical bandwidth achievable - on the specific accelerator. The number of bytes is calculated as the number - of cache lines requested multiplied by the cache line size. This value does - not consider partial requests, so for instance, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Percent - Utilization: - rst: Indicates how busy the :ref:`vL1D Cache RAM ` was during the kernel - execution. The number of cycles where the vL1D Cache RAM is actively processing - any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Coalescing: - rst: Indicates how well memory instructions were coalesced by the :ref:`address - processing unit `, ranging from uncoalesced (25%) to fully coalesced - (100%). Calculated as the average number of :ref:`thread-requests ` - generated per instruction divided by the ideal number of thread-requests per - instruction. - unit: Percent Stalled on L2 Data: rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested data to return from the :doc:`L2 cache ` divided by the number of @@ -2035,6 +590,11 @@ vL1D cache stall metrics: a request for data to the :doc:`L2 cache ` divided by the number of cycles where the vL1D is active [#vl1d-activity]_. unit: Percent + Tag RAM Stall (Atomic): + rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic + requests with conflicting tags being looked up concurrently, divided by the + number of cycles where the vL1D is active [#vl1d-activity]_. + unit: Percent Tag RAM Stall (Read): rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests with conflicting tags being looked up concurrently, divided by the number of @@ -2045,223 +605,14 @@ vL1D cache stall metrics: requests with conflicting tags being looked up concurrently, divided by the number of cycles where the vL1D is active [#vl1d-activity]_. unit: Percent - Tag RAM Stall (Atomic): - rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Total Req: - rst: The total number of incoming requests from the :ref:`address processing - unit ` after coalescing. - unit: Requests - Read Req: - rst: The total number of incoming read requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Write Req: - rst: The total number of incoming write requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Atomic Req: - rst: The total number of incoming atomic requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Cache BW: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions per :ref:`normalization unit `. The - number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so - for instance, if only a single value is requested in a cache line, the data movement - will still be counted as a full cache line. - unit: Bytes per normalization unit - Cache Hit Rate: - rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache - over the total number of cache line requests to the :ref:`vL1D Cache RAM `. - unit: Percent - Cache Accesses: - rst: The total number of cache line lookups in the vL1D. - unit: Cache lines - Cache Hits: - rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2 - cache `, that is, the number of cache line requests serviced by the - :ref:`vL1D Cache RAM ` per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Invalidations: - rst: The number of times the vL1D was issued a write-back invalidate command during - the kernel's execution per :ref:`normalization unit `. This - may be triggered by, for instance, the ``buffer_wbinvl1`` instruction. - unit: Invalidations per normalization unit - L1-L2 BW: - rst: The number of bytes transferred across the vL1D-L2 interface as a result of - :ref:`VMEM ` instructions, per :ref:`normalization unit `. - The number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so for instance, - if only a single value is requested in a cache line, the data movement will - still be counted as a full cache line. - unit: Bytes per normalization unit - L1-L2 Read: - rst: The number of read requests for a vL1D cache line that were not satisfied by - the vL1D and must be retrieved from the to the :doc:`L2 Cache ` per :ref:`normalization - unit `. - unit: Requests per normalization unit - L1-L2 Write: - rst: The number of write requests to a vL1D cache line that were sent through the - vL1D to the :doc:`L2 cache `, per :ref:`normalization unit `. - unit: Requests per normalization unit - L1-L2 Atomic: - rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2 - cache `, per :ref:`normalization unit `. This - includes requests for atomics with, and without return. - unit: Requests per normalization unit - L1 Access Latency: - rst: Calculated as the average number of cycles that a vL1D cache line request - spent in the vL1D cache pipeline. - unit: Cycles - L1-L2 Read Latency: - rst: Calculated as the average number of cycles that the vL1D cache took to issue - and receive read requests from the :doc:`L2 Cache `. This number - also includes requests for atomics with return values. - unit: Cycles - L1-L2 Write Latency: - rst: Calculated as the average number of cycles that the vL1D cache took to issue - and receive acknowledgement of a write request to the :doc:`L2 Cache `. - This number also includes requests for atomics without return values. - unit: Cycles - NC - Read: - rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Read: - rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Read: - rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Read: - rst: '' - unit: Requests per normalization unit - RW - Write: - rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Write: - rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Write: - rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Write: - rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Atomic: - rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Atomic: - rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Atomic: - rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Atomic: - rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - Req: - rst: The number of translation requests made to the UTCL1 per normalization unit. - unit: Requests per normalization unit - Hit Ratio: - rst: The ratio of the number of translation requests that hit in the UTCL1 divided - by the total number of translation requests made to the UTCL1. - unit: Percent - Hits: - rst: The number of translation requests that hit in the UTCL1, and could be reused, - per normalization unit. - unit: Requests per normalization unit - Translation Misses: - rst: The total number of translation requests that missed in the UTCL1 due to translation - not being present in the cache, per :ref:`normalization unit `. - unit: unit - Permission Misses: - rst: "The total number of translation requests that missed in the UTCL1 due to\ - \ a permission error, per :ref:`normalization unit `.\ - \ This is unused and expected to be zero in most configurations for modern\ - \ CDNA\u2122 accelerators." - unit: Requests per normalization unit vL1D cache access metrics: - Hit rate: - rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in - vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache - RAM `. - unit: Percent - Bandwidth: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions, as a percent of the peak theoretical bandwidth achievable - on the specific accelerator. The number of bytes is calculated as the number - of cache lines requested multiplied by the cache line size. This value does - not consider partial requests, so for instance, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Percent - Utilization: - rst: Indicates how busy the :ref:`vL1D Cache RAM ` was during the kernel - execution. The number of cycles where the vL1D Cache RAM is actively processing - any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Coalescing: - rst: Indicates how well memory instructions were coalesced by the :ref:`address - processing unit `, ranging from uncoalesced (25%) to fully coalesced - (100%). Calculated as the average number of :ref:`thread-requests ` - generated per instruction divided by the ideal number of thread-requests per - instruction. - unit: Percent - Stalled on L2 Data: - rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested - data to return from the :doc:`L2 cache ` divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Stalled on L2 Req: - rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue - a request for data to the :doc:`L2 cache ` divided by the number - of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Read): - rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests - with conflicting tags being looked up concurrently, divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Write): - rst: The ratio of the number of cycles where the vL1D is stalled due to Write - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Atomic): - rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Total Req: - rst: The total number of incoming requests from the :ref:`address processing - unit ` after coalescing. - unit: Requests - Read Req: - rst: The total number of incoming read requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Write Req: - rst: The total number of incoming write requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit Atomic Req: rst: The total number of incoming atomic requests from the :ref:`address processing unit ` after coalescing per :ref:`normalization unit ` unit: Requests per normalization unit + Cache Accesses: + rst: The total number of cache line lookups in the vL1D. + unit: Cache lines Cache BW: rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM ` instructions per :ref:`normalization unit `. The @@ -2274,9 +625,6 @@ vL1D cache access metrics: rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache RAM `. unit: Percent - Cache Accesses: - rst: The total number of cache line lookups in the vL1D. - unit: Cache lines Cache Hits: rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2 cache `, that is, the number of cache line requests serviced by the @@ -2287,6 +635,15 @@ vL1D cache access metrics: the kernel's execution per :ref:`normalization unit `. This may be triggered by, for instance, the ``buffer_wbinvl1`` instruction. unit: Invalidations per normalization unit + L1 Access Latency: + rst: Calculated as the average number of cycles that a vL1D cache line request + spent in the vL1D cache pipeline. + unit: Cycles + L1-L2 Atomic: + rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2 + cache `, per :ref:`normalization unit `. This + includes requests for atomics with, and without return. + unit: Requests per normalization unit L1-L2 BW: rst: The number of bytes transferred across the vL1D-L2 interface as a result of :ref:`VMEM ` instructions, per :ref:`normalization unit `. @@ -2300,185 +657,53 @@ vL1D cache access metrics: the vL1D and must be retrieved from the to the :doc:`L2 Cache ` per :ref:`normalization unit `. unit: Requests per normalization unit - L1-L2 Write: - rst: The number of write requests to a vL1D cache line that were sent through the - vL1D to the :doc:`L2 cache `, per :ref:`normalization unit `. - unit: Requests per normalization unit - L1-L2 Atomic: - rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2 - cache `, per :ref:`normalization unit `. This - includes requests for atomics with, and without return. - unit: Requests per normalization unit - L1 Access Latency: - rst: Calculated as the average number of cycles that a vL1D cache line request - spent in the vL1D cache pipeline. - unit: Cycles L1-L2 Read Latency: rst: Calculated as the average number of cycles that the vL1D cache took to issue and receive read requests from the :doc:`L2 Cache `. This number also includes requests for atomics with return values. unit: Cycles + L1-L2 Write: + rst: The number of write requests to a vL1D cache line that were sent through the + vL1D to the :doc:`L2 cache `, per :ref:`normalization unit `. + unit: Requests per normalization unit L1-L2 Write Latency: rst: Calculated as the average number of cycles that the vL1D cache took to issue and receive acknowledgement of a write request to the :doc:`L2 Cache `. This number also includes requests for atomics without return values. unit: Cycles - NC - Read: - rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. + Read Req: + rst: The total number of incoming read requests from the :ref:`address processing + unit ` after coalescing per :ref:`normalization unit ` unit: Requests per normalization unit - UC - Read: - rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Read: - rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Read: - rst: '' - unit: Requests per normalization unit - RW - Write: - rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Write: - rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Write: - rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Write: - rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Atomic: - rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Atomic: - rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Atomic: - rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Atomic: - rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - Req: - rst: The number of translation requests made to the UTCL1 per normalization unit. - unit: Requests per normalization unit - Hit Ratio: - rst: The ratio of the number of translation requests that hit in the UTCL1 divided - by the total number of translation requests made to the UTCL1. - unit: Percent - Hits: - rst: The number of translation requests that hit in the UTCL1, and could be reused, - per normalization unit. - unit: Requests per normalization unit - Translation Misses: - rst: The total number of translation requests that missed in the UTCL1 due to translation - not being present in the cache, per :ref:`normalization unit `. - unit: unit - Permission Misses: - rst: "The total number of translation requests that missed in the UTCL1 due to\ - \ a permission error, per :ref:`normalization unit `.\ - \ This is unused and expected to be zero in most configurations for modern\ - \ CDNA\u2122 accelerators." + Total Req: + rst: The total number of incoming requests from the :ref:`address processing + unit ` after coalescing. + unit: Requests + Write Req: + rst: The total number of incoming write requests from the :ref:`address processing + unit ` after coalescing per :ref:`normalization unit ` unit: Requests per normalization unit Vector L1 data-return path or Texture Data (TD): - Address Processing Unit Busy: - rst: Percent of the :ref:`total CU cycles ` the address processor - was busy - unit: Percent - Address Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending address requests further into the vL1D pipeline - unit: Percent - Data Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending write/atomic data further into the vL1D pipeline - unit: Percent - "Data-Processor \u2192 Address Stall": - rst: Percent of :ref:`total CU cycles ` the address processor was - stalled waiting to send command data to the :ref:`data processor ` - unit: Percent - Total Instructions: - rst: The total number of memory instructions executed by the address processer - over all compute units on the accelerator, per normalization unit. - unit: Instructions per normalization unit - Global/Generic Instructions: - rst: The total number of global & generic memory instructions executed on all :doc:`compute + Atomic Instructions: + rst: The number of atomic instructions submitted to the :ref:`data-return unit + ` by the :ref:`address processor ` summed over all :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. + This is expected to be the sum of global/generic and spill/stack atomics in + the :ref:`address processor `. unit: Instructions per normalization unit - Global/Generic Read Instructions: - rst: The total number of global & generic memory read instructions executed on all - :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Write Instructions: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Atomic Instructions: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Instructions: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Read Instructions: - rst: The total number of spill/stack memory read instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Write Instructions: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Atomic Instructions: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. - unit: Instructions per normalization unit - Spill/Stack Total Cycles: - rst: The number of cycles the address processing unit spent working on spill/stack - instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Spill/Stack Coalesced Read: - rst: The number of cycles the address processing unit spent working on coalesced - spill/stack read instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Spill/Stack Coalesced Write: - rst: The number of cycles the address processing unit spent working on coalesced - spill/stack write instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Data-Return Busy: - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was busy processing or waiting on data to return to the :doc:`CU `. - unit: Percent "Cache RAM \u2192 Data-Return Stall": rst: Percent of the :ref:`total CU cycles ` the data-return unit was stalled on data to be returned from the :ref:`vL1D Cache RAM `. unit: Percent - "Workgroup manager \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled by the :ref:`workgroup manager ` due to initialization - of registers as a part of launching new workgroups. - unit: Percent Coalescable Instructions: rst: The number of instructions submitted to the :ref:`data-return unit ` by the :ref:`address processor ` that were found to be coalescable, per :ref:`normalization unit `. unit: Instructions per normalization unit + Data-Return Busy: + rst: Percent of the :ref:`total CU cycles ` the data-return unit + was busy processing or waiting on data to return to the :doc:`CU `. + unit: Percent Read Instructions: rst: The number of read instructions submitted to the :ref:`data-return unit ` by the :ref:`address processor ` summed over all :doc:`compute @@ -2486,6 +711,16 @@ Vector L1 data-return path or Texture Data (TD): This is expected to be the sum of global/generic and spill/stack reads in the :ref:`address processor `. unit: Instructions per normalization unit + "Workgroup manager \u2192 Data-Return Stall": + rst: Percent of the :ref:`total CU cycles ` the data-return unit + was stalled by the :ref:`workgroup manager ` due to initialization + of registers as a part of launching new workgroups. + unit: Percent + Write Ack Instructions: + rst: The total number of write acknowledgements submitted by :ref:`data-return + unit ` to SQ, summed over all compute units on the accelerator, per + normalization unit. + unit: Instructions per normalization unit Write Instructions: rst: The number of store instructions submitted to the :ref:`data-return unit ` by the :ref:`address processor ` summed over all :doc:`compute @@ -2493,27 +728,12 @@ Vector L1 data-return path or Texture Data (TD): This is expected to be the sum of global/generic and spill/stack stores counted by the :ref:`vL1D cache-front-end `. unit: Instructions per normalization unit - Atomic Instructions: - rst: The number of atomic instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack atomics in - the :ref:`address processor `. - unit: Instructions per normalization unit L2 Speed-of-Light: - Utilization: - rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed - over all L2 channels on the accelerator ` over the - :ref:`total L2 cycles `. - unit: Percent - Peak Bandwidth: - rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical - bandwidth achievable on the specific accelerator. The number of bytes is calculated - as the number of cache lines requested multiplied by the cache line size. This - value does not consider partial requests, so e.g., if only a single value is - requested in a cache line, the data movement will still be counted as a full - cache line. - unit: Percent + HBM Bandwidth: + rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory + (HBM) per unit time. This value is calculated as the number of HBM channels + multiplied by the HBM channel width multiplied by the HBM clock frequency. + unit: GB/s Hit Rate: rst: The ratio of the number of L2 cache line requests that hit in the L2 cache over the total number of incoming cache line requests to the L2 cache. @@ -2526,418 +746,28 @@ L2 Speed-of-Light: rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface ` by write and atomic operations per unit time. unit: GB/s - HBM Bandwidth: - rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory - (HBM) per unit time. This value is calculated as the number of HBM channels - multiplied by the HBM channel width multiplied by the HBM clock frequency. - unit: GB/s - Read BW: - rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization - unit `. - unit: Bytes per normalization unit - HBM Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to the - accelerator's local high-bandwidth memory (HBM). This breakdown does not consider - the *size* of the request (meaning that 32B and 64B requests are both counted - as a single request), so this metric only *approximates* the percent of the - L2-Fabric Read bandwidth directed to the local HBM. + Peak Bandwidth: + rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical + bandwidth achievable on the specific accelerator. The number of bytes is calculated + as the number of cache lines requested multiplied by the cache line size. This + value does not consider partial requests, so e.g., if only a single value is + requested in a cache line, the data movement will still be counted as a full + cache line. unit: Percent - Remote Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. - unit: Percent - Uncached Read Traffic: - rst: The percent of read requests generated by the L2 cache that are reading from - an :ref:`uncached memory allocation `. Note, as described in the - :ref:`request flow ` section, a single 64B read request is - typically counted as two uncached read requests. So, it is possible for the - Uncached Read Traffic to reach up to 200% of the total number of read requests. - This breakdown does not consider the *size* of the request (i.e., 32B and 64B - requests are both counted as a single request), so this metric only *approximates* - the percent of the L2-Fabric read bandwidth directed to an uncached memory - location. - unit: Percent - Write and Atomic BW: - rst: The total number of bytes written by the L2 over Infinity Fabric by write and - atomic operations per :ref:`normalization unit `. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable - memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Bytes per normalization unit - HBM Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - routed to the accelerator's local high-bandwidth memory (HBM). This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM. - Note that on current CDNA accelerators, such as the :ref:`MI2XX `, - requests are only considered *atomic* by Infinity Fabric if they are targeted - at :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations. - unit: Percent - Remote Write and Atomic Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained - memory ` allocations or :ref:`uncached memory ` allocations. - unit: Percent - Atomic Traffic: - rst: The percent of write requests generated by the L2 cache that are atomic requests - to *any* memory location. This breakdown does not consider the *size* of the - request (meaning that 32B and 64B requests are both counted as a single request), - so this metric only *approximates* the percent of the L2-Fabric Read bandwidth - directed to a remote location. Note that on current CDNA accelerators, such - as the :ref:`MI2XX `, requests are only considered *atomic* by - Infinity Fabric if they are targeted at :ref:`fine-grained memory ` - allocations or :ref:`uncached memory ` allocations. - unit: Percent - Uncached Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - targeting :ref:`uncached memory allocations `. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric read bandwidth directed to uncached memory allocations. - unit: Percent - Read Latency: - rst: The time-averaged number of cycles read requests spent in Infinity Fabric before - data was returned to the L2. - unit: Cycles - Write and Atomic Latency: - rst: The time-averaged number of cycles write requests spent in Infinity Fabric - before a completion acknowledgement was returned to the L2. - unit: Cycles - Atomic Latency: - rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric - before a completion acknowledgement (atomic without return value) or data (atomic - with return value) was returned to the L2. - unit: Cycles - Bandwidth: - rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit - `. The number of bytes is calculated as the number of - cache lines requested multiplied by the cache line size. This value does not - consider partial requests, so for example, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit - Req: - rst: The total number of incoming requests to the L2 from all clients for all request - types, per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: 'The total number of read requests to the L2 from all clients. ' - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests to the L2 from all clients. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests (with and without return) to the L2 from - all clients. - unit: Requests per normalization unit - Streaming Req: - rst: The total number of incoming requests to the L2 that are marked as *streaming*. - The exact meaning of this may differ depending on the targeted accelerator, - however on an :ref:`MI2XX ` this corresponds to `non-temporal - load or stores `_. The - L2 cache attempts to evict *streaming* requests before normal requests when - the L2 is at capacity. - unit: Requests per normalization unit - Probe Req: - rst: The number of coherence probe requests made to the L2 cache from outside the - accelerator. On an :ref:`MI2XX `, probe requests may be generated - by, for example, writes to :ref:`fine-grained device ` memory - or by writes to :ref:`coarse-grained ` device memory. - unit: Requests per normalization unit - Cache Hit: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - Hits: - rst: The total number of requests to the L2 from all clients that hit in the cache. - As noted in the :ref:`Speed-of-Light ` section, this includes hit-on-miss - requests. - unit: Requests per normalization unit - Misses: - rst: The total number of requests to the L2 from all clients that miss in the cache. - As noted in the :ref:`Speed-of-Light ` section, these do not include - hit-on-miss requests. - unit: Requests per normalization unit - Writeback: - rst: The total number of L2 cache lines written back to memory for any reason. Write-backs - may occur due to user code (such as HIP kernel calls to ``__threadfence_system`` - or atomic built-ins) by the :doc:`command processor `'s - memory acquire/release fences, or for other internal hardware reasons. - unit: Cache lines per normalization unit - Writeback (Internal): - rst: The total number of L2 cache lines written back to memory for internal hardware - reasons, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Writeback (vL1D Req): - rst: The total number of L2 cache lines written back to memory due to requests initiated - by the :doc:`vL1D cache `, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (Internal): - rst: The total number of L2 cache lines evicted from the cache due to capacity limits, - per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (vL1D Req): - rst: The total number of L2 cache lines evicted from the cache due to invalidation - requests initiated by the :doc:`vL1D cache `, per :ref:`normalization - unit `. - unit: Cache lines per normalization unit - NC Req: - rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory - allocations, per :ref:`normalization unit `. See the :ref:`memory-type` - for more information. - unit: Requests per normalization unit - UC Req: - rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations. - See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - CC Req: - rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - RW Req: - rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW) - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - Write - Credit Starvation: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to any memory location because too many write/atomic requests were - currently in flight, as a percent of the :ref:`total active L2 cycles `. - unit: Percent - Read (32B): - rst: The total number of L2 requests to Infinity Fabric to read 32B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. Typically unused on CDNA accelerators. - unit: Requests per normalization unit - Read (64B): - rst: The total number of L2 requests to Infinity Fabric to read 64B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Read (Uncached): - rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached - data ` from any memory location, per :ref:`normalization unit - `. 64B requests for uncached data are counted as two 32B - uncached data requests. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from the accelerator's local HBM, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Remote Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from any source other than the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (32B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B of data to any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (Uncached): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of :ref:`uncached data `, per :ref:`normalization unit - `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (64B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. plain - unit: Requests per normalization unit - Remote Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in any memory location other than the accelerator's local - HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Atomic: - rst: The total number of L2 requests to Infinity Fabric to atomically update 32B - or 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators, - such as the :ref:`MI2XX `, requests are only considered *atomic* - by Infinity Fabric if they are targeted at non-write-cacheable memory, such - as :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Requests per normalization unit - Read Stall: - rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\ - \ on a read request to any destination (local HBM, remote PCIe\xAE connected\ - \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\ - \ or CPU) over the :ref:`total active L2 cycles `." - unit: Percent - Write Stall: - rst: The ratio of the total number of cycles the L2-Fabric interface was stalled - on a write or atomic request to any destination (local HBM, remote accelerator - or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected - accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles `. - unit: Percent - Read - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total - active L2 cycles `. - unit: Percent - Read - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Read - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles - `. - unit: Percent - Write - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Write - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as - a percent of the :ref:`total active L2 cycles `. - unit: Percent - Write - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to accelerator's local HBM as a percent of the total active L2 cycles. + Utilization: + rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed + over all L2 channels on the accelerator ` over the + :ref:`total L2 cycles `. unit: Percent L2 cache accesses: - Utilization: - rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed - over all L2 channels on the accelerator ` over the - :ref:`total L2 cycles `. - unit: Percent - Peak Bandwidth: - rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical - bandwidth achievable on the specific accelerator. The number of bytes is calculated - as the number of cache lines requested multiplied by the cache line size. This - value does not consider partial requests, so e.g., if only a single value is - requested in a cache line, the data movement will still be counted as a full - cache line. - unit: Percent - Hit Rate: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - L2-Fabric Read BW: - rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface - ` per unit time. - unit: GB/s - L2-Fabric Write and Atomic BW: - rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface - ` by write and atomic operations per unit time. - unit: GB/s - HBM Bandwidth: - rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory - (HBM) per unit time. This value is calculated as the number of HBM channels - multiplied by the HBM channel width multiplied by the HBM clock frequency. - unit: GB/s - Read BW: - rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization - unit `. + Atomic Bandwidth: + rst: Total number of bytes looked up in the L2 cache for atomic requests, per + :ref:`normalization unit `. unit: Bytes per normalization unit - HBM Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to the - accelerator's local high-bandwidth memory (HBM). This breakdown does not consider - the *size* of the request (meaning that 32B and 64B requests are both counted - as a single request), so this metric only *approximates* the percent of the - L2-Fabric Read bandwidth directed to the local HBM. - unit: Percent - Remote Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. - unit: Percent - Uncached Read Traffic: - rst: The percent of read requests generated by the L2 cache that are reading from - an :ref:`uncached memory allocation `. Note, as described in the - :ref:`request flow ` section, a single 64B read request is - typically counted as two uncached read requests. So, it is possible for the - Uncached Read Traffic to reach up to 200% of the total number of read requests. - This breakdown does not consider the *size* of the request (i.e., 32B and 64B - requests are both counted as a single request), so this metric only *approximates* - the percent of the L2-Fabric read bandwidth directed to an uncached memory - location. - unit: Percent - Write and Atomic BW: - rst: The total number of bytes written by the L2 over Infinity Fabric by write and - atomic operations per :ref:`normalization unit `. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable - memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Bytes per normalization unit - HBM Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - routed to the accelerator's local high-bandwidth memory (HBM). This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM. - Note that on current CDNA accelerators, such as the :ref:`MI2XX `, - requests are only considered *atomic* by Infinity Fabric if they are targeted - at :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations. - unit: Percent - Remote Write and Atomic Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained - memory ` allocations or :ref:`uncached memory ` allocations. - unit: Percent - Atomic Traffic: - rst: The percent of write requests generated by the L2 cache that are atomic requests - to *any* memory location. This breakdown does not consider the *size* of the - request (meaning that 32B and 64B requests are both counted as a single request), - so this metric only *approximates* the percent of the L2-Fabric Read bandwidth - directed to a remote location. Note that on current CDNA accelerators, such - as the :ref:`MI2XX `, requests are only considered *atomic* by - Infinity Fabric if they are targeted at :ref:`fine-grained memory ` - allocations or :ref:`uncached memory ` allocations. - unit: Percent - Uncached Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - targeting :ref:`uncached memory allocations `. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric read bandwidth directed to uncached memory allocations. - unit: Percent - Read Latency: - rst: The time-averaged number of cycles read requests spent in Infinity Fabric before - data was returned to the L2. - unit: Cycles - Write and Atomic Latency: - rst: The time-averaged number of cycles write requests spent in Infinity Fabric - before a completion acknowledgement was returned to the L2. - unit: Cycles - Atomic Latency: - rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric - before a completion acknowledgement (atomic without return value) or data (atomic - with return value) was returned to the L2. - unit: Cycles + Atomic Req: + rst: The total number of atomic requests (with and without return) to the L2 from + all clients. + unit: Requests per normalization unit Bandwidth: rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit `. The number of bytes is calculated as the number of @@ -2945,38 +775,23 @@ L2 cache accesses: consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. unit: Bytes per normalization unit - Req: - rst: The total number of incoming requests to the L2 from all clients for all request - types, per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: 'The total number of read requests to the L2 from all clients. ' - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests to the L2 from all clients. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests (with and without return) to the L2 from - all clients. - unit: Requests per normalization unit - Streaming Req: - rst: The total number of incoming requests to the L2 that are marked as *streaming*. - The exact meaning of this may differ depending on the targeted accelerator, - however on an :ref:`MI2XX ` this corresponds to `non-temporal - load or stores `_. The - L2 cache attempts to evict *streaming* requests before normal requests when - the L2 is at capacity. - unit: Requests per normalization unit - Probe Req: - rst: The number of coherence probe requests made to the L2 cache from outside the - accelerator. On an :ref:`MI2XX `, probe requests may be generated - by, for example, writes to :ref:`fine-grained device ` memory - or by writes to :ref:`coarse-grained ` device memory. + CC Req: + rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory + allocations. See the :ref:`memory-type` for more information. unit: Requests per normalization unit Cache Hit: rst: The ratio of the number of L2 cache line requests that hit in the L2 cache over the total number of incoming cache line requests to the L2 cache. unit: Percent + Evict (Internal): + rst: The total number of L2 cache lines evicted from the cache due to capacity limits, + per :ref:`normalization unit `. + unit: Cache lines per normalization unit + Evict (vL1D Req): + rst: The total number of L2 cache lines evicted from the cache due to invalidation + requests initiated by the :doc:`vL1D cache `, per :ref:`normalization + unit `. + unit: Cache lines per normalization unit Hits: rst: The total number of requests to the L2 from all clients that hit in the cache. As noted in the :ref:`Speed-of-Light ` section, this includes hit-on-miss @@ -2987,6 +802,51 @@ L2 cache accesses: As noted in the :ref:`Speed-of-Light ` section, these do not include hit-on-miss requests. unit: Requests per normalization unit + NC Req: + rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory + allocations, per :ref:`normalization unit `. See the :ref:`memory-type` + for more information. + unit: Requests per normalization unit + Probe Req: + rst: The number of coherence probe requests made to the L2 cache from outside the + accelerator. On an :ref:`MI2XX `, probe requests may be generated + by, for example, writes to :ref:`fine-grained device ` memory + or by writes to :ref:`coarse-grained ` device memory. + unit: Requests per normalization unit + RW Req: + rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW) + allocations. See the :ref:`memory-type` for more information. + unit: Requests per normalization unit + Read Bandwidth: + rst: Total number of bytes looked up in the L2 cache for read requests, per :ref:`normalization + unit `. + unit: Bytes per normalization unit + Read Req: + rst: 'The total number of read requests to the L2 from all clients. ' + unit: Requests per normalization unit + Req: + rst: The total number of incoming requests to the L2 from all clients for all request + types, per :ref:`normalization unit `. + unit: Requests per normalization unit + Streaming Req: + rst: The total number of incoming requests to the L2 that are marked as *streaming*. + The exact meaning of this may differ depending on the targeted accelerator, + however on an :ref:`MI2XX ` this corresponds to `non-temporal + load or stores `_. The + L2 cache attempts to evict *streaming* requests before normal requests when + the L2 is at capacity. + unit: Requests per normalization unit + UC Req: + rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations. + See the :ref:`memory-type` for more information. + unit: Requests per normalization unit + Write Bandwidth: + rst: Total number of bytes looked up in the L2 cache for write requests, per :ref:`normalization + unit `. + unit: Bytes per normalization unit + Write Req: + rst: The total number of write requests to the L2 from all clients. + unit: Requests per normalization unit Writeback: rst: The total number of L2 cache lines written back to memory for any reason. Write-backs may occur due to user code (such as HIP kernel calls to ``__threadfence_system`` @@ -3001,174 +861,22 @@ L2 cache accesses: rst: The total number of L2 cache lines written back to memory due to requests initiated by the :doc:`vL1D cache `, per :ref:`normalization unit `. unit: Cache lines per normalization unit - Evict (Internal): - rst: The total number of L2 cache lines evicted from the cache due to capacity limits, - per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (vL1D Req): - rst: The total number of L2 cache lines evicted from the cache due to invalidation - requests initiated by the :doc:`vL1D cache `, per :ref:`normalization - unit `. - unit: Cache lines per normalization unit - NC Req: - rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory - allocations, per :ref:`normalization unit `. See the :ref:`memory-type` - for more information. - unit: Requests per normalization unit - UC Req: - rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations. - See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - CC Req: - rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - RW Req: - rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW) - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - Write - Credit Starvation: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to any memory location because too many write/atomic requests were - currently in flight, as a percent of the :ref:`total active L2 cycles `. - unit: Percent - Read (32B): - rst: The total number of L2 requests to Infinity Fabric to read 32B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. Typically unused on CDNA accelerators. - unit: Requests per normalization unit - Read (64B): - rst: The total number of L2 requests to Infinity Fabric to read 64B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Read (Uncached): - rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached - data ` from any memory location, per :ref:`normalization unit - `. 64B requests for uncached data are counted as two 32B - uncached data requests. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from the accelerator's local HBM, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Remote Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from any source other than the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (32B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B of data to any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (Uncached): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of :ref:`uncached data `, per :ref:`normalization unit - `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (64B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. plain - unit: Requests per normalization unit - Remote Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in any memory location other than the accelerator's local - HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Atomic: - rst: The total number of L2 requests to Infinity Fabric to atomically update 32B - or 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators, - such as the :ref:`MI2XX `, requests are only considered *atomic* - by Infinity Fabric if they are targeted at non-write-cacheable memory, such - as :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Requests per normalization unit - Read Stall: - rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\ - \ on a read request to any destination (local HBM, remote PCIe\xAE connected\ - \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\ - \ or CPU) over the :ref:`total active L2 cycles `." - unit: Percent - Write Stall: - rst: The ratio of the total number of cycles the L2-Fabric interface was stalled - on a write or atomic request to any destination (local HBM, remote accelerator - or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected - accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles `. - unit: Percent - Read - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total - active L2 cycles `. - unit: Percent - Read - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Read - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles - `. - unit: Percent - Write - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Write - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as - a percent of the :ref:`total active L2 cycles `. - unit: Percent - Write - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to accelerator's local HBM as a percent of the total active L2 cycles. - unit: Percent L2-Fabric interface metrics: - Utilization: - rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed - over all L2 channels on the accelerator ` over the - :ref:`total L2 cycles `. + Atomic Latency: + rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric + before a completion acknowledgement (atomic without return value) or data (atomic + with return value) was returned to the L2. + unit: Cycles + Atomic Traffic: + rst: The percent of write requests generated by the L2 cache that are atomic requests + to *any* memory location. This breakdown does not consider the *size* of the + request (meaning that 32B and 64B requests are both counted as a single request), + so this metric only *approximates* the percent of the L2-Fabric Read bandwidth + directed to a remote location. Note that on current CDNA accelerators, such + as the :ref:`MI2XX `, requests are only considered *atomic* by + Infinity Fabric if they are targeted at :ref:`fine-grained memory ` + allocations or :ref:`uncached memory ` allocations. unit: Percent - Peak Bandwidth: - rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical - bandwidth achievable on the specific accelerator. The number of bytes is calculated - as the number of cache lines requested multiplied by the cache line size. This - value does not consider partial requests, so e.g., if only a single value is - requested in a cache line, the data movement will still be counted as a full - cache line. - unit: Percent - Hit Rate: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - L2-Fabric Read BW: - rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface - ` per unit time. - unit: GB/s - L2-Fabric Write and Atomic BW: - rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface - ` by write and atomic operations per unit time. - unit: GB/s - HBM Bandwidth: - rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory - (HBM) per unit time. This value is calculated as the number of HBM channels - multiplied by the HBM channel width multiplied by the HBM clock frequency. - unit: GB/s - Read BW: - rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization - unit `. - unit: Bytes per normalization unit HBM Read Traffic: rst: The percent of read requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown does not consider @@ -3176,33 +884,6 @@ L2-Fabric interface metrics: as a single request), so this metric only *approximates* the percent of the L2-Fabric Read bandwidth directed to the local HBM. unit: Percent - Remote Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. - unit: Percent - Uncached Read Traffic: - rst: The percent of read requests generated by the L2 cache that are reading from - an :ref:`uncached memory allocation `. Note, as described in the - :ref:`request flow ` section, a single 64B read request is - typically counted as two uncached read requests. So, it is possible for the - Uncached Read Traffic to reach up to 200% of the total number of read requests. - This breakdown does not consider the *size* of the request (i.e., 32B and 64B - requests are both counted as a single request), so this metric only *approximates* - the percent of the L2-Fabric read bandwidth directed to an uncached memory - location. - unit: Percent - Write and Atomic BW: - rst: The total number of bytes written by the L2 over Infinity Fabric by write and - atomic operations per :ref:`normalization unit `. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable - memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Bytes per normalization unit HBM Write and Atomic Traffic: rst: The percent of write and atomic requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown @@ -3214,6 +895,28 @@ L2-Fabric interface metrics: at :ref:`fine-grained memory ` allocations or :ref:`uncached memory ` allocations. unit: Percent + Read BW: + rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization + unit `. + unit: Bytes per normalization unit + Read Latency: + rst: The time-averaged number of cycles read requests spent in Infinity Fabric before + data was returned to the L2. + unit: Cycles + Read Stall: + rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\ + \ on a read request to any destination (local HBM, remote PCIe\xAE connected\ + \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\ + \ or CPU) over the :ref:`total active L2 cycles `." + unit: Percent + Remote Read Traffic: + rst: The percent of read requests generated by the L2 cache that are routed to any + memory location other than the accelerator's local high-bandwidth memory (HBM) + -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown + does not consider the *size* of the request (meaning that 32B and 64B requests + are both counted as a single request), so this metric only *approximates* the + percent of the L2-Fabric Read bandwidth directed to a remote location. + unit: Percent Remote Write and Atomic Traffic: rst: The percent of read requests generated by the L2 cache that are routed to any memory location other than the accelerator's local high-bandwidth memory (HBM) @@ -3225,15 +928,16 @@ L2-Fabric interface metrics: are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained memory ` allocations or :ref:`uncached memory ` allocations. unit: Percent - Atomic Traffic: - rst: The percent of write requests generated by the L2 cache that are atomic requests - to *any* memory location. This breakdown does not consider the *size* of the - request (meaning that 32B and 64B requests are both counted as a single request), - so this metric only *approximates* the percent of the L2-Fabric Read bandwidth - directed to a remote location. Note that on current CDNA accelerators, such - as the :ref:`MI2XX `, requests are only considered *atomic* by - Infinity Fabric if they are targeted at :ref:`fine-grained memory ` - allocations or :ref:`uncached memory ` allocations. + Uncached Read Traffic: + rst: The percent of read requests generated by the L2 cache that are reading from + an :ref:`uncached memory allocation `. Note, as described in the + :ref:`request flow ` section, a single 64B read request is + typically counted as two uncached read requests. So, it is possible for the + Uncached Read Traffic to reach up to 200% of the total number of read requests. + This breakdown does not consider the *size* of the request (i.e., 32B and 64B + requests are both counted as a single request), so this metric only *approximates* + the percent of the L2-Fabric read bandwidth directed to an uncached memory + location. unit: Percent Uncached Write and Atomic Traffic: rst: The percent of write and atomic requests generated by the L2 cache that are @@ -3242,430 +946,56 @@ L2-Fabric interface metrics: are both counted as a single request), so this metric only *approximates* the percent of the L2-Fabric read bandwidth directed to uncached memory allocations. unit: Percent - Read Latency: - rst: The time-averaged number of cycles read requests spent in Infinity Fabric before - data was returned to the L2. - unit: Cycles - Write and Atomic Latency: - rst: The time-averaged number of cycles write requests spent in Infinity Fabric - before a completion acknowledgement was returned to the L2. - unit: Cycles - Atomic Latency: - rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric - before a completion acknowledgement (atomic without return value) or data (atomic - with return value) was returned to the L2. - unit: Cycles - Bandwidth: - rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit - `. The number of bytes is calculated as the number of - cache lines requested multiplied by the cache line size. This value does not - consider partial requests, so for example, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit - Req: - rst: The total number of incoming requests to the L2 from all clients for all request - types, per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: 'The total number of read requests to the L2 from all clients. ' - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests to the L2 from all clients. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests (with and without return) to the L2 from - all clients. - unit: Requests per normalization unit - Streaming Req: - rst: The total number of incoming requests to the L2 that are marked as *streaming*. - The exact meaning of this may differ depending on the targeted accelerator, - however on an :ref:`MI2XX ` this corresponds to `non-temporal - load or stores `_. The - L2 cache attempts to evict *streaming* requests before normal requests when - the L2 is at capacity. - unit: Requests per normalization unit - Probe Req: - rst: The number of coherence probe requests made to the L2 cache from outside the - accelerator. On an :ref:`MI2XX `, probe requests may be generated - by, for example, writes to :ref:`fine-grained device ` memory - or by writes to :ref:`coarse-grained ` device memory. - unit: Requests per normalization unit - Cache Hit: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - Hits: - rst: The total number of requests to the L2 from all clients that hit in the cache. - As noted in the :ref:`Speed-of-Light ` section, this includes hit-on-miss - requests. - unit: Requests per normalization unit - Misses: - rst: The total number of requests to the L2 from all clients that miss in the cache. - As noted in the :ref:`Speed-of-Light ` section, these do not include - hit-on-miss requests. - unit: Requests per normalization unit - Writeback: - rst: The total number of L2 cache lines written back to memory for any reason. Write-backs - may occur due to user code (such as HIP kernel calls to ``__threadfence_system`` - or atomic built-ins) by the :doc:`command processor `'s - memory acquire/release fences, or for other internal hardware reasons. - unit: Cache lines per normalization unit - Writeback (Internal): - rst: The total number of L2 cache lines written back to memory for internal hardware - reasons, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Writeback (vL1D Req): - rst: The total number of L2 cache lines written back to memory due to requests initiated - by the :doc:`vL1D cache `, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (Internal): - rst: The total number of L2 cache lines evicted from the cache due to capacity limits, - per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (vL1D Req): - rst: The total number of L2 cache lines evicted from the cache due to invalidation - requests initiated by the :doc:`vL1D cache `, per :ref:`normalization - unit `. - unit: Cache lines per normalization unit - NC Req: - rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory - allocations, per :ref:`normalization unit `. See the :ref:`memory-type` - for more information. - unit: Requests per normalization unit - UC Req: - rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations. - See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - CC Req: - rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - RW Req: - rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW) - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - Write - Credit Starvation: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to any memory location because too many write/atomic requests were - currently in flight, as a percent of the :ref:`total active L2 cycles `. - unit: Percent - Read (32B): - rst: The total number of L2 requests to Infinity Fabric to read 32B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. Typically unused on CDNA accelerators. - unit: Requests per normalization unit - Read (64B): - rst: The total number of L2 requests to Infinity Fabric to read 64B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Read (Uncached): - rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached - data ` from any memory location, per :ref:`normalization unit - `. 64B requests for uncached data are counted as two 32B - uncached data requests. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from the accelerator's local HBM, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Remote Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from any source other than the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (32B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B of data to any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (Uncached): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of :ref:`uncached data `, per :ref:`normalization unit - `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (64B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. plain - unit: Requests per normalization unit - Remote Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in any memory location other than the accelerator's local - HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Atomic: - rst: The total number of L2 requests to Infinity Fabric to atomically update 32B - or 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators, - such as the :ref:`MI2XX `, requests are only considered *atomic* - by Infinity Fabric if they are targeted at non-write-cacheable memory, such - as :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Requests per normalization unit - Read Stall: - rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\ - \ on a read request to any destination (local HBM, remote PCIe\xAE connected\ - \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\ - \ or CPU) over the :ref:`total active L2 cycles `." - unit: Percent Write Stall: rst: The ratio of the total number of cycles the L2-Fabric interface was stalled on a write or atomic request to any destination (local HBM, remote accelerator or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles `. unit: Percent - Read - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total - active L2 cycles `. - unit: Percent - Read - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Read - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles - `. - unit: Percent - Write - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Write - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as - a percent of the :ref:`total active L2 cycles `. - unit: Percent - Write - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to accelerator's local HBM as a percent of the total active L2 cycles. - unit: Percent + Write and Atomic BW: + rst: The total number of bytes written by the L2 over Infinity Fabric by write and + atomic operations per :ref:`normalization unit `. Note + that on current CDNA accelerators, such as the :ref:`MI2XX `, requests + are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable + memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached + memory ` allocations on the MI2XX. + unit: Bytes per normalization unit + Write and Atomic Latency: + rst: The time-averaged number of cycles write requests spent in Infinity Fabric + before a completion acknowledgement was returned to the L2. + unit: Cycles L2 - Fabric interface detailed metrics: - Utilization: - rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed - over all L2 channels on the accelerator ` over the - :ref:`total L2 cycles `. - unit: Percent - Peak Bandwidth: - rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical - bandwidth achievable on the specific accelerator. The number of bytes is calculated - as the number of cache lines requested multiplied by the cache line size. This - value does not consider partial requests, so e.g., if only a single value is - requested in a cache line, the data movement will still be counted as a full - cache line. - unit: Percent - Hit Rate: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - L2-Fabric Read BW: - rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface - ` per unit time. - unit: GB/s - L2-Fabric Write and Atomic BW: - rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface - ` by write and atomic operations per unit time. - unit: GB/s - HBM Bandwidth: - rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory - (HBM) per unit time. This value is calculated as the number of HBM channels - multiplied by the HBM channel width multiplied by the HBM clock frequency. - unit: GB/s - Read BW: - rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization - unit `. + Atomic: + rst: The total number of L2 requests to Infinity Fabric to atomically update 32B + or 64B of data in any memory location, per :ref:`normalization unit `. + See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators, + such as the :ref:`MI2XX `, requests are only considered *atomic* + by Infinity Fabric if they are targeted at non-write-cacheable memory, such + as :ref:`fine-grained memory ` allocations or :ref:`uncached + memory ` allocations on the MI2XX. + unit: Requests per normalization unit + Atomic Bandwidth - HBM: + rst: Total number of bytes due to L2 atomic requests due to HBM traffic, per normalization + unit. unit: Bytes per normalization unit - HBM Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to the - accelerator's local high-bandwidth memory (HBM). This breakdown does not consider - the *size* of the request (meaning that 32B and 64B requests are both counted - as a single request), so this metric only *approximates* the percent of the - L2-Fabric Read bandwidth directed to the local HBM. - unit: Percent - Remote Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. - unit: Percent - Uncached Read Traffic: - rst: The percent of read requests generated by the L2 cache that are reading from - an :ref:`uncached memory allocation `. Note, as described in the - :ref:`request flow ` section, a single 64B read request is - typically counted as two uncached read requests. So, it is possible for the - Uncached Read Traffic to reach up to 200% of the total number of read requests. - This breakdown does not consider the *size* of the request (i.e., 32B and 64B - requests are both counted as a single request), so this metric only *approximates* - the percent of the L2-Fabric read bandwidth directed to an uncached memory - location. - unit: Percent - Write and Atomic BW: - rst: The total number of bytes written by the L2 over Infinity Fabric by write and - atomic operations per :ref:`normalization unit `. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable - memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. + "Atomic Bandwidth - Infinity Fabric\u2122": + rst: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic, + per normalization unit. unit: Bytes per normalization unit - HBM Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - routed to the accelerator's local high-bandwidth memory (HBM). This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM. - Note that on current CDNA accelerators, such as the :ref:`MI2XX `, - requests are only considered *atomic* by Infinity Fabric if they are targeted - at :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations. - unit: Percent - Remote Write and Atomic Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained - memory ` allocations or :ref:`uncached memory ` allocations. - unit: Percent - Atomic Traffic: - rst: The percent of write requests generated by the L2 cache that are atomic requests - to *any* memory location. This breakdown does not consider the *size* of the - request (meaning that 32B and 64B requests are both counted as a single request), - so this metric only *approximates* the percent of the L2-Fabric Read bandwidth - directed to a remote location. Note that on current CDNA accelerators, such - as the :ref:`MI2XX `, requests are only considered *atomic* by - Infinity Fabric if they are targeted at :ref:`fine-grained memory ` - allocations or :ref:`uncached memory ` allocations. - unit: Percent - Uncached Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - targeting :ref:`uncached memory allocations `. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric read bandwidth directed to uncached memory allocations. - unit: Percent - Read Latency: - rst: The time-averaged number of cycles read requests spent in Infinity Fabric before - data was returned to the L2. - unit: Cycles - Write and Atomic Latency: - rst: The time-averaged number of cycles write requests spent in Infinity Fabric - before a completion acknowledgement was returned to the L2. - unit: Cycles - Atomic Latency: - rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric - before a completion acknowledgement (atomic without return value) or data (atomic - with return value) was returned to the L2. - unit: Cycles - Bandwidth: - rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit - `. The number of bytes is calculated as the number of - cache lines requested multiplied by the cache line size. This value does not - consider partial requests, so for example, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. + Atomic Bandwidth - PCIe: + rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, per + normalization unit. unit: Bytes per normalization unit - Req: - rst: The total number of incoming requests to the L2 from all clients for all request - types, per :ref:`normalization unit `. + HBM Read: + rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data + from the accelerator's local HBM, per :ref:`normalization unit `. + See :ref:`l2-request-flow` for more detail. unit: Requests per normalization unit - Read Req: - rst: 'The total number of read requests to the L2 from all clients. ' + HBM Write and Atomic: + rst: The total number of L2 requests to Infinity Fabric to write or atomically update + 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization + unit `. See :ref:`l2-request-flow` for more detail. plain unit: Requests per normalization unit - Write Req: - rst: The total number of write requests to the L2 from all clients. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests (with and without return) to the L2 from - all clients. - unit: Requests per normalization unit - Streaming Req: - rst: The total number of incoming requests to the L2 that are marked as *streaming*. - The exact meaning of this may differ depending on the targeted accelerator, - however on an :ref:`MI2XX ` this corresponds to `non-temporal - load or stores `_. The - L2 cache attempts to evict *streaming* requests before normal requests when - the L2 is at capacity. - unit: Requests per normalization unit - Probe Req: - rst: The number of coherence probe requests made to the L2 cache from outside the - accelerator. On an :ref:`MI2XX `, probe requests may be generated - by, for example, writes to :ref:`fine-grained device ` memory - or by writes to :ref:`coarse-grained ` device memory. - unit: Requests per normalization unit - Cache Hit: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - Hits: - rst: The total number of requests to the L2 from all clients that hit in the cache. - As noted in the :ref:`Speed-of-Light ` section, this includes hit-on-miss - requests. - unit: Requests per normalization unit - Misses: - rst: The total number of requests to the L2 from all clients that miss in the cache. - As noted in the :ref:`Speed-of-Light ` section, these do not include - hit-on-miss requests. - unit: Requests per normalization unit - Writeback: - rst: The total number of L2 cache lines written back to memory for any reason. Write-backs - may occur due to user code (such as HIP kernel calls to ``__threadfence_system`` - or atomic built-ins) by the :doc:`command processor `'s - memory acquire/release fences, or for other internal hardware reasons. - unit: Cache lines per normalization unit - Writeback (Internal): - rst: The total number of L2 cache lines written back to memory for internal hardware - reasons, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Writeback (vL1D Req): - rst: The total number of L2 cache lines written back to memory due to requests initiated - by the :doc:`vL1D cache `, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (Internal): - rst: The total number of L2 cache lines evicted from the cache due to capacity limits, - per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (vL1D Req): - rst: The total number of L2 cache lines evicted from the cache due to invalidation - requests initiated by the :doc:`vL1D cache `, per :ref:`normalization - unit `. - unit: Cache lines per normalization unit - NC Req: - rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory - allocations, per :ref:`normalization unit `. See the :ref:`memory-type` - for more information. - unit: Requests per normalization unit - UC Req: - rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations. - See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - CC Req: - rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - RW Req: - rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW) - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - Write - Credit Starvation: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to any memory location because too many write/atomic requests were - currently in flight, as a percent of the :ref:`total active L2 cycles `. - unit: Percent Read (32B): rst: The total number of L2 requests to Infinity Fabric to read 32B of data from any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` @@ -3682,408 +1012,90 @@ L2 - Fabric interface detailed metrics: `. 64B requests for uncached data are counted as two 32B uncached data requests. See :ref:`l2-request-flow` for more detail. unit: Requests per normalization unit - HBM Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from the accelerator's local HBM, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit + Read Bandwidth - HBM: + rst: Total number of bytes due to L2 read requests due to HBM traffic, per normalization + unit. + unit: Bytes per normalization unit + "Read Bandwidth - Infinity Fabric\u2122": + rst: Total number of bytes due to L2 read requests due to Infinity Fabric traffic, + per normalization unit. + unit: Bytes per normalization unit + Read Bandwidth - PCIe: + rst: Total number of bytes due to L2 read requests due to PCIe traffic, per normalization + unit. + unit: Bytes per normalization unit Remote Read: rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data from any source other than the accelerator's local HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. unit: Requests per normalization unit - Write and Atomic (32B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B of data to any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (Uncached): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of :ref:`uncached data `, per :ref:`normalization unit - `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (64B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. plain - unit: Requests per normalization unit Remote Write and Atomic: rst: The total number of L2 requests to Infinity Fabric to write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. unit: Requests per normalization unit - Atomic: - rst: The total number of L2 requests to Infinity Fabric to atomically update 32B - or 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators, - such as the :ref:`MI2XX `, requests are only considered *atomic* - by Infinity Fabric if they are targeted at non-write-cacheable memory, such - as :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. + Write Bandwidth - HBM: + rst: Total number of bytes due to L2 write requests due to HBM traffic, per normalization + unit. + unit: Bytes per normalization unit + "Write Bandwidth - Infinity Fabric\u2122": + rst: Total number of bytes due to L2 write requests due to Infinity Fabric traffic, + per normalization unit. + unit: Bytes per normalization unit + Write Bandwidth - PCIe: + rst: Total number of bytes due to L2 write requests due to PCIe traffic, per normalization + unit. + unit: Bytes per normalization unit + Write and Atomic (32B): + rst: The total number of L2 requests to Infinity Fabric to write or atomically update + 32B of data to any memory location, per :ref:`normalization unit `. + See :ref:`l2-request-flow` for more detail. + unit: Requests per normalization unit + Write and Atomic (64B): + rst: The total number of L2 requests to Infinity Fabric to write or atomically update + 64B of data in any memory location, per :ref:`normalization unit `. + See :ref:`l2-request-flow` for more detail. + unit: Requests per normalization unit + Write and Atomic (Uncached): + rst: The total number of L2 requests to Infinity Fabric to write or atomically update + 32B or 64B of :ref:`uncached data `, per :ref:`normalization unit + `. See :ref:`l2-request-flow` for more detail. unit: Requests per normalization unit - Read Stall: - rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\ - \ on a read request to any destination (local HBM, remote PCIe\xAE connected\ - \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\ - \ or CPU) over the :ref:`total active L2 cycles `." - unit: Percent - Write Stall: - rst: The ratio of the total number of cycles the L2-Fabric interface was stalled - on a write or atomic request to any destination (local HBM, remote accelerator - or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected - accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles `. - unit: Percent - Read - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total - active L2 cycles `. - unit: Percent - Read - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Read - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles - `. - unit: Percent - Write - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Write - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as - a percent of the :ref:`total active L2 cycles `. - unit: Percent - Write - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to accelerator's local HBM as a percent of the total active L2 cycles. - unit: Percent L2 - Fabric Interface stalls: - Utilization: - rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed - over all L2 channels on the accelerator ` over the - :ref:`total L2 cycles `. - unit: Percent - Peak Bandwidth: - rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical - bandwidth achievable on the specific accelerator. The number of bytes is calculated - as the number of cache lines requested multiplied by the cache line size. This - value does not consider partial requests, so e.g., if only a single value is - requested in a cache line, the data movement will still be counted as a full - cache line. - unit: Percent - Hit Rate: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - L2-Fabric Read BW: - rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface - ` per unit time. - unit: GB/s - L2-Fabric Write and Atomic BW: - rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface - ` by write and atomic operations per unit time. - unit: GB/s - HBM Bandwidth: - rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory - (HBM) per unit time. This value is calculated as the number of HBM channels - multiplied by the HBM channel width multiplied by the HBM clock frequency. - unit: GB/s - Read BW: - rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization - unit `. - unit: Bytes per normalization unit - HBM Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to the - accelerator's local high-bandwidth memory (HBM). This breakdown does not consider - the *size* of the request (meaning that 32B and 64B requests are both counted - as a single request), so this metric only *approximates* the percent of the - L2-Fabric Read bandwidth directed to the local HBM. - unit: Percent - Remote Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. - unit: Percent - Uncached Read Traffic: - rst: The percent of read requests generated by the L2 cache that are reading from - an :ref:`uncached memory allocation `. Note, as described in the - :ref:`request flow ` section, a single 64B read request is - typically counted as two uncached read requests. So, it is possible for the - Uncached Read Traffic to reach up to 200% of the total number of read requests. - This breakdown does not consider the *size* of the request (i.e., 32B and 64B - requests are both counted as a single request), so this metric only *approximates* - the percent of the L2-Fabric read bandwidth directed to an uncached memory - location. - unit: Percent - Write and Atomic BW: - rst: The total number of bytes written by the L2 over Infinity Fabric by write and - atomic operations per :ref:`normalization unit `. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable - memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Bytes per normalization unit - HBM Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - routed to the accelerator's local high-bandwidth memory (HBM). This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM. - Note that on current CDNA accelerators, such as the :ref:`MI2XX `, - requests are only considered *atomic* by Infinity Fabric if they are targeted - at :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations. - unit: Percent - Remote Write and Atomic Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained - memory ` allocations or :ref:`uncached memory ` allocations. - unit: Percent - Atomic Traffic: - rst: The percent of write requests generated by the L2 cache that are atomic requests - to *any* memory location. This breakdown does not consider the *size* of the - request (meaning that 32B and 64B requests are both counted as a single request), - so this metric only *approximates* the percent of the L2-Fabric Read bandwidth - directed to a remote location. Note that on current CDNA accelerators, such - as the :ref:`MI2XX `, requests are only considered *atomic* by - Infinity Fabric if they are targeted at :ref:`fine-grained memory ` - allocations or :ref:`uncached memory ` allocations. - unit: Percent - Uncached Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - targeting :ref:`uncached memory allocations `. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric read bandwidth directed to uncached memory allocations. - unit: Percent - Read Latency: - rst: The time-averaged number of cycles read requests spent in Infinity Fabric before - data was returned to the L2. - unit: Cycles - Write and Atomic Latency: - rst: The time-averaged number of cycles write requests spent in Infinity Fabric - before a completion acknowledgement was returned to the L2. - unit: Cycles - Atomic Latency: - rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric - before a completion acknowledgement (atomic without return value) or data (atomic - with return value) was returned to the L2. - unit: Cycles - Bandwidth: - rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit - `. The number of bytes is calculated as the number of - cache lines requested multiplied by the cache line size. This value does not - consider partial requests, so for example, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit - Req: - rst: The total number of incoming requests to the L2 from all clients for all request - types, per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: 'The total number of read requests to the L2 from all clients. ' - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests to the L2 from all clients. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests (with and without return) to the L2 from - all clients. - unit: Requests per normalization unit - Streaming Req: - rst: The total number of incoming requests to the L2 that are marked as *streaming*. - The exact meaning of this may differ depending on the targeted accelerator, - however on an :ref:`MI2XX ` this corresponds to `non-temporal - load or stores `_. The - L2 cache attempts to evict *streaming* requests before normal requests when - the L2 is at capacity. - unit: Requests per normalization unit - Probe Req: - rst: The number of coherence probe requests made to the L2 cache from outside the - accelerator. On an :ref:`MI2XX `, probe requests may be generated - by, for example, writes to :ref:`fine-grained device ` memory - or by writes to :ref:`coarse-grained ` device memory. - unit: Requests per normalization unit - Cache Hit: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - Hits: - rst: The total number of requests to the L2 from all clients that hit in the cache. - As noted in the :ref:`Speed-of-Light ` section, this includes hit-on-miss - requests. - unit: Requests per normalization unit - Misses: - rst: The total number of requests to the L2 from all clients that miss in the cache. - As noted in the :ref:`Speed-of-Light ` section, these do not include - hit-on-miss requests. - unit: Requests per normalization unit - Writeback: - rst: The total number of L2 cache lines written back to memory for any reason. Write-backs - may occur due to user code (such as HIP kernel calls to ``__threadfence_system`` - or atomic built-ins) by the :doc:`command processor `'s - memory acquire/release fences, or for other internal hardware reasons. - unit: Cache lines per normalization unit - Writeback (Internal): - rst: The total number of L2 cache lines written back to memory for internal hardware - reasons, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Writeback (vL1D Req): - rst: The total number of L2 cache lines written back to memory due to requests initiated - by the :doc:`vL1D cache `, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (Internal): - rst: The total number of L2 cache lines evicted from the cache due to capacity limits, - per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (vL1D Req): - rst: The total number of L2 cache lines evicted from the cache due to invalidation - requests initiated by the :doc:`vL1D cache `, per :ref:`normalization - unit `. - unit: Cache lines per normalization unit - NC Req: - rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory - allocations, per :ref:`normalization unit `. See the :ref:`memory-type` - for more information. - unit: Requests per normalization unit - UC Req: - rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations. - See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - CC Req: - rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - RW Req: - rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW) - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - Write - Credit Starvation: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to any memory location because too many write/atomic requests were - currently in flight, as a percent of the :ref:`total active L2 cycles `. - unit: Percent - Read (32B): - rst: The total number of L2 requests to Infinity Fabric to read 32B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. Typically unused on CDNA accelerators. - unit: Requests per normalization unit - Read (64B): - rst: The total number of L2 requests to Infinity Fabric to read 64B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Read (Uncached): - rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached - data ` from any memory location, per :ref:`normalization unit - `. 64B requests for uncached data are counted as two 32B - uncached data requests. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from the accelerator's local HBM, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Remote Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from any source other than the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (32B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B of data to any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (Uncached): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of :ref:`uncached data `, per :ref:`normalization unit - `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (64B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. plain - unit: Requests per normalization unit - Remote Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in any memory location other than the accelerator's local - HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Atomic: - rst: The total number of L2 requests to Infinity Fabric to atomically update 32B - or 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators, - such as the :ref:`MI2XX `, requests are only considered *atomic* - by Infinity Fabric if they are targeted at non-write-cacheable memory, such - as :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Requests per normalization unit - Read Stall: - rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\ - \ on a read request to any destination (local HBM, remote PCIe\xAE connected\ - \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\ - \ or CPU) over the :ref:`total active L2 cycles `." - unit: Percent - Write Stall: - rst: The ratio of the total number of cycles the L2-Fabric interface was stalled - on a write or atomic request to any destination (local HBM, remote accelerator - or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected - accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles `. - unit: Percent - Read - PCIe Stall: + Read - HBM Stall: rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total - active L2 cycles `. + to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles + `. unit: Percent Read - Infinity Fabric Stall: rst: The number of cycles the L2-Fabric interface was stalled on read requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total active L2 cycles `. unit: Percent - Read - HBM Stall: + Read - PCIe Stall: rst: The number of cycles the L2-Fabric interface was stalled on read requests - to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles - `. + to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total + active L2 cycles `. unit: Percent - Write - PCIe Stall: + Write - Credit Starvation: + rst: The number of cycles the L2-Fabric interface was stalled on write or atomic + requests to any memory location because too many write/atomic requests were + currently in flight, as a percent of the :ref:`total active L2 cycles `. + unit: Percent + Write - HBM Stall: rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. + requests to accelerator's local HBM as a percent of the total active L2 cycles. unit: Percent Write - Infinity Fabric Stall: rst: The number of cycles the L2-Fabric interface was stalled on write or atomic requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total active L2 cycles `. unit: Percent - Write - HBM Stall: + Write - PCIe Stall: rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to accelerator's local HBM as a percent of the total active L2 cycles. + requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent + of the :ref:`total active L2 cycles `. unit: Percent Scalar L1D Speed-of-Light: Bandwidth: @@ -4103,88 +1115,17 @@ Scalar L1D Speed-of-Light: \ unused on current CDNA accelerators, so in the majority of cases this can\ \ be interpreted as an sL1D\u2192L2 read bandwidth." unit: Bytes per normalization unit - Req: - rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization - unit `. - unit: Requests per normalization unit - Hits: - rst: The total number of sL1D requests that hit on a previously loaded cache line, - per :ref:`normalization unit `. - unit: Requests per normalization unit - Misses - Non Duplicated: - rst: The total number of sL1D requests that missed on a cache line that *was not* - already pending due to another request, per :ref:`normalization unit `. - See :ref:`desc-sl1d-sol` for more detail. - unit: Requests per normalization unit - Misses- Duplicated: - rst: The total number of sL1D requests that missed on a cache line that *was* already - pending due to another request, per :ref:`normalization unit `. - See :ref:`desc-sl1d-sol` for more detail. - unit: Requests per normalization unit - Read Req (Total): - rst: The total number of sL1D read requests of any size, per :ref:`normalization - unit `. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests from sL1D to the :doc:`L2 `, - per :ref:`normalization unit `. Typically unused on current - CDNA accelerators. - unit: Requests per normalization unit - Read Req (1 DWord): - rst: The total number of sL1D read requests made for a single dword of data (4B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (2 DWord): - rst: The total number of sL1D read requests made for a two dwords of data (8B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (4 DWord): - rst: The total number of sL1D read requests made for a four dwords of data (16B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (8 DWord): - rst: The total number of sL1D read requests made for a eight dwords of data (32B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (16 DWord): - rst: The total number of sL1D read requests made for a sixteen dwords of data (64B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: The total number of read requests from sL1D to the :doc:`L2 `, per - :ref:`normalization unit `. - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests from sL1D to the :doc:`L2 `, per - :ref:`normalization unit `. Typically unused on current - CDNA accelerators. - unit: Requests per normalization unit - Stall Cycles: - rst: "The total number of cycles the sL1D\u2194 :doc:`L2 ` interface\ - \ was stalled, per :ref:`normalization unit `." - unit: Cycles per normalization unit Scalar L1D cache accesses: - Bandwidth: - rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical - bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D - cycles `. - unit: Percent + Atomic Req: + rst: The total number of atomic requests from sL1D to the :doc:`L2 `, + per :ref:`normalization unit `. Typically unused on current + CDNA accelerators. + unit: Requests per normalization unit Cache Hit Rate: rst: Indicates the percent of sL1D requests that hit on a previously loaded line the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_ over the number of all sL1D requests. unit: Percent - sL1D-L2 BW: - rst: "The total number of bytes read from, written to, or atomically updated \ - \ across the sL1D\u2194:doc:`L2 ` interface, per :ref:`normalization\ - \ unit `. Note that sL1D writes and atomics are typically\ - \ unused on current CDNA accelerators, so in the majority of cases this can\ - \ be interpreted as an sL1D\u2192L2 read bandwidth." - unit: Bytes per normalization unit - Req: - rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization - unit `. - unit: Requests per normalization unit Hits: rst: The total number of sL1D requests that hit on a previously loaded cache line, per :ref:`normalization unit `. @@ -4199,19 +1140,14 @@ Scalar L1D cache accesses: pending due to another request, per :ref:`normalization unit `. See :ref:`desc-sl1d-sol` for more detail. unit: Requests per normalization unit - Read Req (Total): - rst: The total number of sL1D read requests of any size, per :ref:`normalization - unit `. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests from sL1D to the :doc:`L2 `, - per :ref:`normalization unit `. Typically unused on current - CDNA accelerators. - unit: Requests per normalization unit Read Req (1 DWord): rst: The total number of sL1D read requests made for a single dword of data (4B), per :ref:`normalization unit `. unit: Requests per normalization unit + Read Req (16 DWord): + rst: The total number of sL1D read requests made for a sixteen dwords of data (64B), + per :ref:`normalization unit `. + unit: Requests per normalization unit Read Req (2 DWord): rst: The total number of sL1D read requests made for a two dwords of data (8B), per :ref:`normalization unit `. @@ -4224,34 +1160,33 @@ Scalar L1D cache accesses: rst: The total number of sL1D read requests made for a eight dwords of data (32B), per :ref:`normalization unit `. unit: Requests per normalization unit - Read Req (16 DWord): - rst: The total number of sL1D read requests made for a sixteen dwords of data (64B), - per :ref:`normalization unit `. + Read Req (Total): + rst: The total number of sL1D read requests of any size, per :ref:`normalization + unit `. unit: Requests per normalization unit - Read Req: - rst: The total number of read requests from sL1D to the :doc:`L2 `, per - :ref:`normalization unit `. + Req: + rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization + unit `. unit: Requests per normalization unit - Write Req: - rst: The total number of write requests from sL1D to the :doc:`L2 `, per - :ref:`normalization unit `. Typically unused on current - CDNA accelerators. - unit: Requests per normalization unit - Stall Cycles: - rst: "The total number of cycles the sL1D\u2194 :doc:`L2 ` interface\ - \ was stalled, per :ref:`normalization unit `." - unit: Cycles per normalization unit Scalar L1D Cache - L2 Interface: - Bandwidth: - rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical - bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D - cycles `. - unit: Percent - Cache Hit Rate: - rst: Indicates the percent of sL1D requests that hit on a previously loaded line - the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_ - over the number of all sL1D requests. - unit: Percent + Atomic Req: + rst: The total number of atomic requests from sL1D to the :doc:`L2 `, + per :ref:`normalization unit `. Typically unused on current + CDNA accelerators. + unit: Requests per normalization unit + Read Req: + rst: The total number of read requests from sL1D to the :doc:`L2 `, per + :ref:`normalization unit `. + unit: Requests per normalization unit + Stall Cycles: + rst: "The total number of cycles the sL1D\u2194 :doc:`L2 ` interface\ + \ was stalled, per :ref:`normalization unit `." + unit: Cycles per normalization unit + Write Req: + rst: The total number of write requests from sL1D to the :doc:`L2 `, per + :ref:`normalization unit `. Typically unused on current + CDNA accelerators. + unit: Requests per normalization unit sL1D-L2 BW: rst: "The total number of bytes read from, written to, or atomically updated \ \ across the sL1D\u2194:doc:`L2 ` interface, per :ref:`normalization\ @@ -4259,66 +1194,6 @@ Scalar L1D Cache - L2 Interface: \ unused on current CDNA accelerators, so in the majority of cases this can\ \ be interpreted as an sL1D\u2192L2 read bandwidth." unit: Bytes per normalization unit - Req: - rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization - unit `. - unit: Requests per normalization unit - Hits: - rst: The total number of sL1D requests that hit on a previously loaded cache line, - per :ref:`normalization unit `. - unit: Requests per normalization unit - Misses - Non Duplicated: - rst: The total number of sL1D requests that missed on a cache line that *was not* - already pending due to another request, per :ref:`normalization unit `. - See :ref:`desc-sl1d-sol` for more detail. - unit: Requests per normalization unit - Misses- Duplicated: - rst: The total number of sL1D requests that missed on a cache line that *was* already - pending due to another request, per :ref:`normalization unit `. - See :ref:`desc-sl1d-sol` for more detail. - unit: Requests per normalization unit - Read Req (Total): - rst: The total number of sL1D read requests of any size, per :ref:`normalization - unit `. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests from sL1D to the :doc:`L2 `, - per :ref:`normalization unit `. Typically unused on current - CDNA accelerators. - unit: Requests per normalization unit - Read Req (1 DWord): - rst: The total number of sL1D read requests made for a single dword of data (4B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (2 DWord): - rst: The total number of sL1D read requests made for a two dwords of data (8B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (4 DWord): - rst: The total number of sL1D read requests made for a four dwords of data (16B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (8 DWord): - rst: The total number of sL1D read requests made for a eight dwords of data (32B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (16 DWord): - rst: The total number of sL1D read requests made for a sixteen dwords of data (64B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: The total number of read requests from sL1D to the :doc:`L2 `, per - :ref:`normalization unit `. - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests from sL1D to the :doc:`L2 `, per - :ref:`normalization unit `. Typically unused on current - CDNA accelerators. - unit: Requests per normalization unit - Stall Cycles: - rst: "The total number of cycles the sL1D\u2194 :doc:`L2 ` interface\ - \ was stalled, per :ref:`normalization unit `." - unit: Cycles per normalization unit L1I Speed-of-Light: Bandwidth: rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical @@ -4335,252 +1210,104 @@ L1I Speed-of-Light: \ achieved. Calculated as the ratio of the total number of requests from the\ \ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles `." unit: Percent - Req: - rst: The total number of requests made to the L1I per normalization-unit - unit: Requests per normalization unit - Hits: - rst: The total number of L1I requests that hit on a previously loaded cache line, - per :ref:`normalization-unit `. - unit: Requests per normalization unit - Misses - Non Duplicated: - rst: The total number of L1I requests that missed on a cache line that *were - not* already pending due to another request, per :ref:`normalization-unit `. - See note in :ref:`desc-l1i-sol` for more detail. - unit: Requests per normalization unit - Misses - Duplicated: - rst: The total number of L1I requests that missed on a cache line that *were* already - pending due to another request, per :ref:`normalization-unit `. - See note in :ref:`desc-l1i-sol` for more detail. - unit: Requests per normalization unit - Instruction Fetch Latency: - rst: The average number of cycles spent to fetch instructions to a :doc:`CU `. - unit: Cycles L1I cache accesses: - Bandwidth: - rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical - bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I - cycles `. - unit: Percent Cache Hit Rate: rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests. unit: Percent - L1I-L2 Bandwidth: - rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\ - \ achieved. Calculated as the ratio of the total number of requests from the\ - \ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles `." - unit: Percent - Req: - rst: The total number of requests made to the L1I per normalization-unit - unit: Requests per normalization unit Hits: rst: The total number of L1I requests that hit on a previously loaded cache line, per :ref:`normalization-unit `. unit: Requests per normalization unit + Instruction Fetch Latency: + rst: The average number of cycles spent to fetch instructions to a :doc:`CU `. + unit: Cycles + Misses - Duplicated: + rst: The total number of L1I requests that missed on a cache line that *were* already + pending due to another request, per :ref:`normalization-unit `. + See note in :ref:`desc-l1i-sol` for more detail. + unit: Requests per normalization unit Misses - Non Duplicated: rst: The total number of L1I requests that missed on a cache line that *were not* already pending due to another request, per :ref:`normalization-unit `. See note in :ref:`desc-l1i-sol` for more detail. unit: Requests per normalization unit - Misses - Duplicated: - rst: The total number of L1I requests that missed on a cache line that *were* already - pending due to another request, per :ref:`normalization-unit `. - See note in :ref:`desc-l1i-sol` for more detail. + Req: + rst: The total number of requests made to the L1I per normalization-unit unit: Requests per normalization unit - Instruction Fetch Latency: - rst: The average number of cycles spent to fetch instructions to a :doc:`CU `. - unit: Cycles L1I <-> L2 interface: - Bandwidth: - rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical - bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I - cycles `. - unit: Percent - Cache Hit Rate: - rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line - the cache. Calculated as the ratio of the number of L1I requests that hit over - the number of all L1I requests. - unit: Percent L1I-L2 Bandwidth: rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\ \ achieved. Calculated as the ratio of the total number of requests from the\ \ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles `." unit: Percent - Req: - rst: The total number of requests made to the L1I per normalization-unit - unit: Requests per normalization unit - Hits: - rst: The total number of L1I requests that hit on a previously loaded cache line, - per :ref:`normalization-unit `. - unit: Requests per normalization unit - Misses - Non Duplicated: - rst: The total number of L1I requests that missed on a cache line that *were - not* already pending due to another request, per :ref:`normalization-unit `. - See note in :ref:`desc-l1i-sol` for more detail. - unit: Requests per normalization unit - Misses - Duplicated: - rst: The total number of L1I requests that missed on a cache line that *were* already - pending due to another request, per :ref:`normalization-unit `. - See note in :ref:`desc-l1i-sol` for more detail. - unit: Requests per normalization unit - Instruction Fetch Latency: - rst: The average number of cycles spent to fetch instructions to a :doc:`CU `. - unit: Cycles Workgroup manager utilizations: Accelerator Utilization: rst: The percent of cycles in the kernel where the accelerator was actively doing any work. unit: Percent + Dispatched Wavefronts: + rst: The total number of wavefronts, summed over all workgroups, forming this + kernel launch. + unit: Wavefronts + Dispatched Workgroups: + rst: The total number of workgroups forming this kernel launch. + unit: Workgroups + SGPR Writes: + rst: The average number of cycles spent initializing :ref:`SGPRs ` at + wave creation. + unit: Cycles/wave + SIMD Utilization: + rst: The percent of :ref:`total SIMD cycles ` in the kernel where + any :ref:`SIMD ` on a CU was actively doing any work, summed over + all CUs. Low values (less than 100%) indicate that the accelerator was not + fully saturated by the kernel, or a potential load-imbalance issue. + unit: Percent Scheduler-Pipe Utilization: rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the kernel where the scheduler-pipes were actively doing any work. Note: this value is expected to range between 0% and 25%. See :ref:`desc-spi`.' unit: Percent - Workgroup Manager Utilization: - rst: The percent of cycles in the kernel where the workgroup manager was actively - doing any work. - unit: Percent Shader Engine Utilization: rst: The percent of :ref:`total shader engine cycles ` in the kernel where any CU in a shader-engine was actively doing any work, normalized over all shader-engines. Low values (e.g., << 100%) indicate that the accelerator was not fully saturated by the kernel, or a potential load-imbalance issue. unit: Percent - SIMD Utilization: - rst: The percent of :ref:`total SIMD cycles ` in the kernel where - any :ref:`SIMD ` on a CU was actively doing any work, summed over - all CUs. Low values (less than 100%) indicate that the accelerator was not - fully saturated by the kernel, or a potential load-imbalance issue. - unit: Percent - Dispatched Workgroups: - rst: The total number of workgroups forming this kernel launch. - unit: Workgroups - Dispatched Wavefronts: - rst: The total number of wavefronts, summed over all workgroups, forming this - kernel launch. - unit: Wavefronts VGPR Writes: rst: The average number of cycles spent initializing :ref:`VGPRs ` at wave creation. unit: Cycles/wave - SGPR Writes: - rst: The average number of cycles spent initializing :ref:`SGPRs ` at - wave creation. - unit: Cycles/wave - Not-scheduled Rate (Workgroup Manager): - rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the - kernel where a workgroup could not be scheduled to a :doc:`CU ` - due to a bottleneck within the workgroup manager rather than a lack of a CU - or :ref:`SIMD ` with sufficient resources. Note: this value is expected - to range between 0-25%. See note in :ref:`workgroup manager ` description.' - unit: Percent - Not-scheduled Rate (Scheduler-Pipe): - rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the - kernel where a workgroup could not be scheduled to a :doc:`CU ` - due to a bottleneck within the scheduler-pipes rather than a lack of a CU or - :ref:`SIMD ` with sufficient resources. Note: this value is expected - to range between 0-25%, see note in :ref:`workgroup manager ` description.' - unit: Percent - Scheduler-Pipe Stall Rate: - rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the - kernel where a workgroup could not be scheduled to a :doc:`CU ` - due to occupancy limitations (like a lack of a CU or :ref:`SIMD ` - with sufficient resources). Note: this value is expected to range between 0-25%, - see note in :ref:`workgroup manager ` description.' - unit: Percent - Scratch Stall Rate: - rst: The percent of :ref:`total shader-engine cycles ` in the kernel - where a workgroup could not be scheduled to a :doc:`CU ` due - to lack of :ref:`private (a.k.a., scratch) memory ` slots. While - this can reach up to 100%, note that the actual occupancy limitations on a kernel - using private memory are typically quite small (for example, less than 1% of - the total number of waves that can be scheduled to an accelerator). - unit: Percent - Insufficient SIMD Waveslots: - rst: The percent of :ref:`total SIMD cycles ` in the kernel where - a workgroup could not be scheduled to a :ref:`SIMD ` due to lack - of available :ref:`waveslots `. - unit: Percent - Insufficient SIMD VGPRs: - rst: The percent of :ref:`total SIMD cycles ` in the kernel where - a workgroup could not be scheduled to a :ref:`SIMD ` due to lack - of available :ref:`VGPRs `. - unit: Percent - Insufficient SIMD SGPRs: - rst: The percent of :ref:`total SIMD cycles ` in the kernel where - a workgroup could not be scheduled to a :ref:`SIMD ` due to lack - of available :ref:`SGPRs `. - unit: Percent - Insufficient CU LDS: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a workgroup could not be scheduled to a :doc:`CU ` due to lack - of available :doc:`LDS `. - unit: Percent - Insufficient CU Barriers: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a workgroup could not be scheduled to a :doc:`CU ` due to lack - of available :ref:`barriers `. - unit: Percent - Reached CU Workgroup Limit: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a workgroup could not be scheduled to a :doc:`CU ` due to limits - within the workgroup manager. This is expected to be always be zero on CDNA2 - or newer accelerators (and small for previous accelerators). - unit: Percent - Reached CU Wavefront Limit: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a wavefront could not be scheduled to a :doc:`CU ` due to limits - within the workgroup manager. This is expected to be always be zero on CDNA2 - or newer accelerators (and small for previous accelerators). + Workgroup Manager Utilization: + rst: The percent of cycles in the kernel where the workgroup manager was actively + doing any work. unit: Percent Workgroup Manager - Resource Allocation: - Accelerator Utilization: - rst: The percent of cycles in the kernel where the accelerator was actively doing - any work. + Insufficient CU Barriers: + rst: The percent of :ref:`total CU cycles ` in the kernel where + a workgroup could not be scheduled to a :doc:`CU ` due to lack + of available :ref:`barriers `. unit: Percent - Scheduler-Pipe Utilization: - rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the - kernel where the scheduler-pipes were actively doing any work. Note: this value - is expected to range between 0% and 25%. See :ref:`desc-spi`.' + Insufficient CU LDS: + rst: The percent of :ref:`total CU cycles ` in the kernel where + a workgroup could not be scheduled to a :doc:`CU ` due to lack + of available :doc:`LDS `. unit: Percent - Workgroup Manager Utilization: - rst: The percent of cycles in the kernel where the workgroup manager was actively - doing any work. - unit: Percent - Shader Engine Utilization: - rst: The percent of :ref:`total shader engine cycles ` in the kernel - where any CU in a shader-engine was actively doing any work, normalized over - all shader-engines. Low values (e.g., << 100%) indicate that the accelerator - was not fully saturated by the kernel, or a potential load-imbalance issue. - unit: Percent - SIMD Utilization: + Insufficient SIMD SGPRs: rst: The percent of :ref:`total SIMD cycles ` in the kernel where - any :ref:`SIMD ` on a CU was actively doing any work, summed over - all CUs. Low values (less than 100%) indicate that the accelerator was not - fully saturated by the kernel, or a potential load-imbalance issue. + a workgroup could not be scheduled to a :ref:`SIMD ` due to lack + of available :ref:`SGPRs `. unit: Percent - Dispatched Workgroups: - rst: The total number of workgroups forming this kernel launch. - unit: Workgroups - Dispatched Wavefronts: - rst: The total number of wavefronts, summed over all workgroups, forming this - kernel launch. - unit: Wavefronts - VGPR Writes: - rst: The average number of cycles spent initializing :ref:`VGPRs ` at - wave creation. - unit: Cycles/wave - SGPR Writes: - rst: The average number of cycles spent initializing :ref:`SGPRs ` at - wave creation. - unit: Cycles/wave - Not-scheduled Rate (Workgroup Manager): - rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the - kernel where a workgroup could not be scheduled to a :doc:`CU ` - due to a bottleneck within the workgroup manager rather than a lack of a CU - or :ref:`SIMD ` with sufficient resources. Note: this value is expected - to range between 0-25%. See note in :ref:`workgroup manager ` description.' + Insufficient SIMD VGPRs: + rst: The percent of :ref:`total SIMD cycles ` in the kernel where + a workgroup could not be scheduled to a :ref:`SIMD ` due to lack + of available :ref:`VGPRs `. + unit: Percent + Insufficient SIMD Waveslots: + rst: The percent of :ref:`total SIMD cycles ` in the kernel where + a workgroup could not be scheduled to a :ref:`SIMD ` due to lack + of available :ref:`waveslots `. unit: Percent Not-scheduled Rate (Scheduler-Pipe): rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the @@ -4589,6 +1316,25 @@ Workgroup Manager - Resource Allocation: :ref:`SIMD ` with sufficient resources. Note: this value is expected to range between 0-25%, see note in :ref:`workgroup manager ` description.' unit: Percent + Not-scheduled Rate (Workgroup Manager): + rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the + kernel where a workgroup could not be scheduled to a :doc:`CU ` + due to a bottleneck within the workgroup manager rather than a lack of a CU + or :ref:`SIMD ` with sufficient resources. Note: this value is expected + to range between 0-25%. See note in :ref:`workgroup manager ` description.' + unit: Percent + Reached CU Wavefront Limit: + rst: The percent of :ref:`total CU cycles ` in the kernel where + a wavefront could not be scheduled to a :doc:`CU ` due to limits + within the workgroup manager. This is expected to be always be zero on CDNA2 + or newer accelerators (and small for previous accelerators). + unit: Percent + Reached CU Workgroup Limit: + rst: The percent of :ref:`total CU cycles ` in the kernel where + a workgroup could not be scheduled to a :doc:`CU ` due to limits + within the workgroup manager. This is expected to be always be zero on CDNA2 + or newer accelerators (and small for previous accelerators). + unit: Percent Scheduler-Pipe Stall Rate: rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the kernel where a workgroup could not be scheduled to a :doc:`CU ` @@ -4604,121 +1350,36 @@ Workgroup Manager - Resource Allocation: using private memory are typically quite small (for example, less than 1% of the total number of waves that can be scheduled to an accelerator). unit: Percent - Insufficient SIMD Waveslots: - rst: The percent of :ref:`total SIMD cycles ` in the kernel where - a workgroup could not be scheduled to a :ref:`SIMD ` due to lack - of available :ref:`waveslots `. - unit: Percent - Insufficient SIMD VGPRs: - rst: The percent of :ref:`total SIMD cycles ` in the kernel where - a workgroup could not be scheduled to a :ref:`SIMD ` due to lack - of available :ref:`VGPRs `. - unit: Percent - Insufficient SIMD SGPRs: - rst: The percent of :ref:`total SIMD cycles ` in the kernel where - a workgroup could not be scheduled to a :ref:`SIMD ` due to lack - of available :ref:`SGPRs `. - unit: Percent - Insufficient CU LDS: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a workgroup could not be scheduled to a :doc:`CU ` due to lack - of available :doc:`LDS `. - unit: Percent - Insufficient CU Barriers: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a workgroup could not be scheduled to a :doc:`CU ` due to lack - of available :ref:`barriers `. - unit: Percent - Reached CU Workgroup Limit: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a workgroup could not be scheduled to a :doc:`CU ` due to limits - within the workgroup manager. This is expected to be always be zero on CDNA2 - or newer accelerators (and small for previous accelerators). - unit: Percent - Reached CU Wavefront Limit: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a wavefront could not be scheduled to a :doc:`CU ` due to limits - within the workgroup manager. This is expected to be always be zero on CDNA2 - or newer accelerators (and small for previous accelerators). - unit: Percent Command processor fetcher (CPF): + CPF Stall: + rst: Percent of CPF busy cycles where the CPF was stalled for any reason. + unit: Percent CPF Utilization: rst: Percent of total cycles where the CPF was busy actively doing any work. The ratio of CPF busy cycles over total cycles counted by the CPF. unit: Percent - CPF Stall: - rst: Percent of CPF busy cycles where the CPF was stalled for any reason. + CPF-L2 Stall: + rst: Percent of CPF-:doc:`L2 ` L2 busy cycles where the CPF-L2 interface + was stalled for any reason. unit: Percent CPF-L2 Utilization: rst: Percent of total cycles counted by the CPF-:doc:`L2 ` interface where the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles over total cycles counted by the CPF-L2. unit: Percent - CPF-L2 Stall: - rst: Percent of CPF-:doc:`L2 ` L2 busy cycles where the CPF-L2 interface - was stalled for any reason. - unit: Percent CPF-UTCL1 Stall: rst: Percent of CPF busy cycles where the CPF was stalled by address translation. unit: Percent - CPC Utilization: - rst: Percent of total cycles where the CPC was busy actively doing any work. The - ratio of CPC busy cycles over total cycles counted by the CPC. - unit: Percent - CPC Stall Rate: - rst: Percent of CPC busy cycles where the CPC was stalled for any reason. - unit: Percent - CPC Packet Decoding Utilization: - rst: Percent of CPC busy cycles spent decoding commands for processing. - unit: Percent - CPC-Workgroup Manager Utilization: - rst: Percent of CPC busy cycles spent dispatching workgroups to the :ref:`workgroup - manager `. - unit: Percent - CPC-L2 Utilization: - rst: Percent of total cycles counted by the CPC-:doc:`L2 ` interface where - the CPC-L2 interface was active doing any work. - unit: Percent - CPC-UTCL1 Stall: - rst: Percent of CPC busy cycles where the CPC was stalled by address translation - unit: Percent - CPC-UTCL2 Utilization: - rst: Percent of total cycles counted by the CPC's :doc:`L2 ` address translation - interface where the CPC was busy doing address translation work. - unit: Percent Command processor packet processor (CPC): - CPF Utilization: - rst: Percent of total cycles where the CPF was busy actively doing any work. The - ratio of CPF busy cycles over total cycles counted by the CPF. - unit: Percent - CPF Stall: - rst: Percent of CPF busy cycles where the CPF was stalled for any reason. - unit: Percent - CPF-L2 Utilization: - rst: Percent of total cycles counted by the CPF-:doc:`L2 ` interface where - the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles - over total cycles counted by the CPF-L2. - unit: Percent - CPF-L2 Stall: - rst: Percent of CPF-:doc:`L2 ` L2 busy cycles where the CPF-L2 interface - was stalled for any reason. - unit: Percent - CPF-UTCL1 Stall: - rst: Percent of CPF busy cycles where the CPF was stalled by address translation. - unit: Percent - CPC Utilization: - rst: Percent of total cycles where the CPC was busy actively doing any work. The - ratio of CPC busy cycles over total cycles counted by the CPC. + CPC Packet Decoding Utilization: + rst: Percent of CPC busy cycles spent decoding commands for processing. unit: Percent CPC Stall Rate: rst: Percent of CPC busy cycles where the CPC was stalled for any reason. unit: Percent - CPC Packet Decoding Utilization: - rst: Percent of CPC busy cycles spent decoding commands for processing. - unit: Percent - CPC-Workgroup Manager Utilization: - rst: Percent of CPC busy cycles spent dispatching workgroups to the :ref:`workgroup - manager `. + CPC Utilization: + rst: Percent of total cycles where the CPC was busy actively doing any work. The + ratio of CPC busy cycles over total cycles counted by the CPC. unit: Percent CPC-L2 Utilization: rst: Percent of total cycles counted by the CPC-:doc:`L2 ` interface where @@ -4731,26 +1392,74 @@ Command processor packet processor (CPC): rst: Percent of total cycles counted by the CPC's :doc:`L2 ` address translation interface where the CPC was busy doing address translation work. unit: Percent + CPC-Workgroup Manager Utilization: + rst: Percent of CPC busy cycles spent dispatching workgroups to the :ref:`workgroup + manager `. + unit: Percent System Speed-of-Light: - VALU FLOPs: - rst: 'The total floating-point operations executed per second on the :ref:`VALU - `. This is also presented as a percent of the peak theoretical FLOPs - achievable on the specific accelerator. Note: this does not include any floating-point - operations from :ref:`MFMA ` instructions.' - unit: GFLOPs - VALU IOPs: - rst: 'The total integer operations executed per second on the :ref:`VALU `. - This is also presented as a percent of the peak theoretical IOPs achievable - on the specific accelerator. Note: this does not include any integer operations - from :ref:`MFMA ` instructions.' - unit: GOIPs - MFMA FLOPs (F8): - rst: 'The total number of 8-bit brain floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 16-bit brain floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F8 MFMA operations achievable on the specific - accelerator. It is supported on AMD Instinct MI300 series and later only.' - unit: GFLOPs + Active CUs: + rst: Total number of active compute units (CUs) on the accelerator during the + kernel execution. + unit: Number + Branch Utilization: + rst: Indicates what percent of the kernel's duration the :ref:`branch ` + unit was busy executing instructions. Computed as the ratio of the total number + of cycles spent by the :ref:`scheduler ` issuing branch instructions + over the :ref:`total CU cycles `. + unit: Percent + IPC: + rst: The ratio of the total number of instructions executed on the :doc:`CU ` + over the :ref:`total active CU cycles `. + unit: Instructions per-cycle + L1I BW: + rst: The number of bytes looked up in the L1I cache per unit time. This is also + presented as a percent of the peak theoretical bandwidth achievable on the + specific accelerator. + unit: Percent + L1I Fetch Latency: + rst: The average number of cycles spent to fetch instructions to a :doc:`CU `. + unit: Cycles + L1I Hit Rate: + rst: The percent of L1I requests that hit on a previously loaded line the cache. + Calculated as the ratio of the number of L1I requests that hit over the number + of all L1I requests. + unit: GB/s + L2 Cache BW: + rst: The number of bytes looked up in the L2 cache per unit time. The number of + bytes is calculated as the number of cache lines requested multiplied by the + cache line size. This value does not consider partial requests, so e.g., if + only a single value is requested in a cache line, the data movement will still + be counted as a full cache line. This is also presented as a percent of the + peak theoretical bandwidth achievable on the specific accelerator. + unit: GB/s + L2 Cache Hit Rate: + rst: The ratio of the number of L2 cache line requests that hit in the L2 cache + over the total number of incoming cache line requests to the L2 cache. + unit: Percent + L2-Fabric Read BW: + rst: "The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122\ + \ interface ` per unit time. This is also presented as a percent\ + \ of the peak theoretical bandwidth achievable on the specific accelerator." + unit: GB/s + L2-Fabric Read Latency: + rst: The time-averaged number of cycles read requests spent in Infinity Fabric before + data was returned to the L2. + unit: Cycles + L2-Fabric Write BW: + rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface + ` by write and atomic operations per unit time. This is also presented + as a percent of the peak theoretical bandwidth achievable on the specific accelerator. + unit: GB/s + L2-Fabric Write Latency: + rst: The time-averaged number of cycles write requests spent in Infinity Fabric + before a completion acknowledgement was returned to the L2. + unit: Cycles + LDS Bank Conflicts/Access: + rst: The ratio of the number of cycles spent in the :doc:`LDS scheduler ` + due to bank conflicts (as determined by the conflict resolution hardware) to + the base number of cycles that would be spent in the LDS scheduler in a completely uncontended + case. This is also presented in normalized form (i.e., the Bank Conflict Rate). + unit: Conflicts/Access MFMA FLOPs (BF16): rst: 'The total number of 16-bit brain floating point :ref:`MFMA ` operations executed per second. Note: this does not include any 16-bit brain @@ -4776,34 +1485,61 @@ System Speed-of-Light: from :ref:`VALU ` instructions. This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator.' unit: GFLOPs + MFMA FLOPs (F8): + rst: 'The total number of 8-bit brain floating point :ref:`MFMA ` operations + executed per second. Note: this does not include any 16-bit brain floating point + operations from :ref:`VALU ` instructions. This is also presented + as a percent of the peak theoretical F8 MFMA operations achievable on the specific + accelerator. It is supported on AMD Instinct MI300 series and later only.' + unit: GFLOPs MFMA IOPs (Int8): rst: 'The total number of 8-bit integer :ref:`MFMA ` operations executed per second. Note: this does not include any 8-bit integer operations from :ref:`VALU ` instructions. This is also presented as a percent of the peak theoretical INT8 MFMA operations achievable on the specific accelerator.' unit: GIOPs - Active CUs: - rst: Total number of active compute units (CUs) on the accelerator during the - kernel execution. - unit: Number + MFMA Utilization: + rst: Indicates what percent of the kernel's duration the :ref:`MFMA ` + unit was busy executing instructions. Computed as the ratio of the total number + of cycles spent by the :ref:`MFMA ` was busy over the :ref:`total + CU cycles `. + unit: Percent SALU Utilization: rst: Indicates what percent of the kernel's duration the :ref:`SALU ` was busy executing instructions. Computed as the ratio of the total number of cycles spent by the :ref:`scheduler ` issuing SALU / :ref:`SMEM ` instructions over the :ref:`total CU cycles `. unit: Percent + Theoretical LDS Bandwidth: + rst: Indicates the maximum amount of bytes that could have been loaded from, stored + to, or atomically updated in the LDS per unit time (see :ref:`LDS Bandwidth + ` example for more detail). This is also presented as a percent + of the peak theoretical F64 MFMA operations achievable on the specific accelerator. + unit: GB/s + VALU Active Threads: + rst: Indicates the average level of :ref:`divergence ` within + a wavefront over the lifetime of the kernel. The number of work-items that were + active in a wavefront during execution of each :ref:`VALU ` instruction, + time-averaged over all VALU instructions run on all wavefronts in the kernel. + unit: Work-items + VALU FLOPs: + rst: 'The total floating-point operations executed per second on the :ref:`VALU + `. This is also presented as a percent of the peak theoretical FLOPs + achievable on the specific accelerator. Note: this does not include any floating-point + operations from :ref:`MFMA ` instructions.' + unit: GFLOPs + VALU IOPs: + rst: 'The total integer operations executed per second on the :ref:`VALU `. + This is also presented as a percent of the peak theoretical IOPs achievable + on the specific accelerator. Note: this does not include any integer operations + from :ref:`MFMA ` instructions.' + unit: GOIPs VALU Utilization: rst: Indicates what percent of the kernel's duration the :ref:`VALU ` was busy executing instructions. Does not include :ref:`VMEM ` operations. Computed as the ratio of the total number of cycles spent by the :ref:`scheduler ` issuing VALU instructions over the :ref:`total CU cycles `. unit: Percent - MFMA Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`MFMA ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`MFMA ` was busy over the :ref:`total - CU cycles `. - unit: Percent VMEM Utilization: rst: Indicates what percent of the kernel's duration the :ref:`VMEM ` unit was busy executing instructions, including both global/generic and spill/scratch @@ -4812,43 +1548,21 @@ System Speed-of-Light: as the ratio of the total number of cycles spent by the :ref:`scheduler ` issuing VMEM instructions over the :ref:`total CU cycles `. unit: Percent - Branch Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`branch ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`scheduler ` issuing branch instructions - over the :ref:`total CU cycles `. - unit: Percent - VALU Active Threads: - rst: Indicates the average level of :ref:`divergence ` within - a wavefront over the lifetime of the kernel. The number of work-items that were - active in a wavefront during execution of each :ref:`VALU ` instruction, - time-averaged over all VALU instructions run on all wavefronts in the kernel. - unit: Work-items - IPC: - rst: The ratio of the total number of instructions executed on the :doc:`CU ` - over the :ref:`total active CU cycles `. - unit: Instructions per-cycle Wavefront Occupancy: rst: 'The time-averaged number of wavefronts resident on the accelerator over the lifetime of the kernel. Note: this metric may be inaccurate for short-running kernels (less than 1ms). This is also presented as a percent of the peak theoretical occupancy achievable on the specific accelerator.' unit: Wavefronts - Theoretical LDS Bandwidth: - rst: Indicates the maximum amount of bytes that could have been loaded from, stored - to, or atomically updated in the LDS per unit time (see :ref:`LDS Bandwidth - ` example for more detail). This is also presented as a percent - of the peak theoretical F64 MFMA operations achievable on the specific accelerator. + sL1D Cache BW: + rst: The number of bytes looked up in the sL1D cache per unit time. This is also + presented as a percent of the peak theoretical bandwidth achievable on the + specific accelerator. unit: GB/s - LDS Bank Conflicts/Access: - rst: The ratio of the number of cycles spent in the :doc:`LDS scheduler ` - due to bank conflicts (as determined by the conflict resolution hardware) to - the base number of cycles that would be spent in the LDS scheduler in a completely uncontended - case. This is also presented in normalized form (i.e., the Bank Conflict Rate). - unit: Conflicts/Access - vL1D Cache Hit Rate: - rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache - over the total number of cache line requests to the :ref:`vL1D cache RAM `. + sL1D Cache Hit Rate: + rst: The percent of sL1D requests that hit on a previously loaded line the cache. + Calculated as the ratio of the number of sL1D requests that hit over the number + of all sL1D requests. unit: Percent vL1D Cache BW: rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM @@ -4859,56 +1573,7 @@ System Speed-of-Light: cache line. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. unit: GB/s - L2 Cache Hit Rate: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. + vL1D Cache Hit Rate: + rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache + over the total number of cache line requests to the :ref:`vL1D cache RAM `. unit: Percent - L2 Cache BW: - rst: The number of bytes looked up in the L2 cache per unit time. The number of - bytes is calculated as the number of cache lines requested multiplied by the - cache line size. This value does not consider partial requests, so e.g., if - only a single value is requested in a cache line, the data movement will still - be counted as a full cache line. This is also presented as a percent of the - peak theoretical bandwidth achievable on the specific accelerator. - unit: GB/s - L2-Fabric Read BW: - rst: "The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122\ - \ interface ` per unit time. This is also presented as a percent\ - \ of the peak theoretical bandwidth achievable on the specific accelerator." - unit: GB/s - L2-Fabric Write BW: - rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface - ` by write and atomic operations per unit time. This is also presented - as a percent of the peak theoretical bandwidth achievable on the specific accelerator. - unit: GB/s - L2-Fabric Read Latency: - rst: The time-averaged number of cycles read requests spent in Infinity Fabric before - data was returned to the L2. - unit: Cycles - L2-Fabric Write Latency: - rst: The time-averaged number of cycles write requests spent in Infinity Fabric - before a completion acknowledgement was returned to the L2. - unit: Cycles - sL1D Cache Hit Rate: - rst: The percent of sL1D requests that hit on a previously loaded line the cache. - Calculated as the ratio of the number of sL1D requests that hit over the number - of all sL1D requests. - unit: Percent - sL1D Cache BW: - rst: The number of bytes looked up in the sL1D cache per unit time. This is also - presented as a percent of the peak theoretical bandwidth achievable on the - specific accelerator. - unit: GB/s - L1I Hit Rate: - rst: The percent of L1I requests that hit on a previously loaded line the cache. - Calculated as the ratio of the number of L1I requests that hit over the number - of all L1I requests. - unit: GB/s - L1I BW: - rst: The number of bytes looked up in the L1I cache per unit time. This is also - presented as a percent of the peak theoretical bandwidth achievable on the - specific accelerator. - unit: Percent - L1I Fetch Latency: - rst: The average number of cycles spent to fetch instructions to a :doc:`CU `. - unit: Cycles diff --git a/projects/rocprofiler-compute/src/argparser.py b/projects/rocprofiler-compute/src/argparser.py index c8b6c52946..a39d2ee97b 100644 --- a/projects/rocprofiler-compute/src/argparser.py +++ b/projects/rocprofiler-compute/src/argparser.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import argparse import os diff --git a/projects/rocprofiler-compute/src/config.py b/projects/rocprofiler-compute/src/config.py index 88cde44384..c6b2da9dae 100644 --- a/projects/rocprofiler-compute/src/config.py +++ b/projects/rocprofiler-compute/src/config.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,18 +10,21 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. +############################################################################## + + +import re from pathlib import Path # NB: Creating a new module to share global vars across modules @@ -30,6 +33,7 @@ PROJECT_NAME = "rocprofiler-compute" HIDDEN_COLUMNS = ["coll_level"] HIDDEN_COLUMNS_CLI = ["Description", "coll_level"] +HIDDEN_COLUMNS_TUI = ["Description", "coll_level"] HIDDEN_SECTIONS = [400, 1900, 2000] -TIME_UNITS = {"s": 10 ** 9, "ms": 10 ** 6, "us": 10 ** 3, "ns": 1} +TIME_UNITS = {"s": 10**9, "ms": 10**6, "us": 10**3, "ns": 1} diff --git a/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_base.py b/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_base.py index 39709256a4..55a691b5a1 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_base.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_base.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import copy import os diff --git a/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_cli.py b/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_cli.py index 4a4d852e76..5d711482ff 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_cli.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_cli.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from rocprof_compute_analyze.analysis_base import OmniAnalyze_Base from utils import file_io, parser, tty diff --git a/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_webui.py b/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_webui.py index c5d5c61599..47819cfb00 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_webui.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_webui.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import copy import os diff --git a/projects/rocprofiler-compute/src/rocprof_compute_base.py b/projects/rocprofiler-compute/src/rocprof_compute_base.py index c6ee78b765..12e5983cbe 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_base.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_base.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import argparse import importlib diff --git a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py index c7821e6590..85c03a61ba 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_base.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,23 +10,26 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import csv import glob import logging import os import re +import shlex import shutil import time from abc import ABC, abstractmethod @@ -462,7 +465,9 @@ class RocProfCompute_Base: method=self.get_args().pc_sampling_method, interval=self.get_args().pc_sampling_interval, workload_dir=self.get_args().path, - appcmd=self.get_args().remaining, + appcmd=shlex.split( + self.get_args().remaining + ), # FIXME: the right solution is applying it when argparsing once! rocprofiler_sdk_library_path=self.get_args().rocprofiler_sdk_library_path, ) end_run_prof = time.time() diff --git a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v1.py b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v1.py index 2fc2483535..d9dbe9a581 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v1.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v1.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import os from pathlib import Path diff --git a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v2.py b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v2.py index ad5544f4b9..ef57f80543 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v2.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v2.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import os import shlex diff --git a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v3.py b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v3.py index 0c56d0d2bd..bb6e98a272 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v3.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprof_v3.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import os import shlex diff --git a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprofiler_sdk.py b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprofiler_sdk.py index 29994b63cd..06be4e8791 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprofiler_sdk.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_profile/profiler_rocprofiler_sdk.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import os import shlex diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1500_address_processing_unit_and_data_return_path_ta_td.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1500_address_processing_unit_and_data_return_path_ta_td.yaml index 754cbbb688..67c3aa1dfc 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1500_address_processing_unit_and_data_return_path_ta_td.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1500_address_processing_unit_and_data_return_path_ta_td.yaml @@ -63,6 +63,9 @@ Panel Config: unit by the address processor summed over all compute units on the accelerator, per normalization unit. This is expected to be the sum of global/generic and spill/stack atomics in the address processor. + Write Ack Instructions: The total number of write acknowledgements submitted by + data-return unit to SQ, summed over all compute units on the accelerator, per + normalization unit. data source: - metric_table: id: 1501 diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml index d9bc1ca1a9..6e77eb8f93 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml @@ -87,6 +87,12 @@ Panel Config: by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. + Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests, + per normalization unit. + Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests, + per normalization unit. + Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests, + per normalization unit. Req: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. Read Req: The total number of read requests to the L2 from all clients. @@ -143,6 +149,12 @@ Panel Config: Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data from any source other than the accelerator's local HBM, per normalization unit. + Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe + traffic, per normalization unit. + "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read + requests due to Infinity Fabric traffic, per normalization unit. + Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM + traffic, per normalization unit. Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -158,6 +170,18 @@ Panel Config: Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per normalization unit. + Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to + PCIe traffic, per normalization unit. + "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write + requests due to Infinity Fabric traffic, per normalization unit. + Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM + traffic, per normalization unit. + Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to + PCIe traffic, per normalization unit. + "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic + requests due to Infinity Fabric traffic, per normalization unit. + Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to + HBM traffic, per normalization unit. Atomic: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request flow for more detail. Note that on current CDNA accelerators, such as the MI2XX, diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1500_address_processing_unit_and_data_return_path_ta_td.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1500_address_processing_unit_and_data_return_path_ta_td.yaml index 4d808aecab..0d826ceb1b 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1500_address_processing_unit_and_data_return_path_ta_td.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1500_address_processing_unit_and_data_return_path_ta_td.yaml @@ -63,6 +63,9 @@ Panel Config: unit by the address processor summed over all compute units on the accelerator, per normalization unit. This is expected to be the sum of global/generic and spill/stack atomics in the address processor. + Write Ack Instructions: The total number of write acknowledgements submitted by + data-return unit to SQ, summed over all compute units on the accelerator, per + normalization unit. data source: - metric_table: id: 1501 diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml index f3ecdc468c..14398e1104 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml @@ -87,6 +87,12 @@ Panel Config: by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. + Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests, + per normalization unit. + Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests, + per normalization unit. + Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests, + per normalization unit. Req: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. Read Req: The total number of read requests to the L2 from all clients. @@ -143,6 +149,12 @@ Panel Config: Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data from any source other than the accelerator's local HBM, per normalization unit. + Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe + traffic, per normalization unit. + "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read + requests due to Infinity Fabric traffic, per normalization unit. + Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM + traffic, per normalization unit. Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -158,6 +170,18 @@ Panel Config: Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per normalization unit. + Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to + PCIe traffic, per normalization unit. + "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write + requests due to Infinity Fabric traffic, per normalization unit. + Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM + traffic, per normalization unit. + Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to + PCIe traffic, per normalization unit. + "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic + requests due to Infinity Fabric traffic, per normalization unit. + Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to + HBM traffic, per normalization unit. Atomic: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request flow for more detail. Note that on current CDNA accelerators, such as the MI2XX, diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1500_address_processing_unit_and_data_return_path_ta_td.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1500_address_processing_unit_and_data_return_path_ta_td.yaml index f920234926..cdbb5393aa 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1500_address_processing_unit_and_data_return_path_ta_td.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1500_address_processing_unit_and_data_return_path_ta_td.yaml @@ -63,6 +63,9 @@ Panel Config: unit by the address processor summed over all compute units on the accelerator, per normalization unit. This is expected to be the sum of global/generic and spill/stack atomics in the address processor. + Write Ack Instructions: The total number of write acknowledgements submitted by + data-return unit to SQ, summed over all compute units on the accelerator, per + normalization unit. data source: - metric_table: id: 1501 diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml index c2b82a38ec..36d5943858 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml @@ -87,6 +87,12 @@ Panel Config: by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. + Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests, + per normalization unit. + Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests, + per normalization unit. + Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests, + per normalization unit. Req: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. Read Req: The total number of read requests to the L2 from all clients. @@ -143,6 +149,12 @@ Panel Config: Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data from any source other than the accelerator's local HBM, per normalization unit. + Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe + traffic, per normalization unit. + "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read + requests due to Infinity Fabric traffic, per normalization unit. + Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM + traffic, per normalization unit. Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -158,6 +170,18 @@ Panel Config: Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per normalization unit. + Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to + PCIe traffic, per normalization unit. + "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write + requests due to Infinity Fabric traffic, per normalization unit. + Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM + traffic, per normalization unit. + Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to + PCIe traffic, per normalization unit. + "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic + requests due to Infinity Fabric traffic, per normalization unit. + Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to + HBM traffic, per normalization unit. Atomic: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request flow for more detail. Note that on current CDNA accelerators, such as the MI2XX, diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1500_address_processing_unit_and_data_return_path_ta_td.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1500_address_processing_unit_and_data_return_path_ta_td.yaml index f920234926..cdbb5393aa 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1500_address_processing_unit_and_data_return_path_ta_td.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1500_address_processing_unit_and_data_return_path_ta_td.yaml @@ -63,6 +63,9 @@ Panel Config: unit by the address processor summed over all compute units on the accelerator, per normalization unit. This is expected to be the sum of global/generic and spill/stack atomics in the address processor. + Write Ack Instructions: The total number of write acknowledgements submitted by + data-return unit to SQ, summed over all compute units on the accelerator, per + normalization unit. data source: - metric_table: id: 1501 diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml index f1fd043df1..e7acf40a5c 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml @@ -87,6 +87,12 @@ Panel Config: by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. + Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests, + per normalization unit. + Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests, + per normalization unit. + Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests, + per normalization unit. Req: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. Read Req: The total number of read requests to the L2 from all clients. @@ -143,6 +149,12 @@ Panel Config: Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data from any source other than the accelerator's local HBM, per normalization unit. + Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe + traffic, per normalization unit. + "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read + requests due to Infinity Fabric traffic, per normalization unit. + Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM + traffic, per normalization unit. Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -158,6 +170,18 @@ Panel Config: Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per normalization unit. + Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to + PCIe traffic, per normalization unit. + "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write + requests due to Infinity Fabric traffic, per normalization unit. + Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM + traffic, per normalization unit. + Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to + PCIe traffic, per normalization unit. + "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic + requests due to Infinity Fabric traffic, per normalization unit. + Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to + HBM traffic, per normalization unit. Atomic: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request flow for more detail. Note that on current CDNA accelerators, such as the MI2XX, diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml index f920234926..cdbb5393aa 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml @@ -63,6 +63,9 @@ Panel Config: unit by the address processor summed over all compute units on the accelerator, per normalization unit. This is expected to be the sum of global/generic and spill/stack atomics in the address processor. + Write Ack Instructions: The total number of write acknowledgements submitted by + data-return unit to SQ, summed over all compute units on the accelerator, per + normalization unit. data source: - metric_table: id: 1501 diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml index 35777aa064..0a72362ea7 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml @@ -87,6 +87,12 @@ Panel Config: by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. + Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests, + per normalization unit. + Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests, + per normalization unit. + Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests, + per normalization unit. Req: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. Read Req: The total number of read requests to the L2 from all clients. @@ -143,6 +149,12 @@ Panel Config: Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data from any source other than the accelerator's local HBM, per normalization unit. + Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe + traffic, per normalization unit. + "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read + requests due to Infinity Fabric traffic, per normalization unit. + Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM + traffic, per normalization unit. Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -158,6 +170,18 @@ Panel Config: Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per normalization unit. + Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to + PCIe traffic, per normalization unit. + "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write + requests due to Infinity Fabric traffic, per normalization unit. + Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM + traffic, per normalization unit. + Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to + PCIe traffic, per normalization unit. + "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic + requests due to Infinity Fabric traffic, per normalization unit. + Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to + HBM traffic, per normalization unit. Atomic: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request flow for more detail. Note that on current CDNA accelerators, such as the MI2XX, diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml index dfe29d7b99..a37f24eab6 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml @@ -63,6 +63,9 @@ Panel Config: unit by the address processor summed over all compute units on the accelerator, per normalization unit. This is expected to be the sum of global/generic and spill/stack atomics in the address processor. + Write Ack Instructions: The total number of write acknowledgements submitted by + data-return unit to SQ, summed over all compute units on the accelerator, per + normalization unit. data source: - metric_table: id: 1501 diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml index 85abb7d025..c354429c0e 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml @@ -87,6 +87,12 @@ Panel Config: by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. + Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests, + per normalization unit. + Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests, + per normalization unit. + Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests, + per normalization unit. Req: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. Read Req: The total number of read requests to the L2 from all clients. @@ -143,6 +149,12 @@ Panel Config: Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data from any source other than the accelerator's local HBM, per normalization unit. + Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe + traffic, per normalization unit. + "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read + requests due to Infinity Fabric traffic, per normalization unit. + Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM + traffic, per normalization unit. Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -158,6 +170,18 @@ Panel Config: Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per normalization unit. + Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to + PCIe traffic, per normalization unit. + "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write + requests due to Infinity Fabric traffic, per normalization unit. + Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM + traffic, per normalization unit. + Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to + PCIe traffic, per normalization unit. + "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic + requests due to Infinity Fabric traffic, per normalization unit. + Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to + HBM traffic, per normalization unit. Atomic: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request flow for more detail. Note that on current CDNA accelerators, such as the MI2XX, @@ -628,6 +652,21 @@ Panel Config: min: MIN((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom)) max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom)) unit: (Req + $normUnit) + Read Bandwidth - PCIe: + avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom) + unit: (Bytes + $normUnit) + "Read Bandwidth - Infinity Fabric\u2122": + avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom) + unit: (Bytes + $normUnit) + Read Bandwidth - HBM: + avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom) + unit: (Bytes + $normUnit) Write and Atomic (32B): avg: AVG(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom)) min: MIN(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom)) @@ -654,19 +693,19 @@ Panel Config: max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom)) unit: (Req + $normUnit) Write Bandwidth - PCIe: - avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum / $denom) - min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum / $denom) - max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum / $denom) + avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom) unit: (Bytes + $normUnit) "Write Bandwidth - Infinity Fabric\u2122": - avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum / $denom) - min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum / $denom) - max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum / $denom) + avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom) unit: (Bytes + $normUnit) Write Bandwidth - HBM: - avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum / $denom) - min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum / $denom) - max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum / $denom) + avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom) unit: (Bytes + $normUnit) Atomic: avg: AVG((TCC_EA0_ATOMIC_sum / $denom)) @@ -679,17 +718,17 @@ Panel Config: max: MAX((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom)) unit: (Req + $normUnit) Atomic Bandwidth - PCIe: - avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum / $denom) - min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum / $denom) - max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum / $denom) + avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom) unit: (Bytes + $normUnit) "Atomic Bandwidth - Infinity Fabric\u2122": - avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum / $denom) - min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum / $denom) - max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum / $denom) + avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom) unit: (Bytes + $normUnit) Atomic Bandwidth - HBM: - avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum / $denom) - min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum / $denom) - max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum / $denom) + avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom) unit: (Bytes + $normUnit) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_base.py b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_base.py index 2ed4d1c295..bc1937978b 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_base.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_base.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import ctypes import glob diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx908.py b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx908.py index 4263e5f778..ce670a7c09 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx908.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx908.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from pathlib import Path diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx90a.py b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx90a.py index 8fb612668f..99fd3fc775 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx90a.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx90a.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from pathlib import Path diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx940.py b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx940.py index 8145693e21..a8d984d1bb 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx940.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx940.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from pathlib import Path diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx941.py b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx941.py index 105bb1cde1..9012fb3449 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx941.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx941.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from pathlib import Path diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx942.py b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx942.py index 0fb2c08ac3..96243ec7ca 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx942.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx942.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from pathlib import Path diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx950.py b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx950.py index 2d24cd61da..019e21c413 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx950.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_gfx950.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from pathlib import Path diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/analysis_tui.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/analysis_tui.py index ae13c6edbe..a8d40a2611 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/analysis_tui.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/analysis_tui.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,24 +10,28 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import copy -import sys from pathlib import Path from rocprof_compute_analyze.analysis_base import OmniAnalyze_Base -from rocprof_compute_tui.utils.tui_utils import process_panels_to_dataframes +from rocprof_compute_tui.utils.tui_utils import ( + get_top_kernels_and_dispatch_ids, + process_panels_to_dataframes, +) from utils import file_io, parser, schema from utils.kernel_name_shortener import kernel_name_shortener from utils.logger import console_error, demarcate @@ -38,23 +42,21 @@ class tui_analysis(OmniAnalyze_Base): super().__init__(args, supported_archs) self.path = str(path) self.arch = None + self.raw_dfs = {} + self.kernel_dfs = {} # ----------------------- # Required child methods # ----------------------- @demarcate def pre_processing(self): - """Perform any pre-processing steps prior to analysis.""" - # Read profiling config self._profiling_config = file_io.load_profiling_config(self.path) - # initalize runs self._runs = self.initalize_runs() if self.get_args().random_port: console_error("--gui flag is required to enable --random-port") - # create 'mega dataframe' self._runs[self.path].raw_pmc = file_io.create_df_pmc( self.path, self.get_args().nodes, @@ -80,22 +82,33 @@ class tui_analysis(OmniAnalyze_Base): kernel_verbose=self.get_args().kernel_verbose, ) - # demangle and overwrite original 'Kernel_Name' kernel_name_shortener( self._runs[self.path].raw_pmc, self.get_args().kernel_verbose ) - # create the loaded table - parser.load_table_data( - workload=self._runs[self.path], - dir=self.path, - is_gui=False, - args=self.get_args(), - config=self._profiling_config, + # 1. load top kernel + parser.load_kernel_top( + workload=self._runs[self.path], dir=self.path, args=self.get_args() ) + # 2. load table data for each kernel + self.raw_dfs.clear() + for idx in self._runs[self.path].raw_pmc.index: + kernel_df = self._runs[self.path].raw_pmc.loc[[idx]] + kernel_name = kernel_df.pmc_perf["Kernel_Name"].loc[idx] + this_dfs = copy.deepcopy(self._runs[self.path].dfs) + parser.eval_metric( + this_dfs, + self._runs[self.path].dfs_type, + self._runs[self.path].sys_info.iloc[0], + kernel_df, + self.get_args().debug, + self._profiling_config, + ) + + self.raw_dfs[kernel_name] = this_dfs + def initalize_runs(self, normalization_filter=None): - # load required configs sysinfo_path = Path(self.path) sys_info = file_io.load_sys_info(sysinfo_path.joinpath("sysinfo.csv")) self.arch = sys_info.iloc[0]["gpu_arch"] @@ -111,10 +124,6 @@ class tui_analysis(OmniAnalyze_Base): self.load_options(normalization_filter) w = schema.Workload() - # FIXME: - # For regular single node case, load sysinfo.csv directly - # For multi-node, either the default "all", or specified some, - # pick up the one in the 1st sub_dir. We could fix it properly later. w.sys_info = file_io.load_sys_info(sysinfo_path.joinpath("sysinfo.csv")) mspec = self.get_socs()[self.arch]._mspec if args.specs_correction: @@ -127,43 +136,14 @@ class tui_analysis(OmniAnalyze_Base): return self._runs @demarcate - def run_analysis(self): - """Run TUI analysis.""" - super().run_analysis() - - roof_plot = None - # 1. check if not baseline && compatible soc: - if self.arch in [ - # >= MI200 - "gfx90a", - "gfx940", - "gfx941", - "gfx942", - "gfx950", - ]: - # add roofline plot to cli output - self.get_socs()[self.arch].analysis_setup( - roofline_parameters={ - "workload_dir": self.path, - "device_id": 0, - "sort_type": "kernels", - "mem_level": "ALL", - "include_kernel_names": False, - "is_standalone": False, - "roofline_data_type": "FP32", - } + def run_kernel_analysis(self): + self.kernel_dfs.clear() + for kernel_name, df in self.raw_dfs.items(): + self.kernel_dfs[kernel_name] = process_panels_to_dataframes( + self.get_args(), df, self._arch_configs[self.arch], roof_plot=None ) - roof_obj = self.get_socs()[self.arch].roofline_obj + return self.kernel_dfs - if roof_obj: - # NOTE: using default data type - roof_plot = roof_obj.cli_generate_plot(roof_obj.get_dtype()[0]) - - results = process_panels_to_dataframes( - self.get_args(), - self._runs, - self._arch_configs[self.arch], - self._profiling_config, - roof_plot=roof_plot, - ) - return results + @demarcate + def run_top_kernel(self): + return get_top_kernels_and_dispatch_ids(self._runs) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/config.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/config.py index fc51effe2a..deada28a6a 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/config.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/config.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + """ Configuration Module @@ -30,7 +32,6 @@ Central configuration for the application. # Application settings APP_TITLE = "ROCm Compute Profiler TUI" -VERSION = "3.2.0" # Widget configurations DEFAULT_COLLAPSIBLE_STATE = True # True = collapsed by default diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/tui_app.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/tui_app.py index 4f07be9605..e21774a26a 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/tui_app.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/tui_app.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + """ ROCm Compute Profiler TUI - Main Application with Analysis Methods @@ -39,15 +41,18 @@ from textual.binding import Binding from textual.widgets import Button, Footer, Header from textual_fspicker import SelectDirectory -from rocprof_compute_tui.config import APP_TITLE, VERSION +import config +from rocprof_compute_tui.config import APP_TITLE from rocprof_compute_tui.views.main_view import MainView from rocprof_compute_tui.widgets.menu_bar.menu_bar import DropdownMenu from utils.specs import MachineSpecs, generate_machine_specs +from utils.utils import get_version class RocprofTUIApp(App): """Main application for the performance analysis tool.""" + VERSION = get_version(config.rocprof_compute_home)["version"] TITLE = f"{APP_TITLE} v{VERSION}" SUB_TITLE = "Workload Analysis Tool" @@ -55,7 +60,8 @@ class RocprofTUIApp(App): BINDINGS = [ Binding(key="q", action="quit", description="Quit"), Binding(key="r", action="refresh", description="Refresh"), - Binding(key="a", action="analyze", description="Analyze"), + # TODO + # Binding(key="a", action="analyze", description="Analyze"), ] def __init__( diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/utils/analyze_config.yaml b/projects/rocprofiler-compute/src/rocprof_compute_tui/utils/analyze_config.yaml deleted file mode 100644 index daea744094..0000000000 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/utils/analyze_config.yaml +++ /dev/null @@ -1,51 +0,0 @@ -sections: - - title: "📊 Summaries" - collapsed: true - class: "summary-section" - subsections: - - title: "Top Kernels" - data_path: ["0. Top Stats", "0.1 Top Kernels"] - collapsed: true - header_label: "Top Kernels by Duration (ns):" - header_class: "section-header" - - title: "Dispatch List" - data_path: ["0. Top Stats", "0.2 Dispatch List"] - collapsed: true - - title: "System Info" - data_path: ["1. System Info", "1.1"] - collapsed: true - - - title: "⚡ High Level Analysis" - collapsed: true - class: "sysinfo-section" - subsections: - - title: "System Speed-of-Light" - data_path: ["2. System Speed-of-Light", "2.1 Speed-of-Light"] - collapsed: true - - title: "Roofline" - collapsed: true - tui_style: "roofline" - widget_id: "roofline-plot" - - title: "Memory Chart" - data_path: ["3. Memory Chart", "3.1 Memory Chart"] - collapsed: true - tui_style: "mem_chart" - - - title: "🔍 Detailed Block Analysis" - collapsed: true - class: "kernels-section" - dynamic_sections: true - skip_sections: - - "0. Top Stats" - - "1. System Info" - - "2. System Speed-of-Light" - - "3. Memory Chart" - - "4. Roofline" - - - title: "🚧 Source Level Analysis" - collapsed: true - class: "source-section" - subsections: - - title: "PC Sampling" - data_path: ["21. PC Sampling", "21.1 PC Sampling"] - collapsed: true diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/utils/kernel_view_config.yaml b/projects/rocprofiler-compute/src/rocprof_compute_tui/utils/kernel_view_config.yaml new file mode 100644 index 0000000000..a12e2f846b --- /dev/null +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/utils/kernel_view_config.yaml @@ -0,0 +1,35 @@ +# TODO: add System Info +# - title: "System Info" +# data_path: ["1. System Info", "1.1"] +# collapsed: true +sections: + - title: "High Level Analysis" + collapsed: true + class: "sysinfo-section" + subsections: + - title: "System Speed-of-Light" + data_path: ["2. System Speed-of-Light", "2.1 System Speed-of-Light"] + collapsed: true + - title: "Memory Chart" + data_path: ["3. Memory Chart", "3.1 Memory Chart"] + collapsed: true + tui_style: "mem_chart" + + - title: "Detailed Block Analysis" + collapsed: true + class: "kernels-section" + dynamic_sections: true + skip_sections: + - "0. Top Stats" + - "1. System Info" + - "2. System Speed-of-Light" + - "3. Memory Chart" + - "4. Roofline" + + - title: "Source Level Analysis" + collapsed: true + class: "source-section" + subsections: + - title: "PC Sampling" + data_path: ["21. PC Sampling", "21.1 PC Sampling"] + collapsed: true diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/utils/tui_utils.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/utils/tui_utils.py index de56c56607..5f5f0ce0ed 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/utils/tui_utils.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/utils/tui_utils.py @@ -1,87 +1,27 @@ -import copy import logging -import os -import re from collections import defaultdict from datetime import datetime from enum import Enum -from pathlib import Path +import numpy as np import pandas as pd -from config import HIDDEN_COLUMNS, HIDDEN_SECTIONS - -supported_field = [ - "Value", - "Minimum", - "Maximum", - "Average", - "Median", - "Min", - "Max", - "Avg", - "Pct of Peak", - "Peak", - "Count", - "Mean", - "Pct", - "Std Dev", - "Q1", - "Q3", - "Expression", - # Special keywords for L2 channel - "Channel", - "L2 Cache Hit Rate", - "Requests", - "L2 Read", - "L2 Write", - "L2 Atomic", - "L2-Fabric Requests", - "L2-Fabric Read", - "L2-Fabric Write and Atomic", - "L2-Fabric Atomic", - "L2 Read Req", - "L2 Write Req", - "L2 Atomic Req", - "L2-Fabric Read Req", - "L2-Fabric Write and Atomic Req", - "L2-Fabric Atomic Req", - "L2-Fabric Read Latency", - "L2-Fabric Write Latency", - "L2-Fabric Atomic Latency", - "L2-Fabric Read Stall (PCIe)", - "L2-Fabric Read Stall (Infinity Fabric™)", - "L2-Fabric Read Stall (HBM)", - "L2-Fabric Write Stall (PCIe)", - "L2-Fabric Write Stall (Infinity Fabric™)", - "L2-Fabric Write Stall (HBM)", - "L2-Fabric Write Starve", -] +import config class LogLevel(str, Enum): - """Log levels for consistent logging.""" - INFO = "info" WARNING = "warning" ERROR = "error" - SUCCESS = "success" # Maintained for UI compatibility + SUCCESS = "success" class Logger: - """Centralized logging handler for the application.""" - def __init__(self, output_area=None): - """ - Initialize the logger. - """ self.output_area = output_area self._setup_logger() def _setup_logger(self): - """ - Setup the Python logger with proper formatting. - """ self.logger = logging.getLogger("app") self.logger.setLevel(logging.INFO) @@ -94,15 +34,9 @@ class Logger: self.logger.addHandler(handler) def set_output_area(self, output_area): - """ - Set or update the output area for displaying logs. - """ self.output_area = output_area def log(self, message, level=LogLevel.INFO, update_ui=True): - """ - Log a message with the specified level. - """ level_map = { LogLevel.INFO: logging.INFO, LogLevel.SUCCESS: logging.INFO, @@ -145,151 +79,28 @@ class Logger: self.log(message, LogLevel.ERROR, update_ui) -def split_table_line(line): - """ - Splits a table row line into a list of cell strings (trimmed). For example: +def get_top_kernels_and_dispatch_ids(runs): + if not runs: + return None - │ │ Kernel_Name │ Count │ ... - """ + base_run = next(iter(runs.values())) + if not hasattr(base_run, "dfs"): + return None - cells = line.split("│") - if cells and cells[0] == "": - cells = cells[1:] - if cells and cells[-1] == "": - cells = cells[:-1] - return [cell.strip() for cell in cells] + top_kernel_df = base_run.dfs.get(1) + dispatch_id_df = base_run.dfs.get(2) + + if top_kernel_df is None or dispatch_id_df is None: + return None + + merged_df = pd.merge( + top_kernel_df, dispatch_id_df, on="Kernel_Name", how="outer" + ).sort_values("Pct", ascending=False) + + return merged_df.to_dict("records") -def parse_ascii_table(table_lines): - """ - Given a list of lines belonging to one ASCII table (including border rows), - return a tuple (header, data_rows) where header is a list of column names and - data_rows is a list of rows (each a list of cell strings). - - Skips border/separator lines and also checks for continuation - rows (which have an empty first cell). Continuation rows get merged into the previous row. - """ - - header = None - data_rows = [] - - for line in table_lines: - if re.match(r"^[╒╞╘├└─]+", line): - continue - if "│" not in line: - continue - - cells = split_table_line(line) - - if header is None: - header = cells - continue - - if cells and cells[0] == "": - if data_rows: # There should be at least one row already. - for i, cell in enumerate(cells): - if cell: - data_rows[-1][i] += " " + cell - else: - continue - else: - data_rows.append(cells) - return header, data_rows - - -def parse_file(filename): - """ - Returns nested structure: - { - "0. Top Stats": { - "0.1 Top Kernels": {header: [...], data: [...]}, - "0.2 Dispatch List": {header: [...], data: [...]} - }, - "1. System Info": { - "1.1 System Information": {header: [...], data: [...]} - }, - ... - } - """ - with open(filename, "r", encoding="utf-8") as f: - lines = f.readlines() - - sections = {} - current_section = None - current_subsection = None - table_lines = [] - in_table = False - - for line in lines: - line = line.rstrip("\n") - - # Skip separator lines - if line.startswith( - "--------------------------------------------------------------------------------" - ): - continue - - # Check for section header (e.g., "0. Top Stats") - section_match = re.match(r"^\s*(\d+\. .+)$", line) - if section_match: - current_section = section_match.group(1).strip() - sections[current_section] = {} - continue - - # Check for subsection header (e.g., "0.1 Top Kernels") - # FIXME: 1. System Info is an exception, no subsection - subsection_match = re.match(r"^\s*(\d+\.\d+ .+)$", line) - if subsection_match: - current_subsection = subsection_match.group(1).strip() - if current_section is None: - current_section = "Uncategorized" - sections[current_section] = {} - continue - - # Table parsing logic - if line.startswith("╒"): - in_table = True - table_lines = [line] - continue - - if in_table: - table_lines.append(line) - if line.startswith("╘"): - if current_section and current_subsection: - header, data = parse_ascii_table(table_lines) - sections[current_section][current_subsection] = { - "header": header, - "data": data, - } - in_table = False - table_lines = [] - - return sections - - -def get_table_dfs(): - filename = str(Path(os.getcwd()).joinpath("analyze_output.csv")) - sections_info = parse_file(filename) - - # Convert to DataFrames while maintaining nested structure - section_dfs = {} - for section_name, subsections in sections_info.items(): - section_dfs[section_name] = {} - for subsection_name, table_data in subsections.items(): - if table_data and table_data["data"]: - try: - df = pd.DataFrame(table_data["data"], columns=table_data["header"]) - section_dfs[section_name][subsection_name] = df - except Exception as e: - print(f"Error creating DataFrame for {subsection_name}: {e}") - continue - - return section_dfs - - -def process_panels_to_dataframes( - args, runs, archConfigs, profiling_config, roof_plot=None -): +def process_panels_to_dataframes(args, kernel_df, archConfigs, roof_plot=None): """ Process panel data into pandas DataFrames. Returns a nested dictionary structure with DataFrames and tui_style information. @@ -305,318 +116,87 @@ def process_panels_to_dataframes( } """ - comparable_columns = build_comparable_columns(args.time_unit) - filter_panel_ids = profiling_config.get("filter_blocks", []) - if isinstance(filter_panel_ids, dict): - # For backward compatibility - filter_panel_ids = [ - name for name, type in filter_panel_ids.items() if type == "metric_id" - ] - filter_panel_ids = [ - int(convert_metric_id_to_panel_info(metric_id)[0]) - for metric_id in filter_panel_ids - ] + # TODO: add individual kernel roofline logic + # TODO: implement args logic: + # args.filter_metrics + # args.cols + # args.max_stat_num + # args.df_file_dir - # Initialize the result structure result_structure = defaultdict(dict) + decimal_precision = getattr(args, "decimal", 2) if args else 2 + for panel_id, panel in archConfigs.panel_configs.items(): - # Skip panels that don't support baseline comparison - if panel_id in HIDDEN_SECTIONS: + if panel_id in config.HIDDEN_SECTIONS: continue - # Get section name (e.g., "0. Top Stats") section_name = f"{panel_id // 100}. {panel['title']}" for data_source in panel["data source"]: for type, table_config in data_source.items(): - # Check for filtering conditions - if ( - not args.filter_metrics - and filter_panel_ids - and table_config["id"] not in filter_panel_ids - and panel_id not in filter_panel_ids - and panel_id > 100 - ): - table_id_str = ( - str(table_config["id"] // 100) - + "." - + str(table_config["id"] % 100) - ) + table_id = table_config["id"] + + if table_id not in kernel_df: continue - # Process the data - base_run, base_data = next(iter(runs.items())) - base_df = base_data.dfs[table_config["id"]] + base_df = kernel_df[table_id] + + if base_df is None or base_df.empty: + continue df = pd.DataFrame(index=base_df.index) - # Process columns - for header in list(base_df.keys()): - if should_process_column(header, args, type): - if header in HIDDEN_COLUMNS: - pass - elif header not in comparable_columns: - df = process_non_comparable_column( - df, header, base_df, type, table_config, runs - ) - else: - df = process_comparable_column( - df, - header, - base_df, - table_config, - runs, - base_run, - type, - args, - HIDDEN_COLUMNS, - ) + for header in list(base_df.columns): + if header in config.HIDDEN_COLUMNS_TUI: + continue + else: + df[header] = base_df[header] - if not df.empty: - # Check for empty columns - is_empty_columns_exist = check_empty_columns(df) + df = apply_rounding_logic(df, decimal_precision) - if not is_empty_columns_exist: - # Get subsection name - table_id_str = ( - str(table_config["id"] // 100) - + "." - + str(table_config["id"] % 100) - ) - subsection_name = table_id_str - if "title" in table_config and table_config["title"]: - subsection_name += " " + table_config["title"] + subsection_name = ( + str(table_config["id"] // 100) + "." + str(table_config["id"] % 100) + ) + if "title" in table_config and table_config["title"]: + subsection_name += " " + table_config["title"] - # Handle special cases for top stats - if type == "raw_csv_table" and ( - table_config["source"] == "pmc_kernel_top.csv" - or table_config["source"] == "pmc_dispatch_info.csv" - ): - df = df.head(args.max_stat_num) + result_structure[section_name][subsection_name] = { + "df": df, + "tui_style": None, + } - # Check for transpose requirement - transpose = ( - type != "raw_csv_table" - and "columnwise" in table_config - and table_config.get("columnwise") == True - ) + if type == "metric_table" and "tui_style" in table_config: + result_structure[section_name][subsection_name]["tui_style"] = ( + table_config["tui_style"] + ) - if transpose: - df = df.T - - # Store the DataFrame with tui_style as separate keys - result_structure[section_name][subsection_name] = { - "df": df, - "tui_style": None, - } - - # Set tui_style if available - if type == "metric_table" and "tui_style" in table_config: - result_structure[section_name][subsection_name][ - "tui_style" - ] = table_config["tui_style"] - - # Save to CSV if requested - if args.df_file_dir: - save_dataframe_to_csv(df, table_id_str, table_config, args) - result_structure["4. Roofline"] = roof_plot return dict(result_structure) -def should_process_column(header, args, type): - """Check if a column should be processed based on arguments.""" - return ( - (not args.cols) - or ( - args.cols and header in args.cols - ) # Assuming args.cols is now a list of column names - or (type == "raw_csv_table") - ) +def apply_rounding_logic(df, decimal_precision): + df_copy = df.copy() + for column in df_copy.columns: + if column in ["Metric", "Tips", "coll_level", "Unit", "Kernel_Name", "Info"]: + continue -def process_non_comparable_column(df, header, base_df, type, table_config, runs): - """Process columns that are not comparable across runs.""" - if ( - type == "raw_csv_table" - and ( - table_config["source"] == "pmc_kernel_top.csv" - or table_config["source"] == "pmc_dispatch_info.csv" - ) - and header == "Kernel_Name" - ): - # Adjust kernel name width based on source - if table_config["source"] == "pmc_kernel_top.csv": - adjusted_name = base_df["Kernel_Name"].apply( - lambda x: string_multiple_lines(x, 40, 3) - ) + if df_copy[column].dtype in ["float64", "float32", "int64", "int32"]: + df_copy[column] = df_copy[column].round(decimal_precision) else: - adjusted_name = base_df["Kernel_Name"].apply( - lambda x: string_multiple_lines(x, 80, 4) - ) - df = pd.concat([df, adjusted_name], axis=1) - elif type == "raw_csv_table" and header == "Info": - for run, data in runs.items(): - cur_df = data.dfs[table_config["id"]] - df = pd.concat([df, cur_df[header]], axis=1) - else: - df = pd.concat([df, base_df[header]], axis=1) + try: + numeric_series = pd.to_numeric(df_copy[column], errors="coerce") + if not numeric_series.isna().all(): + rounded_series = numeric_series.round(decimal_precision) - return df + if df_copy[column].dtype == "object": + df_copy[column] = df_copy[column].combine( + rounded_series, + lambda orig, rounded: rounded if pd.notna(rounded) else orig, + ) + else: + df_copy[column] = rounded_series + except (ValueError, TypeError): + continue - -def process_comparable_column( - df, header, base_df, table_config, runs, base_run, type, args, hidden_columns -): - """Process columns that can be compared across runs.""" - for run, data in runs.items(): - cur_df = data.dfs[table_config["id"]] - if (type == "raw_csv_table") or ( - type == "metric_table" and (header not in hidden_columns) - ): - if run != base_run: - # Calculate percentage over the baseline - base_values = [float(x) if x != "" else float(0) for x in base_df[header]] - cur_values = [float(x) if x != "" else float(0) for x in cur_df[header]] - - base_df[header] = base_values - cur_df[header] = cur_values - - t_df = pd.concat( - [base_df[header], cur_df[header]], - axis=1, - ) - absolute_diff = (t_df.iloc[:, 1] - t_df.iloc[:, 0]).round(args.decimal) - t_df = absolute_diff / t_df.iloc[:, 0].replace(0, 1) - - t_df_pretty = t_df.astype(float).mul(100).round(args.decimal) - - # Show value + percentage - t_df = ( - cur_df[header].astype(float).round(args.decimal).map(str).astype(str) - + " (" - + t_df_pretty.map(str) - + "%)" - ) - df = pd.concat([df, t_df], axis=1) - - # Check for threshold violations - if ( - header in ["Value", "Count", "Avg"] - and t_df_pretty.abs().gt(args.report_diff).any() - ): - df["Abs Diff"] = absolute_diff - if args.report_diff: - violation_idx = t_df_pretty.index[ - t_df_pretty.abs() > args.report_diff - ] - else: - cur_df_copy = copy.deepcopy(cur_df) - cur_df_copy[header] = [ - (round(float(x), args.decimal) if x != "" else x) - for x in base_df[header] - ] - df = pd.concat([df, cur_df_copy[header]], axis=1) - - return df - - -def check_empty_columns(df): - """Check if any column in the DataFrame is empty.""" - return any( - [ - df.columns[col_idx] - for col_idx in range(len(df.columns)) - if df.replace("", None).iloc[:, col_idx].isnull().all() - ] - ) - - -def save_dataframe_to_csv(df, table_id_str, table_config, args): - """Save DataFrame to CSV file if directory is specified.""" - p = Path(args.df_file_dir) - if not p.exists(): - p.mkdir() - if p.is_dir(): - filename = table_id_str - if "title" in table_config and table_config["title"]: - filename += "_" + table_config["title"] - df.to_csv( - p.joinpath(filename.replace(" ", "_") + ".csv"), - index=False, - ) - - -def string_multiple_lines(source, width, max_rows): - """ - Adjust string with multiple lines by inserting '\n' - """ - idx = 0 - lines = [] - while idx < len(source) and len(lines) < max_rows: - lines.append(source[idx : idx + width]) - idx += width - - if idx < len(source): - last = lines[-1] - lines[-1] = last[0:-3] + "..." - return "\n".join(lines) - - -def convert_metric_id_to_panel_info(metric_id): - """ - Convert metric id into panel information. - Output is a tuples of the form (file_id, panel_id, metric_id). - - For example: - - Input: "2" - Output: ("0200", None, None) - - Input: "11" - Output: ("1100", None, None) - - Input: "11.1" - Output: ("1100", 1101, None) - - Input: "11.1.1" - Output: ("1100", 1101, 1) - - Raises exception for invalid metric id. - """ - tokens = metric_id.split(".") - if 0 < len(tokens) < 4: - # File id - file_id = str(int(tokens[0])) - # 4 -> 04 - if len(file_id) < 2: - file_id = f"0{file_id}" - # Multiply integer by 100 - file_id = f"{file_id}00" - # Panel id - if len(tokens) > 1: - panel_id = int(tokens[0]) * 100 - panel_id += int(tokens[1]) - else: - panel_id = None - # Metric id - if len(tokens) > 2: - metric_id = int(tokens[2]) - else: - metric_id = None - return (file_id, panel_id, metric_id) - else: - raise Exception(f"Invalid metric id: {metric_id}") - - -def build_comparable_columns(time_unit): - """ - Build comparable columns/headers for display - """ - comparable_columns = supported_field - top_stat_base = ["Count", "Sum", "Mean", "Median", "Standard Deviation"] - - for h in top_stat_base: - comparable_columns.append(h + "(" + time_unit + ")") - - return comparable_columns + return df_copy diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/views/kernel_view.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/views/kernel_view.py new file mode 100644 index 0000000000..61e3957b30 --- /dev/null +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/views/kernel_view.py @@ -0,0 +1,203 @@ +""" +Panel Widget Modules +------------------- +Contains the panel widgets used in the main layout. +""" + +from typing import Optional + +from textual import on +from textual.containers import Container, VerticalScroll +from textual.widgets import Label, RadioButton, RadioSet + +from config import rocprof_compute_home +from rocprof_compute_tui.widgets.collapsibles import build_all_sections + + +class KernelView(Container): + """Center panel with analysis results split into two scrollable sections.""" + + DEFAULT_CSS = """ + KernelView { + layout: vertical; + } + + #top-container { + height: 1fr; + border: none; + margin-top: 1; + } + + #bottom-container { + height: 4fr; + border: none; + margin-top: 2; + } + + .kernel-table-header { + background: $primary; + color: $text; + text-style: bold; + padding: 0 1; + offset: 5 0; + margin-top: 1; + } + + .kernel-row { + padding: 0 1; + border-bottom: solid $border; + } + + RadioSet { + border: solid $border; + } + """ + + def __init__(self, config_path: Optional[str] = None): + super().__init__(id="kernel-view") + self.status_label = None + self.dfs = {} + self.top_kernel = [] + + if rocprof_compute_home: + config_path = ( + rocprof_compute_home + / "rocprof_compute_tui" + / "utils" + / "kernel_view_config.yaml" + ) + self.config_path = config_path + + self.keys = None + self.current_selection = None + + def compose(self): + """ + Compose the split panel layout with two scrollable containers. + """ + with VerticalScroll(id="top-container"): + yield Label( + "Open a workload directory to run analysis and view individual kernel analysis results.", + classes="placeholder", + ) + + with VerticalScroll(id="bottom-container"): + # empty on init + pass + + def update_results(self, per_kernel_dfs, top_kernels) -> None: + self.dfs = per_kernel_dfs + self.top_kernel = top_kernels + + top_container = self.query_one("#top-container", VerticalScroll) + top_container.remove_children() + + if self.top_kernel: + try: + header = self.build_header() + top_container.mount(header) + selector = self.build_selector() + top_container.mount(selector) + except Exception as e: + top_container.mount( + Label(f"Error displaying kernel list: {str(e)}", classes="error") + ) + else: + top_container.mount(Label("No kernels available", classes="placeholder")) + + self.current_selection = self.top_kernel[0]["Kernel_Name"] + self._update_bottom_content() + + def update_view(self, message: str, log_level: str) -> None: + """ + Update the view with a status message. + """ + if self.status_label is None: + self.status_label = Label(f"{message}", classes=log_level) + self.mount(self.status_label) + else: + self.status_label.update(f"{message}") + self.status_label.set_classes(log_level) + + def reload_config(self, config_path: str = None) -> None: + if config_path: + self.config_path = config_path + + if self.dfs and self.top_kernel: + self.update_results() + + def build_header(self): + all_keys = set() + + for kernel in self.top_kernel: + all_keys.update(kernel.keys()) + + self.keys = sorted(all_keys) + + if "Kernel_Name" in self.keys: + self.keys.remove("Kernel_Name") + self.keys.insert(0, "Kernel_Name") + + header_text = " | ".join(f"{key:25}" for key in self.keys) + header_label = Label(header_text, classes="kernel-table-header") + + return header_label + + def build_selector(self): + radio_buttons = [] + + for i, kernel in enumerate(self.top_kernel): + row_data = [] + for key in self.keys: + value = str(kernel.get(key, "N/A")) + if len(value) > 18: + value = value[:15] + "..." + row_data.append(f"{value:25}") + + row_text = " | ".join(row_data) + radio_button = RadioButton(row_text, id=f"kernel-{i}") + radio_button.kernel_data = kernel + radio_buttons.append(radio_button) + + selector = RadioSet(*radio_buttons) + + return selector + + @on(RadioSet.Changed) + def on_radio_changed(self, event: RadioSet.Changed) -> None: + if event.pressed: + kernel_data = getattr(event.pressed, "kernel_data", None) + if kernel_data and "Kernel_Name" in kernel_data: + selected_kernel = kernel_data["Kernel_Name"] + self.current_selection = selected_kernel + self._update_bottom_content() + + def _update_bottom_content(self): + bottom_container = self.query_one("#bottom-container", VerticalScroll) + bottom_container.remove_children() + + bottom_container.mount( + Label(f"Toggle kernel selection to view detailed analysis.") + ) + + if self.current_selection and self.current_selection in self.dfs: + bottom_container.mount( + Label(f"Current kernel selection: {self.current_selection}") + ) + filtered_dfs = self.dfs[self.current_selection] + + try: + sections = build_all_sections(filtered_dfs, self.config_path) + for section in sections: + bottom_container.mount(section) + except Exception as e: + bottom_container.mount( + Label(f"Error displaying results: {str(e)}", classes="error") + ) + else: + bottom_container.mount( + Label( + f"No data available for kernel: {self.current_selection}", + classes="error", + ) + ) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/views/main_view.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/views/main_view.py index ba8ce495a7..7edfaef686 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/views/main_view.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/views/main_view.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + """ Main View Module @@ -50,10 +52,10 @@ class MainView(Horizontal): """Main view layout for the application.""" selected_path = reactive(None) - dfs = reactive({}) + per_kernel_dfs = reactive({}) + top_kernels = reactive([]) def __init__(self): - """Initialize the main view.""" super().__init__(id="main-container") self.start_path = ( # NOTE: is cwd the best choice? @@ -70,7 +72,6 @@ class MainView(Horizontal): pass def compose(self) -> ComposeResult: - """Compose the main view layout.""" self.logger.info("Composing main view layout", update_ui=False) yield MenuBar() @@ -80,7 +81,6 @@ class MainView(Horizontal): # Center Panel - Analysis results display center_panel = CenterPanel() yield center_panel - self.center = center_panel # Bottom Panel - Output, terminal, and metric description @@ -91,7 +91,6 @@ class MainView(Horizontal): self.metric_description = tabs.description_area self.output = tabs.output_area - # Now set the output area for the logger self.logger.set_output_area(self.output) self.logger.info("Main view layout composed") @@ -107,8 +106,9 @@ class MainView(Horizontal): try: row_data = table.get_row_at(row_idx) - content = f"Selected Row {row_idx}:\n" - content += "\n".join(f"{val}" for val in row_data) + content = f"Selected Metric ID: {row_data[0]}\n" + content += f"Selected Metric: {row_data[1]}\n" + # content += f"Metric Description:\n\t{row_data[-1]}" self.metric_description.text = content self.logger.info(f"Row {row_idx} data displayed in metric_description") @@ -122,7 +122,8 @@ class MainView(Horizontal): @work(thread=True) def run_analysis(self) -> None: - self.dfs = {} + self.per_kernel_dfs = {} + self.top_kernels = [] if not self.selected_path: error_msg = "No directory selected for analysis" @@ -173,7 +174,6 @@ class MainView(Horizontal): self.logger.info( f"Step 3: sys_info_df shape = {sys_info_df.shape if hasattr(sys_info_df, 'shape') else 'No shape attribute'}" ) - self.logger.info(f"Step 3: sys_info_df = {sys_info_df}") except Exception as e: self.logger.error(f"Step 3 failed - Error loading sys_info: {str(e)}") @@ -196,7 +196,6 @@ class MainView(Horizontal): raise TypeError(f"Unexpected type for sys_info: {type(sys_info_df)}") self.logger.info(f"Step 4: sys_info converted = {sys_info}") - self.logger.info(f"Step 4: sys_info type = {type(sys_info)}") except Exception as e: self.logger.error(f"Step 4 failed - Error converting sys_info: {str(e)}") @@ -231,18 +230,19 @@ class MainView(Horizontal): # Step 8: Run analysis try: self.logger.info("Step 8: Running analysis") - self.dfs = analyzer.run_analysis() - if not self.dfs: - warning_msg = "Step 8: Analysis completed but no data was returned" + self.per_kernel_dfs = analyzer.run_kernel_analysis() + self.top_kernels = analyzer.run_top_kernel() + + # TODO: add per kernel Roofline support when available + + if not self.per_kernel_dfs or not self.top_kernels: + warning_msg = "Step 8: Per Kernel Analysis completed but not all data was returned" self._update_view(warning_msg, LogLevel.WARNING) self.logger.warning(warning_msg) else: self.app.call_from_thread(self.refresh_results) - self.logger.info("Step 8: Analysis completed successfully") - if self.dfs.get("4. Roofline"): - self.logger.info("Step 8: Roofline data available") - else: - self.logger.info("Step 8: Roofline data not available") + self.logger.info("Step 8: Kernel Analysis completed successfully") + # self.logger.info(f"{self.per_kernel_dfs}") except Exception as e: self.logger.error(f"Step 8 failed - Error running analysis: {str(e)}") raise @@ -257,17 +257,15 @@ class MainView(Horizontal): def _update_view(self, message: str, log_level: LogLevel) -> None: try: - # Use call_from_thread to safely update UI from background thread self.app.call_from_thread(self._safe_update_view, message, log_level) except Exception as e: - # Capture errors that might occur when scheduling the UI update self.logger.error(f"View update scheduling error: {str(e)}") def _safe_update_view(self, message: str, log_level: LogLevel) -> None: try: - analyze_view = self.query_one("#analyze-view") - if analyze_view: - analyze_view.update_view(message, log_level) + kernel_view = self.query_one("#kernel-view") + if kernel_view: + kernel_view.update_view(message, log_level) else: self.logger.warning("Analysis view not found when updating log") except Exception as e: @@ -275,24 +273,29 @@ class MainView(Horizontal): def refresh_results(self) -> None: try: - self.logger.info("Refreshing analysis results") - analyze_view = self.query_one("#analyze-view") - if not analyze_view: - self.logger.error("Analysis view not found") + self.logger.info("Refreshing kernel results") + kernel_view = self.query_one("#kernel-view") + if not kernel_view: + self.logger.error("Kernel view not found") return - if not hasattr(self, "dfs") or self.dfs is None: - self.logger.error("No analysis data available to display") + if ( + not hasattr(self, "per_kernel_dfs") + or self.per_kernel_dfs is None + or not hasattr(self, "top_kernels") + or self.top_kernels is None + ): + self.logger.error("No kernel analysis data available to display") return - analyze_view.update_results(self.dfs) + kernel_view.update_results(self.per_kernel_dfs, self.top_kernels) self.logger.success(f"Results displayed successfully.") except Exception as e: self.logger.error(f"Error refreshing results: {str(e)}") def refresh_view(self) -> None: self.logger.info("Refreshing view...") - if self.dfs: + if self.top_kernels: self.refresh_results() else: self.logger.warning("No data available for refresh") diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/center_panel/analyze_view.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/center_panel/analyze_view.py deleted file mode 100644 index c314c73c60..0000000000 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/center_panel/analyze_view.py +++ /dev/null @@ -1,74 +0,0 @@ -""" -Panel Widget Modules -------------------- -Contains the panel widgets used in the main layout. -""" - -from importlib import resources -from typing import Any, Dict, Optional - -from textual.containers import ScrollableContainer -from textual.widgets import Label - -from rocprof_compute_tui.widgets.collapsibles import build_all_sections - - -class AnalyzeView(ScrollableContainer): - """Center panel with analysis results.""" - - def __init__(self, config_path: Optional[str] = None): - super().__init__(id="analyze-view") - self.dfs = {} - - if config_path is None: - config_path = ( - resources.files("rocprof_compute_tui.utils") / "analyze_config.yaml" - ) - - self.config_path = str(config_path) - - def compose(self): - """ - Compose the initial center panel state. - """ - yield Label( - "Open a workload directory to run analysis and view results", - classes="placeholder", - ) - - def update_results(self, dfs: Dict[str, Any]) -> None: - """ - Update the center panel with analysis results. - """ - self.dfs = dfs - self.remove_children() - - try: - sections = build_all_sections(self.dfs, self.config_path) - - # Mount all sections - for section in sections: - self.mount(section) - - except Exception as e: - self.mount(Label(f"Error displaying results: {str(e)}", classes="error")) - - def update_view(self, message: str, log_level: str) -> None: - """ - Update the view with a status message. - """ - self.remove_children() - try: - self.mount(Label(f"{message}", classes=log_level)) - except Exception as e: - self.mount(Label(f"Error displaying results: {str(e)}", classes="error")) - - def reload_config(self, config_path: str = None) -> None: - """ - Reload the configuration and update the view. - """ - if config_path: - self.config_path = config_path - - if self.dfs: - self.update_results(self.dfs) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/center_panel/center_area.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/center_panel/center_area.py index b7d588f0ed..6683e64e8c 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/center_panel/center_area.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/center_panel/center_area.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + """ Panel Widget Modules @@ -29,9 +31,9 @@ Contains the panel widgets used in the main layout. """ from textual.containers import Vertical -from textual.widgets import Label, TabPane +from textual.widgets import TabPane -from rocprof_compute_tui.widgets.center_panel.analyze_view import AnalyzeView +from rocprof_compute_tui.views.kernel_view import KernelView from rocprof_compute_tui.widgets.tabbed_content import TabsTabbedContent @@ -48,15 +50,12 @@ class CenterPanel(Vertical): super().__init__() self.default_tab = "center-analyze" - self.analyze_view = AnalyzeView() + self.kernel_view = KernelView() def compose(self): - with TabsTabbedContent(initial="tab-analyze"): - with TabPane("Basic View", id="tab-analyze"): - yield self.analyze_view - # TODO: - # with TabPane("placeholder (🚧)", id="tab-1"): - # yield Label("🚧 Under Construction") + with TabsTabbedContent(initial="tab-kernel"): + with TabPane("Basic View", id="tab-kernel"): + yield self.kernel_view def on_mount(self) -> None: self.add_class("section") diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/charts.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/charts.py index 3cb710f2e2..7c6b7cab89 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/charts.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/charts.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from __future__ import annotations @@ -68,7 +70,10 @@ def simple_bar(df, title=None): w *= 100 plt.simple_bar(list(metric_dict.keys()), list(metric_dict.values()), width=w) # plt.show() - return "\n" + plt.build() + "\n" + plot_content = plt.build() + if not plot_content or plot_content.strip() == "": + return None + return "\n" + plot_content + "\n" def simple_multiple_bar(df, title=None): @@ -100,10 +105,13 @@ def simple_multiple_bar(df, title=None): h *= 300 plt.plot_size(height=h) - plt.multiple_bar(labels, data, color=["blue", "blue+", 68, 63]) + plt.multiple_bar(labels, data) # plt.show() - return "\n" + plt.build() + "\n" + plot_content = plt.build() + if not plot_content or plot_content.strip() == "": + return None + return "\n" + plot_content + "\n" def simple_box(df, orientation="v", title=None): @@ -173,7 +181,10 @@ def simple_box(df, orientation="v", title=None): plt.theme("pro") # plt.show() - return "\n" + plt.build() + "\n" + plot_content = plt.build() + if not plot_content or plot_content.strip() == "": + return None + return "\n" + plot_content + "\n" def px_simple_bar(df, title: str = None, id=None, style: dict = None, orientation="h"): @@ -284,18 +295,8 @@ class RooflinePlot(Static): super().__init__("", classes="roofline", **kwargs) self.df = df - # Disable markup rendering - self._render_markup = False - try: - plot_str = "" - try: - result = self.df["4. Roofline"] - if result: - plot_str = str(result) - except: - plot_str = "No roofline data generated" - + plot_str = str(self.df.get("4. Roofline", "No roofline data generated")) self.update(plot_str) except Exception as e: error_message = f"Roofline plot error: {str(e)}\n{traceback.format_exc()}" @@ -319,41 +320,37 @@ class MemoryChart(Static): """ def __init__(self, df: pd.DataFrame, **kwargs): - """Initialize the memory chart.""" super().__init__("", classes="mem-chart", **kwargs) self.df = df - # Generate the chart content on initialization try: - # Prepare data - metric_dict = ( - self.df[["Metric", "Value"]].set_index("Metric").to_dict()["Value"] - ) + if self.df is None or self.df.empty: + self.update("No chart data generated") + return + + if not {"Metric", "Value"}.issubset(self.df.columns): + self.update("Error: Missing required columns") + return + + metric_dict = dict(zip(self.df["Metric"], self.df["Value"])) - # Capture stdout original_stdout = sys.stdout - string_buffer = StringIO() - sys.stdout = string_buffer - try: - # Generate the chart - result = plot_mem_chart("", "per_kernel", metric_dict) - stdout_output = string_buffer.getvalue() - - if stdout_output: - plot_str = stdout_output - elif result: - plot_str = str(result) - else: - plot_str = "No chart data generated" + with StringIO() as string_buffer: + sys.stdout = string_buffer + result = plot_mem_chart("", "per_kernel", metric_dict) + stdout_output = string_buffer.getvalue() finally: sys.stdout = original_stdout + plot_str = next( + (x for x in [stdout_output, str(result) if result else None] if x), + "No chart data generated", + ) self.update(plot_str) except Exception as e: - error_message = f"Memory chart error: {str(e)}\n{traceback.format_exc()}" - self.update(f"Error: {str(error_message)}") + self.update(f"Memory chart error: {str(e)}") class SimpleBar(Static): @@ -372,7 +369,6 @@ class SimpleBar(Static): """ def __init__(self, df: pd.DataFrame, **kwargs): - """Initialize the simple bar.""" super().__init__("", classes="simple-bar", **kwargs) self.df = df @@ -381,13 +377,8 @@ class SimpleBar(Static): if result: plot_str = str(result) - # Escape markup characters escaped_content = plot_str.replace("[", r"\[").replace("]", r"\]") self.update(escaped_content) - - # Alternative - wrap in [pre] tags for preformatted text - # self.update(f"[pre]{plot_str}[/pre]") - else: self.update("No simple bar data generated") @@ -398,7 +389,6 @@ class SimpleBar(Static): class SimpleBox(Static): - """Simple Box visualization widget.""" DEFAULT_CSS = """ SimpleBox { @@ -413,7 +403,6 @@ class SimpleBox(Static): """ def __init__(self, df: pd.DataFrame, **kwargs): - """Initialize the simple box.""" super().__init__("", classes="simple-box", **kwargs) self.df = df @@ -422,7 +411,6 @@ class SimpleBox(Static): if result: plot_str = str(result) - # Escape markup characters escaped_content = plot_str.replace("[", r"\[").replace("]", r"\]") self.update(escaped_content) else: @@ -450,7 +438,6 @@ class SimpleMultiBar(Static): """ def __init__(self, df: pd.DataFrame, **kwargs): - """Initialize the simple multiple bar.""" super().__init__("", classes="simple-multi-bar", **kwargs) self.df = df @@ -459,7 +446,6 @@ class SimpleMultiBar(Static): if result: plot_str = str(result) - # Escape markup characters escaped_content = plot_str.replace("[", r"\[").replace("]", r"\]") self.update(escaped_content) else: diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/collapsibles.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/collapsibles.py index 1970354391..17ad8e79c9 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/collapsibles.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/collapsibles.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,23 +10,24 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from typing import Any, Dict, List, Optional import pandas as pd import yaml -from textual.containers import VerticalScroll from textual.widgets import Collapsible, DataTable, Label from rocprof_compute_tui.widgets.charts import ( @@ -41,12 +42,9 @@ from rocprof_compute_tui.widgets.charts import ( def create_table(df: pd.DataFrame) -> DataTable: table = DataTable(zebra_stripes=True) - # Clean the DataFrame - remove NaN and empty cells df = df.reset_index() - df = df.dropna(how="any") df = df[~df.apply(lambda row: row.astype(str).str.strip().eq("").any(), axis=1)] - # Add columns and rows str_columns = [str(col) for col in df.columns] table.add_columns(*str_columns) table.add_rows([tuple(str(x) for x in row) for row in df.itertuples(index=False)]) @@ -59,7 +57,9 @@ def load_config(config_path) -> Dict[str, Any]: with open(config_path, "r") as file: return yaml.safe_load(file) except FileNotFoundError: - raise FileNotFoundError(f"Configuration file {config_path} not found") + raise FileNotFoundError( + f"Configuration file {config_path} not found, \nplease populate the analysis_config.yaml file." + ) except yaml.YAMLError as e: raise ValueError(f"Error parsing YAML configuration: {e}") @@ -167,7 +167,7 @@ def build_subsection( return collapsible -def build_dynamic_kernel_sections( +def build_kernel_sections( dfs: Dict[str, Any], skip_sections: List[str] ) -> List[Collapsible]: children = [] @@ -198,9 +198,10 @@ def build_dynamic_kernel_sections( return None try: - df = data["df"] + if data["df"] is None or data["df"].empty: + return None tui_style = data.get("tui_style") - widget = create_widget_from_data(df, tui_style) + widget = create_widget_from_data(data["df"], tui_style) if widget is None: add_warning(f"Widget creation returned None for '{subsection_name}'") @@ -277,7 +278,7 @@ def build_section_from_config( # Handle dynamic sections (like kernel sections) elif section_config.get("dynamic_sections", False): skip_sections = section_config.get("skip_sections", []) - children = build_dynamic_kernel_sections(dfs, skip_sections) + children = build_kernel_sections(dfs, skip_sections) # Handle regular sections with subsections elif "subsections" in section_config: @@ -290,7 +291,6 @@ def build_section_from_config( except Exception as e: error_msg = f"{subsection_config.get('title', 'Unknown')} error: {str(e)}" children.append(Label(error_msg, classes="warning")) - else: children = [Label("No configuration provided for this section")] diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/directory_tree.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/directory_tree.py deleted file mode 100644 index e7dc44bc6a..0000000000 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/directory_tree.py +++ /dev/null @@ -1,39 +0,0 @@ -##############################################################################bl -# MIT License -# -# Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. -# -# Permission is hereby granted, free of charge, to any person obtaining a copy -# of this software and associated documentation files (the "Software"), to deal -# in the Software without restriction, including without limitation the rights -# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -# copies of the Software, and to permit persons to whom the Software is -# furnished to do so, subject to the following conditions: -# -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el - -""" -Specialized Widget Modules -------------------------- -Contains custom widget implementations for the application. -""" - -from textual.widgets import DirectoryTree - - -class FolderOnlyDirectory(DirectoryTree): - """Directory tree that only shows folders.""" - - def filter_paths(self, paths): - """Filter to only show directories.""" - return [path for path in paths if path.is_dir()] diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/recent_directories.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/recent_directories.py index 5ff43b9879..1239f539b3 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/recent_directories.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/recent_directories.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from typing import List diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/right_panel/right.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/right_panel/right.py index ba4cbd681f..8f6ef470e6 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/right_panel/right.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/right_panel/right.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + """ Panel Widget Modules diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/splitter.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/splitter.py index cd70207d68..aaec3acf4f 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/splitter.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/splitter.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + """ Specialized Widget Modules diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/tabbed_content.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/tabbed_content.py index f07b734d91..3c99aec98f 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/tabbed_content.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/tabbed_content.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from textual.binding import Binding from textual.widgets import TabbedContent, Tabs diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/tabs/tabs_area.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/tabs/tabs_area.py index cc113e642b..aa31f15612 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/tabs/tabs_area.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/tabs/tabs_area.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + """ Panel Widget Modules diff --git a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/tabs/tabs_terminal.py b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/tabs/tabs_terminal.py index 8e3c1dd4a8..a10b420f5c 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/tabs/tabs_terminal.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_tui/widgets/tabs/tabs_terminal.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import os import platform diff --git a/projects/rocprofiler-compute/src/roofline.py b/projects/rocprofiler-compute/src/roofline.py index dc337d5fc3..0900913e32 100644 --- a/projects/rocprofiler-compute/src/roofline.py +++ b/projects/rocprofiler-compute/src/roofline.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import os import textwrap diff --git a/projects/rocprofiler-compute/src/utils/db_connector.py b/projects/rocprofiler-compute/src/utils/db_connector.py index b85d060942..8129d91cbb 100644 --- a/projects/rocprofiler-compute/src/utils/db_connector.py +++ b/projects/rocprofiler-compute/src/utils/db_connector.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import getpass import os diff --git a/projects/rocprofiler-compute/src/utils/file_io.py b/projects/rocprofiler-compute/src/utils/file_io.py index 3bfdd83ba7..63015379a0 100644 --- a/projects/rocprofiler-compute/src/utils/file_io.py +++ b/projects/rocprofiler-compute/src/utils/file_io.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import os import re diff --git a/projects/rocprofiler-compute/src/utils/gui.py b/projects/rocprofiler-compute/src/utils/gui.py index 556eb0a466..c6395693b2 100644 --- a/projects/rocprofiler-compute/src/utils/gui.py +++ b/projects/rocprofiler-compute/src/utils/gui.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import colorlover import pandas as pd diff --git a/projects/rocprofiler-compute/src/utils/gui_components/header.py b/projects/rocprofiler-compute/src/utils/gui_components/header.py index 917ab651b8..039db9ff06 100644 --- a/projects/rocprofiler-compute/src/utils/gui_components/header.py +++ b/projects/rocprofiler-compute/src/utils/gui_components/header.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import dash_bootstrap_components as dbc from dash import dcc, html diff --git a/projects/rocprofiler-compute/src/utils/gui_components/memchart.py b/projects/rocprofiler-compute/src/utils/gui_components/memchart.py index 5ab2027972..a6fcee86bd 100644 --- a/projects/rocprofiler-compute/src/utils/gui_components/memchart.py +++ b/projects/rocprofiler-compute/src/utils/gui_components/memchart.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from dash import html from dash_svg import G, Path, Rect, Svg, Text diff --git a/projects/rocprofiler-compute/src/utils/kernel_name_shortener.py b/projects/rocprofiler-compute/src/utils/kernel_name_shortener.py index 81b96b6c4a..a993375d46 100644 --- a/projects/rocprofiler-compute/src/utils/kernel_name_shortener.py +++ b/projects/rocprofiler-compute/src/utils/kernel_name_shortener.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import re import subprocess diff --git a/projects/rocprofiler-compute/src/utils/logger.py b/projects/rocprofiler-compute/src/utils/logger.py index e8c450095e..ab4abe49ba 100644 --- a/projects/rocprofiler-compute/src/utils/logger.py +++ b/projects/rocprofiler-compute/src/utils/logger.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import logging import os diff --git a/projects/rocprofiler-compute/src/utils/mem_chart.py b/projects/rocprofiler-compute/src/utils/mem_chart.py index 6d3211d0b5..8679210759 100644 --- a/projects/rocprofiler-compute/src/utils/mem_chart.py +++ b/projects/rocprofiler-compute/src/utils/mem_chart.py @@ -106,8 +106,8 @@ def format_text( ) key_str = ( "{key:{key_format}}".format(key=key, key_format=key_format) - if key is not None - else None + if key and isinstance(key, (int, float)) + else str(key) if key else None ) unit_string = post_description_with_space if not "N/A" in value_str else "" @@ -1013,8 +1013,8 @@ class MemChart: block_instr_buff.y_max = self.y_max - 5.0 block_instr_buff.y_min = block_instr_buff.y_max - 24.0 - block_instr_buff.wave_occupancy = metric_dict["Wavefront Occupancy"] - block_instr_buff.wave_life = metric_dict["Wave Life"] + block_instr_buff.wave_occupancy = metric_dict.get("Wavefront Occupancy", "n/a") + block_instr_buff.wave_life = metric_dict.get("Wave Life", "n/a") block_instr_buff.draw(canvas) @@ -1037,14 +1037,14 @@ class MemChart: block_instr_disp.y_max = block_instr_buff.y_max block_instr_disp.y_min = block_instr_buff.y_min - block_instr_disp.instrs["SALU"] = metric_dict["SALU"] - block_instr_disp.instrs["SMEM"] = metric_dict["SMEM"] - block_instr_disp.instrs["VALU"] = metric_dict["VALU"] - block_instr_disp.instrs["MFMA"] = metric_dict["MFMA"] - block_instr_disp.instrs["VMEM"] = metric_dict["VMEM"] - block_instr_disp.instrs["LDS"] = metric_dict["LDS"] - block_instr_disp.instrs["GWS"] = metric_dict["GWS"] - block_instr_disp.instrs["BRANCH"] = metric_dict["BR"] + block_instr_disp.instrs["SALU"] = metric_dict.get("SALU", "n/a") + block_instr_disp.instrs["SMEM"] = metric_dict.get("SMEM", "n/a") + block_instr_disp.instrs["VALU"] = metric_dict.get("VALU", "n/a") + block_instr_disp.instrs["MFMA"] = metric_dict.get("MFMA", "n/a") + block_instr_disp.instrs["VMEM"] = metric_dict.get("VMEM", "n/a") + block_instr_disp.instrs["LDS"] = metric_dict.get("LDS", "n/a") + block_instr_disp.instrs["GWS"] = metric_dict.get("GWS", "n/a") + block_instr_disp.instrs["BRANCH"] = metric_dict.get("BR", "n/a") block_instr_disp.draw(canvas) @@ -1056,14 +1056,14 @@ class MemChart: block_exec.y_min = block_instr_disp.y_min - 6 block_exec.y_max = block_instr_disp.y_max - block_exec.active_cus = metric_dict["Active CUs"] - block_exec.num_cus = metric_dict["Num CUs"] - block_exec.vgprs = metric_dict["VGPR"] - block_exec.sgprs = metric_dict["SGPR"] - block_exec.lds_alloc = metric_dict["LDS Allocation"] - block_exec.scratch_alloc = metric_dict["Scratch Allocation"] - block_exec.wavefronts = metric_dict["Wavefronts"] - block_exec.workgroups = metric_dict["Workgroups"] + block_exec.active_cus = metric_dict.get("Active CUs", "n/a") + block_exec.num_cus = metric_dict.get("Num CUs", "n/a") + block_exec.vgprs = metric_dict.get("VGPR", "n/a") + block_exec.sgprs = metric_dict.get("SGPR", "n/a") + block_exec.lds_alloc = metric_dict.get("LDS Allocation", "n/a") + block_exec.scratch_alloc = metric_dict.get("Scratch Allocation", "n/a") + block_exec.wavefronts = metric_dict.get("Wavefronts", "n/a") + block_exec.workgroups = metric_dict.get("Workgroups", "n/a") block_exec.draw(canvas) @@ -1075,11 +1075,11 @@ class MemChart: wires_E_GLV.y_min = block_instr_disp.y_min wires_E_GLV.y_max = block_instr_disp.y_max - wires_E_GLV.lds_req = metric_dict["LDS Req"] - wires_E_GLV.vl1_rd = metric_dict["VL1 Rd"] - wires_E_GLV.vl1_wr = metric_dict["VL1 Wr"] - wires_E_GLV.vl1_atomic = metric_dict["VL1 Atomic"] - wires_E_GLV.sl1_rd = metric_dict["sL1D Rd"] + wires_E_GLV.lds_req = metric_dict.get("LDS Req", "n/a") + wires_E_GLV.vl1_rd = metric_dict.get("VL1 Rd", "n/a") + wires_E_GLV.vl1_wr = metric_dict.get("VL1 Wr", "n/a") + wires_E_GLV.vl1_atomic = metric_dict.get("VL1 Atomic", "n/a") + wires_E_GLV.sl1_rd = metric_dict.get("VL1D Rd", "n/a") wires_E_GLV.draw(canvas) @@ -1093,7 +1093,7 @@ class MemChart: y_max=block_instr_buff.y_min, ) - wire_InstrBuff_IL1Cache.il1_fetch = metric_dict["IL1 Fetch"] + wire_InstrBuff_IL1Cache.il1_fetch = metric_dict.get("IL1 Fetch", "n/a") wire_InstrBuff_IL1Cache.draw(canvas) @@ -1118,8 +1118,8 @@ class MemChart: block_lds.y_max = wires_E_GLV.y_max block_lds.y_min = block_lds.y_max - 5 - block_lds.util = metric_dict["LDS Util"] - block_lds.latency = metric_dict["LDS Latency"] + block_lds.util = metric_dict.get("LDS Util", "n/a") + block_lds.latency = metric_dict.get("LDS Latency", "n/a") block_lds.draw(canvas) @@ -1131,10 +1131,10 @@ class MemChart: block_vector_L1.y_max = block_lds.y_min - 3 block_vector_L1.y_min = block_vector_L1.y_max - 9 - block_vector_L1.hit = metric_dict["VL1 Hit"] - block_vector_L1.latency = metric_dict["VL1 Lat"] - block_vector_L1.coales = metric_dict["VL1 Coalesce"] - block_vector_L1.stall = metric_dict["VL1 Stall"] + block_vector_L1.hit = metric_dict.get("VL1 Hit", "n/a") + block_vector_L1.latency = metric_dict.get("VL1 Lat", "n/a") + block_vector_L1.coales = metric_dict.get("VL1 Coalesce", "n/a") + block_vector_L1.stall = metric_dict.get("VL1 Stall", "n/a") block_vector_L1.draw(canvas) @@ -1146,8 +1146,8 @@ class MemChart: block_const_L1.y_max = block_vector_L1.y_min - 3 block_const_L1.y_min = block_const_L1.y_max - 5 - block_const_L1.hit = metric_dict["sL1D Hit"] - block_const_L1.latency = metric_dict["sL1D Lat"] + block_const_L1.hit = metric_dict.get("sL1D Hit", "n/a") + block_const_L1.latency = metric_dict.get("sL1D Lat", "n/a") block_const_L1.draw(canvas) @@ -1159,8 +1159,8 @@ class MemChart: block_instr_L1.y_max = block_const_L1.y_min - 3 block_instr_L1.y_min = block_instr_L1.y_max - 5 - block_instr_L1.hit = metric_dict["IL1 Hit"] - block_instr_L1.latency = metric_dict["IL1 Lat"] + block_instr_L1.hit = metric_dict.get("IL1 Hit", "n/a") + block_instr_L1.latency = metric_dict.get("IL1 Lat", "n/a") block_instr_L1.draw(canvas) @@ -1171,13 +1171,13 @@ class MemChart: wires_L1_L2.x_max = wires_L1_L2.x_min + 14 wires_L1_L2.y_min = block_instr_L1.y_min wires_L1_L2.y_max = block_vector_L1.y_max - wires_L1_L2.vl1_l2_rd = metric_dict["VL1_L2 Rd"] - wires_L1_L2.vl1_l2_wr = metric_dict["VL1_L2 Wr"] - wires_L1_L2.vl1_l2_atomic = metric_dict["VL1_L2 Atomic"] - wires_L1_L2.sl1_l2_rd = metric_dict["sL1D_L2 Rd"] - wires_L1_L2.sl1_l2_wr = metric_dict["sL1D_L2 Wr"] - wires_L1_L2.sl1_l2_atomic = metric_dict["sL1D_L2 Atomic"] - wires_L1_L2.il1_l2_req = metric_dict["IL1_L2 Rd"] + wires_L1_L2.vl1_l2_rd = metric_dict.get("VL1_L2 Rd", "n/a") + wires_L1_L2.vl1_l2_wr = metric_dict.get("VL1_L2 Wr", "n/a") + wires_L1_L2.vl1_l2_atomic = metric_dict.get("VL1_L2 Atomic", "n/a") + wires_L1_L2.sl1_l2_rd = metric_dict.get("VL1D_L2 Rd", "n/a") + wires_L1_L2.sl1_l2_wr = metric_dict.get("VL1D_L2 Wr", "n/a") + wires_L1_L2.sl1_l2_atomic = metric_dict.get("VL1D_L2 Atomic", "n/a") + wires_L1_L2.il1_l2_req = metric_dict.get("IL1_L2 Rd", "n/a") wires_L1_L2.draw(canvas) @@ -1190,12 +1190,12 @@ class MemChart: block_L2.y_min = block_instr_L1.y_min block_L2.y_max = block_lds.y_max - block_L2.hit = metric_dict["L2 Hit"] - block_L2.rd = metric_dict["L2 Rd"] - block_L2.wr = metric_dict["L2 Wr"] - block_L2.atomic = metric_dict["L2 Atomic"] - block_L2.rd_lat = metric_dict["L2 Rd Lat"] - block_L2.wr_lat = metric_dict["L2 Wr Lat"] + block_L2.hit = metric_dict.get("L2 Hit", "n/a") + block_L2.rd = metric_dict.get("L2 Rd", "n/a") + block_L2.wr = metric_dict.get("L2 Wr", "n/a") + block_L2.atomic = metric_dict.get("L2 Atomic", "n/a") + block_L2.rd_lat = metric_dict.get("L2 Rd Lat", "n/a") + block_L2.wr_lat = metric_dict.get("L2 Wr Lat", "n/a") block_L2.draw(canvas) @@ -1209,9 +1209,9 @@ class MemChart: y_max=block_L2.y_max - 10, ) - wires_L2_Fabric.rd = metric_dict["Fabric_L2 Rd"] - wires_L2_Fabric.wr = metric_dict["Fabric_L2 Wr"] - wires_L2_Fabric.atomic = metric_dict["Fabric_L2 Atomic"] + wires_L2_Fabric.rd = metric_dict.get("Fabric_L2 Rd", "n/a") + wires_L2_Fabric.wr = metric_dict.get("Fabric_L2 Wr", "n/a") + wires_L2_Fabric.atomic = metric_dict.get("Fabric_L2 Atomic", "n/a") wires_L2_Fabric.draw(canvas) @@ -1236,9 +1236,9 @@ class MemChart: y_min=block_xgmi_pcie.y_min - 5 - 11, ) - block_fabric.lat["Rd"] = metric_dict["Fabric Rd Lat"] - block_fabric.lat["Wr"] = metric_dict["Fabric Wr Lat"] - block_fabric.lat["Atomic"] = metric_dict["Fabric Atomic Lat"] + block_fabric.lat["Rd"] = metric_dict.get("Fabric Rd Lat", "n/a") + block_fabric.lat["Wr"] = metric_dict.get("Fabric Wr Lat", "n/a") + block_fabric.lat["Atomic"] = metric_dict.get("Fabric Atomic Lat", "n/a") block_fabric.draw(canvas) @@ -1264,8 +1264,8 @@ class MemChart: y_max=block_fabric.y_max - 4, ) - wires_Fabric_HBM.rd = metric_dict["HBM Rd"] - wires_Fabric_HBM.wr = metric_dict["HBM Wr"] + wires_Fabric_HBM.rd = metric_dict.get("HBM Rd", "n/a") + wires_Fabric_HBM.wr = metric_dict.get("HBM Wr", "n/a") wires_Fabric_HBM.draw(canvas) diff --git a/projects/rocprofiler-compute/src/utils/mi_gpu_spec.py b/projects/rocprofiler-compute/src/utils/mi_gpu_spec.py index e2b52b1f6c..fd08c48141 100644 --- a/projects/rocprofiler-compute/src/utils/mi_gpu_spec.py +++ b/projects/rocprofiler-compute/src/utils/mi_gpu_spec.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import os from dataclasses import dataclass diff --git a/projects/rocprofiler-compute/src/utils/parser.py b/projects/rocprofiler-compute/src/utils/parser.py index b8bfaa8519..70b86b8543 100644 --- a/projects/rocprofiler-compute/src/utils/parser.py +++ b/projects/rocprofiler-compute/src/utils/parser.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import ast import json @@ -137,14 +139,28 @@ def to_max(*args): def to_avg(a): if str(type(a)) == "": return np.nan - elif np.isnan(a).all(): - return np.nan - elif a.empty: - return np.nan elif isinstance(a, pd.core.series.Series): - return a.mean() + if a.empty: + return np.nan + elif np.isnan(a).all(): + return np.nan + else: + return a.mean() + elif isinstance(a, (np.ndarray, list)): + arr = np.array(a) + if arr.size == 0: + return np.nan + elif np.isnan(arr).all(): + return np.nan + else: + return np.nanmean(arr) + elif isinstance(a, (int, float, np.number)): + if np.isnan(a): + return np.nan + else: + return float(a) else: - raise Exception("to_avg: unsupported type.") + raise Exception(f"to_avg: unsupported type: {type(a)}") def to_median(a): @@ -313,6 +329,7 @@ def build_eval_string(equation, coll_level, config): s = re.sub(r"\'\]\[(\d+)\]", r"[\g<1>]']", s) # use .get() to catch any potential KeyErrors s = re.sub(r"raw_pmc_df\['(.*?)']", r'raw_pmc_df.get("\1")', s) + # print("--- intermediate string: ", s) # apply coll_level if config.get("format_rocprof_output") == "rocpd": # Replace SQ_ACCUM_PREV_HIRES with coll_level_ACCUM then ignore coll_level df @@ -1448,7 +1465,7 @@ def load_kernel_top(workload, dir, args): def load_table_data(workload, dir, is_gui, args, config, skipKernelTop=False): """ - Load data for all "raw_csv_table" - - Load dat for "pc_sampling_table" + - Load data for "pc_sampling_table" - Calculate mertric value for all "metric_table" """ if not skipKernelTop: diff --git a/projects/rocprofiler-compute/src/utils/roofline_calc.py b/projects/rocprofiler-compute/src/utils/roofline_calc.py index 4fb715f1e5..603003bf1e 100644 --- a/projects/rocprofiler-compute/src/utils/roofline_calc.py +++ b/projects/rocprofiler-compute/src/utils/roofline_calc.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import csv from dataclasses import dataclass @@ -259,25 +261,11 @@ def calc_ai(mspec, sort_type, ret_df): df = df.sort_values(by=["Kernel_Name"]) df = df.reset_index(drop=True) - total_flops = ( - valu_flops - ) = ( - mfma_flops_f6f4 - ) = ( - mfma_flops_f8 - ) = ( - mfma_flops_bf16 - ) = ( + total_flops = valu_flops = mfma_flops_f6f4 = mfma_flops_f8 = mfma_flops_bf16 = ( mfma_flops_f16 - ) = ( - mfma_iops_i8 - ) = ( - mfma_flops_f32 - ) = ( - mfma_flops_f64 - ) = ( - lds_data - ) = L1cache_data = L2cache_data = hbm_data = calls = totalDuration = avgDuration = 0.0 + ) = mfma_iops_i8 = mfma_flops_f32 = mfma_flops_f64 = lds_data = L1cache_data = ( + L2cache_data + ) = hbm_data = calls = totalDuration = avgDuration = 0.0 kernelName = "" @@ -498,27 +486,13 @@ def calc_ai(mspec, sort_type, ret_df): kernelName, idx, calls ) ) - total_flops = ( - valu_flops - ) = ( - mfma_flops_f6f4 - ) = ( - mfma_flops_f8 - ) = ( + total_flops = valu_flops = mfma_flops_f6f4 = mfma_flops_f8 = ( mfma_flops_bf16 - ) = ( - mfma_flops_f16 - ) = ( - mfma_iops_i8 - ) = ( - mfma_flops_f32 - ) = ( - mfma_flops_f64 - ) = ( + ) = mfma_flops_f16 = mfma_iops_i8 = mfma_flops_f32 = mfma_flops_f64 = ( lds_data - ) = ( - L1cache_data - ) = L2cache_data = hbm_data = calls = totalDuration = avgDuration = 0.0 + ) = L1cache_data = L2cache_data = hbm_data = calls = totalDuration = ( + avgDuration + ) = 0.0 if sort_type == "dispatches": myList.append( @@ -542,27 +516,13 @@ def calc_ai(mspec, sort_type, ret_df): avgDuration, ) ) - total_flops = ( - valu_flops - ) = ( - mfma_flops_f6f4 - ) = ( - mfma_flops_f8 - ) = ( + total_flops = valu_flops = mfma_flops_f6f4 = mfma_flops_f8 = ( mfma_flops_bf16 - ) = ( - mfma_flops_f16 - ) = ( - mfma_iops_i8 - ) = ( - mfma_flops_f32 - ) = ( - mfma_flops_f64 - ) = ( + ) = mfma_flops_f16 = mfma_iops_i8 = mfma_flops_f32 = mfma_flops_f64 = ( lds_data - ) = ( - L1cache_data - ) = L2cache_data = hbm_data = calls = totalDuration = avgDuration = 0.0 + ) = L1cache_data = L2cache_data = hbm_data = calls = totalDuration = ( + avgDuration + ) = 0.0 myList.sort(key=lambda x: x.totalDuration, reverse=True) diff --git a/projects/rocprofiler-compute/src/utils/schema.py b/projects/rocprofiler-compute/src/utils/schema.py index e0fd9272a8..65c8e9b791 100644 --- a/projects/rocprofiler-compute/src/utils/schema.py +++ b/projects/rocprofiler-compute/src/utils/schema.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + # # Define all common data storage classes, diff --git a/projects/rocprofiler-compute/src/utils/specs.py b/projects/rocprofiler-compute/src/utils/specs.py index cb2c8081e8..f6c54aa99d 100644 --- a/projects/rocprofiler-compute/src/utils/specs.py +++ b/projects/rocprofiler-compute/src/utils/specs.py @@ -1,6 +1,4 @@ -"""Get host/gpu specs.""" - -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -12,17 +10,21 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + +"""Get host/gpu specs.""" + import importlib import os diff --git a/projects/rocprofiler-compute/src/utils/tty.py b/projects/rocprofiler-compute/src/utils/tty.py index 90ea11d1bd..70feb60142 100644 --- a/projects/rocprofiler-compute/src/utils/tty.py +++ b/projects/rocprofiler-compute/src/utils/tty.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import copy import textwrap diff --git a/projects/rocprofiler-compute/src/utils/utils.py b/projects/rocprofiler-compute/src/utils/utils.py index 452300fe67..2fa180354c 100644 --- a/projects/rocprofiler-compute/src/utils/utils.py +++ b/projects/rocprofiler-compute/src/utils/utils.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import glob import io @@ -1018,8 +1020,9 @@ def pc_sampling_prof( "-o", "ps_file", # todo: sync up with the name from source in 2100_.yaml "--", - appcmd, ] + options.extend(appcmd) + success, output = capture_subprocess_output( [rocprof_cmd] + options, new_env=os.environ.copy(), profileMode=True ) diff --git a/projects/rocprofiler-compute/tests/conftest.py b/projects/rocprofiler-compute/tests/conftest.py index 0de7905c60..ff0211d499 100644 --- a/projects/rocprofiler-compute/tests/conftest.py +++ b/projects/rocprofiler-compute/tests/conftest.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import subprocess from importlib.machinery import SourceFileLoader diff --git a/projects/rocprofiler-compute/tests/generate_test_analyze_workloads.py b/projects/rocprofiler-compute/tests/generate_test_analyze_workloads.py index 12b5722e51..ff67f4a469 100644 --- a/projects/rocprofiler-compute/tests/generate_test_analyze_workloads.py +++ b/projects/rocprofiler-compute/tests/generate_test_analyze_workloads.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import argparse import glob diff --git a/projects/rocprofiler-compute/tests/test_TCP_counters.py b/projects/rocprofiler-compute/tests/test_TCP_counters.py index a591539788..bd7629c121 100644 --- a/projects/rocprofiler-compute/tests/test_TCP_counters.py +++ b/projects/rocprofiler-compute/tests/test_TCP_counters.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import csv import inspect diff --git a/projects/rocprofiler-compute/tests/test_analyze_commands.py b/projects/rocprofiler-compute/tests/test_analyze_commands.py index 13106f93ca..d8a2f4bb30 100644 --- a/projects/rocprofiler-compute/tests/test_analyze_commands.py +++ b/projects/rocprofiler-compute/tests/test_analyze_commands.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,21 +10,23 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import os import shutil -from unittest.mock import Mock, patch +from unittest.mock import Mock import pandas as pd import pytest @@ -42,7 +44,7 @@ indirs = [ "tests/workloads/vcopy/MI350", ] -time_units = {"s": 10 ** 9, "ms": 10 ** 6, "us": 10 ** 3, "ns": 1} +time_units = {"s": 10**9, "ms": 10**6, "us": 10**3, "ns": 1} @pytest.mark.misc @@ -1308,9 +1310,9 @@ def test_mathematical_correctness_all_units(sample_time_data, original_ns_values from utils.tty import convert_time_columns test_cases = [ - ("s", 10 ** 9), # 1 second = 10^9 nanoseconds - ("ms", 10 ** 6), # 1 millisecond = 10^6 nanoseconds - ("us", 10 ** 3), # 1 microsecond = 10^3 nanoseconds + ("s", 10**9), # 1 second = 10^9 nanoseconds + ("ms", 10**6), # 1 millisecond = 10^6 nanoseconds + ("us", 10**3), # 1 microsecond = 10^3 nanoseconds ("ns", 1), # 1 nanosecond = 1 nanosecond ] diff --git a/projects/rocprofiler-compute/tests/test_analyze_workloads.py b/projects/rocprofiler-compute/tests/test_analyze_workloads.py index c8c9d0c77b..fc38d51a43 100644 --- a/projects/rocprofiler-compute/tests/test_analyze_workloads.py +++ b/projects/rocprofiler-compute/tests/test_analyze_workloads.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from unittest.mock import patch diff --git a/projects/rocprofiler-compute/tests/test_db_connector.py b/projects/rocprofiler-compute/tests/test_db_connector.py index 7a6aa2171f..b8cc03d894 100644 --- a/projects/rocprofiler-compute/tests/test_db_connector.py +++ b/projects/rocprofiler-compute/tests/test_db_connector.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,25 +10,22 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## import logging -import shutil -import sys -import tempfile -from pathlib import Path -from unittest.mock import MagicMock, Mock, call, patch +from unittest.mock import MagicMock, Mock, patch import pandas as pd import pytest @@ -209,9 +206,9 @@ class TestDatabaseConnector: with patch.object(connector, "prep_import") as mock_prep: mock_prep.return_value = None - connector.connection_info[ - "db" - ] = "rocprofiler-compute_test_team_test_workload_MI100" + connector.connection_info["db"] = ( + "rocprofiler-compute_test_team_test_workload_MI100" + ) connector.db_import() diff --git a/projects/rocprofiler-compute/tests/test_gpu_specs.py b/projects/rocprofiler-compute/tests/test_gpu_specs.py index a2df04b041..5375ee3d40 100644 --- a/projects/rocprofiler-compute/tests/test_gpu_specs.py +++ b/projects/rocprofiler-compute/tests/test_gpu_specs.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import os import re diff --git a/projects/rocprofiler-compute/tests/test_import_workloads.py b/projects/rocprofiler-compute/tests/test_import_workloads.py index c92829517f..531d1f80a8 100644 --- a/projects/rocprofiler-compute/tests/test_import_workloads.py +++ b/projects/rocprofiler-compute/tests/test_import_workloads.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + from importlib.machinery import SourceFileLoader from unittest.mock import patch diff --git a/projects/rocprofiler-compute/tests/test_profile_general.py b/projects/rocprofiler-compute/tests/test_profile_general.py index 741d52d8b7..d3094caf1a 100644 --- a/projects/rocprofiler-compute/tests/test_profile_general.py +++ b/projects/rocprofiler-compute/tests/test_profile_general.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import inspect import os diff --git a/projects/rocprofiler-compute/tests/test_utils.py b/projects/rocprofiler-compute/tests/test_utils.py index 55afb14cfc..481509b2ad 100644 --- a/projects/rocprofiler-compute/tests/test_utils.py +++ b/projects/rocprofiler-compute/tests/test_utils.py @@ -1,4 +1,4 @@ -##############################################################################bl +############################################################################## # MIT License # # Copyright (c) 2021 - 2025 Advanced Micro Devices, Inc. All Rights Reserved. @@ -10,17 +10,19 @@ # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # -# The above copyright notice and this permission notice shall be included in all -# copies or substantial portions of the Software. +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -##############################################################################el +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE. + +############################################################################## + import logging @@ -9270,7 +9272,7 @@ def test_pc_sampling_prof_empty_appcmd( assert mock_capture_subprocess.called options_list = mock_capture_subprocess.call_args[0][0] - assert options_list[-1] == "" + assert options_list[-1] == "--" mock_console_error.assert_not_called() mock_capture_subprocess.reset_mock() diff --git a/projects/rocprofiler-compute/utils/autogen_hash.yaml b/projects/rocprofiler-compute/utils/autogen_hash.yaml index ec28448cca..b3b20b7a8e 100644 --- a/projects/rocprofiler-compute/utils/autogen_hash.yaml +++ b/projects/rocprofiler-compute/utils/autogen_hash.yaml @@ -77,24 +77,24 @@ src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml: src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5 src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5 src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5 -src/rocprof_compute_soc/analysis_configs/gfx908/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 231f9b7c09266c4aac50ac4db1b055c36eb6e563ba713c5f3aa30508d03b9170 -src/rocprof_compute_soc/analysis_configs/gfx90a/1500_address_processing_unit_and_data_return_path_ta_td.yaml: eb1ec287cc1f9f133b80fdde072a2b86e819f96ccdf4c305e721f3466d37b156 -src/rocprof_compute_soc/analysis_configs/gfx940/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 52ae21cec4ce4990e966d7fb438ac02b7e63ad4bc428f9770cd2c08d80f712da -src/rocprof_compute_soc/analysis_configs/gfx941/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 52ae21cec4ce4990e966d7fb438ac02b7e63ad4bc428f9770cd2c08d80f712da -src/rocprof_compute_soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 52ae21cec4ce4990e966d7fb438ac02b7e63ad4bc428f9770cd2c08d80f712da -src/rocprof_compute_soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml: f7b032202e1aea6befda0d62e3d9f04b846f473218bd62e90d59a34678b62a77 +src/rocprof_compute_soc/analysis_configs/gfx908/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 633d59aba82b3a495b7ba33fa4b2ae4da638b58632bcc37ff18be87af68ce4d4 +src/rocprof_compute_soc/analysis_configs/gfx90a/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 2bdb9d7b3bea1057b3baee29ba3b428b211808261063a97bc4b6b319f4a19fb3 +src/rocprof_compute_soc/analysis_configs/gfx940/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19 +src/rocprof_compute_soc/analysis_configs/gfx941/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19 +src/rocprof_compute_soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19 +src/rocprof_compute_soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 9e56cef5b066fb575a5c530bcf9400f1291dd8636b12c8a2244cdba1defafc9f src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml: e6ec43014ce7b7cc072385d4eba072dd187b5de14979c169a3c1e9b8fc4c2762 src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml: e6ec43014ce7b7cc072385d4eba072dd187b5de14979c169a3c1e9b8fc4c2762 src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28 src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28 src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28 src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml: cd21327c193d2af8c18066b9c13f67e3d5dfb44731777bc5a1b6a7738c902dd1 -src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml: 6aeda249093c666000b104f8631b4a85698e083dd55e77e1e1f095f222054742 -src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml: a4ec667e0b827c046de207416d185dd528f030f29bdee162a2634e579bb31846 -src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml: a9ac811e491fce354aef029b11a96edb589535e84224fa2e2b323623e9fd6e00 -src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml: 7d925c3369b366c23e638ca2b3d074672324a5b9fd0fa586a3e71dee458743a6 -src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml: 7532dc55c28c809f435f5edae98632a2d99adc898b2b71a661e2c9696f674f4a -src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml: a9f3146a99e74eaba5327be3cdf9361fb8b69d1640751fb05519e44dd2ec7292 +src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml: 5b48c690b6069a5610d07cc0c2a5e1da65a52296205dcf48a3b6fa5e3df36e9b +src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml: a9b128267a069060e891533334c52586c706f145b1e813a4081cb21d425516ad +src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml: b4eea39f0e23e501ad503cdd96db377109c7f0e212949828fe06102de7355349 +src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml: da0189cd7f6e1ab4b79d0c054c2cdc1f7a9c81972dae9e5285f2f3d9c30ca644 +src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml: b0802f923052eb584ce138210ebf2db70fb7883926896da1861a9e857d4abe81 +src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml: 58bdd965421d610567e461becd7094fa41d668b119eddab99054d2bd6dc12acf src/rocprof_compute_soc/analysis_configs/gfx908/1800_l2_cache_per_channel.yaml: a0c53202fe9f68d5e1fa689ce0643c471ced7d47e007d8ccc68fba294f7f6a05 src/rocprof_compute_soc/analysis_configs/gfx90a/1800_l2_cache_per_channel.yaml: a0c53202fe9f68d5e1fa689ce0643c471ced7d47e007d8ccc68fba294f7f6a05 src/rocprof_compute_soc/analysis_configs/gfx940/1800_l2_cache_per_channel.yaml: e184e3692eb0d641fb2e37fada0e58a6c4958553931d7c038b884e1e6986093f @@ -107,4 +107,4 @@ src/rocprof_compute_soc/analysis_configs/gfx940/2100_pc_sampling.yaml: 4f3af5504 src/rocprof_compute_soc/analysis_configs/gfx941/2100_pc_sampling.yaml: 4f3af55040c40bee5f1fd88d83e2324d06e5dc462c0adc3e6d5b19b3f31af5e7 src/rocprof_compute_soc/analysis_configs/gfx942/2100_pc_sampling.yaml: 4f3af55040c40bee5f1fd88d83e2324d06e5dc462c0adc3e6d5b19b3f31af5e7 src/rocprof_compute_soc/analysis_configs/gfx950/2100_pc_sampling.yaml: 4f3af55040c40bee5f1fd88d83e2324d06e5dc462c0adc3e6d5b19b3f31af5e7 -docs/data/metrics_description.yaml: 69bd9c4121e13bdda6af2dead3129a46569f37fd1c59b20f45c85593824522d2 +docs/data/metrics_description.yaml: 819c08a584ae8b418e6983aa51108b95e43eda4f3b7892eab336c61d844b20bf diff --git a/projects/rocprofiler-compute/utils/split_config.py b/projects/rocprofiler-compute/utils/split_config.py index 52bbef8b0c..ae178978d3 100644 --- a/projects/rocprofiler-compute/utils/split_config.py +++ b/projects/rocprofiler-compute/utils/split_config.py @@ -6,9 +6,9 @@ # Read utils/unified_config.yaml and split it into metric tables per documentation section # WARNING: This script will overwrite existing docs/data/metrics_description.yaml +import copy import hashlib import re -import copy from pathlib import Path import yaml @@ -34,7 +34,10 @@ def update_analysis_config(): new_panel_config = {"Panel Config": {}} new_panel_config["Panel Config"]["id"] = panel_config["id"] new_panel_config["Panel Config"]["title"] = panel_config["title"] - new_panel_config["Panel Config"]["metrics_description"] = {key: value["plain"] for key, value in panel_config.get("metrics_description", {}).items()} + new_panel_config["Panel Config"]["metrics_description"] = { + key: value["plain"] + for key, value in panel_config.get("metrics_description", {}).items() + } # Convert int into str with 4 digits panel_id = str(panel_config["id"]).zfill(4) # Replace parentehsis, hyphen, slash and space with underscore @@ -57,7 +60,9 @@ def update_analysis_config(): for data_source_config in panel_config["data source"]: data_source_config = copy.deepcopy(data_source_config) if "metric_table" in data_source_config: - data_source_config["metric_table"]["metric"] = data_source_config["metric_table"]["metric"][gfx_version] + data_source_config["metric_table"]["metric"] = data_source_config[ + "metric_table" + ]["metric"][gfx_version] new_panel_config["Panel Config"]["data source"].append(data_source_config) # Write panel config to file filename = Path( @@ -121,12 +126,23 @@ def update_documentation(): for data_source in panel_config["data source"]: if "metric_table" in data_source: metrics_info = {} - for key in panel_config["metrics_description"]: - metrics_info[key] = { - "rst": panel_config["metrics_description"][key]["rst"], - "unit": panel_config["metrics_description"][key]["unit"], + # Metric names from data source + metric_names = { + metric + for _, gfx_data in data_source["metric_table"]["metric"].items() + for metric in gfx_data + } + # Select metrics with descriptions available + metric_names = metric_names.intersection( + panel_config["metrics_description"].keys() + ) + # Add metrics info + for metric_name in sorted(list(metric_names)): + metrics_info[metric_name] = { + "rst": panel_config["metrics_description"][metric_name]["rst"], + "unit": panel_config["metrics_description"][metric_name]["unit"], } - panel_metric_map[data_source["metric_table"]["id"]] = metrics_info + panel_metric_map[data_source["metric_table"]["id"]] = metrics_info # Merge panel_metric_map with section_panel_map section_metric_map = {} diff --git a/projects/rocprofiler-compute/utils/unified_config.yaml b/projects/rocprofiler-compute/utils/unified_config.yaml index fbc585e6c8..fb6286d7ab 100644 --- a/projects/rocprofiler-compute/utils/unified_config.yaml +++ b/projects/rocprofiler-compute/utils/unified_config.yaml @@ -10913,6 +10913,13 @@ panels: This is expected to be the sum of global/generic and spill/stack atomics in the :ref:`address processor `. unit: Instructions per normalization unit + Write Ack Instructions: + plain: The total number of write acknowledgements submitted by data-return + unit to SQ, summed over all compute units on the accelerator, per normalization + unit. + rst: The total number of write acknowledgements submitted by :ref:`data-return unit ` + to SQ, summed over all compute units on the accelerator, per normalization unit. + unit: Instructions per normalization unit - id: 1600 title: Vector L1 Data Cache data source: @@ -14728,6 +14735,21 @@ panels: min: MIN((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom)) max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom)) unit: (Req + $normUnit) + Read Bandwidth - PCIe: + avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom) + unit: (Bytes + $normUnit) + "Read Bandwidth - Infinity Fabric\u2122": + avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom) + unit: (Bytes + $normUnit) + Read Bandwidth - HBM: + avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom) + unit: (Bytes + $normUnit) Write and Atomic (32B): avg: AVG(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom)) min: MIN(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom)) @@ -14754,19 +14776,19 @@ panels: max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom)) unit: (Req + $normUnit) Write Bandwidth - PCIe: - avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum / $denom) - min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum / $denom) - max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum / $denom) + avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom) unit: (Bytes + $normUnit) "Write Bandwidth - Infinity Fabric\u2122": - avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum / $denom) - min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum / $denom) - max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum / $denom) + avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom) unit: (Bytes + $normUnit) Write Bandwidth - HBM: - avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum / $denom) - min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum / $denom) - max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum / $denom) + avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom) unit: (Bytes + $normUnit) Atomic: avg: AVG((TCC_EA0_ATOMIC_sum / $denom)) @@ -14779,19 +14801,19 @@ panels: max: MAX((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom)) unit: (Req + $normUnit) Atomic Bandwidth - PCIe: - avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum / $denom) - min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum / $denom) - max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum / $denom) + avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom) unit: (Bytes + $normUnit) "Atomic Bandwidth - Infinity Fabric\u2122": - avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum / $denom) - min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum / $denom) - max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum / $denom) + avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom) unit: (Bytes + $normUnit) Atomic Bandwidth - HBM: - avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum / $denom) - min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum / $denom) - max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum / $denom) + avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom) + min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom) + max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom) unit: (Bytes + $normUnit) gfx908: Read (32B): @@ -15064,6 +15086,24 @@ panels: requested in a cache line, the data movement will still be counted as a full cache line. unit: Bytes per normalization unit + Read Bandwidth: + plain: Total number of bytes looked up in the L2 cache for read requests, + per normalization unit. + rst: Total number of bytes looked up in the L2 cache for read requests, + per :ref:`normalization unit `. + unit: Bytes per normalization unit + Write Bandwidth: + plain: Total number of bytes looked up in the L2 cache for write requests, + per normalization unit. + rst: Total number of bytes looked up in the L2 cache for write requests, + per :ref:`normalization unit `. + unit: Bytes per normalization unit + Atomic Bandwidth: + plain: Total number of bytes looked up in the L2 cache for atomic requests, + per normalization unit. + rst: Total number of bytes looked up in the L2 cache for atomic requests, + per :ref:`normalization unit `. + unit: Bytes per normalization unit Req: plain: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. @@ -15235,6 +15275,18 @@ panels: from any source other than the accelerator's local HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. unit: Requests per normalization unit + Read Bandwidth - PCIe: + plain: Total number of bytes due to L2 read requests due to PCIe traffic, per normalization unit. + rst: Total number of bytes due to L2 read requests due to PCIe traffic, per normalization unit. + unit: Bytes per normalization unit + "Read Bandwidth - Infinity Fabric\u2122": + plain: Total number of bytes due to L2 read requests due to Infinity Fabric traffic, per normalization unit. + rst: Total number of bytes due to L2 read requests due to Infinity Fabric traffic, per normalization unit. + unit: Bytes per normalization unit + Read Bandwidth - HBM: + plain: Total number of bytes due to L2 read requests due to HBM traffic, per normalization unit. + rst: Total number of bytes due to L2 read requests due to HBM traffic, per normalization unit. + unit: Bytes per normalization unit Write and Atomic (32B): plain: The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -15273,6 +15325,30 @@ panels: HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. unit: Requests per normalization unit + Write Bandwidth - PCIe: + plain: Total number of bytes due to L2 write requests due to PCIe traffic, per normalization unit. + rst: Total number of bytes due to L2 write requests due to PCIe traffic, per normalization unit. + unit: Bytes per normalization unit + "Write Bandwidth - Infinity Fabric\u2122": + plain: Total number of bytes due to L2 write requests due to Infinity Fabric traffic, per normalization unit. + rst: Total number of bytes due to L2 write requests due to Infinity Fabric traffic, per normalization unit. + unit: Bytes per normalization unit + Write Bandwidth - HBM: + plain: Total number of bytes due to L2 write requests due to HBM traffic, per normalization unit. + rst: Total number of bytes due to L2 write requests due to HBM traffic, per normalization unit. + unit: Bytes per normalization unit + Atomic Bandwidth - PCIe: + plain: Total number of bytes due to L2 atomic requests due to PCIe traffic, per normalization unit. + rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, per normalization unit. + unit: Bytes per normalization unit + "Atomic Bandwidth - Infinity Fabric\u2122": + plain: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic, per normalization unit. + rst: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic, per normalization unit. + unit: Bytes per normalization unit + Atomic Bandwidth - HBM: + plain: Total number of bytes due to L2 atomic requests due to HBM traffic, per normalization unit. + rst: Total number of bytes due to L2 atomic requests due to HBM traffic, per normalization unit. + unit: Bytes per normalization unit Atomic: plain: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request