Update ROCTracer README for the GitHub link (#1745)

* Update README for the GitHub link

* Updating links to rocm-systems
This commit is contained in:
Swati Rawat
2025-12-09 23:12:48 +05:30
کامیت شده توسط GitHub
والد 938fe1ca8e
کامیت 87e61f514c
8فایلهای تغییر یافته به همراه36 افزوده شده و 36 حذف شده
@@ -39,7 +39,7 @@ Prerequisites
* ROCm 7.x build, or
* Early release can be `built from source <https://github.com/rocm/aqlprofile>`_
* Early release can be `built from source <https://github.com/ROCm/rocm-systems/tree/develop/projects/aqlprofile>`_
* Otherwise, ``rocprofv3`` throws error "INVALID_SHADER_DATA" or "Agent not supported".
@@ -48,13 +48,13 @@ Prerequisites
* For binary files, see `ROCprof trace decoder release page <https://github.com/ROCm/rocprof-trace-decoder/releases>`_.
* Default install location is ``/opt/rocm/lib``
* For custom location, use:
* Parameter ``--att-library-path``, or
* Environment variable ``ROCPROF_ATT_LIBRARY_PATH``
.. _thread-trace-parameters:
@@ -163,7 +163,7 @@ If the subsequent kernels are targeted kernels, the profiler will then profile a
new targeted kernel, so it is possible for a generated ATT file to have more than ``n`` kernels profiled.
All the profiled kernels are then compiled into a single ATT file.
If a new targeted kernel is encountered after the ``rocprofv3`` tool has finished profiling a batch of kernels,
the profiler will restart profiling when encountering this new targeted kernel and create another ATT file with multiple kernels.
the profiler will restart profiling when encountering this new targeted kernel and create another ATT file with multiple kernels.
.. _output-files:
@@ -175,17 +175,17 @@ After the application finishes executing, ROCprof Trace Decoder runs automatical
- stats_*.csv files:
* Contains a summary of instruction latency per kernel.
- ui_output_agent_{agent_id}_dispatch_{dispatch_id} directory:
* Contains detailed tracing information in the form of .json files.
* This directory can be opened using the `ROCprof Compute Viewer <https://rocm.docs.amd.com/projects/rocprof-compute-viewer/en/amd-mainline/>`_.
- Raw files:
* .att - Raw SQTT data. Can be used with the ROCprof Trace Decoder for further analysis.
* .out - Code object binaries (executable). Can be used with ISA analysis tools.
.. _csv-content:
@@ -217,28 +217,28 @@ The columns of the stats_*.csv file are described here:
* **Latency:** Total latency in cycles, defined as "Stall time + Issue time" for gfx9 or "Stall time + Execute time" for gfx10+.
* **Stall:** The total number of cycles the hardware pipe couldn't issue an instruction.
* **Stall:** The total number of cycles the hardware pipe couldn't issue an instruction.
* Usually caused when the hardware unit is busy, such as TCP or LDS backpressure.
* **Idle:** The total time gap between the completion of previous instruction and the beginning of the current instruction. The idle time can be caused by:
* Arbiter loss
* Source or destination register dependency
* Instruction cache miss
* **Source:** The original source line of code assigned by the compiler.
* Requires compiling with debug symbols.
Troubleshooting
===============
For some applications, stats_*.csv file could be empty even for a valid kernel dispatch.
Thread trace is limited to a single CU per SE (``att-target-cu``). If a kernel dispatch doesn't launch enough waves to populate the whole GPU, there's a possibility of no wave getting assigned to the ``target_cu``. In such cases, there's nothing to be traced.
Thread trace is limited to a single CU per SE (``att-target-cu``). If a kernel dispatch doesn't launch enough waves to populate the whole GPU, there's a possibility of no wave getting assigned to the ``target_cu``. In such cases, there's nothing to be traced.
Here are some options to handle this:
* Launch more waves.
@@ -248,10 +248,10 @@ Here are some options to handle this:
* Set the ``--att-shader-engine-mask`` to 0x11111111, or possibly to 0xFFFFFFFF
* A number too high can cause packet losses and/or lead to a full buffer.
* Set the ``HSA_CU_MASK`` to mask out all CUs but the target. For more details, see `setting CUs <https://rocm.docs.amd.com/en/latest/how-to/setting-cus.html>`_.
* If only the ``target_cu`` (or a few CUs) are not masked out, then all or most waves will be assigned to the ``target_cu``.
* This can potentially cause low performance in high-demanding kernels.