[rocprof-compute] Documentation changes for move to super-repo for 7.1 (#1329)

- also remove json output mention in docs
2025-10-15 15:32:54 -04:00
@@ -1,11 +1,11 @@
 ## How to fork from us

-To keep our development fast and conflict free, we recommend you to [fork](https://github.com/ROCm/rocprofiler-compute/fork) our repository and start your work from our `develop` branch in your private repository.
+To keep our development fast and conflict free, we recommend you to [fork](https://github.com/ROCm/rocm-systems/forks) our repository and start your work from our `develop` branch in your private repository.

 Afterwards, git clone your repository to your local machine. But that is not it! To keep track of the original develop repository, add it as another remote.

 ```
-git remote add mainline https://github.com/ROCm/rocprofiler-compute.git
+git remote add mainline https://github.com/ROCm/rocm-systems.git
 git checkout develop
 ```

@@ -21,17 +21,17 @@ and apply your changes there. For more help reference GitHub's ['About Forking']

 ### Did you find a bug?

- Ensure the bug was not already reported by searching on GitHub under [Issues](https://github.com/ROCm/rocprofiler-compute/issues).
+- Ensure the bug was not already reported by searching on GitHub under [Issues](https://github.com/ROCm/rocm-systems/issues).

- If you're unable to find an open issue addressing the problem, [open a new one](https://github.com/ROCm/rocprofiler-compute/issues/new).
+- If you're unable to find an open issue addressing the problem, [open a new one](https://github.com/ROCm/rocm-systems/issues/new).

 ### Did you write a patch that fixes a bug?

- Open a new GitHub [pull request](https://github.com/ROCm/rocprofiler-compute/compare) with the patch.
+- Open a new GitHub [pull request](https://github.com/ROCm/rocm-systems/compare) with the patch.

 - Ensure the PR description clearly describes the problem and solution. If there is an existing GitHub issue open describing this bug, please include it in the description so we can close it.

- Ensure the PR is based on the `develop` branch of the ROCm Compute Profiler GitHub repository.
+- Ensure the PR is based on the `develop` branch of the ROCm Systems GitHub repository.

 > [!TIP]
 > To ensure you meet all formatting requirements before publishing, we recommend you utilize our included [*pre-commit hooks*](https://pre-commit.com/#introduction). For more information on how to use pre-commit hooks please see the [section below](#using-pre-commit-hooks).
@@ -184,7 +184,7 @@ You can also disable specific linting rules for a line by using `# noqa: <rule_c

 ### Coding guidelines

-Below are some repository specific guidelines which are followed througout the repository.
+Below are some repository specific guidelines which are followed throughout the repository.
 Any future contributions should adhere to these guidelines:
 * Use the `pathlib` library functions instead of `os.path` for manipulating the file paths.

@@ -20,13 +20,22 @@ contribution process.

 ## Development

-ROCm Compute Profiler follows a
-[main-dev](https://nvie.com/posts/a-successful-git-branching-model/)
-branching model. As a result, our latest stable release is shipped
-from the `amd-mainline` branch, while new features are developed in our
-`develop` branch.
+ROCm Compute Profiler is now included in the rocm-systems super-repo. The latest sources are in the `develop` branch. You can find particular releases in the `release/rocm-rel-X.Y` branch for the paricular release you're looking for.

-Users may checkout `amd-staging` to preview upcoming features.
+### Pulling the source using sparse-checkout
+
+Being in the super-repo, if you only want to pull the source for a particular project, do a sparse checkout:
+
+```bash
+git clone --no-checkout --filter=blob:none https://github.com/ROCm/rocm-systems.git
+cd rocm-systems
+git sparse-checkout init --cone
+git sparse-checkout set projects/rocprofiler-compute
+git checkout develop
+
+cd rocprofiler-compute
+python3 -m pip install -r requirements.txt
+```

 ## Testing

@@ -147,7 +147,7 @@ The latter issue is discussed in more detail in our ['internal' IPC](Internal_ip
 CDNA accelerators, such as the MI100 and [MI2XX](2xxnote), contain specialized hardware to accelerate matrix-matrix multiplications, also known as Matrix Fused Multiply-Add (MFMA) operations.
 The exact operation types and supported formats may vary by accelerator.
 The reader is referred to the [AMD matrix cores](https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/) blog post on GPUOpen for a general discussion of these hardware units.
-In addition, to explore the available MFMA instructions in-depth on various AMD accelerators (including the CDNA line), we recommend the [AMD Matrix Instruction Calculator](https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator).
+In addition, to explore the available MFMA instructions in-depth on various AMD accelerators (including the CDNA line), we recommend the [AMD Matrix Instruction Calculator](https://github.com/ROCm/amd_matrix_instruction_calculator).

 ```{code-block} shell-session
 :name: matrix_calc_ex
@@ -185,7 +185,7 @@ The exact details of VALU and MFMA operation co-execution vary by instruction, a
  - 'Can co-execute with VALU'
  - 'VALU co-execution cycles possible'

-fields in the [AMD Matrix Instruction Calculator](https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator#example-of-querying-instruction-information)'s detailed instruction information.
+fields in the [AMD Matrix Instruction Calculator](https://github.com/ROCm/amd_matrix_instruction_calculator#example-of-querying-instruction-information)'s detailed instruction information.
 ```

 #### Non-pipeline resources
@@ -210,7 +210,7 @@ AGPRs are not available on all AMD Instinct(tm) accelerators.
 GCN GPUs, such as the AMD Instinct(tm) MI50 had a 256 KiB VGPR file.
 The AMD Instinct(tm) MI100 (CDNA) has a 2x256 KiB register file, where one half is available as general-purpose VGPRs, and the other half is for matrix math accumulation VGPRs (AGPRs).
 The AMD Instinct(tm) [MI2XX](2xxnote) (CDNA2) has a 512 KiB VGPR file per CU, where each wave can dynamically request up to 256 KiB of VGPRs and an additional 256 KiB of AGPRs.
-For more detail, the reader is referred to the [following comment](https://github.com/RadeonOpenCompute/ROCm/issues/1689#issuecomment-1553751913).
+For more detail, the reader is referred to the [following comment](https://github.com/ROCm/ROCm/issues/1689#issuecomment-1553751913).

 (ERM)=
 ### Pipeline Metrics
@@ -562,7 +562,7 @@ The reader is referred to the [Instructions per-cycle and Utilizations](IPC_exam
  - Indicates what percent of the kernel's duration the [MFMA](mfma) unit was busy executing instructions.  Computed as the ratio of the total number of cycles spent by the [MFMA](salu) was busy over the [total CU cycles](TotalCUCycles).
  - Percent
 * - MFMA Instruction Cycles
-  - The average duration of [MFMA](mfma) instructions in this kernel in cycles.  Computed as the ratio of the total number of cycles the [MFMA](mfma) unit was busy over the total number of [MFMA](mfma) instructions.  Compare to e.g., the [AMD Matrix Instruction Calculator](https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator).
+  - The average duration of [MFMA](mfma) instructions in this kernel in cycles.  Computed as the ratio of the total number of cycles the [MFMA](mfma) unit was busy over the total number of [MFMA](mfma) instructions.  Compare to e.g., the [AMD Matrix Instruction Calculator](https://github.com/ROCm/amd_matrix_instruction_calculator).
  - Cycles per instruction
 * - VMEM Latency
  - The average number of round-trip cycles (i.e., from issue to data-return / acknowledgment) required for a VMEM instruction to complete.
@@ -3522,7 +3522,7 @@ The MFMA assembly operations used in this example are inherently unportable to o
 ```

 Unlike the simple quad-cycle `v_mov_b32` operation discussed in our [previous example](VALU_ipc), some operations take many quad-cycles to execute.
-For example, using the [AMD Matrix Instruction Calculator](https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator#example-of-querying-instruction-information) we can see that some [MFMA](mfma) operations take 64 cycles, e.g.:
+For example, using the [AMD Matrix Instruction Calculator](https://github.com/ROCm/amd_matrix_instruction_calculator#example-of-querying-instruction-information) we can see that some [MFMA](mfma) operations take 64 cycles, e.g.:

 ```shell-session
 $ ./matrix_calculator.py --arch CDNA2 --detail-instruction --instruction v_mfma_f32_32x32x8bf16_1k
@@ -173,7 +173,7 @@ external_projects_current_project = "rocprofiler-compute"
 # frequently used external resources
 extlinks = {
    "dev-sample": (
-        "https://github.com/ROCm/rocprofiler-compute/blob/amd-mainline/sample/%s",
+        "https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-compute/sample/%s",
        "%s",
    ),
    "prod-page": (
@@ -128,7 +128,7 @@ There are three high-level GPU analysis views:

 3. Choose your own customized subset of metrics with the ``-b`` (or ``--block``)
   option. Or, build your own configuration following
-   `config_template <https://github.com/ROCm/rocprofiler-compute/blob/amd-mainline/src/rocprof_compute_soc/analysis_configs/panel_config_template.yaml>`_.
+   `config_template <https://github.com/ROCm/rocm-systems/blob/develop/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/panel_config_template.yaml>`_.
   The following snippet shows how to generate a report containing only metric 2
   (:doc:`System Speed-of-Light </conceptual/system-speed-of-light>`).

@@ -47,7 +47,7 @@ Run ``rocprof-compute profile -h`` for more details. See
 Profiling example
 -----------------

-The `<https://github.com/ROCm/rocprofiler-compute/blob/amd-mainline/sample/vcopy.cpp>`__ repository
+The `<https://github.com/ROCm/rocm-systems/blob/develop/projects/rocprofiler-compute/sample/vcopy.cpp>`__ repository
 includes source code for a sample GPU compute workload, ``vcopy.cpp``. A copy of
 this file is available in the ``share/sample`` subdirectory after a normal
 ROCm Compute Profiler installation, or via the ``$ROCPROFCOMPUTE_SHARE/sample`` directory when
@@ -239,11 +239,6 @@ of the underlying ``rocprof`` tool. The following formats are supported:
   * The generated csv files across multiple runs of rocprof are processed and dumped into the workload directory as csv files.
   * Multiple csv files are merged into single pmc_perf.csv file in workload directory.

-* ``json`` format:
-   * Ask underlying rocprof tool to dump raw performance counter data in json format.
-   * The generated json files across multiple runs of rocprof are processed and dumped into the workload directory as csv files.
-   * Multiple csv files are merged into single pmc_perf.csv file in workload directory.
-
 * ``rocpd`` format:
   * Ask underlying rocprof tool to dump raw performance counter data in rocpd format.
   * Multiple ``rocpd`` database files containding counter collection data are merged into a single csv under the workload folder.
@@ -15,7 +15,11 @@ If you're new to ROCm Compute Profiler, familiarize yourself with the tool by re
 chapters that follow and gradually learn its more advanced features. To get
 started, see :doc:`What is ROCm Compute Profiler? <what-is-rocprof-compute>`.

-ROCm Compute Profiler is open source and hosted at `<https://github.com/ROCm/rocprofiler-compute>`__.
+ROCm Compute Profiler is open source and hosted at `<https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-compute>`__.
+
+.. note::
+
+   The rocprofiler-compute repository for ROCm 7.0 and earlier is located at `<https://github.com/ROCm/rocprofiler-compute>`_.

 .. grid:: 2
   :gutter: 3
@@ -111,7 +111,7 @@ Install from source
 -------------------

 #. A typical install begins by downloading the latest release tarball available
-   from `<https://github.com/ROCm/rocprofiler-compute/releases>`__. From there, untar and
+   from `<https://github.com/ROCm/rocm-systems/releases>`__. From there, untar and
   navigate into the top-level directory.

   ..
@@ -623,7 +623,7 @@ manner. See
 for further reading on this instruction type.

 We develop a `simple
-kernel <https://github.com/ROCm/rocprofiler-compute/blob/amd-mainline/sample/stack.hip>`__
+kernel <https://github.com/ROCm/rocm-systems/blob/develop/projects/rocprofiler-compute/sample/stack.hip>`__
 that uses stack memory:

 .. code-block:: cpp
@@ -7,7 +7,7 @@ Profiling by example
 ********************

 The following examples refer to sample :doc:`HIP <hip:index>` code located in
-:fab:`github` :dev-sample:`ROCm/rocprofiler-compute/blob/amd-mainline/sample <>`
+:fab:`github` :dev-sample:`ROCm/rocm-systems/blob/develop/projects/rocprofiler-compute/sample <>`
 and distributed as part of ROCm Compute Profiler.

 .. include:: ./includes/valu-arithmetic-instruction-mix.rst