changing markdown to rst format (#259)

* changing markdown extension to rst extension * updating callback services * updating all services, ssamples and installtion * Fix build * More fixes * more fixes * minor fixes * more fixes * merging changes for SWDEV-510794 from pr 227
2025-03-20 21:39:53 +05:30
@@ -1,238 +0,0 @@
---
-myst:
-    html_meta:
-        "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
-        "keywords": "ROCprofiler-SDK API reference, Buffered services API"
---
-
-# ROCprofiler-SDK buffered services
-
-In the buffered approach, the internal (background) thread sends callbacks for batches of records.
-Supported buffer record categories are enumerated in `rocprofiler_buffer_category_t` category field and supported buffer tracing services are enumerated in  `rocprofiler_buffer_tracing_kind_t`. Configuring
-a buffered tracing service requires buffer creation. Flushing the buffer implicitly or explicitly invokes a callback to the tool, which provides an array of one or more buffer records.
-To flush a buffer explicitly, use `rocprofiler_flush_buffer` function.
-
-## Subscribing to buffer tracing services
-
-During tool initialization, the tool configures callback tracing using `rocprofiler_configure_buffer_tracing_service`
-function. However, before invoking `rocprofiler_configure_buffer_tracing_service`, the tool must create a buffer for the tracing records as shown in the following section.
-
-### Creating a buffer
-
-```cpp
-rocprofiler_status_t
-rocprofiler_create_buffer(rocprofiler_context_id_t        context,
-                          size_t                          size,
-                          size_t                          watermark,
-                          rocprofiler_buffer_policy_t     policy,
-                          rocprofiler_buffer_tracing_cb_t callback,
-                          void*                           callback_data,
-                          rocprofiler_buffer_id_t*        buffer_id);
-```
-
-Here are the parameters required to create a buffer:
-
- `size`: Size of the buffer in bytes, which is rounded up to the nearest
-memory page size (defined by `sysconf(_SC_PAGESIZE)`). The default memory page size on Linux
-is 4096 bytes (4 KB).
-
- `watermark`: Specifies the number of bytes at which the buffer should be flushed. To flush the buffer, the records in the buffer must invoke the `callback` parameter to deliver the records to the tool. For example, for a buffer of size 4096 bytes with the watermark set to 48 bytes, six 8-byte records can be placed in the
-buffer before `callback` is invoked. However, every 64-byte record that is placed in the
-buffer will trigger a flush. It is safe to set the `watermark` to any value between
-zero and the buffer size.
-
- `policy`: Specifies the behavior when a record is larger than the
-amount of free space in the current buffer. For example, for a buffer of size 4000 bytes with the watermark set to 4000 bytes and 3998 bytes populated with records, the `policy` dictates how to handle an incoming record greater than 2 bytes. If the environment variable `ROCPROFILER_BUFFER_POLICY_DISCARD` is enabled, all records greater than 2 bytes are dropped until the tool _explicitly_ flushes the buffer using `rocprofiler_flush_buffer` function call whereas, if the environment variable `ROCPROFILER_BUFFER_POLICY_LOSSLESS` is enabled, the current buffer is swapped out for an empty buffer and placed in the new buffer while the former (full) buffer is _implicitly_ flushed.
-
- `callback`: Invoked to flush the buffer.
-
- `callback_data`: Value passed as one of the arguments to the `callback` function.
-
- `buffer_id`: Output parameter for the function call to contain a
-non-zero handle field after successful buffer creation.
-
-### Creating a dedicated thread for buffer callbacks
-
-By default, all buffers use the same (default) background thread created by ROCprofiler-SDK to
-invoke their callback. However, ROCprofiler-SDK provides an interface to allow the tools to create an additional background thread for one or more of their buffers.
-
-To create callback threads for buffers, use `rocprofiler_create_callback_thread` function:
-
-```cpp
-rocprofiler_status_t
-rocprofiler_create_callback_thread(rocprofiler_callback_thread_t* cb_thread_id);
-```
-
-To assign buffers to that callback thread, use `rocprofiler_assign_callback_thread` function:
-
-```cpp
-rocprofiler_status_t
-rocprofiler_assign_callback_thread(rocprofiler_buffer_id_t       buffer_id,
-                                   rocprofiler_callback_thread_t cb_thread_id);
-```
-
-**Example:**
-
-```cpp
-{
-    // create a context
-    auto context_id = rocprofiler_context_id_t{0};
-    rocprofiler_create_context(&context_id);
-
-    // create a buffer associated with the context
-    auto buffer_id  = rocprofiler_buffer_id_t{};
-    rocprofiler_create_buffer(context_id, ..., &buffer_id);
-
-    // specify that a new callback thread should be created and provide
-    // and assign the identifier for it to the "thr_id" variable
-    auto thr_id = rocprofiler_callback_thread_t{};
-    rocprofiler_create_callback_thread(&thr_id);
-
-    // assign the buffer callback to be delivered on this thread
-    rocprofiler_assign_callback_thread(buffer_id, thr_id);
-}
-```
-
-### Configuring buffer tracing services
-
-To configure buffer tracing services, use:
-
-```cpp
-rocprofiler_status_t
-rocprofiler_configure_buffer_tracing_service(rocprofiler_context_id_t          context_id,
-                                             rocprofiler_buffer_tracing_kind_t kind,
-                                             rocprofiler_tracing_operation_t*  operations,
-                                             size_t                            operations_count,
-                                             rocprofiler_buffer_id_t           buffer_id);
-```
-
-Here are the parameters required to configure buffer tracing services:
-
- `kind`: A high-level specification of the services to be traced. This parameter is also known as "domain".
-Domain examples include, but not limited to, the HIP API, HSA API, and kernel dispatches.
-
- `operations`: For each domain, there are often various `operations` that can be used to restrict the callbacks to a subset within the domain. For domains corresponding to APIs, the `operations` are the functions
-composing the API. To trace all operations in a domain, set the `operations` and `operations_count`
-parameters to `nullptr` and `0` respectively. To restrict the tracing domain to a subset
-of operations, the tool library must specify a C-array of type `rocprofiler_tracing_operation_t` for `operations` and size of the array for the `operations_count` parameter.
-
-Similar to the `rocprofiler_configure_callback_tracing_service`,
-`rocprofiler_configure_buffer_tracing_service` returns an error if a buffer service for the specified context
-and domain is configured more than once.
-
-**Example:**
-
-```cpp
-{
-    auto ctx = rocprofiler_context_id_t{};
-    // ... creation of context, etc. ...
-
-    // buffer parameters
-    constexpr auto KB          = 1024;  // 1024 bytes
-    constexpr auto buffer_size = 16 * KB;
-    constexpr auto watermark   = 15 * KB;
-    constexpr auto policy      = ROCPROFILER_BUFFER_POLICY_LOSSLESS;
-
-    // buffer handle
-    auto buffer_id = rocprofiler_buffer_id_t{};
-
-    // create a buffer associated with the context
-    rocprofiler_create_buffer(
-        context_id, buffer_size, watermark, policy, callback_func, nullptr, &buffer_id);
-
-    // configure HIP runtime API function records to be placed in buffer
-    rocprofiler_configure_buffer_tracing_service(
-        ctx, ROCPROFILER_BUFFER_TRACING_HIP_RUNTIME_API, nullptr, 0, buffer_id);
-
-    // configure kernel dispatch records to be placed in buffer
-    // (more than one service can use the same buffer)
-    rocprofiler_configure_buffer_tracing_service(
-        ctx, ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH, nullptr, 0, buffer_id);
-
-    // ... etc. ...
-}
-```
-
-## Buffer tracing callback function
-
-Here is the buffer tracing callback function:
-
-```cpp
-typedef void (*rocprofiler_buffer_tracing_cb_t)(rocprofiler_context_id_t      context,
-                                                rocprofiler_buffer_id_t       buffer_id,
-                                                rocprofiler_record_header_t** headers,
-                                                size_t                        num_headers,
-                                                void*                         data,
-                                                uint64_t                      drop_count);
-```
-
-The `rocprofiler_record_header_t` data type contains the following information:
-
- `category` (`rocprofiler_buffer_category_t`): The `category` is used to classify the buffer record. For all
-services configured via `rocprofiler_configure_buffer_tracing_service`, the `category` is equal to the value of `ROCPROFILER_BUFFER_CATEGORY_TRACING`. The other available categories are `ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING` and `ROCPROFILER_BUFFER_CATEGORY_COUNTERS`.
-
- `kind`: The `kind` field is dependent on the `category`. For example, for `category` `ROCPROFILER_BUFFER_CATEGORY_TRACING`, the value of `kind` depicts the tracing type such as HSA core API in `ROCPROFILER_BUFFER_TRACING_HSA_CORE_API`.
-
- `payload`: The `payload` is casted after the category and kind have been determined.
-
-```cpp
-{
-    if(header->category == ROCPROFILER_BUFFER_CATEGORY_TRACING &&
-        header->kind == ROCPROFILER_BUFFER_TRACING_HIP_RUNTIME_API)
-    {
-        auto* record =
-            static_cast<rocprofiler_buffer_tracing_hip_api_record_t*>(header->payload);
-
-        // ... etc. ...
-    }
-}
-```
-
-**Example:**
-
-```cpp
-void
-buffer_callback_func(rocprofiler_context_id_t      context,
-                     rocprofiler_buffer_id_t       buffer_id,
-                     rocprofiler_record_header_t** headers,
-                     size_t                        num_headers,
-                     void*                         user_data,
-                     uint64_t                      drop_count)
-{
-    for(size_t i = 0; i < num_headers; ++i)
-    {
-        auto* header = headers[i];
-
-        if(header->category == ROCPROFILER_BUFFER_CATEGORY_TRACING &&
-           header->kind == ROCPROFILER_BUFFER_TRACING_HIP_RUNTIME_API)
-        {
-            auto* record =
-                static_cast<rocprofiler_buffer_tracing_hip_api_record_t*>(header->payload);
-
-            // ... etc. ...
-        }
-        else if(header->category == ROCPROFILER_BUFFER_CATEGORY_TRACING &&
-                header->kind == ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH)
-        {
-            auto* record =
-                static_cast<rocprofiler_buffer_tracing_kernel_dispatch_record_t*>(header->payload);
-
-            // ... etc. ...
-        }
-        else
-        {
-            throw std::runtime_error{"unhandled record header category + kind"};
-        }
-    }
-}
-```
-
-## Buffer tracing record
-
-Unlike callback tracing records, there is no common set of data for each buffer tracing record. However,
-many buffer tracing records contain a `kind` and an `operation` field.
-You can obtain the value for the `kind` of tracing using `rocprofiler_query_buffer_tracing_kind_name` function and the value for the `operation` specific to a tracing kind using the `rocprofiler_query_buffer_tracing_kind_operation_name`
-function. You can also iterate over all the buffer tracing `kinds` and `operations` for each tracing kind using the
-`rocprofiler_iterate_buffer_tracing_kinds` and `rocprofiler_iterate_buffer_tracing_kind_operations` functions.
-
-The buffer tracing record data types are available in the [rocprofiler-sdk/buffer_tracing.h](https://github.com/ROCm/rocprofiler-sdk/blob/amd-mainline/source/include/rocprofiler-sdk/buffer_tracing.h) header.
@@ -0,0 +1,245 @@
+.. ---
+.. myst:
+..     html_meta:
+..         "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
+..         "keywords": "ROCprofiler-SDK API reference, Buffered services API"
+.. ---
+
+ROCprofiler-SDK buffered services
+=================================
+
+In the buffered approach, the internal (background) thread sends callbacks for batches of records.
+Supported buffer record categories are enumerated in ``rocprofiler_buffer_category_t`` category field and supported buffer tracing services are enumerated in  ``rocprofiler_buffer_tracing_kind_t``. Configuring
+a buffered tracing service requires buffer creation. Flushing the buffer implicitly or explicitly invokes a callback to the tool, which provides an array of one or more buffer records.
+To flush a buffer explicitly, use ``rocprofiler_flush_buffer`` function.
+
+Subscribing to buffer tracing services
+--------------------------------------
+
+During tool initialization, the tool configures callback tracing using ``rocprofiler_configure_buffer_tracing_service``
+function. However, before invoking ``rocprofiler_configure_buffer_tracing_service``, the tool must create a buffer for the tracing records as shown in the following section.
+
+Creating a buffer
+-----------------
+
+.. code-block:: cpp
+
+    rocprofiler_status_t
+    rocprofiler_create_buffer(rocprofiler_context_id_t        context,
+                              size_t                          size,
+                              size_t                          watermark,
+                              rocprofiler_buffer_policy_t     policy,
+                              rocprofiler_buffer_tracing_cb_t callback,
+                              void*                           callback_data,
+                              rocprofiler_buffer_id_t*        buffer_id);
+
+Here are the parameters required to create a buffer:
+
+- ``size``: Size of the buffer in bytes, which is rounded up to the nearest
+  memory page size (defined by ``sysconf(_SC_PAGESIZE)``). The default memory page size on Linux
+  is 4096 bytes (4 KB).
+
+- ``watermark``: Specifies the number of bytes at which the buffer should be flushed. To flush the buffer, the records in the buffer must invoke the ``callback`` parameter to deliver the records to the tool. For example, for a buffer of size 4096 bytes with the watermark set to 48 bytes, six 8-byte records can be placed in the
+  buffer before ``callback`` is invoked. However, every 64-byte record that is placed in the
+  buffer will trigger a flush. It is safe to set the ``watermark`` to any value between
+  zero and the buffer size.
+
+- ``policy``: Specifies the behavior when a record is larger than the
+  amount of free space in the current buffer. For example, for a buffer of size 4000 bytes with the watermark set to 4000 bytes and 3998 bytes populated with records, the ``policy`` dictates how to handle an incoming record greater than 2 bytes. If the environment variable ``ROCPROFILER_BUFFER_POLICY_DISCARD`` is enabled, all records greater than 2 bytes are dropped until the tool _explicitly_ flushes the buffer using ``rocprofiler_flush_buffer`` function call whereas, if the environment variable ``ROCPROFILER_BUFFER_POLICY_LOSSLESS`` is enabled, the current buffer is swapped out for an empty buffer and placed in the new buffer while the former (full) buffer is _implicitly_ flushed.
+
+- ``callback``: Invoked to flush the buffer.
+
+- ``callback_data``: Value passed as one of the arguments to the ``callback`` function.
+
+- ``buffer_id``: Output parameter for the function call to contain a
+  non-zero handle field after successful buffer creation.
+
+Creating a dedicated thread for buffer callbacks
+------------------------------------------------
+
+By default, all buffers use the same (default) background thread created by ROCprofiler-SDK to
+invoke their callback. However, ROCprofiler-SDK provides an interface to allow the tools to create an additional background thread for one or more of their buffers.
+
+To create callback threads for buffers, use ``rocprofiler_create_callback_thread`` function:
+
+.. code-block:: cpp
+
+    rocprofiler_status_t
+    rocprofiler_create_callback_thread(rocprofiler_callback_thread_t* cb_thread_id);
+
+To assign buffers to that callback thread, use ``rocprofiler_assign_callback_thread`` function:
+
+.. code-block:: cpp
+
+    rocprofiler_status_t
+    rocprofiler_assign_callback_thread(rocprofiler_buffer_id_t       buffer_id,
+                                       rocprofiler_callback_thread_t cb_thread_id);
+
+**Example:**
+
+.. code-block:: cpp
+
+    {
+        // create a context
+        auto context_id = rocprofiler_context_id_t{0};
+        rocprofiler_create_context(&context_id);
+
+        // create a buffer associated with the context
+        auto buffer_id  = rocprofiler_buffer_id_t{};
+        rocprofiler_create_buffer(context_id, ..., &buffer_id);
+
+        // specify that a new callback thread should be created and provide
+        // and assign the identifier for it to the "thr_id" variable
+        auto thr_id = rocprofiler_callback_thread_t{};
+        rocprofiler_create_callback_thread(&thr_id);
+
+        // assign the buffer callback to be delivered on this thread
+        rocprofiler_assign_callback_thread(buffer_id, thr_id);
+    }
+
+Configuring buffer tracing services
+-----------------------------------
+
+To configure buffer tracing services, use:
+
+.. code-block:: cpp
+
+    rocprofiler_status_t
+    rocprofiler_configure_buffer_tracing_service(rocprofiler_context_id_t          context_id,
+                                                 rocprofiler_buffer_tracing_kind_t kind,
+                                                 rocprofiler_tracing_operation_t*  operations,
+                                                 size_t                            operations_count,
+                                                 rocprofiler_buffer_id_t           buffer_id);
+
+Here are the parameters required to configure buffer tracing services:
+
+- ``kind``: A high-level specification of the services to be traced. This parameter is also known as "domain".
+  Domain examples include, but not limited to, the HIP API, HSA API, and kernel dispatches.
+
+- ``operations``: For each domain, there are often various ``operations`` that can be used to restrict the callbacks to a subset within the domain. For domains corresponding to APIs, the ``operations`` are the functions
+  composing the API. To trace all operations in a domain, set the ``operations`` and ``operations_count``
+  parameters to ``nullptr`` and ``0`` respectively. To restrict the tracing domain to a subset
+  of operations, the tool library must specify a C-array of type ``rocprofiler_tracing_operation_t`` for ``operations`` and size of the array for the ``operations_count`` parameter.
+
+Similar to the ``rocprofiler_configure_callback_tracing_service``,
+``rocprofiler_configure_buffer_tracing_service`` returns an error if a buffer service for the specified context
+and domain is configured more than once.
+
+**Example:**
+
+.. code-block:: cpp
+
+    {
+        auto ctx = rocprofiler_context_id_t{};
+        // ... creation of context, etc. ...
+
+        // buffer parameters
+        constexpr auto KB          = 1024;  // 1024 bytes
+        constexpr auto buffer_size = 16 * KB;
+        constexpr auto watermark   = 15 * KB;
+        constexpr auto policy      = ROCPROFILER_BUFFER_POLICY_LOSSLESS;
+
+        // buffer handle
+        auto buffer_id = rocprofiler_buffer_id_t{};
+
+        // create a buffer associated with the context
+        rocprofiler_create_buffer(
+            context_id, buffer_size, watermark, policy, callback_func, nullptr, &buffer_id);
+
+        // configure HIP runtime API function records to be placed in buffer
+        rocprofiler_configure_buffer_tracing_service(
+            ctx, ROCPROFILER_BUFFER_TRACING_HIP_RUNTIME_API, nullptr, 0, buffer_id);
+
+        // configure kernel dispatch records to be placed in buffer
+        // (more than one service can use the same buffer)
+        rocprofiler_configure_buffer_tracing_service(
+            ctx, ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH, nullptr, 0, buffer_id);
+
+        // ... etc. ...
+    }
+
+Buffer tracing callback function
+--------------------------------
+
+Here is the buffer tracing callback function:
+
+.. code-block:: cpp
+
+    typedef void (*rocprofiler_buffer_tracing_cb_t)(rocprofiler_context_id_t      context,
+                                                    rocprofiler_buffer_id_t       buffer_id,
+                                                    rocprofiler_record_header_t** headers,
+                                                    size_t                        num_headers,
+                                                    void*                         data,
+                                                    uint64_t                      drop_count);
+
+The ``rocprofiler_record_header_t`` data type contains the following information:
+
+- ``category`` (``rocprofiler_buffer_category_t``): The ``category`` is used to classify the buffer record. For all
+  services configured via ``rocprofiler_configure_buffer_tracing_service``, the ``category`` is equal to the value of ``ROCPROFILER_BUFFER_CATEGORY_TRACING``. The other available categories are ``ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING`` and ``ROCPROFILER_BUFFER_CATEGORY_COUNTERS``.
+
+- ``kind``: The ``kind`` field is dependent on the ``category``. For example, for ``category`` ``ROCPROFILER_BUFFER_CATEGORY_TRACING``, the value of ``kind`` depicts the tracing type such as HSA core API in ``ROCPROFILER_BUFFER_TRACING_HSA_CORE_API``.
+
+- ``payload``: The ``payload`` is casted after the category and kind have been determined.
+
+.. code-block:: cpp
+
+    {
+        if(header->category == ROCPROFILER_BUFFER_CATEGORY_TRACING &&
+            header->kind == ROCPROFILER_BUFFER_TRACING_HIP_RUNTIME_API)
+        {
+            auto* record =
+                static_cast<rocprofiler_buffer_tracing_hip_api_record_t*>(header->payload);
+
+            // ... etc. ...
+        }
+    }
+
+**Example:**
+
+.. code-block:: cpp
+
+    void
+    buffer_callback_func(rocprofiler_context_id_t      context,
+                         rocprofiler_buffer_id_t       buffer_id,
+                         rocprofiler_record_header_t** headers,
+                         size_t                        num_headers,
+                         void*                         user_data,
+                         uint64_t                      drop_count)
+    {
+        for(size_t i = 0; i < num_headers; ++i)
+        {
+            auto* header = headers[i];
+
+            if(header->category == ROCPROFILER_BUFFER_CATEGORY_TRACING &&
+               header->kind == ROCPROFILER_BUFFER_TRACING_HIP_RUNTIME_API)
+            {
+                auto* record =
+                    static_cast<rocprofiler_buffer_tracing_hip_api_record_t*>(header->payload);
+
+                // ... etc. ...
+            }
+            else if(header->category == ROCPROFILER_BUFFER_CATEGORY_TRACING &&
+                    header->kind == ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH)
+            {
+                auto* record =
+                    static_cast<rocprofiler_buffer_tracing_kernel_dispatch_record_t*>(header->payload);
+
+                // ... etc. ...
+            }
+            else
+            {
+                throw std::runtime_error{"unhandled record header category + kind"};
+            }
+        }
+    }
+
+Buffer tracing record
+---------------------
+
+Unlike callback tracing records, there is no common set of data for each buffer tracing record. However,
+many buffer tracing records contain a ``kind`` and an ``operation`` field.
+You can obtain the value for the ``kind`` of tracing using ``rocprofiler_query_buffer_tracing_kind_name`` function and the value for the ``operation`` specific to a tracing kind using the ``rocprofiler_query_buffer_tracing_kind_operation_name``
+function. You can also iterate over all the buffer tracing ``kinds`` and ``operations`` for each tracing kind using the
+``rocprofiler_iterate_buffer_tracing_kinds`` and ``rocprofiler_iterate_buffer_tracing_kind_operations`` functions.
+
+The buffer tracing record data types are available in the ``rocprofiler-sdk/buffer_tracing.h`` header.
@@ -1,339 +0,0 @@
---
-myst:
-    html_meta:
-        "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
-        "keywords": "ROCprofiler-SDK API reference, ROCprofiler-SDK callback services, Callback services API"
---
-
-# ROCprofiler-SDK callback tracing services
-
-Callback tracing services provide immediate callbacks to a tool on the current CPU thread on the occurrence of an event.
-For example, when tracing an API function such as `hipSetDevice`, callback tracing invokes a user-specified callback
-before and after the traced function executes on the thread invoking the API function.
-
-## Subscribing to callback tracing services
-
-During tool initialization, tools configure callback tracing using:
-
-```cpp
-rocprofiler_status_t
-rocprofiler_configure_callback_tracing_service(rocprofiler_context_id_t            context_id,
-                                               rocprofiler_callback_tracing_kind_t kind,
-                                               rocprofiler_tracing_operation_t*    operations,
-                                               size_t                              operations_count,
-                                               rocprofiler_callback_tracing_cb_t   callback,
-                                               void*                               callback_args);
-```
-
-Here are the parameters required to configure callback tracing services:
-
- `kind`: A high-level specification of the services to be traced. This parameter is also known as "domain".
-Domain examples include, but not limited to, the HIP API, HSA API, and kernel dispatches.
-
- `operations`: For each domain, there are often various `operations` that can be used to restrict the callbacks to a subset within the domain. For domains corresponding to APIs, the `operations` are the functions
-composing the API. To trace all operations in a domain, set the `operations` and `operations_count`
-parameters to `nullptr` and `0` respectively. To restrict the tracing domain to a subset
-of operations, the tool library must specify a C-array of type `rocprofiler_tracing_operation_t` for `operations` and size of the array for the `operations_count` parameter.
-
-`rocprofiler_configure_callback_tracing_service` returns an error if a callback service for the specified context and domain is configured more than once.
-
-**Example:** To trace only two functions within
-the HIP runtime API, `hipGetDevice` and `hipSetDevice`:
-
-```cpp
-{
-    auto ctx = rocprofiler_context_id_t{};
-    // ... creation of context, etc. ...
-
-    // array of operations (i.e. API functions)
-    auto operations = std::array<rocprofiler_tracing_operation_t, 2>{
-        ROCPROFILER_HIP_RUNTIME_API_ID_hipSetDevice,
-        ROCPROFILER_HIP_RUNTIME_API_ID_hipGetDevice
-    };
-
-    rocprofiler_configure_callback_tracing_service(ctx,
-                                                   ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API,
-                                                   operations.data(),
-                                                   operations.size(),
-                                                   callback_func,
-                                                   nullptr);
-    // ... etc. ...
-}
-```
-
-The following code returns error `ROCPROFILER_STATUS_ERROR_SERVICE_ALREADY_CONFIGURED` as the callback service is already configured:
-
-```cpp
-{
-    auto ctx = rocprofiler_context_id_t{};
-    // ... creation of context, etc. ...
-
-    // array of operations (i.e. API functions)
-    auto operations = std::array<rocprofiler_tracing_operation_t, 2>{
-        ROCPROFILER_HIP_RUNTIME_API_ID_hipSetDevice,
-        ROCPROFILER_HIP_RUNTIME_API_ID_hipGetDevice
-    };
-
-    for(auto op : operations)
-    {
-        // after the first iteration, returns ROCPROFILER_STATUS_ERROR_SERVICE_ALREADY_CONFIGURED
-        rocprofiler_configure_callback_tracing_service(ctx,
-                                                       ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API,
-                                                       &op,
-                                                       1,
-                                                       callback_func,
-                                                       nullptr);
-    }
-
-    // ... etc. ...
-}
-```
-
-## Callback tracing callback function
-
-Here is the callback tracing callback function:
-
-```cpp
-typedef void (*rocprofiler_callback_tracing_cb_t)(rocprofiler_callback_tracing_record_t record,
-                                                  rocprofiler_user_data_t*              user_data,
-                                                  void* callback_data)
-```
-
-The parameters `record` and `user_data` are discussed here:
-
- `record`: Contains the information to uniquely identify a tracing record type. Here is the definition:
-
-  ```cpp
-  typedef struct rocprofiler_callback_tracing_record_t
-  {
-    rocprofiler_context_id_t            context_id;
-    rocprofiler_thread_id_t             thread_id;
-    rocprofiler_correlation_id_t        correlation_id;
-    rocprofiler_callback_tracing_kind_t kind;
-    uint32_t                            operation;
-    rocprofiler_callback_phase_t        phase;
-    void*                               payload;
-  } rocprofiler_callback_tracing_record_t;
-  ```
-  The underlying type of `payload` field is typically unique to a domain and, less frequently, an operation.
-  For example, for the `ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API` and `ROCPROFILER_CALLBACK_TRACING_HIP_COMPILER_API`,
-  the payload must be casted to `rocprofiler_callback_tracing_hip_api_data_t*`, which contains the arguments
-  to the function and the return value when exiting the function. The payload field is a valid
-  pointer only during the invocation of the callback function(s).
-
- `user_data`: Stores data in between callback phases. This value is unique for every
-instance of an operation. For example, for a tool library to store the timestamp of the
-`ROCPROFILER_CALLBACK_PHASE_ENTER` phase for the ensuing `ROCPROFILER_CALLBACK_PHASE_EXIT` callback,
-the data can be stored using:
-
-  ```cpp
-  void
-  callback_func(rocprofiler_callback_tracing_record_t record,
-              rocprofiler_user_data_t*              user_data,
-              void*                                 cb_data)
-  {
-    auto ts = rocprofiler_timestamp_t{};
-    rocprofiler_get_timestamp(&ts);
-
-    if(record.phase == ROCPROFILER_CALLBACK_PHASE_ENTER)
-    {
-        user_data->value = ts;
-    }
-    else if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT)
-    {
-        auto delta_ts = (ts - user_data->value);
-        // ... etc. ...
-    }
-    else
-    {
-        // ... etc. ...
-    }
-  }
-  ```
-
-  The `callback_data` is passed to `rocprofiler_configure_callback_tracing_service` as the value of `callback_args` to [subscribe to callback tracing services](#subscribing-to-callback-tracing-services).
-
-## Callback tracing record
-
-To obtain the name of the `kind` of tracing, you can use `rocprofiler_query_callback_tracing_kind_name` function and to obtain the name of an `operation` specific to a tracing kind, use `rocprofiler_query_callback_tracing_kind_operation_name`
-function. To iterate over all the callback tracing kinds and operations for each tracing kind, use `rocprofiler_iterate_callback_tracing_kinds` and `rocprofiler_iterate_callback_tracing_kind_operations` functions.
-
-Lastly, for a specified `rocprofiler_callback_tracing_record_t` object, ROCprofiler-SDK supports generically iterating over the arguments of the payload field for many domains. Within the `rocprofiler_callback_tracing_record_t` object, the domain-specific information is available in
-an opaque `void* payload`.
-The data types generally follow the naming convention of `rocprofiler_callback_tracing_<DOMAIN>_data_t`. For example, for the tracing kinds `ROCPROFILER_BUFFER_TRACING_HSA_{CORE,AMD_EXT,IMAGE_EXT,FINALIZE_EXT}_API`,
-cast the payload to `rocprofiler_callback_tracing_hsa_api_data_t*`:
-
-```cpp
-void
-callback_func(rocprofiler_callback_tracing_record_t record,
-              rocprofiler_user_data_t*              user_data,
-              void*                                 cb_data)
-{
-    static auto hsa_domains = std::unordered_set<rocprofiler_buffer_tracing_kind_t>{
-        ROCPROFILER_BUFFER_TRACING_HSA_CORE_API,
-        ROCPROFILER_BUFFER_TRACING_HSA_AMD_EXT_API,
-        ROCPROFILER_BUFFER_TRACING_HSA_IMAGE_EXT_API,
-        ROCPROFILER_BUFFER_TRACING_HSA_FINALIZER_API};
-
-    if(hsa_domains.count(record.kind) > 0)
-    {
-        auto* payload = static_cast<rocprofiler_callback_tracing_hsa_api_data_t*>(record.payload);
-
-        hsa_status_t status = payload->retval.hsa_status_t_retval;
-        if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT && status != HSA_STATUS_SUCCESS)
-        {
-            const char* _kind = nullptr;
-            const char* _operation = nullptr;
-
-            rocprofiler_query_callback_tracing_kind_name(record.kind, &_kind, nullptr);
-            rocprofiler_query_callback_tracing_kind_operation_name(
-                record.kind, record.operation, &_operation, nullptr);
-
-            // message that
-            fprintf(stderr, "[domain=%s] %s returned a non-zero exit code: %i\n", _kind, _operation, status);
-        }
-    }
-    else if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT)
-    {
-        auto delta_ts = (ts - user_data->value);
-        // ... etc. ...
-    }
-    else
-    {
-        // ... etc. ...
-    }
-}
-```
-
-**Example:** Iterating over all the callback tracing kinds and operations for each tracing kind using `rocprofiler_iterate_callback_tracing_kind_operation_args`:
-
-```cpp
-int
-print_args(rocprofiler_callback_tracing_kind_t domain_idx,
-           uint32_t                            op_idx,
-           uint32_t                            arg_num,
-           const void* const                   arg_value_addr,
-           int32_t                             arg_indirection_count,
-           const char*                         arg_type,
-           const char*                         arg_name,
-           const char*                         arg_value_str,
-           int32_t                             arg_dereference_count,
-           void*                               data)
-{
-    if(arg_num == 0)
-    {
-        const char* _kind      = nullptr;
-        const char* _operation = nullptr;
-
-        rocprofiler_query_callback_tracing_kind_name(domain_idx, &_kind, nullptr);
-        rocprofiler_query_callback_tracing_kind_operation_name(
-            domain_idx, op_idx, &_operation, nullptr);
-
-        fprintf(stderr, "\n[%s] %s\n", _kind, _operation);
-    }
-
-    char* _arg_type = abi::__cxa_demangle(arg_type, nullptr, nullptr, nullptr);
-
-    fprintf(stderr, "    %u: %-18s %-16s = %s\n", arg_num, _arg_type, arg_name, arg_value_str);
-
-    free(_arg_type);
-
-    // unused in example
-    (void) arg_value_addr;
-    (void) arg_indirection_count;
-    (void) arg_dereference_count;
-    (void) data;
-
-    return 0;
-}
-
-void
-callback_func(rocprofiler_callback_tracing_record_t record,
-              rocprofiler_user_data_t*              user_data,
-              void*                                 cb_data)
-{
-    if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT &&
-       record.kind == ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API &&
-       (record.operation == ROCPROFILER_HIP_RUNTIME_API_ID_hipLaunchKernel ||
-        record.operation == ROCPROFILER_HIP_RUNTIME_API_ID_hipMemcpyAsync))
-    {
-        rocprofiler_iterate_callback_tracing_kind_operation_args(
-                             record, print_args, record.phase, nullptr));
-    }
-}
-```
-
-**Sample output:**
-
-```console
-
-[HIP_RUNTIME_API] hipLaunchKernel
-    0: void const*        function_address = 0x219308
-    1: rocprofiler_dim3_t numBlocks        = {z=1, y=310, x=310}
-    2: rocprofiler_dim3_t dimBlocks        = {z=1, y=32, x=32}
-    3: void**             args             = 0x7ffe6d8dd3c0
-    4: unsigned long      sharedMemBytes   = 0
-    5: hipStream_t*      stream           = 0x17b40c0
-
-[HIP_RUNTIME_API] hipMemcpyAsync
-    0: void*              dst              = 0x7f06c7bbb010
-    1: void const*        src              = 0x7f0698800000
-    2: unsigned long      sizeBytes        = 393625600
-    3: hipMemcpyKind      kind             = DeviceToHost
-    4: hipStream_t*      stream           = 0x25dfcf0
-```
-
-## Code object tracing
-
-The code object tracing service is a critical component for obtaining information regarding
-asynchronous activity on the GPU. The `rocprofiler_callback_tracing_code_object_load_data_t`
-payload (kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`, operation=`ROCPROFILER_CODE_OBJECT_LOAD`)
-provides a unique identifier for a bundle of one or more GPU kernel symbols that are loaded
-for a specific GPU agent. For example, if your application leverages a multi-GPU system
-consisting of four Vega20 GPUs and four MI100 GPUs, at least eight code objects will be loaded: one code
-object for each GPU. Each code object will be associated with a set of kernel symbols.
-The `rocprofiler_callback_tracing_code_object_kernel_symbol_register_data_t` payload
-(kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`, operation=`ROCPROFILER_CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER`)
-provides a globally unique identifier for the specific kernel symbol along with the kernel name and
-several other static properties of the kernel such as scratch size, scalar general purpose register count, and so on.
-
-:::{note}
-The kernel identifiers for two identical kernel symbols with the same properties (kernel name, scratch size, and so on) that are part of similar code objects loaded for different GPU agents will still be unique. Furthermore, the identifier for a code object and its kernel symbols after being unloaded and then
-reloaded, will also be unique.
-:::
-
-Here is the general sequence of events when a code object is loaded and unloaded:
-
-1. Callback: load code object
-    - kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`
-    - operation=`ROCPROFILER_CODE_OBJECT_LOAD`
-    - phase=`ROCPROFILER_CALLBACK_PHASE_LOAD`
-2. Callback: load kernel symbol
-    - kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`
-    - operation=`ROCPROFILER_CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER`
-    - phase=`ROCPROFILER_CALLBACK_PHASE_LOAD`
-    - Repeats for each kernel symbol in code object
-3. Execute application
-4. Callback: unload kernel symbol
-    - kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`
-    - operation=`ROCPROFILER_CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER`
-    - phase=`ROCPROFILER_CALLBACK_PHASE_UNLOAD`
-    - Repeats for each kernel symbol in code object
-5. Callback: unload code object
-    - kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`
-    - operation=`ROCPROFILER_CODE_OBJECT_LOAD`
-    - phase=`ROCPROFILER_CALLBACK_PHASE_UNLOAD`
-
-:::{note}
-ROCprofiler-SDK doesn't provide an interface to query information outside of the
-code object tracing service. If you wish to associate kernel names with kernel tracing records,
-the tool must be configured to create a copy of the relevant information when the code objects and
-kernel symbol are loaded. However, any constant string fields like `const char* kernel_name`
-don't need to be copied as these are guaranteed to be valid pointers until after ROCprofiler-SDK finalization.
-If a tool decides to delete its copy of the data associated with a code object or kernel symbol
-identifier when the code object and kernel symbols are unloaded, it is highly recommended to flush
-all buffers that might contain references to that code object or kernel symbol identifier before
-deleting the associated data.
-:::
-
-For a sample of code object tracing, see [samples/code_object_tracing](https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples/code_object_tracing).
@@ -0,0 +1,344 @@
+.. ---
+.. myst:
+..     html_meta:
+..         "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
+..         "keywords": "ROCprofiler-SDK API reference, ROCprofiler-SDK callback services, Callback services API"
+.. ---
+
+.. _rocprofiler_sdk_callback_tracing_services:
+
+ROCprofiler-SDK callback tracing services
+=========================================
+
+Callback tracing services provide immediate callbacks to a tool on the current CPU thread on the occurrence of an event.
+For example, when tracing an API function such as ``hipSetDevice``, callback tracing invokes a user-specified callback
+before and after the traced function executes on the thread invoking the API function.
+
+Subscribing to callback tracing services
+----------------------------------------
+
+During tool initialization, tools configure callback tracing using:
+
+.. code-block:: cpp
+
+    rocprofiler_status_t
+    rocprofiler_configure_callback_tracing_service(rocprofiler_context_id_t            context_id,
+                                                   rocprofiler_callback_tracing_kind_t kind,
+                                                   rocprofiler_tracing_operation_t*    operations,
+                                                   size_t                              operations_count,
+                                                   rocprofiler_callback_tracing_cb_t   callback,
+                                                   void*                               callback_args);
+
+Here are the parameters required to configure callback tracing services:
+
+- ``kind``: A high-level specification of the services to be traced. This parameter is also known as "domain".
+  Domain examples include, but not limited to, the HIP API, HSA API, and kernel dispatches.
+
+- ``operations``: For each domain, there are often various ``operations`` that can be used to restrict the callbacks to a subset within the domain. For domains corresponding to APIs, the ``operations`` are the functions
+  composing the API. To trace all operations in a domain, set the ``operations`` and ``operations_count``
+  parameters to ``nullptr`` and ``0`` respectively. To restrict the tracing domain to a subset
+  of operations, the tool library must specify a C-array of type ``rocprofiler_tracing_operation_t`` for ``operations`` and size of the array for the ``operations_count`` parameter.
+
+``rocprofiler_configure_callback_tracing_service`` returns an error if a callback service for the specified context and domain is configured more than once.
+
+**Example:** To trace only two functions within
+the HIP runtime API, ``hipGetDevice`` and ``hipSetDevice``:
+
+.. code-block:: cpp
+
+    {
+        auto ctx = rocprofiler_context_id_t{};
+        // ... creation of context, etc. ...
+
+        // array of operations (i.e. API functions)
+        auto operations = std::array<rocprofiler_tracing_operation_t, 2>{
+            ROCPROFILER_HIP_RUNTIME_API_ID_hipSetDevice,
+            ROCPROFILER_HIP_RUNTIME_API_ID_hipGetDevice
+        };
+
+        rocprofiler_configure_callback_tracing_service(ctx,
+                                                       ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API,
+                                                       operations.data(),
+                                                       operations.size(),
+                                                       callback_func,
+                                                       nullptr);
+        // ... etc. ...
+    }
+
+The following code returns error ``ROCPROFILER_STATUS_ERROR_SERVICE_ALREADY_CONFIGURED`` as the callback service is already configured:
+
+.. code-block:: cpp
+
+    {
+        auto ctx = rocprofiler_context_id_t{};
+        // ... creation of context, etc. ...
+
+        // array of operations (i.e. API functions)
+        auto operations = std::array<rocprofiler_tracing_operation_t, 2>{
+            ROCPROFILER_HIP_RUNTIME_API_ID_hipSetDevice,
+            ROCPROFILER_HIP_RUNTIME_API_ID_hipGetDevice
+        };
+
+        for(auto op : operations)
+        {
+            // after the first iteration, returns ROCPROFILER_STATUS_ERROR_SERVICE_ALREADY_CONFIGURED
+            rocprofiler_configure_callback_tracing_service(ctx,
+                                                           ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API,
+                                                           &op,
+                                                           1,
+                                                           callback_func,
+                                                           nullptr);
+        }
+
+        // ... etc. ...
+    }
+
+Callback tracing callback function
+----------------------------------
+
+Here is the callback tracing callback function:
+
+.. code-block:: cpp
+
+    typedef void (*rocprofiler_callback_tracing_cb_t)(rocprofiler_callback_tracing_record_t record,
+                                                      rocprofiler_user_data_t*              user_data,
+                                                      void* callback_data)
+
+The parameters ``record`` and ``user_data`` are discussed here:
+
+- ``record``: Contains the information to uniquely identify a tracing record type. Here is the definition:
+
+  .. code-block:: cpp
+
+      typedef struct rocprofiler_callback_tracing_record_t
+      {
+        rocprofiler_context_id_t            context_id;
+        rocprofiler_thread_id_t             thread_id;
+        rocprofiler_correlation_id_t        correlation_id;
+        rocprofiler_callback_tracing_kind_t kind;
+        uint32_t                            operation;
+        rocprofiler_callback_phase_t        phase;
+        void*                               payload;
+      } rocprofiler_callback_tracing_record_t;
+
+  The underlying type of ``payload`` field is typically unique to a domain and, less frequently, an operation.
+  For example, for the ``ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API`` and ``ROCPROFILER_CALLBACK_TRACING_HIP_COMPILER_API``,
+  the payload must be casted to ``rocprofiler_callback_tracing_hip_api_data_t*``, which contains the arguments
+  to the function and the return value when exiting the function. The payload field is a valid
+  pointer only during the invocation of the callback function(s).
+
+- ``user_data``: Stores data in between callback phases. This value is unique for every
+  instance of an operation. For example, for a tool library to store the timestamp of the
+  ``ROCPROFILER_CALLBACK_PHASE_ENTER`` phase for the ensuing ``ROCPROFILER_CALLBACK_PHASE_EXIT`` callback,
+  the data can be stored using:
+
+  .. code-block:: cpp
+
+      void
+      callback_func(rocprofiler_callback_tracing_record_t record,
+                    rocprofiler_user_data_t*              user_data,
+                    void*                                 cb_data)
+      {
+          auto ts = rocprofiler_timestamp_t{};
+          rocprofiler_get_timestamp(&ts);
+
+          if(record.phase == ROCPROFILER_CALLBACK_PHASE_ENTER)
+          {
+              user_data->value = ts;
+          }
+          else if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT)
+          {
+              auto delta_ts = (ts - user_data->value);
+              // ... etc. ...
+          }
+          else
+          {
+              // ... etc. ...
+          }
+      }
+
+  The `callback_data` is passed to `rocprofiler_configure_callback_tracing_service` as the value of `callback_args` to :ref:`subscribe to callback tracing services <rocprofiler_sdk_callback_tracing_services>`.
+
+Callback tracing record
+-----------------------
+
+To obtain the name of the ``kind`` of tracing, you can use ``rocprofiler_query_callback_tracing_kind_name`` function and to obtain the name of an ``operation`` specific to a tracing kind, use ``rocprofiler_query_callback_tracing_kind_operation_name``
+function. To iterate over all the callback tracing kinds and operations for each tracing kind, use ``rocprofiler_iterate_callback_tracing_kinds`` and ``rocprofiler_iterate_callback_tracing_kind_operations`` functions.
+
+Lastly, for a specified ``rocprofiler_callback_tracing_record_t`` object, ROCprofiler-SDK supports generically iterating over the arguments of the payload field for many domains. Within the ``rocprofiler_callback_tracing_record_t`` object, the domain-specific information is available in
+an opaque ``void* payload``.
+The data types generally follow the naming convention of ``rocprofiler_callback_tracing_<DOMAIN>_data_t``. For example, for the tracing kinds ``ROCPROFILER_BUFFER_TRACING_HSA_{CORE,AMD_EXT,IMAGE_EXT,FINALIZE_EXT}_API``,
+cast the payload to ``rocprofiler_callback_tracing_hsa_api_data_t*``:
+
+.. code-block:: cpp
+
+    void
+    callback_func(rocprofiler_callback_tracing_record_t record,
+                  rocprofiler_user_data_t*              user_data,
+                  void*                                 cb_data)
+    {
+        static auto hsa_domains = std::unordered_set<rocprofiler_buffer_tracing_kind_t>{
+            ROCPROFILER_BUFFER_TRACING_HSA_CORE_API,
+            ROCPROFILER_BUFFER_TRACING_HSA_AMD_EXT_API,
+            ROCPROFILER_BUFFER_TRACING_HSA_IMAGE_EXT_API,
+            ROCPROFILER_BUFFER_TRACING_HSA_FINALIZER_API};
+
+        if(hsa_domains.count(record.kind) > 0)
+        {
+            auto* payload = static_cast<rocprofiler_callback_tracing_hsa_api_data_t*>(record.payload);
+
+            hsa_status_t status = payload->retval.hsa_status_t_retval;
+            if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT && status != HSA_STATUS_SUCCESS)
+            {
+                const char* _kind = nullptr;
+                const char* _operation = nullptr;
+
+                rocprofiler_query_callback_tracing_kind_name(record.kind, &_kind, nullptr);
+                rocprofiler_query_callback_tracing_kind_operation_name(
+                    record.kind, record.operation, &_operation, nullptr);
+
+                // message that
+                fprintf(stderr, "[domain=%s] %s returned a non-zero exit code: %i\n", _kind, _operation, status);
+            }
+        }
+        else if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT)
+        {
+            auto delta_ts = (ts - user_data->value);
+            // ... etc. ...
+        }
+        else
+        {
+            // ... etc. ...
+        }
+    }
+
+**Example:** Iterating over all the callback tracing kinds and operations for each tracing kind using ``rocprofiler_iterate_callback_tracing_kind_operation_args``:
+
+.. code-block:: cpp
+
+    int
+    print_args(rocprofiler_callback_tracing_kind_t domain_idx,
+               uint32_t                            op_idx,
+               uint32_t                            arg_num,
+               const void* const                   arg_value_addr,
+               int32_t                             arg_indirection_count,
+               const char*                         arg_type,
+               const char*                         arg_name,
+               const char*                         arg_value_str,
+               int32_t                             arg_dereference_count,
+               void*                               data)
+    {
+        if(arg_num == 0)
+        {
+            const char* _kind      = nullptr;
+            const char* _operation = nullptr;
+
+            rocprofiler_query_callback_tracing_kind_name(domain_idx, &_kind, nullptr);
+            rocprofiler_query_callback_tracing_kind_operation_name(
+                domain_idx, op_idx, &_operation, nullptr);
+
+            fprintf(stderr, "\n[%s] %s\n", _kind, _operation);
+        }
+
+        char* _arg_type = abi::__cxa_demangle(arg_type, nullptr, nullptr, nullptr);
+
+        fprintf(stderr, "    %u: %-18s %-16s = %s\n", arg_num, _arg_type, arg_name, arg_value_str);
+
+        free(_arg_type);
+
+        // unused in example
+        (void) arg_value_addr;
+        (void) arg_indirection_count;
+        (void) arg_dereference_count;
+        (void) data;
+
+        return 0;
+    }
+
+    void
+    callback_func(rocprofiler_callback_tracing_record_t record,
+                  rocprofiler_user_data_t*              user_data,
+                  void*                                 cb_data)
+    {
+        if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT &&
+           record.kind == ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API &&
+           (record.operation == ROCPROFILER_HIP_RUNTIME_API_ID_hipLaunchKernel ||
+            record.operation == ROCPROFILER_HIP_RUNTIME_API_ID_hipMemcpyAsync))
+        {
+            rocprofiler_iterate_callback_tracing_kind_operation_args(
+                                 record, print_args, record.phase, nullptr));
+        }
+    }
+
+**Sample output:**
+
+.. code-block:: console
+
+    [HIP_RUNTIME_API] hipLaunchKernel
+        0: void const*        function_address = 0x219308
+        1: rocprofiler_dim3_t numBlocks        = {z=1, y=310, x=310}
+        2: rocprofiler_dim3_t dimBlocks        = {z=1, y=32, x=32}
+        3: void**             args             = 0x7ffe6d8dd3c0
+        4: unsigned long      sharedMemBytes   = 0
+        5: hipStream_t*      stream           = 0x17b40c0
+
+    [HIP_RUNTIME_API] hipMemcpyAsync
+        0: void*              dst              = 0x7f06c7bbb010
+        1: void const*        src              = 0x7f0698800000
+        2: unsigned long      sizeBytes        = 393625600
+        3: hipMemcpyKind      kind             = DeviceToHost
+        4: hipStream_t*      stream           = 0x25dfcf0
+
+Code object tracing
+-------------------
+
+The code object tracing service is a critical component for obtaining information regarding
+asynchronous activity on the GPU. The ``rocprofiler_callback_tracing_code_object_load_data_t``
+payload (kind=``ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT``, operation=``ROCPROFILER_CODE_OBJECT_LOAD``)
+provides a unique identifier for a bundle of one or more GPU kernel symbols that are loaded
+for a specific GPU agent. For example, if your application leverages a multi-GPU system
+consisting of four Vega20 GPUs and four MI100 GPUs, at least eight code objects will be loaded: one code
+object for each GPU. Each code object will be associated with a set of kernel symbols.
+The ``rocprofiler_callback_tracing_code_object_kernel_symbol_register_data_t`` payload
+(kind=``ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT``, operation=``ROCPROFILER_CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER``)
+provides a globally unique identifier for the specific kernel symbol along with the kernel name and
+several other static properties of the kernel such as scratch size, scalar general purpose register count, and so on.
+
+.. note::
+    The kernel identifiers for two identical kernel symbols with the same properties (kernel name, scratch size, and so on) that are part of similar code objects loaded for different GPU agents will still be unique. Furthermore, the identifier for a code object and its kernel symbols after being unloaded and then
+    reloaded, will also be unique.
+
+Here is the general sequence of events when a code object is loaded and unloaded:
+
+1. Callback: load code object
+   - kind= ``ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT``
+   - operation= ``ROCPROFILER_CODE_OBJECT_LOAD``
+   - phase= ``ROCPROFILER_CALLBACK_PHASE_LOAD``
+2. Callback: load kernel symbol
+   - kind= ``ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT``
+   - operation= ``ROCPROFILER_CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER``
+   - phase= ``ROCPROFILER_CALLBACK_PHASE_LOAD``
+   - Repeats for each kernel symbol in code object
+3. Execute application
+4. Callback: unload kernel symbol
+   - kind= ``ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT``
+   - operation= ``ROCPROFILER_CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER``
+   - phase= ``ROCPROFILER_CALLBACK_PHASE_UNLOAD``
+   - Repeats for each kernel symbol in code object
+5. Callback: unload code object
+   - kind= ``ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT``
+   - operation= ``ROCPROFILER_CODE_OBJECT_LOAD``
+   - phase= ``ROCPROFILER_CALLBACK_PHASE_UNLOAD``
+
+.. note::
+    ROCprofiler-SDK doesn't provide an interface to query information outside of the
+    code object tracing service. If you wish to associate kernel names with kernel tracing records,
+    the tool must be configured to create a copy of the relevant information when the code objects and
+    kernel symbol are loaded. However, any constant string fields like ``const char* kernel_name``
+    don't need to be copied as these are guaranteed to be valid pointers until after ROCprofiler-SDK finalization.
+    If a tool decides to delete its copy of the data associated with a code object or kernel symbol
+    identifier when the code object and kernel symbols are unloaded, it is highly recommended to flush
+    all buffers that might contain references to that code object or kernel symbol identifier before
+    deleting the associated data.
+
+For a sample of code object tracing, see `samples/code_object_tracing <https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples/code_object_tracing>`_.
@@ -1,415 +0,0 @@
---
-myst:
-    html_meta:
-        "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
-        "keywords": "ROCprofiler-SDK API reference, Counter collection services API"
---
-
-# ROCprofiler-SDK counter collection services
-
-There are two modes of counter collection service:
-
- Dispatch counting: In this mode, counters are collected on a per-kernel launch basis. This mode is useful for collecting highly detailed counters for a specific kernel execution in isolation. Note that dispatch counting allows only a single kernel to execute in hardware at a time.
-
- Device counting: In this mode, counters are collected on a device level. This mode is useful for collecting device level counters not tied to a specific kernel execution, which encompasses collecting counter values for a specific time range.
-
-This topic explains how to setup dispatch and device counting and use common counter collection APIs. For details on the APIs including the less commonly used counter collection APIs, see the API library. For fully functional examples of both dispatch and device counting, see [Samples](https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples).
-
-## Definitions
-
-Profile Config: A configuration to specify the counters to be collected on an agent. This must be supplied to various counter collection APIs to initiate collection of counter data. Profiles are agent-specific and can't be used on different agents.
-
-Counter ID: Unique Id (per-architecture) that specifies the counter. The counter Id can be used to fetch counter information such as its name or expression.
-
-Instance ID: Unique record Id that encodes the counter Id and dimension for a collected value.
-
-Dimension: Dimensions help to provide context to the raw counter values by specifying the hardware register that is the source of counter collection such as a shader engine. All counter values have dimension data encoded in their instance Id, which allows you to extract the values for individual dimensions using functions in the counter interface. The following dimensions are supported:
-
-```c
-    ROCPROFILER_DIMENSION_XCC,            ///< XCC dimension of result
-    ROCPROFILER_DIMENSION_AID,            ///< AID dimension of result
-    ROCPROFILER_DIMENSION_SHADER_ENGINE,  ///< SE dimension of result
-    ROCPROFILER_DIMENSION_AGENT,          ///< Agent dimension
-    ROCPROFILER_DIMENSION_SHADER_ARRAY,   ///< Number of shader arrays
-    ROCPROFILER_DIMENSION_WGP,            ///< Number of workgroup processors
-    ROCPROFILER_DIMENSION_INSTANCE,       ///< From unspecified hardware register
-```
-
-
-## Using the counter collection service
-
-The setup for dispatch and device counting is similar with only minor changes needed to adapt code from one to another.
-Here are the steps required to configure the counter collection services:
-
-### tool_init() setup
-
-Similar to tracing services, you must create a context and a buffer to collect the output when initializing the tool.
-:::{note}
-`Buffered_callback` in `rocprofiler_create_buffer` is invoked with a vector of collected counter samples, when the buffer is full. For details, see the [Buffered callback](#buffered-callback) section.
-:::
-
-```CPP
-rocprofiler_context_id_t ctx{0};
-rocprofiler_buffer_id_t buff;
-ROCPROFILER_CALL(rocprofiler_create_context(&ctx), "context creation failed");
-ROCPROFILER_CALL(rocprofiler_create_buffer(ctx,
-                                            4096,
-                                            2048,
-                                            ROCPROFILER_BUFFER_POLICY_LOSSLESS,
-                                            buffered_callback, // Callback to process data
-                                            user_data,
-                                            &buff),
-                    "buffer creation failed");
-```
-
-
-After creating a context and buffer to store results in `tool_init`, it is highly recommended but not mandatory for you to construct the profiles for each agent, containing the counters for collection. Profile creation should be avoided in the time critical dispatch counting callback as it involves validating if the counters can be collected on the agent. After profile setup, you can set up the collection service for dispatch or device counting. To set up either dispatch or device counting (only one can be used at a time), use:
-
-```CPP
-    /* For Dispatch Counting */
-    // Setup the dispatch profile counting service. This service will trigger the dispatch_callback
-    // when a kernel dispatch is enqueued into the HSA queue. The callback will specify what
-    // counters to collect by returning a profile config id.
-    ROCPROFILER_CALL(rocprofiler_configure_buffered_dispatch_counting_service(
-                         ctx, buff, dispatch_callback, nullptr),
-                     "Could not setup buffered service");
-
-    /* For Agent Counting */
-    // set_profile is a callback that is use to select the profile to use when
-    // the context is started. It is called at every rocprofiler_ctx_start() call.
-    ROCPROFILER_CALL(rocprofiler_configure_device_counting_service(
-                         ctx, buff, agent_id, set_profile, nullptr),
-                     "Could not setup buffered service");
-```
-
-#### Profile setup
-
-1. The first step in constructing a counter collection profile is to find the GPU agents on the machine. You must create a profile for each set of counters to be collected on every agent on the machine. You can use `rocprofiler_query_available_agents` to find agents on the system. The following example collects all GPU agents on the device and stores them in the vector agents:
-
-```CPP
-    std::vector<rocprofiler_agent_v0_t> agents;
-
-    // Callback used by rocprofiler_query_available_agents to return
-    // agents on the device. This can include CPU agents as well. We
-    // select GPU agents only (i.e. type == ROCPROFILER_AGENT_TYPE_GPU)
-    rocprofiler_query_available_agents_cb_t iterate_cb = [](rocprofiler_agent_version_t agents_ver,
-                                                            const void**                agents_arr,
-                                                            size_t                      num_agents,
-                                                            void*                       udata) {
-        if(agents_ver != ROCPROFILER_AGENT_INFO_VERSION_0)
-            throw std::runtime_error{"unexpected rocprofiler agent version"};
-        auto* agents_v = static_cast<std::vector<rocprofiler_agent_v0_t>*>(udata);
-        for(size_t i = 0; i < num_agents; ++i)
-        {
-            const auto* agent = static_cast<const rocprofiler_agent_v0_t*>(agents_arr[i]);
-            if(agent->type == ROCPROFILER_AGENT_TYPE_GPU) agents_v->emplace_back(*agent);
-        }
-        return ROCPROFILER_STATUS_SUCCESS;
-    };
-
-    // Query the agents, only a single callback is made that contains a vector
-    // of all agents.
-    ROCPROFILER_CALL(
-        rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0,
-                                           iterate_cb,
-                                           sizeof(rocprofiler_agent_t),
-                                           const_cast<void*>(static_cast<const void*>(&agents))),
-        "query available agents");
-```
-
-2. To identify the counters supported by an agent, query the available counters with `rocprofiler_iterate_agent_supported_counters`. Here is an example of a single agent returning the available counters in `gpu_counters`:
-
-```CPP
-    std::vector<rocprofiler_counter_id_t> gpu_counters;
-
-    // Iterate all the counters on the agent and store them in gpu_counters.
-    ROCPROFILER_CALL(rocprofiler_iterate_agent_supported_counters(
-                         agent,
-                         [](rocprofiler_agent_id_t,
-                            rocprofiler_counter_id_t* counters,
-                            size_t                    num_counters,
-                            void*                     user_data) {
-                             std::vector<rocprofiler_counter_id_t>* vec =
-                                 static_cast<std::vector<rocprofiler_counter_id_t>*>(user_data);
-                             for(size_t i = 0; i < num_counters; i++)
-                             {
-                                 vec->push_back(counters[i]);
-                             }
-                             return ROCPROFILER_STATUS_SUCCESS;
-                         },
-                         static_cast<void*>(&gpu_counters)),
-                     "Could not fetch supported counters");
-```
-
-3. `rocprofiler_counter_id_t` is a handle to a counter. To fetch information about the counter such as its name, use `rocprofiler_query_counter_info`:
-
-```CPP
-    for(auto& counter : gpu_counters)
-    {
-        // Contains name and other attributes about the counter.
-        // See API documentation for more info on the contents of this struct.
-        rocprofiler_counter_info_v0_t version;
-        ROCPROFILER_CALL(
-            rocprofiler_query_counter_info(
-                counter, ROCPROFILER_COUNTER_INFO_VERSION_0, static_cast<void*>(&version)),
-            "Could not query info for counter");
-    }
-```
-
-4. After identifying the counters to be collected, construct a profile by passing a list of these counters to `rocprofiler_create_profile_config`.
-
-```C++
-    // Create and return the profile
-    rocprofiler_profile_config_id_t profile;
-    ROCPROFILER_CALL(rocprofiler_create_profile_config(
-                         agent, counters_array, counters_array_count, &profile),
-                     "Could not construct profile cfg");
-```
-
-5. You can use the created profile for both dispatch and agent counter collection services.
-
-:::{note}
-
-Points to note on profile behavior:
-
- Profile created is *only valid* for the agent it was created for.
- Profiles are immutable. To collect a new counter set, construct a new profile.
- A single profile can be used multiple times on the same agent.
- Counter Ids supplied to `rocprofiler_create_profile_config` are *agent-specific* and can't be used to construct profiles for other agents.
-:::
-
-### Dispatch counting callback
-
-When a kernel is dispatched, a dispatch callback is issued to the tool to allow selection of counters to be collected for the dispatch by supplying a profile.
-
-```CPP
-void
-dispatch_callback(rocprofiler_dispatch_counting_service_data_t dispatch_data,
-                  rocprofiler_profile_config_id_t*             config,
-                  rocprofiler_user_data_t* user_data,
-                  void* /*callback_data_args*/)
-```
-
-`dispatch_data` contains information about the dispatch being launched such as its name. `config` is used by the tool to specify the profile, which allows counter collection for the dispatch. If no profile is supplied, no counters are collected for this dispatch. `user_data` contains user data supplied to `rocprofiler_configure_buffered_dispatch_profile_counting_service`.
-
-### Agent set profile callback
-
-This callback is invoked after the context starts and allows the tool to specify the profile to be used.
-
-```CPP
-void
-set_profile(rocprofiler_context_id_t                 context_id,
-            rocprofiler_agent_id_t                   agent,
-            rocprofiler_agent_set_profile_callback_t set_config,
-            void*)
-```
-
-The profile to be used for this agent is specified by calling `set_config(agent, profile)`.
-
-### Buffered callback
-
-Data from collected counter values is returned through a buffered callback. The buffered callback routines are similar for dispatch and device counting except that some data such as kernel launch Ids is not available in device counting mode. Here is a sample iteration to print out counter collection data:
-
-```CPP
-    for(size_t i = 0; i < num_headers; ++i)
-    {
-        auto* header = headers[i];
-        if(header->category == ROCPROFILER_BUFFER_CATEGORY_COUNTERS &&
-           header->kind == ROCPROFILER_COUNTER_RECORD_PROFILE_COUNTING_DISPATCH_HEADER)
-        {
-            // Print the returned counter data.
-            auto* record =
-                static_cast<rocprofiler_dispatch_counting_service_record_t*>(header->payload);
-            ss << "[Dispatch_Id: " << record->dispatch_info.dispatch_id
-               << " Kernel_ID: " << record->dispatch_info.kernel_id
-               << " Corr_Id: " << record->correlation_id.internal << ")]\n";
-        }
-        else if(header->category == ROCPROFILER_BUFFER_CATEGORY_COUNTERS &&
-                header->kind == ROCPROFILER_COUNTER_RECORD_VALUE)
-        {
-            // Print the returned counter data.
-            auto* record = static_cast<rocprofiler_record_counter_t*>(header->payload);
-            rocprofiler_counter_id_t counter_id = {.handle = 0};
-
-            rocprofiler_query_record_counter_id(record->id, &counter_id);
-
-            ss << "  (Dispatch_Id: " << record->dispatch_id << " Counter_Id: " << counter_id.handle
-               << " Record_Id: " << record->id << " Dimensions: [";
-
-            for(auto& dim : counter_dimensions(counter_id))
-            {
-                size_t pos = 0;
-                rocprofiler_query_record_dimension_position(record->id, dim.id, &pos);
-                ss << "{" << dim.name << ": " << pos << "},";
-            }
-            ss << "] Value [D]: " << record->counter_value << "),";
-        }
-    }
-```
-
-## Counter definitions
-
-Counters are defined in yaml format in the `counter_defs.yaml` file. The counter definition has the following format:
-
-```yaml
-counter_name:       # Counter name
-  architectures:
-    gfx90a:         # Architecture name
-      block:        # Block information (SQ/etc)
-      event:        # Event ID (used by AQLProfile to identify counter register)
-      expression:   # Formula for the counter (if derived counter)
-      description:  # Per-arch description (optional)
-    gfx1010:
-       ...
-  description:      # Description of the counter
-```
-
-You can separately define the counters for different architectures as shown in the preceding example for gfx90a and gfx1010. If two or more architectures share the same block, event, or expression definition, they can be specified together using "/" delimiter ("gfx90a/gfx1010:").
-Hardware metrics have the elements block, event, and description defined. Derived metrics have the element expression defined and can't have block or event defined.
-
-## Derived metrics
-
-Derived metrics are expressions performing computation on collected hardware metrics. These expressions produce result similar to a real hardware counter.
-
-```yaml
-GPU_UTIL:
-  architectures:
-    gfx942/gfx941/gfx10/gfx1010/gfx1030/gfx1031/gfx11/gfx1032/gfx1102/gfx906/gfx1100/gfx1101/gfx940/gfx908/gfx90a/gfx9:
-      expression: 100*GRBM_GUI_ACTIVE/GRBM_COUNT
-  description: Percentage of the time that GUI is active
-```
-
-In the preceding example, `GPU_UTIL` is a derived metric that uses a mathematic expression to calculate the utilization rate of the GPU using values of two GRBM hardware counters `GRBM_GUI_ACTIVE` and `GRBM_COUNT`. Expressions support the standard set of math operators (/,*,-,+) along with a set of special functions such as reduce and accumulate.
-
-### Reduce function
-
-```yaml
-Expression: 100*reduce(GL2C_HIT,sum)/(reduce(GL2C_HIT,sum)+reduce(GL2C_MISS,sum))
-```
-
-The reduce function reduces counter values across all dimensions such as shader engine, SIMD, and so on, to produce a single output value. This helps to collect and compare values across the entire device.
-Here are the common reduction operations:
- `sum`: Sums to create a single output. For example, `reduce(GL2C_HIT,sum)` sums all `GL2C_HIT` hardware register values.
- `avr`: Calculates the average across all dimensions.
- `min`: Selects minimum value across all dimensions.
- `max`: Selects the maximum value across all dimensions.
-
-```yaml
-expression: reduce(X,sum,[DIMENSION_XCC])
-```
-Reduce() also supports dimension wise reduction, when provided dimensions in 3rd parameter. In the expression above, if `X` has two dimensions `DIMENSION_XCC`, `DIMENSION_SHADER_ARRAY`, and `DIMENSION_WGP`, the reduce happens across counter values where `DIMENSION_SHADER_ARRAY` and `DIMENSION_WGP` dimensions are same as shown below.
-
-Let's say DIM sizes of XCC, SHADER_ARRAY(SH), WGP be 2, 4, 4 respectively.
-
-Raw Counter Data in 3D space:
-
-#### XCC[0]:
-|       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
-|-------|------|------|------|------|
-| SH[0] |   1  |   2  |   3  |   4  |
-| SH[1] |   5  |   6  |   7  |   8  |
-| SH[2] |   9  |   10 |   11 |   12 |
-| SH[3] |   13 |   14 |   15 |   16 |
-
-#### XCC[1]:
-|       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
-|-------|------|------|------|------|
-| SH[0] |   1  |   2  |   3  |   4  |
-| SH[1] |   5  |   6  |   7  |   8  |
-| SH[2] |   9  |   10 |   11 |   12 |
-| SH[3] |   13 |   14 |   15 |   16 |
-
-Reducing XCC dim with sum, results to 2D space with only WGP and SH.
-
-|       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
-|-------|------|------|------|------|
-| SH[0] |  2   |   4  |   6  |   8  |
-| SH[1] |  10  |   12 |   14 |   16 |
-| SH[2] |  18  |   20 |   22 |   24 |
-| SH[3] |  26  |   28 |   30 |   32 |
-
-similarly, for `reduce(X,sum,[DIMENSION_XCC,DIMENSION_SHADER_ARRAY])` results in only WGP dimension.
-
-|       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
-|-------|------|------|------|------|
-|       |  56  |  64  |  72  |  80  |
-
-### Select Function
-
-```yaml
-expression: select(Y, [DIMENSION_XCC=[0],DIMENSION_SHADER_ENGINE=[2]])
-```
-
-select() only returns counter values which match the dimension indexes provided by the user in expression. This operation is to allow a user to state they only want to select specific dimensions index. Supported dimensions include ```DIMENSION_XCC, DIMENSION_AID, DIMENSION_SHADER_ENGINE, DIMENSION_AGENT, DIMENSION_SHADER_ARRAY, DIMENSION_WGP, DIMENSION_INSTANCE```. For example ``select(Y, [DIMENSION_XCC=[0],DIMENSION_SHADER_ENGINE=[2]])`` gives counter values which are from DIMENSION_XCC= 0 and DIMENSION_SHADER_ENGINE= 2 for Y Metric.
-
-Let's say Y has XCC, SHADER_ENGINE(SE), WGP dimensions with sizes 2, 4, 4 respectively.
-
-Raw Counter Data in 3D space:
-
-#### XCC[0]:
-|       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
-|-------|------|------|------|------|
-| SE[0] |   1  |   2  |   3  |   4  |
-| SE[1] |   5  |   6  |   7  |   8  |
-| SE[2] |   9  |   10 |   11 |   12 |
-| SE[3] |   13 |   14 |   15 |   16 |
-
-#### XCC[1]:
-|       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
-|-------|------|------|------|------|
-| SE[0] |   17 |   18 |   19 |   20 |
-| SE[1] |   21 |   22 |   23 |   24 |
-| SE[2] |   25 |   26 |   27 |   28 |
-| SE[3] |   29 |   30 |   31 |   32 |
-
-Selecting at XCC=0 results to 2D space with WGP and SH dimensions, as shown below.
-
-|       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
-|-------|------|------|------|------|
-| SE[0] |   1  |   2  |   3  |   4  |
-| SE[1] |   5  |   6  |   7  |   8  |
-| SE[2] |   9  |   10 |   11 |   12 |
-| SE[3] |   13 |   14 |   15 |   16 |
-
-similarly, for `select(Y, [DIMENSION_XCC=[0],DIMENSION_SHADER_ENGINE=[2]])` results in only WGP dimension with XCC=0 and SE=2.
-
-|       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
-|-------|------|------|------|------|
-|       |  9   |  10  |  11  |  12  |
-
-### Accumulate function
-
-```yaml
-Expression: accumulate(<basic_level_counter>, <resolution>)
-```
-
- The accumulate function sums the values of a basic level counter over the specified number of cycles. The `resolution` parameter allows you to control the frequency of the following summing operation:
-
-  - `HIGH_RES`: Sums up the basic level counter every clock cycle. Captures the value every cycle for higher accuracy, which helps in fine-grained analysis.
-  - `LOW_RES`: Sums up the basic level counter every four clock cycles. Reduces the data points and provides less detailed summing, which helps in reducing data volume.
-  - `NONE`: Does nothing and is equivalent to collecting basic level counter. Outputs the value of the basic level counter without performing any summing operation.
-
-**Example:**
-
-```yaml
-MeanOccupancyPerCU:
-  architectures:
-    gfx942/gfx941/gfx940:
-      expression: accumulate(SQ_LEVEL_WAVES,HIGH_RES)/reduce(GRBM_GUI_ACTIVE,max)/CU_NUM
-  description: Mean occupancy per compute unit.
-```
-
-<metric name="MeanOccupancyPerCU" expr=accumulate(SQ_LEVEL_WAVES,HIGH_RES)/reduce(GRBM_GUI_ACTIVE,max)/CU_NUM descr="Mean occupancy per compute unit."></metric>
-
- `MeanOccupancyPerCU`: In the preceding example, the `MeanOccupancyPerCU` metric calculates the mean occupancy per compute unit. It uses the accumulate function with `HIGH_RES` to sum the `SQ_LEVEL_WAVES` counter every clock cycle.
-This sum is then divided by the maximum value of GRBM_GUI_ACTIVE and the number of compute units `CU_NUM` to derive the mean occupancy.
-
-## Kernel serialization
-
-Counter collection in *dispatch counting* mode requires serialized execution of kernels on a target device. Kernel serialization isolates kernel executions, which helps to collect performance counter data. However, for applications requiring two kernels to execute on the same device simultaneously (co-dependent kernels), kernel serialization leads to deadlock in dispatch counter collection mode. To avoid deadlock in such applications, opt for any of the following options:
-
- Avoid co-dependent kernels in application.
-
- Don't collect performance data for co-dependent kernels by using kernel filtration methods in the rocprofv3’s input configuration PMC file.
-
- Use ROCprofiler-SDK's device-wide counter collection mode to collect performance data. You can use tools such as RDC and PAPI to collect information. Note that the device-wide counter collection captures data for all executions on the device and not specific to the kernels.
@@ -0,0 +1,439 @@
+.. _rocprofiler_sdk_counter_collection_services:
+
+ROCprofiler-SDK Counter Collection Services
+===========================================
+
+There are two modes of counter collection service:
+
+- **Dispatch counting**: In this mode, counters are collected on a per-kernel launch basis. This mode is useful for collecting highly detailed counters for a specific kernel execution in isolation. Note that dispatch counting allows only a single kernel to execute in hardware at a time.
+
+- **Device counting**: In this mode, counters are collected on a device level. This mode is useful for collecting device level counters not tied to a specific kernel execution, which encompasses collecting counter values for a specific time range.
+
+This topic explains how to setup dispatch and device counting and use common counter collection APIs. For details on the APIs including the less commonly used counter collection APIs, see the API library. For fully functional examples of both dispatch and device counting, see `Samples <https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples>`_.
+
+Definitions
+-----------
+
+**Profile Config**: A configuration to specify the counters to be collected on an agent. This must be supplied to various counter collection APIs to initiate collection of counter data. Profiles are agent-specific and can't be used on different agents.
+
+**Counter ID**: Unique Id (per-architecture) that specifies the counter. The counter Id can be used to fetch counter information such as its name or expression.
+
+**Instance ID**: Unique record Id that encodes the counter Id and dimension for a collected value.
+
+**Dimension**: Dimensions help to provide context to the raw counter values by specifying the hardware register that is the source of counter collection such as a shader engine. All counter values have dimension data encoded in their instance Id, which allows you to extract the values for individual dimensions using functions in the counter interface. The following dimensions are supported:
+
+.. code-block:: c
+
+    ROCPROFILER_DIMENSION_XCC,            ///< XCC dimension of result
+    ROCPROFILER_DIMENSION_AID,            ///< AID dimension of result
+    ROCPROFILER_DIMENSION_SHADER_ENGINE,  ///< SE dimension of result
+    ROCPROFILER_DIMENSION_AGENT,          ///< Agent dimension
+    ROCPROFILER_DIMENSION_SHADER_ARRAY,   ///< Number of shader arrays
+    ROCPROFILER_DIMENSION_WGP,            ///< Number of workgroup processors
+    ROCPROFILER_DIMENSION_INSTANCE,       ///< From unspecified hardware register
+
+Using the Counter Collection Service
+------------------------------------
+
+The setup for dispatch and device counting is similar with only minor changes needed to adapt code from one to another. Here are the steps required to configure the counter collection services:
+
+tool_init() setup
+++++++++++++++++++
+
+Similar to tracing services, you must create a context and a buffer to collect the output when initializing the tool.
+
+.. code-block:: cpp
+
+    rocprofiler_context_id_t ctx{0};
+    rocprofiler_buffer_id_t buff;
+    ROCPROFILER_CALL(rocprofiler_create_context(&ctx), "context creation failed");
+    ROCPROFILER_CALL(rocprofiler_create_buffer(ctx,
+                                                4096,
+                                                2048,
+                                                ROCPROFILER_BUFFER_POLICY_LOSSLESS,
+                                                buffered_callback, // Callback to process data
+                                                user_data,
+                                                &buff),
+                        "buffer creation failed");
+
+
+After creating a context and buffer to store results in ``tool_init``, it is highly recommended but not mandatory for you to construct the profiles for each agent, containing the counters for collection. Profile creation should be avoided in the time critical dispatch counting callback as it involves validating if the counters can be collected on the agent. After profile setup, you can set up the collection service for dispatch or device counting. To set up either dispatch or device counting (only one can be used at a time), use:
+
+.. code-block:: cpp
+
+    /* For Dispatch Counting */
+    // Setup the dispatch profile counting service. This service will trigger the dispatch_callback
+    // when a kernel dispatch is enqueued into the HSA queue. The callback will specify what
+    // counters to collect by returning a profile config id.
+    ROCPROFILER_CALL(rocprofiler_configure_buffered_dispatch_counting_service(
+                         ctx, buff, dispatch_callback, nullptr),
+                     "Could not setup buffered service");
+
+    /* For Agent Counting */
+    // set_profile is a callback that is use to select the profile to use when
+    // the context is started. It is called at every rocprofiler_ctx_start() call.
+    ROCPROFILER_CALL(rocprofiler_configure_device_counting_service(
+                         ctx, buff, agent_id, set_profile, nullptr),
+                     "Could not setup buffered service");
+
+
+Profile Setup
+-------------
+
+1. The first step in constructing a counter collection profile is to find the GPU agents on the machine. You must create a profile for each set of counters to be collected on every agent on the machine. You can use ``rocprofiler_query_available_agents`` to find agents on the system. The following example collects all GPU agents on the device and stores them in the vector agents:
+
+.. code-block:: cpp
+
+    std::vector<rocprofiler_agent_v0_t> agents;
+
+    // Callback used by rocprofiler_query_available_agents to return
+    // agents on the device. This can include CPU agents as well. We
+    // select GPU agents only (i.e. type == ROCPROFILER_AGENT_TYPE_GPU)
+    rocprofiler_query_available_agents_cb_t iterate_cb = [](rocprofiler_agent_version_t agents_ver,
+                                                            const void**                agents_arr,
+                                                            size_t                      num_agents,
+                                                            void*                       udata) {
+        if(agents_ver != ROCPROFILER_AGENT_INFO_VERSION_0)
+            throw std::runtime_error{"unexpected rocprofiler agent version"};
+        auto* agents_v = static_cast<std::vector<rocprofiler_agent_v0_t>*>(udata);
+        for(size_t i = 0; i < num_agents; ++i)
+        {
+            const auto* agent = static_cast<const rocprofiler_agent_v0_t*>(agents_arr[i]);
+            if(agent->type == ROCPROFILER_AGENT_TYPE_GPU) agents_v->emplace_back(*agent);
+        }
+        return ROCPROFILER_STATUS_SUCCESS;
+    };
+
+    // Query the agents, only a single callback is made that contains a vector
+    // of all agents.
+    ROCPROFILER_CALL(
+        rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0,
+                                           iterate_cb,
+                                           sizeof(rocprofiler_agent_t),
+                                           const_cast<void*>(static_cast<const void*>(&agents))),
+        "query available agents");
+
+2. To identify the counters supported by an agent, query the available counters with ``rocprofiler_iterate_agent_supported_counters``. Here is an example of a single agent returning the available counters in ``gpu_counters``:
+
+.. code-block:: cpp
+
+    std::vector<rocprofiler_counter_id_t> gpu_counters;
+
+    // Iterate all the counters on the agent and store them in gpu_counters.
+    ROCPROFILER_CALL(rocprofiler_iterate_agent_supported_counters(
+                         agent,
+                         [](rocprofiler_agent_id_t,
+                            rocprofiler_counter_id_t* counters,
+                            size_t                    num_counters,
+                            void*                     user_data) {
+                             std::vector<rocprofiler_counter_id_t>* vec =
+                                 static_cast<std::vector<rocprofiler_counter_id_t>*>(user_data);
+                             for(size_t i = 0; i < num_counters; i++)
+                             {
+                                 vec->push_back(counters[i]);
+                             }
+                             return ROCPROFILER_STATUS_SUCCESS;
+                         },
+                         static_cast<void*>(&gpu_counters)),
+                     "Could not fetch supported counters");
+
+3. ``rocprofiler_counter_id_t`` is a handle to a counter. To fetch information about the counter such as its name, use ``rocprofiler_query_counter_info``:
+
+.. code-block:: cpp
+
+    for(auto& counter : gpu_counters)
+    {
+        // Contains name and other attributes about the counter.
+        // See API documentation for more info on the contents of this struct.
+        rocprofiler_counter_info_v0_t version;
+        ROCPROFILER_CALL(
+            rocprofiler_query_counter_info(
+                counter, ROCPROFILER_COUNTER_INFO_VERSION_0, static_cast<void*>(&version)),
+            "Could not query info for counter");
+    }
+
+
+4. After identifying the counters to be collected, construct a profile by passing a list of these counters to ``rocprofiler_create_profile_config``.
+
+.. code-block:: cpp
+
+    // Create and return the profile
+    rocprofiler_profile_config_id_t profile;
+    ROCPROFILER_CALL(rocprofiler_create_profile_config(
+                         agent, counters_array, counters_array_count, &profile),
+                     "Could not construct profile cfg");
+
+
+5. You can use the created profile for both dispatch and agent counter collection services.
+
+.. note::
+    Points to note on profile behavior:
+
+    - Profile created is *only valid* for the agent it was created for.
+    - Profiles are immutable. To collect a new counter set, construct a new profile.
+    - A single profile can be used multiple times on the same agent.
+    - Counter Ids supplied to ``rocprofiler_create_profile_config`` are *agent-specific* and can't be used to construct profiles for other agents.
+
+Dispatch Counting Callback
+--------------------------
+
+When a kernel is dispatched, a dispatch callback is issued to the tool to allow selection of counters to be collected for the dispatch by supplying a profile.
+
+.. code-block:: cpp
+
+    void
+    dispatch_callback(rocprofiler_dispatch_counting_service_data_t dispatch_data,
+                      rocprofiler_profile_config_id_t*             config,
+                      rocprofiler_user_data_t* user_data,
+                      void* /*callback_data_args*/)
+
+``dispatch_data`` contains information about the dispatch being launched such as its name. ``config`` is used by the tool to specify the profile, which allows counter collection for the dispatch. If no profile is supplied, no counters are collected for this dispatch. ``user_data`` contains user data supplied to ``rocprofiler_configure_buffered_dispatch_profile_counting_service``.
+
+Agent Set Profile Callback
+--------------------------
+
+This callback is invoked after the context starts and allows the tool to specify the profile to be used.
+
+.. code-block:: cpp
+
+    void
+    set_profile(rocprofiler_context_id_t                 context_id,
+                rocprofiler_agent_id_t                   agent,
+                rocprofiler_agent_set_profile_callback_t set_config,
+                void*)
+
+The profile to be used for this agent is specified by calling ``set_config(agent, profile)``.
+
+Buffered callback
++++++++++++++++++
+
+Data from collected counter values is returned through a buffered callback. The buffered callback routines are similar for dispatch and device counting except that some data such as kernel launch Ids is not available in device counting mode. Here is a sample iteration to print out counter collection data:
+
+.. code-block:: cpp
+    
+    for(size_t i = 0; i < num_headers; ++i)
+    {
+        auto* header = headers[i];
+        if(header->category == ROCPROFILER_BUFFER_CATEGORY_COUNTERS &&
+           header->kind == ROCPROFILER_COUNTER_RECORD_PROFILE_COUNTING_DISPATCH_HEADER)
+        {
+            // Print the returned counter data.
+            auto* record =
+                static_cast<rocprofiler_dispatch_counting_service_record_t*>(header->payload);
+            ss << "[Dispatch_Id: " << record->dispatch_info.dispatch_id
+               << " Kernel_ID: " << record->dispatch_info.kernel_id
+               << " Corr_Id: " << record->correlation_id.internal << ")]\n";
+        }
+        else if(header->category == ROCPROFILER_BUFFER_CATEGORY_COUNTERS &&
+                header->kind == ROCPROFILER_COUNTER_RECORD_VALUE)
+        {
+            // Print the returned counter data.
+            auto* record = static_cast<rocprofiler_record_counter_t*>(header->payload);
+            rocprofiler_counter_id_t counter_id = {.handle = 0};
+
+            rocprofiler_query_record_counter_id(record->id, &counter_id);
+
+            ss << "  (Dispatch_Id: " << record->dispatch_id << " Counter_Id: " << counter_id.handle
+               << " Record_Id: " << record->id << " Dimensions: [";
+
+            for(auto& dim : counter_dimensions(counter_id))
+            {
+                size_t pos = 0;
+                rocprofiler_query_record_dimension_position(record->id, dim.id, &pos);
+                ss << "{" << dim.name << ": " << pos << "},";
+            }
+            ss << "] Value [D]: " << record->counter_value << "),";
+        }
+    }
+
+Counter Definitions
+-------------------
+
+Counters are defined in yaml format in the ``counter_defs.yaml`` file. The counter definition has the following format:
+
+.. code-block:: yaml
+
+    counter_name:       # Counter name
+      architectures:
+        gfx90a:         # Architecture name
+          block:        # Block information (SQ/etc)
+          event:        # Event ID (used by AQLProfile to identify counter register)
+          expression:   # Formula for the counter (if derived counter)
+          description:  # Per-arch description (optional)
+        gfx1010:
+           ...
+      description:      # Description of the counter
+
+You can separately define the counters for different architectures as shown in the preceding example for gfx90a and gfx1010. If two or more architectures share the same block, event, or expression definition, they can be specified together using "/" delimiter ("gfx90a/gfx1010:"). Hardware metrics have the elements block, event, and description defined. Derived metrics have the element expression defined and can't have block or event defined.
+
+Derived Metrics
+---------------
+
+Derived metrics are expressions performing computation on collected hardware metrics. These expressions produce result similar to a real hardware counter.
+
+.. code-block:: yaml
+
+    GPU_UTIL:
+      architectures:
+        gfx942/gfx941/gfx10/gfx1010/gfx1030/gfx1031/gfx11/gfx1032/gfx1102/gfx906/gfx1100/gfx1101/gfx940/gfx908/gfx90a/gfx9:
+          expression: 100*GRBM_GUI_ACTIVE/GRBM_COUNT
+      description: Percentage of the time that GUI is active
+
+In the preceding example, ``GPU_UTIL`` is a derived metric that uses a mathematic expression to calculate the utilization rate of the GPU using values of two GRBM hardware counters ``GRBM_GUI_ACTIVE`` and ``GRBM_COUNT``. Expressions support the standard set of math operators (/,*,-,+) along with a set of special functions such as reduce and accumulate.
+
+Reduce Function
++++++++++++++++
+
+.. code-block:: yaml
+
+    Expression: 100*reduce(GL2C_HIT,sum)/(reduce(GL2C_HIT,sum)+reduce(GL2C_MISS,sum))
+
+The reduce function reduces counter values across all dimensions such as shader engine, SIMD, and so on, to produce a single output value. This helps to collect and compare values across the entire device. Here are the common reduction operations:
+
+- ``sum``: Sums to create a single output. For example, ``reduce(GL2C_HIT,sum)`` sums all ``GL2C_HIT`` hardware register values.
+- ``avr``: Calculates the average across all dimensions.
+- ``min``: Selects minimum value across all dimensions.
+- ``max``: Selects the maximum value across all dimensions.
+
+.. code-block:: yaml
+
+    expression: reduce(X,sum,[DIMENSION_XCC])
+
+Reduce() also supports dimension wise reduction, when provided dimensions in 3rd parameter. In the expression above, if ``X`` has two dimensions ``DIMENSION_XCC``, ``DIMENSION_SHADER_ARRAY``, and ``DIMENSION_WGP``, the reduce happens across counter values where ``DIMENSION_SHADER_ARRAY`` and ``DIMENSION_WGP`` dimensions are same as shown below.
+
+Let's say DIM sizes of XCC, SHADER_ARRAY(SH), WGP be 2, 4, 4 respectively.
+
+Raw Counter Data in 3D space:
+
+#### XCC[0]:
+
+.. code-block:: text
+
+    |       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
+    |-------|------|------|------|------|
+    | SH[0] |   1  |   2  |   3  |   4  |
+    | SH[1] |   5  |   6  |   7  |   8  |
+    | SH[2] |   9  |   10 |   11 |   12 |
+    | SH[3] |   13 |   14 |   15 |   16 |
+
+#### XCC[1]:
+
+.. code-block:: text
+
+    |       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
+    |-------|------|------|------|------|
+    | SH[0] |   1  |   2  |   3  |   4  |
+    | SH[1] |   5  |   6  |   7  |   8  |
+    | SH[2] |   9  |   10 |   11 |   12 |
+    | SH[3] |   13 |   14 |   15 |   16 |
+
+Reducing XCC dim with sum, results to 2D space with only WGP and SH.
+
+.. code-block:: text
+
+    |       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
+    |-------|------|------|------|------|
+    | SH[0] |  2   |   4  |   6  |   8  |
+    | SH[1] |  10  |   12 |   14 |   16 |
+    | SH[2] |  18  |   20 |   22 |   24 |
+    | SH[3] |  26  |   28 |   30 |   32 |
+
+similarly, for ``reduce(X,sum,[DIMENSION_XCC,DIMENSION_SHADER_ARRAY])`` results in only WGP dimension.
+
+.. code-block:: text
+
+    |       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
+    |-------|------|------|------|------|
+    |       |  56  |  64  |  72  |  80  |
+
+Select Function
++++++++++++++++
+
+.. code-block:: yaml
+
+    expression: select(Y, [DIMENSION_XCC=[0],DIMENSION_SHADER_ENGINE=[2]])
+
+select() only returns counter values which match the dimension indexes provided by the user in expression. This operation is to allow a user to state they only want to select specific dimensions index. Supported dimensions include ``DIMENSION_XCC, DIMENSION_AID, DIMENSION_SHADER_ENGINE, DIMENSION_AGENT, DIMENSION_SHADER_ARRAY, DIMENSION_WGP, DIMENSION_INSTANCE``. For example ``select(Y, [DIMENSION_XCC=[0],DIMENSION_SHADER_ENGINE=[2]])`` gives counter values which are from DIMENSION_XCC= 0 and DIMENSION_SHADER_ENGINE= 2 for Y Metric.
+
+Let's say Y has XCC, SHADER_ENGINE(SE), WGP dimensions with sizes 2, 4, 4 respectively.
+
+Raw Counter Data in 3D space:
+
+#### XCC[0]:
+
+.. code-block:: text
+
+    |       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
+    |-------|------|------|------|------|
+    | SE[0] |   1  |   2  |   3  |   4  |
+    | SE[1] |   5  |   6  |   7  |   8  |
+    | SE[2] |   9  |   10 |   11 |   12 |
+    | SE[3] |   13 |   14 |   15 |   16 |
+
+#### XCC[1]:
+
+.. code-block:: text
+
+    |       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
+    |-------|------|------|------|------|
+    | SE[0] |   17 |   18 |   19 |   20 |
+    | SE[1] |   21 |   22 |   23 |   24 |
+    | SE[2] |   25 |   26 |   27 |   28 |
+    | SE[3] |   29 |   30 |   31 |   32 |
+
+Selecting at XCC=0 results to 2D space with WGP and SH dimensions, as shown below.
+
+.. code-block:: text
+
+    |       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
+    |-------|------|------|------|------|
+    | SE[0] |   1  |   2  |   3  |   4  |
+    | SE[1] |   5  |   6  |   7  |   8  |
+    | SE[2] |   9  |   10 |   11 |   12 |
+    | SE[3] |   13 |   14 |   15 |   16 |
+
+similarly, for ``select(Y, [DIMENSION_XCC=[0],DIMENSION_SHADER_ENGINE=[2]])`` results in only WGP dimension with XCC=0 and SE=2.
+
+.. code-block:: text
+
+    |       |WGP[0]|WGP[1]|WGP[2]|WGP[3]|
+    |-------|------|------|------|------|
+    |       |  9   |  10  |  11  |  12  |
+
+Accumulate Function
+-------------------
+
+.. code-block:: yaml
+
+    Expression: accumulate(<basic_level_counter>, <resolution>)
+
+- The accumulate function sums the values of a basic level counter over the specified number of cycles. The ``resolution`` parameter allows you to control the frequency of the following summing operation:
+
+  - ``HIGH_RES``: Sums up the basic level counter every clock cycle. Captures the value every cycle for higher accuracy, which helps in fine-grained analysis.
+  - ``LOW_RES``: Sums up the basic level counter every four clock cycles. Reduces the data points and provides less detailed summing, which helps in reducing data volume.
+  - ``NONE``: Does nothing and is equivalent to collecting basic level counter. Outputs the value of the basic level counter without performing any summing operation.
+
+**Example:**
+
+.. code-block:: yaml
+
+    MeanOccupancyPerCU:
+      architectures:
+        gfx942/gfx941/gfx940:
+          expression: accumulate(SQ_LEVEL_WAVES,HIGH_RES)/reduce(GRBM_GUI_ACTIVE,max)/CU_NUM
+      description: Mean occupancy per compute unit.
+
+<metric name="MeanOccupancyPerCU" expr=accumulate(SQ_LEVEL_WAVES,HIGH_RES)/reduce(GRBM_GUI_ACTIVE,max)/CU_NUM descr="Mean occupancy per compute unit."></metric>
+
+- ``MeanOccupancyPerCU``: In the preceding example, the ``MeanOccupancyPerCU`` metric calculates the mean occupancy per compute unit. It uses the accumulate function with ``HIGH_RES`` to sum the ``SQ_LEVEL_WAVES`` counter every clock cycle. This sum is then divided by the maximum value of GRBM_GUI_ACTIVE and the number of compute units ``CU_NUM`` to derive the mean occupancy.
+
+Kernel Serialization
+--------------------
+
+Counter collection in *dispatch counting* mode requires serialized execution of kernels on a target device. Kernel serialization isolates kernel executions, which helps to collect performance counter data. However, for applications requiring two kernels to execute on the same device simultaneously (co-dependent kernels), kernel serialization leads to deadlock in dispatch counter collection mode. To avoid deadlock in such applications, opt for any of the following options:
+
+- Avoid co-dependent kernels in application.
+
+- Don't collect performance data for co-dependent kernels by using kernel filtration methods in the rocprofv3’s input configuration PMC file.
+
+- Use ROCprofiler-SDK's device-wide counter collection mode to collect performance data. You can use tools such as RDC and PAPI to collect information. Note that the device-wide counter collection captures data for all executions on the device and not specific to the kernels.
@@ -1,93 +0,0 @@
---
-myst:
-    html_meta:
-        "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
-        "keywords": "ROCprofiler-SDK API reference, ROCprofiler-SDK intercept table, Intercept table API"
---
-
-# ROCprofiler-SDK runtime intercept tables
-
-While tools commonly leverage the callback or buffer tracing services for tracing the HIP, HSA, and ROCTx
-APIs, ROCprofiler-SDK also provides access to the raw API dispatch tables.
-
-## Forward declaration of public C API function
-
-All the aforementioned APIs are designed similar to the following sample:
-
-```cpp
-extern "C"
-{
-// forward declaration of public C API function
-int
-foo(int) __attribute__((visibility("default")));
-}
-```
-
-## Internal implementation of API function
-
-```cpp
-namespace impl
-{
-int
-foo(int val)
-{
-    // real implementation
-    return (2 * val);
-}
-}
-```
-
-## Dispatch table implementation
-
-```cpp
-namespace impl
-{
-struct dispatch_table
-{
-    int (*foo_fn)(int) = nullptr;
-};
-
-// Invoked once: populates the dispatch_table with function pointers to implementation
-dispatch_table*&
-construct_dispatch_table()
-{
-    static dispatch_table* tbl = new dispatch_table{};
-    tbl->foo_fn                = impl::foo;
-
-    // In between, ROCprofiler-SDK gets passed the pointer
-    // to the dispatch table and has the opportunity to wrap the function
-    // pointers for interception
-
-    return tbl;
-}
-
-// Constructs dispatch table and stores it in static variable
-dispatch_table*
-get_dispatch_table()
-{
-    static dispatch_table*& tbl = construct_dispatch_table();
-    return tbl;
-}
-}  // namespace impl
-```
-
-## Implementation of public C API function
-
-```cpp
-extern "C"
-{
-// implementation of public C API function
-int
-foo(int val)
-{
-    return impl::get_dispatch_table()->foo_fn(val);
-}
-}
-```
-
-## Dispatch table chaining
-
-ROCprofiler-SDK can save the original values of the function pointers such as `foo_fn` in `impl::construct_dispatch_table()` and install its own function pointers in its place. This results in the public C API function `foo` calling into the ROCprofiler-SDK function pointer, which in turn, calls the original function pointer to `impl::foo`. This phenomenon is named chaining. Once ROCprofiler-SDK
-makes necessary modifications to the dispatch table, tools requesting access to the raw dispatch table via `rocprofiler_at_intercept_table_registration`, are provided the pointer to the dispatch table.
-
-For an example of dispatch table chaining, see [samples/intercept_table](https://github.com/ROCm/rocprofiler-sdk-internal/tree/amd-staging/samples/intercept_table).
@@ -0,0 +1,101 @@
+.. ---
+.. myst:
+..     html_meta:
+..         "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
+..         "keywords": "ROCprofiler-SDK API reference, ROCprofiler-SDK intercept table, Intercept table API"
+.. ---
+
+.. _ROCprofiler-SDK runtime intercept tables:
+
+Runtime Intercept Tables
+=========================
+
+While tools commonly leverage the callback or buffer tracing services for tracing the HIP, HSA, and ROCTx
+APIs, ROCprofiler-SDK also provides access to the raw API dispatch tables.
+
+Forward declaration of public C API function:
+----------------------------------------------
+
+All the aforementioned APIs are designed similar to the following sample:
+
+.. code-block:: cpp
+
+    extern "C"
+    {
+    // forward declaration of public C API function
+    int
+    foo(int) __attribute__((visibility("default")));
+    }
+
+Internal implementation of API function:
+-----------------------------------------
+
+.. code-block:: cpp
+
+    namespace impl
+    {
+    int
+    foo(int val)
+    {
+        // real implementation
+        return (2 * val);
+    }
+    }
+
+Dispatch table implementation:
+-------------------------------
+
+.. code-block:: cpp
+
+    namespace impl
+    {
+    struct dispatch_table
+    {
+        int (*foo_fn)(int) = nullptr;
+    };
+
+    // Invoked once: populates the dispatch_table with function pointers to implementation
+    dispatch_table*&
+    construct_dispatch_table()
+    {
+        static dispatch_table* tbl = new dispatch_table{};
+        tbl->foo_fn                = impl::foo;
+
+        // In between, ROCprofiler-SDK gets passed the pointer
+        // to the dispatch table and has the opportunity to wrap the function
+        // pointers for interception
+
+        return tbl;
+    }
+
+    // Constructs dispatch table and stores it in static variable
+    dispatch_table*
+    get_dispatch_table()
+    {
+        static dispatch_table*& tbl = construct_dispatch_table();
+        return tbl;
+    }
+    }  // namespace impl
+
+Implementation of public C API function:
+-----------------------------------------
+
+.. code-block:: cpp
+
+    extern "C"
+    {
+    // implementation of public C API function
+    int
+    foo(int val)
+    {
+        return impl::get_dispatch_table()->foo_fn(val);
+    }
+    }
+
+Dispatch table chaining:
+-------------------------
+
+ROCprofiler-SDK can save the original values of the function pointers such as ``foo_fn`` in ``impl::construct_dispatch_table()`` and install its own function pointers in its place. This results in the public C API function `foo` calling into the ROCprofiler-SDK function pointer, which in turn, calls the original function pointer to ``impl::foo``. This phenomenon is named chaining. Once ROCprofiler-SDK
+makes necessary modifications to the dispatch table, tools requesting access to the raw dispatch table via ``rocprofiler_at_intercept_table_registration``, are provided the pointer to the dispatch table.
+
+For an example of dispatch table chaining, see `samples/intercept_table <https://github.com/ROCm/rocprofiler-sdk-internal/tree/amd-staging/samples/intercept_table>`_.
@@ -1,175 +0,0 @@
---
-myst:
-    html_meta:
-        "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
-        "keywords": "ROCprofiler-SDK API reference, Program counter sampling, PC sampling"
---
-
-# ROCprofiler-SDK PC sampling method
-
-Program Counter (PC) sampling is a profiling method that uses statistical approximation of the kernel execution by sampling GPU program counters. Furthermore, this method periodically chooses an active wave in a round robin manner and snapshots its PC. This process takes place on every compute unit simultaneously, making it device-wide PC sampling. The outcome is the histogram of samples, explaining how many times each kernel instruction was sampled.
-
-:::{warning}
-
-Risk acknowledgment: The PC sampling feature is under development and might not be completely stable. Use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk.
-
-By activating this feature through `ROCPROFILER_PC_SAMPLING_BETA_ENABLED` environment variable, you acknowledge and accept the following potential risks:
-
- Hardware freeze: This beta feature could cause your hardware to freeze unexpectedly.
- Need for cold restart: In the event of a hardware freeze, you might need to perform a cold restart (turning the hardware off and on) to restore normal operations.
-:::
-
-## ROCprofiler-SDK PC sampling service
-
-This section describes how to use ROCProfiler-SDK PC sampling API to configure and use PC sampling service. For fully functional examples, see [Samples](https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples).
-
-### tool_init() setup
-
-Here are the steps to set up ``tool_init()``:
-
-1. As the PC sampling service belongs to the group of [buffered services](buffered_services.md), it requires a buffer and a context to be set up in this phase.
-
-```cpp
-rocprofiler_context_id_t ctx{0};
-rocprofiler_buffer_id_t buff;
-ROCPROFILER_CALL(rocprofiler_create_context(&ctx), "context creation failed");
-ROCPROFILER_CALL(rocprofiler_create_buffer(ctx,
-                                            8192,
-                                            2048,
-                                            ROCPROFILER_BUFFER_POLICY_LOSSLESS,
-                                            pc_sampling_callback, // Callback to process PC samples
-                                            user_data,
-                                            &buff),
-                    "buffer creation failed");
-```
-
-For more details on buffer creation, see [buffered services](buffered_services.md).
-
-2. The PC sampling service is tied to a GPU agent. To extract the list of available agents, use the `rocprofiler_query_available_agents` as shown in the following code snippet:
-
-```cpp
-std::vector<rocprofiler_agent_v0_t> agents;
-
-// Callback used by rocprofiler_query_available_agents to return
-// agents on the device. This can include CPU agents as well.
-// Select GPU agents only (type == ROCPROFILER_AGENT_TYPE_GPU)
-rocprofiler_query_available_agents_cb_t iterate_cb = [](rocprofiler_agent_version_t agents_ver,
-                                                        const void**                agents_arr,
-                                                        size_t                      num_agents,
-                                                        void*                       udata) {
-    if(agents_ver != ROCPROFILER_AGENT_INFO_VERSION_0)
-        throw std::runtime_error{"unexpected rocprofiler agent version"};
-    auto* agents_v = static_cast<std::vector<rocprofiler_agent_v0_t>*>(udata);
-    for(size_t i = 0; i < num_agents; ++i)
-    {
-        const auto* agent = static_cast<const rocprofiler_agent_v0_t*>(agents_arr[i]);
-        if(agent->type == ROCPROFILER_AGENT_TYPE_GPU) agents_v->emplace_back(*agent);
-    }
-    return ROCPROFILER_STATUS_SUCCESS;
-};
-
-// Query the agents. Only a single callback is made that contains a vector
-// of all agents.
-ROCPROFILER_CALL(
-    rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0,
-                                       iterate_cb,
-                                       sizeof(rocprofiler_agent_t),
-                                       const_cast<void*>(static_cast<const void*>(&agents))),
-    "query available agents");
-```
-
-3. Only newer GPU architectures (MI200 onwards) support this feature. To determine whether an agent with `agent_id` supports the PC sampling and the available configurations (`rocprofiler_pc_sampling_configuration_t`), use the `rocprofiler_query_pc_sampling_agent_configurations`.
-
-```cpp
-std::vector<rocprofiler_pc_sampling_configuration_t> available_configurations;
-
-auto cb = [](const rocprofiler_pc_sampling_configuration_t* configs,
-             size_t                                         num_config,
-             void*                                          user_data) {
-    auto* avail_configs = static_cast<avail_configs_vec_t*>(user_data);
-    for(size_t i = 0; i < num_config; i++)
-    {
-        avail_configs->emplace_back(configs[i]);
-    }
-    return ROCPROFILER_STATUS_SUCCESS;
-};
-
-auto status = rocprofiler_query_pc_sampling_agent_configurations(
-    agent_id, cb, &available_configurations);
-```
-
-Assuming the `available_configurations` contain a single element:
-
-```cpp
-rocprofiler_pc_sampling_configuration_t {
-    .method = ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP,
-    .unit = ROCPROFILER_PC_SAMPLING_UNIT_TIME,
-    .min_interval = 1,
-    .max_interval = 10000
-};
-```
-
-4. Configure the PC sampling service on an agent with `agent_id` to generate samples every 1000 micro-seconds as shown here:
-
-```cpp
-auto status = rocprofiler_configure_pc_sampling_service(ctx,
-                                                        agent_id,
-                                                        picked_cfg->method,
-                                                        picked_cfg->unit,
-                                                        1000,  // 1000 us
-                                                        buffer_id,
-                                                        0);
-if (status == ROCPROFILER_STATUS_SUCCESS)
-{
-    // PC Sampling service has been configured successfully.
-}
-else
-{
-    // code for error handling
-}
-```
-
-:::{note}
-Multiple processes can share the same GPU agent simultaneously, so the following A->B->A problem is possible on shared systems. For example, process A can query available configurations and opt to configure the service with configuration CA. However, if process B manages to finish configuring the service with configuration CB, then proess A will fail. Thus, it is advisable for the process A to repeat the querying process to observe configuration CB and reuse it for configuring the PC sampling service. For more details, refer to the [Samples](https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples).
-:::
-
-### Processing PC samples
-
-The PC sampling service asynchronously delivers samples via a dedicated callback (`pc_sampling_callback`). The following code snippet outlines the process of iterating over samples.
-
-```cpp
-void
-pc_sampling_callback(rocprofiler_context_id_t ctx,
-                     rocprofiler_buffer_id_t buff,
-                     rocprofiler_record_header_t** headers,
-                     size_t                        num_headers,
-                     void* data,
-                     uint64_t drop_count)
-{
-    for(size_t i = 0; i < num_headers; i++)
-    {
-        auto* cur_header = headers[i];
-
-        if(cur_header->category == ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING)
-        {
-            if(cur_header->kind == ROCPROFILER_PC_SAMPLING_RECORD_HOST_TRAP_V0_SAMPLE)
-            {
-                auto* pc_sample = static_cast<rocprofiler_pc_sampling_record_host_trap_v0_t*>(
-                    cur_header->payload);
-
-                // Processing a single sample...
-            }
-            else
-            {
-                // ...
-            }
-        }
-    }
-}
-```
-
-For more information on the data comprising a single sample, see [pc_sampling.h](https://github.com/ROCm/rocprofiler-sdk/blob/amd-mainline/source/include/rocprofiler-sdk/pc_sampling.h).
-
-:::{note}
-A user can synchronously flush buffers via `rocprofiler_buffer_flush` that triggers `pc_sampling_callback`.
-:::
@@ -0,0 +1,181 @@
+.. ---
+.. myst:
+..    html_meta:
+..        "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
+..        "keywords": "ROCprofiler-SDK API reference, Program counter sampling, PC sampling"
+.. ---
+
+ROCprofiler-SDK PC sampling method
+===================================
+
+Program Counter (PC) sampling is a profiling method that uses statistical approximation of the kernel execution by sampling GPU program counters. Furthermore, this method periodically chooses an active wave in a round robin manner and snapshots its PC. This process takes place on every compute unit simultaneously, making it device-wide PC sampling. The outcome is the histogram of samples, explaining how many times each kernel instruction was sampled.
+
+.. warning::
+
+    Risk acknowledgment: The PC sampling feature is under development and might not be completely stable. Use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk.
+
+    By activating this feature through ``ROCPROFILER_PC_SAMPLING_BETA_ENABLED`` environment variable, you acknowledge and accept the following potential risks:
+
+    - Hardware freeze: This beta feature could cause your hardware to freeze unexpectedly.
+    - Need for cold restart: In the event of a hardware freeze, you might need to perform a cold restart (turning the hardware off and on) to restore normal operations.
+
+ROCprofiler-SDK PC sampling service
+------------------------------------
+
+This section describes how to use ROCProfiler-SDK PC sampling API to configure and use PC sampling service. For fully functional examples, see `Samples <https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples>`_.
+
+tool_init() setup
++++++++++++++++++
+
+Here are the steps to set up ``tool_init()``:
+
+.. code-block:: cpp
+
+    rocprofiler_context_id_t ctx{0};
+    rocprofiler_buffer_id_t buff;
+    ROCPROFILER_CALL(rocprofiler_create_context(&ctx), "context creation failed");
+    ROCPROFILER_CALL(rocprofiler_create_buffer(ctx,
+                                                8192,
+                                                2048,
+                                                ROCPROFILER_BUFFER_POLICY_LOSSLESS,
+                                                pc_sampling_callback, // Callback to process PC samples
+                                                user_data,
+                                                &buff),
+                        "buffer creation failed");
+
+For more details on buffer creation, see `buffered services <buffered_services.md>`_.
+
+The PC sampling service is tied to a GPU agent. To extract the list of available agents, use the ``rocprofiler_query_available_agents`` as shown in the following code snippet:
+
+.. code-block:: cpp
+
+    std::vector<rocprofiler_agent_v0_t> agents;
+
+    // Callback used by rocprofiler_query_available_agents to return
+    // agents on the device. This can include CPU agents as well.
+    // Select GPU agents only (type == ROCPROFILER_AGENT_TYPE_GPU)
+    rocprofiler_query_available_agents_cb_t iterate_cb = [](rocprofiler_agent_version_t agents_ver,
+                                                            const void**                agents_arr,
+                                                            size_t                      num_agents,
+                                                            void*                       udata) {
+        if(agents_ver != ROCPROFILER_AGENT_INFO_VERSION_0)
+            throw std::runtime_error{"unexpected rocprofiler agent version"};
+        auto* agents_v = static_cast<std::vector<rocprofiler_agent_v0_t>*>(udata);
+        for(size_t i = 0; i < num_agents; ++i)
+        {
+            const auto* agent = static_cast<const rocprofiler_agent_v0_t*>(agents_arr[i]);
+            if(agent->type == ROCPROFILER_AGENT_TYPE_GPU) agents_v->emplace_back(*agent);
+        }
+        return ROCPROFILER_STATUS_SUCCESS;
+    };
+
+    // Query the agents. Only a single callback is made that contains a vector
+    // of all agents.
+    ROCPROFILER_CALL(
+        rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0,
+                                           iterate_cb,
+                                           sizeof(rocprofiler_agent_t),
+                                           const_cast<void*>(static_cast<const void*>(&agents))),
+        "query available agents");
+
+Only newer GPU architectures (MI200 onwards) support this feature. To determine whether an agent with ``agent_id`` supports the PC sampling and the available configurations ``(rocprofiler_pc_sampling_configuration_t)``, use the `rocprofiler_query_pc_sampling_agent_configurations`.
+
+.. code-block:: cpp
+
+    std::vector<rocprofiler_pc_sampling_configuration_t> available_configurations;
+
+    auto cb = [](const rocprofiler_pc_sampling_configuration_t* configs,
+                 size_t                                         num_config,
+                 void*                                          user_data) {
+        auto* avail_configs = static_cast<avail_configs_vec_t*>(user_data);
+        for(size_t i = 0; i < num_config; i++)
+        {
+            avail_configs->emplace_back(configs[i]);
+        }
+        return ROCPROFILER_STATUS_SUCCESS;
+    };
+
+    auto status = rocprofiler_query_pc_sampling_agent_configurations(
+        agent_id, cb, &available_configurations);
+
+Assuming the `available_configurations` contain a single element:
+
+.. code-block:: cpp
+
+    rocprofiler_pc_sampling_configuration_t {
+        .method = ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP,
+        .unit = ROCPROFILER_PC_SAMPLING_UNIT_TIME,
+        .min_interval = 1,
+        .max_interval = 10000
+    };
+
+
+Configure the PC sampling service on an agent with ``agent_id`` to generate samples every 1000 micro-seconds as shown here:
+
+.. code-block:: cpp
+
+    auto status = rocprofiler_configure_pc_sampling_service(ctx,
+                                                            agent_id,
+                                                            picked_cfg->method,
+                                                            picked_cfg->unit,
+                                                            1000,  // 1000 us
+                                                            buffer_id,
+                                                            0);
+    if (status == ROCPROFILER_STATUS_SUCCESS)
+    {
+        // PC Sampling service has been configured successfully.
+    }
+    else
+    {
+        // code for error handling
+    }
+
+.. note::
+
+    Multiple processes can share the same GPU agent simultaneously, so the following A->B->A problem is possible on shared systems. For example, process A can query available configurations and opt to configure the service with configuration CA. However, if process B manages to finish configuring the service with configuration CB, then process A will fail. Thus, it is advisable for process A to repeat the querying process to observe configuration CB and reuse it for configuring the PC sampling service. For more details, refer to the `Samples <https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples>`_.
+
+Processing PC samples
+----------------------
+
+The PC sampling service asynchronously delivers samples via a dedicated callback ``(pc_sampling_callback)``. The following code snippet outlines the process of iterating over samples.
+
+.. code-block:: cpp
+
+    void
+    pc_sampling_callback(rocprofiler_context_id_t ctx,
+                         rocprofiler_buffer_id_t buff,
+                         rocprofiler_record_header_t** headers,
+                         size_t num_headers,
+                         void* data,
+                         uint64_t drop_count)
+    {
+        for(size_t i = 0; i < num_headers; i++)
+        {
+            auto* cur_header = headers[i];
+
+            if(cur_header->category == ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING)
+            {
+                if(cur_header->kind == ROCPROFILER_PC_SAMPLING_RECORD_HOST_TRAP_V0_SAMPLE)
+                {
+                    auto* pc_sample = static_cast<rocprofiler_pc_sampling_record_host_trap_v0_t*>(
+                        cur_header->payload);
+
+                    // Processing a single sample...
+                }
+                else
+                {
+                    // ...
+                }
+            }
+        }
+    }
+
+
+
+For more information on the data comprising a single sample, see `pc_sampling.h <https://github.com/ROCm/rocprofiler-sdk/blob/amd-mainline/source/include/rocprofiler-sdk/pc_sampling.h>`_.
+
+.. note::
+    A user can synchronously flush buffers via ``rocprofiler_buffer_flush`` that triggers ``pc_sampling_callback``.
+
+
+
@@ -1,239 +0,0 @@
---
-myst:
-    html_meta:
-        "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
-        "keywords": "ROCprofiler-SDK API reference, Tool library API"
---
-
-# ROCprofiler-SDK tool library
-
-The tool library utilizes APIs from `rocprofiler-sdk` and `rocprofiler-register` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the `rocprofiler-sdk` and `rocprofiler-register` libraries efficiently. The command-line tool `rocprofv3` is also built on `librocprofiler-sdk-tool.so.X.Y.Z`, which uses these libraries.
-
-## ROCm runtimes design
-
-The ROCm runtimes are designed to directly communicate with a helper library named `rocprofiler-register` during initialization. This library performs cursory checks to find if a tool requires ROCprofiler-SDK services. This detection is based on the presence of one or more instances of `rocprofiler_configure` in the tool or `ROCP_TOOL_LIBRARIES` environment variable. This design provides drastic improvement over previous designs, which relied solely on a tool racing to set runtime-specific environment variables like `HSA_TOOLS_LIB` before the runtime initialization.
-
-## Tool library design
-
-When ROCprofiler-SDK detects `rocprofiler_configure` in a tool's symbol table, ROCprofiler-SDK invokes `rocprofiler-configure` with parameters such as ROCprofiler-SDK version that invokes the function, number of tools already invoked, and a unique identifier for the tool. The tool returns a pointer to a `rocprofiler_tool_configure_result_t` struct, which, if non-null, provides ROCprofiler-SDK with:
-
- Function to be called for tool initialization, which is also the opportunity for context creation.
- Function to be called when ROCprofiler-SDK is finalized.
- A pointer to data to be provided to the tool when ROCprofiler-SDK calls the initialization and finalization functions.
-
-ROCprofiler-SDK provides a `rocprofiler-sdk/registration.h` header file, which forward declares the `rocprofiler_configure` function with the necessary compiler function attributes to ensure that the `rocprofiler-configure` symbol is publicly visible.
-
-```cpp
-#include <rocprofiler-sdk/registration.h>
-
-namespace
-{
-// saves the data provided to rocprofiler_configure
-struct ToolData
-{
-    uint32_t                              version;
-    const char*                           runtime_version;
-    uint32_t                              priority;
-    rocprofiler_client_id_t               client_id;
-};
-
-// tool initialization function
-int
-tool_init(rocprofiler_client_finalize_t fini_func,
-          void* tool_data_v);
-
-// tool finalization function
-void
-tool_fini(void* tool_data_v);
-}
-
-extern "C"
-{
-rocprofiler_tool_configure_result_t*
-rocprofiler_configure(uint32_t                 version,
-                      const char*              runtime_version,
-                      uint32_t                 priority,
-                      rocprofiler_client_id_t* client_id)
-{
-    //If not the first tool to register, indicate that the tool doesn't want to do anything
-    if(priority > 0) return nullptr;
-
-    // (optional) Provide a name for this tool to rocprofiler
-    client_id->name = "ExampleTool";
-
-    // (optional) create configure data
-    static auto data = ToolData{ version,
-                                 runtime_version,
-                                 priority,
-                                 client_id };
-
-    // construct configure result
-    static auto cfg =
-        rocprofiler_tool_configure_result_t{ sizeof(rocprofiler_tool_configure_result_t),
-                                             &tool_init,
-                                             &tool_fini,
-                                             static_cast<void*>(&data) };
-
-    return &cfg;
-}
-```
-
-## Tool initialization
-
-:::{note}
-ROCprofiler-SDK does NOT support calls to any runtime function (HSA, HIP, and so on) during tool initialization.
-Invoking any functions from the runtimes results in a deadlock.
-:::
-
-For each tool that contains a `rocprofiler_configure` function and returns a non-null pointer to a `rocprofiler_tool_configure_result_t` struct, ROCprofiler-SDK invokes the `initialize` callback after completing the scan for all `rocprofiler_configure` symbols. In other words, ROCprofiler-SDK
-collects all `rocprofiler_tool_configure_result_t` instances before invoking the `initialize` member of any of these instances.
-When ROCprofiler-SDK invokes `initialize` function in a tool, this is the opportunity to create contexts:
-
-```cpp
-#include <rocprofiler-sdk/rocprofiler.h>
-
-namespace
-{
-int
-tool_init(rocprofiler_client_finalize_t fini_func,
-          void* data_v)
-{
-    // create a context
-    auto ctx = rocprofiler_context_id_t{0};
-    rocprofiler_create_context(&ctx);
-
-    // ... associate services with context ...
-
-    // start the context (optional)
-    rocprofiler_start_context(ctx);
-
-    return 0;
-}
-}
-```
-
-Although not mandatory, it is recommended that tools store the context handles to control the data collection for the services associated with the context.
-
-## Tool finalization
-
-When the `initialize` callback is invoked in the tool, ROCprofiler-SDK provides a function pointer of type `rocprofiler_client_finalize_t`.
-The tool can invoke this function pointer to explicitly invoke the `finalize` callback from the `rocprofiler_tool_configure_result_t` instance:
-
-```cpp
-#include <rocprofiler-sdk/rocprofiler.h>
-
-namespace
-{
-int
-tool_init(rocprofiler_client_finalize_t fini_func,
-          void* data_v)
-{
-    // ... see initialization section ...
-
-    // function, which finalizes the tool after 10 seconds
-    auto explicit_finalize = [](rocprofiler_client_finalize_t finalizer,
-                                rocprofiler_client_id_t* client_id)
-    {
-        std::this_thread::sleep_for(std::chrono::seconds{ 10 });
-        finalizer(client_id);
-    };
-
-    // start the context
-    rocprofiler_start_context(ctx);
-
-    // dispatch a background thread to explicitly finalize after 10 seconds
-    std::thread{ explicit_finalize, fini_func, static_cast<ToolData*>(data_v)->client_id }.detach();
-
-    return 0;
-}
-}
-```
-
-Otherwise, ROCprofiler-SDK invokes the `finalize` callback via an `atexit` handler.
-
-## Full rocprofiler-configure sample
-
-All the code snippets from the previous sections are combined here to demonstrate complete ROCProfiler configuration.
-
-```cpp
-#include <rocprofiler-sdk/registration.h>
-
-namespace
-{
-struct rocp_tool_data
-{
-    uint32_t                              version;
-    const char*                           runtime_version;
-    uint32_t                              priority;
-    rocprofiler_client_id_t               client_id;
-    rocprofiler_client_finalize_t         finalizer;
-    std::vector<rocprofiler_context_id_t> contexts;
-};
-
-void
-tool_tracing_callback(rocprofiler_callback_tracing_record_t record,
-                      rocprofiler_user_data_t*              user_data,
-                      void*                                 callback_data);
-
-int
-tool_init(rocprofiler_client_finalize_t fini_func,
-          void* tool_data_v)
-{
-    rocp_tool_data* tool_data = static_cast<rocp_tool_data*>(tool_data_v);
-
-    // Save the finalizer function
-    tool_data->finalizer = fini_func;
-
-    // create a context
-    auto ctx = rocprofiler_context_id_t{0};
-    rocprofiler_create_context(&ctx);
-
-    // Save your contexts
-    tool_data->contexts.emplace_back(ctx);
-
-    // Associate code object tracing with this context
-    rocprofiler_configure_callback_tracing_service(
-        ctx,
-        ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT,
-        nullptr,
-        0,
-        tool_tracing_callback,
-        tool_data);
-
-    // ... Associate services with contexts ...
-
-    return 0;
-}
-
-void
-tool_fini(void* tool_data);
-}
-
-extern "C"
-{
-rocprofiler_tool_configure_result_t*
-rocprofiler_configure(uint32_t                 version,
-                      const char*              runtime_version,
-                      uint32_t                 priority,
-                      rocprofiler_client_id_t* client_id)
-{
-    // (optional) Provide a name for this tool to rocprofiler
-    client_id->name = "ExampleTool";
-
-    // Info provided back to tool_init and tool_fini
-    auto* my_tool_data = new rocp_tool_data{ version,
-                                             runtime_version,
-                                             priority,
-                                             client_id,
-                                             nullptr };
-
-    // Create configure data
-    static auto cfg =
-        rocprofiler_tool_configure_result_t{ sizeof(rocprofiler_tool_configure_result_t),
-                                             &tool_init,
-                                             &tool_fini,
-                                             my_tool_data };
-
-    return &cfg;
-}
-```
@@ -0,0 +1,242 @@
+.. ---
+.. myst:
+..     html_meta:
+..         "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
+..         "keywords": "ROCprofiler-SDK API reference, Tool library API"
+.. ---
+
+.. _ROCprofiler-SDK tool library:
+
+ROCprofiler-SDK tool library
+============================
+
+The tool library utilizes APIs from ``rocprofiler-sdk`` and ``rocprofiler-register`` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the ``rocprofiler-sdk`` and ``rocprofiler-register`` libraries efficiently. The command-line tool ``rocprofv3`` is also built on ``librocprofiler-sdk-tool.so.X.Y.Z``, which uses these libraries.
+
+ROCm runtimes design
+---------------------
+
+The ROCm runtimes are designed to directly communicate with a helper library named ``rocprofiler-register`` during initialization. This library performs cursory checks to find if a tool requires ROCprofiler-SDK services. This detection is based on the presence of one or more instances of ``rocprofiler_configure`` in the tool or ``ROCP_TOOL_LIBRARIES`` environment variable. This design provides drastic improvement over previous designs, which relied solely on a tool racing to set runtime-specific environment variables like ``HSA_TOOLS_LIB`` before the runtime initialization.
+
+Tool library design
+--------------------
+
+When ROCprofiler-SDK detects ``rocprofiler_configure`` in a tool's symbol table, ROCprofiler-SDK invokes ``rocprofiler-configure`` with parameters such as ROCprofiler-SDK version that invokes the function, number of tools already invoked, and a unique identifier for the tool. The tool returns a pointer to a ``rocprofiler_tool_configure_result_t`` struct, which, if non-null, provides ROCprofiler-SDK with:
+
+- Function to be called for tool initialization, which is also the opportunity for context creation.
+- Function to be called when ROCprofiler-SDK is finalized.
+- A pointer to data to be provided to the tool when ROCprofiler-SDK calls the initialization and finalization functions.
+
+ROCprofiler-SDK provides a ``rocprofiler-sdk/registration.h`` header file, which forward declares the ``rocprofiler_configure`` function with the necessary compiler function attributes to ensure that the ``rocprofiler-configure`` symbol is publicly visible.
+
+.. code-block:: cpp
+
+    #include <rocprofiler-sdk/registration.h>
+
+    namespace
+    {
+    struct ToolData
+    {
+        uint32_t                              version;
+        const char*                           runtime_version;
+        uint32_t                              priority;
+        rocprofiler_client_id_t               client_id;
+    };
+
+    int
+    tool_init(rocprofiler_client_finalize_t fini_func,
+              void* tool_data_v);
+
+    void
+    tool_fini(void* tool_data_v);
+    }
+
+    extern "C"
+    {
+    rocprofiler_tool_configure_result_t*
+    rocprofiler_configure(uint32_t                 version,
+                          const char*              runtime_version,
+                          uint32_t                 priority,
+                          rocprofiler_client_id_t* client_id)
+    {
+        //If not the first tool to register, indicate that the tool doesn't want to do anything
+        if(priority > 0) return nullptr;
+
+        // (optional) Provide a name for this tool to rocprofiler
+        client_id->name = "ExampleTool";
+
+        // (optional) create configure data
+        static auto data = ToolData{ version,
+                                 runtime_version,
+                                 priority,
+                                 client_id };
+
+        // construct configure result
+        static auto cfg =
+            rocprofiler_tool_configure_result_t{ sizeof(rocprofiler_tool_configure_result_t),
+                                             &tool_init,
+                                             &tool_fini,
+                                             static_cast<void*>(&data) };
+
+        return &cfg;
+    }
+
+
+.. note::
+    ROCprofiler-SDK does NOT support calls to any runtime function (HSA, HIP, and so on) during tool initialization.
+    Invoking any functions from the runtimes results in a deadlock.
+
+For each tool that contains a ``rocprofiler_configure`` function and returns a non-null pointer to a ``rocprofiler_tool_configure_result_t`` struct, ROCprofiler-SDK invokes the ``initialize`` callback after completing the scan for all ``rocprofiler_configure`` symbols. In other words, ROCprofiler-SDK
+collects all ``rocprofiler_tool_configure_result_t`` instances before invoking the ``initialize`` member of any of these instances.
+When ROCprofiler-SDK invokes ``initialize`` function in a tool, this is the opportunity to create contexts:
+
+.. code-block:: cpp
+
+    #include <rocprofiler-sdk/rocprofiler.h>
+
+    namespace
+    {
+    int
+    tool_init(rocprofiler_client_finalize_t fini_func,
+              void* data_v)
+    {
+        // create a context
+        auto ctx = rocprofiler_context_id_t{0};
+        rocprofiler_create_context(&ctx);
+
+        // ... associate services with context ...
+
+        // start the context (optional)
+        rocprofiler_start_context(ctx);
+
+        return 0;
+    }
+    }
+
+Although not mandatory, it is recommended that tools store the context handles to control the data collection for the services associated with the context.
+
+Tool finalization
+------------------
+
+When the `initialize` callback is invoked in the tool, ROCprofiler-SDK provides a function pointer of type `rocprofiler_client_finalize_t`.
+The tool can invoke this function pointer to explicitly invoke the `finalize` callback from the `rocprofiler_tool_configure_result_t` instance:
+
+.. code-block:: cpp
+
+    #include <rocprofiler-sdk/rocprofiler.h>
+
+    namespace
+    {
+        int
+        tool_init(rocprofiler_client_finalize_t fini_func,
+                  void* data_v)
+        {
+            // ... see initialization section ...
+
+            // function, which
+            auto explicit_finalize = [](rocprofiler_client_finalize_t finalizer,
+                                        rocprofiler_client_id_t* client_id)
+            {
+                std::this_thread::sleep_for(std::chrono::seconds{ 10 });
+                finalizer(client_id);
+            };
+
+            // start the context
+            rocprofiler_start_context(ctx);
+
+            // dispatch a background thread to explicitly finalize after 10 seconds
+            std::thread{ explicit_finalize, fini_func, static_cast<ToolData*>(data_v)->client_id }.detach();
+
+            return 0;
+        }
+    }
+
+Otherwise, ROCprofiler-SDK invokes the `finalize` callback via an `atexit` handler.
+
+Full rocprofiler-configure sample
+----------------------------------
+
+All the code snippets from the previous sections are combined here to demonstrate complete ROCProfiler configuration.
+
+.. code-block:: cpp
+
+    #include <rocprofiler-sdk/registration.h>
+
+    namespace
+    {
+    struct rocp_tool_data
+    {
+        uint32_t                              version;
+        const char*                           runtime_version;
+        uint32_t                              priority;
+        rocprofiler_client_id_t               client_id;
+        rocprofiler_client_finalize_t         finalizer;
+        std::vector<rocprofiler_context_id_t> contexts;
+    };
+
+    void
+    tool_tracing_callback(rocprofiler_callback_tracing_record_t record,
+                        rocprofiler_user_data_t*              user_data,
+                        void*                                 callback_data);
+
+    int
+    tool_init(rocprofiler_client_finalize_t fini_func,
+            void* tool_data_v)
+    {
+        rocp_tool_data* tool_data = static_cast<rocp_tool_data*>(tool_data_v);
+
+        // Save the finalizer function
+        tool_data->finalizer = fini_func;
+
+        // create a context
+        auto ctx = rocprofiler_context_id_t{0};
+        rocprofiler_create_context(&ctx);
+
+        // Save your contexts
+        tool_data->contexts.emplace_back(ctx);
+
+        // Associate code object tracing with this context
+        rocprofiler_configure_callback_tracing_service(
+            ctx,
+            ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT,
+            nullptr,
+            0,
+            tool_tracing_callback,
+            tool_data);
+
+        // ... Associate services with contexts ...
+
+        return 0;
+    }
+
+    void
+    tool_fini(void* tool_data);
+    }
+
+    extern "C"
+    {
+    rocprofiler_tool_configure_result_t*
+    rocprofiler_configure(uint32_t                 version,
+                        const char*              runtime_version,
+                        uint32_t                 priority,
+                        rocprofiler_client_id_t* client_id)
+    {
+        // (optional) Provide a name for this tool to rocprofiler
+        client_id->name = "ExampleTool";
+
+        // Info provided back to tool_init and tool_fini
+        auto* my_tool_data = new rocp_tool_data{ version,
+                                                runtime_version,
+                                                priority,
+                                                client_id,
+                                                nullptr };
+
+        // Create configure data
+        static auto cfg =
+            rocprofiler_tool_configure_result_t{ sizeof(rocprofiler_tool_configure_result_t),
+                                                &tool_init,
+                                                &tool_fini,
+                                                my_tool_data };
+
+        return &cfg;
+    }
+
@@ -1,51 +0,0 @@
---
-myst:
-    html_meta:
-        "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
-        "keywords": "ROCprofiler-SDK, ROCProfiler-SDK samples"
---
-
-# ROCprofiler-SDK samples
-
-The samples are provided to help you see the profiler in action.
-
-## Finding samples
-
-The ROCm installation provides sample programs and `rocprofv3` tool.
-
- Sample programs are installed here:
-
-    ```bash
-    /opt/rocm/share/rocprofiler-sdk/samples
-    ```
-
- `rocprofv3` tool is installed here:
-
-    ```bash
-    /opt/rocm/bin
-    ```
-
-## Building Samples
-
-To build samples from any directory, run:
-
-```bash
-cmake -B build-rocprofiler-sdk-samples /opt/rocm/share/rocprofiler-sdk/samples -DCMAKE_PREFIX_PATH=/opt/rocm
-cmake --build build-rocprofiler-sdk-samples --target all --parallel 8
-```
-
-## Running samples
-
-To run the built samples, `cd` into the `build-rocprofiler-sdk-samples` directory and run:
-
-```bash
-ctest -V
-```
-
-:::{note}
-Running a few of these tests require you to install [pandas](https://pandas.pydata.org/) and [pytest](https://docs.pytest.org/en/stable/) first.
-:::
-
-```bash
-/usr/local/bin/python -m pip install -r requirements.txt
-```
@@ -0,0 +1,49 @@
+.. ---
+.. myst:
+..    html_meta:
+..        "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
+..        "keywords": "ROCprofiler-SDK, ROCProfiler-SDK samples"
+.. ---
+
+ROCprofiler-SDK samples
+========================
+
+The samples are provided to help you see the profiler in action.
+
+Finding samples
+---------------
+
+The ROCm installation provides sample programs and ``rocprofv3`` tool.
+
+- Sample programs are installed here:
+
+.. code-block:: bash
+    
+    /opt/rocm/share/rocprofiler-sdk/samples
+
+- ``rocprofv3`` tool is installed here:
+
+.. code-block:: bash
+    
+    /opt/rocm/bin
+
+Building Samples
+----------------
+
+To build samples from any directory, run:
+
+.. code-block:: bash
+
+    cmake -B build-rocprofiler-sdk-samples /opt/rocm/share/rocprofiler-sdk/samples -DCMAKE_PREFIX_PATH=/opt/rocm
+    cmake --build build-rocprofiler-sdk-samples --target all --parallel 8
+
+
+Running samples
+---------------
+
+To run the built samples, ``cd`` into the ``build-rocprofiler-sdk-samples`` directory and run:
+
+.. code-block:: bash
+    
+    ctest -V
+
@@ -41,11 +41,10 @@ In the preceding example, an extra agent info file is generated for the ``mpirun

 .. code-block:: bash

-    3000020_agent_info.csv
-    3000019_agent_info.csv
-    3000020_hip_api_trace.csv
-    3000019_hip_api_trace.csv
-    3164458_agent_info.csv
+    ubuntu-latest.3000020.1/3000020_agent_info.csv
+    ubuntu-latest.3000020.0/3000019_agent_info.csv
+    ubuntu-latest.3000020.1/3000020_hip_api_trace.csv
+    ubuntu-latest.3000020.0/3000019_hip_api_trace.csv

 ROCTx annotations
 ===================
@@ -1,80 +0,0 @@
---
-myst:
-    html_meta:
-        "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
-        "keywords": "Installing ROCprofiler-SDK, Install ROCprofiler-SDK, Build ROCprofiler-SDK"
---
-
-# ROCprofiler-SDK installation
-
-This document provides information required to install ROCprofiler-SDK from source.
-
-## Supported systems
-
-ROCprofiler-SDK is supported only on Linux. The following distributions are tested:
-
- Ubuntu 20.04
- Ubuntu 22.04
- OpenSUSE 15.4
- RedHat 8.8
-
-ROCprofiler-SDK might operate as expected on other [Linux distributions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems), but has not been tested.
-
-### Identifying the operating system
-
-To identify the Linux distribution and version, see the `/etc/os-release` and `/usr/lib/os-release` files:
-
-```shell
-$ cat /etc/os-release
-NAME="Ubuntu"
-VERSION="20.04.4 LTS (Focal Fossa)"
-ID=ubuntu
-...
-VERSION_ID="20.04"
-...
-```
-
-The relevant fields are `ID` and the `VERSION_ID`.
-
-## Build requirements
-
-Install [CMake](https://cmake.org/) version 3.21 (or later).
-
-:::{note}
-If the `CMake` installed on the system is too old, you can install a new version using various methods. One of the easiest options is to use PyPi (Python’s pip).
-:::
-
-```bash
-pip install --user 'cmake==3.22.0'
-export PATH=${HOME}/.local/bin:${PATH}
-```
-
-## Building ROCprofiler-SDK
-
-```bash
-git clone https://github.com/ROCm/rocprofiler-sdk.git rocprofiler-sdk-source
-cmake                                         \
-      -B rocprofiler-sdk-build                \
-      -D ROCPROFILER_BUILD_TESTS=ON           \
-      -D ROCPROFILER_BUILD_SAMPLES=ON         \
-      -D CMAKE_INSTALL_PREFIX=/opt/rocm       \
-       rocprofiler-sdk-source
-
-cmake --build rocprofiler-sdk-build --target all --parallel 8
-```
-
-## Installing ROCprofiler-SDK
-
-To install ROCprofiler-SDK from the `rocprofiler-sdk-build` directory, run:
-
-```bash
-cmake --build rocprofiler-sdk-build --target install
-```
-
-## Testing ROCprofiler-SDK
-
-To run the built tests, `cd` into the `rocprofiler-sdk-build` directory and run:
-
-```bash
-ctest --output-on-failure -O ctest.all.log
-```
@@ -0,0 +1,98 @@
+.. ---
+.. myst:
+..    html_meta:
+..        "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software."
+..        "keywords": "Installing ROCprofiler-SDK, Install ROCprofiler-SDK, Build ROCprofiler-SDK"
+.. ---
+
+ROCprofiler-SDK installation
+============================
+
+This document provides information required to install ROCprofiler-SDK from source.
+
+Supported systems
+-----------------
+
+ROCprofiler-SDK is supported only on Linux. The following distributions are tested:
+
+- Ubuntu 20.04
+- Ubuntu 22.04
+- OpenSUSE 15.4
+- RedHat 8.8
+
+ROCprofiler-SDK might operate as expected on other `Linux distributions <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems>`_, but has not been tested.
+
+Identifying the operating system
+--------------------------------
+
+To identify the Linux distribution and version, see the ``/etc/os-release`` and ``/usr/lib/os-release`` files:
+
+.. code-block:: bash
+
+    $ cat /etc/os-release
+    NAME="Ubuntu"
+    VERSION="20.04.4 LTS (Focal Fossa)"
+    ID=ubuntu
+    ...
+    VERSION_ID="20.04"
+    ...
+
+The relevant fields are ``ID`` and the ``VERSION_ID``.
+
+Build requirements
+------------------
+
+Install CMake
+==============
+
+Install `CMake <https://cmake.org/>`_ version 3.21 (or later).
+
+.. note::
+    If the ``CMake`` installed on the system is too old, you can install a new version using various methods. One of the easiest options is to use PyPi (Python's pip).
+
+.. code-block:: bash
+
+    /usr/local/bin/python -m pip install --user 'cmake==3.22.0'
+    export PATH=${HOME}/.local/bin:${PATH}
+
+
+Building ROCprofiler-SDK
+------------------------
+
+.. code-block:: bash
+
+    git clone https://github.com/ROCm/rocprofiler-sdk.git rocprofiler-sdk-source
+    cmake                                         \
+        -B rocprofiler-sdk-build                \
+        -D ROCPROFILER_BUILD_TESTS=ON           \
+        -D ROCPROFILER_BUILD_SAMPLES=ON         \
+        -D CMAKE_INSTALL_PREFIX=/opt/rocm       \
+        rocprofiler-sdk-source
+
+    cmake --build rocprofiler-sdk-build --target all --parallel 8
+
+Installing ROCprofiler-SDK
+--------------------------
+
+To install ROCprofiler-SDK from the ``rocprofiler-sdk-build`` directory, run:
+
+.. code-block:: bash
+
+    cmake --build rocprofiler-sdk-build --target install
+
+Testing ROCprofiler-SDK
+-----------------------
+
+To run the built tests, ``cd`` into the ``rocprofiler-sdk-build`` directory and run:
+
+.. code-block:: bash
+
+    ctest --output-on-failure -O ctest.all.log
+
+
+.. note::
+    Running a few of these tests require you to install `pandas <https://pandas.pydata.org/>`_ and `pytest <https://docs.pytest.org/en/stable/>`_ first.
+
+.. code-block:: bash
+
+    /usr/local/bin/python -m pip install -r requirements.txt