rocprofv3 doc updates (#982)

* updating rocprofv3 * using rocprofv3 * review updates * naming standardization * Update source/docs/how-to/using-rocprofv3.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * review comments * adding API references * kernel filtering * Remove Sphinx warn as error To bypass false warning for linking between rst and md * remove unused (duplicate) refs in _toc.yml.in --------- Co-authored-by: Gopesh Bhardwaj <gopesh.bhardwaj@amd.com> Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> Co-authored-by: Sam Wu <22262939+samjwu@users.noreply.github.com> Co-authored-by: Peter Jun Park <peter.park@amd.com>
2024-08-03 00:38:04 +05:30
@@ -0,0 +1,251 @@
+# Buffered services
+
+For the buffered approach, supported buffer record categories are enumerated in `rocprofiler_buffer_category_t` category field.
+
+## Overview
+
+In buffered approach, callbacks are receieved for batches of records from an internal (background) thread.
+Supported buffered tracing services are enumerated in  `rocprofiler_buffer_tracing_kind_t`. Configuring
+a buffer tracing service requires the creation of a buffer. When the buffer is "flushed", either implicitly
+or explicitly, a callback to the tool will be invoked which provides an array of one or more buffer records.
+A buffer can be explicitly flushed via the `rocprofiler_flush_buffer` function.
+
+## Subscribing to Buffer Tracing Services
+
+During tool initialization, tools configure callback tracing via the `rocprofiler_configure_buffer_tracing_service`
+function. However, before invoking `rocprofiler_configure_buffer_tracing_service`, the tool must create a buffer
+for the tracing records.
+
+### Creating a Buffer
+
+```cpp
+rocprofiler_status_t
+rocprofiler_create_buffer(rocprofiler_context_id_t        context,
+                          size_t                          size,
+                          size_t                          watermark,
+                          rocprofiler_buffer_policy_t     policy,
+                          rocprofiler_buffer_tracing_cb_t callback,
+                          void*                           callback_data,
+                          rocprofiler_buffer_id_t*        buffer_id);
+```
+
+The `size` parameter is the size of the buffer in bytes and will be rounded up to the nearest
+memory page size (defined by `sysconf(_SC_PAGESIZE)`); the default memory page size on Linux
+is 4096 bytes (4 KB).
+
+The `watermark` parameter specifies the number of bytes at which
+the buffer should be "flushed", i.e. when the records in the buffer should invoke the
+`callback` parameter to deliver the records to the tool. For example, if a buffer has a size
+of 4096 bytes and the watermark is set to 48 bytes, six 8-byte records can be placed in the
+buffer before `callback` is invoked. However, every 64-byte record that is placed in the
+buffer will trigger a flush. It is safe to set the `watermark` to any value between
+zero and the buffer size.
+
+The `policy` parameter specifies the behavior for when a record is larger than the
+amount of free space in the current buffer. For example, if a buffer has a size of
+4000 bytes with a watermark set to 4000 bytes and 3998 of the bytes in the buffer
+have been populated with records, the `policy` dictates how to handle an incoming record >
+2 bytes. The `ROCPROFILER_BUFFER_POLICY_DISCARD` policy dictates that all records greater
+than should 2 bytes should be dropped until the tool _explicitly_ flushes the buffer via
+a `rocprofiler_flush_buffer` function call whereas the `ROCPROFILER_BUFFER_POLICY_LOSSLESS`
+policy dictates that the current buffer should be swapped out for an empty buffer and placed
+in that new buffer and former (full) buffer should be _implicitly_ flushed.
+
+The `callback` parameter is the function that rocprofiler-sdk should invoke when flushing
+the buffer; the value of the `callback_data` parameter will be passed as one of the arguments
+to the `callback` function.
+
+The `buffer_id` parameter is an output parameter for the function call and will have a
+non-zero handle field after successful buffer creation.
+
+### Creating a Dedicated Thread for Buffer Callbacks
+
+By default, all buffers will use the same (default) background thread created by rocprofiler-sdk to
+invoke their callback. However, rocprofiler-sdk provides an interface for tools to specify the
+creation of an additional background thread for one or more of their buffers.
+
+Callback threads for buffers are created via the `rocprofiler_create_callback_thread` function:
+
+```cpp
+rocprofiler_status_t
+rocprofiler_create_callback_thread(rocprofiler_callback_thread_t* cb_thread_id);
+```
+
+Buffers are assigned to that callback thread via the `rocprofiler_assign_callback_thread` function:
+
+```cpp
+rocprofiler_status_t
+rocprofiler_assign_callback_thread(rocprofiler_buffer_id_t       buffer_id,
+                                   rocprofiler_callback_thread_t cb_thread_id);
+```
+
+#### Buffer Callback Thread Creation and Assignment Example
+
+```cpp
+{
+    // create a context
+    auto context_id = rocprofiler_context_id_t{};
+    rocprofiler_create_context(&context_id);
+
+    // create a buffer associated with the context
+    auto buffer_id  = rocprofiler_buffer_id_t{};
+    rocprofiler_create_buffer(context_id, ..., &buffer_id);
+
+    // specify that a new callback thread should be created and provide
+    // and assign the identifier for it to the "thr_id" variable
+    auto thr_id = rocprofiler_callback_thread_t{};
+    rocprofiler_create_callback_thread(&thr_id);
+
+    // assign the buffer callback to be delivered on this thread
+    rocprofiler_assign_callback_thread(buffer_id, thr_id);
+}
+```
+
+### Configuring Buffer Tracing Services
+
+```cpp
+rocprofiler_status_t
+rocprofiler_configure_buffer_tracing_service(rocprofiler_context_id_t          context_id,
+                                             rocprofiler_buffer_tracing_kind_t kind,
+                                             rocprofiler_tracing_operation_t*  operations,
+                                             size_t                            operations_count,
+                                             rocprofiler_buffer_id_t           buffer_id);
+```
+
+The `kind` parameter is a high-level specifier of which service to trace (also known as a "domain").
+Domain examples include, but are not limited to, the HIP API, the HSA API, and kernel dispatches.
+For each domain, there are (often) various "operations", which can be used to restrict the callbacks
+to a subset within the domain. For domains which correspond to APIs, the "operations" are the functions
+which compose the API. If all operations in a domain should be traced, the `operations` and `operations_count`
+parameters can be set to `nullptr` and `0`, respectively. If the tracing domain should be restricted to a subset
+of operations, the tool library should specify a C-array of type `rocprofiler_tracing_operation_t` and the
+size of the array for the `operations` and `operations_count` parameter.
+
+Similar to `rocprofiler_configure_callback_tracing_service`,
+`rocprofiler_configure_buffer_tracing_service` will return an error if a buffer service for given context
+and given domain is configured more than once.
+
+#### Example
+
+```cpp
+{
+    auto ctx = rocprofiler_context_id_t{};
+    // ... creation of context, etc. ...
+
+    // buffer parameters
+    constexpr auto KB          = 1024;  // 1024 bytes
+    constexpr auto buffer_size = 16 * KB;
+    constexpr auto watermark   = 15 * KB;
+    constexpr auto policy      = ROCPROFILER_BUFFER_POLICY_LOSSLESS;
+
+    // buffer handle
+    auto buffer_id = rocprofiler_buffer_id_t{};
+
+    // create a buffer associated with the context
+    rocprofiler_create_buffer(
+        context_id, buffer_size, watermark, policy, callback_func, nullptr, &buffer_id);
+
+    // configure HIP runtime API function records to be placed in buffer
+    rocprofiler_configure_buffer_tracing_service(
+        ctx, ROCPROFILER_BUFFER_TRACING_HIP_RUNTIME_API, nullptr, 0, buffer_id);
+
+    // configure kernel dispatch records to be placed in buffer
+    // (more than one service can use the same buffer)
+    rocprofiler_configure_buffer_tracing_service(
+        ctx, ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH, nullptr, 0, buffer_id);
+
+    // ... etc. ...
+}
+```
+
+## Buffer Tracing Callback Function
+
+Rocprofiler-sdk buffer tracing callback functions have the signature:
+
+```cpp
+typedef void (*rocprofiler_buffer_tracing_cb_t)(rocprofiler_context_id_t      context,
+                                                rocprofiler_buffer_id_t       buffer_id,
+                                                rocprofiler_record_header_t** headers,
+                                                size_t                        num_headers,
+                                                void*                         data,
+                                                uint64_t                      drop_count);
+```
+
+The `rocprofiler_record_header_t` data type provides three pieces of information:
+
+1. Category (`rocprofiler_buffer_category_t`)
+2. Kind
+3. Payload
+
+The category is used to distinguish the classification of the buffer record. For all
+services configured via `rocprofiler_configure_buffer_tracing_service`, the category will
+be equal to the value of `ROCPROFILER_BUFFER_CATEGORY_TRACING`. The meaning of the kind
+field is dependent on the category but when the category is `ROCPROFILER_BUFFER_CATEGORY_TRACING`,
+the kind value will be equivalent to the  is used
+to distinguish the `rocprofiler_buffer_tracing_kind_t` value passed to
+`rocprofiler_configure_buffer_tracing_service`, e.g. `ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH`.
+Once the category and kind have been determined, the payload can be casted:
+
+```cpp
+{
+    if(header->category == ROCPROFILER_BUFFER_CATEGORY_TRACING &&
+        header->kind == ROCPROFILER_BUFFER_TRACING_HIP_RUNTIME_API)
+    {
+        auto* record =
+            static_cast<rocprofiler_buffer_tracing_hip_api_record_t*>(header->payload);
+
+        // ... etc. ...
+    }
+}
+```
+
+### Buffer Tracing Callback Function Example
+
+```cpp
+void
+buffer_callback_func(rocprofiler_context_id_t      context,
+                     rocprofiler_buffer_id_t       buffer_id,
+                     rocprofiler_record_header_t** headers,
+                     size_t                        num_headers,
+                     void*                         user_data,
+                     uint64_t                      drop_count)
+{
+    for(size_t i = 0; i < num_headers; ++i)
+    {
+        auto* header = headers[i];
+
+        if(header->category == ROCPROFILER_BUFFER_CATEGORY_TRACING &&
+           header->kind == ROCPROFILER_BUFFER_TRACING_HIP_RUNTIME_API)
+        {
+            auto* record =
+                static_cast<rocprofiler_buffer_tracing_hip_api_record_t*>(header->payload);
+
+            // ... etc. ...
+        }
+        else if(header->category == ROCPROFILER_BUFFER_CATEGORY_TRACING &&
+                header->kind == ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH)
+        {
+            auto* record =
+                static_cast<rocprofiler_buffer_tracing_kernel_dispatch_record_t*>(header->payload);
+
+            // ... etc. ...
+        }
+        else
+        {
+            throw std::runtime_error{"unhandled record header category + kind"};
+        }
+    }
+}
+```
+
+## Buffer Tracing Record
+
+Unlike callback tracing records, there is no common set of data for each buffer tracing record. However,
+many buffer tracing records contain a `kind` field and an `operation` field.
+The name of a tracing kind can be obtained via the `rocprofiler_query_buffer_tracing_kind_name` function.
+The name of an operation specific to a tracing kind can be obtained via the `rocprofiler_query_buffer_tracing_kind_operation_name`
+function. One can also iterate over all the buffer tracing kinds and operations for each tracing kind via the
+`rocprofiler_iterate_buffer_tracing_kinds` and `rocprofiler_iterate_buffer_tracing_kind_operations` functions.
+
+The buffer tracing record data types can be found in the `rocprofiler-sdk/buffer_tracing.h` header
+(`source/include/rocprofiler-sdk/buffer_tracing.h` in the [rocprofiler-sdk GitHub repository](https://github.com/ROCm/rocproifler-sdk)).
@@ -0,0 +1,337 @@
+# Callback tracing services
+
+## Overview
+
+Callback tracing services provide immediate callbacks to a tool on the current CPU thread when a given event occurs.
+For example, when tracing an API function, e.g. `hipSetDevice`, callback tracing invokes a user-specified callback
+before and after the traced function executes on the thread which is invoking the API function.
+
+## Subscribing to Callback Tracing Services
+
+During tool initialization, tools configure callback tracing via the `rocprofiler_configure_callback_tracing_service`
+function:
+
+```cpp
+rocprofiler_status_t
+rocprofiler_configure_callback_tracing_service(rocprofiler_context_id_t            context_id,
+                                               rocprofiler_callback_tracing_kind_t kind,
+                                               rocprofiler_tracing_operation_t*    operations,
+                                               size_t                              operations_count,
+                                               rocprofiler_callback_tracing_cb_t   callback,
+                                               void*                               callback_args);
+```
+
+The `kind` parameter is a high-level specifier of which service to trace (also known as a "domain").
+Domain examples include, but are not limited to, the HIP API, the HSA API, and kernel dispatches.
+For each domain, there are (often) various "operations", which can be used to restrict the callbacks
+to a subset within the domain. For domains which correspond to APIs, the "operations" are the functions
+which compose the API. If all operations in a domain should be traced, the `operations` and `operations_count`
+parameters can be set to `nullptr` and `0`, respectively. If the tracing domain should be restricted to a subset
+of operations, the tool library should specify a C-array of type `rocprofiler_tracing_operation_t` and the
+size of the array for the `operations` and `operations_count` parameter.
+
+`rocprofiler_configure_callback_tracing_service` will return an error if a callback service for given context
+and given domain is configured more than once. For example, if one only wanted to trace two functions within
+the HIP runtime API, `hipGetDevice` and `hipSetDevice`, the following code would accomplish this objective:
+
+```cpp
+{
+    auto ctx = rocprofiler_context_id_t{};
+    // ... creation of context, etc. ...
+
+    // array of operations (i.e. API functions)
+    auto operations = std::array<rocprofiler_tracing_operation_t, 2>{
+        ROCPROFILER_HIP_RUNTIME_API_ID_hipSetDevice,
+        ROCPROFILER_HIP_RUNTIME_API_ID_hipGetDevice
+    };
+
+    rocprofiler_configure_callback_tracing_service(ctx,
+                                                   ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API,
+                                                   operations.data(),
+                                                   operations.size(),
+                                                   callback_func,
+                                                   nullptr);
+    // ... etc. ...
+}
+```
+
+But the following code would be invalid:
+
+```cpp
+{
+    auto ctx = rocprofiler_context_id_t{};
+    // ... creation of context, etc. ...
+
+    // array of operations (i.e. API functions)
+    auto operations = std::array<rocprofiler_tracing_operation_t, 2>{
+        ROCPROFILER_HIP_RUNTIME_API_ID_hipSetDevice,
+        ROCPROFILER_HIP_RUNTIME_API_ID_hipGetDevice
+    };
+
+    for(auto op : operations)
+    {
+        // after the first iteration, will return ROCPROFILER_STATUS_ERROR_SERVICE_ALREADY_CONFIGURED
+        rocprofiler_configure_callback_tracing_service(ctx,
+                                                       ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API,
+                                                       &op,
+                                                       1,
+                                                       callback_func,
+                                                       nullptr);
+    }
+
+    // ... etc. ...
+}
+```
+
+## Callback Tracing Callback Function
+
+Rocprofiler-sdk callback tracing callback functions have the signature:
+
+```cpp
+typedef void (*rocprofiler_callback_tracing_cb_t)(rocprofiler_callback_tracing_record_t record,
+                                                  rocprofiler_user_data_t*              user_data,
+                                                  void* callback_data)
+```
+
+The `record` parameter contains the information to uniquely identify a tracing record type and has the
+following definition:
+
+```cpp
+typedef struct rocprofiler_callback_tracing_record_t
+{
+    rocprofiler_context_id_t            context_id;
+    rocprofiler_thread_id_t             thread_id;
+    rocprofiler_correlation_id_t        correlation_id;
+    rocprofiler_callback_tracing_kind_t kind;
+    uint32_t                            operation;
+    rocprofiler_callback_phase_t        phase;
+    void*                               payload;
+} rocprofiler_callback_tracing_record_t;
+```
+
+The underlying type of `payload` field above is typically unique to a domain and, less frequently, an operation.
+For example, for the `ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API` and `ROCPROFILER_CALLBACK_TRACING_HIP_COMPILER_API`,
+the payload should be casted to `rocprofiler_callback_tracing_hip_api_data_t*` -- which will contain the arguments
+to the function and (in the exit phase) the return value of the function. The payload field will only be a valid
+pointer during the invocation of the callback function(s).
+
+The `user_data` parameter can be used to store data in between callback phases. It is a unique for every
+instance of an operation. For example, if the tool library wishes to store the timestamp of the
+`ROCPROFILER_CALLBACK_PHASE_ENTER` phase for the ensuing `ROCPROFILER_CALLBACK_PHASE_EXIT` callback,
+this data can be stored in a method similar to below:
+
+```cpp
+void
+callback_func(rocprofiler_callback_tracing_record_t record,
+              rocprofiler_user_data_t*              user_data,
+              void*                                 cb_data)
+{
+    auto ts = rocprofiler_timestamp_t{};
+    rocprofiler_get_timestamp(&ts);
+
+    if(record.phase == ROCPROFILER_CALLBACK_PHASE_ENTER)
+    {
+        user_data->value = ts;
+    }
+    else if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT)
+    {
+        auto delta_ts = (ts - user_data->value);
+        // ... etc. ...
+    }
+    else
+    {
+        // ... etc. ...
+    }
+}
+```
+
+The `callback_data` argument will be the value of `callback_args` passed to `rocprofiler_configure_callback_tracing_service`
+in [the previous section](#subscribing-to-callback-tracing-services).
+
+## Callback Tracing Record
+
+The name of a tracing kind can be obtained via the `rocprofiler_query_callback_tracing_kind_name` function.
+The name of an operation specific to a tracing kind can be obtained via the `rocprofiler_query_callback_tracing_kind_operation_name`
+function. One can also iterate over all the callback tracing kinds and operations for each tracing kind via the
+`rocprofiler_iterate_callback_tracing_kinds` and `rocprofiler_iterate_callback_tracing_kind_operations` functions.
+Lastly, for a given `rocprofiler_callback_tracing_record_t` object, rocprofiler-sdk supports generically iterating over
+the arguments of the payload field for many domains.
+
+As mentioned above, within the `rocprofiler_callback_tracing_record_t` object,
+an opaque `void* payload` is provided for accessing domain specific information.
+The data types generally follow the naming convention of `rocprofiler_callback_tracing_<DOMAIN>_data_t`,
+e.g., for the tracing kinds `ROCPROFILER_BUFFER_TRACING_HSA_{CORE,AMD_EXT,IMAGE_EXT,FINALIZE_EXT}_API`,
+the payload should be casted to `rocprofiler_callback_tracing_hsa_api_data_t*`:
+
+```cpp
+void
+callback_func(rocprofiler_callback_tracing_record_t record,
+              rocprofiler_user_data_t*              user_data,
+              void*                                 cb_data)
+{
+    static auto hsa_domains = std::unordered_set<rocprofiler_buffer_tracing_kind_t>{
+        ROCPROFILER_BUFFER_TRACING_HSA_CORE_API,
+        ROCPROFILER_BUFFER_TRACING_HSA_AMD_EXT_API,
+        ROCPROFILER_BUFFER_TRACING_HSA_IMAGE_EXT_API,
+        ROCPROFILER_BUFFER_TRACING_HSA_FINALIZER_API};
+
+    if(hsa_domains.count(record.kind) > 0)
+    {
+        auto* payload = static_cast<rocprofiler_callback_tracing_hsa_api_data_t*>(record.payload);
+
+        hsa_status_t status = payload->retval.hsa_status_t_retval;
+        if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT && status != HSA_STATUS_SUCCESS)
+        {
+            const char* _kind = nullptr;
+            const char* _operation = nullptr;
+
+            rocprofiler_query_callback_tracing_kind_name(record.kind, &_kind, nullptr);
+            rocprofiler_query_callback_tracing_kind_operation_name(
+                record.kind, record.operation, &_operation, nullptr);
+
+            // message that
+            fprintf(stderr, "[domain=%s] %s returned a non-zero exit code: %i\n", _kind, _operation, status);
+        }
+    }
+    else if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT)
+    {
+        auto delta_ts = (ts - user_data->value);
+        // ... etc. ...
+    }
+    else
+    {
+        // ... etc. ...
+    }
+}
+```
+
+### Sample `rocprofiler_iterate_callback_tracing_kind_operation_args`
+
+```cpp
+int
+print_args(rocprofiler_callback_tracing_kind_t domain_idx,
+           uint32_t                            op_idx,
+           uint32_t                            arg_num,
+           const void* const                   arg_value_addr,
+           int32_t                             arg_indirection_count,
+           const char*                         arg_type,
+           const char*                         arg_name,
+           const char*                         arg_value_str,
+           int32_t                             arg_dereference_count,
+           void*                               data)
+{
+    if(arg_num == 0)
+    {
+        const char* _kind      = nullptr;
+        const char* _operation = nullptr;
+
+        rocprofiler_query_callback_tracing_kind_name(domain_idx, &_kind, nullptr);
+        rocprofiler_query_callback_tracing_kind_operation_name(
+            domain_idx, op_idx, &_operation, nullptr);
+
+        fprintf(stderr, "\n[%s] %s\n", _kind, _operation);
+    }
+
+    char* _arg_type = abi::__cxa_demangle(arg_type, nullptr, nullptr, nullptr);
+
+    fprintf(stderr, "    %u: %-18s %-16s = %s\n", arg_num, _arg_type, arg_name, arg_value_str);
+
+    free(_arg_type);
+
+    // unused in example
+    (void) arg_value_addr;
+    (void) arg_indirection_count;
+    (void) arg_dereference_count;
+    (void) data;
+
+    return 0;
+}
+
+void
+callback_func(rocprofiler_callback_tracing_record_t record,
+              rocprofiler_user_data_t*              user_data,
+              void*                                 cb_data)
+{
+    if(record.phase == ROCPROFILER_CALLBACK_PHASE_EXIT &&
+       record.kind == ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API &&
+       (record.operation == ROCPROFILER_HIP_RUNTIME_API_ID_hipLaunchKernel ||
+        record.operation == ROCPROFILER_HIP_RUNTIME_API_ID_hipMemcpyAsync))
+    {
+        rocprofiler_iterate_callback_tracing_kind_operation_args(
+                             record, print_args, record.phase, nullptr));
+    }
+}
+```
+
+Sample Output:
+
+```console
+
+[HIP_RUNTIME_API] hipLaunchKernel
+    0: void const*        function_address = 0x219308
+    1: rocprofiler_dim3_t numBlocks        = {z=1, y=310, x=310}
+    2: rocprofiler_dim3_t dimBlocks        = {z=1, y=32, x=32}
+    3: void**             args             = 0x7ffe6d8dd3c0
+    4: unsigned long      sharedMemBytes   = 0
+    5: ihipStream_t*      stream           = 0x17b40c0
+
+[HIP_RUNTIME_API] hipMemcpyAsync
+    0: void*              dst              = 0x7f06c7bbb010
+    1: void const*        src              = 0x7f0698800000
+    2: unsigned long      sizeBytes        = 393625600
+    3: hipMemcpyKind      kind             = DeviceToHost
+    4: ihipStream_t*      stream           = 0x25dfcf0
+```
+
+## Code Object Tracing
+
+The code object tracing service is a critical component for obtaining information regarding
+asynchronous activity on the GPU. The `rocprofiler_callback_tracing_code_object_load_data_t`
+payload (kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`, operation=`ROCPROFILER_CODE_OBJECT_LOAD`)
+provides a unique identifier for a bundle of one or more GPU kernel symbols which have been loaded
+for a specific GPU agent. For example, if your application is leveraging a multi-GPU system system
+containing 4 Vega20 GPUs and 4 MI100 GPUs, there will at least 8 code objects loaded: one code
+object for each GPU. Each code object will be associated with a set of kernel symbols:
+the `rocprofiler_callback_tracing_code_object_kernel_symbol_register_data_t` payload
+(kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`, operation=`ROCPROFILER_CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER`)
+provides a globally unique identifier for the specific kernel symbol along with the kernel name and
+several other static properties of the kernel (e.g. scratch size, scalar general purpose register count, etc.).
+Note: two otherwise identical kernel symbols (same kernel name, scratch size, etc.) which are part of
+otherwise identical code objects but the code objects are loaded for different GPU agents ***will*** have unique
+kernel identifiers. Furthermore, if the same code object (and it's kernel symbols) are unloaded and then
+re-loaded, that code object and all of it's kernel symbols ***will*** be given new unique identifiers.
+
+In general, when a code object is loaded and unloaded, here is the sequence of events:
+
+1. Callback: code object load
+    - kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`
+    - operation=`ROCPROFILER_CODE_OBJECT_LOAD`
+    - phase=`ROCPROFILER_CALLBACK_PHASE_LOAD`
+2. Callback: kernel symbol load
+    - kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`
+    - operation=`ROCPROFILER_CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER`
+    - phase=`ROCPROFILER_CALLBACK_PHASE_LOAD`
+    - Repeats for each kernel symbol in code object
+3. Application Execution
+4. Callback: kernel symbol unload
+    - kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`
+    - operation=`ROCPROFILER_CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER`
+    - phase=`ROCPROFILER_CALLBACK_PHASE_UNLOAD`
+    - Repeats for each kernel symbol in code object
+5. Callback: code object unload
+    - kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`
+    - operation=`ROCPROFILER_CODE_OBJECT_LOAD`
+    - phase=`ROCPROFILER_CALLBACK_PHASE_UNLOAD`
+
+Note: rocprofiler-sdk does not provide an interface to query this information outside of the
+code object tracing service. If you wish to be able to associate kernel names with kernel tracing records,
+a tool is personally responsible for making a copy of the relevant information when the code objects and
+kernel symbol are loaded (however, any constant string fields like the (`const char* kernel_name` field)
+need not to be copied, these are guaranteed to be valid pointers until after rocprofiler-sdk finalization).
+If a tool decides to delete their copy of the data associated with a given code object or kernel symbol
+identifier when the code object and kernel symbols are unloaded, it is highly recommended to flush
+any/all buffers which might contain references to that code object or kernel symbol identifiers before
+deleting the associated data.
+
+For a sample of code object tracing, please see the `samples/code_object_tracing` example in the
+[rocprofiler-sdk GitHub repository](https://github.com/ROCm/rocproifler-sdk).
@@ -0,0 +1,287 @@
+# Counter collection services
+
+## Definitions
+
+*Profile Config*: A configuration to specify what counters should be collected on an agent. This needs to be supplied to various counter collection APIs to initiate collection of counter data. Profiles are agent specific and cannot be used on different agents.
+
+*Counter ID*: Unique ID (per-architecture) that specifies the counter. The counter interface can be used to fetch information about the counter (such as its name or expression).
+
+*Instance ID*: Unique record id encoding both the counter id and dimension for a specific collected value.
+
+*Dimension*: Dimensions provide context to the raw counter values to specify the specific hardware register (such as shader engine) that the value was collected from. All counter values have dimension data encoded in its instance id and functions in the counter interface can be used to extract the values for individual dimensions. There following dimensions are currently supported by rocprofiler-sdk:
+
+```c
+    ROCPROFILER_DIMENSION_XCC,            ///< XCC dimension of result
+    ROCPROFILER_DIMENSION_AID,            ///< AID dimension of result
+    ROCPROFILER_DIMENSION_SHADER_ENGINE,  ///< SE dimension of result
+    ROCPROFILER_DIMENSION_AGENT,          ///< Agent dimension
+    ROCPROFILER_DIMENSION_SHADER_ARRAY,   ///< Number of shader arrays
+    ROCPROFILER_DIMENSION_WGP,            ///< Number of workgroup processors
+    ROCPROFILER_DIMENSION_INSTANCE,       ///< From unspecified hardware register
+```
+
+## Using The Counter Collection Service
+
+There are two modes for the counter collection service: *dispatch profiling* where counters are collected on a per kernel launch basis and *agent profiling* where counters are collected on a device level. Dispatch profiling is useful for collecting highly detailed counters for a specific kernel execution in isolation (Note: dispatch profiling allows only a single kernel to execute in hardware at a time). Agent profiling is useful for collecting device level counters not tied to a specific kernel execution (i.e. collecting counter values for a specific time range). 
+
+This guide explains how to setup dispatch and agent profiling along will describing the usage of the common counter collection APIs. More detail on the APIs themselves (as well as non-common options) is available in the API documentation. Fully functional examples of both dispatch and agent profiling can be found on the sample directory of rocprofiler-sdk.
+
+### tool_init() setup
+
+The setup for dispatch and agent profiling is similar (with only minor changes needed to adapt code from one to another). In tool_init, similar to tracing services, you need to create a context and a buffer to collect the output. Important Note: buffered_callback in rocprofiler_create_buffer is called when the buffer is full with a vector of collected counter samples, see the buffered callback section below for processing.  
+
+```CPP
+rocprofiler_context_id_t ctx;
+rocprofiler_buffer_id_t buff;
+ROCPROFILER_CALL(rocprofiler_create_context(&ctx), "context creation failed");
+ROCPROFILER_CALL(rocprofiler_create_buffer(ctx,
+                                            4096,
+                                            2048,
+                                            ROCPROFILER_BUFFER_POLICY_LOSSLESS,
+                                            buffered_callback, // Callback to process data
+                                            user_data,
+                                            &buff),
+                    "buffer creation failed");
+```
+
+After creating a context and buffer to store results, it is highly recommended (but not required) that you construct the profiles for each agent containing the counters you wish to collect in tool_init. Profile creation has a high time cost associated with it due to validating that the counters can be collected on the agent and thus should be avoided in the time critical dispatch profiling callback. After profile setup, the collection service for dispatch or agent profiling can be setup. The following two calls can be used to setup either dispatch or agent profiling (only one can be in use at a time).
+
+```CPP
+    /* For Dispatch Profiling */
+    // Setup the dispatch profile counting service. This service will trigger the dispatch_callback
+    // when a kernel dispatch is enqueued into the HSA queue. The callback will specify what
+    // counters to collect by returning a profile config id. 
+    ROCPROFILER_CALL(rocprofiler_configure_buffered_dispatch_profile_counting_service(
+                         ctx, buff, dispatch_callback, nullptr),
+                     "Could not setup buffered service");
+
+    /* For Agent Profiling */
+    // set_profile is a callback that is use to select the profile to use when
+    // the context is started. It is called at every rocprofiler_ctx_start() call.
+    ROCPROFILER_CALL(rocprofiler_configure_agent_profile_counting_service(
+                         ctx, buff, agent_id, set_profile, nullptr),
+                     "Could not setup buffered service");
+```
+
+#### Profile Setup
+
+The first step in constructing a counter collection profile is to find the GPU agents on the machine. A profile will need to be created for each set of counters you want to collect on every agent on the machine. You can use rocprofiler_query_available_agents to find agents on the system. The below example will collect all GPU agents on the device and store them in the vector agents.
+
+```CPP
+    std::vector<rocprofiler_agent_v0_t> agents;
+
+    // Callback used by rocprofiler_query_available_agents to return
+    // agents on the device. This can include CPU agents as well. We
+    // select GPU agents only (i.e. type == ROCPROFILER_AGENT_TYPE_GPU)
+    rocprofiler_query_available_agents_cb_t iterate_cb = [](rocprofiler_agent_version_t agents_ver,
+                                                            const void**                agents_arr,
+                                                            size_t                      num_agents,
+                                                            void*                       udata) {
+        if(agents_ver != ROCPROFILER_AGENT_INFO_VERSION_0)
+            throw std::runtime_error{"unexpected rocprofiler agent version"};
+        auto* agents_v = static_cast<std::vector<rocprofiler_agent_v0_t>*>(udata);
+        for(size_t i = 0; i < num_agents; ++i)
+        {
+            const auto* agent = static_cast<const rocprofiler_agent_v0_t*>(agents_arr[i]);
+            if(agent->type == ROCPROFILER_AGENT_TYPE_GPU) agents_v->emplace_back(*agent);
+        }
+        return ROCPROFILER_STATUS_SUCCESS;
+    };
+
+    // Query the agents, only a single callback is made that contains a vector
+    // of all agents.
+    ROCPROFILER_CALL(
+        rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0,
+                                           iterate_cb,
+                                           sizeof(rocprofiler_agent_t),
+                                           const_cast<void*>(static_cast<const void*>(&agents))),
+        "query available agents");
+```
+
+To identify the counters that an agent supports, you can query the available counters with rocprofiler_iterate_agent_supported_counters. An example with a single agent (returning the available counters in gpu_counters) would be the following:
+
+```CPP
+    std::vector<rocprofiler_counter_id_t> gpu_counters;
+
+    // Iterate all the counters on the agent and store them in gpu_counters.
+    ROCPROFILER_CALL(rocprofiler_iterate_agent_supported_counters(
+                         agent,
+                         [](rocprofiler_agent_id_t,
+                            rocprofiler_counter_id_t* counters,
+                            size_t                    num_counters,
+                            void*                     user_data) {
+                             std::vector<rocprofiler_counter_id_t>* vec =
+                                 static_cast<std::vector<rocprofiler_counter_id_t>*>(user_data);
+                             for(size_t i = 0; i < num_counters; i++)
+                             {
+                                 vec->push_back(counters[i]);
+                             }
+                             return ROCPROFILER_STATUS_SUCCESS;
+                         },
+                         static_cast<void*>(&gpu_counters)),
+                     "Could not fetch supported counters");
+```
+
+rocprofiler_counter_id_t is a handle to a counter. The information about the counter (such as its name) can be fetched using rocprofiler_query_counter_info.
+
+```CPP
+    for(auto& counter : gpu_counters)
+    {
+        // Contains name and other attributes about the counter.
+        // See API documenation for more info on the contents of this struct.
+        rocprofiler_counter_info_v0_t version;
+        ROCPROFILER_CALL(
+            rocprofiler_query_counter_info(
+                counter, ROCPROFILER_COUNTER_INFO_VERSION_0, static_cast<void*>(&version)),
+            "Could not query info for counter");
+    }
+```
+
+After you have identified a set of counters you wish to collect, a profile can be constructed by passing a list of these counters to rocprofiler_create_profile_config.
+
+```C++
+    // Create and return the profile
+    rocprofiler_profile_config_id_t profile;
+    ROCPROFILER_CALL(rocprofiler_create_profile_config(
+                         agent, counters_array, counters_array_count, &profile),
+                     "Could not construct profile cfg");
+```
+
+The created profile can in turn be used for both dispatch and agent counter collection services. 
+
+##### Special Notes On Profile Behavior
+- Profile created is *only valid* for the agent it was created for.
+- Profiles are immutable. If a new counter set is desired to be collected, construct a new profile. 
+- A single profile can be used multiple times on the same agent. 
+- Counter IDs that are supplied to rocprofiler_create_profile_config are *agent specific* and cannot be used to construct profiles for other agents.
+
+### Dispatch Profiling Callback
+
+When a kernel is dispatched, a dispatch callback is issued to the tool to allow for the selection of counters to collect for the dispatch (via supplying a profile). 
+
+```CPP
+void
+dispatch_callback(rocprofiler_profile_counting_dispatch_data_t dispatch_data,
+                  rocprofiler_profile_config_id_t*             config,
+                  rocprofiler_user_data_t* user_data,
+                  void* /*callback_data_args*/)
+```
+
+Dispatch data contains information about the dispatch that is being launched (such as its name) and config is where the tool can specify the profile (and in turn counters) to collect for the dispatch. If no profile is supplied, no counters are collected for this dispatch. User data contains user data supplied to rocprofiler_configure_buffered_dispatch_profile_counting_service. 
+
+### Agent Set Profile Callback
+
+This callback is called when the context is started and allows for the tool to specify the profile to be used. 
+
+```CPP
+void
+set_profile(rocprofiler_context_id_t                 context_id,
+            rocprofiler_agent_id_t                   agent,
+            rocprofiler_agent_set_profile_callback_t set_config,
+            void*)
+```
+
+The profile to be used for this agent is specified by calling set_config(agent, profile). 
+
+### Buffered Callback
+
+Data from collected counter values is returned via a buffered callback. The buffered callback routines are similar between dispatch and agent profiling with the exception that some data (such as kernel launch ids) are not available in agent profiling mode. A sample iteration to print out counter collection data is the following:
+
+```CPP
+    for(size_t i = 0; i < num_headers; ++i)
+    {
+        auto* header = headers[i];
+        if(header->category == ROCPROFILER_BUFFER_CATEGORY_COUNTERS &&
+           header->kind == ROCPROFILER_COUNTER_RECORD_PROFILE_COUNTING_DISPATCH_HEADER)
+        {
+            // Print the returned counter data.
+            auto* record =
+                static_cast<rocprofiler_profile_counting_dispatch_record_t*>(header->payload);
+            ss << "[Dispatch_Id: " << record->dispatch_info.dispatch_id
+               << " Kernel_ID: " << record->dispatch_info.kernel_id
+               << " Corr_Id: " << record->correlation_id.internal << ")]\n";
+        }
+        else if(header->category == ROCPROFILER_BUFFER_CATEGORY_COUNTERS &&
+                header->kind == ROCPROFILER_COUNTER_RECORD_VALUE)
+        {
+            // Print the returned counter data.
+            auto* record = static_cast<rocprofiler_record_counter_t*>(header->payload);
+            rocprofiler_counter_id_t counter_id = {.handle = 0};
+
+            rocprofiler_query_record_counter_id(record->id, &counter_id);
+
+            ss << "  (Dispatch_Id: " << record->dispatch_id << " Counter_Id: " << counter_id.handle
+               << " Record_Id: " << record->id << " Dimensions: [";
+
+            for(auto& dim : counter_dimensions(counter_id))
+            {
+                size_t pos = 0;
+                rocprofiler_query_record_dimension_position(record->id, dim.id, &pos);
+                ss << "{" << dim.name << ": " << pos << "},";
+            }
+            ss << "] Value [D]: " << record->counter_value << "),";
+        }
+    }
+```
+
+## Counter Definitions
+
+Counters are defined in yaml format in the file counter_defs.yaml. The counter definition has the following format
+
+```yaml
+counter_name:       # Counter name
+  architectures:
+    gfx90a:         # Architecture name 
+      block:        # Block information (SQ/etc)
+      event:        # Event ID (used by AQLProfile to identify counter register)
+      expression:   # Formula for the counter (if derrived counter)
+      description:  # Per-arch description (optional)
+    gfx1010:
+       ...
+  description:      # Description of the counter
+```
+
+Architectures can be separately defined with their own definitions (i.e. gfx90a and gfx1010 in the above example). If two or more architectures share the same block/event/expression definition, they can be "/" delimited on a single line (i.e. "gfx90a/gfx1010:"). Hardware metrics have the elements block, event, and description defined. Derrived metrics have the element expression defined (and cannot have block or event defined).
+
+## Derived Metrics
+
+Derrived metrics allow for computations (via expressions) to be performed on collected hardware metrics with the result returned as it it were a real hardware counter.
+
+```yaml
+GPU_UTIL:
+  architectures:
+    gfx942/gfx941/gfx10/gfx1010/gfx1030/gfx1031/gfx11/gfx1032/gfx1102/gfx906/gfx1100/gfx1101/gfx940/gfx908/gfx90a/gfx9:
+      expression: 100*GRBM_GUI_ACTIVE/GRBM_COUNT
+  description: Percentage of the time that GUI is active
+```
+
+GPU_UTIL is an example of a derrived metric which takes the values of two GRBM hardware counters (GRBM_GUI_ACTIVE and GRBM_COUNT) and uses a mathematic expression to calculate the utilization rate of the GPU. Expressions support the standard set of math operators (/,*,-,+) along with a set of special functions (reduce and accumulate).
+
+### Reduce Function
+
+```yaml
+expression: 100*reduce(GL2C_HIT,sum)/(reduce(GL2C_HIT,sum)+reduce(GL2C_MISS,sum))
+```
+
+Reduce() reduces counter values across all dimensions (shader engine, SIMD, etc) to produce a single output value. This is useful when you want to collect and compare values across the entire device. There are a number of reduction operations that can be perfomed: sum, average (avr), minimum value (selects minimum value across all dimensions, min), and max (selects the maximum value across all dimensions). For example reduce(GL2C_HIT,sum) sums all GL2C_HIT hardware register values together to return a single output value.
+
+### Accumulate Function
+```yaml
+expression: accumulate(<basic_level_counter>, <resolution>)
+```
+#### Description
+- The accumulate metric is used to sum the values of a basic level counter over a specified number of cycles. By setting the resolution parameter, you can control the frequency of the summing operation:
+    - HIGH_RES: Sums up the basic counter every clock cycle. Captures the value every single cycle for higher accuracy, suitable for fine-grained analysis.
+    - LOW_RES: Sums up the basic counter every four clock cycles. Reduces the data points and provides less detailed summing, useful for reducing data volume.
+    - NONE: Does nothing and is equivalent to collecting basic_level_counter. Outputs the value of the basic counter without any summing operation.
+
+#### Usage
+```yaml
+MeanOccupancyPerCU:
+  architectures:
+    gfx942/gfx941/gfx940:
+      expression: accumulate(SQ_LEVEL_WAVES,HIGH_RES)/reduce(GRBM_GUI_ACTIVE,max)/CU_NUM
+  description: Mean occupancy per compute unit.
+```
+    <metric name="MeanOccupancyPerCU" expr=accumulate(SQ_LEVEL_WAVES,HIGH_RES)/reduce(GRBM_GUI_ACTIVE,max)/CU_NUM descr="Mean occupancy per compute unit."></metric>
+- MeanOccupancyPerCU: This metric calculates the mean occupancy per compute unit. It uses the accumulate function with HIGH_RES to sum the SQ_LEVEL_WAVES counter at every clock cycle. This sum is then divided by GRBM_GUI_ACTIVE and the number of compute units (CU_NUM) to derive the mean occupancy.
@@ -0,0 +1,96 @@
+# Runtime intercept tables
+
+Although most tools will want to leverage the callback or buffer tracing services for tracing the HIP, HSA, and ROCTx
+APIs, rocprofiler-sdk does provide access to the raw API dispatch tables. Each of the aforementioned APIs are
+designed similar to the following sample.
+
+## Dispatch Table Overview
+
+### Forward Declaration of public C API function
+
+```cpp
+extern "C"
+{
+// forward declaration of public C API function
+int
+foo(int) __attribute__((visibility("default")));
+}
+```
+
+### Internal Implementation of API function
+
+```cpp
+namespace impl
+{
+int
+foo(int val)
+{
+    // real implementation
+    return (2 * val);
+}
+}
+```
+
+### Dispatch Table Implementation
+
+```cpp
+namespace impl
+{
+struct dispatch_table
+{
+    int (*foo_fn)(int) = nullptr;
+};
+
+// invoked once: populates the dispatch_table with function pointers to implementation
+dispatch_table*&
+construct_dispatch_table()
+{
+    static dispatch_table* tbl = new dispatch_table{};
+    tbl->foo_fn                = impl::foo;
+
+    // in between above and below, rocprofiler-sdk gets passed the pointer
+    // to the dispatch table and has the opportunity to wrap the function
+    // pointers for interception
+
+    return tbl;
+}
+
+// constructs dispatch table and stores it in static variable
+dispatch_table*
+get_dispatch_table()
+{
+    static dispatch_table*& tbl = construct_dispatch_table();
+    return tbl;
+}
+}  // namespace impl
+```
+
+### Implementaiton of public C API function
+
+```cpp
+extern "C"
+{
+// implementation of public C API function
+int
+foo(int val)
+{
+    return impl::get_dispatch_table()->foo_fn(val);
+}
+}
+```
+
+### Dispatch Table Chaining
+
+rocprofiler-sdk is given an opportunity within `impl::construct_dispatch_table()` to
+save the original value(s) of the function pointers such as `foo_fn` and install
+it's own function pointers in its place -- this results in the public C API function `foo`
+calling into the rocprofiler-sdk function pointer, which then in turn, calls the original
+function pointer to `impl::foo` (this is called "chaining"). Once rocprofiler-sdk
+has made any necessary modifications to the dispatch table, tools which indicated
+they also want access to the raw dispatch table via `rocprofiler_at_intercept_table_registration`
+will be passed the pointer to the dispatch table.
+
+## Sample
+
+For a demo of dispatch table chaining, please see the `samples/intercept_table` example in the
+[rocprofiler-sdk GitHub repository](https://github.com/ROCm/rocproifler-sdk).
@@ -0,0 +1,14 @@
+# PC sampling method
+
+PC Sampling is a profiling method that uses statistical approximation of the kernel execution by sampling GPU program counters. Furthermore, the method periodically chooses an active wave (in a round robin manner) and snapshot it's program counter (PC). The process takes place on every compute unit simultaneously which makes it device-wide PC sampling. The outcome is the histogram of samples that says how many times each kernel instruction was sampled.
+
+**Note**: The PC sampling feature is still under development and may not be completely stable.
+
+ **Risk Acknowledgment**:
+ 
+  - By activating this feature through `ROCPROFILER_PC_SAMPLING_BETA_ENABLED` environment variable, you acknowledge and accept the following potential risks:
+     
+     - **Hardware Freeze**: This beta feature could cause your hardware to freeze unexpectedly.
+     - **Need for Cold Restart**: In the event of a hardware freeze, you may need to perform a cold restart (turning the hardware off and on) to restore normal operations.
+        
+ Please use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk.
@@ -0,0 +1,231 @@
+# Tool library
+
+The tool library utilizes APIs from `rocprofiler-sdk` and `rocprofiler-register` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the `rocprofiler-sdk` and `rocprofiler-register` libraries efficiently. The command-line tool `rocprofv3` is also built on `librocprofiler-sdk-tool.so.0.4.0`, which uses these libraries.
+
+## ROCm runtimes design
+
+The ROCm runtimes are designed to directly communicate with a helper library named `rocprofiler-register` during initialization. This library performs cursory checks to find if a tool requires ROCprofiler-SDK services. This detection is based on the presence of one or more instances of `rocprofiler_configure` in the tool or `ROCP_TOOL_LIBRARIES` environment variable. This design provides drastic improvement over previous designs, which relied solely on a tool racing to set runtime-specific environment variables like `HSA_TOOLS_LIB` before the runtime initialization.
+
+## Tool library design
+
+When ROCprofiler-SDK detects `rocprofiler_configure` in a tool's symbol table, ROCprofiler-SDK invokes `rocprofiler-configure` with parameters such as ROCprofiler-SDK version that invokes the function, number of tools already invoked, and a unique identifier for the tool. The tool returns a pointer to a `rocprofiler_tool_configure_result_t` struct, which, if non-null, provides ROCprofiler-SDK with:
+- Function to be called for tool initialization, which is also the opportunity for context creation.
+- Function to be called when ROCprofiler-SDK is finalized.
+- A pointer to data to be provided to the tool when ROCprofiler-SDK calls the initialization and finalization functions.
+
+ROCprofiler-SDK provides a `rocprofiler-sdk/registration.h` header file, which forward declares the `rocprofiler_configure` function with the necessary compiler function attributes to ensure that the `rocprofiler-configure` symbol is publicly visible.
+
+```cpp
+#include <rocprofiler-sdk/registration.h>
+
+namespace
+{
+// saves the data provided to rocprofiler_configure
+struct ToolData
+{
+    uint32_t                              version;
+    const char*                           runtime_version;
+    uint32_t                              priority;
+    rocprofiler_client_id_t               client_id;
+};
+
+// tool initialization function
+int
+tool_init(rocprofiler_client_finalize_t fini_func,
+          void* tool_data_v);
+
+// tool finalization function
+void
+tool_fini(void* tool_data_v);
+}
+
+extern "C"
+{
+rocprofiler_tool_configure_result_t*
+rocprofiler_configure(uint32_t                 version,
+                      const char*              runtime_version,
+                      uint32_t                 priority,
+                      rocprofiler_client_id_t* client_id)
+{
+    //If not the first tool to register, indicate that the tool doesn't want to do anything
+    if(priority > 0) return nullptr;
+
+    // (optional) Provide a name for this tool to rocprofiler
+    client_id->name = "ExampleTool";
+
+    // (optional) create configure data
+    static auto data = ToolData{ version,
+                                 runtime_version,
+                                 priority,
+                                 client_id };
+
+    // construct configure result
+    static auto cfg =
+        rocprofiler_tool_configure_result_t{ sizeof(rocprofiler_tool_configure_result_t),
+                                             &tool_init,
+                                             &tool_fini,
+                                             static_cast<void*>(&data) };
+
+    return &cfg;
+}
+```
+
+## Tool initialization
+
+:::{note}
+ROCprofiler-SDK does NOT support calls to any runtime function (HSA, HIP, and so on) during tool initialization.
+Invoking any functions from the runtimes results in a deadlock.
+:::
+
+For each tool that contains a `rocprofiler_configure` function and returns a non-null pointer to a `rocprofiler_tool_configure_result_t` struct, ROCprofiler-SDK invokes the `initialize` callback after completing the scan for all `rocprofiler_configure` symbols. In other words, ROCprofiler-SDK
+collects all `rocprofiler_tool_configure_result_t` instances before invoking the `initialize` member of any of these instances.
+When ROCprofiler-SDK invokes `initialize` function in a tool, this is the opportunity to create contexts:
+
+```cpp
+#include <rocprofiler-sdk/rocprofiler.h>
+
+namespace
+{
+int
+tool_init(rocprofiler_client_finalize_t fini_func,
+          void* data_v)
+{
+    // create a context
+    auto ctx = rocprofiler_context_id_t{};
+    rocprofiler_create_context(&ctx);
+
+    // ... associate services with context ...
+
+    // start the context (optional)
+    rocprofiler_start_context(ctx);
+
+    return 0;
+}
+}
+```
+
+Although not mandatory, it is recommended that tools store the context handles to control the data collection for the services associated with the context.
+
+## Tool finalization
+
+When the `initialize` callback is invoked in the tool, ROCprofiler-SDK provides a function pointer of type `rocprofiler_client_finalize_t`.
+The tool can invoke this function pointer to explicitly invoke the `finalize` callback from the `rocprofiler_tool_configure_result_t` instance:
+
+```cpp
+#include <rocprofiler-sdk/rocprofiler.h>
+
+namespace
+{
+int
+tool_init(rocprofiler_client_finalize_t fini_func,
+          void* data_v)
+{
+    // ... see initialization section ...
+
+    // function, which finalizes the tool after 10 seconds
+    auto explicit_finalize = [](rocprofiler_client_finalize_t finalizer,
+                                rocprofiler_client_id_t* client_id)
+    {
+        std::this_thread::sleep_for(std::chrono::seconds{ 10 });
+        finalizer(client_id);
+    };
+
+    // start the context
+    rocprofiler_start_context(ctx);
+
+    // dispatch a background thread to explicitly finalize after 10 seconds
+    std::thread{ explicit_finalize, fini_func, static_cast<ToolData*>(data_v)->client_id }.detach();
+
+    return 0;
+}
+}
+```
+
+Otherwise, ROCprofiler-SDK invokes the `finalize` callback via an `atexit` handler.
+
+## Full `rocprofiler_configure` Sample
+
+All of the snippets from the previous sections have been combined here for convenience.
+
+```cpp
+#include <rocprofiler-sdk/registration.h>
+
+namespace
+{
+struct rocp_tool_data
+{
+    uint32_t                              version;
+    const char*                           runtime_version;
+    uint32_t                              priority;
+    rocprofiler_client_id_t               client_id;
+    rocprofiler_client_finalize_t         finalizer;
+    std::vector<rocprofiler_context_id_t> contexts;
+};
+
+void
+tool_tracing_callback(rocprofiler_callback_tracing_record_t record,
+                      rocprofiler_user_data_t*              user_data,
+                      void*                                 callback_data);
+
+int
+tool_init(rocprofiler_client_finalize_t fini_func,
+          void* tool_data_v)
+{
+    rocp_tool_data* tool_data = static_cast<rocp_tool_data*>(tool_data_v);
+
+    // Save the finalizer function
+    tool_data->finalizer = fini_func;
+
+    // create a context
+    auto ctx = rocprofiler_context_id_t{};
+    rocprofiler_create_context(&ctx);
+
+    // Save your contexts
+    tool_data->contexts.emplace_back(ctx);
+
+    // associate code object tracing with this context
+    rocprofiler_configure_callback_tracing_service(
+        ctx,
+        ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT,
+        nullptr,
+        0,
+        tool_tracing_callback,
+        tool_data);
+
+    // ... associate services with contexts ...
+
+    return 0;
+}
+
+void
+tool_fini(void* tool_data);
+}
+
+extern "C"
+{
+rocprofiler_tool_configure_result_t*
+rocprofiler_configure(uint32_t                 version,
+                      const char*              runtime_version,
+                      uint32_t                 priority,
+                      rocprofiler_client_id_t* client_id)
+{
+    // (optional) Provide a name for this tool to rocprofiler
+    client_id->name = "ExampleTool";
+
+    // info provided back to tool_init and tool_fini
+    auto* my_tool_data = new rocp_tool_data{ version,
+                                             runtime_version,
+                                             priority,
+                                             client_id,
+                                             nullptr };
+
+    // create configure data
+    static auto cfg =
+        rocprofiler_tool_configure_result_t{ sizeof(rocprofiler_tool_configure_result_t),
+                                             &tool_init,
+                                             &tool_fini,
+                                             my_tool_data };
+
+    return &cfg;
+}
+```