diff --git a/projects/rocprofiler-sdk/README.md b/projects/rocprofiler-sdk/README.md index 8d2a50065d..40ddcfa068 100644 --- a/projects/rocprofiler-sdk/README.md +++ b/projects/rocprofiler-sdk/README.md @@ -73,13 +73,24 @@ Please report in the Github Issues. - **Need for Cold Restart**: In the event of a hardware freeze, you may need to perform a cold restart (turning the hardware off and on) to restore normal operations. Please use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk. -- At this point, We do not recommend stress-testing the beta implementation. + - At this point, We do not recommend stress-testing the beta implementation. -- Correlation IDs provided by the PC sampling service are verified only for HIP API calls. + - Correlation IDs provided by the PC sampling service are verified only for HIP API calls. -- Timestamps in PC sampling records might not be 100% accurate. + - Timestamps in PC sampling records might not be 100% accurate. -- Using PC sampling on multi-threaded applications might fail with `HSA_STATUS_ERROR_EXCEPTION`.Furthermore, if three or more threads launch operations to the same agent, and if PC sampling is enabled, the `HSA_STATUS_ERROR_EXCEPTION` might appear. + - Using PC sampling on multi-threaded applications might fail with `HSA_STATUS_ERROR_EXCEPTION`.Furthermore, if three or more threads launch operations to the same agent, and if PC sampling is enabled, the `HSA_STATUS_ERROR_EXCEPTION` might appear. + +- Navi3x requires a stable power state for counter collection. + Currently, this state needs to be set by the user. + To do so, set "power_dpm_force_performance_level" to be writeable for non-root users, then set performance level to profile_standard: + + ```bash + sudo chmod 777 /sys/class/drm/card0/device/power_dpm_force_performance_level + echo profile_standard >> /sys/class/drm/card0/device/power_dpm_force_performance_level + ``` + + Recommended: "profile_standard" for counter collection and "auto" for all other profiling. Use rocm-smi to verify the current power state. For multiGPU systems (includes integrated graphics), replace "card0" by the desired card. > [!WARNING] > The latest mainline version of AQLprofile can be found at [https://repo.radeon.com/rocm/misc/aqlprofile/](https://repo.radeon.com/rocm/misc/aqlprofile/). However, it's important to note that updates to the public AQLProfile may not occur as frequently as updates to the rocprofiler-sdk. This discrepancy could lead to a potential mismatch between the AQLprofile binary and the rocprofiler-sdk source. diff --git a/projects/rocprofiler-sdk/source/docs/api-reference/tool_library.md b/projects/rocprofiler-sdk/source/docs/api-reference/tool_library.md index cce13d509b..9ce1660e31 100644 --- a/projects/rocprofiler-sdk/source/docs/api-reference/tool_library.md +++ b/projects/rocprofiler-sdk/source/docs/api-reference/tool_library.md @@ -7,7 +7,7 @@ myst: # ROCprofiler-SDK tool library -The tool library utilizes APIs from `rocprofiler-sdk` and `rocprofiler-register` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the `rocprofiler-sdk` and `rocprofiler-register` libraries efficiently. The command-line tool `rocprofv3` is also built on `librocprofiler-sdk-tool.so.0.4.0`, which uses these libraries. +The tool library utilizes APIs from `rocprofiler-sdk` and `rocprofiler-register` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the `rocprofiler-sdk` and `rocprofiler-register` libraries efficiently. The command-line tool `rocprofv3` is also built on `librocprofiler-sdk-tool.so.X.Y.Z`, which uses these libraries. ## ROCm runtimes design diff --git a/projects/rocprofiler-sdk/source/docs/conceptual/comparing-with-legacy-tools.rst b/projects/rocprofiler-sdk/source/docs/conceptual/comparing-with-legacy-tools.rst index 619a244c09..166511963d 100644 --- a/projects/rocprofiler-sdk/source/docs/conceptual/comparing-with-legacy-tools.rst +++ b/projects/rocprofiler-sdk/source/docs/conceptual/comparing-with-legacy-tools.rst @@ -383,4 +383,10 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more Timing Difference Between rocprofv3 and rocprofv1/v2 ======================================================== -Rocprofv3 has improved the accuracy of timing information by reducing the tool overhead required to collect data and reducing the interference to the timing of the kernel being measured. The result of this work is a reduction in variance of kernel times received for the same kernel execution and more accurate timing in general. These changes have not been backported (and will not be backported) to rocprofv1/v2, so there can be substantial (20%) differences in execution time reported by v1/v2 vs v3 for a single kernel execution. Over a large number of samples of the same kernel, the difference in average execution time is in the low single digit percentage time with a much tighter variance of results on rocprofv3. We have included testing in the test suite to verify the timing information outputted by rocprofv3 to ensure that the values we are returning are accurate. +``rocprofv3`` has improved the accuracy of timing information by reducing the tool overhead required to collect data and reducing the interference to the timing of the kernel being measured. The result of this work is a reduction in variance of kernel times received for the same kernel execution and more accurate timing in general. These changes have not been backported (and will not be backported) to rocprofv1/v2, so there can be substantial (20%) differences in execution time reported by v1/v2 vs v3 for a single kernel execution. Over a large number of samples of the same kernel, the difference in average execution time is in the low single digit percentage time with a much tighter variance of results on rocprofv3. We have included testing in the test suite to verify the timing information outputted by rocprofv3 to ensure that the values we are returning are accurate. + +======================================================== +Default run of rocprofv3 and rocprofv1/v2 +======================================================== + +``rocprofv3`` has a different default behavior than rocprofv1/v2 when being run without any option. The default behavior of rocprofv3 is to collect all available agents on the system and to output it in ``csv`` format. The default behavior of rocprofv1/v2 was to output the `kernel traces` in CSV format. In rocprofv3, kernel traces can be obtained by using ``--kernel-trace`` option. diff --git a/projects/rocprofiler-sdk/source/docs/data/hip_domain_stats.csv b/projects/rocprofiler-sdk/source/docs/data/hip_domain_stats.csv new file mode 100644 index 0000000000..55cd0fc2af --- /dev/null +++ b/projects/rocprofiler-sdk/source/docs/data/hip_domain_stats.csv @@ -0,0 +1,2 @@ +"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev" +"HIP_API",13,458514859,35270373.769231,100.00,2300,352276613,99315857.546240 \ No newline at end of file diff --git a/projects/rocprofiler-sdk/source/docs/data/rccl_trace.csv b/projects/rocprofiler-sdk/source/docs/data/rccl_trace.csv new file mode 100644 index 0000000000..6fee219c99 --- /dev/null +++ b/projects/rocprofiler-sdk/source/docs/data/rccl_trace.csv @@ -0,0 +1,22 @@ +"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp" +"RCCL_API","ncclGetVersion",1834151,1834151,416,18413845573432,18413845577374 +"RCCL_API","ncclGetUniqueId",1834151,1834151,1116,18413961300878,18413963267869 +"RCCL_API","ncclGetUniqueId",1834151,1834151,1481,18414166449182,18414166720831 +"RCCL_API","ncclGroupStart",1834151,1834151,1482,18414166723772,18414166726834 +"RCCL_API","ncclGroupEnd",1834151,1834151,1490,18414166823575,18414380520973 +"RCCL_API","ncclCommInitAll",1834151,1834151,1477,18414166402665,18414380522536 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89098,18414380660695,18414380661652 +"RCCL_API","ncclAllReduce",1834151,1834151,89097,18414380653860,18414380693574 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89108,18414380694631,18414380694659 +"RCCL_API","ncclAllReduce",1834151,1834151,89107,18414380694212,18414380704722 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89117,18414380706650,18414380706677 +"RCCL_API","ncclAllReduce",1834151,1834151,89116,18414380705574,18414380715055 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89126,18414380715749,18414380715774 +"RCCL_API","ncclAllReduce",1834151,1834151,89125,18414380715463,18414380723944 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89135,18414380724688,18414380724715 +"RCCL_API","ncclAllReduce",1834151,1834151,89134,18414380724395,18414380732209 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89154,18414380746383,18414380746411 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89157,18414380749863,18414380749889 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89160,18414380751671,18414380751696 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89163,18414380753326,18414380753353 +"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89166,18414380755128,18414380755154 diff --git a/projects/rocprofiler-sdk/source/docs/data/rocprofv3_hip_memcpy_summary.png b/projects/rocprofiler-sdk/source/docs/data/rocprofv3_hip_memcpy_summary.png new file mode 100644 index 0000000000..c7b048a783 Binary files /dev/null and b/projects/rocprofiler-sdk/source/docs/data/rocprofv3_hip_memcpy_summary.png differ diff --git a/projects/rocprofiler-sdk/source/docs/data/rocprofv3_memcpy_summary.png b/projects/rocprofiler-sdk/source/docs/data/rocprofv3_memcpy_summary.png new file mode 100644 index 0000000000..22f1f5134f Binary files /dev/null and b/projects/rocprofiler-sdk/source/docs/data/rocprofv3_memcpy_summary.png differ diff --git a/projects/rocprofiler-sdk/source/docs/data/rocprofv3_summary.png b/projects/rocprofiler-sdk/source/docs/data/rocprofv3_summary.png new file mode 100644 index 0000000000..7ea55c5b04 Binary files /dev/null and b/projects/rocprofiler-sdk/source/docs/data/rocprofv3_summary.png differ diff --git a/projects/rocprofiler-sdk/source/docs/how-to/using-rocprofv3.rst b/projects/rocprofiler-sdk/source/docs/how-to/using-rocprofv3.rst index 68b0236bf8..50452171f8 100644 --- a/projects/rocprofiler-sdk/source/docs/how-to/using-rocprofv3.rst +++ b/projects/rocprofiler-sdk/source/docs/how-to/using-rocprofv3.rst @@ -173,7 +173,7 @@ To use ``rocprofv3`` for application tracing, run: .. code-block:: bash - rocprofv3 -- + rocprofv3 -- HIP trace +++++++++++ @@ -184,7 +184,7 @@ To trace HIP runtime APIs, use: .. code-block:: bash - rocprofv3 --hip-trace -- < app_relative_path > + rocprofv3 --hip-trace -- The above command generates a ``hip_api_trace.csv`` file prefixed with the process ID. @@ -203,7 +203,7 @@ To trace HIP compile time APIs, use: .. code-block:: shell - rocprofv3 --hip-compiler-trace -- < app_relative_path > + rocprofv3 --hip-compiler-trace -- The above command generates a ``hip_api_trace.csv`` file prefixed with the process ID. @@ -229,7 +229,7 @@ HSA trace contains the start and end time of HSA runtime API calls and their asy .. code-block:: bash - rocprofv3 --hsa-trace -- < app_relative_path > + rocprofv3 --hsa-trace -- The above command generates a ``hsa_api_trace.csv`` file prefixed with process ID. Note that the contents of this file have been truncated for demonstration purposes. @@ -295,7 +295,7 @@ To trace the API calls enclosed within the range, use: .. code-block:: bash - rocprofv3 --marker-trace -- < app_relative_path > + rocprofv3 --marker-trace -- Running the preceding command generates a ``marker_api_trace.csv`` file prefixed with the process ID. @@ -312,6 +312,74 @@ Here are the contents of ``marker_api_trace.csv`` file: For the description of the fields in the output file, see :ref:`output-file-fields`. +Kernel Rename +++++++++++++++ + +To rename kernels with their enclosing roctxRangePush/roctxRangePop message. Known as --roctx-rename in earlier rocprof versions. + +See how to use ``--kernel-rename`` option with help of below code snippet: + +.. code-block:: bash + + #include + + roctxRangePush("HIP_Kernel-1"); + + // Launching kernel from host + hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH); + + // Memory transfer from device to host + roctxRangePush("hipMemCpy-DeviceToHost"); + + hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost); + + roctxRangePop(); // for "hipMemcpy" + roctxRangePop(); // for "hipLaunchKernel" + roctxRangeStop(rangeId); + +To rename the kernel , use: + +.. code-block:: bash + + rocprofv3 --marker-trace --kernel-rename -- + +The above command generates a ``marker-trace`` file prefixed with the process ID. + +.. code-block:: shell + + $ cat 210_marker_api_trace.csv + "Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp" + "MARKER_CORE_API","roctxGetThreadId",315155,315155,2,58378843928406,58378843930247 + "MARKER_CONTROL_API","roctxProfilerPause",315155,315155,3,58378844627184,58378844627502 + "MARKER_CONTROL_API","roctxProfilerResume",315155,315155,4,58378844638601,58378844639267 + "MARKER_CORE_API","pre-kernel-launch",315155,315155,5,58378844641787,58378844641787 + "MARKER_CORE_API","post-kernel-launch",315155,315155,6,58378844936586,58378844936586 + "MARKER_CORE_API","memCopyDth",315155,315155,7,58378844938371,58378851383270 + "MARKER_CORE_API","HIP_Kernel-1",315155,315155,1,58378526575735,58378851384485 + + +Kokkos Trace +++++++++++++++ + +rocprofv3 has a built-in `Kokkos Tools library `_ support to trace Kokkos API calls. `Kokkos `_ is a C++ library for writing performance portable applications. It is used in many scientific applications to write performance portable code that can run on CPUs, GPUs, and other accelerators. +rocprofv3 loads a built-in Kokkos tools library which emits roctx ranges with the labels passed through the API, e.g. Kokkos::parallel_for(“MyParallelForLabel”, …); will internally calls for roctxRangePush and enables the kernel renaming option so that the highly templated kernel names are replaced by the Kokkos labels. +To enable built-in marker support, use the ``kokkos-trace`` option. Internally this option enables ``marker-trace`` and ``kernel-rename``.: + +.. code-block:: bash + + rocprofv3 --kokkos-trace -- + +The above command generates a ``marker-trace`` file prefixed with the process ID. + +.. code-block:: shell + + $ cat 210_marker_api_trace.csv + "Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp" + "MARKER_CORE_API","Kokkos::Initialization Complete",4069256,4069256,1,56728499773965,56728499773965 + "MARKER_CORE_API","Kokkos::Impl::CombinedFunctorReducer, CountFunctor, long int>::Reducer, void>",4069256,4069256,2,56728501756088,56728501764241 + "MARKER_CORE_API","Kokkos::parallel_reduce: fence due to result being value, not view",4069256,4069256,4,56728501767957,56728501769600 + "MARKER_CORE_API","Kokkos::Finalization Complete",4069256,4069256,6,56728502054554,56728502054554 + Kernel trace ++++++++++++++ @@ -319,7 +387,7 @@ To trace kernel dispatch traces, use: .. code-block:: shell - rocprofv3 --kernel-trace -- < app_relative_path > + rocprofv3 --kernel-trace -- The above command generates a ``kernel_trace.csv`` file prefixed with the process ID. @@ -343,7 +411,7 @@ To trace memory moves across the application, use: .. code-block:: shell - rocprofv3 –-memory-copy-trace -- < app_relative_path > + rocprofv3 –-memory-copy-trace -- The above command generates a ``memory_copy_trace.csv`` file prefixed with the process ID. @@ -400,7 +468,7 @@ memory operations (copies and scratch). .. code-block:: shell - rocprofv3 –-runtime-trace -- < app_relative_path > + rocprofv3 –-runtime-trace -- Running the above command generates ``hip_api_trace.csv``, ``kernel_trace.csv``, ``memory_copy_trace.csv``, ``scratch_memory_trace.csv``, ``memory_allocation_trace.csv``, and ``marker_api_trace.csv`` (if ``ROCTx`` APIs are specified in the application) files prefixed with the process ID. @@ -411,7 +479,7 @@ This is an all-inclusive option to collect all the above-mentioned traces. .. code-block:: shell - rocprofv3 –-sys-trace -- < app_relative_path > + rocprofv3 –-sys-trace -- Running the above command generates ``hip_api_trace.csv``, ``hsa_api_trace.csv``, ``kernel_trace.csv``, ``memory_copy_trace.csv``, ``memory_allocation_trace.csv``, and ``marker_api_trace.csv`` (if ``ROCTx`` APIs are specified in the application) files prefixed with the process ID. @@ -422,19 +490,45 @@ This option collects scratch memory operation's traces. Scratch is an address sp .. code-block:: shell - rocprofv3 --scratch-memory-trace -- < app_relative_path > + rocprofv3 --scratch-memory-trace -- -Stats -++++++++ + +RCCL trace +++++++++++++ + +`RCCL `_ (pronounced "Rickle") is a stand-alone library of standard collective communication routines for GPUs. This option traces those communication routines. + +.. code-block:: shell + + rocprofv3 --rccl-trace -- + +The above command generates a ``rccl_api_trace`` file prefixed with the process ID. + +.. code-block:: shell + + $ cat 197_rccl_api_trace.csv + +Here are the contents of ``rccl_api_trace.csv`` file: + +.. csv-table:: RCCL trace + :file: /data/rccl_trace.csv + :widths: 10,10,10,10,10,20,20 + :header-rows: 1 + +Post-processing tracing options +++++++++++++++++++++++++++++++++ + +1. Stats ++++++++++ This option collects statistics for the enabled tracing types. For example, to collect statistics of HIP APIs, when HIP trace is enabled. A higher percentage in statistics can help user focus on the API/function that has taken the most time: .. code-block:: shell - rocprofv3 --stats --hip-trace -- < app_relative_path > + rocprofv3 --stats --hip-trace -- -The above command generates a ``hip_api_stats.csv`` and ``hip_api_trace`` file prefixed with the process ID. +The above command generates a ``hip_api_stats.csv``, ``domain_stats.csv`` and ``hip_api_trace.csv`` file prefixed with the process ID. .. code-block:: shell @@ -447,8 +541,60 @@ Here are the contents of ``hip_api_stats.csv`` file: :widths: 10,10,20,20,10,10,10,10 :header-rows: 1 +Here are the contents of ``domain_stats.csv`` file: + +.. csv-table:: Domain stats + :file: /data/hip_domain_stats.csv + :widths: 10,10,20,20,10,10,10,10 + :header-rows: 1 + For the description of the fields in the output file, see :ref:`output-file-fields`. +2. Summary ++++++++++++ + +Output single summary of tracing data at the conclusion of the profiling session + +.. code-block:: shell + + rocprofv3 -S --hip-trace -- + +.. image:: /data/rocprofv3_summary.png + + +2.1 Summary per domain +++++++++++++++++++++++ + +Outputs the summary of each tracing domain at the end of profiling session. + +.. code-block:: shell + + rocprofv3 -D --hsa-trace --hip-trace -- + +The above command generates a ``hip_trace.csv``, ``hsa_trace.csv`` file prefixed with the process ID along with the summary of each domain at the terminal. + +2.2 Summary groups ++++++++++++++++++++ + +Users can create a summary of multiple domains by specifying the domain names in the command line. The summary groups are separated by a pipe (|) symbol. +To create a summary for ``MEMORY_COPY`` domains, use: + +.. code-block:: shell + + rocprofv3 --summary-groups MEMORY_COPY --sys-trace -- + +.. image:: /data/rocprofv3_memcpy_summary.png + + +To create a summary for ``MEMORY_COPY`` and ``HIP_API`` domains, use: + +.. code-block:: shell + + rocprofv3 --summary-groups 'MEMORY_COPY|HIP_API' --sys-trace -- + +.. image:: /data/rocprofv3_hip_memcpy_summary.png + + Kernel profiling ------------------- @@ -540,7 +686,7 @@ Properties { "jobs": [ { - "pmc": ["SQ_WAVES", "GRBM_COUNT", "GUI_ACTIVE"] + "pmc": ["SQ_WAVES", "GRBM_COUNT", "GRBM_GUI_ACTIVE"] }, { "pmc": ["FETCH_SIZE", "WRITE_SIZE"], @@ -564,7 +710,7 @@ Properties - pmc: - SQ_WAVES - GRBM_COUNT - - GUI_ACTIVE + - GRBM_GUI_ACTIVE - 'TCC_HIT[1]' - 'TCC_HIT[2]' - pmc: @@ -581,7 +727,7 @@ To supply the counters via ``command-line`` options, use: .. code-block:: shell - rocprofv3 --pmc SQ_WAVES GRBM_COUNT GRBM_GUI_ACTIVE -- + rocprofv3 --pmc SQ_WAVES GRBM_COUNT GRBM_GUI_ACTIVE -- .. note:: 1. Please note that more than 1 counters should be separated by a space or a comma. @@ -594,7 +740,7 @@ To supply the input file for kernel profiling, use: .. code-block:: shell - rocprofv3 -i input.txt -- + rocprofv3 -i input.txt -- Running the above command generates a ``./pmc_n/counter_collection.csv`` file prefixed with the process ID. For each ``pmc`` row, a directory ``pmc_n`` containing a ``counter_collection.csv`` file is generated, where n = 1 for the first row and so on. @@ -666,7 +812,7 @@ To collect counters for the kernels matching the filters specified in the preced .. code-block:: shell - rocprofv3 -i input.yml -- + rocprofv3 -i input.yml -- $ cat pass_1/312_counter_collection.csv "Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp" diff --git a/projects/rocprofiler-sdk/source/docs/rocprofiler-sdk.dox.in b/projects/rocprofiler-sdk/source/docs/rocprofiler-sdk.dox.in index 7188a97c6d..e8fdd7609c 100644 --- a/projects/rocprofiler-sdk/source/docs/rocprofiler-sdk.dox.in +++ b/projects/rocprofiler-sdk/source/docs/rocprofiler-sdk.dox.in @@ -139,7 +139,8 @@ FILE_PATTERNS = *.h \ *.tcc \ conf.py RECURSIVE = YES -EXCLUDE = +EXCLUDE = @SOURCE_DIR@/README.md \ + @SOURCE_DIR@/include/rocprofiler-sdk/rccl/details/rccl.h EXCLUDE_SYMLINKS = YES EXCLUDE_PATTERNS = */.git/* \ @SOURCE_DIR@/**/tests/* \ @@ -266,7 +267,7 @@ LATEX_CMD_NAME = latex MAKEINDEX_CMD_NAME = makeindex LATEX_MAKEINDEX_CMD = makeindex COMPACT_LATEX = NO -PAPER_TYPE = +PAPER_TYPE = a4 EXTRA_PACKAGES = float LATEX_HEADER = LATEX_FOOTER = @@ -332,7 +333,8 @@ MACRO_EXPANSION = YES EXPAND_ONLY_PREDEF = NO SEARCH_INCLUDES = NO INCLUDE_PATH = @SOURCE_DIR@/source/include -INCLUDE_FILE_PATTERNS = *.h +INCLUDE_FILE_PATTERNS = *.h \ + *.hpp PREDEFINED = "ROCPROFILER_API=" \ "ROCPROFILER_EXPORT=" \ "ROCPROFILER_IMPORT=" \ diff --git a/projects/rocprofiler-sdk/source/include/rocprofiler-sdk/rccl/details/rccl.h b/projects/rocprofiler-sdk/source/include/rocprofiler-sdk/rccl/details/rccl.h index 304f60256c..e114a26d78 100644 --- a/projects/rocprofiler-sdk/source/include/rocprofiler-sdk/rccl/details/rccl.h +++ b/projects/rocprofiler-sdk/source/include/rocprofiler-sdk/rccl/details/rccl.h @@ -47,7 +47,7 @@ typedef struct } ncclUniqueId; /*! @defgroup rccl_result_code Result Codes - @details The various result codes that RCCL API calls may return + @brief The various result codes that RCCL API calls may return @{ */ /*! @brief Result type @@ -71,12 +71,14 @@ typedef enum #define NCCL_SPLIT_NOCOLOR -1 /*! @defgroup rccl_config_type Communicator Configuration - @details Structure that allows for customizing Communicator behavior via + @brief Structure that allows for customizing Communicator behavior via ncclCommInitRankConfig @{ */ -/*! @brief Communicator configuration - @details Users can assign value to attributes to specify the behavior of a communicator */ +/** + * @defgroup Communicator configuration + * @brief Users can assign value to attributes to specify the behavior of a communicator. + */ typedef struct ncclConfig_v21700 { /* attributes that users should never touch. */ @@ -122,7 +124,7 @@ ncclResult_t pncclMemFree(void* ptr); /*! @defgroup rccl_api_version Version Information - @details API call that returns RCCL version + @brief API call that returns RCCL version @{ */ /*! @brief Return the RCCL_VERSION_CODE of RCCL in the supplied integer. @@ -140,7 +142,7 @@ pncclGetVersion(int* version); /*! @} */ /*! @defgroup rccl_api_communicator Communicator Initialization/Destruction - @details API calls that operate on communicators. + @brief API calls that operate on communicators. Communicators objects are used to launch collective communication operations. Unique ranks between 0 and N-1 must be assigned to each HIP device participating in the same Communicator. @@ -292,7 +294,7 @@ pncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t* newcomm, ncclCon /*! @} */ /*! @defgroup rccl_api_errcheck Error Checking Calls - @details API calls that check for errors + @brief API calls that check for errors @{ */ /*! @brief Returns a string for each result code. @@ -330,7 +332,7 @@ pncclCommGetAsyncError(ncclComm_t comm, ncclResult_t* asyncError); /*! @} */ /*! @defgroup rccl_api_comminfo Communicator Information - @details API calls that query communicator information + @brief API calls that query communicator information @{ */ /*! @brief Gets the number of ranks in the communicator clique. @@ -391,7 +393,7 @@ pncclCommDeregister(const ncclComm_t comm, void* handle); /*! @endcond */ /*! @defgroup rccl_api_enumerations API Enumerations - @details Enumerations used by collective communication calls + @brief Enumerations used by collective communication calls @{ */ /*! @brief Dummy reduction enumeration @@ -454,7 +456,7 @@ typedef enum /*! @} */ /*! @defgroup rccl_api_custom_redop Custom Reduction Operator - @details API calls relating to creation/destroying custom reduction operator + @brief API calls relating to creation/destroying custom reduction operator that pre-multiplies local source arrays prior to reduction @{ */ @@ -516,7 +518,7 @@ pncclRedOpDestroy(ncclRedOp_t op, ncclComm_t comm); /*! @} */ /*! @defgroup rccl_collective_api Collective Communication Operations - @details Collective communication operations must be called separately for each + @brief Collective communication operations must be called separately for each communicator in a communicator clique. They return when operations have been enqueued on the HIP stream. @@ -935,7 +937,7 @@ pncclAllToAllv(const void* sendbuff, /*! @} */ /*! @defgroup msccl_api MSCCL Algorithm - @details API calls relating to the optional MSCCL algorithm datapath + @brief API calls relating to the optional MSCCL algorithm datapath @{ */ /*! @brief Opaque handle to MSCCL algorithm */ @@ -1030,7 +1032,7 @@ pmscclUnloadAlgo(mscclAlgoHandle_t mscclAlgoHandle); /*! @} */ /*! @defgroup rccl_group_api Group semantics - @details When managing multiple GPUs from a single thread, and since RCCL collective + @brief When managing multiple GPUs from a single thread, and since RCCL collective calls may perform inter-CPU synchronization, we need to "group" calls for different ranks/devices into a single call. diff --git a/projects/rocprofiler-sdk/source/include/rocprofiler-sdk/registration.h b/projects/rocprofiler-sdk/source/include/rocprofiler-sdk/registration.h index 837367af27..a1a8f81338 100644 --- a/projects/rocprofiler-sdk/source/include/rocprofiler-sdk/registration.h +++ b/projects/rocprofiler-sdk/source/include/rocprofiler-sdk/registration.h @@ -30,7 +30,7 @@ ROCPROFILER_EXTERN_C_INIT /** * @defgroup REGISTRATION_GROUP Tool registration * - * Data types and functions for tool registration with rocprofiler + * @brief Data types and functions for tool registration with rocprofiler * @{ */ diff --git a/projects/rocprofiler-sdk/source/include/rocprofiler-sdk/rocprofiler.h b/projects/rocprofiler-sdk/source/include/rocprofiler-sdk/rocprofiler.h index e129b92f12..b2ddf54339 100644 --- a/projects/rocprofiler-sdk/source/include/rocprofiler-sdk/rocprofiler.h +++ b/projects/rocprofiler-sdk/source/include/rocprofiler-sdk/rocprofiler.h @@ -86,7 +86,7 @@ ROCPROFILER_EXTERN_C_INIT /** * @defgroup MISCELLANEOUS_GROUP Miscellaneous Utility Functions - * + * @brief utility functions for library * @{ */