From 987ae3cc47d77358c6f0685a00685f5699fcecec Mon Sep 17 00:00:00 2001 From: Ammar ELWazir Date: Fri, 24 May 2024 09:49:44 -0500 Subject: [PATCH] PC Sampling Support (#715) * cmake formatting (cmake-format) (#188) Co-authored-by: vlaindic * source formatting (clang-format v11) (#189) Co-authored-by: vlaindic * pcs: design of the pc sampling data struct; guarding parts of code that uses ROCr marker packets * source formatting (clang-format v11) (#191) Co-authored-by: vlaindic * cmake formatting (cmake-format) (#192) Co-authored-by: vlaindic * pcs: shadow variable fix * pcs: fix for compiler errors reported by CI/CD * source formatting (clang-format v11) (#193) Co-authored-by: vlaindic * pcs: docs fix; samples uses rocprofiler::rocprofiler library * cmake formatting (cmake-format) (#195) Co-authored-by: vlaindic * pcs: client in samples folder fixed * pcs: client requires rocprofiler package as dependency * pcs: client uses single context * source formatting (clang-format v11) (#196) Co-authored-by: vlaindic * pcs: client using single buffer; no buffer destroy in client * pcs: client::setup explicitly called from the example * pcs: rocprofiler_pc_sample_record_t updated * pcs: fixed init of external correlation id * source formatting (clang-format v11) (#198) Co-authored-by: vlaindic * pcs: remove outdated files; update CMakeLists * cmake formatting (cmake-format) (#212) Co-authored-by: vlaindic * pcs: using rocprofiler_agent_id_t * pcs: Removing trailing whitespaces Co-authored-by: Jonathan R. Madsen * source formatting (clang-format v11) (#214) Co-authored-by: vlaindic * pcs: mapping agent_id to the agent * source formatting (clang-format v11) (#215) Co-authored-by: vlaindic * pcs: const while iterating over agents * source formatting (clang-format v11) (#216) Co-authored-by: vlaindic * pcs: calling get_buffer instead of get_buffers * pcs: workgroup typo * pcs: documentation for the public PC sampling API * pcs: queue_cb_t signature adaptation * pcs: mocks removed * pcs: updating HsaApiTable with HSA/ROCr PC sampling API * pcs: querying available PC sampling configs through IOCTL * pcs: create the PCS session in IOCTL * pcs: first actual PC samples delivered to the rocprofiler's client :) * pcs: works with marker packet too * pcs: using HSA table to call pc sampling related functions * pcs: using ioctl instead of kfd in naming * pcs: configuration service test fixed * pcs: sample processing test fixed * pcs: marker packet macro wrapper removed * pcs: marker packet is part of the rocprofiler_packet union * pcs: one fixme added * pcs: client that uses pc-sampling and code obj tracing * pcs: client that supprts PC sampling and code obj tracing refactored * pcs: show more info for each PC sample * pcs: hex output for the samples that do not belong to the matmul kernel * pcs: querying avail configuration happens immediately before configuring * pcs: hsa_ven_amd_pcs_create_from_id renamed * pcs: using hsa_stop; accessing a buffer by id from parser * pcs: includes reworked, tests returned to life * pcs: rocrofiler dir removed as outdated * cmake formatting (cmake-format) (#271) Co-authored-by: vlaindic * source formatting (clang-format v11) (#272) Co-authored-by: vlaindic * pcs: some warnings fixed * source formatting (clang-format v11) (#273) Co-authored-by: vlaindic * cmake formatting (cmake-format) (#274) Co-authored-by: vlaindic * pcs: show MI200 relevant information in the sample * pcs: queue cb fixed; rocr.h include fixed * source formatting (clang-format v11) (#296) Co-authored-by: vlaindic * pcs: getting hsa_agent and the doorbell_id from hsa_queue * source formatting (clang-format v11) (#297) Co-authored-by: vlaindic * pcs: correlation ID logic fixed * source formatting (clang-format v11) (#303) Co-authored-by: vlaindic * pcs: pure pc sampling example fixed * source formatting (clang-format v11) (#307) Co-authored-by: vlaindic * cmake formatting (cmake-format) (#308) Co-authored-by: vlaindic * pcs: interval value if the PC sampling is already configured * pcs: ROCPROFILER_STATUS_ERROR_PC_SAMPLING_ALREADY_CONFIGURED New status code if another process configured PC sampling service with different configuration. Samples are extended to consider this case and retry if it happens. * pcs: hsa_amd_queue_get_info mocked in tests * source formatting (clang-format v11) (#328) Co-authored-by: vlaindic * pcs (tests): query configs after configuring service * source formatting (clang-format v11) (#329) Co-authored-by: vlaindic * pcs: sample checks workgroup_id_* and wave_id * source formatting (clang-format v11) (#330) Co-authored-by: vlaindic * pcs samples: running samples on the device 0 * pcs: kfd_ioctl updated * pcs: ioctl config struct changed fields names * pcs: status when PC sampling is configured by another process is renamed * pcs: HSA PC sampling API table fixed * pcs: tmp hack to be able to use HSA pc sampling table * source formatting (clang-format v11) (#443) Co-authored-by: vlaindic * pcs service use CIDs generated by HIP API tracing service * source formatting (clang-format v11) (#455) Co-authored-by: vlaindic * cmake formatting (cmake-format) (#456) Co-authored-by: vlaindic * pcs: CID manager * pcs: explicit flush with no delivered data executes retirement logic * source formatting (clang-format v11) (#464) Co-authored-by: vlaindic * pcs: rocprofiler_query_pc_sampling_agent_configurations docs update * source formatting (clang-format v11) (#465) Co-authored-by: vlaindic * pcs: rocprofiler_configure_pc_sampling_service docs update * pcs: explicit sync introduced in PCSCIDManager * pcs: new logic for retiring CIDs in PC sampling service documented * pcs: queue interception cb signature updated * source formatting (clang-format v11) (#471) Co-authored-by: vlaindic * pcs: if no agents supports PC sampling, fail gracefully * elaborating when KFD returns EBUSY and EEXIST * pcs: the second PC sampling examples fails gracefully * code samples use only single kernel for now * pcs: CID manager refactored * source formatting (clang-format v11) (#481) Co-authored-by: vlaindic * pcs: ioctl update * source formatting (clang-format v11) (#531) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs:code sample to test PC sampling applied on concurrent kernels * source formatting (clang-format v11) (#533) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: pc sampling strest test included * cmake formatting (cmake-format) (#539) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * source formatting (clang-format v11) (#540) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: standalone benchmark * cmake formatting (cmake-format) (#555) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: glance in external correlation IDs * source formatting (clang-format v11) (#557) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * another change in ioctl interface * pcs: update queue interceptor callbacks and samples accroding to the agent 0 version * source formatting (clang-format v11) (#611) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: avoid running problematic PC sampling test * pcs: guarding tests not to fail on architectures not supporting PC sampling * source formatting (clang-format v11) (#617) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: check IOCTL version prior to each KFD call * pcs: ioctl refactoring * pcs: PC sampling service increases the ref_count of the correlation ID of the kernel dispatch * cmake formatting (cmake-format) (#631) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * source formatting (clang-format v11) (#632) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: PC sampling service provides external correlation IDs * source formatting (clang-format v11) (#644) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: use rocprofiler_dim3_t for workgrou_ip * source formatting (clang-format v11) (#645) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: minor fixes * pcs: updating the documentation for the pc sampling API functions * pcs: api table and queue controller fix * pcs: don't generate marker packets for the agent if PC sampling is not configured on it * pcs: multi-GPU and single-GPU clients * source formatting (clang-format v11) (#700) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: warning and errors fixed * source formatting (clang-format v11) (#702) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: clang compiler errors and warnings fixed * source formatting (clang-format v11) (#716) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: const reference in cid manager * source formatting (clang-format v11) (#717) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: const & func in manager explicit * pcs: test to cover creating PC sampling service of agent that does not exist * pcs: generate marker packets if service is active * source formatting (clang-format v11) (#719) Co-authored-by: vlaindic <139573562+vlaindic@users.noreply.github.com> * pcs: refactoring hsa_adapter; use the correlation_id->thread_idx * Update source/lib/rocprofiler-sdk/pc_sampling/cid_manager.cpp * Update source/lib/rocprofiler-sdk/pc_sampling/cid_manager.cpp * Update source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp * Update source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp * Update source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp * Update source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp * Update source/lib/rocprofiler-sdk/pc_sampling/utils.cpp * Update utils.cpp * moving pc-sampling tests and samples to pc-sampling label * Format fix * pcs: use configured instead of active service * Update source/lib/rocprofiler-sdk/pc_sampling/service.cpp * pcs: ensure configuring PC sampling on the HSA level is called only once * pcs: minor fix * Update CMakeLists.txt * Update CMakeLists.txt * Update CMakeLists.txt * Update CMakeLists.txt * pcs: refactoring IOCTL integration * Update source/lib/rocprofiler-sdk/pc_sampling/tests/CMakeLists.txt Co-authored-by: Ammar ELWazir * Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.cpp Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.cpp Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.hpp Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * pcs: reverting back what bot doubled * Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * pcs: retesting the bot * Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * pcs: why bot fails on this IOCTL status * pcs: why failing on * Update source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.cpp Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * pcs: returning commits removed by bot * pcs: formatting locally * pcs: clients are flushing buffers inside the tool_fini * pcs: sync function in public API * pcs: sync prior to unloading the code object * pcs: sync function requires context * pcs: client uses CID retirement service * pcs: test for flusing internal ROCr buffers * pcs: source formatting * Update source/lib/rocprofiler-sdk/pc_sampling/tests/CMakeLists.txt Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * pcs: code samples refactoring * pcs: public API header refactored * pcs: rocprofiler_buffer_flush drains internal PC sampling buffers too * pcs: remove unnecessary functions * pcs: do not call hsa's copytables * pcs: include reordering * pcs: using ROCP_ERROR inside PC sampling implementation * pcs: pc_sampling sample uses ostream instean of printfs * pcs: pc_sampling_codeobj tracing using ostream instead of prints * pcs: registering once for interceptor callbacks * pcs: do not generate internal CIDs if not in debug mode * pcs: rebasing fixed; missing external correlation IDs * pcs: code formatting * enable kernel tracing service to receive external correlation IDs * pcs: using ROCPROFILER_STATUS_ERROR_INCOMPATIBLE_KERNEL * pcs: polishing parser * formatting * updating parser to use workgroup_id * kfd_ioctl.h extracted in details folder * refactoring * pcs: preparing to generate code object information * flush internal buffers prior to unloading code object * pcs: generating marker records * pcs: wrap code_object's shutdown function * ROCR_VISIBLE_DEVICES and HIP_VISISBLE_DEVICES unsupported at the moment * documenting the ignorance of ROCR/HIP_VISIBLE_DEVICES * pcs: separate structs for code object loading/unloading markers * pcs: inst_pkt_t changed the namespace * pcs: removing wrapper around the shutdown function * pcs: size in record field * pcs: documentation refactoring + typdefs * renaming PCSAgentConfig to PCSAgentSession * pcs: service does not keep a pointer to the context * pcs: static assertions related to the versioning * pcs: rocprofiler_pc_sampling_configuration_t size field * pcs: report API unimplemented unleass explicitly enabled * pcs: skip tests if KFD does not support PC sampling * pcs: if ROCr hides some devices, no PC samples will be delivered for it * pcs: hip error check after kernel launch * formatting * removing PCS info from agent.h * fix based on review * Update continuous integration workflow - use mi200 runner for code coverage (supports PC sampling) - split sanitizer jobs across navi3, vega20, and mi300 * Updating pc sampling test labels * ROCP_PC_SAMPLING_ENABLED env in CI * ROCP_PC_SAMPLING_ENABLED for all CI mi200 jobs * Rearrange sanitizer assignments * fixes according to review * removed unused functions * pcs: rocprofiler_agent_id_t instead of handle as a key in map * Update source/lib/rocprofiler-sdk/context/context.hpp Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * removing drm_fd from the agent.h * pcs: removing one sample due to complexity * pcs: refactoring sample * simplifying sample * new lines * Improve queue_control enable intercepter logic * Update lib/rocprofiler-sdk/hsa/types.hpp - handle amd_ext size for HSA 1.12.0 * ROCP_PC_SAMPLING_ENABLED -> ROCPROFILER_PC_SAMPLING_BETA_ENABLED * Update hsa_adapter.cpp - anonymous namespace + remove debug * parser update * Apply suggestions from code review --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: vlaindic Co-authored-by: vlaindic Co-authored-by: vlaindic Co-authored-by: Vladimir Indic <139573562+vlaindic@users.noreply.github.com> Co-authored-by: Jonathan R. Madsen Co-authored-by: gobhardw Co-authored-by: Jonathan R. Madsen --- .github/workflows/continuous_integration.yml | 32 +- samples/CMakeLists.txt | 1 + samples/pc_sampling/CMakeLists.txt | 63 + samples/pc_sampling/client.cpp | 230 +++ samples/pc_sampling/main.cpp | 412 ++++ samples/pc_sampling/pcs.cpp | 317 +++ samples/pc_sampling/pcs.hpp | 79 + samples/pc_sampling/utils.cpp | 37 + samples/pc_sampling/utils.hpp | 36 + source/include/rocprofiler-sdk/agent.h | 15 +- .../rocprofiler-sdk/cxx/serialization.hpp | 1 - source/include/rocprofiler-sdk/fwd.h | 20 +- source/include/rocprofiler-sdk/pc_sampling.h | 299 +-- source/lib/rocprofiler-sdk/CMakeLists.txt | 1 + source/lib/rocprofiler-sdk/agent.cpp | 27 - source/lib/rocprofiler-sdk/buffer.cpp | 5 + .../lib/rocprofiler-sdk/context/context.cpp | 8 + .../lib/rocprofiler-sdk/context/context.hpp | 14 + .../rocprofiler-sdk/details/CMakeLists.txt | 8 + .../lib/rocprofiler-sdk/details/kfd_ioctl.h | 1822 +++++++++++++++++ source/lib/rocprofiler-sdk/hsa/queue.cpp | 8 + .../rocprofiler-sdk/hsa/queue_controller.cpp | 29 +- .../hsa/rocprofiler_packet.hpp | 5 + source/lib/rocprofiler-sdk/hsa/types.hpp | 4 +- .../page_migration/CMakeLists.txt | 2 - .../page_migration/details/CMakeLists.txt | 7 - .../page_migration/details/kfd_ioctl.h | 1711 ---------------- .../page_migration/page_migration.cpp | 2 +- .../rocprofiler-sdk/page_migration/utils.hpp | 2 +- source/lib/rocprofiler-sdk/pc_sampling.cpp | 85 +- .../pc_sampling/CMakeLists.txt | 13 + .../pc_sampling/cid_manager.cpp | 142 ++ .../pc_sampling/cid_manager.hpp | 119 ++ .../pc_sampling/code_object.cpp | 190 ++ .../pc_sampling/code_object.hpp | 40 + .../pc_sampling/hsa_adapter.cpp | 378 ++++ .../pc_sampling/hsa_adapter.hpp | 56 + .../pc_sampling/ioctl/CMakeLists.txt | 6 + .../pc_sampling/ioctl/ioctl_adapter.cpp | 383 ++++ .../pc_sampling/ioctl/ioctl_adapter.hpp | 50 + .../pc_sampling/ioctl/ioctl_adapter_types.hpp | 108 + .../pc_sampling/parser/correlation.hpp | 7 +- .../pc_sampling/parser/parser_types.h | 2 +- .../parser/pc_record_interface.cpp | 20 +- .../parser/pc_record_interface.hpp | 55 +- .../parser/tests/benchmark_test.cpp | 10 +- .../parser/tests/correlation_id_test.cpp | 24 +- .../pc_sampling/parser/tests/gfx9test.cpp | 26 +- .../pc_sampling/parser/tests/mocks.hpp | 8 +- .../pc_sampling/parser/translation.hpp | 28 +- .../rocprofiler-sdk/pc_sampling/service.cpp | 268 +++ .../rocprofiler-sdk/pc_sampling/service.hpp | 66 + .../pc_sampling/tests/CMakeLists.txt | 29 + .../pc_sampling/tests/configure_service.cpp | 453 ++++ .../tests/pc_sampling_internals.hpp | 63 + .../pc_sampling/tests/query_configuration.cpp | 364 ++++ .../pc_sampling/tests/samples_processing.cpp | 437 ++++ .../lib/rocprofiler-sdk/pc_sampling/types.hpp | 44 + .../lib/rocprofiler-sdk/pc_sampling/utils.cpp | 79 + .../lib/rocprofiler-sdk/pc_sampling/utils.hpp | 62 + source/lib/rocprofiler-sdk/registration.cpp | 16 + source/lib/rocprofiler-sdk/rocprofiler.cpp | 5 + source/lib/rocprofiler-sdk/tests/agent.cpp | 8 +- .../rocprofiler-sdk/tests/page_migration.cpp | 2 +- 64 files changed, 6831 insertions(+), 2012 deletions(-) create mode 100644 samples/pc_sampling/CMakeLists.txt create mode 100644 samples/pc_sampling/client.cpp create mode 100644 samples/pc_sampling/main.cpp create mode 100644 samples/pc_sampling/pcs.cpp create mode 100644 samples/pc_sampling/pcs.hpp create mode 100644 samples/pc_sampling/utils.cpp create mode 100644 samples/pc_sampling/utils.hpp create mode 100644 source/lib/rocprofiler-sdk/details/CMakeLists.txt create mode 100644 source/lib/rocprofiler-sdk/details/kfd_ioctl.h delete mode 100644 source/lib/rocprofiler-sdk/page_migration/details/CMakeLists.txt delete mode 100644 source/lib/rocprofiler-sdk/page_migration/details/kfd_ioctl.h create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/cid_manager.cpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/cid_manager.hpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/code_object.cpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/code_object.hpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.hpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/ioctl/CMakeLists.txt create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.cpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.hpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/service.cpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/service.hpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/tests/CMakeLists.txt create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/tests/configure_service.cpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/tests/pc_sampling_internals.hpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/tests/query_configuration.cpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/tests/samples_processing.cpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/types.hpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/utils.cpp create mode 100644 source/lib/rocprofiler-sdk/pc_sampling/utils.hpp diff --git a/.github/workflows/continuous_integration.yml b/.github/workflows/continuous_integration.yml index 29b1df0c4d..418ea7a8ca 100644 --- a/.github/workflows/continuous_integration.yml +++ b/.github/workflows/continuous_integration.yml @@ -70,6 +70,12 @@ jobs: run: | echo 'EXCLUDED_TESTS=${{ env.PC_SAMPLING_TESTS_REGEX }}' >> $GITHUB_ENV + - name: Enable PC Sampling + if: ${{ contains(matrix.runner, 'mi200') }} + shell: bash + run: | + echo 'ROCPROFILER_PC_SAMPLING_BETA_ENABLED=1' >> $GITHUB_ENV + - name: Configure, Build, and Test timeout-minutes: 30 shell: bash @@ -153,7 +159,7 @@ jobs: strategy: fail-fast: false matrix: - runner: ['navi3'] + runner: ['mi200'] os: ['ubuntu-22.04'] build-type: ['Release'] @@ -164,6 +170,7 @@ jobs: env: GIT_DISCOVERY_ACROSS_FILESYSTEM: 1 GCC_COMPILER_VERSION: 11 + ROCPROFILER_PC_SAMPLING_BETA_ENABLED: 1 steps: - name: Patch Git @@ -225,12 +232,6 @@ jobs: for i in python3 git cmake ctest gcc g++ gcov; do which-realpath $i; done ls -la - - name: Exclude PC Sampling Tests - if: ${{ !contains(matrix.runner, 'mi200') && !contains(matrix.runner, 'mi300') }} - shell: bash - run: | - echo 'EXCLUDED_TESTS=${{ env.PC_SAMPLING_TESTS_REGEX }}' >> $GITHUB_ENV - - name: Configure, Build, and Test (Total Code Coverage) timeout-minutes: 30 shell: bash @@ -384,11 +385,17 @@ jobs: strategy: fail-fast: false matrix: - runner: ['mi200'] + runner: ['navi3', 'vega20', 'mi300'] sanitizer: ['AddressSanitizer', 'ThreadSanitizer', 'LeakSanitizer'] os: ['ubuntu-22.04'] build-type: ['RelWithDebInfo'] - ci-flags: [''] + exclude: + - { runner: 'navi3', sanitizer: 'ThreadSanitizer' } + - { runner: 'navi3', sanitizer: 'LeakSanitizer' } + - { runner: 'vega20', sanitizer: 'AddressSanitizer' } + - { runner: 'vega20', sanitizer: 'LeakSanitizer' } + - { runner: 'mi300', sanitizer: 'AddressSanitizer' } + - { runner: 'mi300', sanitizer: 'ThreadSanitizer' } if: ${{ contains(github.event_name, 'pull_request') }} runs-on: ${{ matrix.runner }}-runner-set @@ -427,6 +434,12 @@ jobs: run: | echo 'EXCLUDED_TESTS=${{ env.PC_SAMPLING_TESTS_REGEX }}' >> $GITHUB_ENV + - name: Enable PC Sampling + if: ${{ contains(matrix.runner, 'mi200') }} + shell: bash + run: | + echo 'ROCPROFILER_PC_SAMPLING_BETA_ENABLED=1' >> $GITHUB_ENV + - name: Configure, Build, and Test timeout-minutes: 45 shell: bash @@ -438,7 +451,6 @@ jobs: --gpu-targets ${{ env.GPU_TARGETS }} --memcheck ${{ matrix.sanitizer }} --run-attempt ${{ github.run_attempt }} - ${{ matrix.ci-flags }} -- -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} -DCMAKE_INSTALL_PREFIX="${{ env.ROCM_PATH }}" diff --git a/samples/CMakeLists.txt b/samples/CMakeLists.txt index 870305dc64..32ec96b35c 100644 --- a/samples/CMakeLists.txt +++ b/samples/CMakeLists.txt @@ -32,3 +32,4 @@ add_subdirectory(intercept_table) add_subdirectory(code_object_isa_decode) add_subdirectory(advanced_thread_trace) add_subdirectory(external_correlation_id_request) +add_subdirectory(pc_sampling) diff --git a/samples/pc_sampling/CMakeLists.txt b/samples/pc_sampling/CMakeLists.txt new file mode 100644 index 0000000000..552db6392d --- /dev/null +++ b/samples/pc_sampling/CMakeLists.txt @@ -0,0 +1,63 @@ +# +# +# +cmake_minimum_required(VERSION 3.21.0 FATAL_ERROR) + +if(NOT CMAKE_HIP_COMPILER) + find_program( + amdclangpp_EXECUTABLE + NAMES amdclang++ + HINTS ${ROCM_PATH} ENV ROCM_PATH /opt/rocm + PATHS ${ROCM_PATH} ENV ROCM_PATH /opt/rocm + PATH_SUFFIXES bin llvm/bin NO_CACHE) + mark_as_advanced(amdclangpp_EXECUTABLE) + + if(amdclangpp_EXECUTABLE) + set(CMAKE_HIP_COMPILER "${amdclangpp_EXECUTABLE}") + endif() +endif() + +project(rocprofiler-sdk-samples-pc-sampling LANGUAGES CXX HIP) + +foreach(_TYPE DEBUG MINSIZEREL RELEASE RELWITHDEBINFO) + if("${CMAKE_HIP_FLAGS_${_TYPE}}" STREQUAL "") + set(CMAKE_HIP_FLAGS_${_TYPE} "${CMAKE_CXX_FLAGS_${_TYPE}}") + endif() +endforeach() + +find_package(rocprofiler-sdk REQUIRED) + +add_library(pc-sampling-client SHARED) +target_sources(pc-sampling-client PRIVATE client.cpp pcs.hpp pcs.cpp utils.hpp utils.cpp) +target_link_libraries( + pc-sampling-client + PRIVATE rocprofiler-sdk::rocprofiler-sdk rocprofiler::samples-build-flags + rocprofiler::samples-common-library) + +set_source_files_properties(main.cpp PROPERTIES LANGUAGE HIP) +find_package(Threads REQUIRED) + +add_executable(pc-sampling) +target_sources(pc-sampling PRIVATE main.cpp) +target_link_libraries(pc-sampling PRIVATE pc-sampling-client Threads::Threads + rocprofiler::samples-build-flags) + +rocprofiler_samples_get_preload_env(PRELOAD_ENV pc-sampling-client) +rocprofiler_samples_get_ld_library_path_env(LIBRARY_PATH_ENV) + +set(pc-sampling-env ${PRELOAD_ENV} ${LIBRARY_PATH_ENV}) + +add_test(NAME pc-sampling COMMAND $) + +set_tests_properties( + pc-sampling + PROPERTIES TIMEOUT + 45 + LABELS + "samples;pc-sampling" + ENVIRONMENT + "${pc-sampling-env}" + FAIL_REGULAR_EXPRESSION + "${ROCPROFILER_DEFAULT_FAIL_REGEX}" + SKIP_REGULAR_EXPRESSION + "PC sampling unavailable") diff --git a/samples/pc_sampling/client.cpp b/samples/pc_sampling/client.cpp new file mode 100644 index 0000000000..66ad74df7b --- /dev/null +++ b/samples/pc_sampling/client.cpp @@ -0,0 +1,230 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +// undefine NDEBUG so asserts are implemented +#ifdef NDEBUG +# undef NDEBUG +#endif + +/** + * @file samples/pc_sampling_library/client.cpp + * + * @brief Example rocprofiler client (tool) + */ + +#include "pcs.hpp" +#include "utils.hpp" + +#include +#include +#include +#include +#include + +#include "common/defines.hpp" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +namespace client +{ +namespace +{ +rocprofiler_client_id_t* client_id = nullptr; +rocprofiler_client_finalize_t client_fini_func = nullptr; +rocprofiler_context_id_t client_ctx; + +int +tool_init(rocprofiler_client_finalize_t fini_func, void* /*tool_data*/) +{ + client_fini_func = fini_func; + + client::pcs::find_all_gpu_agents_supporting_pc_sampling(); + + if(client::pcs::gpu_agents.empty()) + { + *utils::get_output_stream() << "No availabe gpu agents supporting PC sampling" << std::endl; + // Exit with no error if none of the GPUs support PC sampling. + exit(0); + } + + // The relations assumed: + // - One context for all gpu agents + // - a buffer per agent + // - a callback thread per buffer + // - a pc sampling service per agent/buffer + + ROCPROFILER_CHECK(rocprofiler_create_context(&client_ctx)); + + for(auto& gpu_agent : pcs::gpu_agents) + { + // creating a buffer that will hold pc sampling information + rocprofiler_buffer_policy_t drop_buffer_action = ROCPROFILER_BUFFER_POLICY_LOSSLESS; + auto buffer_id = rocprofiler_buffer_id_t{}; + ROCPROFILER_CHECK(rocprofiler_create_buffer(client_ctx, + client::pcs::BUFFER_SIZE_BYTES, + client::pcs::WATERMARK, + drop_buffer_action, + client::pcs::rocprofiler_pc_sampling_callback, + nullptr, + &buffer_id)); + + client::pcs::configure_pc_sampling_prefer_stochastic( + gpu_agent.get(), client_ctx, buffer_id); + + // One helper thread per GPU agent's buffer. + auto client_agent_thread = rocprofiler_callback_thread_t{}; + ROCPROFILER_CHECK(rocprofiler_create_callback_thread(&client_agent_thread)); + + ROCPROFILER_CHECK(rocprofiler_assign_callback_thread(buffer_id, client_agent_thread)); + + client::pcs::buffer_ids.emplace_back(buffer_id); + } + + int valid_ctx = 0; + ROCPROFILER_CHECK(rocprofiler_context_is_valid(client_ctx, &valid_ctx)); + if(valid_ctx == 0) + { + // notify rocprofiler that initialization failed + // and all the contexts, buffers, etc. created + // should be ignored + return -1; + } + + // Start PC sampling + ROCPROFILER_CHECK(rocprofiler_start_context(client_ctx)); + + return 0; +} + +void +tool_fini(void* /*tool_data*/) +{ + if(client_id) + { + // Assert the context is inactive. + int state = -1; + ROCPROFILER_CHECK(rocprofiler_context_is_active(client_ctx, &state)) + assert(state == 0); + + // No need to stop the context, since it has been stopped implicitly by the rocprofiler-SDK. + for(size_t i = 0; i < client::pcs::buffer_ids.size(); i++) + { + // Flush the buffer explicitly + ROCPROFILER_CHECK(rocprofiler_flush_buffer(client::pcs::buffer_ids.at(i))); + // Destroying the buffer + rocprofiler_status_t status = rocprofiler_destroy_buffer(client::pcs::buffer_ids.at(i)); + if(status == ROCPROFILER_STATUS_ERROR_BUFFER_BUSY) + { + *utils::get_output_stream() + << "The buffer is busy, so we cannot destroy it at the moment." << std::endl; + } + else + { + ROCPROFILER_CHECK(status); + } + } + } +} + +} // namespace + +// forward declaration +void +setup(); + +void +setup() +{ + if(int status = 0; + rocprofiler_is_initialized(&status) == ROCPROFILER_STATUS_SUCCESS && status == 0) + { + ROCPROFILER_CHECK(rocprofiler_force_configure(&rocprofiler_configure)); + } +} + +void +shutdown() +{} + +} // namespace client + +extern "C" rocprofiler_tool_configure_result_t* +rocprofiler_configure(uint32_t version, + const char* runtime_version, + uint32_t priority, + rocprofiler_client_id_t* id) +{ + // only activate if main tool + if(priority > 0) return nullptr; + + // set the client name + id->name = "PCSamplingExampleTool"; + + // store client info + client::client_id = id; + + // compute major/minor/patch version info + uint32_t major = version / 10000; + uint32_t minor = (version % 10000) / 100; + uint32_t patch = version % 100; + + // generate info string + auto info = std::stringstream{}; + info << id->name << " is using rocprofiler v" << major << "." << minor << "." << patch << " (" + << runtime_version << ")"; + + std::clog << info.str() << std::endl; + + std::ostream* output_stream = nullptr; + std::string filename = "pc_sampling.log"; + if(auto* outfile = getenv("ROCPROFILER_SAMPLE_OUTPUT_FILE"); outfile) filename = outfile; + if(filename == "stdout") + output_stream = &std::cout; + else if(filename == "stderr") + output_stream = &std::cerr; + else + output_stream = new std::ofstream{filename}; + + client::utils::get_output_stream() = output_stream; + + // create configure data + static auto cfg = + rocprofiler_tool_configure_result_t{sizeof(rocprofiler_tool_configure_result_t), + &client::tool_init, + &client::tool_fini, + static_cast(output_stream)}; + + // return pointer to configure data + return &cfg; +} diff --git a/samples/pc_sampling/main.cpp b/samples/pc_sampling/main.cpp new file mode 100644 index 0000000000..d78efed384 --- /dev/null +++ b/samples/pc_sampling/main.cpp @@ -0,0 +1,412 @@ +// MIT License +// +// Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in +// all copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +// THE SOFTWARE. + +#include "hip/hip_runtime.h" + +#include +#include +#include +#include +#include +#include +#include +#include + +#define HIP_API_CALL(CALL) \ + { \ + hipError_t error_ = (CALL); \ + if(error_ != hipSuccess) \ + { \ + auto _hip_api_print_lk = auto_lock_t{print_lock}; \ + fprintf(stderr, \ + "%s:%d :: HIP error : %s\n", \ + __FILE__, \ + __LINE__, \ + hipGetErrorString(error_)); \ + throw std::runtime_error("hip_api_call"); \ + } \ + } + +namespace +{ +using auto_lock_t = std::unique_lock; +auto print_lock = std::mutex{}; +size_t nthread_per_device = 2; +size_t nitr = 500; +size_t nsync = 10; +constexpr unsigned shared_mem_tile_dim = 32; + +void +check_hip_error(void); + +void +verify(int* in, int* out, int M, int N); +} // namespace + +__global__ void +transpose(const int* in, int* out, int M, int N); + +void +run(int rank, int tid, int devid, int argc, char** argv); + +void +run_transpose(int rank, int tid, hipStream_t stream, int argc, char** argv); + +void +run_migrate(int rank, int tid, hipStream_t stream, int, char** argv); + +void +run_scratch(int rank, int tid, hipStream_t stream, int argc, char** argv); + +int +main(int argc, char** argv) +{ + auto* exe_name = ::basename(argv[0]); + + int rank = 0; + for(int i = 1; i < argc; ++i) + { + auto _arg = std::string{argv[i]}; + if(_arg == "?" || _arg == "-h" || _arg == "--help") + { + fprintf(stderr, + "usage: %s [NUM_THREADS_PER_DEVICE (%zu)] [NUM_ITERATION (%zu)] " + "[SYNC_EVERY_N_ITERATIONS (%zu)]\n", + exe_name, + nthread_per_device, + nitr, + nsync); + exit(EXIT_SUCCESS); + } + } + if(argc > 1) nthread_per_device = atoll(argv[1]); + if(argc > 2) nitr = atoll(argv[2]); + if(argc > 3) nsync = atoll(argv[3]); + + int ndevice = 0; + HIP_API_CALL(hipGetDeviceCount(&ndevice)); + + auto nthreads = (ndevice * nthread_per_device); + + printf("[%s] Number of devices found: %i\n", exe_name, ndevice); + printf("[%s] Number of threads (per device): %zu\n", exe_name, nthread_per_device); + printf("[%s] Number of threads (total): %zu\n", exe_name, nthreads); + printf("[%s] Number of iterations: %zu\n", exe_name, nitr); + printf("[%s] Syncing every %zu iterations\n", exe_name, nsync); + + { + auto _threads = std::vector{}; + for(size_t i = 0; i < nthreads; ++i) + _threads.emplace_back(run, rank, i, i % ndevice, argc, argv); + for(auto& itr : _threads) + itr.join(); + } + + HIP_API_CALL(hipDeviceSynchronize()); + HIP_API_CALL(hipDeviceReset()); + + return 0; +} + +__global__ void +transpose(const int* in, int* out, int M, int N) +{ + __shared__ int tile[shared_mem_tile_dim][shared_mem_tile_dim]; + + int idx = (blockIdx.y * blockDim.y + threadIdx.y) * M + blockIdx.x * blockDim.x + threadIdx.x; + tile[threadIdx.y][threadIdx.x] = in[idx]; + __syncthreads(); + idx = (blockIdx.x * blockDim.x + threadIdx.y) * N + blockIdx.y * blockDim.y + threadIdx.x; + out[idx] = tile[threadIdx.x][threadIdx.y]; +} + +template +__global__ void +test_page_migrate(Tp* data, Tp val) +{ + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + data[idx] += val; +} + +__global__ void +test_kern_large(uint64_t* output) +{ + uint64_t result = 0; + int test[4000]; + memset(test, 5, 4000); + for(int& i : test) + { + i = i + 7; + *output += i; + result += i; + } + *output ^= result; + *output ^= result; +} + +__global__ void +test_kern_medium(uint64_t* output) +{ + uint64_t result = 0; + int test[175]; + memset(test, 5, 175); + for(int& i : test) + { + i = i + 7; + *output += i; + result += i; + } + *output ^= result; + *output ^= result; +} + +__global__ void +test_kern_small(uint64_t* output) +{ + uint64_t result = 0; + int test[2]; + for(int& i : test) + { + i = i + 7; + *output += i; + result += i; + } + *output ^= result; + *output ^= result; +} + +void +run(int rank, int tid, int devid, int argc, char** argv) +{ + auto* stream = hipStream_t{}; + HIP_API_CALL(hipSetDevice(devid)); + HIP_API_CALL(hipStreamCreate(&stream)); + + run_migrate(rank, tid, stream, argc, argv); + run_scratch(rank, tid, stream, argc, argv); + run_transpose(rank, tid, stream, argc, argv); + + HIP_API_CALL(hipStreamSynchronize(stream)); + HIP_API_CALL(hipStreamDestroy(stream)); +} + +void +run_transpose(int rank, int tid, hipStream_t stream, int argc, char** argv) +{ + auto* exe_name = ::basename(argv[0]); + + unsigned int M = 4960 * 2; + unsigned int N = 4960 * 2; + if(argc > 2) nitr = atoll(argv[2]); + if(argc > 3) nsync = atoll(argv[3]); + + auto_lock_t _lk{print_lock}; + std::cout << "[" << exe_name << "][transpose][" << rank << "][" << tid << "] M: " << M + << " N: " << N << std::endl; + _lk.unlock(); + + std::default_random_engine _engine{std::random_device{}() * (rank + 1) * (tid + 1)}; + std::uniform_int_distribution _dist{0, 1000}; + + size_t size = sizeof(int) * M * N; + int* inp_matrix = new int[size]; + int* out_matrix = new int[size]; + for(size_t i = 0; i < M * N; i++) + { + inp_matrix[i] = _dist(_engine); + out_matrix[i] = 0; + } + int* in = nullptr; + int* out = nullptr; + + HIP_API_CALL(hipMalloc(&in, size)); + HIP_API_CALL(hipMalloc(&out, size)); + HIP_API_CALL(hipMemsetAsync(in, 0, size, stream)); + HIP_API_CALL(hipMemsetAsync(out, 0, size, stream)); + HIP_API_CALL(hipMemcpyAsync(in, inp_matrix, size, hipMemcpyHostToDevice, stream)); + HIP_API_CALL(hipStreamSynchronize(stream)); + + dim3 grid(M / 32, N / 32, 1); + dim3 block(32, 32, 1); // transpose + + print_lock.lock(); + printf("[%s][transpose][%i][%i] grid=(%i,%i,%i), block=(%i,%i,%i)\n", + exe_name, + rank, + tid, + grid.x, + grid.y, + grid.z, + block.x, + block.y, + block.z); + print_lock.unlock(); + + auto t1 = std::chrono::high_resolution_clock::now(); + for(size_t i = 0; i < nitr; ++i) + { + transpose<<>>(in, out, M, N); + check_hip_error(); + if(i % nsync == (nsync - 1)) HIP_API_CALL(hipStreamSynchronize(stream)); + } + auto t2 = std::chrono::high_resolution_clock::now(); + HIP_API_CALL(hipStreamSynchronize(stream)); + HIP_API_CALL(hipMemcpyAsync(out_matrix, out, size, hipMemcpyDeviceToHost, stream)); + double time = std::chrono::duration_cast>(t2 - t1).count(); + float GB = (float) size * nitr * 2 / (1 << 30); + + print_lock.lock(); + std::cout << "[" << exe_name << "][transpose][" << rank << "][" << tid + << "] Runtime of transpose is " << time << " sec\n"; + std::cout << "[" << exe_name << "][transpose][" << rank << "][" << tid + << "] The average performance of transpose is " << GB / time << " GBytes/sec" + << std::endl; + print_lock.unlock(); + + HIP_API_CALL(hipStreamSynchronize(stream)); + + // cpu_transpose(matrix, out_matrix, M, N); + verify(inp_matrix, out_matrix, M, N); + + HIP_API_CALL(hipFree(in)); + HIP_API_CALL(hipFree(out)); + + delete[] inp_matrix; + delete[] out_matrix; +} + +void +run_scratch(int rank, int tid, hipStream_t stream, int, char** argv) +{ + auto t1 = std::chrono::high_resolution_clock::now(); + + HIP_API_CALL(hipStreamSynchronize(stream)); + + const auto* exe_name = ::basename(argv[0]); + + uint64_t* data_ptr = nullptr; + HIP_API_CALL(hipHostMalloc(&data_ptr, sizeof(uint64_t), 0)); + *data_ptr = 0; + + test_kern_small<<<1000, 1, 0, stream>>>(data_ptr); + test_kern_medium<<<1000, 1, 0, stream>>>(data_ptr); + test_kern_small<<<1000, 1, 0, stream>>>(data_ptr); + test_kern_large<<<1100, 1, 0, stream>>>(data_ptr); + HIP_API_CALL(hipStreamSynchronize(stream)); + + test_kern_small<<<1000, 1, 0, stream>>>(data_ptr); + HIP_API_CALL(hipStreamSynchronize(stream)); + + test_kern_medium<<<1000, 1, 0, stream>>>(data_ptr); + HIP_API_CALL(hipStreamSynchronize(stream)); + + test_kern_small<<<1000, 1, 0, stream>>>(data_ptr); + HIP_API_CALL(hipStreamSynchronize(stream)); + + test_kern_large<<<1100, 1, 0, stream>>>(data_ptr); + HIP_API_CALL(hipStreamSynchronize(stream)); + + auto t2 = std::chrono::high_resolution_clock::now(); + double time = std::chrono::duration_cast>(t2 - t1).count(); + + print_lock.lock(); + std::cout << "[" << exe_name << "][scratch][" << rank << "][" << tid + << "] Runtime of scratch is " << time << " sec\n"; + print_lock.unlock(); +} + +void +run_migrate(int rank, int tid, hipStream_t stream, int, char** argv) +{ + using data_type = uint64_t; + constexpr data_type init_v = 1; + constexpr data_type incr_v = 1; + + auto t1 = std::chrono::high_resolution_clock::now(); + + HIP_API_CALL(hipStreamSynchronize(stream)); + + const auto* exe_name = ::basename(argv[0]); + auto page_data = std::vector(1024, 0); + + HIP_API_CALL(hipHostRegister( + page_data.data(), page_data.size() * sizeof(data_type), hipHostRegisterDefault)); + + for(auto& itr : page_data) + itr = init_v; + + test_page_migrate<<<1, 1024, 0, stream>>>(page_data.data(), incr_v); + + HIP_API_CALL(hipStreamSynchronize(stream)); + + for(auto& itr : page_data) + { + auto diff = (itr - incr_v); + if(diff != init_v) + { + auto msg = std::stringstream{}; + msg << "invalid diff: " << diff << ". expected: " << init_v; + throw std::runtime_error{msg.str()}; + } + } + + HIP_API_CALL(hipHostUnregister(page_data.data())); + + auto t2 = std::chrono::high_resolution_clock::now(); + double time = std::chrono::duration_cast>(t2 - t1).count(); + + print_lock.lock(); + std::cout << "[" << exe_name << "][migrate][" << rank << "][" << tid + << "] Runtime of migrate is " << time << " sec\n"; + print_lock.unlock(); +} + +namespace +{ +void +check_hip_error(void) +{ + hipError_t err = hipGetLastError(); + if(err != hipSuccess) + { + auto_lock_t _lk{print_lock}; + std::cerr << "Error: " << hipGetErrorString(err) << std::endl; + throw std::runtime_error("hip_api_call"); + } +} + +void +verify(int* in, int* out, int M, int N) +{ + for(int i = 0; i < 10; i++) + { + int row = rand() % M; + int col = rand() % N; + if(in[row * N + col] != out[col * M + row]) + { + auto_lock_t _lk{print_lock}; + std::cout << "mismatch: " << row << ", " << col << " : " << in[row * N + col] << " | " + << out[col * M + row] << "\n"; + } + } +} +} // namespace diff --git a/samples/pc_sampling/pcs.cpp b/samples/pc_sampling/pcs.cpp new file mode 100644 index 0000000000..6ec6369526 --- /dev/null +++ b/samples/pc_sampling/pcs.cpp @@ -0,0 +1,317 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +// undefine NDEBUG so asserts are implemented +#ifdef NDEBUG +# undef NDEBUG +#endif + +#include "pcs.hpp" +#include "utils.hpp" + +#include "common/defines.hpp" + +#include +#include +#include +#include +#include + +namespace client +{ +namespace pcs +{ +tool_agent_info_vec_t gpu_agents; +std::vector buffer_ids; + +rocprofiler_status_t +find_all_gpu_agents_supporting_pc_sampling_impl(rocprofiler_agent_version_t version, + const void** agents, + size_t num_agents, + void* user_data) +{ + assert(version == ROCPROFILER_AGENT_INFO_VERSION_0); + // user_data represent the pointer to the array where gpu_agent will be stored + if(!user_data) return ROCPROFILER_STATUS_ERROR; + + std::stringstream ss; + + auto* _out_agents = static_cast(user_data); + auto* _agents = reinterpret_cast(agents); + for(size_t i = 0; i < num_agents; i++) + { + if(_agents[i]->type == ROCPROFILER_AGENT_TYPE_GPU) + { + // Instantiate the tool_agent_info. + // Store pointer to the rocprofiler_agent_t and instatiate a vector of + // available configurations. + // Move the ownership to the _out_agents + auto tool_gpu_agent = std::make_unique(); + tool_gpu_agent->agent_id = _agents[i]->id; + tool_gpu_agent->avail_configs = std::make_unique(); + tool_gpu_agent->agent = _agents[i]; + // Check if the GPU agent supports PC sampling. If so, add it to the + // output list `_out_agents`. + if(query_avail_configs_for_agent(tool_gpu_agent.get())) + _out_agents->push_back(std::move(tool_gpu_agent)); + } + + ss << "[" << __FUNCTION__ << "] " << _agents[i]->name << " :: " + << "id=" << _agents[i]->id.handle << ", " + << "type=" << _agents[i]->type << "\n"; + } + + *utils::get_output_stream() << ss.str() << std::endl; + + return ROCPROFILER_STATUS_SUCCESS; +} + +void +find_all_gpu_agents_supporting_pc_sampling() +{ + // This function returns the all gpu agents supporting some kind of PC sampling + ROCPROFILER_CHECK( + rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0, + &find_all_gpu_agents_supporting_pc_sampling_impl, + sizeof(rocprofiler_agent_t), + static_cast(&gpu_agents))); +} + +/** + * @brief The function queries available PC sampling configurations. + * If there is at least one available configuration, it returns true. + * Otherwise, this function returns false to indicate the agent does + * not support PC sampling. + */ +bool +query_avail_configs_for_agent(tool_agent_info* agent_info) +{ + // Clear the available configurations vector + agent_info->avail_configs->clear(); + + auto cb = [](const rocprofiler_pc_sampling_configuration_t* configs, + size_t num_config, + void* user_data) { + auto* avail_configs = static_cast(user_data); + for(size_t i = 0; i < num_config; i++) + { + avail_configs->emplace_back(configs[i]); + } + return ROCPROFILER_STATUS_SUCCESS; + }; + + auto status = rocprofiler_query_pc_sampling_agent_configurations( + agent_info->agent_id, cb, agent_info->avail_configs.get()); + + std::stringstream ss; + + if(status != ROCPROFILER_STATUS_SUCCESS) + { + // The query operation failed, so consider the PC sampling is unsupported at the agent. + // This can happen if the PC sampling service is invoked within the ROCgdb. + ss << "Querying PC sampling capabilities failed with status: " << status << std::endl; + *utils::get_output_stream() << ss.str() << std::endl; + return false; + } + else if(agent_info->avail_configs->size() == 0) + { + // No available configuration at the moment, so mark the PC sampling as unsupported. + return false; + } + + ss << "The agent with the id: " << agent_info->agent_id.handle << " supports the " + << agent_info->avail_configs->size() << " configurations: " << std::endl; + size_t ind = 0; + for(auto& cfg : *agent_info->avail_configs) + { + ss << "(" << ++ind << ".) " + << "method: " << cfg.method << ", " + << "unit: " << cfg.unit << ", " + << "min_interval: " << cfg.min_interval << ", " + << "max_interval: " << cfg.max_interval << ", " + << "flags: " << std::hex << cfg.flags << std::dec << std::endl; + } + + *utils::get_output_stream() << ss.str() << std::flush; + + return true; +} + +void +configure_pc_sampling_prefer_stochastic(tool_agent_info* agent_info, + rocprofiler_context_id_t context_id, + rocprofiler_buffer_id_t buffer_id) +{ + int failures = 10; + size_t interval = 0; + do + { + // Update the list of available configurations + auto success = query_avail_configs_for_agent(agent_info); + if(!success) + { + // An error occured while querying PC sampling capabilities, + // so avoid trying configuring PC sampling service. + // Instead return false to indicated a failure. + ROCPROFILER_CHECK(ROCPROFILER_STATUS_ERROR); + } + + const rocprofiler_pc_sampling_configuration_t* first_host_trap_config = nullptr; + const rocprofiler_pc_sampling_configuration_t* first_stochastic_config = nullptr; + // Search until encountering on the stochastic configuration, if any. + // Otherwise, use the host trap config + for(auto const& cfg : *agent_info->avail_configs) + { + if(cfg.method == ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC) + { + first_stochastic_config = &cfg; + break; + } + else if(!first_host_trap_config && + cfg.method == ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP) + { + first_host_trap_config = &cfg; + } + } + + // Check if the stochastic config is found. Use host trap config otherwise. + const rocprofiler_pc_sampling_configuration_t* picked_cfg = + (first_stochastic_config != nullptr) ? first_stochastic_config : first_host_trap_config; + + if(picked_cfg->min_interval == picked_cfg->max_interval) + { + // Another process already configured PC sampling, so use the intreval it set up. + interval = picked_cfg->min_interval; + } + else + { + interval = 10000; + } + + auto status = rocprofiler_configure_pc_sampling_service(context_id, + agent_info->agent_id, + picked_cfg->method, + picked_cfg->unit, + interval, + buffer_id); + if(status == ROCPROFILER_STATUS_SUCCESS) + { + *utils::get_output_stream() + << ">>> We chose PC sampling interval: " << interval + << " on the agent: " << agent_info->agent->id.handle << std::endl; + return; + } + else if(status != ROCPROFILER_STATUS_ERROR_NOT_AVAILABLE) + { + ROCPROFILER_CHECK(status); + } + // status == ROCPROFILER_STATUS_ERROR_NOT_AVAILABLE + // means another process P2 already configured PC sampling. + // Query available configurations again and receive the configurations picked by P2. + // However, if P2 destroys PC sampling service after query function finished, + // but before the `rocprofiler_configure_pc_sampling_service` is called, + // then the `rocprofiler_configure_pc_sampling_service` will fail again. + // The process P1 executing this loop can spin wait (starve) if it is unlucky enough + // to always be interuppted by some other process P2 that creates/destroys + // PC sampling service on the same device while P1 is executing the code + // after the `query_avail_configs_for_agent` and + // before the `rocprofiler_configure_pc_sampling_service`. + // This should happen very rarely, but just to be sure, we introduce a counter `failures` + // that will allow certain amount of failures to process P1. + } while(--failures); + + // The process failed too many times configuring PC sampling, + // report this to user; + ROCPROFILER_CHECK(ROCPROFILER_STATUS_ERROR); +} + +void +rocprofiler_pc_sampling_callback(rocprofiler_context_id_t /*context_id*/, + rocprofiler_buffer_id_t /*buffer_id*/, + rocprofiler_record_header_t** headers, + size_t num_headers, + void* /*data*/, + uint64_t drop_count) +{ + std::stringstream ss; + ss << "The number of delivered samples is: " << num_headers << ", " + << "while the number of dropped samples is: " << drop_count << std::endl; + + for(size_t i = 0; i < num_headers; i++) + { + auto* cur_header = headers[i]; + + if(cur_header == nullptr) + { + throw std::runtime_error{ + "rocprofiler provided a null pointer to header. this should never happen"}; + } + else if(cur_header->hash != + rocprofiler_record_header_compute_hash(cur_header->category, cur_header->kind)) + { + throw std::runtime_error{"rocprofiler_record_header_t (category | kind) != hash"}; + } + else if(cur_header->category == ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING) + { + if(cur_header->kind == ROCPROFILER_PC_SAMPLING_RECORD_SAMPLE) + { + auto* pc_sample = + static_cast(cur_header->payload); + + ss << "pc: " << std::hex << pc_sample->pc << ", " + << "timestamp: " << std::dec << pc_sample->timestamp << ", " + << "exec: " << std::hex << std::setw(16) << pc_sample->exec_mask << ", " + << "workgroup_id_(x=" << std::dec << std::setw(5) << pc_sample->workgroup_id.x + << ", " + << "y=" << std::setw(5) << pc_sample->workgroup_id.y << ", " + << "z=" << std::setw(5) << pc_sample->workgroup_id.z << "), " + << "wave_id: " << std::setw(2) << static_cast(pc_sample->wave_id) + << ", " + << "cu_id: " << pc_sample->hw_id << ", " + << "correlation: {internal=" << std::setw(7) + << pc_sample->correlation_id.internal << ", " + << "external=" << std::setw(5) << pc_sample->correlation_id.external.value << "}" + << std::endl; + } + else if(cur_header->kind == ROCPROFILER_PC_SAMPLING_RECORD_CODE_OBJECT_LOAD_MARKER) + { + auto* marker = static_cast( + cur_header->payload); + ss << "code object loading: " << marker->code_object_id << std::endl; + } + else if(cur_header->kind == ROCPROFILER_PC_SAMPLING_RECORD_CODE_OBJECT_UNLOAD_MARKER) + { + auto* marker = static_cast( + cur_header->payload); + ss << "code object unloading: " << marker->code_object_id << std::endl; + } + } + else + { + throw std::runtime_error{"unexpected rocprofiler_record_header_t category + kind"}; + } + } + + *utils::get_output_stream() << ss.str() << std::endl; +} +} // namespace pcs +} // namespace client diff --git a/samples/pc_sampling/pcs.hpp b/samples/pc_sampling/pcs.hpp new file mode 100644 index 0000000000..6ddb8851dd --- /dev/null +++ b/samples/pc_sampling/pcs.hpp @@ -0,0 +1,79 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#pragma once + +#include +#include + +#include + +namespace client +{ +namespace pcs +{ +constexpr size_t BUFFER_SIZE_BYTES = 8192; +constexpr size_t WATERMARK = (BUFFER_SIZE_BYTES / 4); + +struct tool_agent_info; +using avail_configs_vec_t = std::vector; +using tool_agent_info_vec_t = std::vector>; + +struct tool_agent_info +{ + rocprofiler_agent_id_t agent_id; + std::unique_ptr avail_configs; + const rocprofiler_agent_t* agent; +}; + +// GPU agents supporting some kind of PC sampling. +// Note that for some of these agent, the corresponding context might be invalid, +// meaning we were not able to enable PC sampling service. +// Check the `tool_init` for more information. +extern tool_agent_info_vec_t gpu_agents; +// Ids of the buffers used as containers for PC sampling records +extern std::vector buffer_ids; + +void +find_all_gpu_agents_supporting_pc_sampling(); + +/** + * @brief The return value indicates if the agent supports PC sampling. + * Check the implementation for more info. + */ +bool +query_avail_configs_for_agent(tool_agent_info* agent_info); + +void +configure_pc_sampling_prefer_stochastic(tool_agent_info* agent_info, + rocprofiler_context_id_t context_id, + rocprofiler_buffer_id_t buffer_id); + +void +rocprofiler_pc_sampling_callback(rocprofiler_context_id_t context_id, + rocprofiler_buffer_id_t buffer_id, + rocprofiler_record_header_t** headers, + size_t num_headers, + void* data, + uint64_t drop_count); +} // namespace pcs +} // namespace client diff --git a/samples/pc_sampling/utils.cpp b/samples/pc_sampling/utils.cpp new file mode 100644 index 0000000000..958ca8e17f --- /dev/null +++ b/samples/pc_sampling/utils.cpp @@ -0,0 +1,37 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include "utils.hpp" + +namespace client +{ +namespace utils +{ +std::ostream*& +get_output_stream() +{ + // The output strea is initially unitialized + static std::ostream* _v = nullptr; + return _v; +} +} // namespace utils +} // namespace client diff --git a/samples/pc_sampling/utils.hpp b/samples/pc_sampling/utils.hpp new file mode 100644 index 0000000000..1eedd6bd6b --- /dev/null +++ b/samples/pc_sampling/utils.hpp @@ -0,0 +1,36 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#pragma once + +#include + +#include + +namespace client +{ +namespace utils +{ +std::ostream*& +get_output_stream(); +} +} // namespace client diff --git a/source/include/rocprofiler-sdk/agent.h b/source/include/rocprofiler-sdk/agent.h index 7f56eca59e..641ed793bf 100644 --- a/source/include/rocprofiler-sdk/agent.h +++ b/source/include/rocprofiler-sdk/agent.h @@ -187,17 +187,10 @@ typedef struct rocprofiler_agent_v0_t const char* vendor_name; ///< Vendor of agent (will be AMD) const char* product_name; ///< Marketing name const char* model_name; ///< GPU only. Will be something like vega20, mi200, etc. - uint64_t num_pc_sampling_configs; ///< GPU only. Number of PC sampling modes available for this - ///< device type. Note: if another process is currently using - ///< PC sampling on this agent, this value will be zero so - ///< do not assume the number of PC sampling configurations - ///< based on the device type. - const rocprofiler_pc_sampling_configuration_t* - pc_sampling_configs; ///< GPU only. Array of PC sampling configuration types. - uint32_t node_id; ///< Node sequence number. This will be equivalent to the HSA-runtime - ///< HSA_AMD_AGENT_INFO_DRIVER_NODE_ID property - int32_t logical_node_id; ///< Logical sequence number. This will always be [0..N) where N is - ///< the total number of agents + uint32_t node_id; ///< Node sequence number. This will be equivalent to the HSA-runtime + ///< HSA_AMD_AGENT_INFO_DRIVER_NODE_ID property + int32_t logical_node_id; ///< Logical sequence number. This will always be [0..N) where N is + ///< the total number of agents } rocprofiler_agent_v0_t; typedef rocprofiler_agent_v0_t rocprofiler_agent_t; diff --git a/source/include/rocprofiler-sdk/cxx/serialization.hpp b/source/include/rocprofiler-sdk/cxx/serialization.hpp index 16523adc55..4fd458db22 100644 --- a/source/include/rocprofiler-sdk/cxx/serialization.hpp +++ b/source/include/rocprofiler-sdk/cxx/serialization.hpp @@ -720,7 +720,6 @@ save(ArchiveT& ar, const rocprofiler_agent_v0_t& data) ROCP_SDK_SAVE_DATA_CSTR(vendor_name); ROCP_SDK_SAVE_DATA_CSTR(product_name); ROCP_SDK_SAVE_DATA_CSTR(model_name); - ROCP_SDK_SAVE_DATA_FIELD(num_pc_sampling_configs); ROCP_SDK_SAVE_DATA_FIELD(node_id); ROCP_SDK_SAVE_DATA_FIELD(logical_node_id); diff --git a/source/include/rocprofiler-sdk/fwd.h b/source/include/rocprofiler-sdk/fwd.h index 2e8a5b22d6..fd8c159b06 100644 --- a/source/include/rocprofiler-sdk/fwd.h +++ b/source/include/rocprofiler-sdk/fwd.h @@ -102,6 +102,9 @@ typedef enum // NOLINT(performance-enum-size) ROCPROFILER_STATUS_ERROR_NO_PROFILE_QUEUE, ///< Profile queue creation failed ROCPROFILER_STATUS_ERROR_NO_HARDWARE_COUNTERS, ///< No hardware counters were specified ROCPROFILER_STATUS_ERROR_AGENT_MISMATCH, ///< Agent mismatch between profile and context. + ROCPROFILER_STATUS_ERROR_NOT_AVAILABLE, ///< The service is not available. + ///< Please refer to API functions that return this + ///< status code for more information. ROCPROFILER_STATUS_LAST, } rocprofiler_status_t; @@ -400,6 +403,19 @@ typedef enum ROCPROFILER_COUNTER_FLAG_LAST, } rocprofiler_counter_flag_t; +/** + * @brief Enumeration for distinguishing different buffer record kinds within the + * ::ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING category + */ +typedef enum +{ + ROCPROFILER_PC_SAMPLING_RECORD_NONE = 0, + ROCPROFILER_PC_SAMPLING_RECORD_SAMPLE, ///< ::rocprofiler_pc_sampling_record_t + ROCPROFILER_PC_SAMPLING_RECORD_CODE_OBJECT_LOAD_MARKER, ///< ::rocprofiler_pc_sampling_code_object_load_marker_t + ROCPROFILER_PC_SAMPLING_RECORD_CODE_OBJECT_UNLOAD_MARKER, ///< ::rocprofiler_pc_sampling_code_object_unload_marker_t + ROCPROFILER_PC_SAMPLING_RECORD_LAST, +} rocprofiler_pc_sampling_record_kind_t; + //--------------------------------------------------------------------------------------// // // ALIASES @@ -442,10 +458,6 @@ typedef uint64_t rocprofiler_kernel_id_t; // */ typedef uint64_t rocprofiler_dispatch_id_t; -// forward declaration of struct -typedef struct rocprofiler_pc_sampling_configuration_s rocprofiler_pc_sampling_configuration_t; -typedef struct rocprofiler_pc_sampling_record_s rocprofiler_pc_sampling_record_t; - /** * @brief Unique record id encoding both the counter * and dimensional values (positions) for the record. diff --git a/source/include/rocprofiler-sdk/pc_sampling.h b/source/include/rocprofiler-sdk/pc_sampling.h index 2dabdc77b7..74a6a7252f 100644 --- a/source/include/rocprofiler-sdk/pc_sampling.h +++ b/source/include/rocprofiler-sdk/pc_sampling.h @@ -38,53 +38,57 @@ ROCPROFILER_EXTERN_C_INIT * @brief Function used to configure the PC sampling service on the GPU agent with @p agent_id. * * Prerequisites are the following: - * - The user must create a context and supply its @p context_id. By using this context, - * - The user must create a context and supply its @p context_id. By using this context, - * the user can start/stop PC sampling on the agent. For more information, - * please @see `rocprofiler_start_context`/`rocprofiler_stop_context`. - * - The user must create a buffer and supply its @p buffer_id. Rocprofiler uses the buffer - * - The user must create a buffer and supply its @p buffer_id. Rocprofiler uses the buffer - * to deliver the PC samples to the user. For more information about the data delivery, - * please @see `rocprofiler_create_buffer` and `rocprofiler_buffer_tracing_cb_t`. + * - The client must create a context and supply its @p context_id. By using this context, + * the client can start/stop PC sampling on the agent. For more information, + * please @see rocprofiler_start_context/rocprofiler_stop_context. + * - The user must create a buffer and supply its @p buffer_id. Rocprofiler-SDK uses the buffer + * to deliver the PC samples to the client. For more information about the data delivery, + * please @see rocprofiler_create_buffer and @see rocprofiler_buffer_tracing_cb_t. * * Before calling this function, we recommend querying PC sampling configurations - * supported by the GPU agent via the `rocprofiler_query_pc_sampling_agent_configurations`. - * The user then chooses the @p method, @p unit, and @p interval to match one of the - * available configurations. Note that the @p interval must belong to the range of values - * The user then chooses the @p method, @p unit, and @p interval to match one of the + * supported by the GPU agent via the @see rocprofiler_query_pc_sampling_agent_configurations. + * The client chooses the @p method, @p unit, and @p interval to match one of the * available configurations. Note that the @p interval must belong to the range of values * [available_config.min_interval, available_config.max_interval], - * where available_config is the instance of the `rocprofiler_pc_sampling_configuration_s` - * supported at the moment. + * where available_config is the instance of the @see rocprofiler_pc_sampling_configuration_s + * supported/available at the moment. * - * Rocprofiler checks whether the requsted configuration is actually supported + * Rocprofiler-SDK checks whether the requsted configuration is actually supported * at the moment of calling this function. If the answer is yes, it returns - * the ROCPROFILER_STATUS_SUCCESS. Otherwise, notifies the caller about the + * the @see ROCPROFILER_STATUS_SUCCESS. Otherwise, it notifies the client about the * rejection reason via the returned status code. For more information * about the status codes, please @see rocprofiler_status_t. * + * There are a few constraints a client's code needs to be aware of. + * * Constraint1: A GPU agent can be configured to support at most one running PC sampling * configuration at any time, which implies some of the consequences described below. * After the tool configures the PC sampling with one of the available configurations, - * rocprofiler guarantees that this configuration will be valid for the tool's + * rocprofiler-SDK guarantees that this configuration will be valid for the tool's * lifetime. The tool can start and stop the configured PC sampling service whenever convenient. * * Constraint2: Since the same GPU agent can be used by multiple processes concurrently, - * Rocprofiler cannot guarantee the exclusive access to the PC sampling capability. + * Rocprofiler-SDK cannot guarantee the exclusive access to the PC sampling capability. * The consequence is the following scenario. The tool TA that belongs to the process PA, - * calls the `rocprofiler_query_pc_sampling_agent_configurations` that returns the - * two supported configurations CA and CB by the agent. Then the toolb TB of the process PB, + * calls the @see rocprofiler_query_pc_sampling_agent_configurations that returns the + * two supported configurations CA and CB by the agent. Then the tool TB of the process PB, * configures the PC sampling on the same agent by using the configuration CB. * Subsequently, the TA tries configuring the CA on the agent, and it fails. - * To point out that this case happened, we introduce a special status code (TODO: ARE WE)? - * When this status code is observed by the tool TA, it queties all available configurations again - * by calling `rocprofiler_query_pc_sampling_agent_configurations`, + * To point out that this case happened, we introduce a special status code + * @see ROCPROFILER_STATUS_ERROR_NOT_AVAILABLE. + * When this status code is observed by the tool TA, it queries all available configurations again + * by calling @see rocprofiler_query_pc_sampling_agent_configurations, * that returns only CB this time. The tool TA can choose CB, so that both * TA and TB use the PC sampling capability in the separate processes. + * Both TA and TB receives samples generated by the kernels launched by the + * corresponding processes PA and PB, respectively. * - * Constraints3: We allow only one context to contain the configured PC sampling service - * within the process, that implies that at most one of the loaded tools can use PC sampling. - * One context can contains multiple PC sampling services configured for different GPU agents. + * Constraint3: Rocprofiler-SDK allows only one context to contain the configured PC sampling + * service within the process, that implies that at most one of the loaded tools can use PC + * sampling. One context can contains multiple PC sampling services configured for different GPU + * agents. + * + * Constraint4: PC sampling feature is not available within the ROCgdb. * * @param [in] context_id - id of the context used for starting/stopping PC sampling service * @param [in] agent_id - id of the agent on which caller tries using PC sampling capability @@ -93,6 +97,14 @@ ROCPROFILER_EXTERN_C_INIT * @param [in] interval - frequency at which PC samples are generated * @param [in] buffer_id - id of the buffer used for delivering PC samples * @return ::rocprofiler_status_t + * @retval ::ROCPROFILER_STATUS_SUCCESS PC sampling service configured successfully + * @retval ::ROCPROFILER_STATUS_ERROR_NOT_AVAILABLE One of the scenarios is present: + * 1. PC sampling is already configured with configuration different than requested, + * 2. PC sampling is requested from a process that runs within the ROCgdb. + * 3. HSA runtime does not support PC sampling. + * @retval ::ROCPROFILER_STATUS_ERROR_INCOMPATIBLE_KERNEL the amdgpu driver installed on the system + * does not support the PC sampling feature + * @retval ::ROCPROFILER_STATUS_ERROR a general error caused by the amdgpu driver * */ rocprofiler_status_t ROCPROFILER_API @@ -105,45 +117,45 @@ rocprofiler_configure_pc_sampling_service(rocprofiler_context_id_t conte /** * @brief PC sampling configuration supported by a GPU agent. - * @var rocprofiler_pc_sampling_configuration_s::method - * Sampling method supported by the GPU - * agent. Currenlty, it can take one of the following two values: - * - ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP: a background host thread - * periodically interrupts waves execution on the GPU to generate PC samples - * - ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC: performance monitoring hardware - * on the GPU periodically interrupts waves to generate PC samples. - * @var rocprofiler_pc_sampling_configuration_s::unit - * A unit used to specify the period of the - * @ref method for samples generation. - * @var rocprofiler_pc_sampling_configuration_s::min_interval - * the highest possible frequencey for - * generating samples using @ref method. - * @var rocprofiler_pc_sampling_configuration_s::max_interval - * the lowest possible frequency for - * generating samples using @ref method - * @var rocprofiler_pc_sampling_configuration_s::flags - * TODO: ??? */ -struct rocprofiler_pc_sampling_configuration_s +typedef struct { + uint64_t size; ///< Size of this struct rocprofiler_pc_sampling_method_t method; rocprofiler_pc_sampling_unit_t unit; size_t min_interval; size_t max_interval; - uint64_t flags; -}; + uint64_t flags; /// for future use + + /// @var method + /// @brief Sampling method supported by the GPU agent. + /// Currently, it can take one of the following two values: + /// - ::ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP: a background host kernel thread + /// periodically interrupts waves execution on the GPU to generate PC samples + /// - ::ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC: performance monitoring hardware + /// on the GPU periodically interrupts waves to generate PC samples. + /// @var unit + /// @brief A unit used to specify the interval of the @ref method for samples generation. + /// @var min_interval + /// @brief the highest possible frequencey for generating samples using @ref method. + /// @var max_interva + /// @brief the lowest possible frequency for generating samples using @ref method + +} rocprofiler_pc_sampling_configuration_t; /** - * @brief The rocprofiler calls the tool's callback to deliver the list - * of available configurations upon the calls to the @ref - * rocprofiler_query_pc_sampling_agent_configurations. + * @brief Rocprofiler SDK's callback function to deliver the list of available PC + * sampling configurations upon the call to the + * @ref rocprofiler_query_pc_sampling_agent_configurations. * - * @param[out] configs - The list of PC sampling configurations supported by the agent of the - * moment of invoking @ref rocprofiler_query_pc_sampling_agent_configurations. - * @param[out] num_config - The number of configuration contained in the underlying + * @param[out] configs - The array of PC sampling configurations supported by the agent + * at the moment of invoking @ref rocprofiler_query_pc_sampling_agent_configurations. + * @param[out] num_config - The number of configurations contained in the underlying array + * @p configs. * In case the GPU agent does not support PC sampling, the value is 0. - * @param[in] user_data - A pointer passed as the last argument of the + * @param[in] user_data - client's private data passed via * @ref rocprofiler_query_pc_sampling_agent_configurations + * @return ::rocprofiler_status_t */ typedef rocprofiler_status_t (*rocprofiler_available_pc_sampling_configurations_cb_t)( const rocprofiler_pc_sampling_configuration_t* configs, @@ -153,10 +165,24 @@ typedef rocprofiler_status_t (*rocprofiler_available_pc_sampling_configurations_ /** * @brief Query PC Sampling Configuration. * - * @param [in] agent_id - id of the agent for which available configuration will be listed + * Lists PC sampling configurations a GPU agent with @p agent_id supports at the moment + * of invoking the function. Delivers configurations via @p cb. + * In case the PC sampling is configured on the GPU agent, the @p cb delivers information + * about the active PC sampling configuration. + * In case the GPU agent does not support PC sampling capability, + * the @p cb delivers none PC sampling configurations. + * + * @param [in] agent_id - id of the agent for which available configurations will be listed * @param [in] cb - User callback that delivers the available PC sampling configurations * @param [in] user_data - passed to the @p cb * @return ::rocprofiler_status_t + * @retval ::ROCPROFILER_STATUS_ERROR_NOT_AVAILABLE One of the scenarios is present: + * 1. PC sampling is requested from a process that runs within the ROCgdb. + * 2. HSA runtime does not support PC sampling. + * @retval ::ROCPROFILER_STATUS_ERROR_INCOMPATIBLE_KERNEL the amdgpu driver installed on the system + * does not support the PC sampling feature. + * @retval ::ROCPROFILER_STATUS_ERROR a general error caused by the amdgpu driver + * @retval ::ROCPROFILER_STATUS_SUCCESS @p cb successfully finished */ rocprofiler_status_t ROCPROFILER_API rocprofiler_query_pc_sampling_agent_configurations( @@ -165,36 +191,31 @@ rocprofiler_query_pc_sampling_agent_configurations( void* user_data) ROCPROFILER_NONNULL(2, 3); /** - * @brief The header of the @ref rocprofiler_pc_sampling_record_s, indicating - * what fields of the @ref rocprofiler_pc_sampling_record_s instance are meaningful - * @brief The header of the @ref rocprofiler_pc_sampling_record_s, indicating - * what fields of the @ref rocprofiler_pc_sampling_record_s instance are meaningful + * @brief The header of the @ref rocprofiler_pc_sampling_record_t, indicating + * what fields of the @ref rocprofiler_pc_sampling_record_t instance are meaningful * for the sample. - * @var rocprofiler_pc_sampling_header_v1_t::valid - * the sample is valid - * @var rocprofiler_pc_sampling_header_v1_t::type - * The following values are possible: - * - 0 - reserved - * - 1 - host trap pc sample - * - 2 - stochastic pc sample - * - 3 - perfcounter (unsupported at the moment) - * - other values does not mean anything at the moment - * @var rocprofiler_pc_sampling_header_v1_t::has_stall_reason - * whether the sample contains - * information about the stall reason. If so, please @see rocprofiler_pc_sampling_snapshot_v1_t. - * @var rocprofiler_pc_sampling_header_v1_t::has_wave_cnt - * whether the @ref rocprofiler_pc_sampling_record_s::wave_count contains - * meaningful value - * @var rocprofiler_pc_sampling_header_v1_t::reserved - * for future use */ typedef struct { - uint8_t valid : 1; - uint8_t type : 4; // 0=reserved, 1=hosttrap, 2=stochastic + uint8_t valid : 1; /// sample is valid + uint8_t type : 4; uint8_t has_stall_reason : 1; uint8_t has_wave_cnt : 1; - uint8_t reserved : 1; + uint8_t reserved : 1; /// for future use + + /// @var type + /// @brief The following values are possible: + /// - 0 - reserved + /// - 1 - host trap pc sample + /// - 2 - stochastic pc sample + /// - 3 - perfcounter (unsupported at the moment) + /// - other values does not mean anything at the moment + /// @var has_stall_reason + /// @brief whether the sample contains information about the stall reason. + /// If so, please @see rocprofiler_pc_sampling_snapshot_v1_t. + /// @var has_wave_cnt + /// @brief whether the @ref rocprofiler_pc_sampling_record_t::wave_count + /// contains meaningful value } rocprofiler_pc_sampling_header_v1_t; /** @@ -213,65 +234,73 @@ typedef struct // to reduce the space needed to represent a single sample. /** * @brief ROCProfiler PC Sampling Record corresponding to the interrupted wave. - * @var rocprofiler_pc_sampling_record_s::flags - * header that indicates what fields are meaningful - * for the PC sample. The values depend on what the underlying GPU agent architecture supports. - * @var rocprofiler_pc_sampling_record_s::chiplet - * chiplet index - * @var rocprofiler_pc_sampling_record_s::wave_id - * wave identifier within the workgroup - * @var rocprofiler_pc_sampling_record_s::wave_issued - * a flags indicated whether the wave is - * issueing the instruction' represented by the @ref pc at the moment of interruption. - * @var rocprofiler_pc_sampling_record_s::reserved - * FIXME: reserved 7 bits, must be zero. - * @var rocprofiler_pc_sampling_record_s::hw_id - * compute unit identifier - * @var rocprofiler_pc_sampling_record_s::pc - * The current program counter of the wave at the moment - * of interruption - * @var rocprofiler_pc_sampling_record_s::exec_mask - * shows how many SIMD lanes of the wave were - * executing the instruction represented by the @ref pc. Useful to understand thread-divergance - * within the wave - * @var rocprofiler_pc_sampling_record_s::workgroup_id_x - * the x coordinate of the wave within the workgroup - * @var rocprofiler_pc_sampling_record_s::workgroup_id_y - * the y coordinate of the wave within the workgroup - * @var rocprofiler_pc_sampling_record_s::workgroup_id_z - * the y coordinate of the wave within the workgroup - * @var rocprofiler_pc_sampling_record_s::wave_count - * FIXME: number of waves active at the CU at the moment of sample generation??? - * @var rocprofiler_pc_sampling_record_s::timestamp - * represents the GPU timestamp when the sample is generated - * @var rocprofiler_pc_sampling_record_s::correlation_id - * correlation id of the API call that - * initiated kernel laucnh. The interrupted wave is executed as part of the kernel. - * @var rocprofiler_pc_sampling_record_s::snapshot - * TODO: - * @var rocprofiler_pc_sampling_record_s::reserved2 - * for future use */ -struct rocprofiler_pc_sampling_record_s +typedef struct { - rocprofiler_pc_sampling_header_v1_t flags; - uint8_t chiplet; - uint8_t wave_id; - uint8_t wave_issued : 1; - uint8_t reserved : 7; - uint32_t hw_id; - uint64_t pc; - uint64_t exec_mask; - uint32_t workgroup_id_x; - uint32_t workgroup_id_y; - uint32_t workgroup_id_z; - uint32_t wave_count; - uint64_t timestamp; - rocprofiler_correlation_id_t correlation_id; - rocprofiler_pc_sampling_snapshot_v1_t snapshot; - uint32_t reserved2; -}; + uint64_t size; ///< Size of this struct + rocprofiler_pc_sampling_header_v1_t flags; + uint8_t chiplet; /// chiplet index + uint8_t wave_id; /// wave identifier within the workgroup + uint8_t wave_issued : 1; + uint8_t reserved : 7; /// reserved 7 bits, must be zero + uint32_t hw_id; /// compute unit identifier + uint64_t pc; /// Program counter of the wave of the moment of interruption + uint64_t exec_mask; + rocprofiler_dim3_t workgroup_id; /// wave coordinates within the workgroup + uint32_t wave_count; + uint64_t timestamp; /// timestamp when sample is generated + rocprofiler_correlation_id_t correlation_id; + rocprofiler_pc_sampling_snapshot_v1_t + snapshot; /// @see ::rocprofiler_pc_sampling_snapshot_v1_t + uint32_t reserved2; /// for future use + + /// @var flags + /// @brief indicates what fields of this struct are meaningful for the represented sample. + /// The values depend on what the underlying GPU agent architecture supports. + /// @var wave_issue + /// @brief indicates whether the wave is issueing the instruction represented by the @ref pc + /// @var exec_mask + /// @brief shows how many SIMD lanes of the wave were executing the instruction + /// represented by the @ref pc. Useful to understand thread-divergance within the wave + /// @var wave_count + /// @brief number of active waves on the CU at the moment of sample generation + /// @var correlation_id + /// @brief correlation id of the API call that initiated kernel launch. + /// The interrupted wave is executed as part of the kernel. +} rocprofiler_pc_sampling_record_t; + +/** + * @brief Marker representing code object loading event. + * + * @see rocprofiler_callback_tracing_code_object_load_data_t + * for more information + */ +typedef struct +{ + uint64_t size; ///< Size of this struct + uint64_t code_object_id; /// unique code object identifier +} rocprofiler_pc_sampling_code_object_load_marker_t; + +/** + * @brief Marker representing code object unloading event. + * + * @see rocprofiler_callback_tracing_code_object_load_data_t + * for more information + */ +typedef struct +{ + uint64_t size; ///< Size of this struct + uint64_t code_object_id; /// unique code object identifier +} rocprofiler_pc_sampling_code_object_unload_marker_t; /** @} */ ROCPROFILER_EXTERN_C_FINI + +ROCPROFILER_CXX_CODE( + static_assert(sizeof(rocprofiler_pc_sampling_record_t) == 80, + "Increasing the size of the pc sampling record is not permitted.")); + +ROCPROFILER_CXX_CODE(static_assert(offsetof(rocprofiler_pc_sampling_record_t, chiplet) == 9 && + offsetof(rocprofiler_pc_sampling_record_t, reserved2) == 76, + "PC sampling record layout changed.")); diff --git a/source/lib/rocprofiler-sdk/CMakeLists.txt b/source/lib/rocprofiler-sdk/CMakeLists.txt index e63cbbcb92..4f8a53544c 100644 --- a/source/lib/rocprofiler-sdk/CMakeLists.txt +++ b/source/lib/rocprofiler-sdk/CMakeLists.txt @@ -48,6 +48,7 @@ add_subdirectory(thread_trace) add_subdirectory(tracing) add_subdirectory(kernel_dispatch) add_subdirectory(page_migration) +add_subdirectory(details) target_link_libraries( rocprofiler-object-library diff --git a/source/lib/rocprofiler-sdk/agent.cpp b/source/lib/rocprofiler-sdk/agent.cpp index df37b3eef3..2862ed6d3f 100644 --- a/source/lib/rocprofiler-sdk/agent.cpp +++ b/source/lib/rocprofiler-sdk/agent.cpp @@ -355,12 +355,6 @@ read_property(const MapT& data, const std::string& label, Tp& value) } } -constexpr auto -compute_version(uint32_t major_v, uint32_t minor_v, uint32_t patch_v) -{ - return (major_v * 10000) + (minor_v * 100) + patch_v; -} - auto read_topology() { @@ -371,15 +365,6 @@ read_topology() throw std::runtime_error{ fmt::format("sysfs nodes path '{}' does not exist", sysfs_nodes_path.string())}; - using pc_sampling_config_vec_t = std::vector; - - static auto mi200_pc_sampling_config = pc_sampling_config_vec_t{ - rocprofiler_pc_sampling_configuration_t{ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP, - ROCPROFILER_PC_SAMPLING_UNIT_TIME, - 1UL, - 1000000000UL, - 0}}; - const auto& cpu_info_v = get_cpu_info(); auto data = std::vector{}; uint64_t idcount = 0; @@ -513,18 +498,6 @@ read_topology() } drmClose(drm_fd); } - - // TODO(jomadsen): make contingent on whether this process acquired the PC sampling - // device lock - { - constexpr auto gfx90a_version = compute_version(9, 0, 10); - - if(agent_info.gfx_target_version >= gfx90a_version) - { - agent_info.pc_sampling_configs = mi200_pc_sampling_config.data(); - agent_info.num_pc_sampling_configs = mi200_pc_sampling_config.size(); - } - } } else if(agent_info.type == ROCPROFILER_AGENT_TYPE_CPU) { diff --git a/source/lib/rocprofiler-sdk/buffer.cpp b/source/lib/rocprofiler-sdk/buffer.cpp index 8e017709e5..0cf0edf3b2 100644 --- a/source/lib/rocprofiler-sdk/buffer.cpp +++ b/source/lib/rocprofiler-sdk/buffer.cpp @@ -29,6 +29,7 @@ #include "lib/rocprofiler-sdk/context/domain.hpp" #include "lib/rocprofiler-sdk/hsa/hsa.hpp" #include "lib/rocprofiler-sdk/internal_threading.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/service.hpp" #include "lib/rocprofiler-sdk/registration.hpp" #include @@ -271,6 +272,10 @@ rocprofiler_create_buffer(rocprofiler_context_id_t context, rocprofiler_status_t rocprofiler_flush_buffer(rocprofiler_buffer_id_t buffer_id) { + // Drain internal PC sampling buffers, if needed. + auto status = rocprofiler::pc_sampling::flush_internal_agent_buffers(buffer_id); + if(status != ROCPROFILER_STATUS_SUCCESS) return status; + return rocprofiler::buffer::flush(buffer_id, true); } diff --git a/source/lib/rocprofiler-sdk/context/context.cpp b/source/lib/rocprofiler-sdk/context/context.cpp index 9ca3797f54..f0054f8a42 100644 --- a/source/lib/rocprofiler-sdk/context/context.cpp +++ b/source/lib/rocprofiler-sdk/context/context.cpp @@ -32,6 +32,7 @@ #include "lib/rocprofiler-sdk/buffer.hpp" #include "lib/rocprofiler-sdk/context/context.hpp" #include "lib/rocprofiler-sdk/counters/core.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/service.hpp" #include "lib/rocprofiler-sdk/thread_trace/att_core.hpp" #include @@ -322,6 +323,7 @@ start_context(rocprofiler_context_id_t context_id) if(cfg->counter_collection) rocprofiler::counters::start_context(cfg); if(cfg->thread_trace) cfg->thread_trace->start_context(); if(cfg->agent_counter_collection) status = rocprofiler::counters::start_agent_ctx(cfg); + if(cfg->pc_sampler) status = rocprofiler::pc_sampling::start_service(cfg); return status; } @@ -357,6 +359,12 @@ stop_context(rocprofiler_context_id_t idx) { rocprofiler::counters::stop_agent_ctx(const_cast(_expected)); } + + if(_expected->pc_sampler) + { + rocprofiler::pc_sampling::stop_service(_expected); + } + return ROCPROFILER_STATUS_SUCCESS; } } diff --git a/source/lib/rocprofiler-sdk/context/context.hpp b/source/lib/rocprofiler-sdk/context/context.hpp index cb5286a0d7..e23a4ba616 100644 --- a/source/lib/rocprofiler-sdk/context/context.hpp +++ b/source/lib/rocprofiler-sdk/context/context.hpp @@ -33,13 +33,16 @@ #include "lib/rocprofiler-sdk/counters/agent_profiling.hpp" #include "lib/rocprofiler-sdk/counters/core.hpp" #include "lib/rocprofiler-sdk/external_correlation.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/types.hpp" #include "lib/rocprofiler-sdk/thread_trace/att_core.hpp" #include "rocprofiler-sdk/agent.h" #include #include #include +#include #include +#include namespace rocprofiler { @@ -112,6 +115,16 @@ struct agent_counter_collection_service common::Synchronized enabled{false}; }; +struct pc_sampling_service +{ + // Contains a map with pairs (rocprofiler_agent_id_t, PCSAgentSession*). + // The PCSAgentSession encapsulates the information about the configured PC sampling session + // used on the agent with `rocprofiler_agent_id_t`. + std::unordered_map> + agent_sessions; +}; + struct context { // size is used to ensure that we never read past the end of the version @@ -125,6 +138,7 @@ struct context std::unique_ptr counter_collection = {}; std::unique_ptr agent_counter_collection = {}; std::shared_ptr thread_trace = {}; + std::unique_ptr pc_sampler = {}; }; // set the client index needs to be called before allocate_context() diff --git a/source/lib/rocprofiler-sdk/details/CMakeLists.txt b/source/lib/rocprofiler-sdk/details/CMakeLists.txt new file mode 100644 index 0000000000..c7886f5e26 --- /dev/null +++ b/source/lib/rocprofiler-sdk/details/CMakeLists.txt @@ -0,0 +1,8 @@ +# +# +# +set(ROCPROFILER_DETAILS_SOURCES) +set(ROCPROFILER_DETAILS_HEADERS kfd_ioctl.h) + +target_sources(rocprofiler-object-library PRIVATE ${ROCPROFILER_DETAILS_SOURCES} + ${ROCPROFILER_DETAILS_HEADERS}) diff --git a/source/lib/rocprofiler-sdk/details/kfd_ioctl.h b/source/lib/rocprofiler-sdk/details/kfd_ioctl.h new file mode 100644 index 0000000000..447a7c6f7a --- /dev/null +++ b/source/lib/rocprofiler-sdk/details/kfd_ioctl.h @@ -0,0 +1,1822 @@ +/* + * Copyright 2014 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + */ + +#ifndef KFD_IOCTL_H_INCLUDED +#define KFD_IOCTL_H_INCLUDED + +#include +#include + +/* + * - 1.1 - initial version + * - 1.3 - Add SMI events support + * - 1.4 - Indicate new SRAM EDC bit in device properties + * - 1.5 - Add SVM API + * - 1.6 - Query clear flags in SVM get_attr API + * - 1.7 - Checkpoint Restore (CRIU) API + * - 1.8 - CRIU - Support for SDMA transfers with GTT BOs + * - 1.9 - Add available memory ioctl + * - 1.10 - Add SMI profiler event log + * - 1.11 - Add unified memory for ctx save/restore area + * - 1.12 - Add DMA buf export ioctl + * - 1.13 - Add debugger API + * - 1.14 - Update kfd_event_data + * - 1.15 - Enable managing mappings in compute VMs with GEM_VA ioctl + * - 1.16 - Add PC Sampling ioctl + */ +#define KFD_IOCTL_MAJOR_VERSION 1 +#define KFD_IOCTL_MINOR_VERSION 16 + +struct kfd_ioctl_get_version_args +{ + __u32 major_version; /* from KFD */ + __u32 minor_version; /* from KFD */ +}; + +/* For kfd_ioctl_create_queue_args.queue_type. */ +#define KFD_IOC_QUEUE_TYPE_COMPUTE 0x0 +#define KFD_IOC_QUEUE_TYPE_SDMA 0x1 +#define KFD_IOC_QUEUE_TYPE_COMPUTE_AQL 0x2 +#define KFD_IOC_QUEUE_TYPE_SDMA_XGMI 0x3 + +#define KFD_MAX_QUEUE_PERCENTAGE 100 +#define KFD_MAX_QUEUE_PRIORITY 15 + +struct kfd_ioctl_create_queue_args +{ + __u64 ring_base_address; /* to KFD */ + __u64 write_pointer_address; /* from KFD */ + __u64 read_pointer_address; /* from KFD */ + __u64 doorbell_offset; /* from KFD */ + + __u32 ring_size; /* to KFD */ + __u32 gpu_id; /* to KFD */ + __u32 queue_type; /* to KFD */ + __u32 queue_percentage; /* to KFD */ + __u32 queue_priority; /* to KFD */ + __u32 queue_id; /* from KFD */ + + __u64 eop_buffer_address; /* to KFD */ + __u64 eop_buffer_size; /* to KFD */ + __u64 ctx_save_restore_address; /* to KFD */ + __u32 ctx_save_restore_size; /* to KFD */ + __u32 ctl_stack_size; /* to KFD */ +}; + +struct kfd_ioctl_destroy_queue_args +{ + __u32 queue_id; /* to KFD */ + __u32 pad; +}; + +struct kfd_ioctl_update_queue_args +{ + __u64 ring_base_address; /* to KFD */ + + __u32 queue_id; /* to KFD */ + __u32 ring_size; /* to KFD */ + __u32 queue_percentage; /* to KFD */ + __u32 queue_priority; /* to KFD */ +}; + +struct kfd_ioctl_set_cu_mask_args +{ + __u32 queue_id; /* to KFD */ + __u32 num_cu_mask; /* to KFD */ + __u64 cu_mask_ptr; /* to KFD */ +}; + +struct kfd_ioctl_get_queue_wave_state_args +{ + __u64 ctl_stack_address; /* to KFD */ + __u32 ctl_stack_used_size; /* from KFD */ + __u32 save_area_used_size; /* from KFD */ + __u32 queue_id; /* to KFD */ + __u32 pad; +}; + +struct kfd_ioctl_get_available_memory_args +{ + __u64 available; /* from KFD */ + __u32 gpu_id; /* to KFD */ + __u32 pad; +}; + +struct kfd_dbg_device_info_entry +{ + __u64 exception_status; + __u64 lds_base; + __u64 lds_limit; + __u64 scratch_base; + __u64 scratch_limit; + __u64 gpuvm_base; + __u64 gpuvm_limit; + __u32 gpu_id; + __u32 location_id; + __u32 vendor_id; + __u32 device_id; + __u32 revision_id; + __u32 subsystem_vendor_id; + __u32 subsystem_device_id; + __u32 fw_version; + __u32 gfx_target_version; + __u32 simd_count; + __u32 max_waves_per_simd; + __u32 array_count; + __u32 simd_arrays_per_engine; + __u32 num_xcc; + __u32 capability; + __u32 debug_prop; +}; + +/* For kfd_ioctl_set_memory_policy_args.default_policy and alternate_policy */ +#define KFD_IOC_CACHE_POLICY_COHERENT 0 +#define KFD_IOC_CACHE_POLICY_NONCOHERENT 1 + +struct kfd_ioctl_set_memory_policy_args +{ + __u64 alternate_aperture_base; /* to KFD */ + __u64 alternate_aperture_size; /* to KFD */ + + __u32 gpu_id; /* to KFD */ + __u32 default_policy; /* to KFD */ + __u32 alternate_policy; /* to KFD */ + __u32 pad; +}; + +/* + * All counters are monotonic. They are used for profiling of compute jobs. + * The profiling is done by userspace. + * + * In case of GPU reset, the counter should not be affected. + */ + +struct kfd_ioctl_get_clock_counters_args +{ + __u64 gpu_clock_counter; /* from KFD */ + __u64 cpu_clock_counter; /* from KFD */ + __u64 system_clock_counter; /* from KFD */ + __u64 system_clock_freq; /* from KFD */ + + __u32 gpu_id; /* to KFD */ + __u32 pad; +}; + +struct kfd_process_device_apertures +{ + __u64 lds_base; /* from KFD */ + __u64 lds_limit; /* from KFD */ + __u64 scratch_base; /* from KFD */ + __u64 scratch_limit; /* from KFD */ + __u64 gpuvm_base; /* from KFD */ + __u64 gpuvm_limit; /* from KFD */ + __u32 gpu_id; /* from KFD */ + __u32 pad; +}; + +/* + * AMDKFD_IOC_GET_PROCESS_APERTURES is deprecated. Use + * AMDKFD_IOC_GET_PROCESS_APERTURES_NEW instead, which supports an + * unlimited number of GPUs. + */ +#define NUM_OF_SUPPORTED_GPUS 7 +struct kfd_ioctl_get_process_apertures_args +{ + struct kfd_process_device_apertures process_apertures[NUM_OF_SUPPORTED_GPUS]; /* from KFD */ + + /* from KFD, should be in the range [1 - NUM_OF_SUPPORTED_GPUS] */ + __u32 num_of_nodes; + __u32 pad; +}; + +struct kfd_ioctl_get_process_apertures_new_args +{ + /* User allocated. Pointer to struct kfd_process_device_apertures + * filled in by Kernel + */ + __u64 kfd_process_device_apertures_ptr; + /* to KFD - indicates amount of memory present in + * kfd_process_device_apertures_ptr + * from KFD - Number of entries filled by KFD. + */ + __u32 num_of_nodes; + __u32 pad; +}; + +#define MAX_ALLOWED_NUM_POINTS 100 +#define MAX_ALLOWED_AW_BUFF_SIZE 4096 +#define MAX_ALLOWED_WAC_BUFF_SIZE 128 + +struct kfd_ioctl_dbg_register_args +{ + __u32 gpu_id; /* to KFD */ + __u32 pad; +}; + +struct kfd_ioctl_dbg_unregister_args +{ + __u32 gpu_id; /* to KFD */ + __u32 pad; +}; + +struct kfd_ioctl_dbg_address_watch_args +{ + __u64 content_ptr; /* a pointer to the actual content */ + __u32 gpu_id; /* to KFD */ + __u32 buf_size_in_bytes; /*including gpu_id and buf_size */ +}; + +struct kfd_ioctl_dbg_wave_control_args +{ + __u64 content_ptr; /* a pointer to the actual content */ + __u32 gpu_id; /* to KFD */ + __u32 buf_size_in_bytes; /*including gpu_id and buf_size */ +}; + +#define KFD_INVALID_FD 0xffffffff + +struct kfd_ioctl_dbg_trap_args_deprecated +{ + __u64 exception_mask; /* to KFD */ + __u64 ptr; /* to KFD -- used for pointer arguments: queue arrays */ + __u32 pid; /* to KFD */ + __u32 op; /* to KFD */ + __u32 data1; /* to KFD */ + __u32 data2; /* to KFD */ + __u32 data3; /* to KFD */ + __u32 data4; /* to KFD */ +}; + +/* Matching HSA_EVENTTYPE */ +#define KFD_IOC_EVENT_SIGNAL 0 +#define KFD_IOC_EVENT_NODECHANGE 1 +#define KFD_IOC_EVENT_DEVICESTATECHANGE 2 +#define KFD_IOC_EVENT_HW_EXCEPTION 3 +#define KFD_IOC_EVENT_SYSTEM_EVENT 4 +#define KFD_IOC_EVENT_DEBUG_EVENT 5 +#define KFD_IOC_EVENT_PROFILE_EVENT 6 +#define KFD_IOC_EVENT_QUEUE_EVENT 7 +#define KFD_IOC_EVENT_MEMORY 8 + +#define KFD_IOC_WAIT_RESULT_COMPLETE 0 +#define KFD_IOC_WAIT_RESULT_TIMEOUT 1 +#define KFD_IOC_WAIT_RESULT_FAIL 2 + +#define KFD_SIGNAL_EVENT_LIMIT 4096 + +/* For kfd_event_data.hw_exception_data.reset_type. */ +#define KFD_HW_EXCEPTION_WHOLE_GPU_RESET 0 +#define KFD_HW_EXCEPTION_PER_ENGINE_RESET 1 + +/* For kfd_event_data.hw_exception_data.reset_cause. */ +#define KFD_HW_EXCEPTION_GPU_HANG 0 +#define KFD_HW_EXCEPTION_ECC 1 + +/* For kfd_hsa_memory_exception_data.ErrorType */ +#define KFD_MEM_ERR_NO_RAS 0 +#define KFD_MEM_ERR_SRAM_ECC 1 +#define KFD_MEM_ERR_POISON_CONSUMED 2 +#define KFD_MEM_ERR_GPU_HANG 3 + +struct kfd_ioctl_create_event_args +{ + __u64 event_page_offset; /* from KFD */ + __u32 event_trigger_data; /* from KFD - signal events only */ + __u32 event_type; /* to KFD */ + __u32 auto_reset; /* to KFD */ + __u32 node_id; /* to KFD - only valid for certain + event types */ + __u32 event_id; /* from KFD */ + __u32 event_slot_index; /* from KFD */ +}; + +struct kfd_ioctl_destroy_event_args +{ + __u32 event_id; /* to KFD */ + __u32 pad; +}; + +struct kfd_ioctl_set_event_args +{ + __u32 event_id; /* to KFD */ + __u32 pad; +}; + +struct kfd_ioctl_reset_event_args +{ + __u32 event_id; /* to KFD */ + __u32 pad; +}; + +struct kfd_memory_exception_failure +{ + __u32 NotPresent; /* Page not present or supervisor privilege */ + __u32 ReadOnly; /* Write access to a read-only page */ + __u32 NoExecute; /* Execute access to a page marked NX */ + __u32 imprecise; /* Can't determine the exact fault address */ +}; + +/* memory exception data */ +struct kfd_hsa_memory_exception_data +{ + struct kfd_memory_exception_failure failure; + __u64 va; + __u32 gpu_id; + __u32 ErrorType; /* 0 = no RAS error, + * 1 = ECC_SRAM, + * 2 = Link_SYNFLOOD (poison), + * 3 = GPU hang (not attributable to a specific cause), + * other values reserved + */ +}; + +/* hw exception data */ +struct kfd_hsa_hw_exception_data +{ + __u32 reset_type; + __u32 reset_cause; + __u32 memory_lost; + __u32 gpu_id; +}; + +/* hsa signal event data */ +struct kfd_hsa_signal_event_data +{ + __u64 last_event_age; /* to and from KFD */ +}; + +/* Event data */ +struct kfd_event_data +{ + union + { + /* From KFD */ + struct kfd_hsa_memory_exception_data memory_exception_data; + struct kfd_hsa_hw_exception_data hw_exception_data; + /* To and From KFD */ + struct kfd_hsa_signal_event_data signal_event_data; + }; + __u64 kfd_event_data_ext; /* pointer to an extension structure + for future exception types */ + __u32 event_id; /* to KFD */ + __u32 pad; +}; + +struct kfd_ioctl_wait_events_args +{ + __u64 events_ptr; /* pointed to struct + kfd_event_data array, to KFD */ + __u32 num_events; /* to KFD */ + __u32 wait_for_all; /* to KFD */ + __u32 timeout; /* to KFD */ + __u32 wait_result; /* from KFD */ +}; + +struct kfd_ioctl_set_scratch_backing_va_args +{ + __u64 va_addr; /* to KFD */ + __u32 gpu_id; /* to KFD */ + __u32 pad; +}; + +struct kfd_ioctl_get_tile_config_args +{ + /* to KFD: pointer to tile array */ + __u64 tile_config_ptr; + /* to KFD: pointer to macro tile array */ + __u64 macro_tile_config_ptr; + /* to KFD: array size allocated by user mode + * from KFD: array size filled by kernel + */ + __u32 num_tile_configs; + /* to KFD: array size allocated by user mode + * from KFD: array size filled by kernel + */ + __u32 num_macro_tile_configs; + + __u32 gpu_id; /* to KFD */ + __u32 gb_addr_config; /* from KFD */ + __u32 num_banks; /* from KFD */ + __u32 num_ranks; /* from KFD */ + /* struct size can be extended later if needed + * without breaking ABI compatibility + */ +}; + +struct kfd_ioctl_set_trap_handler_args +{ + __u64 tba_addr; /* to KFD */ + __u64 tma_addr; /* to KFD */ + __u32 gpu_id; /* to KFD */ + __u32 pad; +}; + +struct kfd_ioctl_acquire_vm_args +{ + __u32 drm_fd; /* to KFD */ + __u32 gpu_id; /* to KFD */ +}; + +/* Allocation flags: memory types */ +#define KFD_IOC_ALLOC_MEM_FLAGS_VRAM (1 << 0) +#define KFD_IOC_ALLOC_MEM_FLAGS_GTT (1 << 1) +#define KFD_IOC_ALLOC_MEM_FLAGS_USERPTR (1 << 2) +#define KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL (1 << 3) +#define KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP (1 << 4) +/* Allocation flags: attributes/access options */ +#define KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE (1 << 31) +#define KFD_IOC_ALLOC_MEM_FLAGS_EXECUTABLE (1 << 30) +#define KFD_IOC_ALLOC_MEM_FLAGS_PUBLIC (1 << 29) +#define KFD_IOC_ALLOC_MEM_FLAGS_NO_SUBSTITUTE (1 << 28) +#define KFD_IOC_ALLOC_MEM_FLAGS_AQL_QUEUE_MEM (1 << 27) +#define KFD_IOC_ALLOC_MEM_FLAGS_COHERENT (1 << 26) +#define KFD_IOC_ALLOC_MEM_FLAGS_UNCACHED (1 << 25) +#define KFD_IOC_ALLOC_MEM_FLAGS_EXT_COHERENT (1 << 24) + +/* Allocate memory for later SVM (shared virtual memory) mapping. + * + * @va_addr: virtual address of the memory to be allocated + * all later mappings on all GPUs will use this address + * @size: size in bytes + * @handle: buffer handle returned to user mode, used to refer to + * this allocation for mapping, unmapping and freeing + * @mmap_offset: for CPU-mapping the allocation by mmapping a render node + * for userptrs this is overloaded to specify the CPU address + * @gpu_id: device identifier + * @flags: memory type and attributes. See KFD_IOC_ALLOC_MEM_FLAGS above + */ +struct kfd_ioctl_alloc_memory_of_gpu_args +{ + __u64 va_addr; /* to KFD */ + __u64 size; /* to KFD */ + __u64 handle; /* from KFD */ + __u64 mmap_offset; /* to KFD (userptr), from KFD (mmap offset) */ + __u32 gpu_id; /* to KFD */ + __u32 flags; +}; + +/* Free memory allocated with kfd_ioctl_alloc_memory_of_gpu + * + * @handle: memory handle returned by alloc + */ +struct kfd_ioctl_free_memory_of_gpu_args +{ + __u64 handle; /* to KFD */ +}; + +/* Map memory to one or more GPUs + * + * @handle: memory handle returned by alloc + * @device_ids_array_ptr: array of gpu_ids (__u32 per device) + * @n_devices: number of devices in the array + * @n_success: number of devices mapped successfully + * + * @n_success returns information to the caller how many devices from + * the start of the array have mapped the buffer successfully. It can + * be passed into a subsequent retry call to skip those devices. For + * the first call the caller should initialize it to 0. + * + * If the ioctl completes with return code 0 (success), n_success == + * n_devices. + */ +struct kfd_ioctl_map_memory_to_gpu_args +{ + __u64 handle; /* to KFD */ + __u64 device_ids_array_ptr; /* to KFD */ + __u32 n_devices; /* to KFD */ + __u32 n_success; /* to/from KFD */ +}; + +/* Unmap memory from one or more GPUs + * + * same arguments as for mapping + */ +struct kfd_ioctl_unmap_memory_from_gpu_args +{ + __u64 handle; /* to KFD */ + __u64 device_ids_array_ptr; /* to KFD */ + __u32 n_devices; /* to KFD */ + __u32 n_success; /* to/from KFD */ +}; + +/* Allocate GWS for specific queue + * + * @queue_id: queue's id that GWS is allocated for + * @num_gws: how many GWS to allocate + * @first_gws: index of the first GWS allocated. + * only support contiguous GWS allocation + */ +struct kfd_ioctl_alloc_queue_gws_args +{ + __u32 queue_id; /* to KFD */ + __u32 num_gws; /* to KFD */ + __u32 first_gws; /* from KFD */ + __u32 pad; +}; + +struct kfd_ioctl_get_dmabuf_info_args +{ + __u64 size; /* from KFD */ + __u64 metadata_ptr; /* to KFD */ + __u32 metadata_size; /* to KFD (space allocated by user) + * from KFD (actual metadata size) + */ + __u32 gpu_id; /* from KFD */ + __u32 flags; /* from KFD (KFD_IOC_ALLOC_MEM_FLAGS) */ + __u32 dmabuf_fd; /* to KFD */ +}; + +struct kfd_ioctl_import_dmabuf_args +{ + __u64 va_addr; /* to KFD */ + __u64 handle; /* from KFD */ + __u32 gpu_id; /* to KFD */ + __u32 dmabuf_fd; /* to KFD */ +}; + +struct kfd_ioctl_export_dmabuf_args +{ + __u64 handle; /* to KFD */ + __u32 flags; /* to KFD */ + __u32 dmabuf_fd; /* from KFD */ +}; + +/* + * KFD SMI(System Management Interface) events + */ +enum kfd_smi_event +{ + KFD_SMI_EVENT_NONE = 0, /* not used */ + KFD_SMI_EVENT_VMFAULT = 1, /* event start counting at 1 */ + KFD_SMI_EVENT_THERMAL_THROTTLE = 2, + KFD_SMI_EVENT_GPU_PRE_RESET = 3, + KFD_SMI_EVENT_GPU_POST_RESET = 4, + KFD_SMI_EVENT_MIGRATE_START = 5, + KFD_SMI_EVENT_MIGRATE_END = 6, + KFD_SMI_EVENT_PAGE_FAULT_START = 7, + KFD_SMI_EVENT_PAGE_FAULT_END = 8, + KFD_SMI_EVENT_QUEUE_EVICTION = 9, + KFD_SMI_EVENT_QUEUE_RESTORE = 10, + KFD_SMI_EVENT_UNMAP_FROM_GPU = 11, + + /* + * max event number, as a flag bit to get events from all processes, + * this requires super user permission, otherwise will not be able to + * receive event from any process. Without this flag to receive events + * from same process. + */ + KFD_SMI_EVENT_ALL_PROCESS = 64 +}; + +enum KFD_MIGRATE_TRIGGERS +{ + KFD_MIGRATE_TRIGGER_PREFETCH, + KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, + KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, + KFD_MIGRATE_TRIGGER_TTM_EVICTION +}; + +enum KFD_QUEUE_EVICTION_TRIGGERS +{ + KFD_QUEUE_EVICTION_TRIGGER_SVM, + KFD_QUEUE_EVICTION_TRIGGER_USERPTR, + KFD_QUEUE_EVICTION_TRIGGER_TTM, + KFD_QUEUE_EVICTION_TRIGGER_SUSPEND, + KFD_QUEUE_EVICTION_CRIU_CHECKPOINT, + KFD_QUEUE_EVICTION_CRIU_RESTORE +}; + +enum KFD_SVM_UNMAP_TRIGGERS +{ + KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY, + KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE, + KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU +}; + +#define KFD_SMI_EVENT_MASK_FROM_INDEX(i) (1ULL << ((i) -1)) +#define KFD_SMI_EVENT_MSG_SIZE 96 + +struct kfd_ioctl_smi_events_args +{ + __u32 gpuid; /* to KFD */ + __u32 anon_fd; /* from KFD */ +}; + +/** + * kfd_ioctl_spm_op - SPM ioctl operations + * + * @KFD_IOCTL_SPM_OP_ACQUIRE: acquire exclusive access to SPM + * @KFD_IOCTL_SPM_OP_RELEASE: release exclusive access to SPM + * @KFD_IOCTL_SPM_OP_SET_DEST_BUF: set or unset destination buffer for SPM streaming + */ +enum kfd_ioctl_spm_op +{ + KFD_IOCTL_SPM_OP_ACQUIRE, + KFD_IOCTL_SPM_OP_RELEASE, + KFD_IOCTL_SPM_OP_SET_DEST_BUF +}; + +/** + * kfd_ioctl_spm_args - Arguments for SPM ioctl + * + * @op[in]: specifies the operation to perform + * @gpu_id[in]: GPU ID of the GPU to profile + * @dst_buf[in]: used for the address of the destination buffer + * in @KFD_IOCTL_SPM_SET_DEST_BUFFER + * @buf_size[in]: size of the destination buffer + * @timeout[in/out]: [in]: timeout in milliseconds, [out]: amount of time left + * `in the timeout window + * @bytes_copied[out]: amount of data that was copied to the previous dest_buf + * @has_data_loss: boolean indicating whether data was lost + * (e.g. due to a ring-buffer overflow) + * + * This ioctl performs different functions depending on the @op parameter. + * + * KFD_IOCTL_SPM_OP_ACQUIRE + * ------------------------ + * + * Acquires exclusive access of SPM on the specified @gpu_id for the calling process. + * This must be called before using KFD_IOCTL_SPM_OP_SET_DEST_BUF. + * + * KFD_IOCTL_SPM_OP_RELEASE + * ------------------------ + * + * Releases exclusive access of SPM on the specified @gpu_id for the calling process, + * which allows another process to acquire it in the future. + * + * KFD_IOCTL_SPM_OP_SET_DEST_BUF + * ----------------------------- + * + * If @dst_buf is NULL, the destination buffer address is unset and copying of counters + * is stopped. + * + * If @dst_buf is not NULL, it specifies the pointer to a new destination buffer. + * @buf_size specifies the size of the buffer. + * + * If @timeout is non-0, the call will wait for up to @timeout ms for the previous + * buffer to be filled. If previous buffer to be filled before timeout, the @timeout + * will be updated value with the time remaining. If the timeout is exceeded, the function + * copies any partial data available into the previous user buffer and returns success. + * The amount of valid data in the previous user buffer is indicated by @bytes_copied. + * + * If @timeout is 0, the function immediately replaces the previous destination buffer + * without waiting for the previous buffer to be filled. That means the previous buffer + * may only be partially filled, and @bytes_copied will indicate how much data has been + * copied to it. + * + * If data was lost, e.g. due to a ring buffer overflow, @has_data_loss will be non-0. + * + * Returns negative error code on failure, 0 on success. + */ +struct kfd_ioctl_spm_args +{ + __u64 dest_buf; + __u32 buf_size; + __u32 op; + __u32 timeout; + __u32 gpu_id; + __u32 bytes_copied; + __u32 has_data_loss; +}; + +/************************************************************************************************** + * CRIU IOCTLs (Checkpoint Restore In Userspace) + * + * When checkpointing a process, the userspace application will perform: + * 1. PROCESS_INFO op to determine current process information. This pauses execution and evicts + * all the queues. + * 2. CHECKPOINT op to checkpoint process contents (BOs, queues, events, svm-ranges) + * 3. UNPAUSE op to un-evict all the queues + * + * When restoring a process, the CRIU userspace application will perform: + * + * 1. RESTORE op to restore process contents + * 2. RESUME op to start the process + * + * Note: Queues are forced into an evicted state after a successful PROCESS_INFO. User + * application needs to perform an UNPAUSE operation after calling PROCESS_INFO. + */ + +enum kfd_criu_op +{ + KFD_CRIU_OP_PROCESS_INFO, + KFD_CRIU_OP_CHECKPOINT, + KFD_CRIU_OP_UNPAUSE, + KFD_CRIU_OP_RESTORE, + KFD_CRIU_OP_RESUME, +}; + +/** + * kfd_ioctl_criu_args - Arguments perform CRIU operation + * @devices: [in/out] User pointer to memory location for devices information. + * This is an array of type kfd_criu_device_bucket. + * @bos: [in/out] User pointer to memory location for BOs information + * This is an array of type kfd_criu_bo_bucket. + * @priv_data: [in/out] User pointer to memory location for private data + * @priv_data_size: [in/out] Size of priv_data in bytes + * @num_devices: [in/out] Number of GPUs used by process. Size of @devices array. + * @num_bos [in/out] Number of BOs used by process. Size of @bos array. + * @num_objects: [in/out] Number of objects used by process. Objects are opaque to + * user application. + * @pid: [in/out] PID of the process being checkpointed + * @op [in] Type of operation (kfd_criu_op) + * + * Return: 0 on success, -errno on failure + */ +struct kfd_ioctl_criu_args +{ + __u64 devices; /* Used during ops: CHECKPOINT, RESTORE */ + __u64 bos; /* Used during ops: CHECKPOINT, RESTORE */ + __u64 priv_data; /* Used during ops: CHECKPOINT, RESTORE */ + __u64 priv_data_size; /* Used during ops: PROCESS_INFO, RESTORE */ + __u32 num_devices; /* Used during ops: PROCESS_INFO, RESTORE */ + __u32 num_bos; /* Used during ops: PROCESS_INFO, RESTORE */ + __u32 num_objects; /* Used during ops: PROCESS_INFO, RESTORE */ + __u32 pid; /* Used during ops: PROCESS_INFO, RESUME */ + __u32 op; +}; + +struct kfd_criu_device_bucket +{ + __u32 user_gpu_id; + __u32 actual_gpu_id; + __u32 drm_fd; + __u32 pad; +}; + +struct kfd_criu_bo_bucket +{ + __u64 addr; + __u64 size; + __u64 offset; + __u64 restored_offset; /* During restore, updated offset for BO */ + __u32 gpu_id; /* This is the user_gpu_id */ + __u32 alloc_flags; + __u32 dmabuf_fd; + __u32 pad; +}; + +/* CRIU IOCTLs - END */ +/**************************************************************************************************/ +/* Register offset inside the remapped mmio page + */ +enum kfd_mmio_remap +{ + KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL = 0, + KFD_MMIO_REMAP_HDP_REG_FLUSH_CNTL = 4, +}; + +struct kfd_ioctl_ipc_export_handle_args +{ + __u64 handle; /* to KFD */ + __u32 share_handle[4]; /* from KFD */ + __u32 gpu_id; /* to KFD */ + __u32 flags; /* to KFD */ +}; + +struct kfd_ioctl_ipc_import_handle_args +{ + __u64 handle; /* from KFD */ + __u64 va_addr; /* to KFD */ + __u64 mmap_offset; /* from KFD */ + __u32 share_handle[4]; /* to KFD */ + __u32 gpu_id; /* to KFD */ + __u32 flags; /* from KFD */ +}; + +struct kfd_ioctl_cross_memory_copy_deprecated_args +{ + /* to KFD: Process ID of the remote process */ + __u32 pid; + /* to KFD: See above definition */ + __u32 flags; + /* to KFD: Source GPU VM range */ + __u64 src_mem_range_array; + /* to KFD: Size of above array */ + __u64 src_mem_array_size; + /* to KFD: Destination GPU VM range */ + __u64 dst_mem_range_array; + /* to KFD: Size of above array */ + __u64 dst_mem_array_size; + /* from KFD: Total amount of bytes copied */ + __u64 bytes_copied; +}; + +/* Guarantee host access to memory */ +#define KFD_IOCTL_SVM_FLAG_HOST_ACCESS 0x00000001 +/* Fine grained coherency between all devices with access */ +#define KFD_IOCTL_SVM_FLAG_COHERENT 0x00000002 +/* Use any GPU in same hive as preferred device */ +#define KFD_IOCTL_SVM_FLAG_HIVE_LOCAL 0x00000004 +/* GPUs only read, allows replication */ +#define KFD_IOCTL_SVM_FLAG_GPU_RO 0x00000008 +/* Allow execution on GPU */ +#define KFD_IOCTL_SVM_FLAG_GPU_EXEC 0x00000010 +/* GPUs mostly read, may allow similar optimizations as RO, but writes fault */ +#define KFD_IOCTL_SVM_FLAG_GPU_READ_MOSTLY 0x00000020 +/* Keep GPU memory mapping always valid as if XNACK is disable */ +#define KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED 0x00000040 +/* Fine grained coherency between all devices using device-scope atomics */ +#define KFD_IOCTL_SVM_FLAG_EXT_COHERENT 0x00000080 + +/** + * kfd_ioctl_svm_op - SVM ioctl operations + * + * @KFD_IOCTL_SVM_OP_SET_ATTR: Modify one or more attributes + * @KFD_IOCTL_SVM_OP_GET_ATTR: Query one or more attributes + */ +enum kfd_ioctl_svm_op +{ + KFD_IOCTL_SVM_OP_SET_ATTR, + KFD_IOCTL_SVM_OP_GET_ATTR +}; + +/** kfd_ioctl_svm_location - Enum for preferred and prefetch locations + * + * GPU IDs are used to specify GPUs as preferred and prefetch locations. + * Below definitions are used for system memory or for leaving the preferred + * location unspecified. + */ +enum kfd_ioctl_svm_location +{ + KFD_IOCTL_SVM_LOCATION_SYSMEM = 0, + KFD_IOCTL_SVM_LOCATION_UNDEFINED = 0xffffffff +}; + +/** + * kfd_ioctl_svm_attr_type - SVM attribute types + * + * @KFD_IOCTL_SVM_ATTR_PREFERRED_LOC: gpuid of the preferred location, 0 for + * system memory + * @KFD_IOCTL_SVM_ATTR_PREFETCH_LOC: gpuid of the prefetch location, 0 for + * system memory. Setting this triggers an + * immediate prefetch (migration). + * @KFD_IOCTL_SVM_ATTR_ACCESS: + * @KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE: + * @KFD_IOCTL_SVM_ATTR_NO_ACCESS: specify memory access for the gpuid given + * by the attribute value + * @KFD_IOCTL_SVM_ATTR_SET_FLAGS: bitmask of flags to set (see + * KFD_IOCTL_SVM_FLAG_...) + * @KFD_IOCTL_SVM_ATTR_CLR_FLAGS: bitmask of flags to clear + * @KFD_IOCTL_SVM_ATTR_GRANULARITY: migration granularity + * (log2 num pages) + */ +enum kfd_ioctl_svm_attr_type +{ + KFD_IOCTL_SVM_ATTR_PREFERRED_LOC, + KFD_IOCTL_SVM_ATTR_PREFETCH_LOC, + KFD_IOCTL_SVM_ATTR_ACCESS, + KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE, + KFD_IOCTL_SVM_ATTR_NO_ACCESS, + KFD_IOCTL_SVM_ATTR_SET_FLAGS, + KFD_IOCTL_SVM_ATTR_CLR_FLAGS, + KFD_IOCTL_SVM_ATTR_GRANULARITY +}; + +/** + * kfd_ioctl_svm_attribute - Attributes as pairs of type and value + * + * The meaning of the @value depends on the attribute type. + * + * @type: attribute type (see enum @kfd_ioctl_svm_attr_type) + * @value: attribute value + */ +struct kfd_ioctl_svm_attribute +{ + __u32 type; + __u32 value; +}; + +/** + * kfd_ioctl_svm_args - Arguments for SVM ioctl + * + * @op specifies the operation to perform (see enum + * @kfd_ioctl_svm_op). @start_addr and @size are common for all + * operations. + * + * A variable number of attributes can be given in @attrs. + * @nattr specifies the number of attributes. New attributes can be + * added in the future without breaking the ABI. If unknown attributes + * are given, the function returns -EINVAL. + * + * @KFD_IOCTL_SVM_OP_SET_ATTR sets attributes for a virtual address + * range. It may overlap existing virtual address ranges. If it does, + * the existing ranges will be split such that the attribute changes + * only apply to the specified address range. + * + * @KFD_IOCTL_SVM_OP_GET_ATTR returns the intersection of attributes + * over all memory in the given range and returns the result as the + * attribute value. If different pages have different preferred or + * prefetch locations, 0xffffffff will be returned for + * @KFD_IOCTL_SVM_ATTR_PREFERRED_LOC or + * @KFD_IOCTL_SVM_ATTR_PREFETCH_LOC resepctively. For + * @KFD_IOCTL_SVM_ATTR_SET_FLAGS, flags of all pages will be + * aggregated by bitwise AND. That means, a flag will be set in the + * output, if that flag is set for all pages in the range. For + * @KFD_IOCTL_SVM_ATTR_CLR_FLAGS, flags of all pages will be + * aggregated by bitwise NOR. That means, a flag will be set in the + * output, if that flag is clear for all pages in the range. + * The minimum migration granularity throughout the range will be + * returned for @KFD_IOCTL_SVM_ATTR_GRANULARITY. + * + * Querying of accessibility attributes works by initializing the + * attribute type to @KFD_IOCTL_SVM_ATTR_ACCESS and the value to the + * GPUID being queried. Multiple attributes can be given to allow + * querying multiple GPUIDs. The ioctl function overwrites the + * attribute type to indicate the access for the specified GPU. + */ +struct kfd_ioctl_svm_args +{ + __u64 start_addr; + __u64 size; + __u32 op; + __u32 nattr; + /* Variable length array of attributes */ + struct kfd_ioctl_svm_attribute attrs[]; +}; + +/** + * kfd_ioctl_set_xnack_mode_args - Arguments for set_xnack_mode + * + * @xnack_enabled: [in/out] Whether to enable XNACK mode for this process + * + * @xnack_enabled indicates whether recoverable page faults should be + * enabled for the current process. 0 means disabled, positive means + * enabled, negative means leave unchanged. If enabled, virtual address + * translations on GFXv9 and later AMD GPUs can return XNACK and retry + * the access until a valid PTE is available. This is used to implement + * device page faults. + * + * On output, @xnack_enabled returns the (new) current mode (0 or + * positive). Therefore, a negative input value can be used to query + * the current mode without changing it. + * + * The XNACK mode fundamentally changes the way SVM managed memory works + * in the driver, with subtle effects on application performance and + * functionality. + * + * Enabling XNACK mode requires shader programs to be compiled + * differently. Furthermore, not all GPUs support changing the mode + * per-process. Therefore changing the mode is only allowed while no + * user mode queues exist in the process. This ensure that no shader + * code is running that may be compiled for the wrong mode. And GPUs + * that cannot change to the requested mode will prevent the XNACK + * mode from occurring. All GPUs used by the process must be in the + * same XNACK mode. + * + * GFXv8 or older GPUs do not support 48 bit virtual addresses or SVM. + * Therefore those GPUs are not considered for the XNACK mode switch. + * + * Return: 0 on success, -errno on failure + */ +struct kfd_ioctl_set_xnack_mode_args +{ + __s32 xnack_enabled; +}; + +/* Wave launch override modes */ +enum kfd_dbg_trap_override_mode +{ + KFD_DBG_TRAP_OVERRIDE_OR = 0, + KFD_DBG_TRAP_OVERRIDE_REPLACE = 1 +}; + +/* Wave launch overrides */ +enum kfd_dbg_trap_mask +{ + KFD_DBG_TRAP_MASK_FP_INVALID = 1, + KFD_DBG_TRAP_MASK_FP_INPUT_DENORMAL = 2, + KFD_DBG_TRAP_MASK_FP_DIVIDE_BY_ZERO = 4, + KFD_DBG_TRAP_MASK_FP_OVERFLOW = 8, + KFD_DBG_TRAP_MASK_FP_UNDERFLOW = 16, + KFD_DBG_TRAP_MASK_FP_INEXACT = 32, + KFD_DBG_TRAP_MASK_INT_DIVIDE_BY_ZERO = 64, + KFD_DBG_TRAP_MASK_DBG_ADDRESS_WATCH = 128, + KFD_DBG_TRAP_MASK_DBG_MEMORY_VIOLATION = 256, + KFD_DBG_TRAP_MASK_TRAP_ON_WAVE_START = (1 << 30), + KFD_DBG_TRAP_MASK_TRAP_ON_WAVE_END = (1 << 31) +}; + +/* Wave launch modes */ +enum kfd_dbg_trap_wave_launch_mode +{ + KFD_DBG_TRAP_WAVE_LAUNCH_MODE_NORMAL = 0, + KFD_DBG_TRAP_WAVE_LAUNCH_MODE_HALT = 1, + KFD_DBG_TRAP_WAVE_LAUNCH_MODE_DEBUG = 3 +}; + +/* Address watch modes */ +enum kfd_dbg_trap_address_watch_mode +{ + KFD_DBG_TRAP_ADDRESS_WATCH_MODE_READ = 0, + KFD_DBG_TRAP_ADDRESS_WATCH_MODE_NONREAD = 1, + KFD_DBG_TRAP_ADDRESS_WATCH_MODE_ATOMIC = 2, + KFD_DBG_TRAP_ADDRESS_WATCH_MODE_ALL = 3 +}; + +/* Additional wave settings */ +enum kfd_dbg_trap_flags +{ + KFD_DBG_TRAP_FLAG_SINGLE_MEM_OP = 1, +}; + +/* Trap exceptions */ +enum kfd_dbg_trap_exception_code +{ + EC_NONE = 0, + /* per queue */ + EC_QUEUE_WAVE_ABORT = 1, + EC_QUEUE_WAVE_TRAP = 2, + EC_QUEUE_WAVE_MATH_ERROR = 3, + EC_QUEUE_WAVE_ILLEGAL_INSTRUCTION = 4, + EC_QUEUE_WAVE_MEMORY_VIOLATION = 5, + EC_QUEUE_WAVE_APERTURE_VIOLATION = 6, + EC_QUEUE_PACKET_DISPATCH_DIM_INVALID = 16, + EC_QUEUE_PACKET_DISPATCH_GROUP_SEGMENT_SIZE_INVALID = 17, + EC_QUEUE_PACKET_DISPATCH_CODE_INVALID = 18, + EC_QUEUE_PACKET_RESERVED = 19, + EC_QUEUE_PACKET_UNSUPPORTED = 20, + EC_QUEUE_PACKET_DISPATCH_WORK_GROUP_SIZE_INVALID = 21, + EC_QUEUE_PACKET_DISPATCH_REGISTER_INVALID = 22, + EC_QUEUE_PACKET_VENDOR_UNSUPPORTED = 23, + EC_QUEUE_PREEMPTION_ERROR = 30, + EC_QUEUE_NEW = 31, + /* per device */ + EC_DEVICE_QUEUE_DELETE = 32, + EC_DEVICE_MEMORY_VIOLATION = 33, + EC_DEVICE_RAS_ERROR = 34, + EC_DEVICE_FATAL_HALT = 35, + EC_DEVICE_NEW = 36, + /* per process */ + EC_PROCESS_RUNTIME = 48, + EC_PROCESS_DEVICE_REMOVE = 49, + EC_MAX +}; + +/* Mask generated by ecode in kfd_dbg_trap_exception_code */ +#define KFD_EC_MASK(ecode) (1ULL << (ecode - 1)) + +/* Masks for exception code type checks below */ +#define KFD_EC_MASK_QUEUE \ + (KFD_EC_MASK(EC_QUEUE_WAVE_ABORT) | KFD_EC_MASK(EC_QUEUE_WAVE_TRAP) | \ + KFD_EC_MASK(EC_QUEUE_WAVE_MATH_ERROR) | KFD_EC_MASK(EC_QUEUE_WAVE_ILLEGAL_INSTRUCTION) | \ + KFD_EC_MASK(EC_QUEUE_WAVE_MEMORY_VIOLATION) | KFD_EC_MASK(EC_QUEUE_WAVE_APERTURE_VIOLATION) | \ + KFD_EC_MASK(EC_QUEUE_PACKET_DISPATCH_DIM_INVALID) | \ + KFD_EC_MASK(EC_QUEUE_PACKET_DISPATCH_GROUP_SEGMENT_SIZE_INVALID) | \ + KFD_EC_MASK(EC_QUEUE_PACKET_DISPATCH_CODE_INVALID) | KFD_EC_MASK(EC_QUEUE_PACKET_RESERVED) | \ + KFD_EC_MASK(EC_QUEUE_PACKET_UNSUPPORTED) | \ + KFD_EC_MASK(EC_QUEUE_PACKET_DISPATCH_WORK_GROUP_SIZE_INVALID) | \ + KFD_EC_MASK(EC_QUEUE_PACKET_DISPATCH_REGISTER_INVALID) | \ + KFD_EC_MASK(EC_QUEUE_PACKET_VENDOR_UNSUPPORTED) | KFD_EC_MASK(EC_QUEUE_PREEMPTION_ERROR) | \ + KFD_EC_MASK(EC_QUEUE_NEW)) +#define KFD_EC_MASK_DEVICE \ + (KFD_EC_MASK(EC_DEVICE_QUEUE_DELETE) | KFD_EC_MASK(EC_DEVICE_RAS_ERROR) | \ + KFD_EC_MASK(EC_DEVICE_FATAL_HALT) | KFD_EC_MASK(EC_DEVICE_MEMORY_VIOLATION) | \ + KFD_EC_MASK(EC_DEVICE_NEW)) +#define KFD_EC_MASK_PROCESS \ + (KFD_EC_MASK(EC_PROCESS_RUNTIME) | KFD_EC_MASK(EC_PROCESS_DEVICE_REMOVE)) + +/* Checks for exception code types for KFD search */ +#define KFD_DBG_EC_TYPE_IS_QUEUE(ecode) (!!(KFD_EC_MASK(ecode) & KFD_EC_MASK_QUEUE)) +#define KFD_DBG_EC_TYPE_IS_DEVICE(ecode) (!!(KFD_EC_MASK(ecode) & KFD_EC_MASK_DEVICE)) +#define KFD_DBG_EC_TYPE_IS_PROCESS(ecode) (!!(KFD_EC_MASK(ecode) & KFD_EC_MASK_PROCESS)) + +/* Runtime enable states */ +enum kfd_dbg_runtime_state +{ + DEBUG_RUNTIME_STATE_DISABLED = 0, + DEBUG_RUNTIME_STATE_ENABLED = 1, + DEBUG_RUNTIME_STATE_ENABLED_BUSY = 2, + DEBUG_RUNTIME_STATE_ENABLED_ERROR = 3 +}; + +/* Runtime enable status */ +struct kfd_runtime_info +{ + __u64 r_debug; + __u32 runtime_state; + __u32 ttmp_setup; +}; + +/* Enable modes for runtime enable */ +#define KFD_RUNTIME_ENABLE_MODE_ENABLE_MASK 1 +#define KFD_RUNTIME_ENABLE_MODE_TTMP_SAVE_MASK 2 + +/** + * kfd_ioctl_runtime_enable_args - Arguments for runtime enable + * + * Coordinates debug exception signalling and debug device enablement with runtime. + * + * @r_debug - pointer to user struct for sharing information between ROCr and the debuggger + * @mode_mask - mask to set mode + * KFD_RUNTIME_ENABLE_MODE_ENABLE_MASK - enable runtime for debugging, otherwise disable + * KFD_RUNTIME_ENABLE_MODE_TTMP_SAVE_MASK - enable trap temporary setup (ignore on disable) + * @capabilities_mask - mask to notify runtime on what KFD supports + * + * Return - 0 on SUCCESS. + * - EBUSY if runtime enable call already pending. + * - EEXIST if user queues already active prior to call. + * If process is debug enabled, runtime enable will enable debug devices and + * wait for debugger process to send runtime exception EC_PROCESS_RUNTIME + * to unblock - see kfd_ioctl_dbg_trap_args. + * + */ +struct kfd_ioctl_runtime_enable_args +{ + __u64 r_debug; + __u32 mode_mask; + __u32 capabilities_mask; +}; + +/* Queue information */ +struct kfd_queue_snapshot_entry +{ + __u64 exception_status; + __u64 ring_base_address; + __u64 write_pointer_address; + __u64 read_pointer_address; + __u64 ctx_save_restore_address; + __u32 queue_id; + __u32 gpu_id; + __u32 ring_size; + __u32 queue_type; + __u32 ctx_save_restore_area_size; + __u32 reserved; +}; + +/* Queue status return for suspend/resume */ +#define KFD_DBG_QUEUE_ERROR_BIT 30 +#define KFD_DBG_QUEUE_INVALID_BIT 31 +#define KFD_DBG_QUEUE_ERROR_MASK (1 << KFD_DBG_QUEUE_ERROR_BIT) +#define KFD_DBG_QUEUE_INVALID_MASK (1 << KFD_DBG_QUEUE_INVALID_BIT) + +/* Context save area header information */ +struct kfd_context_save_area_header +{ + struct + { + __u32 control_stack_offset; + __u32 control_stack_size; + __u32 wave_state_offset; + __u32 wave_state_size; + } wave_state; + __u32 debug_offset; + __u32 debug_size; + __u64 err_payload_addr; + __u32 err_event_id; + __u32 reserved1; +}; + +/* + * Debug operations + * + * For specifics on usage and return values, see documentation per operation + * below. Otherwise, generic error returns apply: + * - ESRCH if the process to debug does not exist. + * + * - EINVAL (with KFD_IOC_DBG_TRAP_ENABLE exempt) if operation + * KFD_IOC_DBG_TRAP_ENABLE has not succeeded prior. + * Also returns this error if GPU hardware scheduling is not supported. + * + * - EPERM (with KFD_IOC_DBG_TRAP_DISABLE exempt) if target process is not + * PTRACE_ATTACHED. KFD_IOC_DBG_TRAP_DISABLE is exempt to allow + * clean up of debug mode as long as process is debug enabled. + * + * - EACCES if any DBG_HW_OP (debug hardware operation) is requested when + * AMDKFD_IOC_RUNTIME_ENABLE has not succeeded prior. + * + * - ENODEV if any GPU does not support debugging on a DBG_HW_OP call. + * + * - Other errors may be returned when a DBG_HW_OP occurs while the GPU + * is in a fatal state. + * + */ +enum kfd_dbg_trap_operations +{ + KFD_IOC_DBG_TRAP_ENABLE = 0, + KFD_IOC_DBG_TRAP_DISABLE = 1, + KFD_IOC_DBG_TRAP_SEND_RUNTIME_EVENT = 2, + KFD_IOC_DBG_TRAP_SET_EXCEPTIONS_ENABLED = 3, + KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_OVERRIDE = 4, /* DBG_HW_OP */ + KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_MODE = 5, /* DBG_HW_OP */ + KFD_IOC_DBG_TRAP_SUSPEND_QUEUES = 6, /* DBG_HW_OP */ + KFD_IOC_DBG_TRAP_RESUME_QUEUES = 7, /* DBG_HW_OP */ + KFD_IOC_DBG_TRAP_SET_NODE_ADDRESS_WATCH = 8, /* DBG_HW_OP */ + KFD_IOC_DBG_TRAP_CLEAR_NODE_ADDRESS_WATCH = 9, /* DBG_HW_OP */ + KFD_IOC_DBG_TRAP_SET_FLAGS = 10, + KFD_IOC_DBG_TRAP_QUERY_DEBUG_EVENT = 11, + KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO = 12, + KFD_IOC_DBG_TRAP_GET_QUEUE_SNAPSHOT = 13, + KFD_IOC_DBG_TRAP_GET_DEVICE_SNAPSHOT = 14 +}; + +/** + * kfd_ioctl_dbg_trap_enable_args + * + * Arguments for KFD_IOC_DBG_TRAP_ENABLE. + * + * Enables debug session for target process. Call @op KFD_IOC_DBG_TRAP_DISABLE in + * kfd_ioctl_dbg_trap_args to disable debug session. + * + * @exception_mask (IN) - exceptions to raise to the debugger + * @rinfo_ptr (IN) - pointer to runtime info buffer (see kfd_runtime_info) + * @rinfo_size (IN/OUT) - size of runtime info buffer in bytes + * @dbg_fd (IN) - fd the KFD will nofify the debugger with of raised + * exceptions set in exception_mask. + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - 0 on SUCCESS. + * Copies KFD saved kfd_runtime_info to @rinfo_ptr on enable. + * Size of kfd_runtime saved by the KFD returned to @rinfo_size. + * - EBADF if KFD cannot get a reference to dbg_fd. + * - EFAULT if KFD cannot copy runtime info to rinfo_ptr. + * - EINVAL if target process is already debug enabled. + * + */ +struct kfd_ioctl_dbg_trap_enable_args +{ + __u64 exception_mask; + __u64 rinfo_ptr; + __u32 rinfo_size; + __u32 dbg_fd; +}; + +/** + * kfd_ioctl_dbg_trap_send_runtime_event_args + * + * + * Arguments for KFD_IOC_DBG_TRAP_SEND_RUNTIME_EVENT. + * Raises exceptions to runtime. + * + * @exception_mask (IN) - exceptions to raise to runtime + * @gpu_id (IN) - target device id + * @queue_id (IN) - target queue id + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - 0 on SUCCESS. + * - ENODEV if gpu_id not found. + * If exception_mask contains EC_PROCESS_RUNTIME, unblocks pending + * AMDKFD_IOC_RUNTIME_ENABLE call - see kfd_ioctl_runtime_enable_args. + * All other exceptions are raised to runtime through err_payload_addr. + * See kfd_context_save_area_header. + */ +struct kfd_ioctl_dbg_trap_send_runtime_event_args +{ + __u64 exception_mask; + __u32 gpu_id; + __u32 queue_id; +}; + +/** + * kfd_ioctl_dbg_trap_set_exceptions_enabled_args + * + * Arguments for KFD_IOC_SET_EXCEPTIONS_ENABLED + * Set new exceptions to be raised to the debugger. + * + * @exception_mask (IN) - new exceptions to raise the debugger + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - 0 on SUCCESS. + */ +struct kfd_ioctl_dbg_trap_set_exceptions_enabled_args +{ + __u64 exception_mask; +}; + +/** + * kfd_ioctl_dbg_trap_set_wave_launch_override_args + * + * Arguments for KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_OVERRIDE + * Enable HW exceptions to raise trap. + * + * @override_mode (IN) - see kfd_dbg_trap_override_mode + * @enable_mask (IN/OUT) - reference kfd_dbg_trap_mask. + * IN is the override modes requested to be enabled. + * OUT is referenced in Return below. + * @support_request_mask (IN/OUT) - reference kfd_dbg_trap_mask. + * IN is the override modes requested for support check. + * OUT is referenced in Return below. + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - 0 on SUCCESS. + * Previous enablement is returned in @enable_mask. + * Actual override support is returned in @support_request_mask. + * - EINVAL if override mode is not supported. + * - EACCES if trap support requested is not actually supported. + * i.e. enable_mask (IN) is not a subset of support_request_mask (OUT). + * Otherwise it is considered a generic error (see kfd_dbg_trap_operations). + */ +struct kfd_ioctl_dbg_trap_set_wave_launch_override_args +{ + __u32 override_mode; + __u32 enable_mask; + __u32 support_request_mask; + __u32 pad; +}; + +/** + * kfd_ioctl_dbg_trap_set_wave_launch_mode_args + * + * Arguments for KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_MODE + * Set wave launch mode. + * + * @mode (IN) - see kfd_dbg_trap_wave_launch_mode + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - 0 on SUCCESS. + */ +struct kfd_ioctl_dbg_trap_set_wave_launch_mode_args +{ + __u32 launch_mode; + __u32 pad; +}; + +/** + * kfd_ioctl_dbg_trap_suspend_queues_ags + * + * Arguments for KFD_IOC_DBG_TRAP_SUSPEND_QUEUES + * Suspend queues. + * + * @exception_mask (IN) - raised exceptions to clear + * @queue_array_ptr (IN) - pointer to array of queue ids (u32 per queue id) + * to suspend + * @num_queues (IN) - number of queues to suspend in @queue_array_ptr + * @grace_period (IN) - wave time allowance before preemption + * per 1K GPU clock cycle unit + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Destruction of a suspended queue is blocked until the queue is + * resumed. This allows the debugger to access queue information and + * the its context save area without running into a race condition on + * queue destruction. + * Automatically copies per queue context save area header information + * into the save area base + * (see kfd_queue_snapshot_entry and kfd_context_save_area_header). + * + * Return - Number of queues suspended on SUCCESS. + * . KFD_DBG_QUEUE_ERROR_MASK and KFD_DBG_QUEUE_INVALID_MASK masked + * for each queue id in @queue_array_ptr array reports unsuccessful + * suspend reason. + * KFD_DBG_QUEUE_ERROR_MASK = HW failure. + * KFD_DBG_QUEUE_INVALID_MASK = queue does not exist, is new or + * is being destroyed. + */ +struct kfd_ioctl_dbg_trap_suspend_queues_args +{ + __u64 exception_mask; + __u64 queue_array_ptr; + __u32 num_queues; + __u32 grace_period; +}; + +/** + * kfd_ioctl_dbg_trap_resume_queues_args + * + * Arguments for KFD_IOC_DBG_TRAP_RESUME_QUEUES + * Resume queues. + * + * @queue_array_ptr (IN) - pointer to array of queue ids (u32 per queue id) + * to resume + * @num_queues (IN) - number of queues to resume in @queue_array_ptr + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - Number of queues resumed on SUCCESS. + * KFD_DBG_QUEUE_ERROR_MASK and KFD_DBG_QUEUE_INVALID_MASK mask + * for each queue id in @queue_array_ptr array reports unsuccessful + * resume reason. + * KFD_DBG_QUEUE_ERROR_MASK = HW failure. + * KFD_DBG_QUEUE_INVALID_MASK = queue does not exist. + */ +struct kfd_ioctl_dbg_trap_resume_queues_args +{ + __u64 queue_array_ptr; + __u32 num_queues; + __u32 pad; +}; + +/** + * kfd_ioctl_dbg_trap_set_node_address_watch_args + * + * Arguments for KFD_IOC_DBG_TRAP_SET_NODE_ADDRESS_WATCH + * Sets address watch for device. + * + * @address (IN) - watch address to set + * @mode (IN) - see kfd_dbg_trap_address_watch_mode + * @mask (IN) - watch address mask + * @gpu_id (IN) - target gpu to set watch point + * @id (OUT) - watch id allocated + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - 0 on SUCCESS. + * Allocated watch ID returned to @id. + * - ENODEV if gpu_id not found. + * - ENOMEM if watch IDs can be allocated + */ +struct kfd_ioctl_dbg_trap_set_node_address_watch_args +{ + __u64 address; + __u32 mode; + __u32 mask; + __u32 gpu_id; + __u32 id; +}; + +/** + * kfd_ioctl_dbg_trap_clear_node_address_watch_args + * + * Arguments for KFD_IOC_DBG_TRAP_CLEAR_NODE_ADDRESS_WATCH + * Clear address watch for device. + * + * @gpu_id (IN) - target device to clear watch point + * @id (IN) - allocated watch id to clear + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - 0 on SUCCESS. + * - ENODEV if gpu_id not found. + * - EINVAL if watch ID has not been allocated. + */ +struct kfd_ioctl_dbg_trap_clear_node_address_watch_args +{ + __u32 gpu_id; + __u32 id; +}; + +/** + * kfd_ioctl_dbg_trap_set_flags_args + * + * Arguments for KFD_IOC_DBG_TRAP_SET_FLAGS + * Sets flags for wave behaviour. + * + * @flags (IN/OUT) - IN = flags to enable, OUT = flags previously enabled + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - 0 on SUCCESS. + * - EACCESS if any debug device does not allow flag options. + */ +struct kfd_ioctl_dbg_trap_set_flags_args +{ + __u32 flags; + __u32 pad; +}; + +/** + * kfd_ioctl_dbg_trap_query_debug_event_args + * + * Arguments for KFD_IOC_DBG_TRAP_QUERY_DEBUG_EVENT + * + * Find one or more raised exceptions. This function can return multiple + * exceptions from a single queue or a single device with one call. To find + * all raised exceptions, this function must be called repeatedly until it + * returns -EAGAIN. Returned exceptions can optionally be cleared by + * setting the corresponding bit in the @exception_mask input parameter. + * However, clearing an exception prevents retrieving further information + * about it with KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO. + * + * @exception_mask (IN/OUT) - exception to clear (IN) and raised (OUT) + * @gpu_id (OUT) - gpu id of exceptions raised + * @queue_id (OUT) - queue id of exceptions raised + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - 0 on raised exception found + * Raised exceptions found are returned in @exception mask + * with reported source id returned in @gpu_id or @queue_id. + * - EAGAIN if no raised exception has been found + */ +struct kfd_ioctl_dbg_trap_query_debug_event_args +{ + __u64 exception_mask; + __u32 gpu_id; + __u32 queue_id; +}; + +/** + * kfd_ioctl_dbg_trap_query_exception_info_args + * + * Arguments KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO + * Get additional info on raised exception. + * + * @info_ptr (IN) - pointer to exception info buffer to copy to + * @info_size (IN/OUT) - exception info buffer size (bytes) + * @source_id (IN) - target gpu or queue id + * @exception_code (IN) - target exception + * @clear_exception (IN) - clear raised @exception_code exception + * (0 = false, 1 = true) + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - 0 on SUCCESS. + * If @exception_code is EC_DEVICE_MEMORY_VIOLATION, copy @info_size(OUT) + * bytes of memory exception data to @info_ptr. + * If @exception_code is EC_PROCESS_RUNTIME, copy saved + * kfd_runtime_info to @info_ptr. + * Actual required @info_ptr size (bytes) is returned in @info_size. + */ +struct kfd_ioctl_dbg_trap_query_exception_info_args +{ + __u64 info_ptr; + __u32 info_size; + __u32 source_id; + __u32 exception_code; + __u32 clear_exception; +}; + +/** + * kfd_ioctl_dbg_trap_get_queue_snapshot_args + * + * Arguments KFD_IOC_DBG_TRAP_GET_QUEUE_SNAPSHOT + * Get queue information. + * + * @exception_mask (IN) - exceptions raised to clear + * @snapshot_buf_ptr (IN) - queue snapshot entry buffer (see kfd_queue_snapshot_entry) + * @num_queues (IN/OUT) - number of queue snapshot entries + * The debugger specifies the size of the array allocated in @num_queues. + * KFD returns the number of queues that actually existed. If this is + * larger than the size specified by the debugger, KFD will not overflow + * the array allocated by the debugger. + * + * @entry_size (IN/OUT) - size per entry in bytes + * The debugger specifies sizeof(struct kfd_queue_snapshot_entry) in + * @entry_size. KFD returns the number of bytes actually populated per + * entry. The debugger should use the KFD_IOCTL_MINOR_VERSION to determine, + * which fields in struct kfd_queue_snapshot_entry are valid. This allows + * growing the ABI in a backwards compatible manner. + * Note that entry_size(IN) should still be used to stride the snapshot buffer in the + * event that it's larger than actual kfd_queue_snapshot_entry. + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - 0 on SUCCESS. + * Copies @num_queues(IN) queue snapshot entries of size @entry_size(IN) + * into @snapshot_buf_ptr if @num_queues(IN) > 0. + * Otherwise return @num_queues(OUT) queue snapshot entries that exist. + */ +struct kfd_ioctl_dbg_trap_queue_snapshot_args +{ + __u64 exception_mask; + __u64 snapshot_buf_ptr; + __u32 num_queues; + __u32 entry_size; +}; + +/** + * kfd_ioctl_dbg_trap_get_device_snapshot_args + * + * Arguments for KFD_IOC_DBG_TRAP_GET_DEVICE_SNAPSHOT + * Get device information. + * + * @exception_mask (IN) - exceptions raised to clear + * @snapshot_buf_ptr (IN) - pointer to snapshot buffer (see kfd_dbg_device_info_entry) + * @num_devices (IN/OUT) - number of debug devices to snapshot + * The debugger specifies the size of the array allocated in @num_devices. + * KFD returns the number of devices that actually existed. If this is + * larger than the size specified by the debugger, KFD will not overflow + * the array allocated by the debugger. + * + * @entry_size (IN/OUT) - size per entry in bytes + * The debugger specifies sizeof(struct kfd_dbg_device_info_entry) in + * @entry_size. KFD returns the number of bytes actually populated. The + * debugger should use KFD_IOCTL_MINOR_VERSION to determine, which fields + * in struct kfd_dbg_device_info_entry are valid. This allows growing the + * ABI in a backwards compatible manner. + * Note that entry_size(IN) should still be used to stride the snapshot buffer in the + * event that it's larger than actual kfd_dbg_device_info_entry. + * + * Generic errors apply (see kfd_dbg_trap_operations). + * Return - 0 on SUCCESS. + * Copies @num_devices(IN) device snapshot entries of size @entry_size(IN) + * into @snapshot_buf_ptr if @num_devices(IN) > 0. + * Otherwise return @num_devices(OUT) queue snapshot entries that exist. + */ +struct kfd_ioctl_dbg_trap_device_snapshot_args +{ + __u64 exception_mask; + __u64 snapshot_buf_ptr; + __u32 num_devices; + __u32 entry_size; +}; + +/** + * kfd_ioctl_dbg_trap_args + * + * Arguments to debug target process. + * + * @pid - target process to debug + * @op - debug operation (see kfd_dbg_trap_operations) + * + * @op determines which union struct args to use. + * Refer to kern docs for each kfd_ioctl_dbg_trap_*_args struct. + */ +struct kfd_ioctl_dbg_trap_args +{ + __u32 pid; + __u32 op; + + union + { + struct kfd_ioctl_dbg_trap_enable_args enable; + struct kfd_ioctl_dbg_trap_send_runtime_event_args send_runtime_event; + struct kfd_ioctl_dbg_trap_set_exceptions_enabled_args set_exceptions_enabled; + struct kfd_ioctl_dbg_trap_set_wave_launch_override_args launch_override; + struct kfd_ioctl_dbg_trap_set_wave_launch_mode_args launch_mode; + struct kfd_ioctl_dbg_trap_suspend_queues_args suspend_queues; + struct kfd_ioctl_dbg_trap_resume_queues_args resume_queues; + struct kfd_ioctl_dbg_trap_set_node_address_watch_args set_node_address_watch; + struct kfd_ioctl_dbg_trap_clear_node_address_watch_args clear_node_address_watch; + struct kfd_ioctl_dbg_trap_set_flags_args set_flags; + struct kfd_ioctl_dbg_trap_query_debug_event_args query_debug_event; + struct kfd_ioctl_dbg_trap_query_exception_info_args query_exception_info; + struct kfd_ioctl_dbg_trap_queue_snapshot_args queue_snapshot; + struct kfd_ioctl_dbg_trap_device_snapshot_args device_snapshot; + }; +}; + +/** + * kfd_ioctl_pc_sample_op - PC Sampling ioctl operations + * + * @KFD_IOCTL_PCS_OP_QUERY_CAPABILITIES: Query device PC Sampling capabilities + * @KFD_IOCTL_PCS_OP_CREATE: Register this process with a per-device PC sampler instance + * @KFD_IOCTL_PCS_OP_DESTROY: Unregister from a previously registered PC sampler instance + * @KFD_IOCTL_PCS_OP_START: Process begins taking samples from a previously registered + * PC sampler instance + * @KFD_IOCTL_PCS_OP_STOP: Process stops taking samples from a previously registered + * PC sampler instance + */ +enum kfd_ioctl_pc_sample_op +{ + KFD_IOCTL_PCS_OP_QUERY_CAPABILITIES, + KFD_IOCTL_PCS_OP_CREATE, + KFD_IOCTL_PCS_OP_DESTROY, + KFD_IOCTL_PCS_OP_START, + KFD_IOCTL_PCS_OP_STOP, +}; + +/* Values have to be a power of 2*/ +#define KFD_IOCTL_PCS_FLAG_POWER_OF_2 0x00000001 + +enum kfd_ioctl_pc_sample_method +{ + KFD_IOCTL_PCS_METHOD_HOSTTRAP = 1, + KFD_IOCTL_PCS_METHOD_STOCHASTIC, +}; + +enum kfd_ioctl_pc_sample_type +{ + KFD_IOCTL_PCS_TYPE_TIME_US, + KFD_IOCTL_PCS_TYPE_CLOCK_CYCLES, + KFD_IOCTL_PCS_TYPE_INSTRUCTIONS +}; + +struct kfd_pc_sample_info +{ + __u64 interval; /* [IN] if PCS_TYPE_INTERVAL_US: sample interval in us + * if PCS_TYPE_CLOCK_CYCLES: sample interval in graphics core clk cycles + * if PCS_TYPE_INSTRUCTIONS: sample interval in instructions issued by + * graphics compute units + */ + __u64 interval_min; /* [OUT] */ + __u64 interval_max; /* [OUT] */ + __u64 flags; /* [OUT] indicate potential restrictions e.g FLAG_POWER_OF_2 */ + __u32 method; /* [IN/OUT] kfd_ioctl_pc_sample_method */ + __u32 type; /* [IN/OUT] kfd_ioctl_pc_sample_type */ +}; + +#define KFD_IOCTL_PCS_QUERY_TYPE_FULL (1 << 0) /* If not set, return current */ + +struct kfd_ioctl_pc_sample_args +{ + __u64 sample_info_ptr; /* array of kfd_pc_sample_info */ + __u32 num_sample_info; + __u32 op; /* kfd_ioctl_pc_sample_op */ + __u32 gpu_id; + __u32 trace_id; + __u32 flags; /* kfd_ioctl_pcs_query flags */ + __u32 reserved; +}; + +#define AMDKFD_IOCTL_BASE 'K' +#define AMDKFD_IO(nr) _IO(AMDKFD_IOCTL_BASE, nr) +#define AMDKFD_IOR(nr, type) _IOR(AMDKFD_IOCTL_BASE, nr, type) +#define AMDKFD_IOW(nr, type) _IOW(AMDKFD_IOCTL_BASE, nr, type) +#define AMDKFD_IOWR(nr, type) _IOWR(AMDKFD_IOCTL_BASE, nr, type) + +#define AMDKFD_IOC_GET_VERSION AMDKFD_IOR(0x01, struct kfd_ioctl_get_version_args) + +#define AMDKFD_IOC_CREATE_QUEUE AMDKFD_IOWR(0x02, struct kfd_ioctl_create_queue_args) + +#define AMDKFD_IOC_DESTROY_QUEUE AMDKFD_IOWR(0x03, struct kfd_ioctl_destroy_queue_args) + +#define AMDKFD_IOC_SET_MEMORY_POLICY AMDKFD_IOW(0x04, struct kfd_ioctl_set_memory_policy_args) + +#define AMDKFD_IOC_GET_CLOCK_COUNTERS AMDKFD_IOWR(0x05, struct kfd_ioctl_get_clock_counters_args) + +#define AMDKFD_IOC_GET_PROCESS_APERTURES \ + AMDKFD_IOR(0x06, struct kfd_ioctl_get_process_apertures_args) + +#define AMDKFD_IOC_UPDATE_QUEUE AMDKFD_IOW(0x07, struct kfd_ioctl_update_queue_args) + +#define AMDKFD_IOC_CREATE_EVENT AMDKFD_IOWR(0x08, struct kfd_ioctl_create_event_args) + +#define AMDKFD_IOC_DESTROY_EVENT AMDKFD_IOW(0x09, struct kfd_ioctl_destroy_event_args) + +#define AMDKFD_IOC_SET_EVENT AMDKFD_IOW(0x0A, struct kfd_ioctl_set_event_args) + +#define AMDKFD_IOC_RESET_EVENT AMDKFD_IOW(0x0B, struct kfd_ioctl_reset_event_args) + +#define AMDKFD_IOC_WAIT_EVENTS AMDKFD_IOWR(0x0C, struct kfd_ioctl_wait_events_args) + +#define AMDKFD_IOC_DBG_REGISTER_DEPRECATED AMDKFD_IOW(0x0D, struct kfd_ioctl_dbg_register_args) + +#define AMDKFD_IOC_DBG_UNREGISTER_DEPRECATED AMDKFD_IOW(0x0E, struct kfd_ioctl_dbg_unregister_args) + +#define AMDKFD_IOC_DBG_ADDRESS_WATCH_DEPRECATED \ + AMDKFD_IOW(0x0F, struct kfd_ioctl_dbg_address_watch_args) + +#define AMDKFD_IOC_DBG_WAVE_CONTROL_DEPRECATED \ + AMDKFD_IOW(0x10, struct kfd_ioctl_dbg_wave_control_args) + +#define AMDKFD_IOC_SET_SCRATCH_BACKING_VA \ + AMDKFD_IOWR(0x11, struct kfd_ioctl_set_scratch_backing_va_args) + +#define AMDKFD_IOC_GET_TILE_CONFIG AMDKFD_IOWR(0x12, struct kfd_ioctl_get_tile_config_args) + +#define AMDKFD_IOC_SET_TRAP_HANDLER AMDKFD_IOW(0x13, struct kfd_ioctl_set_trap_handler_args) + +#define AMDKFD_IOC_GET_PROCESS_APERTURES_NEW \ + AMDKFD_IOWR(0x14, struct kfd_ioctl_get_process_apertures_new_args) + +#define AMDKFD_IOC_ACQUIRE_VM AMDKFD_IOW(0x15, struct kfd_ioctl_acquire_vm_args) + +#define AMDKFD_IOC_ALLOC_MEMORY_OF_GPU AMDKFD_IOWR(0x16, struct kfd_ioctl_alloc_memory_of_gpu_args) + +#define AMDKFD_IOC_FREE_MEMORY_OF_GPU AMDKFD_IOW(0x17, struct kfd_ioctl_free_memory_of_gpu_args) + +#define AMDKFD_IOC_MAP_MEMORY_TO_GPU AMDKFD_IOWR(0x18, struct kfd_ioctl_map_memory_to_gpu_args) + +#define AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU \ + AMDKFD_IOWR(0x19, struct kfd_ioctl_unmap_memory_from_gpu_args) + +#define AMDKFD_IOC_SET_CU_MASK AMDKFD_IOW(0x1A, struct kfd_ioctl_set_cu_mask_args) + +#define AMDKFD_IOC_GET_QUEUE_WAVE_STATE \ + AMDKFD_IOWR(0x1B, struct kfd_ioctl_get_queue_wave_state_args) + +#define AMDKFD_IOC_GET_DMABUF_INFO AMDKFD_IOWR(0x1C, struct kfd_ioctl_get_dmabuf_info_args) + +#define AMDKFD_IOC_IMPORT_DMABUF AMDKFD_IOWR(0x1D, struct kfd_ioctl_import_dmabuf_args) + +#define AMDKFD_IOC_ALLOC_QUEUE_GWS AMDKFD_IOWR(0x1E, struct kfd_ioctl_alloc_queue_gws_args) + +#define AMDKFD_IOC_SMI_EVENTS AMDKFD_IOWR(0x1F, struct kfd_ioctl_smi_events_args) + +#define AMDKFD_IOC_SVM AMDKFD_IOWR(0x20, struct kfd_ioctl_svm_args) + +#define AMDKFD_IOC_SET_XNACK_MODE AMDKFD_IOWR(0x21, struct kfd_ioctl_set_xnack_mode_args) + +#define AMDKFD_IOC_CRIU_OP AMDKFD_IOWR(0x22, struct kfd_ioctl_criu_args) + +#define AMDKFD_IOC_AVAILABLE_MEMORY AMDKFD_IOWR(0x23, struct kfd_ioctl_get_available_memory_args) + +#define AMDKFD_IOC_EXPORT_DMABUF AMDKFD_IOWR(0x24, struct kfd_ioctl_export_dmabuf_args) + +#define AMDKFD_IOC_RUNTIME_ENABLE AMDKFD_IOWR(0x25, struct kfd_ioctl_runtime_enable_args) + +#define AMDKFD_IOC_DBG_TRAP AMDKFD_IOWR(0x26, struct kfd_ioctl_dbg_trap_args) + +#define AMDKFD_COMMAND_START 0x01 +#define AMDKFD_COMMAND_END 0x27 + +/* non-upstream ioctls */ +#define AMDKFD_IOC_IPC_IMPORT_HANDLE AMDKFD_IOWR(0x80, struct kfd_ioctl_ipc_import_handle_args) + +#define AMDKFD_IOC_IPC_EXPORT_HANDLE AMDKFD_IOWR(0x81, struct kfd_ioctl_ipc_export_handle_args) + +#define AMDKFD_IOC_DBG_TRAP_DEPRECATED AMDKFD_IOWR(0x82, struct kfd_ioctl_dbg_trap_args_deprecated) + +#define AMDKFD_IOC_CROSS_MEMORY_COPY_DEPRECATED \ + AMDKFD_IOWR(0x83, struct kfd_ioctl_cross_memory_copy_deprecated_args) + +#define AMDKFD_IOC_RLC_SPM AMDKFD_IOWR(0x84, struct kfd_ioctl_spm_args) + +#define AMDKFD_IOC_PC_SAMPLE AMDKFD_IOWR(0x85, struct kfd_ioctl_pc_sample_args) + +#define AMDKFD_COMMAND_START_2 0x80 +#define AMDKFD_COMMAND_END_2 0x86 + +#endif diff --git a/source/lib/rocprofiler-sdk/hsa/queue.cpp b/source/lib/rocprofiler-sdk/hsa/queue.cpp index da5384e9dc..cb3a666cf8 100644 --- a/source/lib/rocprofiler-sdk/hsa/queue.cpp +++ b/source/lib/rocprofiler-sdk/hsa/queue.cpp @@ -29,6 +29,8 @@ #include "lib/rocprofiler-sdk/hsa/hsa.hpp" #include "lib/rocprofiler-sdk/hsa/queue_controller.hpp" #include "lib/rocprofiler-sdk/kernel_dispatch/tracing.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/hsa_adapter.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/service.hpp" #include "lib/rocprofiler-sdk/registration.hpp" #include "lib/rocprofiler-sdk/tracing/tracing.hpp" @@ -369,6 +371,12 @@ WriteInterceptor(const void* packets, CreateBarrierPacket(nullptr, nullptr, transformed_packets); } + if(pc_sampling::is_pc_sample_service_configured(queue.get_agent().get_rocp_agent()->id)) + { + transformed_packets.emplace_back(pc_sampling::hsa::generate_marker_packet_for_kernel( + corr_id, tracing_data_v.external_correlation_ids)); + } + transformed_packets.emplace_back(kernel_pkt); // Make a copy of the original packet, adding its signal to a barrier diff --git a/source/lib/rocprofiler-sdk/hsa/queue_controller.cpp b/source/lib/rocprofiler-sdk/hsa/queue_controller.cpp index e4c8a32883..f83773ac81 100644 --- a/source/lib/rocprofiler-sdk/hsa/queue_controller.cpp +++ b/source/lib/rocprofiler-sdk/hsa/queue_controller.cpp @@ -136,8 +136,6 @@ constexpr rocprofiler_agent_t default_agent = .vendor_name = nullptr, .product_name = nullptr, .model_name = nullptr, - .num_pc_sampling_configs = 0, - .pc_sampling_configs = nullptr, .node_id = 0, .logical_node_id = 0}; } // namespace @@ -256,26 +254,24 @@ QueueController::init(CoreApiTable& core_table, AmdExtTable& ext_table) auto enable_intercepter = false; for(const auto& itr : context::get_registered_contexts()) { - constexpr auto expected_context_size = 192UL; + constexpr auto expected_context_size = 200UL; static_assert( sizeof(context::context) == expected_context_size + sizeof(std::shared_ptr), "If you added a new field to context struct, make sure there is a check here if it " "requires queue interception. Once you have done so, increment expected_context_size"); - if(itr->counter_collection) + bool has_kernel_tracing = + (itr->callback_tracer && + itr->callback_tracer->domains(ROCPROFILER_CALLBACK_TRACING_KERNEL_DISPATCH)) || + (itr->buffered_tracer && + itr->buffered_tracer->domains(ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH)); + + if(itr->counter_collection || itr->pc_sampler || has_kernel_tracing) { enable_intercepter = true; break; } - else if(itr->buffered_tracer) - { - if(itr->buffered_tracer->domains(ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH)) - { - enable_intercepter = true; - break; - } - } else if(itr->thread_trace) { enable_intercepter = true; @@ -288,15 +284,6 @@ QueueController::init(CoreApiTable& core_table, AmdExtTable& ext_table) [trace](const AgentCache& cache, const CoreApiTable&, const AmdExtTable&) { if(auto locked = trace.lock()) locked->resource_deinit(cache); }); - break; - } - else if(itr->callback_tracer) - { - if(itr->callback_tracer->domains(ROCPROFILER_CALLBACK_TRACING_KERNEL_DISPATCH)) - { - enable_intercepter = true; - break; - } } } diff --git a/source/lib/rocprofiler-sdk/hsa/rocprofiler_packet.hpp b/source/lib/rocprofiler-sdk/hsa/rocprofiler_packet.hpp index ac0267a81a..4750c3fabb 100644 --- a/source/lib/rocprofiler-sdk/hsa/rocprofiler_packet.hpp +++ b/source/lib/rocprofiler-sdk/hsa/rocprofiler_packet.hpp @@ -45,6 +45,7 @@ union rocprofiler_packet hsa_kernel_dispatch_packet_t kernel_dispatch; hsa_barrier_and_packet_t barrier_and; hsa_barrier_or_packet_t barrier_or; + amd_aql_intercept_marker_t marker; rocprofiler_packet() : ext_amd_aql_pm4{null_amd_aql_pm4_packet} @@ -66,6 +67,10 @@ union rocprofiler_packet : barrier_or{val} {} + rocprofiler_packet(amd_aql_intercept_marker_t val) + : marker{val} + {} + ~rocprofiler_packet() = default; rocprofiler_packet(const rocprofiler_packet&) = default; rocprofiler_packet(rocprofiler_packet&&) noexcept = default; diff --git a/source/lib/rocprofiler-sdk/hsa/types.hpp b/source/lib/rocprofiler-sdk/hsa/types.hpp index c06f4f3ca9..aa38b1406e 100644 --- a/source/lib/rocprofiler-sdk/hsa/types.hpp +++ b/source/lib/rocprofiler-sdk/hsa/types.hpp @@ -152,8 +152,10 @@ struct table_size // TODO(jomadsen): come up with a better way of handling this # if HSA_AMD_EXT_API_TABLE_STEP_VERSION == 0x00 static constexpr size_t amd_ext = 552; -# else +# elif HSA_AMD_EXT_API_TABLE_STEP_VERSION == 0x1 static constexpr size_t amd_ext = 560; +# else + static constexpr size_t amd_ext = 568; # endif }; diff --git a/source/lib/rocprofiler-sdk/page_migration/CMakeLists.txt b/source/lib/rocprofiler-sdk/page_migration/CMakeLists.txt index 874d8ae1a8..242fe0b752 100644 --- a/source/lib/rocprofiler-sdk/page_migration/CMakeLists.txt +++ b/source/lib/rocprofiler-sdk/page_migration/CMakeLists.txt @@ -5,5 +5,3 @@ set(ROCPROFILER_LIB_UVM_HEADERS defines.hpp page_migration.hpp utils.hpp) target_sources(rocprofiler-object-library PRIVATE ${ROCPROFILER_LIB_UVM_SOURCES} ${ROCPROFILER_LIB_UVM_HEADERS}) - -add_subdirectory(details) diff --git a/source/lib/rocprofiler-sdk/page_migration/details/CMakeLists.txt b/source/lib/rocprofiler-sdk/page_migration/details/CMakeLists.txt deleted file mode 100644 index 7e06f953c1..0000000000 --- a/source/lib/rocprofiler-sdk/page_migration/details/CMakeLists.txt +++ /dev/null @@ -1,7 +0,0 @@ -# -# -set(ROCPROFILER_LIB_UVM_DETAILS_SOURCES) -set(ROCPROFILER_LIB_UVM_DETAILS_HEADERS kfd_ioctl.h) - -target_sources(rocprofiler-object-library PRIVATE ${ROCPROFILER_LIB_UVM_DETAILS_SOURCES} - ${ROCPROFILER_LIB_UVM_DETAILS_HEADERS}) diff --git a/source/lib/rocprofiler-sdk/page_migration/details/kfd_ioctl.h b/source/lib/rocprofiler-sdk/page_migration/details/kfd_ioctl.h deleted file mode 100644 index f39ba54702..0000000000 --- a/source/lib/rocprofiler-sdk/page_migration/details/kfd_ioctl.h +++ /dev/null @@ -1,1711 +0,0 @@ -// clang-format off -/* - * Copyright 2014 Advanced Micro Devices, Inc. - * - * Permission is hereby granted, free of charge, to any person obtaining a - * copy of this software and associated documentation files (the "Software"), - * to deal in the Software without restriction, including without limitation - * the rights to use, copy, modify, merge, publish, distribute, sublicense, - * and/or sell copies of the Software, and to permit persons to whom the - * Software is furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in - * all copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL - * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR - * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, - * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR - * OTHER DEALINGS IN THE SOFTWARE. - */ - -#ifndef KFD_IOCTL_H_INCLUDED -#define KFD_IOCTL_H_INCLUDED - -#include -#include - -/* - * - 1.1 - initial version - * - 1.3 - Add SMI events support - * - 1.4 - Indicate new SRAM EDC bit in device properties - * - 1.5 - Add SVM API - * - 1.6 - Query clear flags in SVM get_attr API - * - 1.7 - Checkpoint Restore (CRIU) API - * - 1.8 - CRIU - Support for SDMA transfers with GTT BOs - * - 1.9 - Add available memory ioctl - * - 1.10 - Add SMI profiler event log - * - 1.11 - Add unified memory for ctx save/restore area - * - 1.12 - Add DMA buf export ioctl - * - 1.13 - Add debugger API - * - 1.14 - Update kfd_event_data - */ -#define KFD_IOCTL_MAJOR_VERSION 1 -#define KFD_IOCTL_MINOR_VERSION 14 - -struct kfd_ioctl_get_version_args { - __u32 major_version; /* from KFD */ - __u32 minor_version; /* from KFD */ -}; - -/* For kfd_ioctl_create_queue_args.queue_type. */ -#define KFD_IOC_QUEUE_TYPE_COMPUTE 0x0 -#define KFD_IOC_QUEUE_TYPE_SDMA 0x1 -#define KFD_IOC_QUEUE_TYPE_COMPUTE_AQL 0x2 -#define KFD_IOC_QUEUE_TYPE_SDMA_XGMI 0x3 - -#define KFD_MAX_QUEUE_PERCENTAGE 100 -#define KFD_MAX_QUEUE_PRIORITY 15 - -struct kfd_ioctl_create_queue_args { - __u64 ring_base_address; /* to KFD */ - __u64 write_pointer_address; /* from KFD */ - __u64 read_pointer_address; /* from KFD */ - __u64 doorbell_offset; /* from KFD */ - - __u32 ring_size; /* to KFD */ - __u32 gpu_id; /* to KFD */ - __u32 queue_type; /* to KFD */ - __u32 queue_percentage; /* to KFD */ - __u32 queue_priority; /* to KFD */ - __u32 queue_id; /* from KFD */ - - __u64 eop_buffer_address; /* to KFD */ - __u64 eop_buffer_size; /* to KFD */ - __u64 ctx_save_restore_address; /* to KFD */ - __u32 ctx_save_restore_size; /* to KFD */ - __u32 ctl_stack_size; /* to KFD */ -}; - -struct kfd_ioctl_destroy_queue_args { - __u32 queue_id; /* to KFD */ - __u32 pad; -}; - -struct kfd_ioctl_update_queue_args { - __u64 ring_base_address; /* to KFD */ - - __u32 queue_id; /* to KFD */ - __u32 ring_size; /* to KFD */ - __u32 queue_percentage; /* to KFD */ - __u32 queue_priority; /* to KFD */ -}; - -struct kfd_ioctl_set_cu_mask_args { - __u32 queue_id; /* to KFD */ - __u32 num_cu_mask; /* to KFD */ - __u64 cu_mask_ptr; /* to KFD */ -}; - -struct kfd_ioctl_get_queue_wave_state_args { - __u64 ctl_stack_address; /* to KFD */ - __u32 ctl_stack_used_size; /* from KFD */ - __u32 save_area_used_size; /* from KFD */ - __u32 queue_id; /* to KFD */ - __u32 pad; -}; - -struct kfd_ioctl_get_available_memory_args { - __u64 available; /* from KFD */ - __u32 gpu_id; /* to KFD */ - __u32 pad; -}; - -struct kfd_dbg_device_info_entry { - __u64 exception_status; - __u64 lds_base; - __u64 lds_limit; - __u64 scratch_base; - __u64 scratch_limit; - __u64 gpuvm_base; - __u64 gpuvm_limit; - __u32 gpu_id; - __u32 location_id; - __u32 vendor_id; - __u32 device_id; - __u32 revision_id; - __u32 subsystem_vendor_id; - __u32 subsystem_device_id; - __u32 fw_version; - __u32 gfx_target_version; - __u32 simd_count; - __u32 max_waves_per_simd; - __u32 array_count; - __u32 simd_arrays_per_engine; - __u32 num_xcc; - __u32 capability; - __u32 debug_prop; -}; - -/* For kfd_ioctl_set_memory_policy_args.default_policy and alternate_policy */ -#define KFD_IOC_CACHE_POLICY_COHERENT 0 -#define KFD_IOC_CACHE_POLICY_NONCOHERENT 1 - -struct kfd_ioctl_set_memory_policy_args { - __u64 alternate_aperture_base; /* to KFD */ - __u64 alternate_aperture_size; /* to KFD */ - - __u32 gpu_id; /* to KFD */ - __u32 default_policy; /* to KFD */ - __u32 alternate_policy; /* to KFD */ - __u32 pad; -}; - -/* - * All counters are monotonic. They are used for profiling of compute jobs. - * The profiling is done by userspace. - * - * In case of GPU reset, the counter should not be affected. - */ - -struct kfd_ioctl_get_clock_counters_args { - __u64 gpu_clock_counter; /* from KFD */ - __u64 cpu_clock_counter; /* from KFD */ - __u64 system_clock_counter; /* from KFD */ - __u64 system_clock_freq; /* from KFD */ - - __u32 gpu_id; /* to KFD */ - __u32 pad; -}; - -struct kfd_process_device_apertures { - __u64 lds_base; /* from KFD */ - __u64 lds_limit; /* from KFD */ - __u64 scratch_base; /* from KFD */ - __u64 scratch_limit; /* from KFD */ - __u64 gpuvm_base; /* from KFD */ - __u64 gpuvm_limit; /* from KFD */ - __u32 gpu_id; /* from KFD */ - __u32 pad; -}; - -/* - * AMDKFD_IOC_GET_PROCESS_APERTURES is deprecated. Use - * AMDKFD_IOC_GET_PROCESS_APERTURES_NEW instead, which supports an - * unlimited number of GPUs. - */ -#define NUM_OF_SUPPORTED_GPUS 7 -struct kfd_ioctl_get_process_apertures_args { - struct kfd_process_device_apertures - process_apertures[NUM_OF_SUPPORTED_GPUS];/* from KFD */ - - /* from KFD, should be in the range [1 - NUM_OF_SUPPORTED_GPUS] */ - __u32 num_of_nodes; - __u32 pad; -}; - -struct kfd_ioctl_get_process_apertures_new_args { - /* User allocated. Pointer to struct kfd_process_device_apertures - * filled in by Kernel - */ - __u64 kfd_process_device_apertures_ptr; - /* to KFD - indicates amount of memory present in - * kfd_process_device_apertures_ptr - * from KFD - Number of entries filled by KFD. - */ - __u32 num_of_nodes; - __u32 pad; -}; - -#define MAX_ALLOWED_NUM_POINTS 100 -#define MAX_ALLOWED_AW_BUFF_SIZE 4096 -#define MAX_ALLOWED_WAC_BUFF_SIZE 128 - -struct kfd_ioctl_dbg_register_args { - __u32 gpu_id; /* to KFD */ - __u32 pad; -}; - -struct kfd_ioctl_dbg_unregister_args { - __u32 gpu_id; /* to KFD */ - __u32 pad; -}; - -struct kfd_ioctl_dbg_address_watch_args { - __u64 content_ptr; /* a pointer to the actual content */ - __u32 gpu_id; /* to KFD */ - __u32 buf_size_in_bytes; /*including gpu_id and buf_size */ -}; - -struct kfd_ioctl_dbg_wave_control_args { - __u64 content_ptr; /* a pointer to the actual content */ - __u32 gpu_id; /* to KFD */ - __u32 buf_size_in_bytes; /*including gpu_id and buf_size */ -}; - -#define KFD_INVALID_FD 0xffffffff - -struct kfd_ioctl_dbg_trap_args_deprecated { - __u64 exception_mask; /* to KFD */ - __u64 ptr; /* to KFD */ - __u32 pid; /* to KFD */ - __u32 op; /* to KFD */ - __u32 data1; /* to KFD */ - __u32 data2; /* to KFD */ - __u32 data3; /* to KFD */ - __u32 data4; /* to KFD */ -}; - -/* Matching HSA_EVENTTYPE */ -#define KFD_IOC_EVENT_SIGNAL 0 -#define KFD_IOC_EVENT_NODECHANGE 1 -#define KFD_IOC_EVENT_DEVICESTATECHANGE 2 -#define KFD_IOC_EVENT_HW_EXCEPTION 3 -#define KFD_IOC_EVENT_SYSTEM_EVENT 4 -#define KFD_IOC_EVENT_DEBUG_EVENT 5 -#define KFD_IOC_EVENT_PROFILE_EVENT 6 -#define KFD_IOC_EVENT_QUEUE_EVENT 7 -#define KFD_IOC_EVENT_MEMORY 8 - -#define KFD_IOC_WAIT_RESULT_COMPLETE 0 -#define KFD_IOC_WAIT_RESULT_TIMEOUT 1 -#define KFD_IOC_WAIT_RESULT_FAIL 2 - -#define KFD_SIGNAL_EVENT_LIMIT 4096 - -/* For kfd_event_data.hw_exception_data.reset_type. */ -#define KFD_HW_EXCEPTION_WHOLE_GPU_RESET 0 -#define KFD_HW_EXCEPTION_PER_ENGINE_RESET 1 - -/* For kfd_event_data.hw_exception_data.reset_cause. */ -#define KFD_HW_EXCEPTION_GPU_HANG 0 -#define KFD_HW_EXCEPTION_ECC 1 - -/* For kfd_hsa_memory_exception_data.ErrorType */ -#define KFD_MEM_ERR_NO_RAS 0 -#define KFD_MEM_ERR_SRAM_ECC 1 -#define KFD_MEM_ERR_POISON_CONSUMED 2 -#define KFD_MEM_ERR_GPU_HANG 3 - -struct kfd_ioctl_create_event_args { - __u64 event_page_offset; /* from KFD */ - __u32 event_trigger_data; /* from KFD - signal events only */ - __u32 event_type; /* to KFD */ - __u32 auto_reset; /* to KFD */ - __u32 node_id; /* to KFD - only valid for certain - event types */ - __u32 event_id; /* from KFD */ - __u32 event_slot_index; /* from KFD */ -}; - -struct kfd_ioctl_destroy_event_args { - __u32 event_id; /* to KFD */ - __u32 pad; -}; - -struct kfd_ioctl_set_event_args { - __u32 event_id; /* to KFD */ - __u32 pad; -}; - -struct kfd_ioctl_reset_event_args { - __u32 event_id; /* to KFD */ - __u32 pad; -}; - -struct kfd_memory_exception_failure { - __u32 NotPresent; /* Page not present or supervisor privilege */ - __u32 ReadOnly; /* Write access to a read-only page */ - __u32 NoExecute; /* Execute access to a page marked NX */ - __u32 imprecise; /* Can't determine the exact fault address */ -}; - -/* memory exception data */ -struct kfd_hsa_memory_exception_data { - struct kfd_memory_exception_failure failure; - __u64 va; - __u32 gpu_id; - __u32 ErrorType; /* 0 = no RAS error, - * 1 = ECC_SRAM, - * 2 = Link_SYNFLOOD (poison), - * 3 = GPU hang (not attributable to a specific cause), - * other values reserved - */ -}; - -/* hw exception data */ -struct kfd_hsa_hw_exception_data { - __u32 reset_type; - __u32 reset_cause; - __u32 memory_lost; - __u32 gpu_id; -}; - -/* hsa signal event data */ -struct kfd_hsa_signal_event_data { - __u64 last_event_age; /* to and from KFD */ -}; - -/* Event data */ -struct kfd_event_data { - union { - /* From KFD */ - struct kfd_hsa_memory_exception_data memory_exception_data; - struct kfd_hsa_hw_exception_data hw_exception_data; - /* To and From KFD */ - struct kfd_hsa_signal_event_data signal_event_data; - }; - __u64 kfd_event_data_ext; /* pointer to an extension structure - for future exception types */ - __u32 event_id; /* to KFD */ - __u32 pad; -}; - -struct kfd_ioctl_wait_events_args { - __u64 events_ptr; /* pointed to struct - kfd_event_data array, to KFD */ - __u32 num_events; /* to KFD */ - __u32 wait_for_all; /* to KFD */ - __u32 timeout; /* to KFD */ - __u32 wait_result; /* from KFD */ -}; - -struct kfd_ioctl_set_scratch_backing_va_args { - __u64 va_addr; /* to KFD */ - __u32 gpu_id; /* to KFD */ - __u32 pad; -}; - -struct kfd_ioctl_get_tile_config_args { - /* to KFD: pointer to tile array */ - __u64 tile_config_ptr; - /* to KFD: pointer to macro tile array */ - __u64 macro_tile_config_ptr; - /* to KFD: array size allocated by user mode - * from KFD: array size filled by kernel - */ - __u32 num_tile_configs; - /* to KFD: array size allocated by user mode - * from KFD: array size filled by kernel - */ - __u32 num_macro_tile_configs; - - __u32 gpu_id; /* to KFD */ - __u32 gb_addr_config; /* from KFD */ - __u32 num_banks; /* from KFD */ - __u32 num_ranks; /* from KFD */ - /* struct size can be extended later if needed - * without breaking ABI compatibility - */ -}; - -struct kfd_ioctl_set_trap_handler_args { - __u64 tba_addr; /* to KFD */ - __u64 tma_addr; /* to KFD */ - __u32 gpu_id; /* to KFD */ - __u32 pad; -}; - -struct kfd_ioctl_acquire_vm_args { - __u32 drm_fd; /* to KFD */ - __u32 gpu_id; /* to KFD */ -}; - -/* Allocation flags: memory types */ -#define KFD_IOC_ALLOC_MEM_FLAGS_VRAM (1 << 0) -#define KFD_IOC_ALLOC_MEM_FLAGS_GTT (1 << 1) -#define KFD_IOC_ALLOC_MEM_FLAGS_USERPTR (1 << 2) -#define KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL (1 << 3) -#define KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP (1 << 4) -/* Allocation flags: attributes/access options */ -#define KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE (1 << 31) -#define KFD_IOC_ALLOC_MEM_FLAGS_EXECUTABLE (1 << 30) -#define KFD_IOC_ALLOC_MEM_FLAGS_PUBLIC (1 << 29) -#define KFD_IOC_ALLOC_MEM_FLAGS_NO_SUBSTITUTE (1 << 28) -#define KFD_IOC_ALLOC_MEM_FLAGS_AQL_QUEUE_MEM (1 << 27) -#define KFD_IOC_ALLOC_MEM_FLAGS_COHERENT (1 << 26) -#define KFD_IOC_ALLOC_MEM_FLAGS_UNCACHED (1 << 25) -#define KFD_IOC_ALLOC_MEM_FLAGS_EXT_COHERENT (1 << 24) - -/* Allocate memory for later SVM (shared virtual memory) mapping. - * - * @va_addr: virtual address of the memory to be allocated - * all later mappings on all GPUs will use this address - * @size: size in bytes - * @handle: buffer handle returned to user mode, used to refer to - * this allocation for mapping, unmapping and freeing - * @mmap_offset: for CPU-mapping the allocation by mmapping a render node - * for userptrs this is overloaded to specify the CPU address - * @gpu_id: device identifier - * @flags: memory type and attributes. See KFD_IOC_ALLOC_MEM_FLAGS above - */ -struct kfd_ioctl_alloc_memory_of_gpu_args { - __u64 va_addr; /* to KFD */ - __u64 size; /* to KFD */ - __u64 handle; /* from KFD */ - __u64 mmap_offset; /* to KFD (userptr), from KFD (mmap offset) */ - __u32 gpu_id; /* to KFD */ - __u32 flags; -}; - -/* Free memory allocated with kfd_ioctl_alloc_memory_of_gpu - * - * @handle: memory handle returned by alloc - */ -struct kfd_ioctl_free_memory_of_gpu_args { - __u64 handle; /* to KFD */ -}; - -/* Map memory to one or more GPUs - * - * @handle: memory handle returned by alloc - * @device_ids_array_ptr: array of gpu_ids (__u32 per device) - * @n_devices: number of devices in the array - * @n_success: number of devices mapped successfully - * - * @n_success returns information to the caller how many devices from - * the start of the array have mapped the buffer successfully. It can - * be passed into a subsequent retry call to skip those devices. For - * the first call the caller should initialize it to 0. - * - * If the ioctl completes with return code 0 (success), n_success == - * n_devices. - */ -struct kfd_ioctl_map_memory_to_gpu_args { - __u64 handle; /* to KFD */ - __u64 device_ids_array_ptr; /* to KFD */ - __u32 n_devices; /* to KFD */ - __u32 n_success; /* to/from KFD */ -}; - -/* Unmap memory from one or more GPUs - * - * same arguments as for mapping - */ -struct kfd_ioctl_unmap_memory_from_gpu_args { - __u64 handle; /* to KFD */ - __u64 device_ids_array_ptr; /* to KFD */ - __u32 n_devices; /* to KFD */ - __u32 n_success; /* to/from KFD */ -}; - -/* Allocate GWS for specific queue - * - * @queue_id: queue's id that GWS is allocated for - * @num_gws: how many GWS to allocate - * @first_gws: index of the first GWS allocated. - * only support contiguous GWS allocation - */ -struct kfd_ioctl_alloc_queue_gws_args { - __u32 queue_id; /* to KFD */ - __u32 num_gws; /* to KFD */ - __u32 first_gws; /* from KFD */ - __u32 pad; -}; - -struct kfd_ioctl_get_dmabuf_info_args { - __u64 size; /* from KFD */ - __u64 metadata_ptr; /* to KFD */ - __u32 metadata_size; /* to KFD (space allocated by user) - * from KFD (actual metadata size) - */ - __u32 gpu_id; /* from KFD */ - __u32 flags; /* from KFD (KFD_IOC_ALLOC_MEM_FLAGS) */ - __u32 dmabuf_fd; /* to KFD */ -}; - -struct kfd_ioctl_import_dmabuf_args { - __u64 va_addr; /* to KFD */ - __u64 handle; /* from KFD */ - __u32 gpu_id; /* to KFD */ - __u32 dmabuf_fd; /* to KFD */ -}; - -struct kfd_ioctl_export_dmabuf_args { - __u64 handle; /* to KFD */ - __u32 flags; /* to KFD */ - __u32 dmabuf_fd; /* from KFD */ -}; - -/* - * KFD SMI(System Management Interface) events - */ -enum kfd_smi_event { - KFD_SMI_EVENT_NONE = 0, /* not used */ - KFD_SMI_EVENT_VMFAULT = 1, /* event start counting at 1 */ - KFD_SMI_EVENT_THERMAL_THROTTLE = 2, - KFD_SMI_EVENT_GPU_PRE_RESET = 3, - KFD_SMI_EVENT_GPU_POST_RESET = 4, - KFD_SMI_EVENT_MIGRATE_START = 5, - KFD_SMI_EVENT_MIGRATE_END = 6, - KFD_SMI_EVENT_PAGE_FAULT_START = 7, - KFD_SMI_EVENT_PAGE_FAULT_END = 8, - KFD_SMI_EVENT_QUEUE_EVICTION = 9, - KFD_SMI_EVENT_QUEUE_RESTORE = 10, - KFD_SMI_EVENT_UNMAP_FROM_GPU = 11, - - /* - * max event number, as a flag bit to get events from all processes, - * this requires super user permission, otherwise will not be able to - * receive event from any process. Without this flag to receive events - * from same process. - */ - KFD_SMI_EVENT_ALL_PROCESS = 64 -}; - -enum KFD_MIGRATE_TRIGGERS { - KFD_MIGRATE_TRIGGER_PREFETCH, - KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, - KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, - KFD_MIGRATE_TRIGGER_TTM_EVICTION -}; - -enum KFD_QUEUE_EVICTION_TRIGGERS { - KFD_QUEUE_EVICTION_TRIGGER_SVM, - KFD_QUEUE_EVICTION_TRIGGER_USERPTR, - KFD_QUEUE_EVICTION_TRIGGER_TTM, - KFD_QUEUE_EVICTION_TRIGGER_SUSPEND, - KFD_QUEUE_EVICTION_CRIU_CHECKPOINT, - KFD_QUEUE_EVICTION_CRIU_RESTORE -}; - -enum KFD_SVM_UNMAP_TRIGGERS { - KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY, - KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE, - KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU -}; - -#define KFD_SMI_EVENT_MASK_FROM_INDEX(i) (1ULL << ((i) - 1)) -#define KFD_SMI_EVENT_MSG_SIZE 96 - -struct kfd_ioctl_smi_events_args { - __u32 gpuid; /* to KFD */ - __u32 anon_fd; /* from KFD */ -}; - -/** - * kfd_ioctl_spm_op - SPM ioctl operations - * - * @KFD_IOCTL_SPM_OP_ACQUIRE: acquire exclusive access to SPM - * @KFD_IOCTL_SPM_OP_RELEASE: release exclusive access to SPM - * @KFD_IOCTL_SPM_OP_SET_DEST_BUF: set or unset destination buffer for SPM streaming - */ -enum kfd_ioctl_spm_op { - KFD_IOCTL_SPM_OP_ACQUIRE, - KFD_IOCTL_SPM_OP_RELEASE, - KFD_IOCTL_SPM_OP_SET_DEST_BUF -}; - -/** - * kfd_ioctl_spm_args - Arguments for SPM ioctl - * - * @op[in]: specifies the operation to perform - * @gpu_id[in]: GPU ID of the GPU to profile - * @dst_buf[in]: used for the address of the destination buffer - * in @KFD_IOCTL_SPM_SET_DEST_BUFFER - * @buf_size[in]: size of the destination buffer - * @timeout[in/out]: [in]: timeout in milliseconds, [out]: amount of time left - * `in the timeout window - * @bytes_copied[out]: amount of data that was copied to the previous dest_buf - * @has_data_loss: boolean indicating whether data was lost - * (e.g. due to a ring-buffer overflow) - * - * This ioctl performs different functions depending on the @op parameter. - * - * KFD_IOCTL_SPM_OP_ACQUIRE - * ------------------------ - * - * Acquires exclusive access of SPM on the specified @gpu_id for the calling process. - * This must be called before using KFD_IOCTL_SPM_OP_SET_DEST_BUF. - * - * KFD_IOCTL_SPM_OP_RELEASE - * ------------------------ - * - * Releases exclusive access of SPM on the specified @gpu_id for the calling process, - * which allows another process to acquire it in the future. - * - * KFD_IOCTL_SPM_OP_SET_DEST_BUF - * ----------------------------- - * - * If @dst_buf is NULL, the destination buffer address is unset and copying of counters - * is stopped. - * - * If @dst_buf is not NULL, it specifies the pointer to a new destination buffer. - * @buf_size specifies the size of the buffer. - * - * If @timeout is non-0, the call will wait for up to @timeout ms for the previous - * buffer to be filled. If previous buffer to be filled before timeout, the @timeout - * will be updated value with the time remaining. If the timeout is exceeded, the function - * copies any partial data available into the previous user buffer and returns success. - * The amount of valid data in the previous user buffer is indicated by @bytes_copied. - * - * If @timeout is 0, the function immediately replaces the previous destination buffer - * without waiting for the previous buffer to be filled. That means the previous buffer - * may only be partially filled, and @bytes_copied will indicate how much data has been - * copied to it. - * - * If data was lost, e.g. due to a ring buffer overflow, @has_data_loss will be non-0. - * - * Returns negative error code on failure, 0 on success. - */ -struct kfd_ioctl_spm_args { - __u64 dest_buf; - __u32 buf_size; - __u32 op; - __u32 timeout; - __u32 gpu_id; - __u32 bytes_copied; - __u32 has_data_loss; -}; - -/************************************************************************************************** - * CRIU IOCTLs (Checkpoint Restore In Userspace) - * - * When checkpointing a process, the userspace application will perform: - * 1. PROCESS_INFO op to determine current process information. This pauses execution and evicts - * all the queues. - * 2. CHECKPOINT op to checkpoint process contents (BOs, queues, events, svm-ranges) - * 3. UNPAUSE op to un-evict all the queues - * - * When restoring a process, the CRIU userspace application will perform: - * - * 1. RESTORE op to restore process contents - * 2. RESUME op to start the process - * - * Note: Queues are forced into an evicted state after a successful PROCESS_INFO. User - * application needs to perform an UNPAUSE operation after calling PROCESS_INFO. - */ - -enum kfd_criu_op { - KFD_CRIU_OP_PROCESS_INFO, - KFD_CRIU_OP_CHECKPOINT, - KFD_CRIU_OP_UNPAUSE, - KFD_CRIU_OP_RESTORE, - KFD_CRIU_OP_RESUME, -}; - -/** - * kfd_ioctl_criu_args - Arguments perform CRIU operation - * @devices: [in/out] User pointer to memory location for devices information. - * This is an array of type kfd_criu_device_bucket. - * @bos: [in/out] User pointer to memory location for BOs information - * This is an array of type kfd_criu_bo_bucket. - * @priv_data: [in/out] User pointer to memory location for private data - * @priv_data_size: [in/out] Size of priv_data in bytes - * @num_devices: [in/out] Number of GPUs used by process. Size of @devices array. - * @num_bos [in/out] Number of BOs used by process. Size of @bos array. - * @num_objects: [in/out] Number of objects used by process. Objects are opaque to - * user application. - * @pid: [in/out] PID of the process being checkpointed - * @op [in] Type of operation (kfd_criu_op) - * - * Return: 0 on success, -errno on failure - */ -struct kfd_ioctl_criu_args { - __u64 devices; /* Used during ops: CHECKPOINT, RESTORE */ - __u64 bos; /* Used during ops: CHECKPOINT, RESTORE */ - __u64 priv_data; /* Used during ops: CHECKPOINT, RESTORE */ - __u64 priv_data_size; /* Used during ops: PROCESS_INFO, RESTORE */ - __u32 num_devices; /* Used during ops: PROCESS_INFO, RESTORE */ - __u32 num_bos; /* Used during ops: PROCESS_INFO, RESTORE */ - __u32 num_objects; /* Used during ops: PROCESS_INFO, RESTORE */ - __u32 pid; /* Used during ops: PROCESS_INFO, RESUME */ - __u32 op; -}; - -struct kfd_criu_device_bucket { - __u32 user_gpu_id; - __u32 actual_gpu_id; - __u32 drm_fd; - __u32 pad; -}; - -struct kfd_criu_bo_bucket { - __u64 addr; - __u64 size; - __u64 offset; - __u64 restored_offset; /* During restore, updated offset for BO */ - __u32 gpu_id; /* This is the user_gpu_id */ - __u32 alloc_flags; - __u32 dmabuf_fd; - __u32 pad; -}; - -/* CRIU IOCTLs - END */ -/**************************************************************************************************/ -/* Register offset inside the remapped mmio page - */ -enum kfd_mmio_remap { - KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL = 0, - KFD_MMIO_REMAP_HDP_REG_FLUSH_CNTL = 4, -}; - -struct kfd_ioctl_ipc_export_handle_args { - __u64 handle; /* to KFD */ - __u32 share_handle[4]; /* from KFD */ - __u32 gpu_id; /* to KFD */ - __u32 flags; /* to KFD */ -}; - -struct kfd_ioctl_ipc_import_handle_args { - __u64 handle; /* from KFD */ - __u64 va_addr; /* to KFD */ - __u64 mmap_offset; /* from KFD */ - __u32 share_handle[4]; /* to KFD */ - __u32 gpu_id; /* to KFD */ - __u32 flags; /* from KFD */ -}; - -struct kfd_ioctl_cross_memory_copy_deprecated_args { - /* to KFD: Process ID of the remote process */ - __u32 pid; - /* to KFD: See above definition */ - __u32 flags; - /* to KFD: Source GPU VM range */ - __u64 src_mem_range_array; - /* to KFD: Size of above array */ - __u64 src_mem_array_size; - /* to KFD: Destination GPU VM range */ - __u64 dst_mem_range_array; - /* to KFD: Size of above array */ - __u64 dst_mem_array_size; - /* from KFD: Total amount of bytes copied */ - __u64 bytes_copied; -}; - -/* Guarantee host access to memory */ -#define KFD_IOCTL_SVM_FLAG_HOST_ACCESS 0x00000001 -/* Fine grained coherency between all devices with access */ -#define KFD_IOCTL_SVM_FLAG_COHERENT 0x00000002 -/* Use any GPU in same hive as preferred device */ -#define KFD_IOCTL_SVM_FLAG_HIVE_LOCAL 0x00000004 -/* GPUs only read, allows replication */ -#define KFD_IOCTL_SVM_FLAG_GPU_RO 0x00000008 -/* Allow execution on GPU */ -#define KFD_IOCTL_SVM_FLAG_GPU_EXEC 0x00000010 -/* GPUs mostly read, may allow similar optimizations as RO, but writes fault */ -#define KFD_IOCTL_SVM_FLAG_GPU_READ_MOSTLY 0x00000020 -/* Keep GPU memory mapping always valid as if XNACK is disable */ -#define KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED 0x00000040 -/* Fine grained coherency between all devices using device-scope atomics */ -#define KFD_IOCTL_SVM_FLAG_EXT_COHERENT 0x00000080 - -/** - * kfd_ioctl_svm_op - SVM ioctl operations - * - * @KFD_IOCTL_SVM_OP_SET_ATTR: Modify one or more attributes - * @KFD_IOCTL_SVM_OP_GET_ATTR: Query one or more attributes - */ -enum kfd_ioctl_svm_op { - KFD_IOCTL_SVM_OP_SET_ATTR, - KFD_IOCTL_SVM_OP_GET_ATTR -}; - -/** kfd_ioctl_svm_location - Enum for preferred and prefetch locations - * - * GPU IDs are used to specify GPUs as preferred and prefetch locations. - * Below definitions are used for system memory or for leaving the preferred - * location unspecified. - */ -enum kfd_ioctl_svm_location { - KFD_IOCTL_SVM_LOCATION_SYSMEM = 0, - KFD_IOCTL_SVM_LOCATION_UNDEFINED = 0xffffffff -}; - -/** - * kfd_ioctl_svm_attr_type - SVM attribute types - * - * @KFD_IOCTL_SVM_ATTR_PREFERRED_LOC: gpuid of the preferred location, 0 for - * system memory - * @KFD_IOCTL_SVM_ATTR_PREFETCH_LOC: gpuid of the prefetch location, 0 for - * system memory. Setting this triggers an - * immediate prefetch (migration). - * @KFD_IOCTL_SVM_ATTR_ACCESS: - * @KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE: - * @KFD_IOCTL_SVM_ATTR_NO_ACCESS: specify memory access for the gpuid given - * by the attribute value - * @KFD_IOCTL_SVM_ATTR_SET_FLAGS: bitmask of flags to set (see - * KFD_IOCTL_SVM_FLAG_...) - * @KFD_IOCTL_SVM_ATTR_CLR_FLAGS: bitmask of flags to clear - * @KFD_IOCTL_SVM_ATTR_GRANULARITY: migration granularity - * (log2 num pages) - */ -enum kfd_ioctl_svm_attr_type { - KFD_IOCTL_SVM_ATTR_PREFERRED_LOC, - KFD_IOCTL_SVM_ATTR_PREFETCH_LOC, - KFD_IOCTL_SVM_ATTR_ACCESS, - KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE, - KFD_IOCTL_SVM_ATTR_NO_ACCESS, - KFD_IOCTL_SVM_ATTR_SET_FLAGS, - KFD_IOCTL_SVM_ATTR_CLR_FLAGS, - KFD_IOCTL_SVM_ATTR_GRANULARITY -}; - -/** - * kfd_ioctl_svm_attribute - Attributes as pairs of type and value - * - * The meaning of the @value depends on the attribute type. - * - * @type: attribute type (see enum @kfd_ioctl_svm_attr_type) - * @value: attribute value - */ -struct kfd_ioctl_svm_attribute { - __u32 type; - __u32 value; -}; - -/** - * kfd_ioctl_svm_args - Arguments for SVM ioctl - * - * @op specifies the operation to perform (see enum - * @kfd_ioctl_svm_op). @start_addr and @size are common for all - * operations. - * - * A variable number of attributes can be given in @attrs. - * @nattr specifies the number of attributes. New attributes can be - * added in the future without breaking the ABI. If unknown attributes - * are given, the function returns -EINVAL. - * - * @KFD_IOCTL_SVM_OP_SET_ATTR sets attributes for a virtual address - * range. It may overlap existing virtual address ranges. If it does, - * the existing ranges will be split such that the attribute changes - * only apply to the specified address range. - * - * @KFD_IOCTL_SVM_OP_GET_ATTR returns the intersection of attributes - * over all memory in the given range and returns the result as the - * attribute value. If different pages have different preferred or - * prefetch locations, 0xffffffff will be returned for - * @KFD_IOCTL_SVM_ATTR_PREFERRED_LOC or - * @KFD_IOCTL_SVM_ATTR_PREFETCH_LOC resepctively. For - * @KFD_IOCTL_SVM_ATTR_SET_FLAGS, flags of all pages will be - * aggregated by bitwise AND. That means, a flag will be set in the - * output, if that flag is set for all pages in the range. For - * @KFD_IOCTL_SVM_ATTR_CLR_FLAGS, flags of all pages will be - * aggregated by bitwise NOR. That means, a flag will be set in the - * output, if that flag is clear for all pages in the range. - * The minimum migration granularity throughout the range will be - * returned for @KFD_IOCTL_SVM_ATTR_GRANULARITY. - * - * Querying of accessibility attributes works by initializing the - * attribute type to @KFD_IOCTL_SVM_ATTR_ACCESS and the value to the - * GPUID being queried. Multiple attributes can be given to allow - * querying multiple GPUIDs. The ioctl function overwrites the - * attribute type to indicate the access for the specified GPU. - */ -struct kfd_ioctl_svm_args { - __u64 start_addr; - __u64 size; - __u32 op; - __u32 nattr; - /* Variable length array of attributes */ - struct kfd_ioctl_svm_attribute attrs[]; -}; - -/** - * kfd_ioctl_set_xnack_mode_args - Arguments for set_xnack_mode - * - * @xnack_enabled: [in/out] Whether to enable XNACK mode for this process - * - * @xnack_enabled indicates whether recoverable page faults should be - * enabled for the current process. 0 means disabled, positive means - * enabled, negative means leave unchanged. If enabled, virtual address - * translations on GFXv9 and later AMD GPUs can return XNACK and retry - * the access until a valid PTE is available. This is used to implement - * device page faults. - * - * On output, @xnack_enabled returns the (new) current mode (0 or - * positive). Therefore, a negative input value can be used to query - * the current mode without changing it. - * - * The XNACK mode fundamentally changes the way SVM managed memory works - * in the driver, with subtle effects on application performance and - * functionality. - * - * Enabling XNACK mode requires shader programs to be compiled - * differently. Furthermore, not all GPUs support changing the mode - * per-process. Therefore changing the mode is only allowed while no - * user mode queues exist in the process. This ensure that no shader - * code is running that may be compiled for the wrong mode. And GPUs - * that cannot change to the requested mode will prevent the XNACK - * mode from occurring. All GPUs used by the process must be in the - * same XNACK mode. - * - * GFXv8 or older GPUs do not support 48 bit virtual addresses or SVM. - * Therefore those GPUs are not considered for the XNACK mode switch. - * - * Return: 0 on success, -errno on failure - */ -struct kfd_ioctl_set_xnack_mode_args { - __s32 xnack_enabled; -}; - -/* Wave launch override modes */ -enum kfd_dbg_trap_override_mode { - KFD_DBG_TRAP_OVERRIDE_OR = 0, - KFD_DBG_TRAP_OVERRIDE_REPLACE = 1 -}; - -/* Wave launch overrides */ -enum kfd_dbg_trap_mask { - KFD_DBG_TRAP_MASK_FP_INVALID = 1, - KFD_DBG_TRAP_MASK_FP_INPUT_DENORMAL = 2, - KFD_DBG_TRAP_MASK_FP_DIVIDE_BY_ZERO = 4, - KFD_DBG_TRAP_MASK_FP_OVERFLOW = 8, - KFD_DBG_TRAP_MASK_FP_UNDERFLOW = 16, - KFD_DBG_TRAP_MASK_FP_INEXACT = 32, - KFD_DBG_TRAP_MASK_INT_DIVIDE_BY_ZERO = 64, - KFD_DBG_TRAP_MASK_DBG_ADDRESS_WATCH = 128, - KFD_DBG_TRAP_MASK_DBG_MEMORY_VIOLATION = 256, - KFD_DBG_TRAP_MASK_TRAP_ON_WAVE_START = (1 << 30), - KFD_DBG_TRAP_MASK_TRAP_ON_WAVE_END = (1 << 31) -}; - -/* Wave launch modes */ -enum kfd_dbg_trap_wave_launch_mode { - KFD_DBG_TRAP_WAVE_LAUNCH_MODE_NORMAL = 0, - KFD_DBG_TRAP_WAVE_LAUNCH_MODE_HALT = 1, - KFD_DBG_TRAP_WAVE_LAUNCH_MODE_DEBUG = 3 -}; - -/* Address watch modes */ -enum kfd_dbg_trap_address_watch_mode { - KFD_DBG_TRAP_ADDRESS_WATCH_MODE_READ = 0, - KFD_DBG_TRAP_ADDRESS_WATCH_MODE_NONREAD = 1, - KFD_DBG_TRAP_ADDRESS_WATCH_MODE_ATOMIC = 2, - KFD_DBG_TRAP_ADDRESS_WATCH_MODE_ALL = 3 -}; - -/* Additional wave settings */ -enum kfd_dbg_trap_flags { - KFD_DBG_TRAP_FLAG_SINGLE_MEM_OP = 1, -}; - -/* Trap exceptions */ -enum kfd_dbg_trap_exception_code { - EC_NONE = 0, - /* per queue */ - EC_QUEUE_WAVE_ABORT = 1, - EC_QUEUE_WAVE_TRAP = 2, - EC_QUEUE_WAVE_MATH_ERROR = 3, - EC_QUEUE_WAVE_ILLEGAL_INSTRUCTION = 4, - EC_QUEUE_WAVE_MEMORY_VIOLATION = 5, - EC_QUEUE_WAVE_APERTURE_VIOLATION = 6, - EC_QUEUE_PACKET_DISPATCH_DIM_INVALID = 16, - EC_QUEUE_PACKET_DISPATCH_GROUP_SEGMENT_SIZE_INVALID = 17, - EC_QUEUE_PACKET_DISPATCH_CODE_INVALID = 18, - EC_QUEUE_PACKET_RESERVED = 19, - EC_QUEUE_PACKET_UNSUPPORTED = 20, - EC_QUEUE_PACKET_DISPATCH_WORK_GROUP_SIZE_INVALID = 21, - EC_QUEUE_PACKET_DISPATCH_REGISTER_INVALID = 22, - EC_QUEUE_PACKET_VENDOR_UNSUPPORTED = 23, - EC_QUEUE_PREEMPTION_ERROR = 30, - EC_QUEUE_NEW = 31, - /* per device */ - EC_DEVICE_QUEUE_DELETE = 32, - EC_DEVICE_MEMORY_VIOLATION = 33, - EC_DEVICE_RAS_ERROR = 34, - EC_DEVICE_FATAL_HALT = 35, - EC_DEVICE_NEW = 36, - /* per process */ - EC_PROCESS_RUNTIME = 48, - EC_PROCESS_DEVICE_REMOVE = 49, - EC_MAX -}; - -/* Mask generated by ecode in kfd_dbg_trap_exception_code */ -#define KFD_EC_MASK(ecode) (1ULL << (ecode - 1)) - -/* Masks for exception code type checks below */ -#define KFD_EC_MASK_QUEUE (KFD_EC_MASK(EC_QUEUE_WAVE_ABORT) | \ - KFD_EC_MASK(EC_QUEUE_WAVE_TRAP) | \ - KFD_EC_MASK(EC_QUEUE_WAVE_MATH_ERROR) | \ - KFD_EC_MASK(EC_QUEUE_WAVE_ILLEGAL_INSTRUCTION) | \ - KFD_EC_MASK(EC_QUEUE_WAVE_MEMORY_VIOLATION) | \ - KFD_EC_MASK(EC_QUEUE_WAVE_APERTURE_VIOLATION) | \ - KFD_EC_MASK(EC_QUEUE_PACKET_DISPATCH_DIM_INVALID) | \ - KFD_EC_MASK(EC_QUEUE_PACKET_DISPATCH_GROUP_SEGMENT_SIZE_INVALID) | \ - KFD_EC_MASK(EC_QUEUE_PACKET_DISPATCH_CODE_INVALID) | \ - KFD_EC_MASK(EC_QUEUE_PACKET_RESERVED) | \ - KFD_EC_MASK(EC_QUEUE_PACKET_UNSUPPORTED) | \ - KFD_EC_MASK(EC_QUEUE_PACKET_DISPATCH_WORK_GROUP_SIZE_INVALID) | \ - KFD_EC_MASK(EC_QUEUE_PACKET_DISPATCH_REGISTER_INVALID) | \ - KFD_EC_MASK(EC_QUEUE_PACKET_VENDOR_UNSUPPORTED) | \ - KFD_EC_MASK(EC_QUEUE_PREEMPTION_ERROR) | \ - KFD_EC_MASK(EC_QUEUE_NEW)) -#define KFD_EC_MASK_DEVICE (KFD_EC_MASK(EC_DEVICE_QUEUE_DELETE) | \ - KFD_EC_MASK(EC_DEVICE_RAS_ERROR) | \ - KFD_EC_MASK(EC_DEVICE_FATAL_HALT) | \ - KFD_EC_MASK(EC_DEVICE_MEMORY_VIOLATION) | \ - KFD_EC_MASK(EC_DEVICE_NEW)) -#define KFD_EC_MASK_PROCESS (KFD_EC_MASK(EC_PROCESS_RUNTIME) | \ - KFD_EC_MASK(EC_PROCESS_DEVICE_REMOVE)) - -/* Checks for exception code types for KFD search */ -#define KFD_DBG_EC_TYPE_IS_QUEUE(ecode) \ - (!!(KFD_EC_MASK(ecode) & KFD_EC_MASK_QUEUE)) -#define KFD_DBG_EC_TYPE_IS_DEVICE(ecode) \ - (!!(KFD_EC_MASK(ecode) & KFD_EC_MASK_DEVICE)) -#define KFD_DBG_EC_TYPE_IS_PROCESS(ecode) \ - (!!(KFD_EC_MASK(ecode) & KFD_EC_MASK_PROCESS)) - - -/* Runtime enable states */ -enum kfd_dbg_runtime_state { - DEBUG_RUNTIME_STATE_DISABLED = 0, - DEBUG_RUNTIME_STATE_ENABLED = 1, - DEBUG_RUNTIME_STATE_ENABLED_BUSY = 2, - DEBUG_RUNTIME_STATE_ENABLED_ERROR = 3 -}; - -/* Runtime enable status */ -struct kfd_runtime_info { - __u64 r_debug; - __u32 runtime_state; - __u32 ttmp_setup; -}; - -/* Enable modes for runtime enable */ -#define KFD_RUNTIME_ENABLE_MODE_ENABLE_MASK 1 -#define KFD_RUNTIME_ENABLE_MODE_TTMP_SAVE_MASK 2 - -/** - * kfd_ioctl_runtime_enable_args - Arguments for runtime enable - * - * Coordinates debug exception signalling and debug device enablement with runtime. - * - * @r_debug - pointer to user struct for sharing information between ROCr and the debuggger - * @mode_mask - mask to set mode - * KFD_RUNTIME_ENABLE_MODE_ENABLE_MASK - enable runtime for debugging, otherwise disable - * KFD_RUNTIME_ENABLE_MODE_TTMP_SAVE_MASK - enable trap temporary setup (ignore on disable) - * @capabilities_mask - mask to notify runtime on what KFD supports - * - * Return - 0 on SUCCESS. - * - EBUSY if runtime enable call already pending. - * - EEXIST if user queues already active prior to call. - * If process is debug enabled, runtime enable will enable debug devices and - * wait for debugger process to send runtime exception EC_PROCESS_RUNTIME - * to unblock - see kfd_ioctl_dbg_trap_args. - * - */ -struct kfd_ioctl_runtime_enable_args { - __u64 r_debug; - __u32 mode_mask; - __u32 capabilities_mask; -}; - -/* Queue information */ -struct kfd_queue_snapshot_entry { - __u64 exception_status; - __u64 ring_base_address; - __u64 write_pointer_address; - __u64 read_pointer_address; - __u64 ctx_save_restore_address; - __u32 queue_id; - __u32 gpu_id; - __u32 ring_size; - __u32 queue_type; - __u32 ctx_save_restore_area_size; - __u32 reserved; -}; - -/* Queue status return for suspend/resume */ -#define KFD_DBG_QUEUE_ERROR_BIT 30 -#define KFD_DBG_QUEUE_INVALID_BIT 31 -#define KFD_DBG_QUEUE_ERROR_MASK (1 << KFD_DBG_QUEUE_ERROR_BIT) -#define KFD_DBG_QUEUE_INVALID_MASK (1 << KFD_DBG_QUEUE_INVALID_BIT) - -/* Context save area header information */ -struct kfd_context_save_area_header { - struct { - __u32 control_stack_offset; - __u32 control_stack_size; - __u32 wave_state_offset; - __u32 wave_state_size; - } wave_state; - __u32 debug_offset; - __u32 debug_size; - __u64 err_payload_addr; - __u32 err_event_id; - __u32 reserved1; -}; - -/* - * Debug operations - * - * For specifics on usage and return values, see documentation per operation - * below. Otherwise, generic error returns apply: - * - ESRCH if the process to debug does not exist. - * - * - EINVAL (with KFD_IOC_DBG_TRAP_ENABLE exempt) if operation - * KFD_IOC_DBG_TRAP_ENABLE has not succeeded prior. - * Also returns this error if GPU hardware scheduling is not supported. - * - * - EPERM (with KFD_IOC_DBG_TRAP_DISABLE exempt) if target process is not - * PTRACE_ATTACHED. KFD_IOC_DBG_TRAP_DISABLE is exempt to allow - * clean up of debug mode as long as process is debug enabled. - * - * - EACCES if any DBG_HW_OP (debug hardware operation) is requested when - * AMDKFD_IOC_RUNTIME_ENABLE has not succeeded prior. - * - * - ENODEV if any GPU does not support debugging on a DBG_HW_OP call. - * - * - Other errors may be returned when a DBG_HW_OP occurs while the GPU - * is in a fatal state. - * - */ -enum kfd_dbg_trap_operations { - KFD_IOC_DBG_TRAP_ENABLE = 0, - KFD_IOC_DBG_TRAP_DISABLE = 1, - KFD_IOC_DBG_TRAP_SEND_RUNTIME_EVENT = 2, - KFD_IOC_DBG_TRAP_SET_EXCEPTIONS_ENABLED = 3, - KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_OVERRIDE = 4, /* DBG_HW_OP */ - KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_MODE = 5, /* DBG_HW_OP */ - KFD_IOC_DBG_TRAP_SUSPEND_QUEUES = 6, /* DBG_HW_OP */ - KFD_IOC_DBG_TRAP_RESUME_QUEUES = 7, /* DBG_HW_OP */ - KFD_IOC_DBG_TRAP_SET_NODE_ADDRESS_WATCH = 8, /* DBG_HW_OP */ - KFD_IOC_DBG_TRAP_CLEAR_NODE_ADDRESS_WATCH = 9, /* DBG_HW_OP */ - KFD_IOC_DBG_TRAP_SET_FLAGS = 10, - KFD_IOC_DBG_TRAP_QUERY_DEBUG_EVENT = 11, - KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO = 12, - KFD_IOC_DBG_TRAP_GET_QUEUE_SNAPSHOT = 13, - KFD_IOC_DBG_TRAP_GET_DEVICE_SNAPSHOT = 14 -}; - -/** - * kfd_ioctl_dbg_trap_enable_args - * - * Arguments for KFD_IOC_DBG_TRAP_ENABLE. - * - * Enables debug session for target process. Call @op KFD_IOC_DBG_TRAP_DISABLE in - * kfd_ioctl_dbg_trap_args to disable debug session. - * - * @exception_mask (IN) - exceptions to raise to the debugger - * @rinfo_ptr (IN) - pointer to runtime info buffer (see kfd_runtime_info) - * @rinfo_size (IN/OUT) - size of runtime info buffer in bytes - * @dbg_fd (IN) - fd the KFD will nofify the debugger with of raised - * exceptions set in exception_mask. - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - 0 on SUCCESS. - * Copies KFD saved kfd_runtime_info to @rinfo_ptr on enable. - * Size of kfd_runtime saved by the KFD returned to @rinfo_size. - * - EBADF if KFD cannot get a reference to dbg_fd. - * - EFAULT if KFD cannot copy runtime info to rinfo_ptr. - * - EINVAL if target process is already debug enabled. - * - */ -struct kfd_ioctl_dbg_trap_enable_args { - __u64 exception_mask; - __u64 rinfo_ptr; - __u32 rinfo_size; - __u32 dbg_fd; -}; - -/** - * kfd_ioctl_dbg_trap_send_runtime_event_args - * - * - * Arguments for KFD_IOC_DBG_TRAP_SEND_RUNTIME_EVENT. - * Raises exceptions to runtime. - * - * @exception_mask (IN) - exceptions to raise to runtime - * @gpu_id (IN) - target device id - * @queue_id (IN) - target queue id - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - 0 on SUCCESS. - * - ENODEV if gpu_id not found. - * If exception_mask contains EC_PROCESS_RUNTIME, unblocks pending - * AMDKFD_IOC_RUNTIME_ENABLE call - see kfd_ioctl_runtime_enable_args. - * All other exceptions are raised to runtime through err_payload_addr. - * See kfd_context_save_area_header. - */ -struct kfd_ioctl_dbg_trap_send_runtime_event_args { - __u64 exception_mask; - __u32 gpu_id; - __u32 queue_id; -}; - -/** - * kfd_ioctl_dbg_trap_set_exceptions_enabled_args - * - * Arguments for KFD_IOC_SET_EXCEPTIONS_ENABLED - * Set new exceptions to be raised to the debugger. - * - * @exception_mask (IN) - new exceptions to raise the debugger - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - 0 on SUCCESS. - */ -struct kfd_ioctl_dbg_trap_set_exceptions_enabled_args { - __u64 exception_mask; -}; - -/** - * kfd_ioctl_dbg_trap_set_wave_launch_override_args - * - * Arguments for KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_OVERRIDE - * Enable HW exceptions to raise trap. - * - * @override_mode (IN) - see kfd_dbg_trap_override_mode - * @enable_mask (IN/OUT) - reference kfd_dbg_trap_mask. - * IN is the override modes requested to be enabled. - * OUT is referenced in Return below. - * @support_request_mask (IN/OUT) - reference kfd_dbg_trap_mask. - * IN is the override modes requested for support check. - * OUT is referenced in Return below. - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - 0 on SUCCESS. - * Previous enablement is returned in @enable_mask. - * Actual override support is returned in @support_request_mask. - * - EINVAL if override mode is not supported. - * - EACCES if trap support requested is not actually supported. - * i.e. enable_mask (IN) is not a subset of support_request_mask (OUT). - * Otherwise it is considered a generic error (see kfd_dbg_trap_operations). - */ -struct kfd_ioctl_dbg_trap_set_wave_launch_override_args { - __u32 override_mode; - __u32 enable_mask; - __u32 support_request_mask; - __u32 pad; -}; - -/** - * kfd_ioctl_dbg_trap_set_wave_launch_mode_args - * - * Arguments for KFD_IOC_DBG_TRAP_SET_WAVE_LAUNCH_MODE - * Set wave launch mode. - * - * @mode (IN) - see kfd_dbg_trap_wave_launch_mode - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - 0 on SUCCESS. - */ -struct kfd_ioctl_dbg_trap_set_wave_launch_mode_args { - __u32 launch_mode; - __u32 pad; -}; - -/** - * kfd_ioctl_dbg_trap_suspend_queues_ags - * - * Arguments for KFD_IOC_DBG_TRAP_SUSPEND_QUEUES - * Suspend queues. - * - * @exception_mask (IN) - raised exceptions to clear - * @queue_array_ptr (IN) - pointer to array of queue ids (u32 per queue id) - * to suspend - * @num_queues (IN) - number of queues to suspend in @queue_array_ptr - * @grace_period (IN) - wave time allowance before preemption - * per 1K GPU clock cycle unit - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Destruction of a suspended queue is blocked until the queue is - * resumed. This allows the debugger to access queue information and - * the its context save area without running into a race condition on - * queue destruction. - * Automatically copies per queue context save area header information - * into the save area base - * (see kfd_queue_snapshot_entry and kfd_context_save_area_header). - * - * Return - Number of queues suspended on SUCCESS. - * . KFD_DBG_QUEUE_ERROR_MASK and KFD_DBG_QUEUE_INVALID_MASK masked - * for each queue id in @queue_array_ptr array reports unsuccessful - * suspend reason. - * KFD_DBG_QUEUE_ERROR_MASK = HW failure. - * KFD_DBG_QUEUE_INVALID_MASK = queue does not exist, is new or - * is being destroyed. - */ -struct kfd_ioctl_dbg_trap_suspend_queues_args { - __u64 exception_mask; - __u64 queue_array_ptr; - __u32 num_queues; - __u32 grace_period; -}; - -/** - * kfd_ioctl_dbg_trap_resume_queues_args - * - * Arguments for KFD_IOC_DBG_TRAP_RESUME_QUEUES - * Resume queues. - * - * @queue_array_ptr (IN) - pointer to array of queue ids (u32 per queue id) - * to resume - * @num_queues (IN) - number of queues to resume in @queue_array_ptr - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - Number of queues resumed on SUCCESS. - * KFD_DBG_QUEUE_ERROR_MASK and KFD_DBG_QUEUE_INVALID_MASK mask - * for each queue id in @queue_array_ptr array reports unsuccessful - * resume reason. - * KFD_DBG_QUEUE_ERROR_MASK = HW failure. - * KFD_DBG_QUEUE_INVALID_MASK = queue does not exist. - */ -struct kfd_ioctl_dbg_trap_resume_queues_args { - __u64 queue_array_ptr; - __u32 num_queues; - __u32 pad; -}; - -/** - * kfd_ioctl_dbg_trap_set_node_address_watch_args - * - * Arguments for KFD_IOC_DBG_TRAP_SET_NODE_ADDRESS_WATCH - * Sets address watch for device. - * - * @address (IN) - watch address to set - * @mode (IN) - see kfd_dbg_trap_address_watch_mode - * @mask (IN) - watch address mask - * @gpu_id (IN) - target gpu to set watch point - * @id (OUT) - watch id allocated - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - 0 on SUCCESS. - * Allocated watch ID returned to @id. - * - ENODEV if gpu_id not found. - * - ENOMEM if watch IDs can be allocated - */ -struct kfd_ioctl_dbg_trap_set_node_address_watch_args { - __u64 address; - __u32 mode; - __u32 mask; - __u32 gpu_id; - __u32 id; -}; - -/** - * kfd_ioctl_dbg_trap_clear_node_address_watch_args - * - * Arguments for KFD_IOC_DBG_TRAP_CLEAR_NODE_ADDRESS_WATCH - * Clear address watch for device. - * - * @gpu_id (IN) - target device to clear watch point - * @id (IN) - allocated watch id to clear - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - 0 on SUCCESS. - * - ENODEV if gpu_id not found. - * - EINVAL if watch ID has not been allocated. - */ -struct kfd_ioctl_dbg_trap_clear_node_address_watch_args { - __u32 gpu_id; - __u32 id; -}; - -/** - * kfd_ioctl_dbg_trap_set_flags_args - * - * Arguments for KFD_IOC_DBG_TRAP_SET_FLAGS - * Sets flags for wave behaviour. - * - * @flags (IN/OUT) - IN = flags to enable, OUT = flags previously enabled - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - 0 on SUCCESS. - * - EACCESS if any debug device does not allow flag options. - */ -struct kfd_ioctl_dbg_trap_set_flags_args { - __u32 flags; - __u32 pad; -}; - -/** - * kfd_ioctl_dbg_trap_query_debug_event_args - * - * Arguments for KFD_IOC_DBG_TRAP_QUERY_DEBUG_EVENT - * - * Find one or more raised exceptions. This function can return multiple - * exceptions from a single queue or a single device with one call. To find - * all raised exceptions, this function must be called repeatedly until it - * returns -EAGAIN. Returned exceptions can optionally be cleared by - * setting the corresponding bit in the @exception_mask input parameter. - * However, clearing an exception prevents retrieving further information - * about it with KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO. - * - * @exception_mask (IN/OUT) - exception to clear (IN) and raised (OUT) - * @gpu_id (OUT) - gpu id of exceptions raised - * @queue_id (OUT) - queue id of exceptions raised - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - 0 on raised exception found - * Raised exceptions found are returned in @exception mask - * with reported source id returned in @gpu_id or @queue_id. - * - EAGAIN if no raised exception has been found - */ -struct kfd_ioctl_dbg_trap_query_debug_event_args { - __u64 exception_mask; - __u32 gpu_id; - __u32 queue_id; -}; - -/** - * kfd_ioctl_dbg_trap_query_exception_info_args - * - * Arguments KFD_IOC_DBG_TRAP_QUERY_EXCEPTION_INFO - * Get additional info on raised exception. - * - * @info_ptr (IN) - pointer to exception info buffer to copy to - * @info_size (IN/OUT) - exception info buffer size (bytes) - * @source_id (IN) - target gpu or queue id - * @exception_code (IN) - target exception - * @clear_exception (IN) - clear raised @exception_code exception - * (0 = false, 1 = true) - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - 0 on SUCCESS. - * If @exception_code is EC_DEVICE_MEMORY_VIOLATION, copy @info_size(OUT) - * bytes of memory exception data to @info_ptr. - * If @exception_code is EC_PROCESS_RUNTIME, copy saved - * kfd_runtime_info to @info_ptr. - * Actual required @info_ptr size (bytes) is returned in @info_size. - */ -struct kfd_ioctl_dbg_trap_query_exception_info_args { - __u64 info_ptr; - __u32 info_size; - __u32 source_id; - __u32 exception_code; - __u32 clear_exception; -}; - -/** - * kfd_ioctl_dbg_trap_get_queue_snapshot_args - * - * Arguments KFD_IOC_DBG_TRAP_GET_QUEUE_SNAPSHOT - * Get queue information. - * - * @exception_mask (IN) - exceptions raised to clear - * @snapshot_buf_ptr (IN) - queue snapshot entry buffer (see kfd_queue_snapshot_entry) - * @num_queues (IN/OUT) - number of queue snapshot entries - * The debugger specifies the size of the array allocated in @num_queues. - * KFD returns the number of queues that actually existed. If this is - * larger than the size specified by the debugger, KFD will not overflow - * the array allocated by the debugger. - * - * @entry_size (IN/OUT) - size per entry in bytes - * The debugger specifies sizeof(struct kfd_queue_snapshot_entry) in - * @entry_size. KFD returns the number of bytes actually populated per - * entry. The debugger should use the KFD_IOCTL_MINOR_VERSION to determine, - * which fields in struct kfd_queue_snapshot_entry are valid. This allows - * growing the ABI in a backwards compatible manner. - * Note that entry_size(IN) should still be used to stride the snapshot buffer in the - * event that it's larger than actual kfd_queue_snapshot_entry. - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - 0 on SUCCESS. - * Copies @num_queues(IN) queue snapshot entries of size @entry_size(IN) - * into @snapshot_buf_ptr if @num_queues(IN) > 0. - * Otherwise return @num_queues(OUT) queue snapshot entries that exist. - */ -struct kfd_ioctl_dbg_trap_queue_snapshot_args { - __u64 exception_mask; - __u64 snapshot_buf_ptr; - __u32 num_queues; - __u32 entry_size; -}; - -/** - * kfd_ioctl_dbg_trap_get_device_snapshot_args - * - * Arguments for KFD_IOC_DBG_TRAP_GET_DEVICE_SNAPSHOT - * Get device information. - * - * @exception_mask (IN) - exceptions raised to clear - * @snapshot_buf_ptr (IN) - pointer to snapshot buffer (see kfd_dbg_device_info_entry) - * @num_devices (IN/OUT) - number of debug devices to snapshot - * The debugger specifies the size of the array allocated in @num_devices. - * KFD returns the number of devices that actually existed. If this is - * larger than the size specified by the debugger, KFD will not overflow - * the array allocated by the debugger. - * - * @entry_size (IN/OUT) - size per entry in bytes - * The debugger specifies sizeof(struct kfd_dbg_device_info_entry) in - * @entry_size. KFD returns the number of bytes actually populated. The - * debugger should use KFD_IOCTL_MINOR_VERSION to determine, which fields - * in struct kfd_dbg_device_info_entry are valid. This allows growing the - * ABI in a backwards compatible manner. - * Note that entry_size(IN) should still be used to stride the snapshot buffer in the - * event that it's larger than actual kfd_dbg_device_info_entry. - * - * Generic errors apply (see kfd_dbg_trap_operations). - * Return - 0 on SUCCESS. - * Copies @num_devices(IN) device snapshot entries of size @entry_size(IN) - * into @snapshot_buf_ptr if @num_devices(IN) > 0. - * Otherwise return @num_devices(OUT) queue snapshot entries that exist. - */ -struct kfd_ioctl_dbg_trap_device_snapshot_args { - __u64 exception_mask; - __u64 snapshot_buf_ptr; - __u32 num_devices; - __u32 entry_size; -}; - -/** - * kfd_ioctl_dbg_trap_args - * - * Arguments to debug target process. - * - * @pid - target process to debug - * @op - debug operation (see kfd_dbg_trap_operations) - * - * @op determines which union struct args to use. - * Refer to kern docs for each kfd_ioctl_dbg_trap_*_args struct. - */ -struct kfd_ioctl_dbg_trap_args { - __u32 pid; - __u32 op; - - union { - struct kfd_ioctl_dbg_trap_enable_args enable; - struct kfd_ioctl_dbg_trap_send_runtime_event_args send_runtime_event; - struct kfd_ioctl_dbg_trap_set_exceptions_enabled_args set_exceptions_enabled; - struct kfd_ioctl_dbg_trap_set_wave_launch_override_args launch_override; - struct kfd_ioctl_dbg_trap_set_wave_launch_mode_args launch_mode; - struct kfd_ioctl_dbg_trap_suspend_queues_args suspend_queues; - struct kfd_ioctl_dbg_trap_resume_queues_args resume_queues; - struct kfd_ioctl_dbg_trap_set_node_address_watch_args set_node_address_watch; - struct kfd_ioctl_dbg_trap_clear_node_address_watch_args clear_node_address_watch; - struct kfd_ioctl_dbg_trap_set_flags_args set_flags; - struct kfd_ioctl_dbg_trap_query_debug_event_args query_debug_event; - struct kfd_ioctl_dbg_trap_query_exception_info_args query_exception_info; - struct kfd_ioctl_dbg_trap_queue_snapshot_args queue_snapshot; - struct kfd_ioctl_dbg_trap_device_snapshot_args device_snapshot; - }; -}; - -#define AMDKFD_IOCTL_BASE 'K' -#define AMDKFD_IO(nr) _IO(AMDKFD_IOCTL_BASE, nr) -#define AMDKFD_IOR(nr, type) _IOR(AMDKFD_IOCTL_BASE, nr, type) -#define AMDKFD_IOW(nr, type) _IOW(AMDKFD_IOCTL_BASE, nr, type) -#define AMDKFD_IOWR(nr, type) _IOWR(AMDKFD_IOCTL_BASE, nr, type) - -#define AMDKFD_IOC_GET_VERSION \ - AMDKFD_IOR(0x01, struct kfd_ioctl_get_version_args) - -#define AMDKFD_IOC_CREATE_QUEUE \ - AMDKFD_IOWR(0x02, struct kfd_ioctl_create_queue_args) - -#define AMDKFD_IOC_DESTROY_QUEUE \ - AMDKFD_IOWR(0x03, struct kfd_ioctl_destroy_queue_args) - -#define AMDKFD_IOC_SET_MEMORY_POLICY \ - AMDKFD_IOW(0x04, struct kfd_ioctl_set_memory_policy_args) - -#define AMDKFD_IOC_GET_CLOCK_COUNTERS \ - AMDKFD_IOWR(0x05, struct kfd_ioctl_get_clock_counters_args) - -#define AMDKFD_IOC_GET_PROCESS_APERTURES \ - AMDKFD_IOR(0x06, struct kfd_ioctl_get_process_apertures_args) - -#define AMDKFD_IOC_UPDATE_QUEUE \ - AMDKFD_IOW(0x07, struct kfd_ioctl_update_queue_args) - -#define AMDKFD_IOC_CREATE_EVENT \ - AMDKFD_IOWR(0x08, struct kfd_ioctl_create_event_args) - -#define AMDKFD_IOC_DESTROY_EVENT \ - AMDKFD_IOW(0x09, struct kfd_ioctl_destroy_event_args) - -#define AMDKFD_IOC_SET_EVENT \ - AMDKFD_IOW(0x0A, struct kfd_ioctl_set_event_args) - -#define AMDKFD_IOC_RESET_EVENT \ - AMDKFD_IOW(0x0B, struct kfd_ioctl_reset_event_args) - -#define AMDKFD_IOC_WAIT_EVENTS \ - AMDKFD_IOWR(0x0C, struct kfd_ioctl_wait_events_args) - -#define AMDKFD_IOC_DBG_REGISTER_DEPRECATED \ - AMDKFD_IOW(0x0D, struct kfd_ioctl_dbg_register_args) - -#define AMDKFD_IOC_DBG_UNREGISTER_DEPRECATED \ - AMDKFD_IOW(0x0E, struct kfd_ioctl_dbg_unregister_args) - -#define AMDKFD_IOC_DBG_ADDRESS_WATCH_DEPRECATED \ - AMDKFD_IOW(0x0F, struct kfd_ioctl_dbg_address_watch_args) - -#define AMDKFD_IOC_DBG_WAVE_CONTROL_DEPRECATED \ - AMDKFD_IOW(0x10, struct kfd_ioctl_dbg_wave_control_args) - -#define AMDKFD_IOC_SET_SCRATCH_BACKING_VA \ - AMDKFD_IOWR(0x11, struct kfd_ioctl_set_scratch_backing_va_args) - -#define AMDKFD_IOC_GET_TILE_CONFIG \ - AMDKFD_IOWR(0x12, struct kfd_ioctl_get_tile_config_args) - -#define AMDKFD_IOC_SET_TRAP_HANDLER \ - AMDKFD_IOW(0x13, struct kfd_ioctl_set_trap_handler_args) - -#define AMDKFD_IOC_GET_PROCESS_APERTURES_NEW \ - AMDKFD_IOWR(0x14, \ - struct kfd_ioctl_get_process_apertures_new_args) - -#define AMDKFD_IOC_ACQUIRE_VM \ - AMDKFD_IOW(0x15, struct kfd_ioctl_acquire_vm_args) - -#define AMDKFD_IOC_ALLOC_MEMORY_OF_GPU \ - AMDKFD_IOWR(0x16, struct kfd_ioctl_alloc_memory_of_gpu_args) - -#define AMDKFD_IOC_FREE_MEMORY_OF_GPU \ - AMDKFD_IOW(0x17, struct kfd_ioctl_free_memory_of_gpu_args) - -#define AMDKFD_IOC_MAP_MEMORY_TO_GPU \ - AMDKFD_IOWR(0x18, struct kfd_ioctl_map_memory_to_gpu_args) - -#define AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU \ - AMDKFD_IOWR(0x19, struct kfd_ioctl_unmap_memory_from_gpu_args) - -#define AMDKFD_IOC_SET_CU_MASK \ - AMDKFD_IOW(0x1A, struct kfd_ioctl_set_cu_mask_args) - -#define AMDKFD_IOC_GET_QUEUE_WAVE_STATE \ - AMDKFD_IOWR(0x1B, struct kfd_ioctl_get_queue_wave_state_args) - -#define AMDKFD_IOC_GET_DMABUF_INFO \ - AMDKFD_IOWR(0x1C, struct kfd_ioctl_get_dmabuf_info_args) - -#define AMDKFD_IOC_IMPORT_DMABUF \ - AMDKFD_IOWR(0x1D, struct kfd_ioctl_import_dmabuf_args) - -#define AMDKFD_IOC_ALLOC_QUEUE_GWS \ - AMDKFD_IOWR(0x1E, struct kfd_ioctl_alloc_queue_gws_args) - -#define AMDKFD_IOC_SMI_EVENTS \ - AMDKFD_IOWR(0x1F, struct kfd_ioctl_smi_events_args) - -#define AMDKFD_IOC_SVM AMDKFD_IOWR(0x20, struct kfd_ioctl_svm_args) - -#define AMDKFD_IOC_SET_XNACK_MODE \ - AMDKFD_IOWR(0x21, struct kfd_ioctl_set_xnack_mode_args) - -#define AMDKFD_IOC_CRIU_OP \ - AMDKFD_IOWR(0x22, struct kfd_ioctl_criu_args) - -#define AMDKFD_IOC_AVAILABLE_MEMORY \ - AMDKFD_IOWR(0x23, struct kfd_ioctl_get_available_memory_args) - -#define AMDKFD_IOC_EXPORT_DMABUF \ - AMDKFD_IOWR(0x24, struct kfd_ioctl_export_dmabuf_args) - -#define AMDKFD_IOC_RUNTIME_ENABLE \ - AMDKFD_IOWR(0x25, struct kfd_ioctl_runtime_enable_args) - -#define AMDKFD_IOC_DBG_TRAP \ - AMDKFD_IOWR(0x26, struct kfd_ioctl_dbg_trap_args) - -#define AMDKFD_COMMAND_START 0x01 -#define AMDKFD_COMMAND_END 0x27 - -/* non-upstream ioctls */ -#define AMDKFD_IOC_IPC_IMPORT_HANDLE \ - AMDKFD_IOWR(0x80, struct kfd_ioctl_ipc_import_handle_args) - -#define AMDKFD_IOC_IPC_EXPORT_HANDLE \ - AMDKFD_IOWR(0x81, struct kfd_ioctl_ipc_export_handle_args) - -#define AMDKFD_IOC_DBG_TRAP_DEPRECATED \ - AMDKFD_IOWR(0x82, struct kfd_ioctl_dbg_trap_args_deprecated) - -#define AMDKFD_IOC_CROSS_MEMORY_COPY_DEPRECATED \ - AMDKFD_IOWR(0x83, struct kfd_ioctl_cross_memory_copy_deprecated_args) - -#define AMDKFD_IOC_RLC_SPM \ - AMDKFD_IOWR(0x84, struct kfd_ioctl_spm_args) - -#define AMDKFD_COMMAND_START_2 0x80 -#define AMDKFD_COMMAND_END_2 0x85 - -#endif -// clang-format on diff --git a/source/lib/rocprofiler-sdk/page_migration/page_migration.cpp b/source/lib/rocprofiler-sdk/page_migration/page_migration.cpp index 14fc5d7277..07d2fe39f4 100644 --- a/source/lib/rocprofiler-sdk/page_migration/page_migration.cpp +++ b/source/lib/rocprofiler-sdk/page_migration/page_migration.cpp @@ -27,8 +27,8 @@ #include "lib/rocprofiler-sdk/agent.hpp" #include "lib/rocprofiler-sdk/buffer.hpp" #include "lib/rocprofiler-sdk/context/context.hpp" +#include "lib/rocprofiler-sdk/details/kfd_ioctl.h" #include "lib/rocprofiler-sdk/internal_threading.hpp" -#include "lib/rocprofiler-sdk/page_migration/details/kfd_ioctl.h" #include "lib/rocprofiler-sdk/page_migration/utils.hpp" #include diff --git a/source/lib/rocprofiler-sdk/page_migration/utils.hpp b/source/lib/rocprofiler-sdk/page_migration/utils.hpp index 589e31ca4e..65efda40ee 100644 --- a/source/lib/rocprofiler-sdk/page_migration/utils.hpp +++ b/source/lib/rocprofiler-sdk/page_migration/utils.hpp @@ -22,7 +22,7 @@ #pragma once -#include "lib/rocprofiler-sdk/page_migration/details/kfd_ioctl.h" +#include "lib/rocprofiler-sdk/details/kfd_ioctl.h" #include #include diff --git a/source/lib/rocprofiler-sdk/pc_sampling.cpp b/source/lib/rocprofiler-sdk/pc_sampling.cpp index 291dc8d0d1..a9e238173f 100644 --- a/source/lib/rocprofiler-sdk/pc_sampling.cpp +++ b/source/lib/rocprofiler-sdk/pc_sampling.cpp @@ -23,10 +23,38 @@ #include #include -#include "lib/common/utility.hpp" +#include "lib/common/environment.hpp" +#include "lib/rocprofiler-sdk/agent.hpp" +#include "lib/rocprofiler-sdk/buffer.hpp" +#include "lib/rocprofiler-sdk/context/context.hpp" +#include "lib/rocprofiler-sdk/hsa/hsa.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/service.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/types.hpp" #include "lib/rocprofiler-sdk/registration.hpp" -using ::rocprofiler::common::consume_args; +namespace +{ +/** + * @brief The functions checks if the `ROCPROFILER_PC_SAMPLING_BETA_ENABLED` is set. + * If so, it will enable PC sampling API. Otherwise, the API is reported + * as not implemented. + * + * The PC sampling is in experimental phase and its usage may hang the machine + * requiring the reboot. By enabling the `ROCPROFILER_PC_SAMPLING_BETA_ENABLED`, + * user accepts all consequences of using early implementation of PC sampling API. + */ +bool +is_pc_sampling_explicitly_enabled() +{ + auto pc_sampling_enabled = + rocprofiler::common::get_env("ROCPROFILER_PC_SAMPLING_BETA_ENABLED", false); + + if(!pc_sampling_enabled) LOG(ERROR) << "PC sampling unavailable\n"; + + return pc_sampling_enabled; +} +} // namespace extern "C" { rocprofiler_status_t @@ -37,20 +65,61 @@ rocprofiler_configure_pc_sampling_service(rocprofiler_context_id_t conte uint64_t interval, rocprofiler_buffer_id_t buffer_id) { - if(rocprofiler::registration::get_init_status() > 0) + if(!is_pc_sampling_explicitly_enabled()) return ROCPROFILER_STATUS_ERROR_NOT_IMPLEMENTED; + +#if ROCPROFILER_SDK_HSA_PC_SAMPLING > 0 + if(rocprofiler::registration::get_init_status() > -1) return ROCPROFILER_STATUS_ERROR_CONFIGURATION_LOCKED; - consume_args(context_id, agent_id, method, unit, interval, buffer_id); - return ROCPROFILER_STATUS_ERROR_NOT_IMPLEMENTED; + const auto* agent = rocprofiler::agent::get_agent(agent_id); + if(!agent) return ROCPROFILER_STATUS_ERROR_AGENT_NOT_FOUND; + + // checking if the registered context exists + auto* ctx = rocprofiler::context::get_mutable_registered_context(context_id); + if(!ctx) return ROCPROFILER_STATUS_ERROR_CONTEXT_NOT_FOUND; + + // checking if the buffer is registered + auto const* buff = rocprofiler::buffer::get_buffer(buffer_id); + if(!buff) return ROCPROFILER_STATUS_ERROR_BUFFER_NOT_FOUND; + + return rocprofiler::pc_sampling::configure_pc_sampling_service( + ctx, agent, method, unit, interval, buffer_id); +#else + (void) context_id; + (void) agent_id; + (void) method; + (void) unit; + (void) interval; + (void) buffer_id; + + // ROCr runtime is missing PC sampling. + return ROCPROFILER_STATUS_ERROR_NOT_AVAILABLE; +#endif } -rocprofiler_status_t ROCPROFILER_API +rocprofiler_status_t rocprofiler_query_pc_sampling_agent_configurations( rocprofiler_agent_id_t agent_id, rocprofiler_available_pc_sampling_configurations_cb_t cb, void* user_data) { - consume_args(agent_id, cb, user_data); - return ROCPROFILER_STATUS_ERROR_NOT_IMPLEMENTED; + if(!is_pc_sampling_explicitly_enabled()) return ROCPROFILER_STATUS_ERROR_NOT_IMPLEMENTED; + +#if ROCPROFILER_SDK_HSA_PC_SAMPLING > 0 + const auto* agent = rocprofiler::agent::get_agent(agent_id); + if(!agent) return ROCPROFILER_STATUS_ERROR_AGENT_NOT_FOUND; + + std::vector configs; + auto status = rocprofiler::pc_sampling::ioctl::ioctl_query_pcs_configs(agent, configs); + return (status == ROCPROFILER_STATUS_SUCCESS) ? cb(configs.data(), configs.size(), user_data) + : status; +#else + (void) agent_id; + (void) cb; + (void) user_data; + + // ROCr runtime is missing PC sampling. + return ROCPROFILER_STATUS_ERROR_NOT_AVAILABLE; +#endif } } diff --git a/source/lib/rocprofiler-sdk/pc_sampling/CMakeLists.txt b/source/lib/rocprofiler-sdk/pc_sampling/CMakeLists.txt index 3bacb12a40..0c17bc9c30 100644 --- a/source/lib/rocprofiler-sdk/pc_sampling/CMakeLists.txt +++ b/source/lib/rocprofiler-sdk/pc_sampling/CMakeLists.txt @@ -1 +1,14 @@ +set(ROCPROFILER_PC_SAMPLING_SOURCES hsa_adapter.cpp utils.cpp service.cpp cid_manager.cpp + code_object.cpp) +set(ROCPROFILER_PC_SAMPLING_HEADERS hsa_adapter.hpp utils.hpp service.hpp types.hpp + cid_manager.hpp code_object.hpp) + +target_sources(rocprofiler-object-library PRIVATE ${ROCPROFILER_PC_SAMPLING_SOURCES} + ${ROCPROFILER_PC_SAMPLING_HEADERS}) + add_subdirectory(parser) +add_subdirectory(ioctl) + +if(ROCPROFILER_BUILD_TESTS) + add_subdirectory(tests) +endif() diff --git a/source/lib/rocprofiler-sdk/pc_sampling/cid_manager.cpp b/source/lib/rocprofiler-sdk/pc_sampling/cid_manager.cpp new file mode 100644 index 0000000000..015b519cc8 --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/cid_manager.cpp @@ -0,0 +1,142 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include "lib/rocprofiler-sdk/pc_sampling/cid_manager.hpp" + +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +void +PCSCIDManager::cid_async_activity_completed(context::correlation_id* cid) +{ + // Hold the lock while updating the state of PCSCIDManager + std::unique_lock lock(m); + // The kernel of the `cid` completed, so add cid to `q1`. + q1.emplace_back(cid); +} + +void +PCSCIDManager::manage_cids_implicit(const pc_samples_copy_fn_t& pc_samples_copy_fn) +{ + std::vector q3; + { + // To manipulate the contents of q1 and q2 and change the state of PCSCIDManager, + // acquire the lock. + std::unique_lock lock(m); + // Move all CIDs from q2 to the q3 local for this function. + // Note: two buffer flushes happened since kernels of q3's CIDs completed. + q3 = std::move(q2); + // Move all CIDs from q1 to q2. + // Note: exactly one buffer flush occured since kernels of q2's CIDs completed. + q2 = std::move(q1); + + // We move CIDs from one queue to another to reflect that an implicit ROCr's buffer flush + // occured. move from q1 to q2 reflects the first buffer flush since kernels of q1's CIDs + // completed move from q2 to local q3 reflects the second buffer flush since kernels of q2's + // CIDs completed. + + // Empty the q1 to indicate that there are no CIDs with the following property: + // no buffer flush occured since the kernel of CID is marked completed. + q1.clear(); + + // The code that follows does not change the state of the PCSCIDManager, so release the lock + // implicitly. + } + + // Copy PC samples from the ROCr's buffer to the SDK's buffer by invoking the passed function. + pc_samples_copy_fn(); + + // Exactly two implicit buffer flushes occured since kernels of q3's CIDs completed. + // Since all PC samples corresponding to these CIDs are placed in the SDK's buffer, + // decrement their reference counters to indicate that PC sampling service will not use + // these CIDs anymore. + // Eventually, CIDs retirement service will report retirement of these CIDs + // to the client tool. + // Note: the q3 is local to the function, so there is no need for inter-thread synchronization. + retire_cids_of(q3); +} + +void +PCSCIDManager::manage_cids_explicit(const pc_samples_copy_fn_t& pc_samples_explicit_flush_fn) +{ + std::vector q1_copy; + std::vector q2_copy; + { + // To manipulate the contents of q1 and q2 and change the state of PCSCIDManager, + // acquire the lock. + std::unique_lock lock(m); + + // Move all CIDs from q1 and q2 to local q1_copy and q2_copy, respectively + q1_copy = std::move(q1); + q2_copy = std::move(q2); + + // Drop CIDs from q1 and q2, because the following explicit flush + // will deliver corresponding samples. + q1.clear(); + q2.clear(); + + // The code that follows does not change the state of the PCSCIDManager, so release the lock + // implicitly. + } + + // Call the passed lambda function to initiate an explicit flush of ROCr buffer by leveraging + // the `hsa_ven_amd_pcs_flush flush`. The latter function guarantees delivery of all samples + // generated (sequenced) before the call to the `hsa_ven_amd_pcs_flush`. + // Thus, all samples corresponding to CIDs of `q1_copy` and `q2_copy` will be copied + // from the ROCr's buffer to the SDK's buffer, + // meaning CIDs of `q1_copy` and `q2_copy` will not be used anymore by the PC sampling service. + pc_samples_explicit_flush_fn(); + + // The PC sampling service will not use q1_copy's and q2_copy's CIDs anymore, so it decrements + // their CIDs. Eventually, CIDs retirement service will report retirement of these CIDs to the + // client tool. Note: both `q1_copy` and `q2_copy` are local to the function, so there is no + // need for inter-thread synchronization. + retire_cids_of(q1_copy); + retire_cids_of(q2_copy); +} + +/** + * @brief A helper function used to notify that the correlation IDs of @p q + * are ready to be retired by decrementing their ref_counters. + * Furthermore, this function notifies the PC sampling parser that + * kernels matching these CIDs are completed and can be removed from parser's + * internal maps. + */ +void +PCSCIDManager::retire_cids_of(std::vector& q) +{ + // This function does not change the local state of the manager, + // so it does not need synchronization. + for(auto* cid : q) + { + // Notify the parser that the kernel has completed. + pcs_parser->completeDispatch(cid->internal); + // Decrement the ref_counter. Eventually, the CID is retired. + cid->sub_ref_count(); + } +} + +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/cid_manager.hpp b/source/lib/rocprofiler-sdk/pc_sampling/cid_manager.hpp new file mode 100644 index 0000000000..3b956f14bc --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/cid_manager.hpp @@ -0,0 +1,119 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#pragma once + +#include "lib/rocprofiler-sdk/context/correlation_id.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.hpp" + +#include +#include +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +/** + * @brief A class that encapsulates the logic for marking the correlation IDs retired + * by PC sampling service. + * + * To reduce the overhead, SDK's PC sampling service tries to avoid flushing the ROCr's buffer + * explicitly. Instead, it waits for the ROCr to deliver the PC samples once the buffer's watermark + * is crossed. + * + * There are some subtleties we need to consider when implementing the PC sampling service. + * Currently, the 2nd level trap handler uses the double-buffering scheme, meaning the following + * scenario can occur. Assume that one of the buffers (referred to as A) is full and is reported to + * the PC sampling service via `data_ready_callback`. In the meantime, the 2nd level trap handler is + * filling the buffer B with samples of currently active kernel K that is about to finish. Let's + * mark the thread executing the `data_ready_callback` as TA. Before TA accesses the information + * about all completed correlation IDs, it might be intercepted by another thread TB that receives + * the kernel completion callback for the kernel K. While executing this callback, the thread TB + * marks the K's correlation ID as completed. After TB finishes executing the callback, the TA + * continues executing the `data_ready_callback` and observes that the K's CID has been marked as + * completed. The TA drains the buffer A and decrements ref counts of all completed CIDs including + * K's CID. If the count reaches zero, then the K's CID might be reported as retired. However, the + * buffer B might still contain samples generated by the kernel K. To be sure that PC sampling + * service drains all samples generated by the kernel K, we require one of the following + * two scenarios to happen: + * + * 1. two implicit buffer flushes happened after the kernel of the correlation ID has completed, + * 2. one explicit buffer flush initiated via `hsa_ven_amd_pcs_flush` happened after the kernel + * of the correlation ID has completed. The reason why only one explicit flush is enough is because + * the `hsa_ven_amd_pcs_flush` guarantees that all samples generated prior to (sequenced-before) the + * call to the `hsa_ven_amd_pcs_flush` will be delivered. + * + * This way, we can guarantee that all samples are + * drained from both buffers filled by 2nd level trap handler. + * + * To know if all samples produced by a kernel are drained from the ROCr's and 2nd level trap + * handler's buffers and placed in the SDK's buffer, the PC sampling service employs the CID + * retirement protocol implemented in the PCSCIDManager class. Refer to the comments of the + * PCSCIDManager's attributes and methods for more details about the CID retirement protocol. + * + * PCSCIDManager is a singleton per PCSAgentSession. + */ +class PCSCIDManager +{ + /// A lock that must be hold while updating the state of PCSCIDManager. + std::mutex m; + /// Correlation IDs with the following property: no ROCr's buffer flush happened + /// since a corresponding kernel completed + std::vector q1; + /// Correlation IDs with the following property: exactly one ROCr's buffer flush occured + /// since a corresponding kernel completed + std::vector q2; + /// A pointer to the PC sampling parser to be notified when the CID is retired. + PCSamplingParserContext* pcs_parser = nullptr; + + /// Prepare the CIDs of q to be retired. Refer to the implementation for more information. + void retire_cids_of(std::vector& q); + +public: + PCSCIDManager(PCSamplingParserContext* parser) + : pcs_parser(parser) + {} + + /// Called by the `kernel_completion_callback` to mark the kernel matching @p cid completed. + void cid_async_activity_completed(context::correlation_id* cid); + + /// a callback function for copying PC samples from ROCr's buffer to the SDK's buffer + using pc_samples_copy_fn_t = std::function; + + /// Called by the @p data_ready_callback. + /// Encapsulates the logic for verifying that two implicit ROCr's buffer flushes + /// happened after a kernel of the CID is marked completed (scenario 1 from above), + /// before retiring that CID. + /// @p manage_cids_implicit calls @p pc_samples_copy_fn to copy samples from + /// ROCr's buffer to the SDK's buffer. + void manage_cids_implicit(const pc_samples_copy_fn_t& pc_samples_copy_fn); + + /// Called by the PC sampling service prior to initiating an explicit ROCr's buffer flush. + /// The explicit flush is initiated by the @p pc_samples_explicit_flush_fn` callback. + /// @p manage_cids_explicit` retires all CIDs whose corresponding kernels completed + /// (sequenced) before the call to the @p manage_cids_explicit (scenario 2 from above). + void manage_cids_explicit(const pc_samples_copy_fn_t& pc_samples_explicit_flush_fn); +}; + +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/code_object.cpp b/source/lib/rocprofiler-sdk/pc_sampling/code_object.cpp new file mode 100644 index 0000000000..f9d7fb6440 --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/code_object.cpp @@ -0,0 +1,190 @@ +// MIT License +// +// Copyright (c) 2024 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include "lib/rocprofiler-sdk/pc_sampling/code_object.hpp" + +#include "lib/common/container/operators.hpp" +#include "lib/common/logging.hpp" +#include "lib/rocprofiler-sdk/code_object/code_object.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/service.hpp" + +#include +#include +#include + +#include +#include +#include +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +namespace code_object +{ +namespace +{ +auto& +get_freeze_function() +{ + static decltype(::hsa_executable_freeze)* _v = nullptr; + return _v; +} + +auto& +get_destroy_function() +{ + static decltype(::hsa_executable_destroy)* _v = nullptr; + return _v; +} + +/** + * @brief Flush internal PC sampling buffers and generate a marker record + * for the code object load/unload event. + * + * By using the @p code_object, the function finds the corresponding agent. + * Then, it drains internal (ROCr + 2nd level trap) buffers of this agent + * and places all samples in the SDK PC sampling buffer. + * Finally, it places the marker record representing code object load/unload event + * in the SDK PC sampling buffer. + * + * @param [in] phase - loading/unloading phase + * @param [in] code_object - loaded/unloaded code object. + */ +void +flush_buffers_generate_marker_record(rocprofiler_callback_phase_t phase, + const rocprofiler::code_object::hsa::code_object& code_object) +{ + auto agent_id = code_object.rocp_data.rocp_agent; + if(!is_pc_sample_service_configured(agent_id)) return; + + // The PC sampling service is configured on the agent. + // Find the agent's buffer and place marker record. + // TODO: Creating a function that gives the buffer_id based on the agent_id? + const auto* pcs_service = get_configured_pc_sampling_service().load(); + const auto* agent_session = pcs_service->agent_sessions.at(agent_id).get(); + auto agent_buffer_id = agent_session->buffer_id; + + // flush internal PC sampling buffers + flush_internal_agent_buffers(agent_buffer_id); + + auto* buff = rocprofiler::buffer::get_buffer(agent_buffer_id); + + // create code object load/unload marker record and emplace it into the SDK's PC SAMPLING + // buffer. + if(phase == ROCPROFILER_CALLBACK_PHASE_LOAD) + { + auto marker = + common::init_public_api_struct(rocprofiler_pc_sampling_code_object_load_marker_t{}); + marker.code_object_id = code_object.rocp_data.code_object_id; + // emplace marker to the SDK's PC sampling buffer + buff->emplace(ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING, + ROCPROFILER_PC_SAMPLING_RECORD_CODE_OBJECT_LOAD_MARKER, + marker); + } + else + { + auto marker = + common::init_public_api_struct(rocprofiler_pc_sampling_code_object_unload_marker_t{}); + marker.code_object_id = code_object.rocp_data.code_object_id; + // emplace marker to the SDK's PC sampling buffer + buff->emplace(ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING, + ROCPROFILER_PC_SAMPLING_RECORD_CODE_OBJECT_UNLOAD_MARKER, + marker); + } + + // Assuming that the `rocprofiler_pc_sampling_code_object_load_marker_t` and + // `rocprofiler_pc_sampling_code_object_unload_marker_t` share the same content, + // we could replace the previous if else with the following + /* + auto marker = + common::init_public_api_struct(rocprofiler_pc_sampling_code_object_load_marker_t{}); + marker.code_object_id = code_object.rocp_data.code_object_id; + // emplace marker to the SDK's PC sampling buffer + buff->emplace(ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING, + (phase == ROCPROFILER_CALLBACK_PHASE_LOAD) ? + ROCPROFILER_PC_SAMPLING_RECORD_CODE_OBJECT_LOAD_MARKER + : ROCPROFILER_PC_SAMPLING_RECORD_CODE_OBJECT_UNLOAD_MARKER, + marker); + */ +} + +hsa_status_t +executable_freeze(hsa_executable_t executable, const char* options) +{ + // Call underlying function + hsa_status_t status = CHECK_NOTNULL(get_freeze_function())(executable, options); + if(status != HSA_STATUS_SUCCESS) return status; + + rocprofiler::code_object::iterate_loaded_code_objects( + [&](const rocprofiler::code_object::hsa::code_object& code_object) { + if(code_object.hsa_executable == executable) + flush_buffers_generate_marker_record(ROCPROFILER_CALLBACK_PHASE_LOAD, code_object); + }); + + return HSA_STATUS_SUCCESS; +} + +hsa_status_t +executable_destroy(hsa_executable_t executable) +{ + rocprofiler::code_object::iterate_loaded_code_objects( + [&](const rocprofiler::code_object::hsa::code_object& code_object) { + if(code_object.hsa_executable == executable) + flush_buffers_generate_marker_record(ROCPROFILER_CALLBACK_PHASE_UNLOAD, + code_object); + }); + + // Call underlying function + return CHECK_NOTNULL(get_destroy_function())(executable); +} +} // namespace + +void +initialize(HsaApiTable* table) +{ + (void) table; + auto& core_table = *table->core_; + + get_freeze_function() = CHECK_NOTNULL(core_table.hsa_executable_freeze_fn); + get_destroy_function() = CHECK_NOTNULL(core_table.hsa_executable_destroy_fn); + core_table.hsa_executable_freeze_fn = executable_freeze; + core_table.hsa_executable_destroy_fn = executable_destroy; + LOG_IF(FATAL, get_freeze_function() == core_table.hsa_executable_freeze_fn) + << "infinite recursion"; + LOG_IF(FATAL, get_destroy_function() == core_table.hsa_executable_destroy_fn) + << "infinite recursion"; +} + +void +finalize() +{ + rocprofiler::code_object::iterate_loaded_code_objects( + [&](const rocprofiler::code_object::hsa::code_object& code_object) { + flush_buffers_generate_marker_record(ROCPROFILER_CALLBACK_PHASE_UNLOAD, code_object); + }); +} + +} // namespace code_object +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/code_object.hpp b/source/lib/rocprofiler-sdk/pc_sampling/code_object.hpp new file mode 100644 index 0000000000..562cf8b7f5 --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/code_object.hpp @@ -0,0 +1,40 @@ +// MIT License +// +// Copyright (c) 2024 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#pragma once + +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +namespace code_object +{ +void +initialize(HsaApiTable* table); + +void +finalize(); +} // namespace code_object +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp b/source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp new file mode 100644 index 0000000000..0f7df075d0 --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp @@ -0,0 +1,378 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include "lib/rocprofiler-sdk/pc_sampling/hsa_adapter.hpp" + +#include "lib/common/logging.hpp" +#include "lib/rocprofiler-sdk/context/context.hpp" +#include "lib/rocprofiler-sdk/hsa/hsa.hpp" +#include "lib/rocprofiler-sdk/hsa/queue_controller.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/service.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/types.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/utils.hpp" + +#include +#include +#include + +#include +#include +#include +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +namespace hsa +{ +namespace +{ +const PCSAgentSession* +get_pcs_session_of(hsa_agent_t hsa_agent) +{ + // TODO: optimize this + auto* service = get_configured_pc_sampling_service().load(); + for(const auto& [_, agent_session] : service->agent_sessions) + { + if(agent_session->hsa_agent->handle == hsa_agent.handle) + { + return agent_session.get(); + } + } + return nullptr; +} + +// Called just before the dispatch packet is put inside the real hardware queue. +void +amd_intercept_marker_handler_callback(const struct amd_aql_intercept_marker_s* packet, + hsa_queue_t* queue, + uint64_t packet_id) +{ + auto* ext_table_ = rocprofiler::hsa::get_table().amd_ext_; + hsa_agent_t hsa_agent; + if(ext_table_->hsa_amd_queue_get_info_fn(queue, HSA_AMD_QUEUE_INFO_AGENT, &hsa_agent) != + HSA_STATUS_SUCCESS) + { + throw std::runtime_error("Cannot map hsa_queue_t* to hsa_agent_t"); + } + + uint64_t doorbell_id = 0; + if(ext_table_->hsa_amd_queue_get_info_fn(queue, HSA_AMD_QUEUE_INFO_DOORBELL_ID, &doorbell_id) != + HSA_STATUS_SUCCESS) + { + throw std::runtime_error("Cannot map hsa_queue_t* to doorbell_id"); + } + + auto internal_correlation = packet->user_data[0]; + auto external_correlation = rocprofiler_user_data_t{.value = packet->user_data[1]}; + + auto const* pcs_session = get_pcs_session_of(hsa_agent); + assert(pcs_session); + + dispatch_pkt_id_t dispatch_pkt; + dispatch_pkt.type = (pcs_session->method == ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP) + ? AMD_HOST_TRAP_V1 + : AMD_SNAPSHOT_V1; + // Use rocp_agent handle to uniquely identify the GPU device + dispatch_pkt.device = device_handle{static_cast(pcs_session->agent->id.handle)}; + dispatch_pkt.doorbell_id = doorbell_id; + dispatch_pkt.queue_size = queue->size; + dispatch_pkt.write_index = packet_id; + dispatch_pkt.correlation_id = {.internal = internal_correlation, + .external = external_correlation}; + + auto* parser = pcs_session->parser.get(); + if(parser->shouldFlipRocrBuffer(dispatch_pkt)) + { + rocprofiler::hsa::get_table().pc_sampling_ext_->hsa_ven_amd_pcs_flush_fn( + pcs_session->hsa_pc_sampling); + } + + parser->newDispatch(dispatch_pkt); +} + +/** + * Callback called by HSA interceptor when the kernel has completed. + */ +void +kernel_completion_cb(const rocprofiler_agent_t* rocp_agent, + rocprofiler::hsa::rocprofiler_packet& /*kernel_pkt*/, + const rocprofiler::hsa::Queue::queue_info_session_t& session) +{ + // No internal correlation IDs, meaning there is no need to call CID manager. + if(!session.correlation_id) return; + + // Check if the PC sampling service is configured on this agent. + if(!is_pc_sample_service_configured(rocp_agent->id)) return; + + auto* service = get_configured_pc_sampling_service().load(); + assert(service); + auto* agent_session = service->agent_sessions.at(rocp_agent->id).get(); + // Mark the correlation ID as completed + agent_session->cid_manager->cid_async_activity_completed(session.correlation_id); +} + +void +data_ready_callback(void* client_callback_data, + size_t data_size, + size_t lost_sample_count, + hsa_ven_amd_pcs_data_copy_callback_t data_copy_callback, + void* hsa_callback_data) +{ + (void) lost_sample_count; // TODO: How is this exposed to the tool? + + auto* agent_session = static_cast(client_callback_data); + + // Wrap around the logic for copying PC samples from ROCr's buffer to the SDK's + // PC sampling buffer inside the lambda function called by the CID manager, + // a component responsible for managing the PC sampling related part of the + // process of retiring correlation IDs. + agent_session->cid_manager->manage_cids_implicit([&]() { + size_t samples_num = data_size / sizeof(packet_union_t); + // allocate a temporary buffer for copying PC samples + // TODO: think about how to optimize this (e.g., introduce a buffer pool) + auto buff = std::make_unique(samples_num); + + // copy all the data + data_copy_callback(hsa_callback_data, data_size, buff.get()); + + upcoming_samples_t upc; + // rocp_agent handle uniquely identifies the device + upc.device = device_handle{static_cast(agent_session->agent->id.handle)}; + upc.which_sample_type = (agent_session->method == ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP) + ? AMD_HOST_TRAP_V1 + : AMD_SNAPSHOT_V1; + upc.num_samples = samples_num; + + // TODO: how about using std::future + std::condition_variable cv; + + auto gfx_major = ((agent_session->agent->gfx_target_version / 10000) % 100); + auto pcs_parser_status = agent_session->parser->parse( + upc, reinterpret_cast(buff.get()), gfx_major, cv, false); + + if(pcs_parser_status != PCSAMPLE_STATUS_SUCCESS) + { + // TODO: should we end program here or somehow report an error to the user and continue? + throw std::runtime_error("Error while parsing PC samples"); + } + }); +} +} // namespace + +rocprofiler::hsa::rocprofiler_packet +generate_marker_packet_for_kernel( + context::correlation_id* correlation_id, + const tracing::external_correlation_id_map_t& external_correlation_ids) +{ + // This function executes for each kernel dispatched to the agent on which + // the PC sampling service is configured. + // By doing this, we allow the following scenario to happen: + // A tool configures PC sampling on an agent and offloads some kernels on that agent. + // In the middle of the kernel execution, a tool starts/activates PC sampling service + // to collect samples. Although the PC sampling service was not started/activated + // at the moment of dispatching kernels, the configured PC sampling service is aware of all + // kernels dispatched on the agent and can recreate their correlation IDs. + // The disadvantage of this approach is that it introduces overhead when PC sampling + // service is inactive/stopped. + amd_aql_intercept_marker_t marker_pkt; + marker_pkt.header = HSA_PACKET_TYPE_VENDOR_SPECIFIC; + marker_pkt.format = AMD_AQL_FORMAT_INTERCEPT_MARKER; + marker_pkt.callback = amd_intercept_marker_handler_callback; + + if(correlation_id != nullptr) + { + correlation_id->add_ref_count(); + // Use the internal correlation ID generated by the tracing service. + marker_pkt.user_data[0] = correlation_id->internal; + + // Find a context that holds PC sampling service. + auto contexts = context::get_registered_contexts( + [](const auto* ctx) { return ctx->pc_sampler != nullptr; }); + assert(contexts.size() == 1); + const auto* pcs_context = contexts.at(0); + + // Get an external correlation that corresponds to the context + // enclosing PC sampling service. + auto external_corr = tracing::empty_user_data; + auto external_corr_it = external_correlation_ids.find(pcs_context); + if(external_corr_it != external_correlation_ids.end()) + external_corr = external_corr_it->second; + marker_pkt.user_data[1] = external_corr.value; + } + else + { + marker_pkt.user_data[0] = 0; + // No external correlation ID + marker_pkt.user_data[1] = 0; + } + + return rocprofiler::hsa::rocprofiler_packet(marker_pkt); +} + +void +pc_sampling_service_start(context::pc_sampling_service* service) +{ + auto* pc_sampling_table_ = rocprofiler::hsa::get_table().pc_sampling_ext_; + for(const auto& [_, agent_session] : service->agent_sessions) + { + // If the agent has been hidden by the ROCR_VISIBLE_DEVICES, no need to start PC sampling. + // Please check `pc_sampling_service_finish_configuration` for more information. + if(!agent_session->hsa_agent.has_value()) continue; + + if(pc_sampling_table_->hsa_ven_amd_pcs_start_fn(agent_session->hsa_pc_sampling) != + HSA_STATUS_SUCCESS) + { + // Two concurrent calls to the pc_sampling::start_service are invoked on the same + // service. The "faster" one succeeds and starts the PC sampling service on the HSA + // level. Although the "slower fails", the service is started. + ROCP_ERROR << "HSA runtime failed to start PC sampling on the agent " + << agent_session->agent->id.handle << "\n"; + } + } +} + +void +pc_sampling_service_stop(context::pc_sampling_service* service) +{ + auto* pc_sampling_table_ = rocprofiler::hsa::get_table().pc_sampling_ext_; + for(const auto& [_, agent_session] : service->agent_sessions) + { + // If the agent has been hidden by the ROCR_VISIBLE_DEVICES, no need to stop PC sampling. + // Please check `pc_sampling_service_finish_configuration` for more information. + if(!agent_session->hsa_agent.has_value()) continue; + + if(pc_sampling_table_->hsa_ven_amd_pcs_stop_fn(agent_session->hsa_pc_sampling) != + HSA_STATUS_SUCCESS) + { + // Two concurrent calls to the pc_sampling::stop_serivce are invoked on the same + // service. The "faster" one succeeds and stops the PC sampling service on the HSA + // level. Although the "slower fails", the service is stopped. The "slower" continues, + // while the "faster" tries flushing the ROCr's buffer below. + ROCP_ERROR << "HSA runtime failed to stop PC sampling on the agent " + << agent_session->agent->id.handle << "\n"; + continue; + }; + + // Flush internal PC sampling buffers (ROCr + 2nd level trap handler buffers) + flush_internal_agent_buffers(agent_session.get()); + } +} + +void +pc_sampling_service_finish_configuration(context::pc_sampling_service* service) +{ + // This function is executed once by a single thread. + // No synchronization needed. + auto* pc_sampling_table_ = rocprofiler::hsa::get_table().pc_sampling_ext_; + + for(const auto& [_, agent_session] : service->agent_sessions) + { + // Get the HSA agent handle + agent_session->hsa_agent = rocprofiler::agent::get_hsa_agent(agent_session->agent); + + // Check if HSA agent corresponding to the KFD node id is hidden via ROCR_VISIBLE_DEVICES, + // If so, we cannot finish the configuration on the ROCr level. + // Consequently, no PC samples will be delivered for this device. + if(!agent_session->hsa_agent.has_value()) continue; + + // Create PC sampling session on the ROCr level. + // ROCr reuses IOCTL session with `agent_session->ioctl_pcs_id`. + hsa_status_t status = pc_sampling_table_->hsa_ven_amd_pcs_create_from_id_fn( + agent_session->ioctl_pcs_id, + agent_session->hsa_agent.value(), + pc_sampling::utils::get_matching_hsa_pcs_method(agent_session->method), + pc_sampling::utils::get_matching_hsa_pcs_units(agent_session->unit), + agent_session->interval, + pc_sampling::utils::get_hsa_pcs_latency(), + pc_sampling::utils::get_hsa_pcs_buffer_size(), + data_ready_callback, + agent_session.get(), + &agent_session->hsa_pc_sampling); + + if(status != HSA_STATUS_SUCCESS) + { + ROCP_ERROR << "HSA runtime failed to finish configuring PC sampling service" + << " on the agent with id: " << agent_session->agent->id.handle << "\n"; + std::runtime_error("PC sampling config on the HSA/ROCr level failed"); + } + + // TODO: any better way of informing the parser about what buffer is used for a + // specific agent? + if(!agent_session->parser->register_buffer_for_agent(agent_session->buffer_id, + agent_session->agent->id)) + { + std::runtime_error("PCS parser does not accept buffer"); + } + } + + // Register callbacks for the HSA's queue interceptor. + // TODO: should we store callback ID in the service? + rocprofiler::hsa::get_queue_controller()->add_callback( + std::nullopt, + [](const rocprofiler::hsa::Queue&, + const rocprofiler::hsa::rocprofiler_packet&, + rocprofiler_kernel_id_t /*kernel_id*/, + rocprofiler_dispatch_id_t /*dispatch_id*/, + rocprofiler_user_data_t*, + const rocprofiler::hsa::Queue::queue_info_session_t::external_corr_id_map_t&, + const context::correlation_id*) { return nullptr; }, + // Completion CB + [](const rocprofiler::hsa::Queue& q, + rocprofiler::hsa::rocprofiler_packet kern_pkt, + const rocprofiler::hsa::Queue::queue_info_session_t& session, + rocprofiler::hsa::inst_pkt_t&) { + kernel_completion_cb(q.get_agent().get_rocp_agent(), kern_pkt, session); + }); +} + +rocprofiler_status_t +flush_internal_agent_buffers(const PCSAgentSession* agent_session) +{ + // If the agent has been hidden by the ROCR_VISIBLE_DEVICES, + // there is no ROCr internal buffers to flush. + if(!agent_session->hsa_agent.has_value()) return ROCPROFILER_STATUS_SUCCESS; + + auto* pc_sampling_table_ = rocprofiler::hsa::get_table().pc_sampling_ext_; + + // HSA table has not been loaded, so ROCr buffers does not exist yet. + if(!pc_sampling_table_->hsa_ven_amd_pcs_flush_fn) + return ROCPROFILER_STATUS_ERROR_HSA_NOT_LOADED; + + auto hsa_pcs_handle = agent_session->hsa_pc_sampling; + // Explicitly flush ROCr's buffers and sync completed CIDs. + agent_session->cid_manager->manage_cids_explicit([=]() { + // TODO: investigate whether the ROCr should maintain an extra buffer + // beyond the 2nd level trap handler buffers. + if(pc_sampling_table_->hsa_ven_amd_pcs_flush_fn(hsa_pcs_handle) != HSA_STATUS_SUCCESS) + { + // TODO: Think if it is possible to recover from this error. + std::runtime_error("Fail to flush ROCr's buffer explicitly"); + } + }); + return ROCPROFILER_STATUS_SUCCESS; +} +} // namespace hsa +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.hpp b/source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.hpp new file mode 100644 index 0000000000..94689eca5d --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.hpp @@ -0,0 +1,56 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#pragma once + +#include "lib/rocprofiler-sdk/context/context.hpp" +#include "lib/rocprofiler-sdk/hsa/queue.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/types.hpp" +#include "lib/rocprofiler-sdk/tracing/fwd.hpp" + +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +namespace hsa +{ +rocprofiler::hsa::rocprofiler_packet +generate_marker_packet_for_kernel( + context::correlation_id* correlation_id, + const tracing::external_correlation_id_map_t& external_correlation_ids); + +void +pc_sampling_service_start(context::pc_sampling_service* service); + +void +pc_sampling_service_stop(context::pc_sampling_service* service); + +void +pc_sampling_service_finish_configuration(context::pc_sampling_service* service); + +rocprofiler_status_t +flush_internal_agent_buffers(const PCSAgentSession* agent_session); +} // namespace hsa +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/ioctl/CMakeLists.txt b/source/lib/rocprofiler-sdk/pc_sampling/ioctl/CMakeLists.txt new file mode 100644 index 0000000000..569ee5b87e --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/ioctl/CMakeLists.txt @@ -0,0 +1,6 @@ +set(ROCPROFILER_PC_SAMPLING_IOCTL_SOURCES ioctl_adapter.cpp) +set(ROCPROFILER_PC_SAMPLING_IOCTL_HEADERS ioctl_adapter.hpp ioctl_adapter_types.hpp) + +target_sources( + rocprofiler-object-library PRIVATE ${ROCPROFILER_PC_SAMPLING_IOCTL_SOURCES} + ${ROCPROFILER_PC_SAMPLING_IOCTL_HEADERS}) diff --git a/source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.cpp b/source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.cpp new file mode 100644 index 0000000000..e02adc59de --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.cpp @@ -0,0 +1,383 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include "lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.hpp" + +#include "lib/rocprofiler-sdk/details/kfd_ioctl.h" + +#include "lib/common/logging.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp" + +#include + +#include +#include +#include +#include +#include +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +namespace ioctl +{ +// forward declaration +rocprofiler_ioctl_version_info_t& +get_ioctl_version(); + +// IOCTL 1.16 is the first one supporting PC sampling. +#define CHECK_IOCTL_VERSION \ + do \ + { \ + auto ioctl_version = get_ioctl_version(); \ + if(ioctl_version.major_version < 1 || ioctl_version.minor_version < 16) \ + { \ + LOG(ERROR) << "PC sampling unavailable\n"; \ + return ROCPROFILER_STATUS_ERROR_INCOMPATIBLE_KERNEL; \ + } \ + } while(0) + +int +kfd_open() +{ + int fd = -1; + static const char kfd_device_name[] = "/dev/kfd"; + + fd = open(kfd_device_name, O_RDWR | O_CLOEXEC); + + if(fd == -1) + { + throw std::runtime_error("Cannot open /dev/kfd"); + } + + return fd; +} + +int +get_kfd_fd() +{ + static auto _v = kfd_open(); + return _v; +} + +/** Call ioctl, restarting if it is interrupted + * Taken from libhsakmt.c + */ +int +ioctl(int fd, unsigned long request, void* arg) +{ + int ret; + + do + { + ret = ::ioctl(fd, request, arg); + } while(ret == -1 && (errno == EINTR || errno == EAGAIN)); + + if(ret == -1 && errno == EBADF) + { + /* In case pthread_atfork didn't catch it, this will + * make any subsequent hsaKmt calls fail in CHECK_KFD_OPEN. + */ + printf("Invalid KFD descriptor: %d\n", fd); + } + + return ret * errno; +} + +// More or less taken from the HsaKmt +rocprofiler_ioctl_version_info_t +query_ioctl_version(void) +{ + rocprofiler_ioctl_version_info_t ioctl_version; + ioctl_version.minor_version = 0; + ioctl_version.major_version = 0; + + // If querying the IOCTL version fails, return major_version/minor_version = 0; + struct kfd_ioctl_get_version_args args = {.major_version = 0, .minor_version = 0}; + + if(ioctl(get_kfd_fd(), AMDKFD_IOC_GET_VERSION, &args) == 0) + { + ioctl_version.major_version = args.major_version; + ioctl_version.minor_version = args.minor_version; + } + + return ioctl_version; +} + +rocprofiler_ioctl_version_info_t& +get_ioctl_version() +{ + static auto v = query_ioctl_version(); + return v; +} + +/** + * @kfd_gpu_id represents the gpu identifier read from the content of the + * /sys/class/kfd/kfd/topology/nodes//gpu_id. + */ +ROCPROFILER_IOCTL_STATUS +ioctl_query_pc_sampling_capabilities(uint32_t kfd_gpu_id, + void* sample_info, + uint32_t sample_info_sz, + uint32_t* size) +{ + int ret; + struct kfd_ioctl_pc_sample_args args; + + assert(sizeof(rocprofiler_ioctl_pc_sampling_info_t) == sizeof(struct kfd_pc_sample_info)); + + ret = ROCPROFILER_IOCTL_STATUS_SUCCESS; + args.op = KFD_IOCTL_PCS_OP_QUERY_CAPABILITIES; + args.gpu_id = kfd_gpu_id; + args.sample_info_ptr = (uint64_t) sample_info; + args.num_sample_info = sample_info_sz; + args.flags = 0; + + ret = ioctl(get_kfd_fd(), AMDKFD_IOC_PC_SAMPLE, &args); + + if(ret != 0) + { + if(ret == -EBUSY) + { + // Querying PC sampling capabilities is requsted from within the ROCgdb + // which is not supported. + return ROCPROFILER_IOCTL_STATUS_UNAVAILABLE; + } + ROCP_ERROR << "IOCTL failed to query PC sampling configs: " << ret << "\n"; + } + *size = args.num_sample_info; + + return (ret == -ENOSPC) ? ROCPROFILER_IOCTL_STATUS_BUFFER_TOO_SMALL + : (ret != 0) ? ROCPROFILER_IOCTL_STATUS_ERROR + : ROCPROFILER_IOCTL_STATUS_SUCCESS; +} + +rocprofiler_status_t +convert_ioctl_pcs_config_to_rocp(const rocprofiler_ioctl_pc_sampling_info_t& ioctl_pcs_config, + rocprofiler_pc_sampling_configuration_t& rocp_pcs_config) +{ + // Sometimes, the KFD returns 0 for `method` and `units` as an error. + // Note: the 0 is not of the matching enumeration. + // Thus, the default case remains here to indicate that KFD edge case + // and prevents failures inside rocprofiler. + + switch(ioctl_pcs_config.method) + { + case ROCPROFILER_IOCTL_PC_SAMPLING_METHOD_KIND_HOSTTRAP_V1: + rocp_pcs_config.method = ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP; + break; + case ROCPROFILER_IOCTL_PC_SAMPLING_METHOD_KIND_STOCHASTIC_V1: + rocp_pcs_config.method = ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC; + break; + default: + // Sampling method unsupported, return the error + return ROCPROFILER_STATUS_ERROR; + } + + switch(ioctl_pcs_config.units) + { + case ROCPROFILER_IOCTL_PC_SAMPLING_UNIT_INTERVAL_MICROSECONDS: + rocp_pcs_config.unit = ROCPROFILER_PC_SAMPLING_UNIT_TIME; + break; + case ROCPROFILER_IOCTL_PC_SAMPLING_UNIT_INTERVAL_CYCLES: + rocp_pcs_config.unit = ROCPROFILER_PC_SAMPLING_UNIT_CYCLES; + break; + case ROCPROFILER_IOCTL_PC_SAMPLING_UNIT_INTERVAL_INSTRUCTIONS: + rocp_pcs_config.unit = ROCPROFILER_PC_SAMPLING_UNIT_INSTRUCTIONS; + break; + default: + // Sampling unit unsupported, return error + return ROCPROFILER_STATUS_ERROR; + } + + if(ioctl_pcs_config.interval != 0) + { + // The pc sampling is configured on the corresponding device. + // The `interval` contains the value of the interval used for deliverying samples. + // Values of `interval_min` and `interval_max` are irrelevant. + rocp_pcs_config.min_interval = ioctl_pcs_config.interval; + rocp_pcs_config.max_interval = ioctl_pcs_config.interval; + } + else + { + // No one configured PC sampling on the corresponding device. + // Read the values of min and max interval provided by the KFD + rocp_pcs_config.min_interval = ioctl_pcs_config.interval_min; + rocp_pcs_config.max_interval = ioctl_pcs_config.interval_max; + } + + rocp_pcs_config.flags = ioctl_pcs_config.flags; + + return ROCPROFILER_STATUS_SUCCESS; +} + +rocprofiler_status_t +ioctl_query_pcs_configs(const rocprofiler_agent_t* agent, rocp_pcs_cfgs_vec_t& rocp_configs) +{ + // Assert the IOCTL version + CHECK_IOCTL_VERSION; + + uint32_t kfd_gpu_id = agent->gpu_id; + + const size_t ioctl_configs_num = 10; + uint32_t size = 0; + + std::vector ioctl_configs(ioctl_configs_num); + + auto ret = ioctl_query_pc_sampling_capabilities( + kfd_gpu_id, ioctl_configs.data(), ioctl_configs.size(), &size); + if(ret == ROCPROFILER_IOCTL_STATUS_BUFFER_TOO_SMALL) + { + ioctl_configs.resize(size); + ret = ioctl_query_pc_sampling_capabilities( + kfd_gpu_id, ioctl_configs.data(), ioctl_configs.size(), &size); + } + + if(ret == ROCPROFILER_IOCTL_STATUS_UNAVAILABLE) + { + // The PC sampling is accessed from within the ROCgdb which is not supported. + return ROCPROFILER_STATUS_ERROR_NOT_AVAILABLE; + } + else if(ret != ROCPROFILER_IOCTL_STATUS_SUCCESS) + { + ROCP_ERROR << "......... Failed while iterating over PC sampling configurations\n"; + return ROCPROFILER_STATUS_ERROR; + } + + for(auto const& ioctl_cfg : ioctl_configs) + { + // FIXME: Why this happens? + if(ioctl_cfg.method == 0) continue; + auto rocp_cfg = common::init_public_api_struct(rocprofiler_pc_sampling_configuration_t{}); + auto rocp_ret = convert_ioctl_pcs_config_to_rocp(ioctl_cfg, rocp_cfg); + if(rocp_ret != ROCPROFILER_STATUS_SUCCESS) + { + // This should never happened, unless the KFD is broken. + continue; + } + rocp_configs.emplace_back(rocp_cfg); + } + + return ROCPROFILER_STATUS_SUCCESS; +} + +rocprofiler_status_t +create_ioctl_pcs_config_from_rocp(rocprofiler_ioctl_pc_sampling_info_t& ioctl_cfg, + rocprofiler_pc_sampling_method_t method, + rocprofiler_pc_sampling_unit_t unit, + uint64_t interval) +{ + switch(method) + { + case ROCPROFILER_PC_SAMPLING_METHOD_NONE: return ROCPROFILER_STATUS_ERROR_INVALID_ARGUMENT; + case ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC: + ioctl_cfg.method = ROCPROFILER_IOCTL_PC_SAMPLING_METHOD_KIND_STOCHASTIC_V1; + break; + case ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP: + ioctl_cfg.method = ROCPROFILER_IOCTL_PC_SAMPLING_METHOD_KIND_HOSTTRAP_V1; + break; + case ROCPROFILER_PC_SAMPLING_METHOD_LAST: return ROCPROFILER_STATUS_ERROR_INVALID_ARGUMENT; + } + + switch(unit) + { + case ROCPROFILER_PC_SAMPLING_UNIT_NONE: return ROCPROFILER_STATUS_ERROR_INVALID_ARGUMENT; + case ROCPROFILER_PC_SAMPLING_UNIT_INSTRUCTIONS: + ioctl_cfg.units = ROCPROFILER_IOCTL_PC_SAMPLING_UNIT_INTERVAL_INSTRUCTIONS; + break; + case ROCPROFILER_PC_SAMPLING_UNIT_CYCLES: + ioctl_cfg.units = ROCPROFILER_IOCTL_PC_SAMPLING_UNIT_INTERVAL_CYCLES; + break; + case ROCPROFILER_PC_SAMPLING_UNIT_TIME: + ioctl_cfg.units = ROCPROFILER_IOCTL_PC_SAMPLING_UNIT_INTERVAL_MICROSECONDS; + break; + case ROCPROFILER_PC_SAMPLING_UNIT_LAST: return ROCPROFILER_STATUS_ERROR_INVALID_ARGUMENT; + } + + ioctl_cfg.interval = interval; + // TODO: Is it possible to use flags for interval values that are power of 2 + // when specifying stochastic on MI300? + ioctl_cfg.flags = 0; + ioctl_cfg.interval_min = 0; + ioctl_cfg.interval_max = 0; + + return ROCPROFILER_STATUS_SUCCESS; +} + +/** + * @brief Reserve PC sampling service on the device + * @param[out] ioctl_pcs_id - If the return value is ROCPROFILER_STATUS_SUCCESS, + * contains the id that uniquely identifies PC sampling session within IOCTL. + */ +rocprofiler_status_t +ioctl_pcs_create(const rocprofiler_agent_t* agent, + rocprofiler_pc_sampling_method_t method, + rocprofiler_pc_sampling_unit_t unit, + uint64_t interval, + uint32_t* ioctl_pcs_id) +{ + // Assert the IOCTL version + CHECK_IOCTL_VERSION; + + rocprofiler_ioctl_pc_sampling_info_t ioctl_cfg; + auto ret = create_ioctl_pcs_config_from_rocp(ioctl_cfg, method, unit, interval); + if(ret != ROCPROFILER_STATUS_SUCCESS) + { + return ret; + } + + struct kfd_ioctl_pc_sample_args args; + + if(!ioctl_pcs_id) return ROCPROFILER_STATUS_ERROR_INVALID_ARGUMENT; + + *ioctl_pcs_id = INVALID_TRACE_ID; + + args.op = KFD_IOCTL_PCS_OP_CREATE; + args.gpu_id = agent->gpu_id; + args.sample_info_ptr = (uint64_t)(&ioctl_cfg); + args.num_sample_info = 1; + args.trace_id = INVALID_TRACE_ID; + + auto ioctl_ret = ioctl(get_kfd_fd(), AMDKFD_IOC_PC_SAMPLE, &args); + *ioctl_pcs_id = args.trace_id; + + if(ioctl_ret != 0 && (errno == EBUSY || errno == EEXIST)) + { + // Currently, KFD uses EBUSY when e.g., PC sampling create is requested from + // withing the ROCgdb. + // On the other hand, EEXIST is used when one tries to create a PC sampling + // with a configuration different than the one already active. + return ROCPROFILER_STATUS_ERROR_NOT_AVAILABLE; + } + else if(ioctl_ret != 0) + { + return ROCPROFILER_STATUS_ERROR; + } + + return ROCPROFILER_STATUS_SUCCESS; +} + +} // namespace ioctl +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.hpp b/source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.hpp new file mode 100644 index 0000000000..2a0e91fadc --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.hpp @@ -0,0 +1,50 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include + +#include "lib/rocprofiler-sdk/context/context.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/types.hpp" + +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +namespace ioctl +{ +using rocp_pcs_cfgs_vec_t = std::vector; + +rocprofiler_status_t +ioctl_query_pcs_configs(const rocprofiler_agent_t* agent, rocp_pcs_cfgs_vec_t& rocp_configs); + +rocprofiler_status_t +ioctl_pcs_create(const rocprofiler_agent_t* agent, + rocprofiler_pc_sampling_method_t method, + rocprofiler_pc_sampling_unit_t unit, + uint64_t interval, + uint32_t* ioctl_pcs_id); + +} // namespace ioctl +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp b/source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp new file mode 100644 index 0000000000..023dbb6187 --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter_types.hpp @@ -0,0 +1,108 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include + +#include "lib/rocprofiler-sdk/context/context.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/types.hpp" + +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +namespace ioctl +{ +#define INVALID_TRACE_ID 0x0 + +// The data structure copied from the HsaKmt +// Currently, we are using the following status codes: +// 1. ROCPROFILER_IOCTL_STATUS_SUCCESS +// 2. ROCPROFILER_IOCTL_STATUS_ERROR +// 3. ROCPROFILER_IOCTL_STATUS_BUFFER_TOO_SMALL +// 4. ROCPROFILER_IOCTL_STATUS_UNAVAILABLE +// We might replace 1, 2, and 4 with rocprofiler_status_t, but still lacking a counterpart +// for the ROCPROFILER_IOCTL_STATUS_BUFFER_TOO_SMALL +typedef enum _ROCPROFILER_IOCTL_STATUS +{ + ROCPROFILER_IOCTL_STATUS_SUCCESS = 0, /// Operation successful // USED + ROCPROFILER_IOCTL_STATUS_ERROR = 1, /// General error return if not otherwise specified // USED + ROCPROFILER_IOCTL_STATUS_DRIVER_MISMATCH = + 2, /// User mode component is not compatible with kernel HSA driver + ROCPROFILER_IOCTL_STATUS_INVALID_NODE_UNIT = + 5, /// KFD identifies node or unit parameter invalid + ROCPROFILER_IOCTL_STATUS_NO_MEMORY = + 6, /// No memory available (when allocating queues or memory) + ROCPROFILER_IOCTL_STATUS_BUFFER_TOO_SMALL = + 7, /// A buffer needed to handle a request is too small //USED + ROCPROFILER_IOCTL_STATUS_NOT_IMPLEMENTED = + 10, /// KFD function is not implemented for this set of paramters + ROCPROFILER_IOCTL_STATUS_UNAVAILABLE = 12, /// KFD function is not available currently on this + /// // USED node (but may be at a later time) + ROCPROFILER_IOCTL_STATUS_OUT_OF_RESOURCES = + 13, /// KFD function request exceeds the resources currently available. + ROCPROFILER_IOCTL_STATUS_KERNEL_COMMUNICATION_ERROR = + 21, /// user-kernel mode communication failure + ROCPROFILER_IOCTL_STATUS_KERNEL_ALREADY_OPENED = 22, /// KFD driver path already opened + ROCPROFILER_IOCTL_STATUS_HSAMMU_UNAVAILABLE = + 23, /// ATS/PRI 1.1 (Address Translation Services) not available + /// (IOMMU driver not installed or not-available) + ROCPROFILER_IOCTL_STATUS_WAIT_FAILURE = 30, /// The wait operation failed + ROCPROFILER_IOCTL_STATUS_WAIT_TIMEOUT = 31, /// The wait operation timed out + ROCPROFILER_IOCTL_STATUS_MEMORY_ALREADY_REGISTERED = 35, /// Memory buffer already registered + ROCPROFILER_IOCTL_STATUS_MEMORY_NOT_REGISTERED = 36, /// Memory buffer not registered + ROCPROFILER_IOCTL_STATUS_MEMORY_ALIGNMENT = 37, /// Memory parameter not aligned +} ROCPROFILER_IOCTL_STATUS; + +typedef struct rocprofiler_ioctl_version_info_s +{ + uint32_t major_version; /// supported IOCTL interface major version + uint32_t minor_version; /// supported IOCTL interface minor version +} rocprofiler_ioctl_version_info_t; + +typedef enum _ROCPROFILER_IOCTL_PC_SAMPLING_METHOD_KIND +{ + ROCPROFILER_IOCTL_PC_SAMPLING_METHOD_KIND_HOSTTRAP_V1 = 1, + ROCPROFILER_IOCTL_PC_SAMPLING_METHOD_KIND_STOCHASTIC_V1, +} ROCPROFILER_IOCTL_PC_SAMPLING_METHOD_KIND; + +typedef enum _ROCPROFILER_IOCTL_PC_SAMPLING_UNITS +{ + ROCPROFILER_IOCTL_PC_SAMPLING_UNIT_INTERVAL_MICROSECONDS, + ROCPROFILER_IOCTL_PC_SAMPLING_UNIT_INTERVAL_CYCLES, + ROCPROFILER_IOCTL_PC_SAMPLING_UNIT_INTERVAL_INSTRUCTIONS, +} ROCPROFILER_IOCTL_PC_SAMPLING_UNIT_INTERVAL; + +typedef struct rocprofiler_ioctl_pc_sampling_info_s +{ + uint64_t interval; + uint64_t interval_min; + uint64_t interval_max; + uint64_t flags; + ROCPROFILER_IOCTL_PC_SAMPLING_METHOD_KIND method; + ROCPROFILER_IOCTL_PC_SAMPLING_UNIT_INTERVAL units; +} rocprofiler_ioctl_pc_sampling_info_t; + +} // namespace ioctl +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/parser/correlation.hpp b/source/lib/rocprofiler-sdk/pc_sampling/parser/correlation.hpp index a7a1648f3c..166e6eb10e 100644 --- a/source/lib/rocprofiler-sdk/pc_sampling/parser/correlation.hpp +++ b/source/lib/rocprofiler-sdk/pc_sampling/parser/correlation.hpp @@ -184,17 +184,20 @@ add_upcoming_samples(const device_handle device, const generic_sample_t* buffer, const size_t available_samples, Parser::CorrelationMap* corr_map, - rocprofiler_pc_sampling_record_s* samples) + rocprofiler_pc_sampling_record_t* samples) { pcsample_status_t status = PCSAMPLE_STATUS_SUCCESS; for(uint64_t p = 0; p < available_samples; p++) { const auto* snap = reinterpret_cast(buffer + p); samples[p] = copySample((const void*) (buffer + p)); + samples[p].size = 0; // pc sampling record with size 0 will indicate invalid sample try { Parser::trap_correlation_id_t trap{.raw = snap->correlation_id}; samples[p].correlation_id = corr_map->get(device, trap); + samples[p].size = sizeof(rocprofiler_pc_sampling_record_t); + // set size after corr_map->get which may throw } catch(std::exception& e) { status = PCSAMPLE_STATUS_PARSER_ERROR; @@ -240,7 +243,7 @@ _parse_buffer(generic_sample_t* buffer, while(pkt_counter > 0) { - rocprofiler_pc_sampling_record_s* samples = nullptr; + rocprofiler_pc_sampling_record_t* samples = nullptr; uint64_t available_samples = callback(&samples, pkt_counter, userdata); if(available_samples == 0 || available_samples > pkt_counter) diff --git a/source/lib/rocprofiler-sdk/pc_sampling/parser/parser_types.h b/source/lib/rocprofiler-sdk/pc_sampling/parser/parser_types.h index 83229a1f77..dde0d139c8 100644 --- a/source/lib/rocprofiler-sdk/pc_sampling/parser/parser_types.h +++ b/source/lib/rocprofiler-sdk/pc_sampling/parser/parser_types.h @@ -87,7 +87,7 @@ union pcsample_header_v1_t uint8_t raw; }; -typedef uint64_t (*user_callback_t)(rocprofiler_pc_sampling_record_s**, uint64_t, void*); +typedef uint64_t (*user_callback_t)(rocprofiler_pc_sampling_record_t**, uint64_t, void*); /** * The types of errors to be returned by parse_buffer. diff --git a/source/lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.cpp b/source/lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.cpp index 944fca47f3..abaaac846f 100644 --- a/source/lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.cpp +++ b/source/lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.cpp @@ -23,7 +23,7 @@ #include "lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.hpp" uint64_t -PCSamplingParserContext::alloc(rocprofiler_pc_sampling_record_s** buffer, uint64_t size) +PCSamplingParserContext::alloc(rocprofiler_pc_sampling_record_t** buffer, uint64_t size) { std::unique_lock lock(mut); assert(buffer != nullptr); @@ -97,3 +97,21 @@ PCSamplingParserContext::shouldFlipRocrBuffer(const dispatch_pkt_id_t& pkt) cons std::shared_lock lock(mut); return corr_map->checkDispatch(pkt); } + +void +PCSamplingParserContext::generate_upcoming_pc_record( + uint64_t agent_id_handle, + const rocprofiler_pc_sampling_record_t* samples, + size_t num_samples) +{ + auto buff_id = _agent_buffers.at(rocprofiler_agent_id_t{agent_id_handle}); + rocprofiler::buffer::instance* buff = rocprofiler::buffer::get_buffer(buff_id); + + if(!buff) + throw std::runtime_error(fmt::format("Buffer with id: {} does not exists", buff_id.handle)); + + for(size_t i = 0; i < num_samples; i++) + buff->emplace(ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING, + ROCPROFILER_PC_SAMPLING_RECORD_SAMPLE, + samples[i]); +}; \ No newline at end of file diff --git a/source/lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.hpp b/source/lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.hpp index 3566fe5500..d9b91e073c 100644 --- a/source/lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.hpp +++ b/source/lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.hpp @@ -22,23 +22,32 @@ #pragma once +#include "lib/rocprofiler-sdk/buffer.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/parser/correlation.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/parser/parser_types.h" + +#include +#include +#include + +#include +#include #include #include +#include +#include #include #include #include #include -#include "lib/rocprofiler-sdk/pc_sampling/parser/correlation.hpp" -#include "lib/rocprofiler-sdk/pc_sampling/parser/parser_types.h" - struct PCSamplingData { PCSamplingData(size_t size) : samples(size){}; PCSamplingData& operator=(PCSamplingData&) = delete; - std::vector samples; + std::vector samples; }; class PCSamplingParserContext @@ -52,7 +61,7 @@ public: * @param[in] size Number of samples requested. * @returns Number of samples actually allocated on *buffer. */ - uint64_t alloc(rocprofiler_pc_sampling_record_s** buffer, uint64_t size); + uint64_t alloc(rocprofiler_pc_sampling_record_t** buffer, uint64_t size); /** * @brief Parses a chunk of samples. @@ -95,6 +104,24 @@ public: */ bool shouldFlipRocrBuffer(const dispatch_pkt_id_t& pkt) const; + bool register_buffer_for_agent(rocprofiler_buffer_id_t buffer_id, + rocprofiler_agent_id_t agent_id) + { + std::unique_lock lock(mut); + // Single buffer per agent is allowed + if(_agent_buffers.count(agent_id) > 0) return false; + + _agent_buffers.emplace(agent_id, buffer_id); + return true; + } + + void unregister_buffer_from_agent(rocprofiler_agent_id_t agent_id) + { + std::unique_lock lock(mut); + + _agent_buffers.erase(agent_id); + } + protected: /** * @brief Parses the given input data and generates pc sampling records. @@ -103,7 +130,7 @@ protected: template pcsample_status_t _parse(const upcoming_samples_t& upcoming, const generic_sample_t* data_) { - std::shared_lock lock(mut); + // std::shared_lock lock(mut); pcsample_status_t status = PCSAMPLE_STATUS_SUCCESS; uint64_t pkt_counter = upcoming.num_samples; @@ -112,7 +139,7 @@ protected: while(pkt_counter > 0) { - rocprofiler_pc_sampling_record_s* samples = nullptr; + rocprofiler_pc_sampling_record_t* samples = nullptr; uint64_t memsize = alloc(&samples, pkt_counter); if(memsize == 0 || memsize > pkt_counter) return PCSAMPLE_STATUS_CALLBACK_ERROR; @@ -125,7 +152,7 @@ protected: data_ += memsize; pkt_counter -= memsize; - generate_upcoming_pc_record(samples, memsize); + generate_upcoming_pc_record(dev.handle, samples, memsize); } return status; @@ -137,12 +164,9 @@ protected: */ pcsample_status_t flushForgetList(); static void generate_id_completion_record(const dispatch_pkt_id_t& pkt) { (void) pkt; }; - static void generate_upcoming_pc_record(const rocprofiler_pc_sampling_record_s* samples, - size_t num_samples) - { - (void) samples; - (void) num_samples; - }; + void generate_upcoming_pc_record(uint64_t agent_id_handle, + const rocprofiler_pc_sampling_record_t* samples, + size_t num_samples); //! Maps doorbells and dispatch_index to correlation_id std::unique_ptr corr_map; @@ -156,4 +180,7 @@ protected: std::unordered_set forget_list; mutable std::shared_mutex mut; + +private: + std::unordered_map _agent_buffers; }; diff --git a/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/benchmark_test.cpp b/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/benchmark_test.cpp index eb6920d449..3dc49f6a52 100644 --- a/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/benchmark_test.cpp +++ b/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/benchmark_test.cpp @@ -56,8 +56,8 @@ Benchmark(bool bWarmup) for(size_t i = 0; i < SAMPLE_PER_DISPATCH; i++) MockWave(dispatch).genPCSample(); - std::pair userdata; - userdata.first = new rocprofiler_pc_sampling_record_s[TOTAL_NUM_SAMPLES]; + std::pair userdata; + userdata.first = new rocprofiler_pc_sampling_record_t[TOTAL_NUM_SAMPLES]; userdata.second = TOTAL_NUM_SAMPLES; auto t0 = std::chrono::system_clock::now(); @@ -65,9 +65,9 @@ Benchmark(bool bWarmup) (generic_sample_t*) buffer->packets.data(), buffer->packets.size(), GFXIP_MAJOR, - [](rocprofiler_pc_sampling_record_s** sample, uint64_t size, void* userdata_) { + [](rocprofiler_pc_sampling_record_t** sample, uint64_t size, void* userdata_) { auto* pair = - reinterpret_cast*>(userdata_); + reinterpret_cast*>(userdata_); assert(TOTAL_NUM_SAMPLES == pair->second); *sample = pair->first; return size; @@ -80,7 +80,7 @@ Benchmark(bool bWarmup) { std::cout << "Benchmark: Parsed " << int(samples_per_us * 1E3f + 0.5f) * 1E-3f << " Msample/s ("; - std::cout << int(sizeof(rocprofiler_pc_sampling_record_s) * samples_per_us) << " MB/s)" + std::cout << int(sizeof(rocprofiler_pc_sampling_record_t) * samples_per_us) << " MB/s)" << std::endl; } diff --git a/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/correlation_id_test.cpp b/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/correlation_id_test.cpp index 019821248c..e2ddcfcc50 100644 --- a/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/correlation_id_test.cpp +++ b/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/correlation_id_test.cpp @@ -33,14 +33,14 @@ std::mt19937 rdgen(1); /** * Sample user memory allocation callback. * It expects userdata to be cast-able to a pointer to - * std::vector> + * std::vector> */ static uint64_t -alloc_callback(rocprofiler_pc_sampling_record_s** buffer, uint64_t size, void* userdata) +alloc_callback(rocprofiler_pc_sampling_record_t** buffer, uint64_t size, void* userdata) { - *buffer = new rocprofiler_pc_sampling_record_s[size]; + *buffer = new rocprofiler_pc_sampling_record_t[size]; auto& vector = - *reinterpret_cast>*>( + *reinterpret_cast>*>( userdata); vector.push_back({*buffer, size}); return size; @@ -51,7 +51,7 @@ alloc_callback(rocprofiler_pc_sampling_record_s** buffer, uint64_t size, void* u * the reconstructed correlation_id. */ static bool -check_samples(rocprofiler_pc_sampling_record_s* samples, uint64_t size) +check_samples(rocprofiler_pc_sampling_record_t* samples, uint64_t size) { for(size_t i = 0; i < size; i++) if(samples[i].correlation_id.internal != samples[i].pc) return false; @@ -71,7 +71,7 @@ TEST(pcs_parser, hello_world) MockWave(dispatch).genPCSample(); MockWave(dispatch).genPCSample(); - std::vector> all_allocations; + std::vector> all_allocations; CHECK_PARSER(parse_buffer((generic_sample_t*) buffer->packets.data(), buffer->packets.size(), @@ -114,7 +114,7 @@ TEST(pcs_parser, reverse_wave_order) for(auto it = dispatches.begin(); it != dispatches.end(); it++) MockWave(*it).genPCSample(); - std::vector> all_allocations; + std::vector> all_allocations; CHECK_PARSER(parse_buffer((generic_sample_t*) buffer->packets.data(), buffer->packets.size(), @@ -150,7 +150,7 @@ TEST(pcs_parser, dispatch_wrapping) MockWave(dispatch).genPCSample(); } - std::vector> all_allocations; + std::vector> all_allocations; CHECK_PARSER(parse_buffer((generic_sample_t*) buffer->packets.data(), buffer->packets.size(), @@ -197,7 +197,7 @@ TEST(pcs_parser, random_samples) for(int i = 0; i < num_samples; i++) MockWave(dispatches[rdgen() % dispatches.size()]).genPCSample(); - std::vector> all_allocations; + std::vector> all_allocations; CHECK_PARSER(parse_buffer((generic_sample_t*) buffer->packets.data(), buffer->packets.size(), @@ -290,7 +290,7 @@ TEST(pcs_parser, queue_hammer) << std::endl; std::cout << "Max queue occupancy: " << max_q_occupancy << "\n\n" << std::endl; - std::vector> all_allocations; + std::vector> all_allocations; CHECK_PARSER(parse_buffer((generic_sample_t*) buffer->packets.data(), buffer->packets.size(), @@ -302,7 +302,7 @@ TEST(pcs_parser, queue_hammer) NUM_ACTIONS); // QueueHammer test: Incorrect number of callbacks for(auto sb = 0ul; sb < all_allocations.size(); sb++) { - rocprofiler_pc_sampling_record_s* samples = all_allocations[sb].first; + rocprofiler_pc_sampling_record_t* samples = all_allocations[sb].first; size_t num_samples = all_allocations[sb].second; EXPECT_EQ(num_samples, NUM_QUEUES); // QueueHammer: Incorrect number of samples @@ -329,7 +329,7 @@ TEST(pcs_parser, multi_buffer) const auto& packets = firstBuffer->packets; secondBuffer->packets = std::vector(packets.begin() + 2, packets.end()); - std::vector> all_allocations; + std::vector> all_allocations; CHECK_PARSER(parse_buffer((generic_sample_t*) firstBuffer->packets.data(), firstBuffer->packets.size(), diff --git a/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/gfx9test.cpp b/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/gfx9test.cpp index a17a0bc48b..e69905ede3 100644 --- a/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/gfx9test.cpp +++ b/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/gfx9test.cpp @@ -24,13 +24,15 @@ # undef NDEBUG #endif +#include "lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/parser/tests/mocks.hpp" + +#include + #include #include #include -#include "lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.hpp" -#include "lib/rocprofiler-sdk/pc_sampling/parser/tests/mocks.hpp" - #define GFXIP_MAJOR 9 #define TYPECHECK(x) \ @@ -295,7 +297,7 @@ class WaveIssueAndErrorTest : public WaveSnapTest void genPCSample(bool valid, bool issued, bool dual, bool error) { - rocprofiler_pc_sampling_record_s sample; + rocprofiler_pc_sampling_record_t sample; ::memset(&sample, 0, sizeof(sample)); sample.pc = dispatch->unique_id; sample.correlation_id.internal = dispatch->getMockId().raw; @@ -320,7 +322,7 @@ class WaveIssueAndErrorTest : public WaveSnapTest dispatch->submit(std::move(pss)); }; - std::vector compare; + std::vector compare; }; class WaveOtherFieldsTest : public WaveSnapTest @@ -347,9 +349,7 @@ class WaveOtherFieldsTest : public WaveSnapTest assert(parsed[0][i].flags.reserved == false); assert(compare[i].exec_mask == parsed[0][i].exec_mask); - assert(compare[i].workgroup_id_x == parsed[0][i].workgroup_id_x); - assert(compare[i].workgroup_id_y == parsed[0][i].workgroup_id_y); - assert(compare[i].workgroup_id_z == parsed[0][i].workgroup_id_z); + assert(compare[i].workgroup_id == parsed[0][i].workgroup_id); assert(compare[i].chiplet == parsed[0][i].chiplet); assert(compare[i].wave_id == parsed[0][i].wave_id); @@ -360,13 +360,13 @@ class WaveOtherFieldsTest : public WaveSnapTest void genPCSample(int pc, int exec, int blkx, int blky, int blkz, int chip, int wave, int hwid) { - rocprofiler_pc_sampling_record_s sample; + rocprofiler_pc_sampling_record_t sample; ::memset(&sample, 0, sizeof(sample)); sample.exec_mask = exec; - sample.workgroup_id_x = blkx; - sample.workgroup_id_y = blky; - sample.workgroup_id_z = blkz; + sample.workgroup_id.x = blkx; + sample.workgroup_id.y = blky; + sample.workgroup_id.z = blkz; sample.chiplet = chip; sample.wave_id = wave; @@ -392,7 +392,7 @@ class WaveOtherFieldsTest : public WaveSnapTest (void) pc; }; - std::vector compare; + std::vector compare; }; TEST(pcs_parser, gfx9_test) diff --git a/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/mocks.hpp b/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/mocks.hpp index 0952de4659..26e6f8e78c 100644 --- a/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/mocks.hpp +++ b/source/lib/rocprofiler-sdk/pc_sampling/parser/tests/mocks.hpp @@ -65,7 +65,7 @@ public: submit(uni); } - std::vector> get_parsed_buffer(int GFXIP_MAJOR) + std::vector> get_parsed_buffer(int GFXIP_MAJOR) { parsed_data = {}; @@ -78,18 +78,18 @@ public: return parsed_data; } - static uint64_t alloc_parse_memory(rocprofiler_pc_sampling_record_s** sample, + static uint64_t alloc_parse_memory(rocprofiler_pc_sampling_record_t** sample, uint64_t req_size, void* userdata) { auto* buffer = reinterpret_cast(userdata); - buffer->parsed_data.push_back(std::vector(req_size)); + buffer->parsed_data.push_back(std::vector(req_size)); *sample = buffer->parsed_data.back().data(); return req_size; } std::vector packets; - std::vector> parsed_data; + std::vector> parsed_data; }; /** diff --git a/source/lib/rocprofiler-sdk/pc_sampling/parser/translation.hpp b/source/lib/rocprofiler-sdk/pc_sampling/parser/translation.hpp index 97d9e2d7ec..3ebcdd0b0a 100644 --- a/source/lib/rocprofiler-sdk/pc_sampling/parser/translation.hpp +++ b/source/lib/rocprofiler-sdk/pc_sampling/parser/translation.hpp @@ -32,18 +32,18 @@ #include "lib/rocprofiler-sdk/pc_sampling/parser/rocr.h" template -inline rocprofiler_pc_sampling_record_s +inline rocprofiler_pc_sampling_record_t copySampleHeader(const SType& sample) { - rocprofiler_pc_sampling_record_s ret; + rocprofiler_pc_sampling_record_t ret; ret.flags = pcsample_header_v1_t{.raw = 0}.flags; ret.flags.type = AMD_SNAPSHOT_V1; ret.pc = sample.pc; ret.exec_mask = sample.exec_mask; - ret.workgroup_id_x = sample.workgroup_id_x; - ret.workgroup_id_y = sample.workgroup_id_y; - ret.workgroup_id_z = sample.workgroup_id_z; + ret.workgroup_id.x = sample.workgroup_id_x; + ret.workgroup_id.y = sample.workgroup_id_y; + ret.workgroup_id.z = sample.workgroup_id_z; ret.chiplet = sample.chiplet_and_wave_id >> 8; ret.wave_id = sample.chiplet_and_wave_id & 0x3F; @@ -52,23 +52,23 @@ copySampleHeader(const SType& sample) return ret; } -inline rocprofiler_pc_sampling_record_s +inline rocprofiler_pc_sampling_record_t copyHostTrapSample(const perf_sample_host_trap_v1& sample) { - rocprofiler_pc_sampling_record_s ret = copySampleHeader(sample); + rocprofiler_pc_sampling_record_t ret = copySampleHeader(sample); ret.flags.type = AMD_HOST_TRAP_V1; return ret; } template -inline rocprofiler_pc_sampling_record_s +inline rocprofiler_pc_sampling_record_t copyStochasticSample(const perf_sample_snapshot_v1& sample); template <> -inline rocprofiler_pc_sampling_record_s +inline rocprofiler_pc_sampling_record_t copyStochasticSample(const perf_sample_snapshot_v1& sample) { - rocprofiler_pc_sampling_record_s ret = copySampleHeader(sample); + rocprofiler_pc_sampling_record_t ret = copySampleHeader(sample); ret.flags.valid = sample.perf_snapshot_data & (~sample.perf_snapshot_data >> 26) & 0x1; // Check wave_id matches snapshot_wave_id @@ -88,10 +88,10 @@ copyStochasticSample(const perf_sample_snapshot_v1& sample) } template <> -inline rocprofiler_pc_sampling_record_s +inline rocprofiler_pc_sampling_record_t copyStochasticSample(const perf_sample_snapshot_v1& sample) { - rocprofiler_pc_sampling_record_s ret = copySampleHeader(sample); + rocprofiler_pc_sampling_record_t ret = copySampleHeader(sample); ret.flags.valid = sample.perf_snapshot_data & (~sample.perf_snapshot_data >> 23) & 0x1; // Check wave_id matches snapshot_wave_id @@ -195,12 +195,12 @@ translate_inst(int in) #undef LUTOVERLOAD template -inline rocprofiler_pc_sampling_record_s +inline rocprofiler_pc_sampling_record_t copySample(const void* sample) { if(HostTrap) return copyHostTrapSample(*(const perf_sample_host_trap_v1*) sample); - rocprofiler_pc_sampling_record_s ret = + rocprofiler_pc_sampling_record_t ret = copyStochasticSample(*(const perf_sample_snapshot_v1*) sample); ret.snapshot.inst_type = translate_inst(ret.snapshot.inst_type); diff --git a/source/lib/rocprofiler-sdk/pc_sampling/service.cpp b/source/lib/rocprofiler-sdk/pc_sampling/service.cpp new file mode 100644 index 0000000000..59616f9412 --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/service.cpp @@ -0,0 +1,268 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include "lib/rocprofiler-sdk/pc_sampling/service.hpp" + +#include "lib/common/logging.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/hsa_adapter.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/ioctl/ioctl_adapter.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/utils.hpp" + +namespace rocprofiler +{ +namespace pc_sampling +{ +using hsa_initialized_t = std::atomic; + +hsa_initialized_t& +is_hsa_initialized() +{ + static auto _v = hsa_initialized_t{false}; + return _v; +} + +// The function returns the atomic pointer to the active PC sampling service. +// The nullptr means the PC sampling service is inactive. +atomic_pc_sampling_service_t& +get_active_pc_sampling_service() +{ + static auto _v = atomic_pc_sampling_service_t{nullptr}; + return _v; +} + +// The function returns the atomic pointer to the configured pc sampling service. +// The nullptr means the PC sampling service is not configured. +atomic_pc_sampling_service_t& +get_configured_pc_sampling_service() +{ + static auto _v = atomic_pc_sampling_service_t{nullptr}; + return _v; +} + +rocprofiler_status_t +start_service(const context::context* ctx) +{ + auto* service = ctx->pc_sampler.get(); + + context::pc_sampling_service* _expected = nullptr; + // If there is no active pc_sampling_service, mark `service` as activated. + bool success = get_active_pc_sampling_service().compare_exchange_strong(_expected, service); + + if(!success) + { + // Some other context is active at the moment. + return ROCPROFILER_STATUS_ERROR; + } + + if(is_hsa_initialized().load()) + { + hsa::pc_sampling_service_start(service); + } + + return ROCPROFILER_STATUS_SUCCESS; +} + +rocprofiler_status_t +stop_service(const context::context* ctx) +{ + auto* service = ctx->pc_sampler.get(); + + if(get_active_pc_sampling_service().load() != service) + { + // Some other service is activated at the moment. + return ROCPROFILER_STATUS_ERROR; + } + + if(is_hsa_initialized().load()) + { + hsa::pc_sampling_service_stop(service); + } + + // No active PC sampling services + bool success = get_active_pc_sampling_service().compare_exchange_strong(service, nullptr); + + return (success) ? ROCPROFILER_STATUS_SUCCESS : ROCPROFILER_STATUS_ERROR; +} + +void +post_hsa_init_start_active_service() +{ + // Called as part of the registration of the HSA table + if(is_hsa_initialized().load()) + { + // If there is a guarantee that the `rocprofiler_set_api_table` + // can be called only once for the HSA, then this condition is redundant. + return; + } + + // If the PC sampling service is not configured on any of the agents, return. + if(!get_configured_pc_sampling_service().load()) return; + + static auto _once = std::once_flag{}; + std::call_once(_once, []() { + // Configure PC sampling on the ROCr level only once. + hsa::pc_sampling_service_finish_configuration(get_configured_pc_sampling_service().load()); + }); + + // Theoretically, the remainder of the function + // can execute concurrently with start_context/stop_context. + + context::pc_sampling_service* _expected = nullptr; + void* invalid_ptr = reinterpret_cast(0xDEADBEEF); + context::pc_sampling_service* pseudo_sevice = + static_cast(invalid_ptr); + + if(get_active_pc_sampling_service().compare_exchange_strong(_expected, pseudo_sevice)) + { + // At this point, we prevented any `start_context` instance from activating the service. + is_hsa_initialized().store(true); + // Now, allow `start_context` to active the service. + get_active_pc_sampling_service().compare_exchange_strong(pseudo_sevice, nullptr); + } + else + { + // Someone already called `start_context` that activated service. + // The pointer to this service is written inside `_expected`. + // Start PC sampling service on the HSA level in the name of the + // `start_context` caller. + hsa::pc_sampling_service_start(_expected); + // Although the caller of the `start_context` might try calling the hsa_start, + // it will fail, which is fine, since the service is eventually started. + is_hsa_initialized().store(true); + } +} + +rocprofiler_status_t +configure_pc_sampling_service(context::context* ctx, + const rocprofiler_agent_t* agent, + rocprofiler_pc_sampling_method_t method, + rocprofiler_pc_sampling_unit_t unit, + uint64_t interval, + rocprofiler_buffer_id_t buffer_id) +{ + if(!ctx->pc_sampler) + { + ctx->pc_sampler = std::make_unique(); + } + + if(ctx->pc_sampler->agent_sessions.count(agent->id) > 0) + { + // The service has already been configured for this agent. + return ROCPROFILER_STATUS_ERROR_SERVICE_ALREADY_CONFIGURED; + } + + // The restriction we agreed at the moment is that at most one context + // can have PC sampling service configured, meaning + // at most one instance of the `context::pc_sampling_service` can be configured + // This `pc_sampling_service` contains at most one configuration per agent. + context::pc_sampling_service* expected = nullptr; + // Try registering the new instance of the `pc_sampling_service`. + if(!get_configured_pc_sampling_service().compare_exchange_strong(expected, + ctx->pc_sampler.get())) + { + // A `pc_sampling_service` instance has already been configured. + // Note: the `expected` contains the pointer to the configured `pc_sampling_service` + // instance. + if(expected != ctx->pc_sampler.get()) + { + // Someone tried configuring a new `pc_sampling_service instance`, which we do not + // allow. Invalidate the `pc_sampling_service` from the `ctx` and return an error. + ctx->pc_sampler = nullptr; + // TODO: new status code needed + return ROCPROFILER_STATUS_ERROR; + } + // Someone is trying to enable PC sampling on another agent, and we allow registering + // new agent inside `pc_sampling_service` instance. + } + + // calling KFD to check if the configuration is actually supported at the moment + uint32_t ioctl_pcs_id; + auto ioctl_status = ioctl::ioctl_pcs_create(agent, method, unit, interval, &ioctl_pcs_id); + if(ioctl_status != ROCPROFILER_STATUS_SUCCESS) return ioctl_status; + + ctx->pc_sampler->agent_sessions[agent->id] = std::make_unique(); + + auto* session = ctx->pc_sampler->agent_sessions[agent->id].get(); + session->agent = agent; + session->method = method; + session->unit = unit; + session->interval = interval; + session->buffer_id = buffer_id; + session->ioctl_pcs_id = ioctl_pcs_id; + session->parser = std::make_unique(); + session->cid_manager = std::make_unique(session->parser.get()); + + ROCP_ERROR << "PC sampling session with id: " << session->ioctl_pcs_id + << " hsa been created!\n"; + + return ROCPROFILER_STATUS_SUCCESS; +} + +bool +is_pc_sample_service_configured(rocprofiler_agent_id_t agent_id) +{ + auto* service = get_configured_pc_sampling_service().load(); + if(service) + { + // If the agent_id is in the service->agent_sessions map, + // then the PC sampling service is configured on this agent. + return service->agent_sessions.find(agent_id) != service->agent_sessions.end(); + } + // The PC sampling service is not configured on this agent + return false; +} + +rocprofiler_status_t +flush_internal_agent_buffers(rocprofiler_buffer_id_t buffer_id) +{ + // checking if the buffer is registered + auto const* buff = rocprofiler::buffer::get_buffer(buffer_id); + if(!buff) return ROCPROFILER_STATUS_ERROR_BUFFER_NOT_FOUND; + + // Checking if the context is registered + const auto* ctx = rocprofiler::context::get_registered_context( + rocprofiler_context_id_t{.handle = buff->context_id}); + if(!ctx) return ROCPROFILER_STATUS_ERROR_CONTEXT_NOT_FOUND; + + auto* service = get_configured_pc_sampling_service().load(); + if(service && ctx->pc_sampler.get() == service) + { + // The context `ctx` (that holds the buffer with `buffer_id`) + // is the one containing PC sampling service. + // The HSA interception table is registered. + for(const auto& [_, agent_session] : service->agent_sessions) + { + // Find the agent that fills the buffer with `buffer_id` + if(agent_session->buffer_id.handle == buffer_id.handle) + { + // Flush internal PC sampling buffers filled by the agent + return hsa::flush_internal_agent_buffers(agent_session.get()); + } + } + } + + // PC sampling service not configured. + return ROCPROFILER_STATUS_SUCCESS; +} + +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/service.hpp b/source/lib/rocprofiler-sdk/pc_sampling/service.hpp new file mode 100644 index 0000000000..b5e7022769 --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/service.hpp @@ -0,0 +1,66 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#pragma once + +#include "lib/rocprofiler-sdk/context/context.hpp" + +#include +#include + +#include + +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +using atomic_pc_sampling_service_t = std::atomic; + +atomic_pc_sampling_service_t& +get_configured_pc_sampling_service(); + +rocprofiler_status_t +start_service(const context::context* ctx); + +rocprofiler_status_t +stop_service(const context::context* ctx); + +void +post_hsa_init_start_active_service(); + +rocprofiler_status_t +configure_pc_sampling_service(context::context* ctx, + const rocprofiler_agent_t* agent, + rocprofiler_pc_sampling_method_t method, + rocprofiler_pc_sampling_unit_t unit, + uint64_t interval, + rocprofiler_buffer_id_t buffer_id); + +bool +is_pc_sample_service_configured(rocprofiler_agent_id_t agent_id); + +rocprofiler_status_t +flush_internal_agent_buffers(rocprofiler_buffer_id_t buffer_id); +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/tests/CMakeLists.txt b/source/lib/rocprofiler-sdk/pc_sampling/tests/CMakeLists.txt new file mode 100644 index 0000000000..0f5a0f849c --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/tests/CMakeLists.txt @@ -0,0 +1,29 @@ +rocprofiler_deactivate_clang_tidy() + +include(GoogleTest) + +set(ROCPROFILER_LIB_PC_SAMPLING_TEST_SOURCES + configure_service.cpp + # samples_processing.cpp + query_configuration.cpp) +set(ROCPROFILER_LIB_PC_SAMPLING_TEST_HEADERS pc_sampling_internals.hpp) + +add_executable(pcs-test) + +target_sources(pcs-test PRIVATE ${ROCPROFILER_LIB_PC_SAMPLING_TEST_SOURCES} + ${ROCPROFILER_LIB_PC_SAMPLING_TEST_HEADERS}) + +target_link_libraries( + pcs-test + PRIVATE rocprofiler::rocprofiler-common-library + rocprofiler::rocprofiler-static-library GTest::gtest GTest::gtest_main) + +gtest_add_tests( + TARGET pcs-test + SOURCES ${ROCPROFILER_LIB_COUNTER_TEST_SOURCES} + TEST_LIST pcs-tests_TESTS + WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) + +set_tests_properties( + ${pcs-tests_TESTS} PROPERTIES TIMEOUT 45 LABELS "unittests;pc-sampling" + SKIP_REGULAR_EXPRESSION "PC sampling unavailable") diff --git a/source/lib/rocprofiler-sdk/pc_sampling/tests/configure_service.cpp b/source/lib/rocprofiler-sdk/pc_sampling/tests/configure_service.cpp new file mode 100644 index 0000000000..a01cc48145 --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/tests/configure_service.cpp @@ -0,0 +1,453 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include +#include +#include +#include +#include + +#include "lib/common/utility.hpp" + +#include +#include + +namespace +{ +constexpr size_t BUFFER_SIZE_BYTES = 8192; +constexpr size_t WATERMARK = (BUFFER_SIZE_BYTES / 4); + +#define ROCPROFILER_CALL(ARG, MSG) \ + { \ + auto _status = (ARG); \ + EXPECT_EQ(_status, ROCPROFILER_STATUS_SUCCESS) << MSG << " :: " << #ARG; \ + } + +struct callback_data +{ + rocprofiler_client_id_t* client_id = nullptr; + rocprofiler_client_finalize_t client_fini_func = nullptr; + rocprofiler_context_id_t client_ctx = {}; + rocprofiler_buffer_id_t client_buffer = {}; + rocprofiler_callback_thread_t client_thread = {}; + uint64_t client_workflow_count = {}; + uint64_t client_callback_count = {}; + int64_t current_depth = 0; + int64_t max_depth = 0; + std::map client_correlation = {}; + std::vector gpu_pcs_agents = {}; +}; + +struct agent_data +{ + uint64_t agent_count = 0; + std::vector agents = {}; +}; + +bool +is_pc_sampling_supported(rocprofiler_agent_id_t agent_id) +{ + auto cb = [](const rocprofiler_pc_sampling_configuration_t* configs, + size_t num_config, + void* user_data) { + auto* avail_configs = + static_cast*>(user_data); + // printf("The agent with the id: %lu supports the %lu configurations: \n", + // agent_id_.handle, num_config); + for(size_t i = 0; i < num_config; i++) + { + avail_configs->emplace_back(configs[i]); + } + return ROCPROFILER_STATUS_SUCCESS; + }; + + std::vector configs; + auto status = rocprofiler_query_pc_sampling_agent_configurations(agent_id, cb, &configs); + + if(status != ROCPROFILER_STATUS_SUCCESS) + { + // PC sampling is not supported + return false; + } + else if(configs.size() > 0) + { + return true; + } + else + { + return false; + } +} + +rocprofiler_status_t +find_all_gpu_agents_supporting_pc_sampling_impl(rocprofiler_agent_version_t version, + const void** agents, + size_t num_agents, + void* user_data) +{ + EXPECT_EQ(version, ROCPROFILER_AGENT_INFO_VERSION_0); + + // user_data represent the pointer to the array where gpu_agent will be stored + if(!user_data) return ROCPROFILER_STATUS_ERROR; + + auto* _out_agents = static_cast*>(user_data); + auto* _agents = reinterpret_cast(agents); + for(size_t i = 0; i < num_agents; i++) + { + if(_agents[i]->type == ROCPROFILER_AGENT_TYPE_GPU) + { + if(is_pc_sampling_supported(_agents[i]->id)) _out_agents->push_back(_agents[i]); + + printf("[%s] %s :: id=%zu, type=%i\n", + __FUNCTION__, + _agents[i]->name, + _agents[i]->id.handle, + _agents[i]->type); + } + else + { + printf("[%s] %s :: id=%zu, type=%i\n", + __FUNCTION__, + _agents[i]->name, + _agents[i]->id.handle, + _agents[i]->type); + } + } + + return ROCPROFILER_STATUS_SUCCESS; +} + +const rocprofiler_pc_sampling_configuration_t +extract_pc_sampling_config_prefer_stochastic(rocprofiler_agent_id_t agent_id) +{ + auto cb = [](const rocprofiler_pc_sampling_configuration_t* configs, + size_t num_config, + void* user_data) { + auto* avail_configs = + static_cast*>(user_data); + // printf("The agent with the id: %lu supports the %lu configurations: \n", + // agent_id_.handle, num_config); + for(size_t i = 0; i < num_config; i++) + { + avail_configs->emplace_back(configs[i]); + } + return ROCPROFILER_STATUS_SUCCESS; + }; + std::vector configs; + ROCPROFILER_CALL(rocprofiler_query_pc_sampling_agent_configurations(agent_id, cb, &configs), + "Failed to query available configurations"); + + const rocprofiler_pc_sampling_configuration_t* first_host_trap_config = nullptr; + const rocprofiler_pc_sampling_configuration_t* first_stochastic_config = nullptr; + // Search until encountering on the stochastic configuration, if any. + // Otherwise, use the host trap config + for(auto const& cfg : configs) + { + if(cfg.method == ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC) + { + first_stochastic_config = &cfg; + break; + } + else if(!first_host_trap_config && cfg.method == ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP) + { + first_host_trap_config = &cfg; + } + } + + // Check if the stochastic config is found. Use host trap config otherwise. + const rocprofiler_pc_sampling_configuration_t* picked_cfg = + (first_stochastic_config != nullptr) ? first_stochastic_config : first_host_trap_config; + + return *picked_cfg; +} + +void +rocprofiler_pc_sampling_callback(rocprofiler_context_id_t /*context_id*/, + rocprofiler_buffer_id_t /*buffer_id*/, + rocprofiler_record_header_t** /*headers*/, + size_t /*num_headers*/, + void* /*data*/, + uint64_t /*drop_count*/) +{} + +void +test_fail_because_of_wrong_agent(const callback_data* cb_data, + const rocprofiler_pc_sampling_configuration_t* pcs_config) +{ + auto not_existing_agent = rocprofiler_agent_id_t{0xDEADBEEF}; + + EXPECT_EQ(rocprofiler_configure_pc_sampling_service(cb_data->client_ctx, + not_existing_agent, + pcs_config->method, + pcs_config->unit, + pcs_config->min_interval, + cb_data->client_buffer), + ROCPROFILER_STATUS_ERROR_AGENT_NOT_FOUND); +} + +void +test_fail_because_of_wrong_context(const callback_data* cb_data, + rocprofiler_agent_id_t agent_id, + const rocprofiler_pc_sampling_configuration_t* pcs_config) +{ + auto not_existing_ctx = rocprofiler_context_id_t{0xDEADBEEF}; + + EXPECT_EQ(rocprofiler_configure_pc_sampling_service(not_existing_ctx, + agent_id, + pcs_config->method, + pcs_config->unit, + pcs_config->min_interval, + cb_data->client_buffer), + ROCPROFILER_STATUS_ERROR_CONTEXT_NOT_FOUND); +} + +void +test_fail_because_of_wrong_buffer(const callback_data* cb_data, + rocprofiler_agent_id_t agent_id, + const rocprofiler_pc_sampling_configuration_t* pcs_config) +{ + auto not_existing_buffer_id = rocprofiler_buffer_id_t{0xDEADBEEF}; + + EXPECT_EQ(rocprofiler_configure_pc_sampling_service(cb_data->client_ctx, + agent_id, + pcs_config->method, + pcs_config->unit, + pcs_config->min_interval, + not_existing_buffer_id), + ROCPROFILER_STATUS_ERROR_BUFFER_NOT_FOUND); +} + +void +test_fail_because_of_unsupported_configuration( + const callback_data* cb_data, + rocprofiler_agent_id_t agent_id, + const rocprofiler_pc_sampling_configuration_t* pcs_config) +{ + auto less_than_min_interval = pcs_config->min_interval - 1; + auto greater_than_max_interval = pcs_config->max_interval + 1; + auto wrong_method = ROCPROFILER_PC_SAMPLING_METHOD_LAST; + auto wrong_unit = ROCPROFILER_PC_SAMPLING_UNIT_NONE; + + EXPECT_NE(rocprofiler_configure_pc_sampling_service(cb_data->client_ctx, + agent_id, + pcs_config->method, + pcs_config->unit, + less_than_min_interval, + cb_data->client_buffer), + ROCPROFILER_STATUS_SUCCESS); + + EXPECT_NE(rocprofiler_configure_pc_sampling_service(cb_data->client_ctx, + agent_id, + pcs_config->method, + pcs_config->unit, + greater_than_max_interval, + cb_data->client_buffer), + ROCPROFILER_STATUS_SUCCESS); + + EXPECT_NE(rocprofiler_configure_pc_sampling_service(cb_data->client_ctx, + agent_id, + wrong_method, + pcs_config->unit, + pcs_config->max_interval, + cb_data->client_buffer), + ROCPROFILER_STATUS_SUCCESS); + + EXPECT_NE(rocprofiler_configure_pc_sampling_service(cb_data->client_ctx, + agent_id, + pcs_config->method, + wrong_unit, + pcs_config->max_interval, + cb_data->client_buffer), + ROCPROFILER_STATUS_SUCCESS); +} + +void +test_fail_because_service_is_already_configured( + const callback_data* cb_data, + rocprofiler_agent_id_t agent_id, + const rocprofiler_pc_sampling_configuration_t* pcs_config) +{ + EXPECT_EQ(rocprofiler_configure_pc_sampling_service(cb_data->client_ctx, + agent_id, + pcs_config->method, + pcs_config->unit, + pcs_config->min_interval, + cb_data->client_buffer), + ROCPROFILER_STATUS_ERROR_SERVICE_ALREADY_CONFIGURED); +} + +} // namespace + +TEST(pc_sampling, rocprofiler_configure_pc_sampling_service) +{ + using init_func_t = int (*)(rocprofiler_client_finalize_t, void*); + using fini_func_t = void (*)(void*); + + // using hsa_iterate_agents_cb_t = hsa_status_t (*)(hsa_agent_t, void*); + + auto cmd_line = rocprofiler::common::read_command_line(getpid()); + ASSERT_FALSE(cmd_line.empty()); + + static init_func_t tool_init = [](rocprofiler_client_finalize_t fini_func, + void* client_data) -> int { + auto* cb_data = static_cast(client_data); + + cb_data->client_workflow_count++; + cb_data->client_fini_func = fini_func; + + // This function returns the all gpu agents supporting some kind of PC sampling + ROCPROFILER_CALL( + rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0, + &find_all_gpu_agents_supporting_pc_sampling_impl, + sizeof(rocprofiler_agent_t), + static_cast(&cb_data->gpu_pcs_agents)), + "Failed to find GPU agents"); + + // TODO-VLAINDIC: Can we dynamically skip the test if the underlying + // HW does not support PC sampling + if(cb_data->gpu_pcs_agents.size() == 0) exit(0); + + ROCPROFILER_CALL(rocprofiler_create_context(&cb_data->client_ctx), + "failed to create context"); + + ROCPROFILER_CALL(rocprofiler_create_buffer(cb_data->client_ctx, + BUFFER_SIZE_BYTES, + WATERMARK, + ROCPROFILER_BUFFER_POLICY_LOSSLESS, + rocprofiler_pc_sampling_callback, + client_data, + &cb_data->client_buffer), + "buffer creation failed"); + + // We will create another context and try configuring pc sampling inside it, + // that is supposed to fail. + rocprofiler_context_id_t another_ctx; + ROCPROFILER_CALL(rocprofiler_create_context(&another_ctx), "failed to create context"); + rocprofiler_buffer_id_t another_buff; + ROCPROFILER_CALL(rocprofiler_create_buffer(another_ctx, + 4096, + 2048, + ROCPROFILER_BUFFER_POLICY_LOSSLESS, + rocprofiler_pc_sampling_callback, + nullptr, + &another_buff), + "buffer creation failed"); + + for(const auto* agent : cb_data->gpu_pcs_agents) + { + const auto agent_id = agent->id; + const auto pcs_config = extract_pc_sampling_config_prefer_stochastic(agent_id); + + test_fail_because_of_wrong_agent(cb_data, &pcs_config); + test_fail_because_of_wrong_context(cb_data, agent_id, &pcs_config); + test_fail_because_of_wrong_buffer(cb_data, agent_id, &pcs_config); + test_fail_because_of_unsupported_configuration(cb_data, agent_id, &pcs_config); + + size_t interval = pcs_config.max_interval; + + // This calls succeeds + ROCPROFILER_CALL(rocprofiler_configure_pc_sampling_service(cb_data->client_ctx, + agent_id, + pcs_config.method, + pcs_config.unit, + interval, + cb_data->client_buffer), + "Failed to configure PC sampling service"); + + test_fail_because_service_is_already_configured(cb_data, agent_id, &pcs_config); + + // Cannot create PC sampling service in context different than the `cb_data->client_ctx` + EXPECT_EQ(rocprofiler_configure_pc_sampling_service(another_ctx, + agent_id, + pcs_config.method, + pcs_config.unit, + interval, + another_buff), + ROCPROFILER_STATUS_ERROR); + } + + ROCPROFILER_CALL(rocprofiler_create_callback_thread(&cb_data->client_thread), + "failure creating callback thread"); + + ROCPROFILER_CALL( + rocprofiler_assign_callback_thread(cb_data->client_buffer, cb_data->client_thread), + "failed to assign thread for buffer"); + + int valid_ctx = 0; + ROCPROFILER_CALL(rocprofiler_context_is_valid(cb_data->client_ctx, &valid_ctx), + "failure checking context validity"); + + EXPECT_EQ(valid_ctx, 1); + + ROCPROFILER_CALL(rocprofiler_start_context(cb_data->client_ctx), + "rocprofiler context start failed"); + + // no errors + return 0; + }; + + static fini_func_t tool_fini = [](void* client_data) -> void { + auto* cb_data = static_cast(client_data); + ROCPROFILER_CALL(rocprofiler_stop_context(cb_data->client_ctx), + "rocprofiler context stop failed"); + + static_cast(client_data)->client_workflow_count++; + }; + + static auto cb_data = callback_data{}; + + static auto cfg_result = + rocprofiler_tool_configure_result_t{sizeof(rocprofiler_tool_configure_result_t), + tool_init, + tool_fini, + static_cast(&cb_data)}; + + static rocprofiler_configure_func_t rocp_init = + [](uint32_t version, + const char* runtime_version, + uint32_t prio, + rocprofiler_client_id_t* client_id) -> rocprofiler_tool_configure_result_t* { + auto expected_version = ROCPROFILER_VERSION; + EXPECT_EQ(expected_version, version); + EXPECT_EQ(std::string_view{runtime_version}, std::string_view{ROCPROFILER_VERSION_STRING}); + EXPECT_EQ(prio, 0); + EXPECT_EQ(client_id->name, nullptr); + cb_data.client_id = client_id; + cb_data.client_id->name = ::testing::UnitTest::GetInstance()->current_test_info()->name(); + + return &cfg_result; + }; + + EXPECT_EQ(rocprofiler_force_configure(rocp_init), ROCPROFILER_STATUS_SUCCESS); + + // Further tests assumes the existence of at least one GPU agent supporting + if(cb_data.gpu_pcs_agents.size() == 0) return; + + const auto* agent = cb_data.gpu_pcs_agents.at(0); + EXPECT_EQ(rocprofiler_configure_pc_sampling_service(cb_data.client_ctx, + agent->id, + ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP, + ROCPROFILER_PC_SAMPLING_UNIT_TIME, + 1, + cb_data.client_buffer), + ROCPROFILER_STATUS_ERROR_CONFIGURATION_LOCKED); +} diff --git a/source/lib/rocprofiler-sdk/pc_sampling/tests/pc_sampling_internals.hpp b/source/lib/rocprofiler-sdk/pc_sampling/tests/pc_sampling_internals.hpp new file mode 100644 index 0000000000..a752c24903 --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/tests/pc_sampling_internals.hpp @@ -0,0 +1,63 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#pragma once + +#include "lib/rocprofiler-sdk/pc_sampling/hsa_adapter.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/service.hpp" + +namespace rocprofiler +{ +namespace pc_sampling +{ +void +post_hsa_init_start_active_service(); + +namespace hsa +{ +extern void +amd_intercept_marker_handler_callback(const struct amd_aql_intercept_marker_s* packet, + hsa_queue_t* queue, + uint64_t packet_id); + +extern void +kernel_completion_cb(const std::shared_ptr& info, + const rocprofiler_agent_t* rocp_agent, + rocprofiler::hsa::ClientID client_id, + const rocprofiler::hsa::rocprofiler_packet& kernel_pkt, + const rocprofiler::hsa::Queue::queue_info_session_t& session, + std::unique_ptr pkt); + +extern void +data_ready_callback(void* client_callback_data, + size_t data_size, + size_t lost_sample_count, + hsa_ven_amd_pcs_data_copy_callback_t data_copy_callback, + void* hsa_callback_data); + +extern atomic_pc_sampling_service_t& +get_active_pc_sampling_service(); + +} // namespace hsa +} // namespace pc_sampling +} // namespace rocprofiler \ No newline at end of file diff --git a/source/lib/rocprofiler-sdk/pc_sampling/tests/query_configuration.cpp b/source/lib/rocprofiler-sdk/pc_sampling/tests/query_configuration.cpp new file mode 100644 index 0000000000..fd49bbc035 --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/tests/query_configuration.cpp @@ -0,0 +1,364 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include +#include +#include +#include +#include +#include + +namespace +{ +#define USER_DATA_VAL 33 + +constexpr size_t BUFFER_SIZE_BYTES = 8192; +constexpr size_t WATERMARK = (BUFFER_SIZE_BYTES / 4); + +#define ROCPROFILER_CALL(ARG, MSG) \ + { \ + auto _status = (ARG); \ + EXPECT_EQ(_status, ROCPROFILER_STATUS_SUCCESS) << MSG << " :: " << #ARG; \ + } + +struct callback_data +{ + rocprofiler_client_id_t* client_id = nullptr; + rocprofiler_client_finalize_t client_fini_func = nullptr; + rocprofiler_context_id_t client_ctx = {}; + rocprofiler_buffer_id_t client_buffer = {}; + rocprofiler_callback_thread_t client_thread = {}; + uint64_t client_workflow_count = {}; + uint64_t client_callback_count = {}; + int64_t current_depth = 0; + int64_t max_depth = 0; + std::map client_correlation = {}; + std::vector gpu_pcs_agents = {}; +}; + +struct agent_data +{ + uint64_t agent_count = 0; + std::vector agents = {}; +}; + +bool +is_pc_sampling_supported(rocprofiler_agent_id_t agent_id) +{ + auto cb = [](const rocprofiler_pc_sampling_configuration_t* configs, + size_t num_config, + void* user_data) { + auto* avail_configs = + static_cast*>(user_data); + // printf("The agent with the id: %lu supports the %lu configurations: \n", + // agent_id_.handle, num_config); + for(size_t i = 0; i < num_config; i++) + { + avail_configs->emplace_back(configs[i]); + } + return ROCPROFILER_STATUS_SUCCESS; + }; + + std::vector configs; + auto status = rocprofiler_query_pc_sampling_agent_configurations(agent_id, cb, &configs); + + if(status != ROCPROFILER_STATUS_SUCCESS) + { + // PC sampling is not supported + return false; + } + else if(configs.size() > 0) + { + return true; + } + else + { + return false; + } +} + +rocprofiler_status_t +find_all_gpu_agents_supporting_pc_sampling_impl(rocprofiler_agent_version_t version, + const void** agents, + size_t num_agents, + void* user_data) +{ + EXPECT_EQ(version, ROCPROFILER_AGENT_INFO_VERSION_0); + + // user_data represent the pointer to the array where gpu_agent will be stored + if(!user_data) return ROCPROFILER_STATUS_ERROR; + + auto* _out_agents = static_cast*>(user_data); + auto* _agents = reinterpret_cast(agents); + for(size_t i = 0; i < num_agents; i++) + { + if(_agents[i]->type == ROCPROFILER_AGENT_TYPE_GPU) + { + if(is_pc_sampling_supported(_agents[i]->id)) _out_agents->push_back(_agents[i]); + + printf("[%s] %s :: id=%zu, type=%i\n", + __FUNCTION__, + _agents[i]->name, + _agents[i]->id.handle, + _agents[i]->type); + } + else + { + printf("[%s] %s :: id=%zu, type=%i\n", + __FUNCTION__, + _agents[i]->name, + _agents[i]->id.handle, + _agents[i]->type); + } + } + + return ROCPROFILER_STATUS_SUCCESS; +} + +rocprofiler_pc_sampling_configuration_t +extract_pc_sampling_config_prefer_stochastic(rocprofiler_agent_id_t agent_id) +{ + auto cb = [](const rocprofiler_pc_sampling_configuration_t* configs, + size_t num_config, + void* user_data) { + auto* avail_configs = + static_cast*>(user_data); + // printf("The agent with the id: %lu supports the %lu configurations: \n", + // agent_id_.handle, num_config); + for(size_t i = 0; i < num_config; i++) + { + avail_configs->emplace_back(configs[i]); + } + return ROCPROFILER_STATUS_SUCCESS; + }; + std::vector configs; + ROCPROFILER_CALL(rocprofiler_query_pc_sampling_agent_configurations(agent_id, cb, &configs), + "Failed to query available configurations"); + + const rocprofiler_pc_sampling_configuration_t* first_host_trap_config = nullptr; + const rocprofiler_pc_sampling_configuration_t* first_stochastic_config = nullptr; + // Search until encountering on the stochastic configuration, if any. + // Otherwise, use the host trap config + for(auto const& cfg : configs) + { + if(cfg.method == ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC) + { + first_stochastic_config = &cfg; + break; + } + else if(!first_host_trap_config && cfg.method == ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP) + { + first_host_trap_config = &cfg; + } + } + + // Check if the stochastic config is found. Use host trap config otherwise. + const rocprofiler_pc_sampling_configuration_t* picked_cfg = + (first_stochastic_config != nullptr) ? first_stochastic_config : first_host_trap_config; + + return *picked_cfg; +} + +void +rocprofiler_pc_sampling_callback(rocprofiler_context_id_t /*context_id*/, + rocprofiler_buffer_id_t /*buffer_id*/, + rocprofiler_record_header_t** /*headers*/, + size_t /*num_headers*/, + void* /*data*/, + uint64_t /*drop_count*/) +{} + +rocprofiler_status_t +check_all_configs_cb(const rocprofiler_pc_sampling_configuration_t* configs, + size_t num_config, + void* user_data) +{ + auto* val = reinterpret_cast(user_data); + EXPECT_EQ(*val, USER_DATA_VAL); + + if(num_config == 0) return ROCPROFILER_STATUS_ERROR; + + for(size_t i = 0; i < num_config; i++) + { + const auto* cfg = &configs[i]; + EXPECT_LT(ROCPROFILER_PC_SAMPLING_METHOD_NONE, cfg->method); + EXPECT_LT(cfg->method, ROCPROFILER_PC_SAMPLING_METHOD_LAST); + + EXPECT_LT(ROCPROFILER_PC_SAMPLING_UNIT_NONE, cfg->unit); + EXPECT_LT(cfg->unit, ROCPROFILER_PC_SAMPLING_UNIT_LAST); + } + + return ROCPROFILER_STATUS_SUCCESS; +}; + +} // namespace + +// TODO: change according to the actual implementation +TEST(pc_sampling, query_configs_agent_does_not_exists) +{ + int cb_data = USER_DATA_VAL; + // The agent does not exists + EXPECT_EQ(rocprofiler_query_pc_sampling_agent_configurations( + rocprofiler_agent_id_t{.handle = 0xDEADBEEF}, check_all_configs_cb, &cb_data), + ROCPROFILER_STATUS_ERROR_AGENT_NOT_FOUND); +} + +TEST(pc_sampling, query_configs_after_service_setup) +{ + using init_func_t = int (*)(rocprofiler_client_finalize_t, void*); + using fini_func_t = void (*)(void*); + + // TODO: configure PC sampling and query if the configuration is listed + static init_func_t tool_init = [](rocprofiler_client_finalize_t fini_func, + void* client_data) -> int { + auto* cb_data = static_cast(client_data); + + cb_data->client_workflow_count++; + cb_data->client_fini_func = fini_func; + + // This function returns the all gpu agents supporting some kind of PC sampling + ROCPROFILER_CALL( + rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0, + &find_all_gpu_agents_supporting_pc_sampling_impl, + sizeof(rocprofiler_agent_t), + static_cast(&cb_data->gpu_pcs_agents)), + "Failed to find GPU agents"); + + // TODO-VLAINDIC: Can we dynamically skip the test if the underlying + // HW does not support PC sampling + if(cb_data->gpu_pcs_agents.size() == 0) exit(0); + + int query_cb_data = USER_DATA_VAL; + const auto* agent = cb_data->gpu_pcs_agents.at(0); + const auto agent_id = agent->id; + auto status = rocprofiler_query_pc_sampling_agent_configurations( + agent_id, check_all_configs_cb, &query_cb_data); + + if(status != ROCPROFILER_STATUS_SUCCESS) + { + // The agent does not support PC sampling + return -1; + } + + ROCPROFILER_CALL(rocprofiler_create_context(&cb_data->client_ctx), + "failed to create context"); + + ROCPROFILER_CALL(rocprofiler_create_buffer(cb_data->client_ctx, + BUFFER_SIZE_BYTES, + WATERMARK, + ROCPROFILER_BUFFER_POLICY_LOSSLESS, + rocprofiler_pc_sampling_callback, + client_data, + &cb_data->client_buffer), + "buffer creation failed"); + + auto pcs_config = extract_pc_sampling_config_prefer_stochastic(agent_id); + + size_t interval = pcs_config.max_interval; + + // This calls succeeds + ROCPROFILER_CALL(rocprofiler_configure_pc_sampling_service(cb_data->client_ctx, + agent->id, + pcs_config.method, + pcs_config.unit, + interval, + cb_data->client_buffer), + "Failed to configure PC sampling service"); + + // query configuration and expect to see `pcs_config->max_interval` as the `interval` + auto post_setup_conf_cb = [](const rocprofiler_pc_sampling_configuration_t* configs, + size_t num_config, + void* user_data) { + const rocprofiler_pc_sampling_configuration_t* picked_cfg = + static_cast(user_data); + + EXPECT_EQ(num_config, 1); + + const auto* cfg = &configs[0]; + EXPECT_EQ(cfg->method, picked_cfg->method); + EXPECT_EQ(cfg->unit, picked_cfg->unit); + // Min and max interval are equeal when PC sampling is enabled + EXPECT_EQ(cfg->min_interval, cfg->max_interval); + // When set up the PC sampling, we used the max_interval of the picked_cfg + EXPECT_EQ(cfg->max_interval, picked_cfg->max_interval); + + return ROCPROFILER_STATUS_SUCCESS; + }; + + EXPECT_EQ(rocprofiler_query_pc_sampling_agent_configurations( + agent_id, post_setup_conf_cb, &pcs_config), + ROCPROFILER_STATUS_SUCCESS); + + ROCPROFILER_CALL(rocprofiler_create_callback_thread(&cb_data->client_thread), + "failure creating callback thread"); + + ROCPROFILER_CALL( + rocprofiler_assign_callback_thread(cb_data->client_buffer, cb_data->client_thread), + "failed to assign thread for buffer"); + + int valid_ctx = 0; + ROCPROFILER_CALL(rocprofiler_context_is_valid(cb_data->client_ctx, &valid_ctx), + "failure checking context validity"); + + EXPECT_EQ(valid_ctx, 1); + + ROCPROFILER_CALL(rocprofiler_start_context(cb_data->client_ctx), + "rocprofiler context start failed"); + + // no errors + return 0; + }; + + static fini_func_t tool_fini = [](void* client_data) -> void { + auto* cb_data = static_cast(client_data); + ROCPROFILER_CALL(rocprofiler_stop_context(cb_data->client_ctx), + "rocprofiler context stop failed"); + + cb_data->client_workflow_count++; + }; + + static auto cb_data = callback_data{}; + + static auto cfg_result = + rocprofiler_tool_configure_result_t{sizeof(rocprofiler_tool_configure_result_t), + tool_init, + tool_fini, + static_cast(&cb_data)}; + + static rocprofiler_configure_func_t rocp_init = + [](uint32_t version, + const char* runtime_version, + uint32_t prio, + rocprofiler_client_id_t* client_id) -> rocprofiler_tool_configure_result_t* { + auto expected_version = ROCPROFILER_VERSION; + EXPECT_EQ(expected_version, version); + EXPECT_EQ(std::string_view{runtime_version}, std::string_view{ROCPROFILER_VERSION_STRING}); + EXPECT_EQ(prio, 0); + EXPECT_EQ(client_id->name, nullptr); + cb_data.client_id = client_id; + cb_data.client_id->name = ::testing::UnitTest::GetInstance()->current_test_info()->name(); + + return &cfg_result; + }; + + EXPECT_EQ(rocprofiler_force_configure(rocp_init), ROCPROFILER_STATUS_SUCCESS); +} diff --git a/source/lib/rocprofiler-sdk/pc_sampling/tests/samples_processing.cpp b/source/lib/rocprofiler-sdk/pc_sampling/tests/samples_processing.cpp new file mode 100644 index 0000000000..3a54f178a6 --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/tests/samples_processing.cpp @@ -0,0 +1,437 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include +#include +#include +#include +#include +#include + +#include "lib/common/utility.hpp" +#include "lib/rocprofiler-sdk/hsa/agent_cache.hpp" +#include "lib/rocprofiler-sdk/hsa/hsa.hpp" +#include "lib/rocprofiler-sdk/hsa/queue.hpp" +#include "lib/rocprofiler-sdk/hsa/queue_controller.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/parser/rocr.h" +#include "lib/rocprofiler-sdk/pc_sampling/tests/pc_sampling_internals.hpp" +#include "pc_sampling_internals.hpp" + +#include +#include +#include + +constexpr size_t BUFFER_SIZE_BYTES = 8192; +constexpr size_t WATERMARK = (BUFFER_SIZE_BYTES / 4); + +#define ROCPROFILER_CALL(ARG, MSG) \ + { \ + auto _status = (ARG); \ + EXPECT_EQ(_status, ROCPROFILER_STATUS_SUCCESS) << MSG << " :: " << #ARG; \ + } + +namespace +{ +#define NUM_SAMPLES 5 +#define TRAP_ID 0 + +struct callback_data +{ + rocprofiler_client_id_t* client_id = nullptr; + rocprofiler_client_finalize_t client_fini_func = nullptr; + rocprofiler_context_id_t client_ctx = {}; + rocprofiler_buffer_id_t client_buffer = {}; + rocprofiler_callback_thread_t client_thread = {}; + uint64_t client_workflow_count = {}; + uint64_t client_callback_count = {}; + int64_t current_depth = 0; + int64_t max_depth = 0; + std::map client_correlation = {}; + std::vector gpu_pcs_agents = {}; +}; + +struct agent_data +{ + uint64_t agent_count = 0; + std::vector agents = {}; +}; + +rocprofiler_status_t +find_all_gpu_agents_supporting_pc_sampling_impl(const rocprofiler_agent_t** agents, + size_t num_agents, + void* user_data) +{ + // user_data represent the pointer to the array where gpu_agent will be stored + if(!user_data) return ROCPROFILER_STATUS_ERROR; + + auto* _out_agents = static_cast*>(user_data); + // find the first GPU agent + for(size_t i = 0; i < num_agents; i++) + { + if(agents[i]->type == ROCPROFILER_AGENT_TYPE_GPU) + { + // Skip GPU agents not supporting PC sampling + // Vladimir: The assumption is that if a GPU agent does not support PC sampling, + // the size is 0. + if(agents[i]->num_pc_sampling_configs == 0) continue; + + _out_agents->push_back(agents[i]); + + printf("[%s] %s :: id=%zu, type=%i, num pc sample configs=%zu\n", + __FUNCTION__, + agents[i]->name, + agents[i]->id.handle, + agents[i]->type, + agents[i]->num_pc_sampling_configs); + } + else + { + printf("[%s] %s :: id=%zu, type=%i, num pc sample configs=%zu\n", + __FUNCTION__, + agents[i]->name, + agents[i]->id.handle, + agents[i]->type, + agents[i]->num_pc_sampling_configs); + } + } + + return !_out_agents->empty() ? ROCPROFILER_STATUS_SUCCESS : ROCPROFILER_STATUS_ERROR; +} + +const rocprofiler_pc_sampling_configuration_t +extract_pc_sampling_config_prefer_stochastic(rocprofiler_agent_id_t agent_id) +{ + auto cb = [](const rocprofiler_pc_sampling_configuration_t* configs, + size_t num_config, + void* user_data) { + auto* avail_configs = + static_cast*>(user_data); + // printf("The agent with the id: %lu supports the %lu configurations: \n", + // agent_id_.handle, num_config); + for(size_t i = 0; i < num_config; i++) + { + avail_configs->emplace_back(configs[i]); + } + return ROCPROFILER_STATUS_SUCCESS; + }; + std::vector configs; + ROCPROFILER_CALL(rocprofiler_query_pc_sampling_agent_configurations(agent_id, cb, &configs), + "Failed to query available configurations"); + + const rocprofiler_pc_sampling_configuration_t* first_host_trap_config = nullptr; + const rocprofiler_pc_sampling_configuration_t* first_stochastic_config = nullptr; + // Search until encountering on the stochastic configuration, if any. + // Otherwise, use the host trap config + for(auto const& cfg : configs) + { + if(cfg.method == ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC) + { + first_stochastic_config = &cfg; + break; + } + else if(!first_host_trap_config && cfg.method == ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP) + { + first_host_trap_config = &cfg; + } + } + + // Check if the stochastic config is found. Use host trap config otherwise. + const rocprofiler_pc_sampling_configuration_t* picked_cfg = + (first_stochastic_config != nullptr) ? first_stochastic_config : first_host_trap_config; + + return *picked_cfg; +} + +void +rocprofiler_pc_sampling_callback(rocprofiler_context_id_t /*context_id*/, + rocprofiler_buffer_id_t /*buffer_id*/, + rocprofiler_record_header_t** headers, + size_t num_headers, + void* /*data*/, + uint64_t drop_count) +{ + EXPECT_EQ(drop_count, 0); + + for(size_t i = 0; i < num_headers; i++) + { + auto* cur_header = headers[i]; + + if(cur_header == nullptr) + { + throw std::runtime_error{ + "rocprofiler provided a null pointer to header. this should never happen"}; + } + else if(cur_header->hash != + rocprofiler_record_header_compute_hash(cur_header->category, cur_header->kind)) + { + throw std::runtime_error{"rocprofiler_record_header_t (category | kind) != hash"}; + } + else if(cur_header->category == ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING) + { + auto* pc_sample = static_cast(cur_header->payload); + // FIXME: find the cause why this fails + // EXPECT_EQ(pc_sample->correlation_id.internal, 1); + EXPECT_EQ(pc_sample->pc, i + 1); + EXPECT_EQ(pc_sample->timestamp, i + 33); + EXPECT_EQ(pc_sample->hw_id, 0); + } + else + { + throw std::runtime_error{"unexpected rocprofiler_record_header_t category + kind"}; + } + } +} + +} // namespace + +TEST(pc_sampling, processing_pc_samples) +{ + using init_func_t = int (*)(rocprofiler_client_finalize_t, void*); + using fini_func_t = void (*)(void*); + + // using hsa_iterate_agents_cb_t = hsa_status_t (*)(hsa_agent_t, void*); + + auto cmd_line = rocprofiler::common::read_command_line(getpid()); + ASSERT_FALSE(cmd_line.empty()); + + static init_func_t tool_init = [](rocprofiler_client_finalize_t fini_func, + void* client_data) -> int { + auto* cb_data = static_cast(client_data); + + cb_data->client_workflow_count++; + cb_data->client_fini_func = fini_func; + + // This function returns the all gpu agents supporting some kind of PC sampling + ROCPROFILER_CALL( + rocprofiler_query_available_agents(&find_all_gpu_agents_supporting_pc_sampling_impl, + sizeof(rocprofiler_agent_t), + static_cast(&cb_data->gpu_pcs_agents)), + "Failed to find GPU agents"); + + ROCPROFILER_CALL(rocprofiler_create_context(&cb_data->client_ctx), + "failed to create context"); + + ROCPROFILER_CALL(rocprofiler_create_buffer(cb_data->client_ctx, + 4096, + 2048, + ROCPROFILER_BUFFER_POLICY_LOSSLESS, + rocprofiler_pc_sampling_callback, + client_data, + &cb_data->client_buffer), + "buffer creation failed"); + + const auto* agent = cb_data->gpu_pcs_agents.at(0); + const auto agent_id = agent->id; + const auto pcs_config = extract_pc_sampling_config_prefer_stochastic(agent_id); + + size_t interval = pcs_config.max_interval; + + // This calls succeeds + ROCPROFILER_CALL(rocprofiler_configure_pc_sampling_service(cb_data->client_ctx, + agent->id, + pcs_config.method, + pcs_config.unit, + interval, + cb_data->client_buffer), + "Failed to configure PC sampling service"); + + ROCPROFILER_CALL(rocprofiler_create_callback_thread(&cb_data->client_thread), + "failure creating callback thread"); + + ROCPROFILER_CALL( + rocprofiler_assign_callback_thread(cb_data->client_buffer, cb_data->client_thread), + "failed to assign thread for buffer"); + + int valid_ctx = 0; + ROCPROFILER_CALL(rocprofiler_context_is_valid(cb_data->client_ctx, &valid_ctx), + "failure checking context validity"); + + EXPECT_EQ(valid_ctx, 1); + + ROCPROFILER_CALL(rocprofiler_start_context(cb_data->client_ctx), + "rocprofiler context start failed"); + + // no errors + return 0; + }; + + static fini_func_t tool_fini = [](void* client_data) -> void { + auto* cb_data = static_cast(client_data); + // FIXME: for some reason, this returns context not found + // ROCPROFILER_CALL(rocprofiler_stop_context(cb_data->client_ctx), + // "rocprofiler context stop failed"); + + cb_data->client_workflow_count++; + }; + + static auto cb_data = callback_data{}; + + static auto cfg_result = + rocprofiler_tool_configure_result_t{sizeof(rocprofiler_tool_configure_result_t), + tool_init, + tool_fini, + static_cast(&cb_data)}; + + static rocprofiler_configure_func_t rocp_init = + [](uint32_t version, + const char* runtime_version, + uint32_t prio, + rocprofiler_client_id_t* client_id) -> rocprofiler_tool_configure_result_t* { + auto expected_version = ROCPROFILER_VERSION; + EXPECT_EQ(expected_version, version); + EXPECT_EQ(std::string_view{runtime_version}, std::string_view{ROCPROFILER_VERSION_STRING}); + EXPECT_EQ(prio, 0); + EXPECT_EQ(client_id->name, nullptr); + cb_data.client_id = client_id; + cb_data.client_id->name = ::testing::UnitTest::GetInstance()->current_test_info()->name(); + + return &cfg_result; + }; + + EXPECT_EQ(rocprofiler_force_configure(rocp_init), ROCPROFILER_STATUS_SUCCESS); + + // Further tests assumes the existence of at least one GPU agent supporting + if(cb_data.gpu_pcs_agents.size() == 0) return; + + auto& hsa_table = rocprofiler::hsa::get_table(); + auto* pc_sampling_table_ = hsa_table.pc_sampling_ext_; + EXPECT_NE(pc_sampling_table_, nullptr); + + pc_sampling_table_->hsa_ven_amd_pcs_create_from_id_fn = + [](uint32_t /*ioctl_pcs_id*/, + hsa_agent_t /*agent*/, + hsa_ven_amd_pcs_method_kind_t /*method*/, + hsa_ven_amd_pcs_units_t /*units*/, + size_t /*interval*/, + size_t /*latency*/, + size_t /*buffer_size*/, + hsa_ven_amd_pcs_data_ready_callback_t /*data_ready_callback*/, + void* /*client_callback_data*/, + hsa_ven_amd_pcs_t* /*pc_sampling*/) { return HSA_STATUS_SUCCESS; }; + + pc_sampling_table_->hsa_ven_amd_pcs_flush_fn = [](hsa_ven_amd_pcs_t /*pc_sampling*/) { + return HSA_STATUS_SUCCESS; + }; + + auto* ext_table_ = hsa_table.amd_ext_; + EXPECT_NE(ext_table_, nullptr); + + ext_table_->hsa_amd_queue_get_info_fn = + [](hsa_queue_t* queue, hsa_queue_info_attribute_t attribute, void* value) { + (void) queue; + switch(attribute) + { + case HSA_AMD_QUEUE_INFO_AGENT: + *(reinterpret_cast(value)) = hsa_agent_t{.handle = 1}; + break; + case HSA_AMD_QUEUE_INFO_DOORBELL_ID: + *(reinterpret_cast(value)) = 0; + break; + default: return HSA_STATUS_ERROR_INVALID_ARGUMENT; + } + return HSA_STATUS_SUCCESS; + }; + +#if 1 + + // Set the HSA agent for the active PCSamplingConfiguration, + // The reason for setting HSA agent manually follows. + // The test links against rocporifler static library. + // Hence, the rocprofiler_set_api_table is not called. + auto* service = rocprofiler::pc_sampling::get_active_pc_sampling_service().load(); + EXPECT_NE(service, nullptr); + const auto* rocp_agent = cb_data.gpu_pcs_agents.at(0); + auto agent_session = service->agent_sessions.at(rocp_agent->id).get(); + hsa_agent_t pseudo_hsa_agent = {.handle = 1}; + agent_session->hsa_agent = std::make_unique(pseudo_hsa_agent); + + // TODO: We need to register the agent inside the parser + rocprofiler::pc_sampling::hsa::get_pc_sampling_parser().register_buffer_for_agent( + cb_data.client_buffer.handle, rocp_agent->id.handle); + + // The following test calls some segments of internal PC sampling API implementation + // by mimicking the HIP and ROCr + + // Generate dispatch and marker packet + rocprofiler::hsa::rocprofiler_packet dispatch_pkt; + auto marker_pkt = + rocprofiler::pc_sampling::hsa::generate_marker_packet_for_kernel(&dispatch_pkt); + + // create a pseudo hsa queue + hsa_queue_t queue; + queue.size = 1024; + // Mimic the ROCr and notify the pc sampling service that the marker packet has been encoutered. + rocprofiler::pc_sampling::hsa::amd_intercept_marker_handler_callback( + &marker_pkt.marker, &queue, 0); + + // We need to generate some samples and send them via data_ready_calllback. + size_t num_samples = NUM_SAMPLES; + auto samples_data_size = num_samples * sizeof(packet_union_t); + + static hsa_ven_amd_pcs_data_copy_callback_t hsa_mock_data_copy_callback = + [](void* hsa_callback_data, size_t data_size, void* destination) { + (void) hsa_callback_data; + (void) data_size; + using rocr_buffer_t = std::vector; + auto samples_buff = rocr_buffer_t{}; + for(size_t i = 0; i < NUM_SAMPLES; i++) + { + perf_sample_host_trap_v1 hs; + hs.pc = i + 1; + hs.exec_mask = 0xF; + hs.workgroup_id_x = 1; + hs.workgroup_id_y = 2; + hs.workgroup_id_z = 3; + hs.chiplet_and_wave_id = 0; + hs.hw_id = 0; + hs.timestamp = 33 + i; + hs.correlation_id = TRAP_ID; + samples_buff.push_back(packet_union_t{.host = hs}); + } + // copy the data + std::memcpy(destination, samples_buff.data(), NUM_SAMPLES * sizeof(packet_union_t)); + // clear the data + return HSA_STATUS_SUCCESS; + }; + + // calling data_ready_callback that will result in copying the data from above + // to the client buffer via PC sampling parser + rocprofiler::pc_sampling::PCSAgentSession pcs_agent_session; + pcs_agent_session.agent = rocp_agent; + pcs_agent_session.method = ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP; + + size_t lost_samples = 0; + + rocprofiler::pc_sampling::hsa::data_ready_callback( + &pcs_agent_session, samples_data_size, lost_samples, hsa_mock_data_copy_callback, nullptr); + + rocprofiler::pc_sampling::hsa::kernel_completion_cb( + nullptr, rocp_agent, static_cast(1), dispatch_pkt, nullptr); + + // Flush the buffer explicitly + ROCPROFILER_CALL(rocprofiler_flush_buffer(cb_data.client_buffer), + "rocprofiler flush buffer failed"); + // Stop the context + ROCPROFILER_CALL(rocprofiler_stop_context(cb_data.client_ctx), + "rocprofiler context stop failed"); +#endif +} diff --git a/source/lib/rocprofiler-sdk/pc_sampling/types.hpp b/source/lib/rocprofiler-sdk/pc_sampling/types.hpp new file mode 100644 index 0000000000..61f811ce97 --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/types.hpp @@ -0,0 +1,44 @@ +#pragma once + +#include "lib/rocprofiler-sdk/hsa/queue.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/cid_manager.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/parser/pc_record_interface.hpp" + +#include +#include + +#include +#include + +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +// forward declaration to avoid circular dependency +class PCSCIDManager; + +struct PCSAgentSession +{ + const rocprofiler_agent_t* agent; + rocprofiler_pc_sampling_method_t method; + rocprofiler_pc_sampling_unit_t unit; + uint64_t interval; + rocprofiler_buffer_id_t buffer_id; + // hsa relevant information + std::optional hsa_agent = std::nullopt; + hsa_ven_amd_pcs_t hsa_pc_sampling; + hsa::ClientID intercept_cb_id{-1}; + // ioctl relevant information + uint32_t ioctl_pcs_id; + // PC sampling parser + std::unique_ptr parser; + // Manager responsible for retiring CIDs + std::unique_ptr cid_manager; +}; + +// TODO static assertions + +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/utils.cpp b/source/lib/rocprofiler-sdk/pc_sampling/utils.cpp new file mode 100644 index 0000000000..36bf597add --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/utils.cpp @@ -0,0 +1,79 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include "lib/rocprofiler-sdk/pc_sampling/utils.hpp" +#include "lib/rocprofiler-sdk/agent.hpp" + +#include +#include + +#include +#include +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +namespace utils +{ +hsa_ven_amd_pcs_method_kind_t +get_matching_hsa_pcs_method(rocprofiler_pc_sampling_method_t method) +{ + switch(method) + { + case ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC: return HSA_VEN_AMD_PCS_METHOD_STOCHASTIC_V1; + case ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP: return HSA_VEN_AMD_PCS_METHOD_HOSTTRAP_V1; + default: throw std::runtime_error("Illegal pc sampling method\n"); + } +} + +hsa_ven_amd_pcs_units_t +get_matching_hsa_pcs_units(rocprofiler_pc_sampling_unit_t unit) +{ + switch(unit) + { + case ROCPROFILER_PC_SAMPLING_UNIT_NONE: break; + case ROCPROFILER_PC_SAMPLING_UNIT_INSTRUCTIONS: + return HSA_VEN_AMD_PCS_INTERVAL_UNITS_INSTRUCTIONS; + case ROCPROFILER_PC_SAMPLING_UNIT_CYCLES: + return HSA_VEN_AMD_PCS_INTERVAL_UNITS_CLOCK_CYCLES; + case ROCPROFILER_PC_SAMPLING_UNIT_TIME: return HSA_VEN_AMD_PCS_INTERVAL_UNITS_MICRO_SECONDS; + case ROCPROFILER_PC_SAMPLING_UNIT_LAST: break; + } + + throw std::runtime_error("Illegal pc sampling units\n"); +} + +uint64_t +get_unique_correlation_id() +{ + // TODO: Remove once we confirmed it is unnecessary. + // Also, update the PC sampling parser not to decode correlation ID + // (or always 0 for both internal/external correlation IDs) + static auto _cnt = std::atomic{0}; + return ++_cnt; +} + +} // namespace utils +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/pc_sampling/utils.hpp b/source/lib/rocprofiler-sdk/pc_sampling/utils.hpp new file mode 100644 index 0000000000..baabebb33c --- /dev/null +++ b/source/lib/rocprofiler-sdk/pc_sampling/utils.hpp @@ -0,0 +1,62 @@ +// MIT License +// +// Copyright (c) 2023 ROCm Developer Tools +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include +#include + +#include +#include + +#include +#include + +namespace rocprofiler +{ +namespace pc_sampling +{ +namespace utils +{ +hsa_ven_amd_pcs_method_kind_t +get_matching_hsa_pcs_method(rocprofiler_pc_sampling_method_t method); + +hsa_ven_amd_pcs_units_t +get_matching_hsa_pcs_units(rocprofiler_pc_sampling_unit_t unit); + +inline constexpr size_t +get_hsa_pcs_latency() +{ + // TODO: Check with David about the default value in the hsa-runtime + return 1000; +} + +inline constexpr size_t +get_hsa_pcs_buffer_size() +{ + // TODO: Find the minimum size of all buffers and use that. + return 1024 * sizeof(perf_sample_hosttrap_v1_t); +} + +uint64_t +get_unique_correlation_id(); +} // namespace utils +} // namespace pc_sampling +} // namespace rocprofiler diff --git a/source/lib/rocprofiler-sdk/registration.cpp b/source/lib/rocprofiler-sdk/registration.cpp index d49be4ec6e..76251c97aa 100644 --- a/source/lib/rocprofiler-sdk/registration.cpp +++ b/source/lib/rocprofiler-sdk/registration.cpp @@ -39,6 +39,8 @@ #include "lib/rocprofiler-sdk/internal_threading.hpp" #include "lib/rocprofiler-sdk/marker/marker.hpp" #include "lib/rocprofiler-sdk/page_migration/page_migration.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/code_object.hpp" +#include "lib/rocprofiler-sdk/pc_sampling/service.hpp" #include #include @@ -601,6 +603,10 @@ finalize() hsa::async_copy_fini(); hsa::queue_controller_fini(); page_migration::finalize(); +#if ROCPROFILER_SDK_HSA_PC_SAMPLING > 0 + // WARNING: this must precede `code_object::finalize()` + pc_sampling::code_object::finalize(); +#endif code_object::finalize(); if(get_init_status() > 0) { @@ -753,6 +759,9 @@ rocprofiler_set_api_table(const char* name, rocprofiler::hsa::async_copy_init(hsa_api_table, lib_instance); rocprofiler::code_object::initialize(hsa_api_table); +#if ROCPROFILER_SDK_HSA_PC_SAMPLING > 0 + rocprofiler::pc_sampling::code_object::initialize(hsa_api_table); +#endif // install rocprofiler API wrappers rocprofiler::hsa::update_table(hsa_api_table->core_, lib_instance); @@ -761,6 +770,11 @@ rocprofiler_set_api_table(const char* name, rocprofiler::hsa::update_table(hsa_api_table->finalizer_ext_, lib_instance); rocprofiler::hsa::update_table(hsa_api_table->tools_, lib_instance); +#if ROCPROFILER_SDK_HSA_PC_SAMPLING > 0 + // Initialize PC sampling service if configured + rocprofiler::pc_sampling::post_hsa_init_start_active_service(); +#endif + // allow tools to install API wrappers rocprofiler::intercept_table::notify_intercept_table_registration( ROCPROFILER_HSA_TABLE, lib_version, lib_instance, std::make_tuple(hsa_api_table)); @@ -816,6 +830,7 @@ rocprofiler_set_api_table(const char* name, return 0; } +// #if 0 bool OnLoad(HsaApiTable* table, uint64_t runtime_version, @@ -843,4 +858,5 @@ OnUnload() ::rocprofiler::registration::finalize(); ROCP_INFO << "Finalization complete."; } +// #endif } diff --git a/source/lib/rocprofiler-sdk/rocprofiler.cpp b/source/lib/rocprofiler-sdk/rocprofiler.cpp index 13a2da9ec8..d5a1ed1ec8 100644 --- a/source/lib/rocprofiler-sdk/rocprofiler.cpp +++ b/source/lib/rocprofiler-sdk/rocprofiler.cpp @@ -107,6 +107,11 @@ ROCPROFILER_STATUS_STRING(ROCPROFILER_STATUS_ERROR_NO_HARDWARE_COUNTERS, "Counter set does not include any hardware counters") ROCPROFILER_STATUS_STRING(ROCPROFILER_STATUS_ERROR_AGENT_MISMATCH, "Counter profile agent does not match the agent in the context") +ROCPROFILER_STATUS_STRING(ROCPROFILER_STATUS_ERROR_NOT_AVAILABLE, + "The service is not available." + "Please refer to API functions that return this status code" + "for more information.") + template const char* get_status_name(rocprofiler_status_t status, std::index_sequence) diff --git a/source/lib/rocprofiler-sdk/tests/agent.cpp b/source/lib/rocprofiler-sdk/tests/agent.cpp index 855fe41ec2..538be36b0c 100644 --- a/source/lib/rocprofiler-sdk/tests/agent.cpp +++ b/source/lib/rocprofiler-sdk/tests/agent.cpp @@ -101,13 +101,11 @@ TEST(rocprofiler_lib, agent_abi) EXPECT_EQ(offsetof(rocprofiler_agent_t, vendor_name), 256) << msg; EXPECT_EQ(offsetof(rocprofiler_agent_t, product_name), 264) << msg; EXPECT_EQ(offsetof(rocprofiler_agent_t, model_name), 272) << msg; - EXPECT_EQ(offsetof(rocprofiler_agent_t, num_pc_sampling_configs), 280) << msg; - EXPECT_EQ(offsetof(rocprofiler_agent_t, pc_sampling_configs), 288) << msg; - EXPECT_EQ(offsetof(rocprofiler_agent_t, node_id), 296) << msg; - EXPECT_EQ(offsetof(rocprofiler_agent_t, logical_node_id), 300) << msg; + EXPECT_EQ(offsetof(rocprofiler_agent_t, node_id), 280) << msg; + EXPECT_EQ(offsetof(rocprofiler_agent_t, logical_node_id), 284) << msg; // Add test for offset of new field above this. Do NOT change any existing values! - constexpr auto expected_rocp_agent_size = 304; + constexpr auto expected_rocp_agent_size = 288; // If a new field is added, increase this value by the size of the new field(s) EXPECT_EQ(sizeof(rocprofiler_agent_t), expected_rocp_agent_size) << "ABI break. If you added a new field, make sure that this is the only new check that " diff --git a/source/lib/rocprofiler-sdk/tests/page_migration.cpp b/source/lib/rocprofiler-sdk/tests/page_migration.cpp index 743114c764..696bf03078 100644 --- a/source/lib/rocprofiler-sdk/tests/page_migration.cpp +++ b/source/lib/rocprofiler-sdk/tests/page_migration.cpp @@ -21,7 +21,7 @@ // SOFTWARE. #include "lib/common/defines.hpp" -#include "lib/rocprofiler-sdk/page_migration/details/kfd_ioctl.h" +#include "lib/rocprofiler-sdk/details/kfd_ioctl.h" #include "lib/rocprofiler-sdk/page_migration/utils.hpp" #include