168 Commity

Autor SHA1 Wiadomość Data
Venkateshwar Reddy Kandula dea3da3a6f [rocprofiler-sdk][CI] Fix rocprofiler-sdk CI change rocm version to 7.2.0 (#2979)
* Update aqlprofile-continuous_integration.yml to use rocm-7.2.0

* Update rocprofiler-sdk-continuous_integration.yml to use rocm-7.2.0

* Update rocprofiler-sdk-docs.yml to use rocm-7.2.0
2026-01-29 17:28:52 -06:00
Venkateshwar Reddy Kandula a7c3e8392a [rocprofiler-sdk] Use venv for fixing CI docker image workflow (#2955)
* use python virtual env for aws cli

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* use 7.2 amdgpu for ubuntu

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-29 09:53:15 -05:00
Copilot 14f9f2537a Add artifact upload steps to AMDSMI CI workflow for PR builds (#2936) 2026-01-28 18:14:47 -05:00
Jason Bonnell d917259953 Add --verbose to ctest to get more output (#2928) 2026-01-28 22:43:14 +05:30
Benjamin Welton 1517a398bf [rocprofiler-sdk] Buffer finalization fixes and HSA ABI 0x09 support (#2318)
* [rocprofiler-sdk] Fix buffer flush ordering and sanitizer CI improvements

Buffer Pool Design
------------------
Replace the fixed array-based double buffer with a dynamic pool design to
fix race conditions that caused "internal correlation id was retired
prematurely" errors.

The original design had a race where flush callbacks could be delivered
out-of-order: when buffer 0 fills and begins flushing, writes go to
buffer 1. If buffer 1 fills before buffer 0's flush completes, the
buffer index wraps back to 0 (which may still be flushing). Independent
flush tasks submitted to the thread pool can complete out of order.

The new pool design:
- Uses a std::deque of buffer instances that grows as needed
- Allocates buffers from the pool when the current buffer needs to flush
- Serializes flushes with a mutex to ensure FIFO callback ordering
- Returns buffers to the pool after flush completion
- Eliminates the race between buffer selection and write operations

New Unit Tests
--------------
- buffer_correlation_ordering.cpp: Tests that API records are always
  delivered before their corresponding retirement records
- buffer_ordering_stress.cpp: Stress tests buffer flush ordering under
  high contention with multiple threads rapidly filling buffers

HSA Tool Hooks
--------------
Added hsa_tool_hooks.cpp/hpp to register an HSA OnUnload callback that
waits for pending flush tasks before tool finalization, preventing
"retired prematurely" errors during HSA shutdown.

Sanitizer Improvements
----------------------
- LSAN: Set fast_unwind_on_malloc=1 to prevent deadlock in libgcc unwinder
- LSAN: Added suppressions for external tools (liblzma, liblsan, seq, strdup)
- TSAN: Added suppression for false positive on C++11 thread-safe static
  initialization in create_write_functor
- ASAN/UBSAN: Added patterns for known issues in HSA runtime, HIP, perfetto
- Disabled attachment tests for sanitizers due to library preloading issues

Other Fixes
-----------
- Thread-trace agent test: Use heap-allocated callback state
- Correlation ID: Refactored reference counting and finalization ordering

* [rocprofiler-sdk] Revert buffer pool design changes

Revert buffer.cpp and buffer.hpp to the original double-buffer
design from develop branch. The pool-based redesign introduced
concerns about:
- Signal safety (mutex vs atomic_flag)
- API changes (flush() return type)
- Complexity of the new design

This revert removes:
- Dynamic buffer pool with std::deque
- std::mutex/condition_variable synchronization
- buffer_correlation_ordering.cpp test
- buffer_ordering_stress.cpp test

The underlying buffer flush ordering issue will need to be
addressed with a different approach that preserves the original
API and synchronization characteristics.

* [rocprofiler-sdk] Consistent fini_status checks to prevent correlation ID creation during finalization

- Revert TOCTOU CAS loop change in sub_ref_count() - not needed with consistent checks
- Add fini_status check in correlation_tracing_service::construct() with ROCP_CI_LOG warning
- Add nullptr checks at all construct() call sites (queue.cpp, async_copy.cpp, memory_allocation.cpp)
- Change all 'get_fini_status() > 0' to '!= 0' for consistent behavior:
  - hsa/queue.cpp (lines 105, 210)
  - hsa/async_copy.cpp (line 344)
  - hsa/hsa_barrier.cpp (line 43)
  - buffer.cpp (lines 107, 138, 185)

This ensures no correlation IDs are created once finalization starts (fini_status != 0),
preventing races between finalization and ongoing tracing operations.

* [rocprofiler-sdk] Replace arrival-order checks with timestamp-based temporal validation

Buffer records are not guaranteed to arrive in any specific order. Tests and
samples should use timestamps for temporal ordering validation instead.

Changes:
- samples/external_correlation_id_request: Replace 'retired prematurely' arrival
  order check with timestamp-based validation that retirement timestamp >=
  max(end_timestamps) for records with the same correlation ID
- tests/external_correlation.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check
- tests/registration.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check
- tests/roctx.cpp: Remove EXPECT_GT(corr_id, last_corr_id) check

Correlation IDs are not guaranteed to be monotonically increasing when records
are sorted by timestamp. Temporal ordering should be validated using the
timestamp fields in each record.

* [rocprofiler-sdk] Revert external/CMakeLists.txt SYSTEM keyword removal

Restore the SYSTEM keyword to target_include_directories for
rocprofiler-sdk-fmt to match develop branch.

* [rccl] Remove orphaned rocSHMEM gitlink

Remove orphaned submodule reference that was introduced during a merge
but never had a corresponding .gitmodules entry, causing CI failures
with "fatal: no submodule mapping found in .gitmodules".

* [rocprofiler-sdk] Add HSA ABI version 0x09 support

Add ABI checks for HSA_AMD_EXT_API_TABLE_STEP_VERSION 0x09 which
introduces hsa_amd_counted_queue_acquire and hsa_amd_counted_queue_release
functions (added in rocr-runtime SWDEV-561708).

* [rocprofiler-sdk] Handle finalized status gracefully in buffer flush operations

This commit consolidates fixes for handling the finalization status during
buffer flush operations across the SDK.

Changes:
- Tool and samples: Handle ROCPROFILER_STATUS_ERROR_FINALIZED gracefully
  when flushing buffers, as this indicates buffers were already flushed
  during finalization (not an error condition)
- HSA handlers (queue.cpp, async_copy.cpp, hsa_barrier.cpp): Use > 0 check
  for fini_status to allow operations during finalization process
- buffer.cpp: Revert fini_status checks to use > 0 for consistency
- correlation_id.cpp: Add fini_status > 0 check with ROCP_TRACE logging
  to prevent correlation ID creation after finalization starts

Files modified:
- source/lib/rocprofiler-sdk-tool/tool.cpp
- tests/tools/json-tool.cpp
- source/lib/rocprofiler-sdk/tests/registration.cpp
- source/lib/rocprofiler-sdk/tests/roctx.cpp
- samples/api_buffered_tracing/client.cpp
- samples/counter_collection/buffered_client.cpp
- samples/counter_collection/device_counting_async_client.cpp
- samples/external_correlation_id_request/client.cpp
- samples/pc_sampling/client.cpp
- source/lib/rocprofiler-sdk/buffer.cpp
- source/lib/rocprofiler-sdk/context/correlation_id.cpp
- source/lib/rocprofiler-sdk/hsa/queue.cpp
- source/lib/rocprofiler-sdk/hsa/async_copy.cpp
- source/lib/rocprofiler-sdk/hsa/hsa_barrier.cpp

* [rocprofiler-sdk] Remove hsa_tool_hooks and simplify buffer flush handling

Remove the hsa_tool_hooks infrastructure and simplify buffer flush calls
in samples and tools. The ERROR_FINALIZED handling was overly complex
and the hsa_tool_hooks OnUnload synchronization is no longer needed.

Changes:
- Remove hsa_tool_hooks.cpp/hpp and related registration.cpp code
- Simplify buffer flush calls in samples to use direct ROCPROFILER_CALL
- Simplify buffer flush in tool.cpp and json-tool.cpp
- Remove ERROR_FINALIZED special handling from test files

Co-Authored-By: Claude <noreply@anthropic.com>

* [rocprofiler-sdk] Fix output_stream move semantics to null source pointers

The default move constructor and move assignment operator for
output_stream did not null out the source's pointers after the move.
This caused double-close when the moved-from temporary was destroyed,
leading to use-after-free crashes (SIGSEGV in std::ostream::sentry).

Co-Authored-By: Claude <noreply@anthropic.com>

* [rocprofiler-sdk] Improve Perfetto trace writer and sanitizer configuration

- generatePerfetto.cpp: Move output_stream into shared_state to prevent
  use-after-free race conditions during Perfetto callback execution
- run-ci.py: Simplify and consolidate sanitizer environment variable
  configuration for better maintainability

Co-Authored-By: Claude <noreply@anthropic.com>

* [rocprofiler-sdk] Revert run-ci.py changes that broke sanitizer suppressions

The previous changes removed MEMCHECK_SANITIZER_OPTIONS which is required
for CTest to properly pass suppression files to the sanitizers during
memcheck runs.

Co-Authored-By: Claude <noreply@anthropic.com>

* Revert "[rccl] Remove orphaned rocSHMEM gitlink"

This reverts commit 1ad21003941355658fff8114fa27768f11a948f7.

* [rocprofiler-sdk] Revert registration.cpp changes

Revert changes to registration.cpp to match develop branch.

Co-Authored-By: Claude <noreply@anthropic.com>

* [rocprofiler-sdk] Remove suppression file content printing from run-ci.py

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix output_stream move ctor/assignment operator

* Fix erroneous revert of registration.cpp

* Fix handling of fini status in correlation ID construction

* [rocprofiler-sdk] Fix OMPT segfault during finalization

Add nullptr checks in OMPT tracing code to handle the case where
correlation_tracing_service::construct() returns nullptr during
finalization. This fixes segfaults in openmp-target-sample and
tests.integration.execute.openmp-tools.

The correlation ID construction now returns nullptr when fini_status > 0,
but the OMPT callbacks were not checking for this, causing crashes when
dereferencing the null pointer during OpenMP runtime shutdown.

Changes:
- event_common(): Return nullptr early if correlation ID is null
- event(): Check for nullptr before calling sub_ref_count()
- ompt_task_create_callback(): Return early if correlation ID is null
- ompt_task_schedule_callback(): Return early if correlation ID is null

* [rocprofiler-sdk] Fix HSA API tracing segfault during finalization

Add nullptr check in hsa_api_impl::functor after correlation ID
construction. During finalization, correlation_service::construct()
returns nullptr, and without this check the code would dereference
the null pointer when accessing corr_id->internal.

This fixes the SEGV at address 0x000000000008 (null + 8 byte offset)
that occurs when HSA async event threads call hsa_signal_destroy
during runtime shutdown after finalization has started.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
2026-01-27 13:27:54 -05:00
Jason Bonnell 1255ba2bcc rocprofiler-compute Docker Images in GHCR (#1195)
* Initial cleanup of compute workflows and skeleton of ghcr workflow

* Add containers-ci.yml, update opensuse and rhel dockerfiles

* rename id in rocprofiler-compute-ghcr.yml

* Add new line to end of containers-ci.yml

* Update action versions for rocprofiler-compute-ghcr.yml

* Switch back to SHA for action versions

* Add conda set solver classic fix to compute CI dockerfiles

* Update conda install for compute Dockerfiles

* Change opensuse version to 15.6 in containers-ci.yml

* Add fix for ubuntu noble to compute Dockerfile.ubuntu.ci

* Add default distro and version to Dockerfile.ubuntu.ci

* Updated regex for tarball version

* Remove Python3.8 from compute CI Dockerfiles

* Change RHEL 9.4 to 9, add retry for compute workflow

* Revert name change for compute rhel workflow

* update path naming

* Remove binutils-gold from Dockerfile.opensuse.ci

* Remove conda python installs from Dockerfile.ci files in compute

* Change CMake version to 3.21 in compute Dockerfile.ci files

* Update checkout actions from v4 to v5
2026-01-26 17:06:20 -05:00
Jason Bonnell ebc1980768 Add ROCM_VERSION env var, update repo URLs (#2870) 2026-01-26 16:00:16 -05:00
Aravind Ravikumar cc1a7c3c82 Enable CI to work on release branches (#2571)
* Commit to enable CI for release branches

* Pin back to specific SHA in therock

---------

Co-authored-by: Aravind Ravikumar <arravikum@amd.com>
2026-01-26 14:50:05 -05:00
jamessiddeley-amd dbd26a88b4 [rocprof-compute] Fix silent failures in Continuous-Integration CI workflow (#2797)
* fix silent failures in rocprof-compute continuous-integratin CI workflow

* CDash uploads complete before the script fails
2026-01-26 12:16:17 -05:00
jamessiddeley-amd 25090e003f [rocprof-compute] Pin ruff version for consistent formatting (#2680)
* pin ruff versions each to current latest

* Update rocprofiler-compute-formatting.yml

* Downgrade .pre-commit-config.yaml to match develop
2026-01-19 19:10:02 -05:00
Gopesh Bhardwaj b18db05091 [rocprofiler-sdk] Fixing docs build (#2608) 2026-01-14 10:13:17 -05:00
dsclear-amd d5f490fa2f Sets heavy GitHub CI workflows to not trigger on text documentation-only changes. (#2417)
Sets heavy GitHub CI workflows to not trigger on docs-only changes.

Specifically, sets azure-ci-dispatcher.yml and therock-ci.yml, as well as many rocprofiler workflows, to not trigger when the change consists entirely of docs-only files.
2026-01-12 18:31:30 -05:00
Jason Bonnell 95a31b10cd Fix aqlprofile-continuous_integration.yml workflow (#2582)
* Fix typo in matrix definition for aqlprofile-continuous_integration.yml

* Update ROCM_VERSION to 7.1.1

* Minor changes to core-rpm step

* Add working-directory to test steps

* Revert changes

* Add set -v to rpm test step

* Remove Python venv line from rpm test step
2026-01-12 15:53:04 -05:00
Mythreya Kuricheti 36d9d33d90 Users/mkuriche/rocprofiler sdk fmt build fix memory header (#2537)
* [rocprofiler-sdk] Fix fmt::join build errors

- remedy use of fmt::join without include <fmt/ranges.h>

* include memory header

* Disable FMT build for SDK CI

* Add -DROCPROFILER_BUILD_FMT=OFF to sanitizer steps

* Add temporary workaround for rccl.h issue

* Add ROCPROFILER_INTERNAL_RCCL_API_TRACE to SDK CI builds

* disable clang-tidy for vendored includes

---------

Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
Co-authored-by: jbonnell-amd <jason.bonnell@amd.com>
2026-01-12 12:59:47 -05:00
Jason Bonnell 788bcdddd0 Update SDK Dockerfile.ci for Ubuntu (#2539)
* Add verbose output for submods step

* Remove git config setting

* Determine git version

* Try different git install

* Update Dockerfile.ci

* Revert git location in Ubuntu jobs

* Update RHEL and SLES sections to use 2.52 as well

* Add git --version to each step, fix typo in SLES Docker
2026-01-09 12:10:36 -05:00
Jason Bonnell 1d5a6e9bfe Update rocprofiler workflows to use new mi325 runner names (#2467)
* Update rocprofiler workflows to use new runner naming for mi325

* Add input options to workflow_dispatch for rocprofiler-systems CI workflow

* Update runner name on therock-ci-linux.yml as well
2026-01-05 15:41:01 -05:00
Joseph Macaranas 11d9472e5f Bump TheRock SHA for CI 20251230 (#2466)
* Bump TheRock SHA for CI 20251230
* Remove patch and align workflows between OS
2026-01-05 13:00:37 -05:00
ammallya c2c4d4c1f5 Revert "Adding full build capability to theROCK for HIP changes (#2003)" (#2441)
This reverts commit 0a52f5c101.

Reverts #2003

MIOpen build failures on windows causing blockers on unrelated file changes.
2025-12-23 13:01:08 -08:00
ammallya 0a52f5c101 Adding full build capability to theROCK for HIP changes (#2003)
## Add Full Build Capability to theROCK for HIP

### Summary
This PR adds full build support to **theROCK** for HIP-related changes, ensuring that all components are built.

### Changes
- Enabled full build coverage for the following projects:
  - `projects/clr`
  - `projects/hip`
  - `projects/hip-tests`
  - `projects/rocr-runtime`
- Updated build configuration to include all targets for the above projects.
- Ensured rocm-libraries is pulled to build optional components.

### Motivation
These changes are required to support HIP development and testing within theROCK by ensuring all components are built together. This improves reliability, integration testing.
2025-12-22 05:31:32 -08:00
Geo Min 3635953cd8 Revert "Adding org var and dynamic selection of targets (#2317)" (#2416)
This reverts commit c9ac018395.
2025-12-19 14:51:53 -08:00
Jason Bonnell 112b4fd413 [rocprofiler-compute] Add SDK dependency to rocprofiler-compute-tarball.yml workflow (#2329)
* Install rocm-dev in rocprofiler-compute-tarball.yml workflow

* Update paths for push and PR for rocprofiler-compute-tarball.yml

* Add ROCm dependencies to disttest job

* cmake fix binary link creation and fix format

* Use python3 instead of python3.9 in RHEL 8 and RHEL 9 workflows

* set default python3 to python3.9 in rhel8

* Try alternatives setup for python3 in RHEL8 env

* Add pip install cmake to debug RHEL8 issue

* Remove python3.11 in RHEL8 workflow

* Add back comment regarding RHEL8

---------

Co-authored-by: Vignesh Edithal <Vignesh.Edithal@amd.com>
2025-12-18 11:56:23 -05:00
amd-juwillia 3a3738ad98 Added AMDSMI CI to rocm-systems(#2074)
Signed-off-by: Justin Williams <Justin.Williams@amd.com>
2025-12-16 13:52:42 -06:00
Geo Min c9ac018395 Adding org var and dynamic selection of targets (#2317) 2025-12-16 10:46:59 -08:00
systems-assistant[bot] b002c6a739 SWDEV-538607 - Add SIMDe as a build dependency, remove naked intrinsic use. (#500)
Co-authored-by: Alex Voicu <alexandru.voicu@amd.com>
Co-authored-by: Ioannis Assiouras <Ioannis.Assiouras@amd.com>
2025-12-15 17:40:51 +00:00
Ioannis Assiouras 13561fb8cd Bump hash for theRock to 2025-12-12 commit (#2289) 2025-12-13 23:56:08 +00:00
jamessiddeley-amd 81720183ad [rocprof-compute] Merge CDash Nightly and Continuous workflow files (#2279)
* merged code-coverage and continuous workflow files

* fixed runner typos and added build mode

* add actor name to Continuous build

* improve error handling and remove redundant verbose

* fixed workflow file log output

* revert logs output in run_ci.py

* ruff format
2025-12-12 17:04:56 -05:00
David Galiffi fbaeb74107 [rocprof-sys] Update nightly CI workflow (#2263)
Update ROCm version to 7.1.0

Signed-off-by: David Galiffi <David.Galiffi@amd.com>
2025-12-11 12:23:26 -05:00
Mario Limonciello 0c4d08f38d Revert correcting the VGPR size for GFX 11.5.1 (#2268)
Although the value is correct; there is no source of truth between
kernel and userspace.  This leads to problems if the kernel has strict
restrictions (such as kernel 6.17 or earlier). The restrictions were
lifted in 6.17.9 and and 6.18, but there is no guarantee userspace is
using this.

So short term this value will be wrong.  But on newer kernels the kernel
will communicate the right size and rocr-runtime will be adjusted to
use that.

Link: https://github.com/ROCm/TheRock/pull/2505

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
2025-12-11 07:59:19 -06:00
David Galiffi 70562eb854 Add ROCm 7.1 to workflows (#2256)
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
2025-12-10 13:41:22 -05:00
Geo Min 879d010974 Bumping commit hash for TheRock (#2244) 2025-12-09 14:56:20 -08:00
Jason Bonnell 74cf2a85ec Fix rocprofiler-sdk CI workflow tarball install errors (#2225)
* Add ls statement for debugging /opt directory file naming

* Update ROCM_VERSION from 7.0.0 to 7.1.1 in SDK CI

* Update amdgpu debian package for Ubuntu in Dockerfile.ci

* disable HIP/CLR build in codeql (#2242)

---------

Co-authored-by: Venkateshwar Reddy Kandula <Venkateshwarreddy.Kandula@amd.com>
2025-12-09 16:06:35 -05:00
Mario Limonciello 6a899b5f6d Run pre-commit's whitespace related hooks on .github and .azuredevops (#2129)
In order for pre-commit to be useful, everything needs to meet a common
baseline.

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
2025-12-08 14:39:42 -06:00
Ioannis Assiouras a101df369c Bumping theROCK submodule 2025-12-04 commit (#2167)
* Bumping theROCK submodule 2025-12-04 commit

* Update container image in therock-ci-linux.yml

---------

Co-authored-by: jbonnell-amd <jason.bonnell@amd.com>
2025-12-05 18:33:32 +00:00
Jason Bonnell 3b875cc0ee [rocprofiler-compute] Add Nightly and CI on MI355/MI325 Runners (#1455)
* Initial work in progress for compute CI workflow

* Update run-ci.py script location, enable test creation

* Add new lines to files

* Add coverage file argument to run-ci.py

* Remove run-ci.py script usage from rocprofiler-compute-continuous-integration.yml workflow

* Add --break-system-packages parameter

* Add --ignore-installed to pip install

* Checkout specific branch until amdclang issue fixed in develop

* Add missing slash to path for cxx compiler

* Remove specific branch from checkout action

* Use run-ci.py in rocprofiler-compute-continuous-integration.yml

* Update install python requirements step

* Fix typo in build-name

* Update run-ci.py to have toggle for code coverage

* Apply ruff formatting

* Ruff again

* Exclude live attach detach and roofline tests in CI

* Add ctest args

* Revert run-ci.py changes

* Try new run-ci-2.py

* Update type of pytest-numprocs argument

* Try casting arg to str

* Fix typo in arg reference

* upgrade pip before running python installs

* Use jammy instead of noble for CI

* Remove python nproc arg from run-ci-2.py

* Switch to MI325 runners for CI

* Fix spacing issue

* Rename run-ci.py to run-code-coverage.py, add new run-ci.py

* Update to ROCm version 7.1.0 to debug sdk issues

* Testing out tarball install again

* Update regex on tarball version

* Update tarball regex on compute

* ruff formatting

* Revert change to systems CI file

* Switch back to rocm-dev install

* ruff formatting again

* Add ld_lib_path for rocm_sysdeps

* Remove excluded tests temporarily

* Add back excluded tests, add timeout for test step

* Address PR feedback

* Add git safe directory lines

* Revert dependencies change to debug new failures

* Exclude roofline again, rework dependencies

* Add in hip-runtime-amd dependency

* Install hip dev package

* Add TEST_FROM_INSTALL cmake arg to compute CI workflow

* Remove test_from_install for now

* Enable roofline tests again
2025-12-05 11:43:47 -05:00
Jason Bonnell 463126770a Update build docker container workflow, opensuse dockerfiles (#1883)
## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve. -->

- __Reduced Code Duplication__: Version parsing logic moved from individual Dockerfiles to the central build script
- __Improved Edge Case Handling__: Better handling of ROCm versions with and without patch numbers (e.g., `6.2` vs `6.2.0`)
- __Easier Maintenance__: Future version-related changes only need to be made in one place
- __Cleaner Dockerfiles__: Simplified Dockerfiles focus on package installation rather than complex shell logic
- __Updated Platform Support__: Refreshed container matrix to reflect current platform/ROCm version combinations
- __Fix OpenSUSE Docker Generation__: OpenSUSE container generation fails due to a change to the `binutils-gold` package
- __Error Handling__: Fix bug where errors in docker image build were being masked, allowing workflow to pass anyway.


## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->
- Updated `Dockerfile.opensuse` and `Dockerfile.opensuse.ci` docker files to remove `binutils-gold`
  - Not needed since we build `binutils` with systems anyways
- Updated `rocprofiler-systems-containers.yml` to remove `pushd/popd` commands and just run the shell scripts
  - There was a silent failure observed here, which I verified in this PR before adding the fix for openSUSE
- Refactor ROCm version parsing. Move this logic to the `build-docker.sh` script to reduce duplication.
  - Fix bug that caused ROCm 7.0 to fail installation. The trailing `.0` was being trimmed.
- Fixed inconsistencies in `containers.yml` that lead to invalid ROCm-OS_VERSION combinations.
- Formatting fixes 
  - Removed trailing whitespace
  - Fix docker build warnings. Use an `=` rather than ` ` when assigning an environment variable.
2025-12-04 23:33:15 -05:00
Mythreya Kuricheti 3e5749ea59 [rocprofiler-sdk][CI] Update codeql rocm version (#1909)
* [rocprofiler-sdk][CI] Update codeql rocm version

* Build HIP from source
2025-12-02 12:17:59 -06:00
cfallows-amd 29a7591791 Update RHEL8/9 workflow with latest rocm 7.1.1 links (#2060)
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
2025-12-01 11:14:33 -05:00
Marius Brehler 2dc32d645b Explicitly load versioned libamdhip64.so (#1872)
* Explicitly load versioned libamdhip64.so

* Fix syntax errors

* Fix when patching happens in Windows workflow

---------

Co-authored-by: Joseph Macaranas <145489236+jayhawk-commits@users.noreply.github.com>
Co-authored-by: ammallya <ameyakeshava.mallya@amd.com>
2025-11-24 10:05:05 -08:00
Jonathan R. Madsen a2288eb50b [rocprofiler-sdk] Install unit tests and helper functions for integration tests (#921)
* [rocprofiler-sdk] Install unit tests and helper functions for integration tests

* Fix rocprofiler-sdk-tests-target export

* Fix handling of cmake policy CMP0174

* Remove -vv from new pytest.ini files

* add unit tests and integration tests.

* add path to ci workflow.

* misc. fixes.

* pc sampling tests.

* bug fixes.

* pc sampling tests fix.

* misc.

* Update CMakeLists.txt

* Update rocprofiler_config_install_tests.cmake, correct license name

* fix units tests install issues.

* fix counters_def file path.

* fix bug, arg shifting.

* vendor pytest-cmake.

* cmake config fix. missing endfunction()

* disable tests, 1.rocprofv3-trace-hip-libs. 2.kernel-tracing. 3.external_correlation 4.rocpd.

* disable buffered-tracing test and remove pytest-cmake from requirements.txt.

* disable hip-graph-tracing test.

* fix building standalone tests to load rocprofiler-sdk cmake package first and then find rocprofiler_sdk_pytest module.

* addressed comments: 1.add local bin path to code cov workflow. 2.add to cmake prefix path local bin. 3.use ROCPROFILER_MEMCHECK_PRELOAD_ENV_VALUE 4.misc. fix

* enabled back tests api_buffered, external_correlation_id, hip-graph, kernel-tracing, rocpd, tracing-hip-in-libraries. and misc fixes(formating, extra fixtures for agent-index tests.)

* cpack to use llvm bin for .hsaco debug symbols.

* psdb tests fixes.

* EOL.

* misc. fixes and Disable api_buffered_tracing, external_correlation_id, hip-graph-tracing, kernel-tracing, rocpd, summary, tracing-hip-libraries, tracing-plus-counter-collection.

* fix incorrect cmakelists file.

* strip smallkernel.bin

* format.

* revert disabled tests commit.

* misc. fix in counter tests.

* misc.

* search codeobj unit test assets in curr bin and install bin.

* refactor newly added rocpd tests.

* modify tests for newly added hip-host-tracing.

* add LD LIB path to units, psdb is failing due to libs not being found.

---------

Co-authored-by: Venkateshwar Reddy Kandula <venkateshwar.kandula1306@gmail.com>
Co-authored-by: Venkateshwar Reddy Kandula <Venkateshwarreddy.Kandula@amd.com>
Co-authored-by: JeniferC99 <150404595+JeniferC99@users.noreply.github.com>
2025-11-21 08:06:56 -06:00
Jason Bonnell 66ea1cdff2 Add workflow to remove old untagged rocprofiler GHCR Docker Images (#1959)
* Add WIP workflow step to delete untagged images older than 1 week

* Formatting fix for rocprofiler-systems-ghcr.yml

* Move step to new workflow

* Remove needs parameter from cleanup-rocprofiler-images

* Remove expand-packages option

* Expand cleanup for every OS

* Revert spacing change to rocprofiler-systems-ghcr.yml

* Turn off dry-run to do an initial clean

* Switch dry-run to be only on PR

* Added comment about schedule
2025-11-21 08:49:29 -05:00
vedithal-amd ae8f72fa79 [rocprofiler-compute] Use native tool for counter collection (#1212)
* Use native tool for counter collection

* Add native counter collection tool which uses rocprofiler-sdk C++
  library public API to get counter collection data
    * This is enabled by default, unless --no-native-tool option is
      provided or ROCPROF=rocprofv3 env. var. is provided
    * This tool is only supported for ROCm version >=7.x.x
    * This tool is not supported for attach/detach scenario
* Build native tool shared object during build time
* If using rocprof-compute without building then runtime compilation of
  t push native tool shared object is performed
* rocprofiler-sdk tools is still used for services other than counter
  collection and data collected by native tool is merged into the
  rocpd/csv output of rocprofiler-sdk tool

* Make `rocpd` choice the default choice for `--format-rocprof-output`
  option
    * If `rocpd` public API from rocprofiler-sdk library is not present,
      then fallback to `csv` choice
    * In this case only `pmc_perf.csv` is written in workload folder
      instead of multiple `csv` files for each profiling run
* Remove `json` choice from `--format-rocprof-output` option since it
  functions identical to `csv` option

* Rename option `--rocprofiler-sdk-library-path` to
  `--rocprofiler-sdk-tool-path` since we LD_PRELOAD the
  rocprofiler-sdk tool shared object and not the rocprofiler-sdk library
shared object

* Fix the meaning of `--dispatch` option in `profile` mode to mention
  dispatch iteration filtering instead of dispatch id filtering
    * --dispatch option in analyze mode does dispatch id filtering

* Move standalone binary creation logic from cmake file to docker file

* fix native counter collection tool during attach/detach

* improve logging

* fix attach detach with native tool

* fix attach detach with native tool

* do not support attach/detach in native tool

* Update changelog

* add standalone binary creation functionality in cmake

* address review comments

* address review comments

* fix formatting

* address review comments

* Adding paths for cmake to search. Also updated min. cmake requirement to 3.21 as this was when hip was supported.

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* Update hip compiler ID check, sometimes comes up as Clang, sometimes ROCMClang- depends on setup.
Updated formatting.

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* RHEL8.10 unable to compile due to defaulting to old c++ version, need to force c++17

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* Updating changelog per docs team recommendations

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* Apply suggestions from code review to changelog

Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>

* Do not required HIP complier to build native counter collection tool

* fix cmake

* gersemi formatting on latest cmake change

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* ex ci updated dependencies to include rocprofiler-sdk, but cmake was still not capturing the path- there was a commit that added to the cmake_prefix_path entry that specified rocprof-sdk's cmake location ut was too specific for the search paths in find_package's config mode.
removing the cmake_prefix_path var and adding hints to find_package call instead, and specifying config mode so it knows how to construct the search paths

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* gersemi run for formatting

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* Still need prefix path, should not have been removed in last commit but does need to be shortened to just the rocm path to allow for find_package config mode to do the job

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* include cstdint for uint32_t

* Run formatting on helper.cpp

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* Remove rocm 7.2 release stuff from version and changelog and handle it in separate pr

* fix version

* fix changelog

* fix changelog

* run ruff formatter

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* fix rocprofiler-sdk attach so path

---------

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
Co-authored-by: Carrie Fallows <Carrie.Fallows@amd.com>
Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>
2025-11-18 23:34:38 -05:00
Scott Todd fa772be675 Reapply amdgpu-windows-interop revert. (#1893)
## Overview and rationale

This reverts https://github.com/ROCm/rocm-systems/pull/1886, which...
* Re-applies https://github.com/ROCm/rocm-systems/pull/1866
* Reverts https://github.com/ROCm/rocm-systems/pull/1728

(So it restores the [`amdgpu-windows-interop/`](https://github.com/ROCm/rocm-systems/tree/develop/shared/amdgpu-windows-interop) folder back to the state from a few weeks ago)

The rationale for this change is at https://github.com/ROCm/rocm-systems/pull/1866:
> Last PAL update broke applications on gfx12 Windows.

## Cross-repository change details

That PR failed to build but was merged with this explanation:

> TheRock CI Windows build fails as expected with this revert.
> 
> References to these PAL members need to be stripped out in a patch on TheRock.
> 
> ```
> 11.3	C:\home\runner\_work\rocm-systems\rocm-systems\projects\clr\rocclr\device\pal\palubercapturemgr.cpp(152): error C2039: 'RegisterTraceStateChangeCallback': is not a member of 'GpuUtil::TraceSession'
> 11.4	C:\home\runner\_work\rocm-systems\rocm-systems\shared\amdgpu-windows-interop\pal\inc\gpuUtil\palTraceSession.h(372): note: see declaration of 'GpuUtil::TraceSession'
> 11.4	C:\home\runner\_work\rocm-systems\rocm-systems\projects\clr\rocclr\device\pal\palubercapturemgr.cpp(195): error C2039: 'UnregisterTraceStateChangeCallback': is not a member of 'GpuUtil::TraceSession'
> 11.4	C:\home\runner\_work\rocm-systems\rocm-systems\shared\amdgpu-windows-interop\pal\inc\gpuUtil\palTraceSession.h(372): note: see declaration of 'GpuUtil::TraceSession'
> ```

The patch in TheRock was updated in https://github.com/ROCm/TheRock/pull/2154. This rolls forward by updating the ref for TheRock.

That original PR could have been sequenced differently to avoid a build break - perhaps by
* Pointing to a branch in TheRock with the patch rebased
* Deleting the patch in the workflows here but holding a local copy of the path to be applied in workflows
* Landing the patch as a normal commit instead of carrying it at all

## Test plan

1. Watch TheRock CI here (https://github.com/ROCm/rocm-systems/actions/runs/19447202693/job/55644411119?pr=1893)
2. Build locally:
    
    ```bash
    # In rocm-systems
    git am --whitespace=nowarn D:\projects\TheRock\patches\amd-mainline\rocm-systems\0001-Revert-SWDEV-543498-Some-compute-Ubertrace-profiles-.patch
    git am --whitespace=nowarn D:\projects\TheRock\patches\amd-mainline\rocm-systems\0003-Use-is_versioned-true-consistently-in-both-Comgr-Loa.patch
    git am --whitespace=nowarn D:\projects\TheRock\patches\amd-mainline\rocm-systems\0006-Explicitly-load-libamdhip64.so.7.patch
    # Note: the build fails with the observed errors if patch 0001 is not applied!
    
    # In TheRock
    cmake -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_C_COMPILER=cl.exe -DCMAKE_CXX_COMPILER=cl.exe \
      -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
      -DPython3_EXECUTABLE=d:/projects/TheRock/.venv/Scripts/python \
      -DTHEROCK_ROCM_SYSTEMS_SOURCE_DIR=d:/projects/TheRock/../rocm-systems \  # IMPORTANT
      -DTHEROCK_AMDGPU_FAMILIES=gfx110X-all \
      -DBUILD_TESTING=ON \
      -DTHEROCK_ENABLE_ALL=ON \
      -Damd-llvm_BUILD_TYPE=RelWithDebInfo \
      -S D:/projects/TheRock \
      -B D:/projects/TheRock/build \
      -G Ninja
    
    cmake --build D:/projects/TheRock/build --target hip-clr
    # [build] Build finished with exit code 0
    cmake --build D:/projects/TheRock/build --target ocl-clr+dist
    # [build] Build finished with exit code 0
    ```
2025-11-18 07:17:06 -08:00
jamessiddeley-amd d49e2e35fd [rocprof-compute] Automate ctest coverage and test cases on runners with CDash (#1481)
* Add nightly coverage workflow

* ruff formatting

* temp workflow testing

* restore workflow file

* add workflow condition

* update workflow file

* update workflow file

* fix typo in run-ci.py

* edit run-ci.py

* add python deps install

* add python deps install

* add python deps install

* add python deps install

* check if enable coverage is on when using workflow

* remove github CI breakdown and fix enable coverage

* set cache variables must be set before dashboard starts

* Update run-ci.py

* Update run-ci.py to fix ctest cache

* Update rocprofiler-compute-code-coverage.yml to install tests

* Update rocprofiler-compute-code-coverage.yml

* Restore workflow file

* Update run-ci.py

* Simplify workflow build command

* Update run-ci.py to build tests

* edited run-ci script

* edit ctest configure commands

* edit ctest configure commands to be on one line

* edit ctest configure command to include path to amdclang++

* update clang check in tests/cmakelists.txt

* update rocm

* update rocm

* update rocm version 7.0.2

* update tests/CMakeLists.txt

* use tarball instead for rocm install

* apt install rocm-dev instead for 7.0.0 release

* workflow tweaks

* update to use new 'tools' dir

* install rocm-dev

* add CMAKE_CXX_COMPILER as clang

* update tests/cmakelists.txt

* update cdasg site and build names

* remove run automatically on pull requests

* ruff format

* increased timeouts for tests

* add back reruns for workflow testing

* fix typo

* rename workflow "nightly" -> "code"

* added tracks to keep track of gpu (325 vs 355)

* remove test_db_connector.py

* revert build names and tracking

* update workflow pushes

* CMake format

* changed parallel level back to 1
2025-11-17 09:24:24 -05:00
JC c9dd49c48a Fix SignatureDoesNotMatch by aws credential option (#1867)
Replicating https://github.com/ROCm/TheRock/pull/2147#discussion_r2528008441
## Motivation

Fixes https://github.com/ROCm/TheRock/issues/875 which is the issue where Windows builds would fail randomly when uploading to s3 with the `SignatureDoesNotMatch` error as a result of special characters existing in the AWS Access Keys generated by the `configure-aws-credentials` action that is passed through Windows environment variables to `aws-cli`. More details below.

## Technical Details

https://github.com/ROCm/TheRock/issues/875#issuecomment-3530851762
In summary, in Windows workflows, the `special-characters-workaround` option is set to true for the `configure-aws-credentials` action which will regenerate access keys until there are no special characters that may not be passable through windows environment variables correctly.

## Test Plan

Observe CI.

## Test Result

TBD.

## Submission Checklist

- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2025-11-14 14:04:54 -08:00
cfallows-amd 683a63d9ec Update rocprofiler-compute workflows (#1788)
* Update workflow files to use general public rocm dev build images from dockerhub.
Old method was to borrow rocprofiler-systems images but they do not contain rocm install anymore, so we cannot rely on them.

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* Add workflow files to paths on push and PR

* Revert change of image for red hat variant because the image offered in official rocm image release is too large for runners.
Going back to using systems team images and installing rocm on them (as they do) as a workaround until we can get a smaller package size docker image with ROCm included.

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

Adjusted python3-devel install line with an if else determined by distro version.

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

---------

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
Co-authored-by: jbonnell-amd <jason.bonnell@amd.com>
2025-11-10 20:48:39 -05:00
Aleksandar Djordjevic f39a60ac25 [rocprofiler-systems] Apply new CMake formatting for the latest gersemi version (#1778)
* Fix cmake formatting

* Updated rev. in `.pre-commit-config.yaml`

* Pin the gersemi used in CI to v0.23.1, matching the pre-commit

---------

Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
2025-11-10 13:08:44 -05:00
Milan Radosavljevic d9b00da102 Add clean up of buffered_storage files (#1738)
* Add clean up of buffered_storage files

* Add step to workflows to test for remaining temp files after tests

* Applied suggestions from code review

* add deletion of all cache files

---------

Co-authored-by: David Galiffi <David.Galiffi@amd.com>
2025-11-07 11:51:09 -05:00
Jason Bonnell 6e195ded9b Update rocprofiler_config_interfaces.cmake to use different elf naming (#1722)
* Update rocprofiler_config_interfaces.cmake to use different elf naming

* try out conditional for libelf

* run cmake-format to fix formatting issue

* Remove libelf.patch file from therock-ci-windows.yml

* Remove libelf patch from therock-ci-linux.yml as well
2025-11-06 23:50:02 -05:00
David Galiffi 89cf46eb55 Removing jlumbroso/free-disk-space action from workflows (#1700) 2025-11-06 18:11:09 -05:00
Joseph Macaranas 524f62ae67 TheRock CI Workflow Updates 20251106 (#1743)
- Update the pinned SHA for TheRock in CI workflows.
- Update the version for actions in those same workflows.
- Comment out the rm .patch line and provide details on its use.
2025-11-06 12:06:44 -05:00