* Add WIP workflow step to delete untagged images older than 1 week
* Formatting fix for rocprofiler-systems-ghcr.yml
* Move step to new workflow
* Remove needs parameter from cleanup-rocprofiler-images
* Remove expand-packages option
* Expand cleanup for every OS
* Revert spacing change to rocprofiler-systems-ghcr.yml
* Turn off dry-run to do an initial clean
* Switch dry-run to be only on PR
* Added comment about schedule
* Use native tool for counter collection
* Add native counter collection tool which uses rocprofiler-sdk C++
library public API to get counter collection data
* This is enabled by default, unless --no-native-tool option is
provided or ROCPROF=rocprofv3 env. var. is provided
* This tool is only supported for ROCm version >=7.x.x
* This tool is not supported for attach/detach scenario
* Build native tool shared object during build time
* If using rocprof-compute without building then runtime compilation of
t push native tool shared object is performed
* rocprofiler-sdk tools is still used for services other than counter
collection and data collected by native tool is merged into the
rocpd/csv output of rocprofiler-sdk tool
* Make `rocpd` choice the default choice for `--format-rocprof-output`
option
* If `rocpd` public API from rocprofiler-sdk library is not present,
then fallback to `csv` choice
* In this case only `pmc_perf.csv` is written in workload folder
instead of multiple `csv` files for each profiling run
* Remove `json` choice from `--format-rocprof-output` option since it
functions identical to `csv` option
* Rename option `--rocprofiler-sdk-library-path` to
`--rocprofiler-sdk-tool-path` since we LD_PRELOAD the
rocprofiler-sdk tool shared object and not the rocprofiler-sdk library
shared object
* Fix the meaning of `--dispatch` option in `profile` mode to mention
dispatch iteration filtering instead of dispatch id filtering
* --dispatch option in analyze mode does dispatch id filtering
* Move standalone binary creation logic from cmake file to docker file
* fix native counter collection tool during attach/detach
* improve logging
* fix attach detach with native tool
* fix attach detach with native tool
* do not support attach/detach in native tool
* Update changelog
* add standalone binary creation functionality in cmake
* address review comments
* address review comments
* fix formatting
* address review comments
* Adding paths for cmake to search. Also updated min. cmake requirement to 3.21 as this was when hip was supported.
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* Update hip compiler ID check, sometimes comes up as Clang, sometimes ROCMClang- depends on setup.
Updated formatting.
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* RHEL8.10 unable to compile due to defaulting to old c++ version, need to force c++17
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* Updating changelog per docs team recommendations
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* Apply suggestions from code review to changelog
Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>
* Do not required HIP complier to build native counter collection tool
* fix cmake
* gersemi formatting on latest cmake change
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* ex ci updated dependencies to include rocprofiler-sdk, but cmake was still not capturing the path- there was a commit that added to the cmake_prefix_path entry that specified rocprof-sdk's cmake location ut was too specific for the search paths in find_package's config mode.
removing the cmake_prefix_path var and adding hints to find_package call instead, and specifying config mode so it knows how to construct the search paths
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* gersemi run for formatting
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* Still need prefix path, should not have been removed in last commit but does need to be shortened to just the rocm path to allow for find_package config mode to do the job
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* include cstdint for uint32_t
* Run formatting on helper.cpp
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* Remove rocm 7.2 release stuff from version and changelog and handle it in separate pr
* fix version
* fix changelog
* fix changelog
* run ruff formatter
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* fix rocprofiler-sdk attach so path
---------
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
Co-authored-by: Carrie Fallows <Carrie.Fallows@amd.com>
Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>
## Overview and rationale
This reverts https://github.com/ROCm/rocm-systems/pull/1886, which...
* Re-applies https://github.com/ROCm/rocm-systems/pull/1866
* Reverts https://github.com/ROCm/rocm-systems/pull/1728
(So it restores the [`amdgpu-windows-interop/`](https://github.com/ROCm/rocm-systems/tree/develop/shared/amdgpu-windows-interop) folder back to the state from a few weeks ago)
The rationale for this change is at https://github.com/ROCm/rocm-systems/pull/1866:
> Last PAL update broke applications on gfx12 Windows.
## Cross-repository change details
That PR failed to build but was merged with this explanation:
> TheRock CI Windows build fails as expected with this revert.
>
> References to these PAL members need to be stripped out in a patch on TheRock.
>
> ```
> 11.3 C:\home\runner\_work\rocm-systems\rocm-systems\projects\clr\rocclr\device\pal\palubercapturemgr.cpp(152): error C2039: 'RegisterTraceStateChangeCallback': is not a member of 'GpuUtil::TraceSession'
> 11.4 C:\home\runner\_work\rocm-systems\rocm-systems\shared\amdgpu-windows-interop\pal\inc\gpuUtil\palTraceSession.h(372): note: see declaration of 'GpuUtil::TraceSession'
> 11.4 C:\home\runner\_work\rocm-systems\rocm-systems\projects\clr\rocclr\device\pal\palubercapturemgr.cpp(195): error C2039: 'UnregisterTraceStateChangeCallback': is not a member of 'GpuUtil::TraceSession'
> 11.4 C:\home\runner\_work\rocm-systems\rocm-systems\shared\amdgpu-windows-interop\pal\inc\gpuUtil\palTraceSession.h(372): note: see declaration of 'GpuUtil::TraceSession'
> ```
The patch in TheRock was updated in https://github.com/ROCm/TheRock/pull/2154. This rolls forward by updating the ref for TheRock.
That original PR could have been sequenced differently to avoid a build break - perhaps by
* Pointing to a branch in TheRock with the patch rebased
* Deleting the patch in the workflows here but holding a local copy of the path to be applied in workflows
* Landing the patch as a normal commit instead of carrying it at all
## Test plan
1. Watch TheRock CI here (https://github.com/ROCm/rocm-systems/actions/runs/19447202693/job/55644411119?pr=1893)
2. Build locally:
```bash
# In rocm-systems
git am --whitespace=nowarn D:\projects\TheRock\patches\amd-mainline\rocm-systems\0001-Revert-SWDEV-543498-Some-compute-Ubertrace-profiles-.patch
git am --whitespace=nowarn D:\projects\TheRock\patches\amd-mainline\rocm-systems\0003-Use-is_versioned-true-consistently-in-both-Comgr-Loa.patch
git am --whitespace=nowarn D:\projects\TheRock\patches\amd-mainline\rocm-systems\0006-Explicitly-load-libamdhip64.so.7.patch
# Note: the build fails with the observed errors if patch 0001 is not applied!
# In TheRock
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=cl.exe -DCMAKE_CXX_COMPILER=cl.exe \
-DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DPython3_EXECUTABLE=d:/projects/TheRock/.venv/Scripts/python \
-DTHEROCK_ROCM_SYSTEMS_SOURCE_DIR=d:/projects/TheRock/../rocm-systems \ # IMPORTANT
-DTHEROCK_AMDGPU_FAMILIES=gfx110X-all \
-DBUILD_TESTING=ON \
-DTHEROCK_ENABLE_ALL=ON \
-Damd-llvm_BUILD_TYPE=RelWithDebInfo \
-S D:/projects/TheRock \
-B D:/projects/TheRock/build \
-G Ninja
cmake --build D:/projects/TheRock/build --target hip-clr
# [build] Build finished with exit code 0
cmake --build D:/projects/TheRock/build --target ocl-clr+dist
# [build] Build finished with exit code 0
```
Replicating https://github.com/ROCm/TheRock/pull/2147#discussion_r2528008441
## Motivation
Fixes https://github.com/ROCm/TheRock/issues/875 which is the issue where Windows builds would fail randomly when uploading to s3 with the `SignatureDoesNotMatch` error as a result of special characters existing in the AWS Access Keys generated by the `configure-aws-credentials` action that is passed through Windows environment variables to `aws-cli`. More details below.
## Technical Details
https://github.com/ROCm/TheRock/issues/875#issuecomment-3530851762
In summary, in Windows workflows, the `special-characters-workaround` option is set to true for the `configure-aws-credentials` action which will regenerate access keys until there are no special characters that may not be passable through windows environment variables correctly.
## Test Plan
Observe CI.
## Test Result
TBD.
## Submission Checklist
- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
* Update workflow files to use general public rocm dev build images from dockerhub.
Old method was to borrow rocprofiler-systems images but they do not contain rocm install anymore, so we cannot rely on them.
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* Add workflow files to paths on push and PR
* Revert change of image for red hat variant because the image offered in official rocm image release is too large for runners.
Going back to using systems team images and installing rocm on them (as they do) as a workaround until we can get a smaller package size docker image with ROCm included.
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
Adjusted python3-devel install line with an if else determined by distro version.
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
---------
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
Co-authored-by: jbonnell-amd <jason.bonnell@amd.com>
* Fix cmake formatting
* Updated rev. in `.pre-commit-config.yaml`
* Pin the gersemi used in CI to v0.23.1, matching the pre-commit
---------
Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
* Add clean up of buffered_storage files
* Add step to workflows to test for remaining temp files after tests
* Applied suggestions from code review
* add deletion of all cache files
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
* Update rocprofiler_config_interfaces.cmake to use different elf naming
* try out conditional for libelf
* run cmake-format to fix formatting issue
* Remove libelf.patch file from therock-ci-windows.yml
* Remove libelf patch from therock-ci-linux.yml as well
- Update the pinned SHA for TheRock in CI workflows.
- Update the version for actions in those same workflows.
- Comment out the rm .patch line and provide details on its use.
Motivation:
Basic runners are frequently running out of space
Technical Details:
Running autoclean after package installations.
Use the jlumbroso/free-disk-space action.
* [ROCProfiler-SDK] Remove 'gfx900' and 'gfx940' from GPU targets
* Remove unsupported GPU targets from workflow
* Remove gfx900 and gfx940 from GPU targets
* Change how cache manager handles child process trace cache
* Sampling and backtrace metrics to cache
* Apply cmake formatting
* Fix parsing of metadata json
* Code clean up
* Fix build nlohmann json from source
* Fix storage parsed finished callback
* Revert sampling for child process
* Change cache file name generating
* Fix thread start stop
* Fix process start end timestamp
* Applied suggestions from code review
* Try with late start of flushing task thread
* Change dockerfiles for ci
* Revert changes on github workflows
* Remove json_fwd.hpp include
* fix dump
* Build nlohmann/json by default
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Update location of build artifacts for nlohmann/json
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Revert use_output_suffix
* Remove unused logs
* Fix cache store inside counter due to structure change
* Remove decode tests from debian ci
* Fix issue where all databases have the same UUID (#1499)
Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>
* Removing the cpack and install steps to save space
* Revert "Remove decode tests from debian ci"
This reverts commit ddabf6dd142dcf438e6b8997b8abe86f2c868468.
* Revert "Removing the cpack and install steps to save space"
This reverts commit 973da3a1ba99d99d529af5269d30e177092f9bfa.
* Add prepare-runner job as dependency to clean up the space
* Fix formatting
* Free up even more space
* Remove verbose for workflows
* remove hw_counters from ext_data
* move space clean up inside container
* try to remove external folder to free up space
* Check space
* Refactor Cleanup to it's own step
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Aleksandar Djordjevic <aleksandar.djordjevic@amd.com>
Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>
* Upgrade min python version from 3.8 to 3.9
* Set min version for textual-fspicker for TUI support
* Update workflows to use python 3.9 instead of 3.8
* fix formatting
* fix bug
---------
Co-authored-by: Vignesh Edithal <Vignesh.Edithal@amd.com>
* migrate docs update workflow from rocm-libraries
* add test branch to the trigger condition
* modify docs to test workflow
* temporarily rename project folder name to match the test project
* add more content for testing
* test successful, restore test modifications
* Add GHCR retry logic
* Add retries to Install ROCm Packages step in rocprofiler-systems-redhat.yml
* Update containers-ci.yml file to use latest RHEL9/10 releases
* Use build-docker-ci script in rocprofiler-systems-containers
* Remove working-directory from step in rocprofiler-systems-redhat.yml
* Remove shell bash from Install ROCm Packages step
* Revert RHEL version change in rocprofiler-systems-redhat.yml
* Try outputting LastTest.log
* Update if condition for outputting log
* Another attempt
* Only run Ubuntu Noble on MI355 in push/PR
* Try exclude matrix
* Move conditional statement in matrix exclusion
* Create ci-matrix.yml file
* Add needs parameter to ubuntu job
* Fix typo in matrix output variable
* Add back pull_request_template.md
* Add back pull_request_template.md
* Initial steps added for rocprofiler-systems-continuous-integration.yml
* Add new line to end of rocprofiler-systems-continuous-integration.yml
* Fix matrix issue in rocprofiler-systems CI workflow
* Update runner to use mi355
* Remove sudo from ROCm download step
* Add Python venv
* Try to install python venv
* Add -y to pip venv install commands
* Add shell: bash to download ROCm step
* Fix issue in if statement
* Fix typo in mv command
* Fix mv command
* Update paths
* add directory in install step
* Use default runner for now while debugging setup
* Add set -e to steps
* debug build step
* Add amdgpu install step
* remove working-directory from amdgpu install step
* add path/ld lib path, add -S argument to run-ci.py
* Fix typo in DCMAKE_PREFIX_PATH
* Add DGPU_TARGETS to run-ci.py command
* add Docker options, remove GPU_TARGETS
* Install amd-smi-lib
* Add DCMAKE_BUILD_TYPE, update path
* Remove mkdir
* Add build dynist cmake arguments
* Update cmake arguments again
* Add missing \ to run-ci.py command
* add libdw dependency
* Add later install step
* Increase timeout of configure/build/test step
* use 16 jobs to try and speed up pipeline time
* Add GHCR image, remove TheRock tarball download step, minor changes for debugging
* Add credentials to container portion of step
* Add package read permissions to ubuntu step
* Update tarball name
* Increase jobs to 16, disable some tests for now due to timeouts
* Modify to only include gpu tests
* Fix configuration
* Enable MPI on run-ci.py run
* Add install MPI step, changed tests to be run
* Enable OMPI flags, enable network counter access
* Use new Docker image names, add privileged option to Docker
* Change cmake build type
* Add fail-fast false option for CI
* Update ROCM_VERSION variable to reflect docker changes
* Specify TARBALL_ROCM_VERSION as separate
* Add MI325 to debug pipeline errors
* Move location of env variables
* Only test on jammy for now, run all tests to assess other issues
* test with branch that contains fix for openmp
* Exclude "ompvv"
We will re-add one ticket is fixed.
* Test: Disable USE_MPI
* Replace TheRock ROCm install with rocm-dev for now
* Try out MI355 noble and MI325 for jammy/noble
* Update amdgpu step to support different ROCm versions
* Remove unused env variables
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
* TheRock CI points to rocm systems
* Fixing depth
* Fixing cache path
* Adding core components
* Adding more packages
* try this for windows building
* Add math libs
* Adding core only
* Attempt with no ccache
* adding patching
* Adding ls test
* adding this
* removing ls test
* changing dir name
* Adding cleanup for patch
* Adding ref
* adding correct no include
* Adding new temp branch for testing
* empty commit
* empty commit
* Adding commit hash bump
* Adding new hash for removed patches
* Adding TheRock submodule bump
* trying with compiler removed test
* Try dvc pull windows
* Update .github/workflows/therock-ci-linux.yml
Co-authored-by: Marius Brehler <marius.brehler@gmail.com>
* Adding correct env
* revert to ../
* Adding path
* try new var
* Adding new branch
* Adding correct hash
* Update .github/workflows/therock-ci-linux.yml
Co-authored-by: Marius Brehler <marius.brehler@gmail.com>
* Update .github/workflows/therock-ci-windows.yml
Co-authored-by: Marius Brehler <marius.brehler@gmail.com>
---------
Co-authored-by: Marius Brehler <marius.brehler@gmail.com>
- Rename the GHCR packages for rocprofiler Docker images to reduce the number of packages that will be released on the repository
- Changed package name to only include the OS instead of OS+Version - version moved to the tag instead.
- Updated Dockerfile.*.ci files to specify target ROCm version from tarball in name.
- 404 Not Found errors when trying to download dependencies in the Get the latest therock build step. Adding `sudo apt-get update` command first to avoid this.
- Added `sudo apt-get update` to the rocprofiler-sdk-build-ci-docker-images.yml workflow.