## Overview and rationale
This reverts https://github.com/ROCm/rocm-systems/pull/1886, which...
* Re-applies https://github.com/ROCm/rocm-systems/pull/1866
* Reverts https://github.com/ROCm/rocm-systems/pull/1728
(So it restores the [`amdgpu-windows-interop/`](https://github.com/ROCm/rocm-systems/tree/develop/shared/amdgpu-windows-interop) folder back to the state from a few weeks ago)
The rationale for this change is at https://github.com/ROCm/rocm-systems/pull/1866:
> Last PAL update broke applications on gfx12 Windows.
## Cross-repository change details
That PR failed to build but was merged with this explanation:
> TheRock CI Windows build fails as expected with this revert.
>
> References to these PAL members need to be stripped out in a patch on TheRock.
>
> ```
> 11.3 C:\home\runner\_work\rocm-systems\rocm-systems\projects\clr\rocclr\device\pal\palubercapturemgr.cpp(152): error C2039: 'RegisterTraceStateChangeCallback': is not a member of 'GpuUtil::TraceSession'
> 11.4 C:\home\runner\_work\rocm-systems\rocm-systems\shared\amdgpu-windows-interop\pal\inc\gpuUtil\palTraceSession.h(372): note: see declaration of 'GpuUtil::TraceSession'
> 11.4 C:\home\runner\_work\rocm-systems\rocm-systems\projects\clr\rocclr\device\pal\palubercapturemgr.cpp(195): error C2039: 'UnregisterTraceStateChangeCallback': is not a member of 'GpuUtil::TraceSession'
> 11.4 C:\home\runner\_work\rocm-systems\rocm-systems\shared\amdgpu-windows-interop\pal\inc\gpuUtil\palTraceSession.h(372): note: see declaration of 'GpuUtil::TraceSession'
> ```
The patch in TheRock was updated in https://github.com/ROCm/TheRock/pull/2154. This rolls forward by updating the ref for TheRock.
That original PR could have been sequenced differently to avoid a build break - perhaps by
* Pointing to a branch in TheRock with the patch rebased
* Deleting the patch in the workflows here but holding a local copy of the path to be applied in workflows
* Landing the patch as a normal commit instead of carrying it at all
## Test plan
1. Watch TheRock CI here (https://github.com/ROCm/rocm-systems/actions/runs/19447202693/job/55644411119?pr=1893)
2. Build locally:
```bash
# In rocm-systems
git am --whitespace=nowarn D:\projects\TheRock\patches\amd-mainline\rocm-systems\0001-Revert-SWDEV-543498-Some-compute-Ubertrace-profiles-.patch
git am --whitespace=nowarn D:\projects\TheRock\patches\amd-mainline\rocm-systems\0003-Use-is_versioned-true-consistently-in-both-Comgr-Loa.patch
git am --whitespace=nowarn D:\projects\TheRock\patches\amd-mainline\rocm-systems\0006-Explicitly-load-libamdhip64.so.7.patch
# Note: the build fails with the observed errors if patch 0001 is not applied!
# In TheRock
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=cl.exe -DCMAKE_CXX_COMPILER=cl.exe \
-DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DPython3_EXECUTABLE=d:/projects/TheRock/.venv/Scripts/python \
-DTHEROCK_ROCM_SYSTEMS_SOURCE_DIR=d:/projects/TheRock/../rocm-systems \ # IMPORTANT
-DTHEROCK_AMDGPU_FAMILIES=gfx110X-all \
-DBUILD_TESTING=ON \
-DTHEROCK_ENABLE_ALL=ON \
-Damd-llvm_BUILD_TYPE=RelWithDebInfo \
-S D:/projects/TheRock \
-B D:/projects/TheRock/build \
-G Ninja
cmake --build D:/projects/TheRock/build --target hip-clr
# [build] Build finished with exit code 0
cmake --build D:/projects/TheRock/build --target ocl-clr+dist
# [build] Build finished with exit code 0
```
Replicating https://github.com/ROCm/TheRock/pull/2147#discussion_r2528008441
## Motivation
Fixes https://github.com/ROCm/TheRock/issues/875 which is the issue where Windows builds would fail randomly when uploading to s3 with the `SignatureDoesNotMatch` error as a result of special characters existing in the AWS Access Keys generated by the `configure-aws-credentials` action that is passed through Windows environment variables to `aws-cli`. More details below.
## Technical Details
https://github.com/ROCm/TheRock/issues/875#issuecomment-3530851762
In summary, in Windows workflows, the `special-characters-workaround` option is set to true for the `configure-aws-credentials` action which will regenerate access keys until there are no special characters that may not be passable through windows environment variables correctly.
## Test Plan
Observe CI.
## Test Result
TBD.
## Submission Checklist
- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
* Update workflow files to use general public rocm dev build images from dockerhub.
Old method was to borrow rocprofiler-systems images but they do not contain rocm install anymore, so we cannot rely on them.
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* Add workflow files to paths on push and PR
* Revert change of image for red hat variant because the image offered in official rocm image release is too large for runners.
Going back to using systems team images and installing rocm on them (as they do) as a workaround until we can get a smaller package size docker image with ROCm included.
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
Adjusted python3-devel install line with an if else determined by distro version.
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
---------
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
Co-authored-by: jbonnell-amd <jason.bonnell@amd.com>
* Fix cmake formatting
* Updated rev. in `.pre-commit-config.yaml`
* Pin the gersemi used in CI to v0.23.1, matching the pre-commit
---------
Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
* Add clean up of buffered_storage files
* Add step to workflows to test for remaining temp files after tests
* Applied suggestions from code review
* add deletion of all cache files
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
* Update rocprofiler_config_interfaces.cmake to use different elf naming
* try out conditional for libelf
* run cmake-format to fix formatting issue
* Remove libelf.patch file from therock-ci-windows.yml
* Remove libelf patch from therock-ci-linux.yml as well
- Update the pinned SHA for TheRock in CI workflows.
- Update the version for actions in those same workflows.
- Comment out the rm .patch line and provide details on its use.
Motivation:
Basic runners are frequently running out of space
Technical Details:
Running autoclean after package installations.
Use the jlumbroso/free-disk-space action.
* [ROCProfiler-SDK] Remove 'gfx900' and 'gfx940' from GPU targets
* Remove unsupported GPU targets from workflow
* Remove gfx900 and gfx940 from GPU targets
* Change how cache manager handles child process trace cache
* Sampling and backtrace metrics to cache
* Apply cmake formatting
* Fix parsing of metadata json
* Code clean up
* Fix build nlohmann json from source
* Fix storage parsed finished callback
* Revert sampling for child process
* Change cache file name generating
* Fix thread start stop
* Fix process start end timestamp
* Applied suggestions from code review
* Try with late start of flushing task thread
* Change dockerfiles for ci
* Revert changes on github workflows
* Remove json_fwd.hpp include
* fix dump
* Build nlohmann/json by default
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Update location of build artifacts for nlohmann/json
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Revert use_output_suffix
* Remove unused logs
* Fix cache store inside counter due to structure change
* Remove decode tests from debian ci
* Fix issue where all databases have the same UUID (#1499)
Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>
* Removing the cpack and install steps to save space
* Revert "Remove decode tests from debian ci"
This reverts commit ddabf6dd142dcf438e6b8997b8abe86f2c868468.
* Revert "Removing the cpack and install steps to save space"
This reverts commit 973da3a1ba99d99d529af5269d30e177092f9bfa.
* Add prepare-runner job as dependency to clean up the space
* Fix formatting
* Free up even more space
* Remove verbose for workflows
* remove hw_counters from ext_data
* move space clean up inside container
* try to remove external folder to free up space
* Check space
* Refactor Cleanup to it's own step
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Aleksandar Djordjevic <aleksandar.djordjevic@amd.com>
Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>
* Upgrade min python version from 3.8 to 3.9
* Set min version for textual-fspicker for TUI support
* Update workflows to use python 3.9 instead of 3.8
* fix formatting
* fix bug
---------
Co-authored-by: Vignesh Edithal <Vignesh.Edithal@amd.com>
* migrate docs update workflow from rocm-libraries
* add test branch to the trigger condition
* modify docs to test workflow
* temporarily rename project folder name to match the test project
* add more content for testing
* test successful, restore test modifications
* Add GHCR retry logic
* Add retries to Install ROCm Packages step in rocprofiler-systems-redhat.yml
* Update containers-ci.yml file to use latest RHEL9/10 releases
* Use build-docker-ci script in rocprofiler-systems-containers
* Remove working-directory from step in rocprofiler-systems-redhat.yml
* Remove shell bash from Install ROCm Packages step
* Revert RHEL version change in rocprofiler-systems-redhat.yml
* Try outputting LastTest.log
* Update if condition for outputting log
* Another attempt
* Only run Ubuntu Noble on MI355 in push/PR
* Try exclude matrix
* Move conditional statement in matrix exclusion
* Create ci-matrix.yml file
* Add needs parameter to ubuntu job
* Fix typo in matrix output variable
* Add back pull_request_template.md
* Add back pull_request_template.md
* Initial steps added for rocprofiler-systems-continuous-integration.yml
* Add new line to end of rocprofiler-systems-continuous-integration.yml
* Fix matrix issue in rocprofiler-systems CI workflow
* Update runner to use mi355
* Remove sudo from ROCm download step
* Add Python venv
* Try to install python venv
* Add -y to pip venv install commands
* Add shell: bash to download ROCm step
* Fix issue in if statement
* Fix typo in mv command
* Fix mv command
* Update paths
* add directory in install step
* Use default runner for now while debugging setup
* Add set -e to steps
* debug build step
* Add amdgpu install step
* remove working-directory from amdgpu install step
* add path/ld lib path, add -S argument to run-ci.py
* Fix typo in DCMAKE_PREFIX_PATH
* Add DGPU_TARGETS to run-ci.py command
* add Docker options, remove GPU_TARGETS
* Install amd-smi-lib
* Add DCMAKE_BUILD_TYPE, update path
* Remove mkdir
* Add build dynist cmake arguments
* Update cmake arguments again
* Add missing \ to run-ci.py command
* add libdw dependency
* Add later install step
* Increase timeout of configure/build/test step
* use 16 jobs to try and speed up pipeline time
* Add GHCR image, remove TheRock tarball download step, minor changes for debugging
* Add credentials to container portion of step
* Add package read permissions to ubuntu step
* Update tarball name
* Increase jobs to 16, disable some tests for now due to timeouts
* Modify to only include gpu tests
* Fix configuration
* Enable MPI on run-ci.py run
* Add install MPI step, changed tests to be run
* Enable OMPI flags, enable network counter access
* Use new Docker image names, add privileged option to Docker
* Change cmake build type
* Add fail-fast false option for CI
* Update ROCM_VERSION variable to reflect docker changes
* Specify TARBALL_ROCM_VERSION as separate
* Add MI325 to debug pipeline errors
* Move location of env variables
* Only test on jammy for now, run all tests to assess other issues
* test with branch that contains fix for openmp
* Exclude "ompvv"
We will re-add one ticket is fixed.
* Test: Disable USE_MPI
* Replace TheRock ROCm install with rocm-dev for now
* Try out MI355 noble and MI325 for jammy/noble
* Update amdgpu step to support different ROCm versions
* Remove unused env variables
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
* TheRock CI points to rocm systems
* Fixing depth
* Fixing cache path
* Adding core components
* Adding more packages
* try this for windows building
* Add math libs
* Adding core only
* Attempt with no ccache
* adding patching
* Adding ls test
* adding this
* removing ls test
* changing dir name
* Adding cleanup for patch
* Adding ref
* adding correct no include
* Adding new temp branch for testing
* empty commit
* empty commit
* Adding commit hash bump
* Adding new hash for removed patches
* Adding TheRock submodule bump
* trying with compiler removed test
* Try dvc pull windows
* Update .github/workflows/therock-ci-linux.yml
Co-authored-by: Marius Brehler <marius.brehler@gmail.com>
* Adding correct env
* revert to ../
* Adding path
* try new var
* Adding new branch
* Adding correct hash
* Update .github/workflows/therock-ci-linux.yml
Co-authored-by: Marius Brehler <marius.brehler@gmail.com>
* Update .github/workflows/therock-ci-windows.yml
Co-authored-by: Marius Brehler <marius.brehler@gmail.com>
---------
Co-authored-by: Marius Brehler <marius.brehler@gmail.com>
- Rename the GHCR packages for rocprofiler Docker images to reduce the number of packages that will be released on the repository
- Changed package name to only include the OS instead of OS+Version - version moved to the tag instead.
- Updated Dockerfile.*.ci files to specify target ROCm version from tarball in name.
- 404 Not Found errors when trying to download dependencies in the Get the latest therock build step. Adding `sudo apt-get update` command first to avoid this.
- Added `sudo apt-get update` to the rocprofiler-sdk-build-ci-docker-images.yml workflow.
This will update the CI to match TheRocks CI by introducing to use a
python script to report on health status. Commands that were in the
section that modified the status were moved to separate sections
(ccache, git config, ..).
Related PR:
https://github.com/ROCm/TheRock/pull/1516
* Initial skeleton code for rocprofiler-systems-continuous-integration.yml
* Add python3-devel to opensuse and rhel ci images
* Update rocprofiler-systems-containers.yml to include TheRock tarballs
* Update pip install command for Dockerfile.ubuntu.ci
* Fix pip install again for Dockerfile.ubuntu.ci
* Remove skeleton workflow for CI
* Add new ci-gfx containers for TheRock installs
* Add set -e and pipefail to ci Dockerfiles to detect errors
* Upgrade pip in Dockerfile.ubuntu.ci
* revert pipefail set -e change
* Replace build-docker-ci.sh script with Docker step for ci-base
* Add support for gfx950, add containers-ci-gfx.yml
* Add working-directory to matrix setup steps
* Try changing containers-ci-gfx.yml
* make more changes to containers-ci-gfx.yml
* Remove build-docker-ci.sh script from gfx step, fix typo in Dockerfile
* Remove gfx110X and gfx120X for now
* Update ci-gfx docker workflow to use ghcr.io
* Temporary change to test one image
* Enable push to test out ghcr package
* Add labels to debug oauth issue
* add pacakages permissions to step
* add rocprofiler-systems-ghcr.yml workflow
* Remove cache from Docker push action step
* Add prefix to tag
* Add back gfx94X and gfx950 support, add back no push on PR
* Remove gfx container creation from rocprofiler-systems-containers.yml
* Add a gfx950 image for now
* Revert change
* Replace cmake-format with gersemi in rocprofiler-compute-formatting.yml
* Run gersemi formatting on CMakeLists.txt files
* Remove .cmake-format.yaml, add .gersemirc file
* Add more options to .gersemirc
* Add new line to .gersemirc
* Add new line to CMakeLists.txt
* Run gersemi again with new options