## Overview and rationale
This reverts https://github.com/ROCm/rocm-systems/pull/1886, which...
* Re-applies https://github.com/ROCm/rocm-systems/pull/1866
* Reverts https://github.com/ROCm/rocm-systems/pull/1728
(So it restores the [`amdgpu-windows-interop/`](https://github.com/ROCm/rocm-systems/tree/develop/shared/amdgpu-windows-interop) folder back to the state from a few weeks ago)
The rationale for this change is at https://github.com/ROCm/rocm-systems/pull/1866:
> Last PAL update broke applications on gfx12 Windows.
## Cross-repository change details
That PR failed to build but was merged with this explanation:
> TheRock CI Windows build fails as expected with this revert.
>
> References to these PAL members need to be stripped out in a patch on TheRock.
>
> ```
> 11.3 C:\home\runner\_work\rocm-systems\rocm-systems\projects\clr\rocclr\device\pal\palubercapturemgr.cpp(152): error C2039: 'RegisterTraceStateChangeCallback': is not a member of 'GpuUtil::TraceSession'
> 11.4 C:\home\runner\_work\rocm-systems\rocm-systems\shared\amdgpu-windows-interop\pal\inc\gpuUtil\palTraceSession.h(372): note: see declaration of 'GpuUtil::TraceSession'
> 11.4 C:\home\runner\_work\rocm-systems\rocm-systems\projects\clr\rocclr\device\pal\palubercapturemgr.cpp(195): error C2039: 'UnregisterTraceStateChangeCallback': is not a member of 'GpuUtil::TraceSession'
> 11.4 C:\home\runner\_work\rocm-systems\rocm-systems\shared\amdgpu-windows-interop\pal\inc\gpuUtil\palTraceSession.h(372): note: see declaration of 'GpuUtil::TraceSession'
> ```
The patch in TheRock was updated in https://github.com/ROCm/TheRock/pull/2154. This rolls forward by updating the ref for TheRock.
That original PR could have been sequenced differently to avoid a build break - perhaps by
* Pointing to a branch in TheRock with the patch rebased
* Deleting the patch in the workflows here but holding a local copy of the path to be applied in workflows
* Landing the patch as a normal commit instead of carrying it at all
## Test plan
1. Watch TheRock CI here (https://github.com/ROCm/rocm-systems/actions/runs/19447202693/job/55644411119?pr=1893)
2. Build locally:
```bash
# In rocm-systems
git am --whitespace=nowarn D:\projects\TheRock\patches\amd-mainline\rocm-systems\0001-Revert-SWDEV-543498-Some-compute-Ubertrace-profiles-.patch
git am --whitespace=nowarn D:\projects\TheRock\patches\amd-mainline\rocm-systems\0003-Use-is_versioned-true-consistently-in-both-Comgr-Loa.patch
git am --whitespace=nowarn D:\projects\TheRock\patches\amd-mainline\rocm-systems\0006-Explicitly-load-libamdhip64.so.7.patch
# Note: the build fails with the observed errors if patch 0001 is not applied!
# In TheRock
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=cl.exe -DCMAKE_CXX_COMPILER=cl.exe \
-DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DPython3_EXECUTABLE=d:/projects/TheRock/.venv/Scripts/python \
-DTHEROCK_ROCM_SYSTEMS_SOURCE_DIR=d:/projects/TheRock/../rocm-systems \ # IMPORTANT
-DTHEROCK_AMDGPU_FAMILIES=gfx110X-all \
-DBUILD_TESTING=ON \
-DTHEROCK_ENABLE_ALL=ON \
-Damd-llvm_BUILD_TYPE=RelWithDebInfo \
-S D:/projects/TheRock \
-B D:/projects/TheRock/build \
-G Ninja
cmake --build D:/projects/TheRock/build --target hip-clr
# [build] Build finished with exit code 0
cmake --build D:/projects/TheRock/build --target ocl-clr+dist
# [build] Build finished with exit code 0
```
Use fork/waitpid to isolate API call and detect SIGKILL from kernel
Signed-off-by: Sumanth Gavini <sumanth.gavini@amd.com>
[ROCm/amdsmi commit: a044536b8d]
* Added Product Serial Number to the raw_bytes cper entries
* Added Product Serial Number to the Python API return
---------
Signed-off-by: Saeed, Oosman <Oosman.Saeed@amd.com>
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
* Added Product Serial Number to the raw_bytes cper entries
* Added Product Serial Number to the Python API return
---------
Signed-off-by: Saeed, Oosman <Oosman.Saeed@amd.com>
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
[ROCm/amdsmi commit: 05ea00dcc4]
* [SWDEV-563828] Fix incorrect help text for --perf-determinism flag to indicate it expects GFXCLK frequency in MHz
---------
Signed-off-by: Billakanti, Koushik <Koushik.Billakanti@amd.com>
* [SWDEV-563828] Fix incorrect help text for --perf-determinism flag to indicate it expects GFXCLK frequency in MHz
---------
Signed-off-by: Billakanti, Koushik <Koushik.Billakanti@amd.com>
[ROCm/amdsmi commit: 23f68555db]
* Add XGMI and PCIe metrics to the profiling data
Add support for AMD XGMI (GPU-to-GPU interconnect) and PCIe
metrics:
* XGMI link width in bits
* XGMI link speed in GT/s
* Per-link read bandwidth (KB)
* Per-link write bandwidth (KB)
- Add new categories for PCIe metrics:
* PCIe link width
* PCIe link speed in GT/s
* Accumulated bandwidth (MB)
* Instantaneous bandwidth (MB/s)
* Fix VCN/JPEG insert logic
* Modify the gpu_metrics struct to accomodate XCP structure
* Add ctest automation for gpu interconnect metrics
* Refactor to move gpu_metrics struct and serialization to another file
* Possible fix for timeout in CI
Fix redundant skip check in ctest
Add xgmi and pcie option in rocprof-sys-avail.
* Change2: Address review comments
Change ctest sampling to avoid timeout
Change variable name and code structuring
* Add option in ctest to run rocprof-sys-run without rewrite
Run transferbench with rocprof-sys-run without sampling
* Change3: Fix sample insert bug and address review comments
xgmi and pci support check
renaming variables
additional hip_api validation in rocpd
* Reduce the load from the trnasferBench sample
The CI builds were timing out when flushing a big temporary file to the
DB: (2720824.23 KB / 2720.82 MB / 2.72 GB)...
Added check for GCC versions prior to 9.0 and
link with libstdc++fs when needed. This fixes
undefined symbols on older systems like Deb10
with GCC 8.3.0.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Added check for GCC versions prior to 9.0 and
link with libstdc++fs when needed. This fixes
undefined symbols on older systems like Deb10
with GCC 8.3.0.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/amdsmi commit: e1b3d5f02e]
Replicating https://github.com/ROCm/TheRock/pull/2147#discussion_r2528008441
## Motivation
Fixes https://github.com/ROCm/TheRock/issues/875 which is the issue where Windows builds would fail randomly when uploading to s3 with the `SignatureDoesNotMatch` error as a result of special characters existing in the AWS Access Keys generated by the `configure-aws-credentials` action that is passed through Windows environment variables to `aws-cli`. More details below.
## Technical Details
https://github.com/ROCm/TheRock/issues/875#issuecomment-3530851762
In summary, in Windows workflows, the `special-characters-workaround` option is set to true for the `configure-aws-credentials` action which will regenerate access keys until there are no special characters that may not be passable through windows environment variables correctly.
## Test Plan
Observe CI.
## Test Result
TBD.
## Submission Checklist
- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
* rocr: Fix exception on AsyncEventControl init
Fix exception on init when compiling with in release mode.
* rocr: Fix crash when interrupts are disabled
Fix segfault due to assert for signal->EopEvent() being false when
HSA_ENABLE_INTERRUPT=0. Use Signal::WaitMultiple(..) when interrupt is
disabled.
---------
Co-authored-by: JeniferC99 <150404595+JeniferC99@users.noreply.github.com>
* SWDEV-533237 Added test cases for hipOccupancyAvailableDynamicSMemPerBlock API
* SWDEV-533237 : Added test cases for hipOccupancyAvailableDynamicSMemPerBlock
* SWDEV-533237 : Addressed review comments for hipOccupancyAvailableDynamicSMemPerBlock aip test cases
---------
Co-authored-by: jainprad <92369414+jainprad@users.noreply.github.com>
Handled numa data - including cpu and socket list, bitmask,
and affinity for csv format.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Handled numa data - including cpu and socket list, bitmask,
and affinity for csv format.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
[ROCm/amdsmi commit: 1b027d15bd]
* Added Python & C API's for new node devices. Currently these are functional for node 0 only.
- amdsmi_get_node_handle
- amdsmi_get_npm_info
* Added `amd-smi node` CLI for Node Power Management
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
* Added Python & C API's for new node devices. Currently these are functional for node 0 only.
- amdsmi_get_node_handle
- amdsmi_get_npm_info
* Added `amd-smi node` CLI for Node Power Management
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
[ROCm/amdsmi commit: f8e4771363]
* Forward ctest labels from the execution test to the validation test.
* Adjust test validation parameters for amid_smi samples
The actual number of samples will vary depending on the GPU. This test
is just to validate the presence of the samples
Changes:
- Simplified reset calls
- Updated static limit N/A values to all possible data
(helps csv format be consistent)
- Unit format was broken on static
- get_power_cap() had min/max values swapped, and the return
was missing two fields
- Updated changelog to reflect all changes
Change-Id: I23713471b984f52085372486c6e6ff852e2f42f8
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Changes:
- Simplified reset calls
- Updated static limit N/A values to all possible data
(helps csv format be consistent)
- Unit format was broken on static
- get_power_cap() had min/max values swapped, and the return
was missing two fields
- Updated changelog to reflect all changes
Change-Id: I23713471b984f52085372486c6e6ff852e2f42f8
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: 00a893d299]
- Updated python integration test to account for PPT1 support changes
- Updated set/reset power-cap input format
- Adjusted python API and updated C++ API test
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
Change-Id: Ia9d02868b6e91c88c10a9772d9e2d9f37c3c352f
- Updated python integration test to account for PPT1 support changes
- Updated set/reset power-cap input format
- Adjusted python API and updated C++ API test
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
Change-Id: Ia9d02868b6e91c88c10a9772d9e2d9f37c3c352f
[ROCm/amdsmi commit: 18faddf6f3]