* Add verbose output for submods step
* Remove git config setting
* Determine git version
* Try different git install
* Update Dockerfile.ci
* Revert git location in Ubuntu jobs
* Update RHEL and SLES sections to use 2.52 as well
* Add git --version to each step, fix typo in SLES Docker
## Motivation
The `rocprof-sys-avail -H -c GPU` command is returning blank output which is expected to display a list of available GPU hardware counters instead.
The `rocprof-sys-sample` and `rocprof-sys-run` is missing the `--gpu-events` option for specifying GPU counter events during profiling.
## Technical Details
The initialize_event_info() function had a logic bug where it only called set_agents() if the agent_manager was empty, but the actual issue was that the gpu_agents and cpu_agents vectors were empty even when agents were discovered.
Fixed the conditional logic to properly call set_agents() when gpu_agents and cpu_agents are empty, regardless of the agent_manager state.
Added the `--gpu-events (-G)` option which sets the `ROCPROFSYS_ROCM_EVENTS` environment variable to the specified values.
Fixes an issue where unsupported GPU/APU arch is being skipped gracefully - more details about this issue in the below comment.
* Update device.h for hip_bfloat16 inclusion guard
Prevents other files in rocm include the old hip/hip_bfloat16.h, which is guarded by _HIP_INCLUDE_HIP_AMD_DETAIL_HIP_BFLOAT16_H_ and _HIP_BFLOAT16_H_
* Update device.h to handle old hip_bfloat16.h
Added a workaround for old hip_bfloat16.h header usage.
[ROCm/rccl commit: 8e4dbfdf37]
* Update device.h for hip_bfloat16 inclusion guard
Prevents other files in rocm include the old hip/hip_bfloat16.h, which is guarded by _HIP_INCLUDE_HIP_AMD_DETAIL_HIP_BFLOAT16_H_ and _HIP_BFLOAT16_H_
* Update device.h to handle old hip_bfloat16.h
Added a workaround for old hip_bfloat16.h header usage.
* Replace O(n^2²) nested loop with O(1) dictionary lookup when associating
metric values with metrics. Pre-group values by (metric_id, kernel_name)
to eliminate redundant iteration over entire values dataframe for each
metric-kernel combination.
* This optimization significantly improves database write performance for
workloads with large numbers of metrics and kernels.
* Standalone roofline should create HTML instead of PDF
* Eiminate the dependency on kaleido and plotly_get_chrome by moving
towards plotly native HTML image roofline chart generation
* Address review comments
SWDEV-539526 - Add support for Mipmapped Array in Rocr
Add support for Mipmapped Array functionality in Rocr Runtimeenabling GPU applications to work with multi-level texture mipmaps. The implementation introduces new public APIs for creating, querying, and managing mipmapped arrays across different GPU architectures.
Signed-off-by: Apurv Mishra <Apurv.Mishra@amd.com>
Co-authored-by: Shweta Khatri <shweta.khatri@amd.com>
Co-authored-by: taosang2 <tao.sang@amd.com>
config_hashes json had mismatched md5s for the delta_hash values, regenerated the file with the existing files in develop branch.
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* Error out when IPC gets selected when it is impossible to run it.
* Use RTLD_LAZY when dlopening
* Do not dlclose libbnxt/ionic/mlx5.so as that breaks libibverbs
[ROCm/rocshmem commit: 47f6fa6267]
* Error out when IPC gets selected when it is impossible to run it.
* Use RTLD_LAZY when dlopening
* Do not dlclose libbnxt/ionic/mlx5.so as that breaks libibverbs
* Data imputation strategy for iteration multiplexing
* Implement data imputation methodology to handle missing counter values
in case of iteration multiplexing
* Enable dispatch filtering with iteration multiplexing since we are no
longer merging dispatches
* Bugfix to prevent check for missing counter values when using csv
format when profiling with iteration multiplexing
* Move warning and info message in case of iteration multiplexing to
sanitize function which comes earlier in analyze mode
* Address review comments
* Fix typo in documentation
* Move profiling config init. after path check in sanitize()
* Graceful handling of dispatches with all counters empty within data
imputation logic
* Improve info message for iteration multiplexing based analysis
* Ensure proper error message when trying to run iteration multiplexing with attach/detach
* fix test case
Implements automatic device wake using getDRMDeviceId() DRM call when GPUs
are detected in low-power state. This ensures rocm-smi can access device
information on suspended GPUs.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
* Added python test runner to execute rccl tests
* Disabled capture output to avoid hangs
* Add RCCL_TEST_MPI_HOSTFILE env var to get the hostfile
* Converted test_type to boolean gtest flag
* Removed unused return values
* Added custom rccl library usage
* Removed json output
* Updates to test_runner: added num_gpus field
* Address review comments
* Prepend env vars for single node, single process executions
* Added separate enums for exit and result codes
* Update configuration files
* Moved configurations to its own dir
* Address review comments
* Update tools/scripts/test_runner/README.md
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
---------
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
[ROCm/rccl commit: 0c2c61d2f1]
* Added python test runner to execute rccl tests
* Disabled capture output to avoid hangs
* Add RCCL_TEST_MPI_HOSTFILE env var to get the hostfile
* Converted test_type to boolean gtest flag
* Removed unused return values
* Added custom rccl library usage
* Removed json output
* Updates to test_runner: added num_gpus field
* Address review comments
* Prepend env vars for single node, single process executions
* Added separate enums for exit and result codes
* Update configuration files
* Moved configurations to its own dir
* Address review comments
* Update tools/scripts/test_runner/README.md
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
---------
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
The scratch_size_per_wave_ and dispatch_waves_ should use
the maximum values from all packets in the batch.
Signed-off-by: Longlong Yao <Longlong.Yao@amd.com>
Reviewed-by: Flora Cui <flora.cui@amd.com>
Problem:
The existing SDMA engine selection logic had several issues:
1. Same VirtualGPU/stream could use different SDMA engines for consecutive
async copies since copy_engine_status may report engines as busy
2. Busy and Preferred engine check for every copy
3. No global tracking of which VirtualGPU uses which engine, leading to
suboptimal resource allocation
Solution:
Implemented a global SDMA engine allocator with per-stream affinity:
- Added Device::SdmaEngineAllocator to manage VirtualGPU → engine assignments
* Maintains global map of active assignments
* Enforces exclusivity: different streams use different engines (except
inter-GPU copies where preferred engines are prioritized for optimal
hardware paths like XGMI links)
* Thread-safe allocation/release with Monitor lock
- Modified VirtualGPU to cache assigned engine locally (assigned_sdma_engine_)
for fast lookup without map access on hot path
- Refactored rocrCopyBuffer() to:
1. Check local cached engine first → use if assigned
2. Call AllocateSdmaEngine() if not assigned → cache result
- Moved HSA API queries (memory_copy_engine_status, memory_get_preferred_copy_engine)
into AllocateEngine() for cleaner separation of concerns
- Engine release on HostQueue::finish() instead of only VirtualGPU destruction
* Improves engine utilization by releasing earlier
* Added virtual ReleaseSdmaEngines() method to device::VirtualDevice
- Added future path for simple round-robin allocation (kUseSimpleRR) for
next-gen GPUs with uniform SDMA bandwidth (disabled by default)
Cleanup:
- Removed selectSdmaEngine() helper (logic moved to allocator)
- Removed getSdmaRWMasks() (allocator accesses maxSdmaReadMask_/WriteMask_ directly)
- Removed unused sdmaEngineReadMask_/WriteMask_ member variables from DmaBlitManager
Benefits:
- Ensures consistent per-stream SDMA engine usage
- Prevents cross-stream contention and engine thrashing
- Prioritizes hardware-optimal paths for inter-GPU transfers
- Better resource utilization through earlier release
- Cleaner, more maintainable code structure
- Fixes SWDEV-559349
- Fix build failure caused by correct libunwind not being found in some environments.
- Updated the `timemory` submodule to commit `24407d37ab85c46ba6c18fba9498320f825ee4e4 `.
* Use static catch2.lib instead of catch2.dll
Using catch2.dll incraeses execution time by 12x
* handle debug option for static catch2
* SWDEV-573539 - skip atomics on windows since its taking a very long time to execute
mlsejenkins needs newer cmake but compiler breaks with newer versions
so skipping on windows can be a workaround for now
---------
Co-authored-by: Joseph Macaranas <145489236+jayhawk-commits@users.noreply.github.com>