Grafik Komit

74743 Melakukan

Penulis SHA1 Pesan Tanggal
Sajina PK b3f59a37e4 [Rocprofiler-system]: Fix GPU event enumeration for rocprof-sys-avail and CLI option for parsing GPU HW Counters (#2476)
## Motivation

The `rocprof-sys-avail -H -c GPU` command is returning blank output which is expected to display a list of available GPU hardware counters instead.
The `rocprof-sys-sample` and `rocprof-sys-run` is missing the `--gpu-events` option for specifying GPU counter events during profiling.

## Technical Details

The initialize_event_info() function had a logic bug where it only called set_agents() if the agent_manager was empty, but the actual issue was that the gpu_agents and cpu_agents vectors were empty even when agents were discovered.
Fixed the conditional logic to properly call set_agents() when gpu_agents and cpu_agents are empty, regardless of the agent_manager state.

Added the `--gpu-events (-G)` option which sets the `ROCPROFSYS_ROCM_EVENTS` environment variable to the specified values.

Fixes an issue where unsupported GPU/APU arch is being skipped gracefully - more details about this issue in the below comment.
2026-01-09 11:59:45 -05:00
Dingming Wu 4e15dc142c Update device.h for hip_bfloat16 inclusion guard (#2107)
* Update device.h for hip_bfloat16 inclusion guard

Prevents other files in rocm include the old hip/hip_bfloat16.h, which is guarded by _HIP_INCLUDE_HIP_AMD_DETAIL_HIP_BFLOAT16_H_ and _HIP_BFLOAT16_H_

* Update device.h to handle old hip_bfloat16.h

Added a workaround for old hip_bfloat16.h header usage.

[ROCm/rccl commit: 8e4dbfdf37]
2026-01-09 09:45:47 -05:00
Dingming Wu 8e4dbfdf37 Update device.h for hip_bfloat16 inclusion guard (#2107)
* Update device.h for hip_bfloat16 inclusion guard

Prevents other files in rocm include the old hip/hip_bfloat16.h, which is guarded by _HIP_INCLUDE_HIP_AMD_DETAIL_HIP_BFLOAT16_H_ and _HIP_BFLOAT16_H_

* Update device.h to handle old hip_bfloat16.h

Added a workaround for old hip_bfloat16.h header usage.
2026-01-09 09:45:47 -05:00
vedithal-amd ebe22b5907 Add pre-processor guards for rocflop (#2534) 2026-01-09 09:06:52 -05:00
vedithal-amd d65de0a203 Performance optimization of analysis database (#2557)
* Replace O(n^2²) nested loop with O(1) dictionary lookup when associating
metric values with metrics. Pre-group values by (metric_id, kernel_name)
to eliminate redundant iteration over entire values dataframe for each
metric-kernel combination.

* This optimization significantly improves database write performance for
workloads with large numbers of metrics and kernels.
2026-01-09 09:06:33 -05:00
vedithal-amd 51ba3c3a53 [rocprofiler-compute] Standalone roofline should create HTML instead of PDF (#2535)
* Standalone roofline should create HTML instead of PDF

* Eiminate the dependency on kaleido and plotly_get_chrome by moving
  towards plotly native HTML image roofline chart generation

* Address review comments
2026-01-09 09:05:49 -05:00
dependabot[bot] 12d9d45667 Bump urllib3 from 2.6.0 to 2.6.3 in /docs/sphinx (#383)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.6.0 to 2.6.3.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.6.0...2.6.3)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-version: 2.6.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rocshmem commit: f9fc022ed5]
2026-01-09 08:27:43 -05:00
dependabot[bot] f9fc022ed5 Bump urllib3 from 2.6.0 to 2.6.3 in /docs/sphinx (#383)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.6.0 to 2.6.3.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.6.0...2.6.3)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-version: 2.6.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-09 08:27:43 -05:00
Karthikeyan Arumugam 94499918b3 Add check for P2pPolicy for rocm-ib (#2122)
[ROCm/rccl commit: d0d00c33ee]
2026-01-09 11:33:05 +00:00
Karthikeyan Arumugam d0d00c33ee Add check for P2pPolicy for rocm-ib (#2122) 2026-01-09 11:33:05 +00:00
Flora Cui 029690f0a4 wsl/librocdxg: fix deb package name and add version macro
Signed-off-by: Horatio Zhang <Hongkun.Zhang@amd.com>
Signed-off-by: Flora Cui <flora.cui@amd.com>
2026-01-09 13:05:25 +08:00
Wenkai Du 07453ebfaf Improve RCCL kernel coll trace (#2061)
[ROCm/rccl commit: 1d22c87167]
2026-01-08 16:07:18 -08:00
Wenkai Du 1d22c87167 Improve RCCL kernel coll trace (#2061) 2026-01-08 16:07:18 -08:00
Apurv Mishra be375c2dbf rocr: Add support for Mipmapped Array (#1847)
SWDEV-539526 - Add support for Mipmapped Array in Rocr

Add support for Mipmapped Array functionality in Rocr Runtimeenabling GPU applications to work with multi-level texture mipmaps. The implementation introduces new public APIs for creating, querying, and managing mipmapped arrays across different GPU architectures.

Signed-off-by: Apurv Mishra <Apurv.Mishra@amd.com>
Co-authored-by: Shweta Khatri <shweta.khatri@amd.com>
Co-authored-by: taosang2 <tao.sang@amd.com>
2026-01-08 17:14:39 -06:00
Wenkai Du 721c624de8 Remove iommu warning in KVM env (#2112)
* Remove iommu warning in KVM env

* Fix for review comments

[ROCm/rccl commit: de931f4c53]
2026-01-08 13:55:40 -08:00
Wenkai Du de931f4c53 Remove iommu warning in KVM env (#2112)
* Remove iommu warning in KVM env

* Fix for review comments
2026-01-08 13:55:40 -08:00
Mario Limonciello 8b529e7b29 Run pre-commit's whitespace related hooks on projects/rocr-runtime/samples (#2126)
In order for pre-commit to be useful, everything needs to meet a common
baseline.

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
2026-01-08 15:36:57 -05:00
cfallows-amd ae1abe4254 [rocprofiler-compute] Update .config_hashes.json (#2530)
config_hashes json had mismatched md5s for the delta_hash values, regenerated the file with the existing files in develop branch.

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
2026-01-08 14:33:36 -05:00
Yiannis Papadopoulos e8fef02e5a rocr/aie: Use util/os to get system memory (#2520) 2026-01-08 12:40:22 -06:00
Aurelien Bouteiller 6cad766d4e dlclosing the dvlib may leave libibverbs in a broken state (#381)
* Error out when IPC gets selected when it is impossible to run it.

* Use RTLD_LAZY when dlopening

* Do not dlclose libbnxt/ionic/mlx5.so as that breaks libibverbs

[ROCm/rocshmem commit: 47f6fa6267]
2026-01-08 13:40:11 -05:00
Aurelien Bouteiller 47f6fa6267 dlclosing the dvlib may leave libibverbs in a broken state (#381)
* Error out when IPC gets selected when it is impossible to run it.

* Use RTLD_LAZY when dlopening

* Do not dlclose libbnxt/ionic/mlx5.so as that breaks libibverbs
2026-01-08 13:40:11 -05:00
Yazen AL Musaffar d8a914d8cc comment update for wrong units associated with RDC (#2299)
* comment update for wrong units associted with RDC_FI_GPU_MEMORY_CUR_BANDWIDTH

Signed-off-by: yalmusaf_amdeng <Yazen.ALMusaffar@amd.com>

* Update rdc.h

---------

Signed-off-by: yalmusaf_amdeng <Yazen.ALMusaffar@amd.com>
2026-01-08 12:14:51 -06:00
vedithal-amd 769d3dd67a [rocprofiler-compute] Data imputation strategy for iteration multiplexing (#2468)
* Data imputation strategy for iteration multiplexing

* Implement data imputation methodology to handle missing counter values
  in case of iteration multiplexing

* Enable dispatch filtering with iteration multiplexing since we are no
  longer merging dispatches

* Bugfix to prevent check for missing counter values when using csv
  format when profiling with iteration multiplexing

* Move warning and info message in case of iteration multiplexing to
  sanitize function which comes earlier in analyze mode

* Address review comments

* Fix typo in documentation

* Move profiling config init. after path check in sanitize()

* Graceful handling of dispatches with all counters empty within data
  imputation logic

* Improve info message for iteration multiplexing based analysis

* Ensure proper error message when trying to run iteration multiplexing with attach/detach

* fix test case
2026-01-08 12:01:51 -05:00
Yiltan 51d26b7cea Fix __match_any_sync on ROCm 6.x (#382)
[ROCm/rocshmem commit: e47cff7f45]
2026-01-08 11:25:16 -05:00
Yiltan e47cff7f45 Fix __match_any_sync on ROCm 6.x (#382) 2026-01-08 11:25:16 -05:00
systems-assistant[bot] 53c56fca5f [SWDEV-558534] AMD-SMI bad pages add flag to convert to hex (#1900)
* Simplify hex flag check for bad page info
* moved the hex help text up with the other help text

---------

Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Authored-by: Koushik Billakanti <Koushik.Billakanti@amd.com>
Co-authored-by: Koushik Billakanti <Koushik.Billakanti@amd.com>
2026-01-08 10:21:10 -06:00
Bindhiya Kanangot Balakrishnan 8326c33d33 [SWDEV-573540] Add DRM-based wake for suspended AMD GPUs (#2510)
Implements automatic device wake using getDRMDeviceId() DRM call when GPUs
are detected in low-power state. This ensures rocm-smi can access device
information on suspended GPUs.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2026-01-08 10:19:45 -06:00
Atul Kulkarni 30d36661c2 Adds Python-based test runner for RCCL (#2034)
* Added python test runner to execute rccl tests

* Disabled capture output to avoid hangs

* Add RCCL_TEST_MPI_HOSTFILE env var to get the hostfile

* Converted test_type to boolean gtest flag

* Removed unused return values

* Added custom rccl library usage

* Removed json output

* Updates to test_runner: added num_gpus field

* Address review comments

* Prepend env vars for single node, single process executions

* Added separate enums for exit and result codes

* Update configuration files

* Moved configurations to its own dir

* Address review comments

* Update tools/scripts/test_runner/README.md

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: 0c2c61d2f1]
2026-01-08 10:04:41 -06:00
Atul Kulkarni 0c2c61d2f1 Adds Python-based test runner for RCCL (#2034)
* Added python test runner to execute rccl tests

* Disabled capture output to avoid hangs

* Add RCCL_TEST_MPI_HOSTFILE env var to get the hostfile

* Converted test_type to boolean gtest flag

* Removed unused return values

* Added custom rccl library usage

* Removed json output

* Updates to test_runner: added num_gpus field

* Address review comments

* Prepend env vars for single node, single process executions

* Added separate enums for exit and result codes

* Update configuration files

* Moved configurations to its own dir

* Address review comments

* Update tools/scripts/test_runner/README.md

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
2026-01-08 10:04:41 -06:00
Kapil S. Pawar 868f40c49d [NAVI3X] [MI308X] Fix UT hangs and failures for ROCm RCCL builds (#2124)
* Update toolchain with compiler flags for RelWithDebInfo

[ROCm/rccl commit: e905d52fc0]
2026-01-08 08:58:19 -06:00
Kapil S. Pawar e905d52fc0 [NAVI3X] [MI308X] Fix UT hangs and failures for ROCm RCCL builds (#2124)
* Update toolchain with compiler flags for RelWithDebInfo
2026-01-08 08:58:19 -06:00
koushikbillakanti-amd ac1fa8dccb [SWDEV-567284] AMDSMI conceptual documentation for setting perf determinism (#2529)
Authored-by: Koushik Billakanti <kbillaka@amd.com>
2026-01-08 08:04:23 -06:00
Alexandra Sidorova 38a359f5f3 [CLR] prevent compilation errors for non-HIP compilers in amd_hip_mx_common.h and amd_hip_ocp_types.h (#2448)
Co-authored-by: Andrei Kochin <andrei.kochin@amd.com>
2026-01-08 17:49:13 +04:00
Longlong Yao e67113a741 wsl/librocdxg: correct scratch info for kernel dispatch
The scratch_size_per_wave_ and dispatch_waves_ should use
the maximum values from all packets in the batch.

Signed-off-by: Longlong Yao <Longlong.Yao@amd.com>
Reviewed-by: Flora Cui <flora.cui@amd.com>
2026-01-08 16:10:36 +08:00
SaleelK 6b28faa532 clr: Implement per-stream SDMA engine affinity for improved copy performance (#2480)
Problem:
The existing SDMA engine selection logic had several issues:
1. Same VirtualGPU/stream could use different SDMA engines for consecutive
   async copies since copy_engine_status may report engines as busy
2. Busy and Preferred engine check for every copy
3. No global tracking of which VirtualGPU uses which engine, leading to
   suboptimal resource allocation

Solution:
Implemented a global SDMA engine allocator with per-stream affinity:

- Added Device::SdmaEngineAllocator to manage VirtualGPU → engine assignments
  * Maintains global map of active assignments
  * Enforces exclusivity: different streams use different engines (except
    inter-GPU copies where preferred engines are prioritized for optimal
    hardware paths like XGMI links)
  * Thread-safe allocation/release with Monitor lock

- Modified VirtualGPU to cache assigned engine locally (assigned_sdma_engine_)
  for fast lookup without map access on hot path

- Refactored rocrCopyBuffer() to:
  1. Check local cached engine first → use if assigned
  2. Call AllocateSdmaEngine() if not assigned → cache result

- Moved HSA API queries (memory_copy_engine_status, memory_get_preferred_copy_engine)
  into AllocateEngine() for cleaner separation of concerns

- Engine release on HostQueue::finish() instead of only VirtualGPU destruction
  * Improves engine utilization by releasing earlier
  * Added virtual ReleaseSdmaEngines() method to device::VirtualDevice

- Added future path for simple round-robin allocation (kUseSimpleRR) for
  next-gen GPUs with uniform SDMA bandwidth (disabled by default)

Cleanup:
- Removed selectSdmaEngine() helper (logic moved to allocator)
- Removed getSdmaRWMasks() (allocator accesses maxSdmaReadMask_/WriteMask_ directly)
- Removed unused sdmaEngineReadMask_/WriteMask_ member variables from DmaBlitManager

Benefits:
- Ensures consistent per-stream SDMA engine usage
- Prevents cross-stream contention and engine thrashing
- Prioritizes hardware-optimal paths for inter-GPU transfers
- Better resource utilization through earlier release
- Cleaner, more maintainable code structure
2026-01-07 19:37:45 -08:00
Flora Cui be04fa8250 rocr: reorder HsaNodeProperties to improve compatibility (#2447)
Signed-off-by: Flora Cui <flora.cui@amd.com>
2026-01-08 09:56:39 +08:00
David Galiffi cb17e59a57 [rocprofiler-systems] Improve build time by refactoring RCCL test cmake (#1656)
Improve cmake configuration time by making sure the rccl-tests are built during the build phase rather than the configuration phase.
2026-01-07 19:51:54 -05:00
anujshuk-amd c35a7dd8cb [rocprofiler-systems] Update timemory submodule (#2440)
- Fixes SWDEV-559349 
- Fix build failure caused by correct libunwind not being found in some environments.
- Updated the `timemory` submodule to commit `24407d37ab85c46ba6c18fba9498320f825ee4e4 `.
2026-01-07 19:35:23 -05:00
Ajay GunaShekar 95ab459a4c Use static catch2.lib instead of catch2.dll (#2419)
* Use static catch2.lib instead of catch2.dll

Using catch2.dll incraeses execution time by 12x

* handle debug option for static catch2

* SWDEV-573539 - skip atomics on windows since its taking a very long time to execute

mlsejenkins needs newer cmake but compiler breaks with newer versions
so skipping on windows can be a workaround for now

---------

Co-authored-by: Joseph Macaranas <145489236+jayhawk-commits@users.noreply.github.com>
2026-01-07 14:35:25 -08:00
Alysa Liu 5be4fddf06 kfdtest: Support blit kernel copy (#677)
Add support for blit kernel copy.
Add GpuMemCopyTest test for KFDQMTest.
2026-01-07 16:48:11 -05:00
dependabot[bot] 645236aadd Bump pynacl from 1.5.0 to 1.6.2 in /docs/sphinx (#379)
Bumps [pynacl](https://github.com/pyca/pynacl) from 1.5.0 to 1.6.2.
- [Changelog](https://github.com/pyca/pynacl/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/pynacl/compare/1.5.0...1.6.2)

---
updated-dependencies:
- dependency-name: pynacl
  dependency-version: 1.6.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rocshmem commit: fb644ddfa9]
2026-01-07 14:39:00 -05:00
dependabot[bot] fb644ddfa9 Bump pynacl from 1.5.0 to 1.6.2 in /docs/sphinx (#379)
Bumps [pynacl](https://github.com/pyca/pynacl) from 1.5.0 to 1.6.2.
- [Changelog](https://github.com/pyca/pynacl/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/pynacl/compare/1.5.0...1.6.2)

---
updated-dependencies:
- dependency-name: pynacl
  dependency-version: 1.6.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-07 14:39:00 -05:00
David Yat Sin 7178747ebc Update CODEOWNERS for ROCR-Runtime (#2521) 2026-01-07 14:22:11 -05:00
Aleksandar Djordjevic aecea25a61 [rocprofiler-systems] CMake Cleanup (#2455)
## Technical Details

- Removed `configure_file()` call that was generating `defines.hpp` from `defines.hpp.in` and update CMake file to reference renamed file.
- Remove duplicate `find_library(pthread_LIBRARY NAMES pthread pthreads)`
2026-01-07 14:07:37 -05:00
anujshuk-amd 596ffce5fe [rocprof-sys] Fix segfault from thread ID array overflow (#2172)
**Thread limit configuration and enforcement: **

* Added a check in `CMakeLists.txt` to ensure `ROCPROFSYS_MAX_THREADS` is at least 128, automatically setting it to 128 with a warning if a lower value is provided.
* Replaced hardcoded thread limit (`allowed_max_threads`) in `pthread_create_gotcha.cpp` with the configurable `ROCPROFSYS_MAX_THREADS` value, ensuring all runtime checks and warnings use the actual configured limit.

**Documentation improvements: **

* Updated the development guide to explain the new thread limit behavior, including how exceeding the limit is handled gracefully, how to configure it, and the build-time validation rules.

**Test updates: **

* Modified thread limit tests to use the configurable `ROCPROFSYS_MAX_THREADS` value instead of a hardcoded limit and expanded the range of tested thread values.
* Increased test timeouts to accommodate larger thread counts and ensure reliability with higher limits.
2026-01-07 14:03:37 -05:00
Aurelien Bouteiller 8d2dca4505 Fix DEBUG build (#378)
[ROCm/rocshmem commit: 27d87b8b67]
2026-01-07 10:39:57 -05:00
Aurelien Bouteiller 27d87b8b67 Fix DEBUG build (#378) 2026-01-07 10:39:57 -05:00
vedithal-amd 050e88ee71 Remove unused python packages (#2437)
* Remove dependency on following unused python packages by updating
  requirements.txt, LICENSE, standalone binary requirements, cmake and
  docker requirements
    * matplotlib
    * kaleido
    * pymongo
    * colorlover
    * tqdm

* Remove unused code from src/utils/gui.py

* Reformat python using ruff
2026-01-07 09:03:49 -05:00
Godavarthy Surya, Anusha 1ef6a86ee3 SWDEV-549711 - Improve graph DEBUG dot print for segments (#2205)
Co-authored-by: Anusha GodavarthySurya<agodavar@amd.com>
2026-01-07 14:07:49 +05:30
Allen Hubbe 67536a85ef gda ionic: collapsed cqe (#345)
* util: dlsym optional helper

Like DLSYM_HELPER, but does not return if the symbol is not found.

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>

* gda ionic: sync dv and fw headers

Sync dv and fw headers to match out-of-tree libionic and firmware.

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>

* gda ionic: collapsed cqe

Detect and enable collapsed cqe if supported by drivers and firmware.
Fall back to regular completion queue.

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>

---------

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>

[ROCm/rocshmem commit: 1494c24f9a]
2026-01-06 20:42:15 -05:00