提交图

74719 次代码提交

作者 SHA1 备注 提交日期
Yiltan e47cff7f45 Fix __match_any_sync on ROCm 6.x (#382) 2026-01-08 11:25:16 -05:00
systems-assistant[bot] 53c56fca5f [SWDEV-558534] AMD-SMI bad pages add flag to convert to hex (#1900)
* Simplify hex flag check for bad page info
* moved the hex help text up with the other help text

---------

Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Authored-by: Koushik Billakanti <Koushik.Billakanti@amd.com>
Co-authored-by: Koushik Billakanti <Koushik.Billakanti@amd.com>
2026-01-08 10:21:10 -06:00
Bindhiya Kanangot Balakrishnan 8326c33d33 [SWDEV-573540] Add DRM-based wake for suspended AMD GPUs (#2510)
Implements automatic device wake using getDRMDeviceId() DRM call when GPUs
are detected in low-power state. This ensures rocm-smi can access device
information on suspended GPUs.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2026-01-08 10:19:45 -06:00
Atul Kulkarni 30d36661c2 Adds Python-based test runner for RCCL (#2034)
* Added python test runner to execute rccl tests

* Disabled capture output to avoid hangs

* Add RCCL_TEST_MPI_HOSTFILE env var to get the hostfile

* Converted test_type to boolean gtest flag

* Removed unused return values

* Added custom rccl library usage

* Removed json output

* Updates to test_runner: added num_gpus field

* Address review comments

* Prepend env vars for single node, single process executions

* Added separate enums for exit and result codes

* Update configuration files

* Moved configurations to its own dir

* Address review comments

* Update tools/scripts/test_runner/README.md

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: 0c2c61d2f1]
2026-01-08 10:04:41 -06:00
Atul Kulkarni 0c2c61d2f1 Adds Python-based test runner for RCCL (#2034)
* Added python test runner to execute rccl tests

* Disabled capture output to avoid hangs

* Add RCCL_TEST_MPI_HOSTFILE env var to get the hostfile

* Converted test_type to boolean gtest flag

* Removed unused return values

* Added custom rccl library usage

* Removed json output

* Updates to test_runner: added num_gpus field

* Address review comments

* Prepend env vars for single node, single process executions

* Added separate enums for exit and result codes

* Update configuration files

* Moved configurations to its own dir

* Address review comments

* Update tools/scripts/test_runner/README.md

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
2026-01-08 10:04:41 -06:00
Kapil S. Pawar 868f40c49d [NAVI3X] [MI308X] Fix UT hangs and failures for ROCm RCCL builds (#2124)
* Update toolchain with compiler flags for RelWithDebInfo

[ROCm/rccl commit: e905d52fc0]
2026-01-08 08:58:19 -06:00
Kapil S. Pawar e905d52fc0 [NAVI3X] [MI308X] Fix UT hangs and failures for ROCm RCCL builds (#2124)
* Update toolchain with compiler flags for RelWithDebInfo
2026-01-08 08:58:19 -06:00
koushikbillakanti-amd ac1fa8dccb [SWDEV-567284] AMDSMI conceptual documentation for setting perf determinism (#2529)
Authored-by: Koushik Billakanti <kbillaka@amd.com>
2026-01-08 08:04:23 -06:00
Alexandra Sidorova 38a359f5f3 [CLR] prevent compilation errors for non-HIP compilers in amd_hip_mx_common.h and amd_hip_ocp_types.h (#2448)
Co-authored-by: Andrei Kochin <andrei.kochin@amd.com>
2026-01-08 17:49:13 +04:00
Longlong Yao e67113a741 wsl/librocdxg: correct scratch info for kernel dispatch
The scratch_size_per_wave_ and dispatch_waves_ should use
the maximum values from all packets in the batch.

Signed-off-by: Longlong Yao <Longlong.Yao@amd.com>
Reviewed-by: Flora Cui <flora.cui@amd.com>
2026-01-08 16:10:36 +08:00
SaleelK 6b28faa532 clr: Implement per-stream SDMA engine affinity for improved copy performance (#2480)
Problem:
The existing SDMA engine selection logic had several issues:
1. Same VirtualGPU/stream could use different SDMA engines for consecutive
   async copies since copy_engine_status may report engines as busy
2. Busy and Preferred engine check for every copy
3. No global tracking of which VirtualGPU uses which engine, leading to
   suboptimal resource allocation

Solution:
Implemented a global SDMA engine allocator with per-stream affinity:

- Added Device::SdmaEngineAllocator to manage VirtualGPU → engine assignments
  * Maintains global map of active assignments
  * Enforces exclusivity: different streams use different engines (except
    inter-GPU copies where preferred engines are prioritized for optimal
    hardware paths like XGMI links)
  * Thread-safe allocation/release with Monitor lock

- Modified VirtualGPU to cache assigned engine locally (assigned_sdma_engine_)
  for fast lookup without map access on hot path

- Refactored rocrCopyBuffer() to:
  1. Check local cached engine first → use if assigned
  2. Call AllocateSdmaEngine() if not assigned → cache result

- Moved HSA API queries (memory_copy_engine_status, memory_get_preferred_copy_engine)
  into AllocateEngine() for cleaner separation of concerns

- Engine release on HostQueue::finish() instead of only VirtualGPU destruction
  * Improves engine utilization by releasing earlier
  * Added virtual ReleaseSdmaEngines() method to device::VirtualDevice

- Added future path for simple round-robin allocation (kUseSimpleRR) for
  next-gen GPUs with uniform SDMA bandwidth (disabled by default)

Cleanup:
- Removed selectSdmaEngine() helper (logic moved to allocator)
- Removed getSdmaRWMasks() (allocator accesses maxSdmaReadMask_/WriteMask_ directly)
- Removed unused sdmaEngineReadMask_/WriteMask_ member variables from DmaBlitManager

Benefits:
- Ensures consistent per-stream SDMA engine usage
- Prevents cross-stream contention and engine thrashing
- Prioritizes hardware-optimal paths for inter-GPU transfers
- Better resource utilization through earlier release
- Cleaner, more maintainable code structure
2026-01-07 19:37:45 -08:00
Flora Cui be04fa8250 rocr: reorder HsaNodeProperties to improve compatibility (#2447)
Signed-off-by: Flora Cui <flora.cui@amd.com>
2026-01-08 09:56:39 +08:00
David Galiffi cb17e59a57 [rocprofiler-systems] Improve build time by refactoring RCCL test cmake (#1656)
Improve cmake configuration time by making sure the rccl-tests are built during the build phase rather than the configuration phase.
2026-01-07 19:51:54 -05:00
anujshuk-amd c35a7dd8cb [rocprofiler-systems] Update timemory submodule (#2440)
- Fixes SWDEV-559349 
- Fix build failure caused by correct libunwind not being found in some environments.
- Updated the `timemory` submodule to commit `24407d37ab85c46ba6c18fba9498320f825ee4e4 `.
2026-01-07 19:35:23 -05:00
Ajay GunaShekar 95ab459a4c Use static catch2.lib instead of catch2.dll (#2419)
* Use static catch2.lib instead of catch2.dll

Using catch2.dll incraeses execution time by 12x

* handle debug option for static catch2

* SWDEV-573539 - skip atomics on windows since its taking a very long time to execute

mlsejenkins needs newer cmake but compiler breaks with newer versions
so skipping on windows can be a workaround for now

---------

Co-authored-by: Joseph Macaranas <145489236+jayhawk-commits@users.noreply.github.com>
2026-01-07 14:35:25 -08:00
Alysa Liu 5be4fddf06 kfdtest: Support blit kernel copy (#677)
Add support for blit kernel copy.
Add GpuMemCopyTest test for KFDQMTest.
2026-01-07 16:48:11 -05:00
dependabot[bot] 645236aadd Bump pynacl from 1.5.0 to 1.6.2 in /docs/sphinx (#379)
Bumps [pynacl](https://github.com/pyca/pynacl) from 1.5.0 to 1.6.2.
- [Changelog](https://github.com/pyca/pynacl/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/pynacl/compare/1.5.0...1.6.2)

---
updated-dependencies:
- dependency-name: pynacl
  dependency-version: 1.6.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rocshmem commit: fb644ddfa9]
2026-01-07 14:39:00 -05:00
dependabot[bot] fb644ddfa9 Bump pynacl from 1.5.0 to 1.6.2 in /docs/sphinx (#379)
Bumps [pynacl](https://github.com/pyca/pynacl) from 1.5.0 to 1.6.2.
- [Changelog](https://github.com/pyca/pynacl/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/pynacl/compare/1.5.0...1.6.2)

---
updated-dependencies:
- dependency-name: pynacl
  dependency-version: 1.6.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-07 14:39:00 -05:00
David Yat Sin 7178747ebc Update CODEOWNERS for ROCR-Runtime (#2521) 2026-01-07 14:22:11 -05:00
Aleksandar Djordjevic aecea25a61 [rocprofiler-systems] CMake Cleanup (#2455)
## Technical Details

- Removed `configure_file()` call that was generating `defines.hpp` from `defines.hpp.in` and update CMake file to reference renamed file.
- Remove duplicate `find_library(pthread_LIBRARY NAMES pthread pthreads)`
2026-01-07 14:07:37 -05:00
anujshuk-amd 596ffce5fe [rocprof-sys] Fix segfault from thread ID array overflow (#2172)
**Thread limit configuration and enforcement: **

* Added a check in `CMakeLists.txt` to ensure `ROCPROFSYS_MAX_THREADS` is at least 128, automatically setting it to 128 with a warning if a lower value is provided.
* Replaced hardcoded thread limit (`allowed_max_threads`) in `pthread_create_gotcha.cpp` with the configurable `ROCPROFSYS_MAX_THREADS` value, ensuring all runtime checks and warnings use the actual configured limit.

**Documentation improvements: **

* Updated the development guide to explain the new thread limit behavior, including how exceeding the limit is handled gracefully, how to configure it, and the build-time validation rules.

**Test updates: **

* Modified thread limit tests to use the configurable `ROCPROFSYS_MAX_THREADS` value instead of a hardcoded limit and expanded the range of tested thread values.
* Increased test timeouts to accommodate larger thread counts and ensure reliability with higher limits.
2026-01-07 14:03:37 -05:00
Aurelien Bouteiller 8d2dca4505 Fix DEBUG build (#378)
[ROCm/rocshmem commit: 27d87b8b67]
2026-01-07 10:39:57 -05:00
Aurelien Bouteiller 27d87b8b67 Fix DEBUG build (#378) 2026-01-07 10:39:57 -05:00
vedithal-amd 050e88ee71 Remove unused python packages (#2437)
* Remove dependency on following unused python packages by updating
  requirements.txt, LICENSE, standalone binary requirements, cmake and
  docker requirements
    * matplotlib
    * kaleido
    * pymongo
    * colorlover
    * tqdm

* Remove unused code from src/utils/gui.py

* Reformat python using ruff
2026-01-07 09:03:49 -05:00
Godavarthy Surya, Anusha 1ef6a86ee3 SWDEV-549711 - Improve graph DEBUG dot print for segments (#2205)
Co-authored-by: Anusha GodavarthySurya<agodavar@amd.com>
2026-01-07 14:07:49 +05:30
Allen Hubbe 67536a85ef gda ionic: collapsed cqe (#345)
* util: dlsym optional helper

Like DLSYM_HELPER, but does not return if the symbol is not found.

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>

* gda ionic: sync dv and fw headers

Sync dv and fw headers to match out-of-tree libionic and firmware.

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>

* gda ionic: collapsed cqe

Detect and enable collapsed cqe if supported by drivers and firmware.
Fall back to regular completion queue.

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>

---------

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>

[ROCm/rocshmem commit: 1494c24f9a]
2026-01-06 20:42:15 -05:00
Allen Hubbe 1494c24f9a gda ionic: collapsed cqe (#345)
* util: dlsym optional helper

Like DLSYM_HELPER, but does not return if the symbol is not found.

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>

* gda ionic: sync dv and fw headers

Sync dv and fw headers to match out-of-tree libionic and firmware.

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>

* gda ionic: collapsed cqe

Detect and enable collapsed cqe if supported by drivers and firmware.
Fall back to regular completion queue.

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>

---------

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>
2026-01-06 20:42:15 -05:00
Aurelien Bouteiller bcdf60def6 Enable new a2a (pr 334) on ionic as well (#366)
* Enable new a2a (pr 334) on ionic as well

* Apply suggestions from AI code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/rocshmem commit: 82d91433c9]
2026-01-06 20:41:51 -05:00
Aurelien Bouteiller 82d91433c9 Enable new a2a (pr 334) on ionic as well (#366)
* Enable new a2a (pr 334) on ionic as well

* Apply suggestions from AI code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-06 20:41:51 -05:00
Stella Laurenzo 81eed26ec6 [amdsmi] Add include dirs for libdrm. (#2504)
This has started failing on various developer build systems. Looking at it, it is not precisely clear how this ever worked given that nothing appears to be adding the DRM include dirs.

I'd prefer that we remove this delay loading (at least for TheRock builds where it is never needed), but in the meantime, this does fix the issue and is verified on an affected system.

Fixes https://github.com/ROCm/TheRock/issues/2744
2026-01-06 15:18:20 -08:00
Yazen AL Musaffar cb372748f8 [ROCM-SMI] [SWDEV-569731] rsmi tests failing on Frequency/Power/GpuMetrics ReadOnly Fix (#2303)
* Updated unsupported metric version file for rocm_smi_tests Frequency/Power/GpuMetrics ReadOnly tests

Signed-off-by: yalmusaf_amdeng <Yazen.ALMusaffar@amd.com>
2026-01-06 16:46:38 -06:00
Gerardo Hernandez 50644f5aef SWDEV-508225 remove assertions when loading fat binary (#2013)
* SWDEV-508225 - do not assert() after calling digestFatBinary() if it fails. Otherwise this causes assertions to trigger easily in systems that have an APU and a discrete GPU and the code was compiled for the discrete one

* SWDEV-508225 - fix that when using a non-existent ordinal in HIP_VISIBLE_DEVICES, getCurrentArch() would crash
2026-01-06 21:53:32 +00:00
Daniel Oliveira 32fde0f73d [SWDEV-568613] Add gpu_metrics 1.0 support for older GPUs (#2444)
fix: Add gpu_metrics 1.0 support which is still used by some hardware

Code changes related to the following:
  * APIs
  * Unit tests

Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2026-01-06 14:25:13 -06:00
systems-assistant[bot] c6b7448227 Add support for get and set APIs for CPUISOFreqPolicy and DFCState Co… (#1901)
* Add support for get and set APIs for CPUISOFreqPolicy and DFCState Control

  - Add support for get and set APIs for CPUISOFreqPolicy and DFCState Control
    in AMD SMI and also in the CLI tool

* CHANGELOG.md file updated

* SWDEV-562837: Update amdsmi-py-api.md as per the new APIs

Updated amdsmi-py-api.md as per the new APIs added.

---------

Signed-off-by: Soumya <sranjanr@amd.com>
Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
Co-authored-by: Saka Sitharammurthy <SitharamMurthy.Saka@amd.com>
2026-01-06 10:37:07 -06:00
SakaSitharammurthy 6c98c49362 [SWDEV-568731] Updated example code in amdsmi-py-api.md file (#2311)
Addresses:
- SWDEV-568731
- SWDEV-568724
- SWDEV-568695

Signed-off-by: Saka, SitharamMurthy <SitharamMurthy.Saka@amd.com>
2026-01-06 10:34:36 -06:00
Edgar Gabriel e38f98fad5 fix reduction test for gfx1201 (#374)
* fix reduction for gfx942 and 1201

match the synchronizaation of internal_putmem_wg and internal_getmem_wg
to their non-internal counterparts. the internal_putmem_wg is used in
the ipc reduction

* move specialization to internal_putmem

[ROCm/rocshmem commit: 8d2504d6c1]
2026-01-06 10:15:38 -06:00
Edgar Gabriel 8d2504d6c1 fix reduction test for gfx1201 (#374)
* fix reduction for gfx942 and 1201

match the synchronizaation of internal_putmem_wg and internal_getmem_wg
to their non-internal counterparts. the internal_putmem_wg is used in
the ipc reduction

* move specialization to internal_putmem
2026-01-06 10:15:38 -06:00
pghoshamd 637b0d71f0 SWDEV-569319 Replace ScopedAcquire with stdcpp wrappers (#2146)
* SWDEV-569319 Replace ScopedAcquire with stdcpp wrappers

* Remove KernelMutex and KernelSharedMutex abstractions with std::mutex and std::shared_mutex

* Replaced unique_locks with lock_guards

* More changes

* Replace new and deletes with smart pointers

* Replaced some more with shared ptrs

* Replacements with smart pointers - pt 2

* missed change
2026-01-06 10:59:34 -05:00
Mustafa Abduljabbar 5bba932529 [WarpSpeed] Improve handling for auto and manual modes (#2125)
* Force ring in WarpSpeed manual mode and log event

* Skip usage for non-ring in WarpSpeed auto mode

* Enable WarpSpeed when its CU count is set

[ROCm/rccl commit: 93fdcb160c]
2026-01-06 10:21:49 -05:00
Mustafa Abduljabbar 93fdcb160c [WarpSpeed] Improve handling for auto and manual modes (#2125)
* Force ring in WarpSpeed manual mode and log event

* Skip usage for non-ring in WarpSpeed auto mode

* Enable WarpSpeed when its CU count is set
2026-01-06 10:21:49 -05:00
vedithal-amd e005f8487b [rocprofiler-compute] Add gfx arch. based pre-processor guards and runtime checks in rocflop.cpp (#2487)
* Remove MFMA functionality in rocflop sample since its not supported in MI50

* Add gfx arc based support for MFMA and SMFMAC in rocflop.cpp

* Add --int32 usage doc

* Address review comments
2026-01-06 10:17:54 -05:00
Nusrat Islam 49d9f8cc27 use memcpy for local copies (#2121)
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-09.cs-aus.dcgpu>

[ROCm/rccl commit: b4a86ef680]
2026-01-06 09:00:57 -06:00
Nusrat Islam b4a86ef680 use memcpy for local copies (#2121)
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-09.cs-aus.dcgpu>
2026-01-06 09:00:57 -06:00
Edgar Gabriel cc727261de disable the putmem_signal_on_stream on RO (#376)
it fails in about 50% of the cases. Will revisit later why it fails,
but RO is at the moment lower priority, so disabling the test for now.

[ROCm/rocshmem commit: ed2f75f1de]
2026-01-06 08:10:46 -06:00
Edgar Gabriel ed2f75f1de disable the putmem_signal_on_stream on RO (#376)
it fails in about 50% of the cases. Will revisit later why it fails,
but RO is at the moment lower priority, so disabling the test for now.
2026-01-06 08:10:46 -06:00
Jonathan R. Madsen 7fcea905f3 [rocprofiler-sdk] Fix double-buffering emplace and flush synchronization (#2334)
* Fix buffer tracing synchronization lock

- PR #529 (in rocprofiler-sdk-internal) introduced waiting on the syncer flag when emplacing in a buffer to prevent the overwriting buffer records currently being processed in a buffer flush callback
- The above fix introduced a block on the both buffers when a buffer flush callback was being executed instead of a block on the buffer being flushed.

* Add rocpd tests for duplicate records

* Address code review comments
2026-01-06 06:06:18 -06:00
habajpai-amd 9e4d1c31c7 fix: prevent static initialization deadlock in thread_data (#2474)
* fix: prevent static initialization deadlock in thread_data

* update comment
2026-01-06 16:39:32 +05:30
Longlong Yao c34ec1e52f wsl/librocdxg: Change scratch memory allocation
Calculate the actual scratch memory size required based on the
packet information for kernel dispatch.

If the required size exceeds the total allocated memory, scratch
memory must be reallocated. Otherwise, no action is needed.

miopen_gtest: Full/GPU_MIOpenDriverRegressionTest_FP16.MIOpenDriverRegressionHalf/0

Signed-off-by: Longlong Yao <Longlong.Yao@amd.com>
Reviewed-by: Flora Cui <flora.cui@amd.com>
Reviewed-by: Horatio Zhang <Hongkun.Zhang@amd.com>
2026-01-06 10:12:04 +08:00
Longlong Yao c3f55c8e59 wsl/librocdxg: Change scratch memory allocation
Calculate the actual scratch memory size required based on the
packet information for kernel dispatch.

If the required size exceeds the total allocated memory, scratch
memory must be reallocated. Otherwise, no action is needed.

miopen_gtest: Full/GPU_MIOpenDriverRegressionTest_FP16.MIOpenDriverRegressionHalf/0

Signed-off-by: Longlong Yao <Longlong.Yao@amd.com>
Reviewed-by: Flora Cui <flora.cui@amd.com>
Reviewed-by: Horatio Zhang <Hongkun.Zhang@amd.com>
2026-01-06 10:12:04 +08:00
Aurelien Bouteiller abb1e0684a Do not hardcode wf_size==64 in ionic provider (#367)
* Do not hardcode wf_size==64 in ionic provider

* Simpler same_qp_mask in ionic


[ROCm/rocshmem commit: 0c496d83d6]
2026-01-05 18:36:58 -05:00