402 Commits

Autor SHA1 Nachricht Datum
SaleelK 340f3aa887 clr: Implement dynamic stream to HWq logic (#1958)
* clr: Implement dynamic stream to HW queue assignment

This change implements dynamic stream to hardware queue (HWq) mapping
with the following features:

* Queue depth heuristics with weights for optimal HWq assignment
* Make last used queue sticky for better locality
* Use pipe HWq to pipe mapping - gfx9 follows a round-robin queue to
  pipe mapping based on creation order (single process per device only,
  as pipe ID is statically assigned by runtime)
* More aggressive heuristic usage for better queue distribution
* Extend dynamic queues support for all stream priorities

Environment variables:
* DEBUG_HIP_DYNAMIC_QUEUE: 0 - disabled, 1 - Depth heuristics 2 -
  Depth+Pipe heuristics
* DEBUG_HIP_IGNORE_STREAM_PRIORITY=1: ignore priority stream creation

* clr: Clean up last_used_queue_
2026-01-23 10:40:54 -08:00
Tao Sang 163e44d0a8 SWDEV-555889 - Support mipmap on rocr (#2082)
* SWDEV-555889 - Support mipmap on rocr

Support mipmap in hip-rt on rocr backend.
Enable all mipmap tests in Windows.
Some other minor improvement.

Add some SRD logs that will be removed finally.

* Add sampler.mipFilter to fix sampler issues on mipmap in rocr.
Fix format issues of view of leveled image and  mipmap image in blit kernel in rocr.
Enabled disabled mipmap tests.

* Rewrite view logic

* Set word4.f.PITCH = 0 for mipmap SRD on navi31 to fix unstable test issues.
Reset last error in nagative tests.

* Remove SRD dump log from hip-rt
Let Rocr mipmap log be in condition.

* minor format chang

* Exclude mipmap tests for mi200+ which don't support mipmap.
2026-01-21 09:10:29 -08:00
Karthik Jayaprakash 6a84a00208 Use size_t datatype for global dimensions. (#2604) 2026-01-20 20:39:07 -05:00
SaleelK 6b28faa532 clr: Implement per-stream SDMA engine affinity for improved copy performance (#2480)
Problem:
The existing SDMA engine selection logic had several issues:
1. Same VirtualGPU/stream could use different SDMA engines for consecutive
   async copies since copy_engine_status may report engines as busy
2. Busy and Preferred engine check for every copy
3. No global tracking of which VirtualGPU uses which engine, leading to
   suboptimal resource allocation

Solution:
Implemented a global SDMA engine allocator with per-stream affinity:

- Added Device::SdmaEngineAllocator to manage VirtualGPU → engine assignments
  * Maintains global map of active assignments
  * Enforces exclusivity: different streams use different engines (except
    inter-GPU copies where preferred engines are prioritized for optimal
    hardware paths like XGMI links)
  * Thread-safe allocation/release with Monitor lock

- Modified VirtualGPU to cache assigned engine locally (assigned_sdma_engine_)
  for fast lookup without map access on hot path

- Refactored rocrCopyBuffer() to:
  1. Check local cached engine first → use if assigned
  2. Call AllocateSdmaEngine() if not assigned → cache result

- Moved HSA API queries (memory_copy_engine_status, memory_get_preferred_copy_engine)
  into AllocateEngine() for cleaner separation of concerns

- Engine release on HostQueue::finish() instead of only VirtualGPU destruction
  * Improves engine utilization by releasing earlier
  * Added virtual ReleaseSdmaEngines() method to device::VirtualDevice

- Added future path for simple round-robin allocation (kUseSimpleRR) for
  next-gen GPUs with uniform SDMA bandwidth (disabled by default)

Cleanup:
- Removed selectSdmaEngine() helper (logic moved to allocator)
- Removed getSdmaRWMasks() (allocator accesses maxSdmaReadMask_/WriteMask_ directly)
- Removed unused sdmaEngineReadMask_/WriteMask_ member variables from DmaBlitManager

Benefits:
- Ensures consistent per-stream SDMA engine usage
- Prevents cross-stream contention and engine thrashing
- Prioritizes hardware-optimal paths for inter-GPU transfers
- Better resource utilization through earlier release
- Cleaner, more maintainable code structure
2026-01-07 19:37:45 -08:00
Ioannis Assiouras 49b8900158 SWDEV-558849 - keep the lastEnqueueCommand_ when PAL backend is enabled (#2320) 2025-12-23 21:24:09 +00:00
German Andryeyev 3895aadba6 SWDEV-558849 - Make ROCR path in Windows more stable (#2181) 2025-12-10 12:37:10 -05:00
SaleelK c105dcd05b clr: Use graph segment scheduling to process HIP Graphs (#1372)
* clr: Use graph segment scheduling to process HIP Graphs

* Add a broader path to use capture packet capture for all topologies
* Refactor code
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING to toggle new vs classic path,
  Enabled by default

* clr: Few fixes and improvements

* clr: Detect complex graphs to take classic path

* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING=2 to force segment scheduling
  path

* clr: Fix a cornercase stack corruption

* clr: Track commands of segments instead of snapshots

* clr: Fix Batch dispatch logic

* Track fence_dirty_ flag for command of other streams
* Dependency resolution markers can now accomodate dirty fence on cross
  streams

---------

Co-authored-by: Ioannis Assiouras <Ioannis.Assiouras@amd.com>
Co-authored-by: Godavarthy Surya, Anusha <agodavar@amd.com>
2025-12-01 12:49:26 -08:00
AidanBeltonS d849b88aef SWDEV-558080 - Add recommended granularity (#1176)
* Add recommended granularity

* Improve granularity testing

* Update based on feedback
2025-11-26 16:10:58 +00:00
Karthik Jayaprakash 740a06d567 SWDEV-559267 - Use CLPrint to DevLogPrintf with Log Level - detail debug. (#1160) 2025-11-25 19:25:32 -05:00
cadolphe-amd cce94f6ee0 SWDEV-557412 - Incorporate proper chunk offset when remapping virtual memory (#1848)
* SWDEV-557412 - Incorporate proper offset when remapping virtual memory

* Fix condition to check if VMHeap allocation address matches a chunk address

* Move offset calculation outside if/else block

---------

Co-authored-by: JeniferC99 <150404595+JeniferC99@users.noreply.github.com>
2025-11-25 18:05:25 -05:00
Pengda Xie 6c31785eaf SWDEV-562761 - Cleanup static fatbin on runtime teardown (#1873) 2025-11-24 21:57:46 -08:00
sluzynsk-amd 2cf9faa93f SWDEV-563777 - fix warnings related to inconsistent overrides (#1625)
This patch adds missing override keywords. Fixes this class of warnings.

Signed-off-by: Sebastian Luzynski <Sebastian.Luzynski@amd.com>
2025-11-24 18:50:07 +01:00
Ioannis Assiouras 36029ea1a8 SWDEV-559166 - Fix race condition in getDemangledName (#1868) 2025-11-23 08:45:45 +00:00
Ioannis Assiouras 4f91b68988 SWDEV-559166 - Remove obsolete member execInfoOffset from KernelParameters (#1790) 2025-11-12 17:20:36 +00:00
Pengda Xie 93947241d0 SWDEV-556684 - HSAIL cleanup (#1657) 2025-11-08 02:22:03 -08:00
Rahul Manocha 4f075902fc SWDEV-555347 - Remove lock contention in async events loop (#878)
* SWDEV-555347 - Remove lock contention in async events loop

* SWDEV-555347 - Introduce Pool of AsyncEventItems

* create generic mempool for AsyncEventItem

* Use BaseShared allocate and free for async event pool

---------

Co-authored-by: Rahul Manocha <rmanocha@amd.com>
2025-10-24 08:43:00 -07:00
Ioannis Assiouras 602ea0be1e SWDEV-558078 - Fix use-after-free in graph tests due to AsyncEventHandler (#1502) 2025-10-23 22:49:24 +01:00
Pengda Xie a4bbd73dc6 SWDEV-556684 - Remove HSAIL support (#1183) 2025-10-23 11:21:49 -07:00
Ioannis Assiouras 6d6b136374 SWDEV-559166 - Fix data races in GetSubmissionBatch, CaptureAndSet and SetQueueStatus (#1441) 2025-10-23 12:18:31 +01:00
Jimbo 37f2be9140 SWDEV-554608 - Add hipHostRegisterIoMemory for hipHostRegister (#962)
* SWDEV-554608 - Add hipHostRegisterIoMemory for hipHostRegister

* SWDEV-554608 - Add hipHostRegisterIoMemory for hipHostRegister

* SWDEV-554174 Added hipHostRegisterIoMemory flag in test cases

* SWDEV-554174 : Did formatting corrections

* SWDEV-554608 - set HSA_AMD_MEMORY_POOL_UNCACHED_FLAG if IoMemory is set

* SWDEV-554608 - set HSA_AMD_MEMORY_POOL_UNCACHED_FLAG if IoMemory is set

* SWDEV-554608 - Add hipHostRegisterIoMemory for hipHostRegister

---------

Co-authored-by: Anavena Venkatesh <Anavena.Venkatesh@amd.com>
Co-authored-by: Rambabu Swargam <rambabu.swargam@amd.com>
2025-10-22 20:25:59 -04:00
Ioannis Assiouras 30a14a8a05 SWDEV-559166 - Fix potential data race in ReferenceCountedObject::release() (#1388)
Use fetch_sub(std::memory_order_acq_rel) on release
so the destroying thread acquires prior writes.
2025-10-20 17:15:56 +01:00
cadolphe-amd 207a278d41 SWDEV-516307 - Clean up ICD references in HIP (#1019)
Moved default empty dispatch table and associated Platform initialization for HIP from fixme.cpp into the respective struct definitions.
2025-10-08 09:49:35 -04:00
Godavarthy Surya, Anusha fb72d7f851 SWDEV-524746 - Part-II Add multi device support for hip graph. Updated kernel arg manager for each device (#813)
- Updated kernel arg manager to support allocating kernel args on multiple devices for single graph.
- Updated AQL path to capture on the device where graph node is added.

Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
2025-09-25 20:38:18 +05:30
SaleelK 34b9184686 clr: Fix memory corruption for memset nodes (#1068)
* Detect graph capture and use graph kernelarg memory for FillBuffer pattern
2025-09-23 17:17:33 -07:00
Godavarthy Surya, Anusha ce560304a8 SWDEV-548417 - Fix Memleaks in Graph (#713)
Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
2025-09-19 17:39:36 +05:30
SaleelK 149dc17c90 clr: Optimize doorbell ring (#1030)
*Lay foundation to batch packets efficiently for graphs
*Dynamically copy packets with max threshold set with
DEBUG_HIP_GRAPH_BATCH_SIZE, if not stagger packet copy with pow2
*Default threshold for DEBUG_HIP_GRAPH_BATCH_SIZE is 256
*If TS are not collected for a signal for reuse, create a new signal.
This can potentially increase signal footprint if the handler doesn't run
fast enough.
2025-09-18 15:02:10 -07:00
Ioannis Assiouras 35629e433d SWDEV-546146 - Added support for hipMemLocationTypeHost in hipMemSetAccess (#682) 2025-09-10 23:06:20 +01:00
SaleelK c4537e8050 SWDEV-553126 - Improve logging (#835)
* Ability to mask COPY api usage in logs
* Show total graph nodes in logs
* Add another log level for detailed debug
2025-09-04 10:08:41 -07:00
German Andryeyev 7a1a6682e2 SWDEV-552846 - Unpin memory for hip before exit the copy (#851) 2025-09-04 20:04:01 +05:30
systems-assistant[bot] ded5b86e83 SWDEV-540609 - capture of MIOpen OCL kernels needs remainder globalWorkSize (#431)
Co-authored-by: Rakesh Roy <rakesh.roy@amd.com>
2025-08-26 16:11:31 -04:00
Danylo Lytovchenko 2ff2316227 Adjust clang format to the new versions, revert broken macro layout (#714) 2025-08-22 17:23:22 +02:00
Danylo Lytovchenko f7338717ae SWDEV-470698 - fix formatting, add format check workflow (#657) 2025-08-20 19:58:06 +05:30
Betigeri, Sourabh 35e48d1eaf SWDEV-546293 - hipMemPrefetchAsync_v2 and hipMemAdvise_v2 implementation (#869)
SWDEV-546293 - hipMemPrefetchAsync hipMemAdvise_v2

Please enter the commit message for your changes. Lines starting

[ROCm/clr commit: cbee74a80e]
2025-08-15 22:40:04 -07:00
Manocha, Rahul b3ccf487da SWDEV-545952 - API definitions for hipStreamSet/GetAttribute (#831)
Co-authored-by: Rahul Manocha <rmanocha@amd.com>

[ROCm/clr commit: 0f49c4a97f]
2025-08-15 12:51:35 -07:00
Arandjelovic, Marko 208d124f54 SWDEV-547453 Release the kernel command if the operation returns an error (#807)
* SWDEV-547453 Release the kernel command if the operation returns an error

* SWDEV-547453 - Initialize parameters_ to default value

* SWDEV-547453 - Run clang-format

[ROCm/clr commit: a15957fee9]
2025-08-14 20:08:53 +02:00
Stojiljkovic, Vladana 33085dd232 SWDEV-533220 - Release marker when HostQueue is destroyed (#460)
Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>

[ROCm/clr commit: 14760c6eba]
2025-08-13 15:15:31 +02:00
Andryeyev, German 6df9a49437 SWDEV-465041 - Add support for user events with DD (#321)
* SWDEV-465041 - Add support for user events with DD

User events can be replaced with HSA signals. Add the interface
to allocate HSA signal for user events and update the status on
CL_COMPLETE.
Force pinned path with DD to avoid blocking calls. Pinned memory
can be released only when the command is complete.
Simplify device enqueue path to use generic kernel arg buffer and
signals

* Fix notifyCmdQueue() logic for OCL

* Avoid blocking calls in OCL with DD

* Add event  destruciton in a case of the failure.

[ROCm/clr commit: 2305f8ae56]
2025-08-12 19:04:36 -04:00
Kudchadker, Saleel 3a849c6962 SWDEV-538195 - Introduce threshold for handler submission (#723)
- When doing device/stream sync, we can submit a handler which may
  introduce some host side delays. Use DEBUG_CLR_BATCH_CPU_SYNC_SIZE to
  batch commands for host wait. Default for HIP is 8 commands.
- Investigation is underway in ROCr but need to address this for now in
  HIP runtime.

[ROCm/clr commit: 9b045922a8]
2025-08-06 20:34:42 -07:00
Xie, Pengda 9cbbee4d6e SWDEV-520384 - Improve Fat Binary loading latency (#390)
Init and fini kernel needs to be launched when we load and unload code object. Avoid looping through all kernels within a code object just to run the init and fini kernels. Compiler currently only generates 1 init and fini kernel.

[ROCm/clr commit: cd46294b31]
2025-08-05 14:02:05 -07:00
Betigeri, Sourabh 40999496c1 SWDEV-545273 - Respect HIP_LAUNCH_PARAM_BUFFER_SIZE (#770)
[ROCm/clr commit: 2a02d2c2f3]
2025-08-03 17:32:52 -07:00
Manocha, Rahul 4a93a614e5 SWDEV-539710 - Defer allocation of managed variable (#652)
Co-authored-by: Rahul Manocha <rmanocha@amd.com>

[ROCm/clr commit: 3f6f9d6081]
2025-07-31 08:30:23 -07:00
Manocha, Rahul 8e8dc41cf0 SWDEV-532420 - Fix kokkos P2P copy failure with vmheap (#426)
Co-authored-by: Rahul Manocha <rmanocha@amd.com>

[ROCm/clr commit: 22b1ca4d8c]
2025-07-07 17:27:13 -07:00
Trinh, Ethan 66b125d5dc SWDEV-539861 - add ROCclr version to logs (#612)
[ROCm/clr commit: 2ce45143d8]
2025-06-27 16:37:58 -04:00
Guan, Zichuan c43b8ca05c Fix memory leak in opengl interop (#541)
Signed-off-by: zichguan-amd <zichuan.guan@amd.com>

[ROCm/clr commit: 51508e0bef]
2025-06-27 09:51:18 -04:00
Patel, Jaydeepkumar 821a1d89b0 SWDEV-536226 - Avoid waiting for lastCommand completion if GPU has already reported an error otherwise it causes hang due to status of cmd is not becoming CL_COMPLETE. (#478)
[ROCm/clr commit: a60212b9b4]
2025-06-25 20:59:17 +05:30
Manocha, Rahul b936e55e81 SWDEV-475482 - [6.4 Preview] Match hipTexObjectCreate with cuda (#408)
Co-authored-by: Rahul Manocha <rmanocha@amd.com>

[ROCm/clr commit: b5e9bc55cb]
2025-05-28 13:21:45 +05:30
Jayaprakash, Karthik f4c9056cdc Revert "SWDEV-522707 - Set phys_mem_handle type to sizeof(size_t) to avoid blocking address range. (#105)" (#348)
This reverts commit 0071d33754.

[ROCm/clr commit: bb7750a946]
2025-05-19 15:19:16 +05:30
Jayaprakash, Karthik 4ea2d9a5ee SWDEV-531711 - Report correct error code based on device failure. (#286)
[ROCm/clr commit: f5b8db33f1]
2025-05-17 06:33:13 -04:00
Andryeyev, German b13eec7049 SWDEV-345024 - Retain the program on Fini kernel execution (#307)
Fini kernel is executed during the invocation of amd::Program destructor,
but the dispatch logic can retain/release the reference counter and
cause double free. Avoid double free with an extra retain() call

[ROCm/clr commit: bddb8f14d1]
2025-05-14 21:21:26 +05:30
Assiouras, Ioannis 71f19d7017 SWDEV-529449 - Bug fix when retrieving a memobj from the IPC mem handle
[ROCm/clr commit: f7482ef0a6]
2025-05-13 19:18:22 +01:00