147 Commit

Autore SHA1 Messaggio Data
SaleelK e6e0378acd clr: Always query new engine for intergpu copies (#2559) 2026-01-12 11:01:02 -08:00
SaleelK 6b28faa532 clr: Implement per-stream SDMA engine affinity for improved copy performance (#2480)
Problem:
The existing SDMA engine selection logic had several issues:
1. Same VirtualGPU/stream could use different SDMA engines for consecutive
   async copies since copy_engine_status may report engines as busy
2. Busy and Preferred engine check for every copy
3. No global tracking of which VirtualGPU uses which engine, leading to
   suboptimal resource allocation

Solution:
Implemented a global SDMA engine allocator with per-stream affinity:

- Added Device::SdmaEngineAllocator to manage VirtualGPU → engine assignments
  * Maintains global map of active assignments
  * Enforces exclusivity: different streams use different engines (except
    inter-GPU copies where preferred engines are prioritized for optimal
    hardware paths like XGMI links)
  * Thread-safe allocation/release with Monitor lock

- Modified VirtualGPU to cache assigned engine locally (assigned_sdma_engine_)
  for fast lookup without map access on hot path

- Refactored rocrCopyBuffer() to:
  1. Check local cached engine first → use if assigned
  2. Call AllocateSdmaEngine() if not assigned → cache result

- Moved HSA API queries (memory_copy_engine_status, memory_get_preferred_copy_engine)
  into AllocateEngine() for cleaner separation of concerns

- Engine release on HostQueue::finish() instead of only VirtualGPU destruction
  * Improves engine utilization by releasing earlier
  * Added virtual ReleaseSdmaEngines() method to device::VirtualDevice

- Added future path for simple round-robin allocation (kUseSimpleRR) for
  next-gen GPUs with uniform SDMA bandwidth (disabled by default)

Cleanup:
- Removed selectSdmaEngine() helper (logic moved to allocator)
- Removed getSdmaRWMasks() (allocator accesses maxSdmaReadMask_/WriteMask_ directly)
- Removed unused sdmaEngineReadMask_/WriteMask_ member variables from DmaBlitManager

Benefits:
- Ensures consistent per-stream SDMA engine usage
- Prevents cross-stream contention and engine thrashing
- Prioritizes hardware-optimal paths for inter-GPU transfers
- Better resource utilization through earlier release
- Cleaner, more maintainable code structure
2026-01-07 19:37:45 -08:00
SaleelK c105dcd05b clr: Use graph segment scheduling to process HIP Graphs (#1372)
* clr: Use graph segment scheduling to process HIP Graphs

* Add a broader path to use capture packet capture for all topologies
* Refactor code
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING to toggle new vs classic path,
  Enabled by default

* clr: Few fixes and improvements

* clr: Detect complex graphs to take classic path

* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING=2 to force segment scheduling
  path

* clr: Fix a cornercase stack corruption

* clr: Track commands of segments instead of snapshots

* clr: Fix Batch dispatch logic

* Track fence_dirty_ flag for command of other streams
* Dependency resolution markers can now accomodate dirty fence on cross
  streams

---------

Co-authored-by: Ioannis Assiouras <Ioannis.Assiouras@amd.com>
Co-authored-by: Godavarthy Surya, Anusha <agodavar@amd.com>
2025-12-01 12:49:26 -08:00
Karthik Jayaprakash 740a06d567 SWDEV-559267 - Use CLPrint to DevLogPrintf with Log Level - detail debug. (#1160) 2025-11-25 19:25:32 -05:00
SaleelK 5e418ca256 clr: Allow all engines but prefer recommended engines (#1750)
* Also honor ROC_P2P_SDMA_SIZE for IPC, since IPC can also mean P2P
2025-11-10 13:10:46 -08:00
Ioannis Assiouras 538ebc5409 SWDEV-556877 - Ensure pinned memory is released if hsa copy fails (#1137) 2025-10-14 10:08:49 +01:00
Godavarthy Surya, Anusha fb72d7f851 SWDEV-524746 - Part-II Add multi device support for hip graph. Updated kernel arg manager for each device (#813)
- Updated kernel arg manager to support allocating kernel args on multiple devices for single graph.
- Updated AQL path to capture on the device where graph node is added.

Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
2025-09-25 20:38:18 +05:30
SaleelK 34b9184686 clr: Fix memory corruption for memset nodes (#1068)
* Detect graph capture and use graph kernelarg memory for FillBuffer pattern
2025-09-23 17:17:33 -07:00
German Andryeyev ea89ddd589 SWDEV-547108 - Add dll loader for Windows build (#1004)
The build of ROCR backend will be enabled by default in Windows.
It requires the dll loader until ROCR dll will be always available in Windows for any configuration.
2025-09-19 11:25:30 -04:00
SaleelK ec5e9673ad clr: Use current device copy engine for inter-dev copy (#945)
* For inter-device copies always use the SDMA engine of current device
* ROCr uses srcAgent SDMA engine, and it could be a remote device
2025-09-16 12:56:07 -07:00
SaleelK c8e91b3f3e clr: Fix condition for taking shader path (#884)
* SWDEV-551080
* Fix condition for taking shader path, the size check was moved
  incorrectly
* Also account for a bitmask returned for preferred engines
2025-09-09 13:13:29 -07:00
SaleelK c4537e8050 SWDEV-553126 - Improve logging (#835)
* Ability to mask COPY api usage in logs
* Show total graph nodes in logs
* Add another log level for detailed debug
2025-09-04 10:08:41 -07:00
German Andryeyev 7a1a6682e2 SWDEV-552846 - Unpin memory for hip before exit the copy (#851) 2025-09-04 20:04:01 +05:30
SaleelK ddba20579d SWDEV-551080 - Fix hipMemcpyDeviceToDeviceNoCU path (#683)
* hipMemcpyDeviceToDeviceNoCU should always take SDMA path as per the
  flag usage
2025-08-25 15:13:02 -07:00
Danylo Lytovchenko 2ff2316227 Adjust clang format to the new versions, revert broken macro layout (#714) 2025-08-22 17:23:22 +02:00
systems-assistant[bot] 621da5410a SWDEV-465041 - Avoid wait in device enqueue (#443)
If we have PCIE atomics then we can avoid workaround in the scheduler, which requires an explicit wait on CPU
2025-08-20 12:46:47 -04:00
Danylo Lytovchenko f7338717ae SWDEV-470698 - fix formatting, add format check workflow (#657) 2025-08-20 19:58:06 +05:30
Andryeyev, German 72b9408fed SWDEV-547108 - Fix compilation errors under Windows (#867)
Interop and numa are not enabled.

[ROCm/clr commit: 0ac913e64c]
2025-08-17 02:33:31 -04:00
Andryeyev, German 6df9a49437 SWDEV-465041 - Add support for user events with DD (#321)
* SWDEV-465041 - Add support for user events with DD

User events can be replaced with HSA signals. Add the interface
to allocate HSA signal for user events and update the status on
CL_COMPLETE.
Force pinned path with DD to avoid blocking calls. Pinned memory
can be released only when the command is complete.
Simplify device enqueue path to use generic kernel arg buffer and
signals

* Fix notifyCmdQueue() logic for OCL

* Avoid blocking calls in OCL with DD

* Add event  destruciton in a case of the failure.

[ROCm/clr commit: 2305f8ae56]
2025-08-12 19:04:36 -04:00
Kudchadker, Saleel 3a849c6962 SWDEV-538195 - Introduce threshold for handler submission (#723)
- When doing device/stream sync, we can submit a handler which may
  introduce some host side delays. Use DEBUG_CLR_BATCH_CPU_SYNC_SIZE to
  batch commands for host wait. Default for HIP is 8 commands.
- Investigation is underway in ROCr but need to address this for now in
  HIP runtime.

[ROCm/clr commit: 9b045922a8]
2025-08-06 20:34:42 -07:00
Kudchadker, Saleel 433c25eab0 SWDEV-539378 - Use agent of IPC memory owner (#570)
- Currently runtime just uses the local agent as it did not check for
  IPCShared()
- With this fix we query hsa_amd_pointer_info and get the right agent
  for the memory to pass it to the HSA copy api

[ROCm/clr commit: 46d766e4e2]
2025-07-08 12:02:01 -07:00
Kudchadker, Saleel ee7c5554db SWDEV-523279 - Use preferred engine mask for SDMA (#317)
- ROCr now reports preferred engine for copy status. We can leverage
this for max bandwidth for inter-GPU copies
- Cleanup logging

[ROCm/clr commit: 1b0ea080e4]
2025-05-19 16:04:51 -07:00
Kudchadker, Saleel 5fd5e846e2 SWDEV-531518 - Fix offset accumulation (#333)
srcAddress/dstAddress accumulation was cumulative, which shouldnt be
done if we increment offset.

[ROCm/clr commit: 5712944c7c]
2025-05-19 18:03:06 +05:30
Andryeyev, German a9df586812 SWDEV-459758 - Pass workgroup size explicitly (#185)
It's easier for compiler to move explicit kernel arguments into user SGPRs

[ROCm/clr commit: 3fd7650fe3]
2025-04-15 15:22:15 -04:00
Saleel Kudchadker c94c02a2e6 SWDEV-519596 - Avoid passing dep signal to SDMA
- For D2H cases avoid passing dependent signals to SDMA, the signals
  take a while to resolve on SDMA engine

Change-Id: I569635228af977847f201c82ca897002f8f2f4a8


[ROCm/clr commit: 78d0ff2dbc]
2025-03-07 17:37:21 -05:00
Saleel Kudchadker c8f39ec2b0 SWDEV-502365 - Track last used command
- This change tries to save extra synchronization packets we may insert
  as we didnt track the completion signals for every command. We track
the current enqueued command until it exits the enqueue stage. We also
record the exit scope to know if we flushed the caches
- Handle correct release scopes and store completion signal as HW events
- Use a new finishCommand implementation to only wait for the command
  passed as the argument

Change-Id: Ie4350c5dd24f5d48dfa6ccbabd892f0544caadcc


[ROCm/clr commit: e03e4f3b5d]
2025-03-04 16:05:02 -05:00
Saleel Kudchadker d0a7ae02cf SWDEV-513197 - Unify getBuffer implementation
- Use getBuffer/releaseBuffer in BlitManager
- Cleanup XferBuffer as we use ManagedBuffer for both reads/writes

Change-Id: I2661b85dd012763b17a38a743fec1b1d79125f67


[ROCm/clr commit: 37d606d193]
2025-02-28 12:47:51 -05:00
Saleel Kudchadker ef505c7cd8 SWDEV-513197 - Improve launch perf for Device Heap kernels
- If any kernel uses device heap, the launch needs to be preceeded by an
  init kernel, Save on the extra barrier packet launch/flush between the
init heap kernel and user kernel

Change-Id: I8ebc6246188200e5f673dc464bc76a53bcb8b7c6


[ROCm/clr commit: ca530c660b]
2025-02-27 19:17:51 -05:00
Rahul Manocha 90337103ac SWDEV-510849 - Restore pinned memory copy path
1) Create getBuffer method to return pinned host memory or staging buffer
2) for D2H path use managed buffer instead of static buffer
3) use staging buffer copy for 16KB < size < 1MB
4) use pinned memory copy for size > 1MB

Change-Id: I13d4d6ab60691bc6c7724239db1e11e23f0f3dc2


[ROCm/clr commit: 4bf634dfca]
2025-02-26 11:25:02 -05:00
kjayapra-amd 010253430f SWDEV-516303 - Remove SDMA retainer logic to select the engine.
Change-Id: I818129444131825cdb87e06cb495afa3e5cdb683


[ROCm/clr commit: 1f583a6870]
2025-02-20 11:34:38 -05:00
Saleel Kudchadker 71e1a0b10d SWDEV-504494 - Further copy improvements
- Fix regression for D2H pinned copies which adds systemscope release.
- Skip cpu wait for D2H unpinned copies as we can pass the signal of the
  barrier to rocr copy.
- Fix an old bug in sdmaEngineRetainCount_ logic
- Improve logging

Change-Id: If074bddb05564b15949b0d5f9bf12acd3692174e


[ROCm/clr commit: 4c95ee5e1e]
2025-02-11 00:55:52 -05:00
kjayapra-amd 892d7bb064 SWDEV-488290 - Remove Stream to Engine logic and rely on engine query status HSA API.
Change-Id: I469ab6679360c8ee8d4ee515678a8aa8d4578ebf


[ROCm/clr commit: cc62a82347]
2025-02-04 13:00:16 -05:00
Jimbo Xie cc229f251f SWDEV-504383 - Cleaned up kForcedTimeout10us and removed IsHwEventReadyForcedWait
Also removed active_wait_timeout

Change-Id: I7a429f003c09a4df267b5c0983050704260094c6


[ROCm/clr commit: 4872b420c9]
2025-01-31 14:40:18 -05:00
Saleel Kudchadker c6eef97e3e SWDEV-504494 - Set active engine for SDMA
Change-Id: I4cec84e71903c5813a7063e8b9ff1ea4473f4720


[ROCm/clr commit: d208e8052f]
2025-01-27 17:54:36 -05:00
Saleel Kudchadker 16f14e4b00 SWDEV-504494 - Use system scope for D2H
- When using shader copy, make sure to use release scope for the AQL
  packet. This is a potential bug but is hidden as hipMemcpyAsync always
needs synchronization(which inserts a barrier with release scope). For
hipMemcpy we use a barrier packet to make sure its blocking. Eitherways
a barrier gets always used and hides in some ways a potential bug.

Change-Id: I57fb7f769c3179e76d712471c0905104c801d7ba


[ROCm/clr commit: c9dd95bf6c]
2025-01-10 00:34:08 -05:00
Sourabh Betigeri 36f3d7647c SWDEV-505971 - Fix size mismatch of count type to uint32_t
Change-Id: Ie526f828f816e6681ef1735d5edb2db895dace57


[ROCm/clr commit: f5b2516f5d]
2025-01-08 12:47:36 -05:00
Saleel Kudchadker a18f2c549c SWDEV-504494 - Flush to systemscope when copying non-coherent mem
- When we use blit(compute) copies, two subsequent copies may read for
  the same source buffer, the buffer may get modified by the host in
between and if the src buffer was allocated with non-coherent flag, the
device may simply use stale value from previous cacheline fetch. This is
a corner case.

Change-Id: I2ce261c6f6fa4e5bb608f116548e5cc711ae6f3c


[ROCm/clr commit: b63005d550]
2025-01-07 12:49:22 -05:00
Jatin Jaikishan Chaudhary 8b1d0cff83 Revert "SWDEV-505971 - change setArgument arg from uint32_t to uint64_t"
This reverts commit 0830d95f6d.

Reason for revert: There needs to be memcpy size change

Change-Id: If4f51769731e54743ac705b19b4f81b2d5925d5a


[ROCm/clr commit: 446ed661a0]
2025-01-06 18:03:23 -05:00
Jatin Chaudhary 0830d95f6d SWDEV-505971 - change setArgument arg from uint32_t to uint64_t
We are passing this arg as an address, and memcpy complains about
overreading (8 bytes instead of 4).

Change-Id: Ica9207f6c5f6056a4bfc968280c76e779ded13ae


[ROCm/clr commit: a6f2a2c2af]
2025-01-06 08:16:59 -05:00
Sourabh Betigeri 7261404002 SWDEV-440866 - [hip-roclr] Adds support to batch memory operations APIs
Change-Id: I5ac63a6626af8c2b4ac382c52dfe1aaf0b3716b8


[ROCm/clr commit: 03dbcd8ca7]
2024-12-12 19:29:24 -05:00
Saleel Kudchadker 7d7aa8b69c SWDEV-497145 - Use rocr copyOnEngine API for staged copies
- Refactor blit code and clean ASAN instrumentation
- Use unified function for rocr copy
- Enable shader copy path for unpinned writeBuffer/readBuffer paths
- Set GPU_FORCE_BLIT_COPY_SIZE=16 which means we will use BLIT copy for
  pinned copies or unpinned H2D/D2H copies < 16KB

Change-Id: I42045cca79234b340dbf53dafb93044199736ae4


[ROCm/clr commit: 7863eb92dc]
2024-12-04 13:38:13 -05:00
Sourabh Betigeri 1712acdd2e Revert "SWDEV-440866 - [hip-roclr] Adds support to batch memory operations APIs"
This reverts commit ab0ff9163d.

Reason for revert: hipInfo fails on windows. Updating llvm amd-mainline-closed

Change-Id: I57e1fa1945188b0bc0a799c4f3d540f2b7713003


[ROCm/clr commit: 2ca644cf22]
2024-12-02 16:46:12 -05:00
Sourabh Betigeri ab0ff9163d SWDEV-440866 - [hip-roclr] Adds support to batch memory operations APIs
Change-Id: I449ffca44bbb04d13348d112e896d603c70fd485


[ROCm/clr commit: bd5d8e9baf]
2024-11-30 17:54:32 -05:00
Saleel Kudchadker 35201ff32d SWDEV-483586 - Do not take pinned path for read/write
- When GPU_FORCE_BLIT_COPY_SIZE is set do not take pinned path

Change-Id: Iaa065db63cc8fda61f82e6c9701e9fdaec5c54cb


[ROCm/clr commit: f1e98ab6e4]
2024-11-01 12:55:15 -04:00
Anusha GodavarthySurya c0ceb1cf12 SWDEV-477324 - Capture Memcpy1D pinned H2D D2H
Change-Id: I1f4744f20a9caeed005ec68da44e5fde737e09f7


[ROCm/clr commit: 742b0210d3]
2024-09-30 01:01:30 -04:00
German Andryeyev f8fc11c2d8 SWDEV-483586 - Unblock staging H2D transfers
Although unpinned copies require synchronizations
in HIP, runtime can avoid syncs for H2D copies with
a staging buffer

Change-Id: If2203c6bc0cbd89742823688dc8e89e9acd873b2


[ROCm/clr commit: 29cc678d8d]
2024-09-21 10:25:27 -04:00
kjayapra-amd f19260d568 SWDEV-480772 - Remove name variable from amd::Monitor class.
Change-Id: Ie2a4fa44f485786227230f8a892e090e718aa30e


[ROCm/clr commit: 12a39fbf22]
2024-09-19 11:55:01 -04:00
Saleel Kudchadker 16920809d7 SWDEV-301667 - Refactor Blit force env var
Change-Id: I5344ac2e6442cd8f526118e688f1b1412cc5b45a


[ROCm/clr commit: d379f4efd0]
2024-07-25 15:15:10 -04:00
Anusha GodavarthySurya 291f079669 SWDEV-467102 - Hidden heap init for graph capture
If the graph has kernels that does device side allocation,  during packet capture, heap is
allocated because heap pointer has to be added to the AQL packet, and initialized during
graph launch.

Handle race with wait when 2 kernels with device heap are enqueued on multiple streams.

Change-Id: I45933b77fcaf7bc8fdf1bc906462e32b5d8d3688


[ROCm/clr commit: 57156c524d]
2024-06-17 02:07:25 -04:00
Ioannis Assiouras 407d1346f2 SWDEV-463865 - changed device,roc and pal namespaces to be nested under amd
Change-Id: Icad342843c039c634e249a13a7aa31400730b1dd


[ROCm/clr commit: 775dc204aa]
2024-06-07 12:23:06 -04:00