98 Коммитов

Автор SHA1 Сообщение Дата
SaleelK c105dcd05b clr: Use graph segment scheduling to process HIP Graphs (#1372)
* clr: Use graph segment scheduling to process HIP Graphs

* Add a broader path to use capture packet capture for all topologies
* Refactor code
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING to toggle new vs classic path,
  Enabled by default

* clr: Few fixes and improvements

* clr: Detect complex graphs to take classic path

* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING=2 to force segment scheduling
  path

* clr: Fix a cornercase stack corruption

* clr: Track commands of segments instead of snapshots

* clr: Fix Batch dispatch logic

* Track fence_dirty_ flag for command of other streams
* Dependency resolution markers can now accomodate dirty fence on cross
  streams

---------

Co-authored-by: Ioannis Assiouras <Ioannis.Assiouras@amd.com>
Co-authored-by: Godavarthy Surya, Anusha <agodavar@amd.com>
2025-12-01 12:49:26 -08:00
sluzynsk-amd 2cf9faa93f SWDEV-563777 - fix warnings related to inconsistent overrides (#1625)
This patch adds missing override keywords. Fixes this class of warnings.

Signed-off-by: Sebastian Luzynski <Sebastian.Luzynski@amd.com>
2025-11-24 18:50:07 +01:00
Ioannis Assiouras 36029ea1a8 SWDEV-559166 - Fix race condition in getDemangledName (#1868) 2025-11-23 08:45:45 +00:00
Ioannis Assiouras 6d6b136374 SWDEV-559166 - Fix data races in GetSubmissionBatch, CaptureAndSet and SetQueueStatus (#1441) 2025-10-23 12:18:31 +01:00
Godavarthy Surya, Anusha fb72d7f851 SWDEV-524746 - Part-II Add multi device support for hip graph. Updated kernel arg manager for each device (#813)
- Updated kernel arg manager to support allocating kernel args on multiple devices for single graph.
- Updated AQL path to capture on the device where graph node is added.

Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
2025-09-25 20:38:18 +05:30
SaleelK 34b9184686 clr: Fix memory corruption for memset nodes (#1068)
* Detect graph capture and use graph kernelarg memory for FillBuffer pattern
2025-09-23 17:17:33 -07:00
SaleelK 149dc17c90 clr: Optimize doorbell ring (#1030)
*Lay foundation to batch packets efficiently for graphs
*Dynamically copy packets with max threshold set with
DEBUG_HIP_GRAPH_BATCH_SIZE, if not stagger packet copy with pow2
*Default threshold for DEBUG_HIP_GRAPH_BATCH_SIZE is 256
*If TS are not collected for a signal for reuse, create a new signal.
This can potentially increase signal footprint if the handler doesn't run
fast enough.
2025-09-18 15:02:10 -07:00
German Andryeyev 7a1a6682e2 SWDEV-552846 - Unpin memory for hip before exit the copy (#851) 2025-09-04 20:04:01 +05:30
Danylo Lytovchenko f7338717ae SWDEV-470698 - fix formatting, add format check workflow (#657) 2025-08-20 19:58:06 +05:30
Betigeri, Sourabh 35e48d1eaf SWDEV-546293 - hipMemPrefetchAsync_v2 and hipMemAdvise_v2 implementation (#869)
SWDEV-546293 - hipMemPrefetchAsync hipMemAdvise_v2

Please enter the commit message for your changes. Lines starting

[ROCm/clr commit: cbee74a80e]
2025-08-15 22:40:04 -07:00
Andryeyev, German 6df9a49437 SWDEV-465041 - Add support for user events with DD (#321)
* SWDEV-465041 - Add support for user events with DD

User events can be replaced with HSA signals. Add the interface
to allocate HSA signal for user events and update the status on
CL_COMPLETE.
Force pinned path with DD to avoid blocking calls. Pinned memory
can be released only when the command is complete.
Simplify device enqueue path to use generic kernel arg buffer and
signals

* Fix notifyCmdQueue() logic for OCL

* Avoid blocking calls in OCL with DD

* Add event  destruciton in a case of the failure.

[ROCm/clr commit: 2305f8ae56]
2025-08-12 19:04:36 -04:00
Betigeri, Sourabh 40999496c1 SWDEV-545273 - Respect HIP_LAUNCH_PARAM_BUFFER_SIZE (#770)
[ROCm/clr commit: 2a02d2c2f3]
2025-08-03 17:32:52 -07:00
Kudchadker, Saleel cd14def193 SWDEV-521647 - Fix tracking of hw_event (#206)
- When a command may possibly have two packets(like device heap
  initializer), and if there is no signal on the main kernel packet the
tracking was broken as it marked HW event of the command as the first
packet signal.
- Make sure if no completion signal is attached to the second packet
  then clear the HW event for the command.

[ROCm/clr commit: 072fb0804e]
2025-04-25 08:46:44 -07:00
Saleel Kudchadker c8f39ec2b0 SWDEV-502365 - Track last used command
- This change tries to save extra synchronization packets we may insert
  as we didnt track the completion signals for every command. We track
the current enqueued command until it exits the enqueue stage. We also
record the exit scope to know if we flushed the caches
- Handle correct release scopes and store completion signal as HW events
- Use a new finishCommand implementation to only wait for the command
  passed as the argument

Change-Id: Ie4350c5dd24f5d48dfa6ccbabd892f0544caadcc


[ROCm/clr commit: e03e4f3b5d]
2025-03-04 16:05:02 -05:00
Anusha GodavarthySurya 08c92f4793 SWDEV-480209 - Make internal callbacks non-blocking
Change-Id: Ic918d08f341abfd9a7c167d09f9c723cdc43157f


[ROCm/clr commit: 683a942364]
2025-01-10 02:16:11 -05:00
Sourabh Betigeri 7261404002 SWDEV-440866 - [hip-roclr] Adds support to batch memory operations APIs
Change-Id: I5ac63a6626af8c2b4ac382c52dfe1aaf0b3716b8


[ROCm/clr commit: 03dbcd8ca7]
2024-12-12 19:29:24 -05:00
Sourabh Betigeri 1712acdd2e Revert "SWDEV-440866 - [hip-roclr] Adds support to batch memory operations APIs"
This reverts commit ab0ff9163d.

Reason for revert: hipInfo fails on windows. Updating llvm amd-mainline-closed

Change-Id: I57e1fa1945188b0bc0a799c4f3d540f2b7713003


[ROCm/clr commit: 2ca644cf22]
2024-12-02 16:46:12 -05:00
Sourabh Betigeri ab0ff9163d SWDEV-440866 - [hip-roclr] Adds support to batch memory operations APIs
Change-Id: I449ffca44bbb04d13348d112e896d603c70fd485


[ROCm/clr commit: bd5d8e9baf]
2024-11-30 17:54:32 -05:00
German Andryeyev faea40cbb3 SWDEV-486602 - Optimize HSA callback performance
- Don't generate callbacks for HIP events
- Don't process profiling info in the callback for HIP events
- Wait for CPU status update of the submitted commands
every 50 calls. That will allow to drain the commands and
destroy HSA signals.

Change-Id: Ib601a350e7e7c2b6c6209a172385389baccf73a9


[ROCm/clr commit: 364dfb0ed1]
2024-10-11 14:50:25 -04:00
Todd tiantuo Li 170e45b879 SWDEV-472357 - support Rect copy with staging buffer for 2D & 3D memcpy in PAL
Change-Id: Ie32f3e5a6fa077f6b2db20fc1ab1e2e0da8344cb


[ROCm/clr commit: 41dc4545fc]
2024-10-10 18:00:19 -04:00
Anusha GodavarthySurya c0ceb1cf12 SWDEV-477324 - Capture Memcpy1D pinned H2D D2H
Change-Id: I1f4744f20a9caeed005ec68da44e5fde737e09f7


[ROCm/clr commit: 742b0210d3]
2024-09-30 01:01:30 -04:00
Vladana Stojiljkovic 887b11894b SWDEV-482086 - Fix hipGraphInstantiate leak
* In a scenario where kernel is launched with hipExtLaunchKernelGGL and stop event is used, hipGraphInstantiate leaks. Since stop event is used, profiling is enabled and Timestamp (ReferencedCountedObject) is created, but it doesn't get released.
* The idea behind this solution is that profiling should be disabled when command is captured, hence the timestamp should not be created. Because information about capturing isn't available when kernel command is created, packet capturing state is used to determine whether to create a timestamp or not.

Change-Id: Ia23adac4592ded4fb5e236acf99e12e729f63692


[ROCm/clr commit: da5f1a6146]
2024-09-29 11:36:53 -04:00
Chong Li 4979c2f206 SWDEV-478929 - Benchmark ReallyQuickPureX Failed
Ensure the member function Alloc() and Free() of command_pool_ will not be
accessed after command_pool_ be destructed.

Signed-off-by: Chong Li <chongli2@amd.com>
Change-Id: Ic2d36423302518a030bd61fa399290ebe2ed8194


[ROCm/clr commit: e6a5c81221]
2024-09-10 22:08:18 -04:00
German Andryeyev 35c7a87014 SWDEV-470612 - Add the optimized multistream path
- Added the optimized multi stream path in graph execution. It uses a fixed number of async streams in the execution
- Optimize the launch latency, where commands
creation and execution is done at the same time
- Optimize the scheduling to use less barriers and waiting signals if
the same queue  can be detected
- The new path is controlled by  DEBUG_HIP_FORCE_GRAPH_QUEUES
environment variable, where 0 will use the original path and any other
value will force the number of asynchronous queues for execution
- DEBUG_HIP_FORCE_ASYNC_QUEUE can force single queue async
execution in graphs(applicable for Navi families only)

Change-Id: I7eb40bc15c45f508d6911868a6f6d4c3598d380e


[ROCm/clr commit: 9db52f9a46]
2024-08-02 14:19:44 -04:00
Anusha GodavarthySurya 31927fefd6 SWDEV-468424 - Add support to capture multiple AQL Packets
=> Added support to capture multiple AQL Packets.
=> Added Interface to callback to hip runtime from rocclr to allocate
kernel args from the graph kernel arg pool.
=> Enabled Support to capture memset node.

Change-Id: I7e1c2ba06927459e024653058af142bd82192c43


[ROCm/clr commit: bd3a35bde1]
2024-08-01 23:55:51 -04:00
Anusha GodavarthySurya 7985a72073 SWDEV-468424 - hipgraph capture memset node
Capture AQL packets during GraphInstantiation and enqueue AQL packets during graph launch.

Added support to capture single graph memset node.
Capture support for memset node is currently disabled.
Memset capture will be enabled when capture for multiple packets are supported..

Change-Id: I14dfbc41731025cc3a548a730558915def3fa384


[ROCm/clr commit: 346da4bb40]
2024-07-19 23:52:50 -04:00
Tao Sang b8cf863eaa SWDEV-458943 - Implement std::mutex based monitor
Implement std::mutex based monitor that has much
simpler logics than legacy monitor.
Create DEBUG_CLR_USE_STDMUTEX_IN_AMD_MONITOR to
toggle them.
If DEBUG_CLR_USE_STDMUTEX_IN_AMD_MONITOR = false
  (by default), use legacy monitor;
If DEBUG_CLR_USE_STDMUTEX_IN_AMD_MONITOR = true,
  use std::mutex based monitor.
If no perf drop of stl::mutex based monitor,
legacy one will be removed later.

Change-Id: I1d21368ff462477d3238d71e4e2a1a7d6b9167ad


[ROCm/clr commit: 73c02041e1]
2024-07-04 11:50:46 -04:00
kjayapra-amd 41cb6dadf9 SWDEV-460948 - Changes to alloc, set, capture under single function.
Change-Id: I7b2d40e99e812b97c53535c5e63c41ad64a8f543


[ROCm/clr commit: 892071aeb2]
2024-06-06 16:57:53 -04:00
Ioannis Assiouras 0e023d1a0a SWDEV-463865 - symbol renamings to prevent conflicts in static build
Change-Id: Id7fbb638c1088c23df52fee877cd790d637b1ffb


[ROCm/clr commit: b8c2ac4de4]
2024-06-06 04:05:55 -04:00
Saleel Kudchadker 0b3e421451 SWDEV-301667 - Refactor graph code
- Remove Last graph node optimization and instead submit a barrier NOP
packet always. This simplifies the code.

Change-Id: Ied443173ba47a08b6df148ac7e3ead712acda11c


[ROCm/clr commit: badf2b0880]
2024-05-28 06:28:17 +00:00
German Andryeyev 68344576d3 SWDEV-460242 - Add system memory suballocator
Switch commands creation to the new suballocator to avoid
frequent expensive OS calls

Change-Id: I3597c811820e577c15708bad8b8a41aa53acc400


[ROCm/clr commit: 5b0bfdcbad]
2024-05-28 06:28:17 +00:00
Saleel Kudchadker 588e870000 SWDEV-301667 - Pass reference to kernel name
Change-Id: I21abe109ddfabfe7640bf78a96c81a1317d31952


[ROCm/clr commit: 4a9d24a211]
2024-05-05 16:38:20 -04:00
Saleel Kudchadker f3aedfbec0 SWDEV-301667 - Create TS for each node recorded in graph
- Create a vector to allow multiple TS to be stored in Command.
- This would mean we dont wait for entire batch in Accumulate command
to finish when we exhaust signals.
- Reduce the number of signals created at init to 64. This min value
may still need to be tuned but the KFD allows max of 4094 interrupt
signals per device.
- Store kernel names whenever they are available and not just when
profiling. If we dynamically enable profiling like for Torch, a crash
can happen if hipGraphInstantiate wasnt included in Torch profile scope
beacuse we previously entered kernel names only when profiler is
attached.

Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006


[ROCm/clr commit: c157bfb202]
2024-03-26 14:47:24 -04:00
Saleel Kudchadker cc0b04cc60 SWDEV-301667 - Reset profiler correlation_id_
- The correlation_id had random junk values which we were inserting in
the dispatch AQL packet even when no profiler was attached but if we had
a valid timestamp.
- Also make sure we dont even write the reserved2 field in the AQL
packet if no profiler attached.

Change-Id: Icdb7493198c1bb5e2d786a97e027288660854cd7


[ROCm/clr commit: 9a6ddae7b2]
2024-02-05 05:08:11 +00:00
Saleel Kudchadker dfb1087c3e SWDEV-422207 - Tag captured kernel names for graphs
Change-Id: I9540daa4abf9c340541a681037e2dca4eec821ed


[ROCm/clr commit: dfd4635f91]
2024-01-03 11:50:05 -05:00
Saleel Kudchadker cb9a715e04 SWDEV-422207 - Report kernel names for activity profiling
- Report kernel names for optimized graph path
- Refactor code so that we store profiling info in Accumulate command

Change-Id: Ib97735a0239aeb9fc3a50a4bb7126dd0bcadc8af


[ROCm/clr commit: b056686607]
2023-11-15 14:38:07 -05:00
Saleel Kudchadker be743bcd59 SWDEV-422207 - Optimize graph end detection
- Do not use extra barrier to detect graph end. If its a kernel node we
can use a completion signal for the last packet. Saves roughly 6us for
Phantom testcase per graph launch.

Change-Id: I5e0c2479d9964fbeda86ed97533f6718f49a7f91


[ROCm/clr commit: c3bd229f4f]
2023-11-10 11:57:02 -05:00
Saleel Kudchadker 19ea94729c SWDEV-422207 - Report TS for Accumulate command
Change-Id: Iba193a6068c1a2d25c2136643faee2c1e2591a07


[ROCm/clr commit: f5c6fc4dfa]
2023-11-07 18:19:40 +00:00
Saleel Kudchadker 5f009b7cb1 SWDEV-422207 - Track commands for capture
- Track all captured commands under a new AccumulateCommand
- Add begin() and end() methods to capture commands
- Explicit TS object now passed to certain methods because
profilingBegin() and profilingEnd() now happen separately and thus can
run into threading issues

Change-Id: I171106bdcad72b057836cb2f3fc398db3533119f


[ROCm/clr commit: 40f41f4d0b]
2023-11-03 05:09:04 +00:00
Anusha GodavarthySurya 0404df28ef SWDEV-422207 - Capture AQL Packets for graph Kernel nodes during graph Inst. And enqueue AQL packet during launch
Change-Id: I1e5f7f9e2a70bd500d190193cb6ba0867f5a63e7


[ROCm/clr commit: e63c280d4d]
2023-10-05 00:34:29 -04:00
German 5d9912f48b SWDEV-407533 - [ABI Break]Remove Wavelimiter
Change-Id: I6a2f6fb5a0c3acea93fa0200a69679783e76f5bd


[ROCm/clr commit: 7be3a5e33e]
2023-09-07 09:58:41 -04:00
Tao Sang 3fdd346cf2 SWDEV-417727 - Fix hipSignalExternalSemaphoresAsync()
This reverts commit cab71e6e00.

Implement the right way to make ExternalSemaphores be signalled
only after prior works on the stream have been finished.

Change-Id: I9d5974e05d5f229170b928db4566c14e40e3cbaa


[ROCm/clr commit: d433df4761]
2023-08-23 22:31:27 -04:00
taosang2 cab71e6e00 SWDEV-417727 - Fix hipSignalExternalSemaphoresAsync()
Let ExternalSemaphores be signalled only after prior works on the
stream have been finished.

Change-Id: I856917db905f68f55fdf484f5267f7fe8ea3117f


[ROCm/clr commit: 44a3935cda]
2023-08-23 14:58:37 -04:00
Anusha GodavarthySurya 57467ef2c7 SWDEV-392732 - Initial commit for graph doorbell optimization(AQL Buffering)
Change-Id: I451725006c54c249dc530c55d2af2a31594bf49b


[ROCm/clr commit: b0e6f99ad7]
2023-07-16 07:56:00 -04:00
sdashmiz 2216908962 SWDEV-403638 - Fix warnings
- disable deprecated function use warning
- disalbe size_t to .type' warning
- disable conversion from 'type1' to 'type2' warning

Signed-off-by: sdashmiz <shadi.dashmiz@amd.com>
Change-Id: I64161fd37cf56de3d132102103267ae8da40193a


[ROCm/clr commit: 38a67df312]
2023-06-15 12:17:22 -04:00
German 8d97827417 SWDEV-353281 - VM support in mempool for graphs
The change enables VM support in graphs on Windows. That allows
to avoid caching of all allocations at the cost of map/unmap
overhead during memory create/destroy.

Change-Id: I792be00fba099e5e5d3cd44a963e1dfd6976a86d


[ROCm/clr commit: 04b696abee]
2023-05-05 15:31:26 -04:00
Saleel Kudchadker cb09d962ba SWDEV-384557 - Leverage SDMA engine status query
Change-Id: I5f386f2965de24a229ea43b6c4da82099692f91f


[ROCm/clr commit: 20ca8b8116]
2023-04-05 07:50:53 +00:00
German b5b078e036 SWDEV-377991 - Remove liquidflash support
Change-Id: Iba6455e5c0210c3223a06fec332404cd9f489154


[ROCm/clr commit: 53a10c9039]
2023-01-20 09:57:06 -05:00
Ioannis Assiouras 733c8d1d1c SWDEV-369581 - Convey copy API metadata to ROCclr
Change-Id: I569462d6d268700d419510255e201bf7d80d6714


[ROCm/clr commit: 72b45e2a1f]
2022-12-09 00:27:15 -05:00
Sourabh Betigeri 7aa958a8f7 SWDEV-305894 - Cooperative groups grid and multi grid sync support for gfx940+
Change-Id: I35d72f1cb50c3a96eee56a612b72d641852b145f


[ROCm/clr commit: 5d7f3f9f3c]
2022-12-05 16:30:30 -05:00