* clr: Implement dynamic stream to HW queue assignment
This change implements dynamic stream to hardware queue (HWq) mapping
with the following features:
* Queue depth heuristics with weights for optimal HWq assignment
* Make last used queue sticky for better locality
* Use pipe HWq to pipe mapping - gfx9 follows a round-robin queue to
pipe mapping based on creation order (single process per device only,
as pipe ID is statically assigned by runtime)
* More aggressive heuristic usage for better queue distribution
* Extend dynamic queues support for all stream priorities
Environment variables:
* DEBUG_HIP_DYNAMIC_QUEUE: 0 - disabled, 1 - Depth heuristics 2 -
Depth+Pipe heuristics
* DEBUG_HIP_IGNORE_STREAM_PRIORITY=1: ignore priority stream creation
* clr: Clean up last_used_queue_
* SWDEV-555889 - Support mipmap on rocr
Support mipmap in hip-rt on rocr backend.
Enable all mipmap tests in Windows.
Some other minor improvement.
Add some SRD logs that will be removed finally.
* Add sampler.mipFilter to fix sampler issues on mipmap in rocr.
Fix format issues of view of leveled image and mipmap image in blit kernel in rocr.
Enabled disabled mipmap tests.
* Rewrite view logic
* Set word4.f.PITCH = 0 for mipmap SRD on navi31 to fix unstable test issues.
Reset last error in nagative tests.
* Remove SRD dump log from hip-rt
Let Rocr mipmap log be in condition.
* minor format chang
* Exclude mipmap tests for mi200+ which don't support mipmap.
* SWDEV-465041 - Add support for user events with DD
User events can be replaced with HSA signals. Add the interface
to allocate HSA signal for user events and update the status on
CL_COMPLETE.
Force pinned path with DD to avoid blocking calls. Pinned memory
can be released only when the command is complete.
Simplify device enqueue path to use generic kernel arg buffer and
signals
* Fix notifyCmdQueue() logic for OCL
* Avoid blocking calls in OCL with DD
* Add event destruciton in a case of the failure.
[ROCm/clr commit: 2305f8ae56]
- Use getBuffer/releaseBuffer in BlitManager
- Cleanup XferBuffer as we use ManagedBuffer for both reads/writes
Change-Id: I2661b85dd012763b17a38a743fec1b1d79125f67
[ROCm/clr commit: 37d606d193]
Use env var DEBUG_CLR_KERNARG_HDP_FLUSH_WA=1 to fall back to HDP flush
workaround. The default is 0
Change-Id: I7bdb9be61da60c30d15ac9991b7cd27351e1831c
[ROCm/clr commit: 9de6d4d46c]
On gfx8, gfx9 devices before MI100 and gfx10.0 or gfx10.1
none of the memory ordering workarounds for device kernel arguments
can be applied. Use host kernel arguments on these devices.
Change-Id: I9be6fbfe4b3986eb7d9f83998334df5f03fd4124
[ROCm/clr commit: 2b746de6de]
The Readback and Avoid HDP Flush memory ordering workaround is
used as a fallback solution only when HDP flush register is invalid
Change-Id: Ic284eba1f95ed22b0270d3abeb904fb902015b1a
[ROCm/clr commit: 6cb7b6ec6b]
- Enable Device kernel args for MI300* for now.
- Fix a perf issue which impacts graph instantiate when dev kernel args
are enabled.
Change-Id: I962e58fd9d8dd1a8db95e601cb03a8e9c7bac97f
[ROCm/clr commit: 68f40f78dd]
- Implement workaround to ensure HDP writes are done by writing and
reading the HDP MMIO register.
- Implement the same workaround for graphs, we no longer need sentinel
write/readback
Change-Id: I0d3027b46a1f61131ec62e3c8c669ff5184fa6b2
[ROCm/clr commit: f138e0d113]
Use only 16 workgroups for compute P2P copies.
That should be enough to utilize XGMI bandwidth.
Change-Id: I60dfe019279bb95f93c8874244c1738aad1896d8
[ROCm/clr commit: 31101c6219]
- Add the new fillBuffer kernel, which allows to launch a limited
number of workgroups for memory fill operation
- Switch fill memory to 16 bytes write by default
- Allow to limit the workgroups with DEBUG_CLR_LIMIT_BLIT_WG
Change-Id: Ibad1822f2d42b2fc71bcfc1917c31409c0623e8e
[ROCm/clr commit: f1dc81f427]
This reverts commit dfa7790030.
Reason for revert: Deferred to a future release.
Change-Id: Ia66c37f0ab9734dee73c930d10d7469d5fd57254
[ROCm/clr commit: 5dc104b3ea]
Pinned copy can cause big performance drops, because slow pinning under Windows.
Use up to 128MB for staging transfers. Change staging buffer size to 4MB.
Linux path should still have the old defaults.
Change-Id: I954edceb3ec89e8e670be116aa2d0a9564c8b11c
[ROCm/clr commit: 79d12df147]
This allows experimenting with env var GPU_PINNED_XFER_SIZE which is
still at a default of 32MB
Change-Id: I85ade700ed58d498eba29d1737601dc74d4c26a4
[ROCm/clr commit: 3f82b99f5d]
Add a env var ROC_USE_FGS_KERNARG to toggle kernel arg placement
By default its in Fine Grain Kernel arg segment for supported asics.
Change-Id: I3d57ed69a1a4db2b392b0438ead499f3ddca4716
[ROCm/clr commit: e29b9c00ee]
ROC_BARRIER_SYNC will not work with direct dispatch.
Remove and cleanup.
Change-Id: I81368b2e65039477bd0343bb92708dab48867db6
[ROCm/clr commit: aa38af8c96]
- The logic will trace compute, sdma read/write operations and
apply signals when necessary
- ROC_CPU_WAIT_FOR_SIGNAL, ROC_SYSTEM_SCOPE_SIGNAL
and ROC_SKIP_COPY_SYNC were added to control the tracking
Change-Id: I9e8e6174c63bf7784f7ab00964e2918c8667d364
[ROCm/clr commit: dbc7abaecf]
- Correct GSL path to report targets using the TargetID syntax.
- Correct GSL path to check compatibility of code objects when
loading.
- Add concept of an device isa and create a registery used by ROCm,
PAL and GSL.
- Support XNACK and SRAMECC target features consistently for PAL and ROCm.
- Correct logic for NullDevices and asserts to avoid memory coruption.
- Allow all NullDevices to be created for HIP.
- Numerous other code improvements.
Change-Id: I40abf3d2b22249c1492d1af5919665f8184f4e0e
[ROCm/clr commit: c7e8d91e14]