SWDEV-79445 - OCL generic changes and code clean-up
- Make sure PAL_DISABLE_SDMA is fully functional. CP DMA is used for buffer transfers currently and kernels for images and buffer rect copies.
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#32 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/paldevice.cpp#150 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palsettings.cpp#92 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palsettings.hpp#25 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palvirtual.cpp#142 edit
SWDEV-180872 - Runtime support changes for Cooperative Group Features
- Initial implementation of the core functionality. Disabled by default. Use GPU_ENABLE_COOP_GROUPS=1 to enable the feature.
- Runtime uses device queue for cooperative executions with a synchronization on the launched queue.
- The current implementation is pure runtime change and it can work if only one app uses this feature. No ROCr/KFD support was added or tested
- Only inline assembler was tested
Affected files ...
... //depot/stg/opencl/drivers/opencl/api/hip/hip_device.cpp#20 edit
... //depot/stg/opencl/drivers/opencl/api/hip/hip_device_runtime.cpp#15 edit
... //depot/stg/opencl/drivers/opencl/api/hip/hip_hcc.def.in#15 edit
... //depot/stg/opencl/drivers/opencl/api/hip/hip_hcc.map.in#17 edit
... //depot/stg/opencl/drivers/opencl/api/hip/hip_module.cpp#28 edit
... //depot/stg/opencl/drivers/opencl/api/hip/hip_platform.cpp#32 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/device.hpp#338 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpudevice.cpp#606 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpudevice.hpp#171 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#31 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.hpp#9 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/paldevice.cpp#142 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/paldevice.hpp#39 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palschedcl.cpp#6 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palvirtual.cpp#135 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palvirtual.hpp#61 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/rocm/rocblit.cpp#32 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/rocm/rocblit.hpp#12 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/rocm/rocdevice.cpp#127 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/rocm/rocdevice.hpp#37 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/rocm/rocschedcl.cpp#3 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/rocm/rocvirtual.cpp#75 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/rocm/rocvirtual.hpp#23 edit
... //depot/stg/opencl/drivers/opencl/runtime/platform/command.cpp#94 edit
... //depot/stg/opencl/drivers/opencl/runtime/platform/command.hpp#92 edit
... //depot/stg/opencl/drivers/opencl/runtime/utils/flags.hpp#311 edit
SWDEV-189453 - [Navi10][OpenCl][x32][Converter] Process hang
- Use the argument size from the caller. With LC path and 32 bit the both sizes are different and runtime has to use the caller's size, which matches the host bitness, because the optimized path updates 32 bit values only.
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#30 edit
SWDEV-132899 - [OCL][GFX10] 70 subtests of Conformance Mipmaps (clCopyImage) test failed for image type 1Darray
This is the follow up for CL#1517501
copyImage1DA blit kernel uses image2d_array_t type for src/dst images. On gx10, num of arrays/layers is expected in Z component for a 2Darray image so a swap is required for 1Darray images when we use 2Darray image for the image copy. The copyImage1DA has code for swapping z and y components as follows:
if (srcOrigin.w != 0) {
coordsSrc.z = coordsSrc.y;
coordsSrc.y = 0;
}
if (dstOrigin.w != 0) {
coordsDst.z = coordsDst.y;
coordsDst.y = 0;
}
So to use this path force the w component to 1 for src and dst images on gfx10 if image type is 1Darray.
ReviewRequestURL = http://ocltc.amd.com/reviews/r/16538/diff/
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#28 edit
SWDEV-79445 - OCL generic changes and code clean-up
- Remove mapping of some internal CL formats in PAL backend, since it shouldn't need them anymore.
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#27 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/paldefs.hpp#43 edit
SWDEV-79445 - OCL generic changes and code clean-up
- Use SDMA staging transfers for data upload if pinning fails. Fixes HIP failure in a test that uses the code segment data for uppload.
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#26 edit
SWDEV-79445 - OCL generic changes and code clean-up
- Following CL#1552596. Make sure virtual GPU is set for the internal allocations before the create() call, since the deferred alloc is disabled.
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#128 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpuvirtual.cpp#416 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpuvirtual.hpp#144 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#22 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palvirtual.cpp#96 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palvirtual.hpp#51 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/rocm/rocblit.cpp#21 edit
SWDEV-151739 - [CQE OCL][DTB][Perf][QR][DTB-BLOCKER][VEGA10] Upto 18% performance drop observed while running Video Composition test sub test of Compubench due to faulty CL#1544622
- Implement customized TS tracking for managed buffers. The common TS tracking mechanism saves the event of the last command, assuming SDMA and compute operations occur in order, but for managed buffers it's not the case. Also managed buffer doesn't have to validate TS for the parent resource.
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#21 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palconstbuf.cpp#11 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palconstbuf.hpp#9 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palmemory.hpp#6 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palresource.hpp#22 edit
SWDEV-132899 - [OCL][GFX10] OCLCreateImage[3] fails for the image type 1Darray
Issue: FillImage blit kernel is not working properly on gfx10 if the image type is 1Darray (i.e., it only fills the first slice/layer and ignores the rest of the layers when number of layers >1)
Root cause: gfx10 HW expects the number of layers in Z component
Fix: To fix this issue we swap the Y and Z components if the image type is 1Darray for gfx10+ in image blit kernels.
ReviewBoardURL = http://ocltc.amd.com/reviews/r/14281/diff/
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#17 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palsettings.cpp#42 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palsettings.hpp#14 edit
SWDEV-129129 - [[CQE OCL][Vega vs Fiji] Upto 12% Performance drop observed on VEGA10 compared to FIJI while running BlackMagic Davinci Resolve
The app creates/destroys hundred resources each frame. PAL path was removing the destroyed resources from the resident list, although the resource was kept in the cache. This change does the follwoing:
- Switch TS tracking from a map in VirtualGPU to resource
- Don't remove references until the actual memory destruction
- Add a residency threshold to avoid OS resident/eviction calls
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/blit.hpp#5 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#14 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.hpp#6 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/paldefs.hpp#19 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/paldevice.cpp#50 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/paldevice.hpp#17 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palkernel.cpp#35 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palkernel.hpp#13 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palmemory.cpp#14 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palprogram.cpp#46 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palprogram.hpp#19 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palresource.cpp#30 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palresource.hpp#13 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palvirtual.cpp#52 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palvirtual.hpp#28 edit
SWDEV-107226 - [SDI] SDISpeedTest Corruption for OCL GPU to SDI RGBA
- Single step copy using SDMA to remote SDI buffer seems to be causing corruption. This fix is a workaround to do transfer via a staging buffer and seems to be fixing corruption. The issue is under investigation
ReviewBoardURL = http://ocltc.amd.com/reviews/r/11882/diff/
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#125 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#12 edit
SWDEV-101448 - [CQE OCL][Brahma][PERF][QR] ~21% perf drop is observed with lulesh-cl subtest of ComputeApps tests : Faulty CL # 1306133
- Use the logic for transfer size before CL#1306133
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#124 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#10 edit
SWDEV-79445 - OCL generic changes and code clean-up
- Improve image fill performance with multiple writes in a single thread. The current split has 3 regions
Affected files ...
... //depot/stg/opencl/drivers/opencl/library/common.hsa/src/blitKernels.cl#4 edit
... //depot/stg/opencl/drivers/opencl/library/common/src/blitKernels.cl#4 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#123 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.hpp#40 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#8 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.hpp#4 edit
SWDEV-101206 - [CQE OCL][Perf][G][QR] Upto ~9% Performance drop observed while running Video Composition subtest of Compubench; Faulty CL#1306133
- Use the original logic without DMA flush. Flush on staging write helps with a blocking op only, but currently VDI doesn't have that information.
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#122 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#7 edit
SWDEV-79445 - OCL generic changes and code clean-up
- Update staging copy path with a flush so CPU copy and SDMA transfer could run asynchronously.
- Tune chunk size for transfers
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#121 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#6 edit
SWDEV-3 - AMDGPU: Expand unaligned accesses early
Due to visit order problems, in the case of an unaligned copy
the legalized DAG fails to eliminate extra instructions introduced
by the expansion of both unaligned parts.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274397 91177308-0d34-0410-b5e6-96231b3b80d8
GitHash: d4452f8fcf496a2e19c1a1c9792f5f063f4e9703
Affected files ...
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/lib/Target/AMDGPU/AMDGPUISelLowering.cpp#3 edit
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/lib/Target/AMDGPU/AMDGPUISelLowering.h#3 edit
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll#3 edit
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/test/CodeGen/AMDGPU/sext-in-reg-failure-r600.ll#1 add
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/test/CodeGen/AMDGPU/sext-in-reg.ll#3 edit
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/test/CodeGen/AMDGPU/unaligned-load-store.ll#2 edit
SWDEV-3 - [msan] Fix __msan_maybe_ for non-standard type sizes.
Fix incorrect calculation of the type size for __msan_maybe_warning_N
call that resulted in an invalid (narrowing) zext instruction and
\"Assertion `castIsValid(op, S, Ty) && \"Invalid cast!\"' failed.\"
Only happens in very large functions (with more than 3500 MSan
checks) operating on integer types that are not power-of-two.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274395 91177308-0d34-0410-b5e6-96231b3b80d8
GitHash: dcfa1b5241a1d0484ad1a67485329b1c7c13b575
Affected files ...
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/lib/Transforms/Instrumentation/MemorySanitizer.cpp#4 edit
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/test/Instrumentation/MemorySanitizer/with-call-type-size.ll#1 add
SWDEV-3 - [codeview] Don't record UDTs for anonymous structs
MSVC makes up names for these anonymous structs, but we don't (yet).
Eventually Clang should use getTypedefNameForAnonDecl() to put some name
in the debug info, and we can update the test case when that happens.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274391 91177308-0d34-0410-b5e6-96231b3b80d8
GitHash: 613f19910964eb95a63bd906b0b75d9aa20d9b06
Affected files ...
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/lib/CodeGen/AsmPrinter/CodeViewDebug.cpp#5 edit
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/test/DebugInfo/COFF/udts.ll#2 edit
SWDEV-3 - IR: Set TargetPrefix for some X86 and AArch64 intrinsics where it was missing
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274390 91177308-0d34-0410-b5e6-96231b3b80d8
GitHash: f0a4c116041f7c2aef7796c8b067f0947b69602d
Affected files ...
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/include/llvm/IR/IntrinsicsAArch64.td#2 edit
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/include/llvm/IR/IntrinsicsX86.td#2 edit
SWDEV-3 - Address two correctness issues in LoadStoreVectorizer
Summary:
GetBoundryInstruction returns the last instruction as the instruction which follows or end(). Otherwise the last instruction in the boundry set is not being tested by isVectorizable().
Partially solve reordering of instructions. More extensive solution to follow.
Reviewers: tstellarAMD, llvm-commits, jlebar
Subscribers: escha, arsenm, mzolotukhin
Differential Revision: http://reviews.llvm.org/D21934
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274389 91177308-0d34-0410-b5e6-96231b3b80d8
GitHash: 1e53a5fcec984e0f1cefe43dba3939e4b72a533f
Affected files ...
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp#11 edit
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/test/Transforms/LoadStoreVectorizer/AMDGPU/interleaved-mayalias-store.ll#2 edit
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/test/Transforms/LoadStoreVectorizer/X86/lit.local.cfg#1 add
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/test/Transforms/LoadStoreVectorizer/X86/preserve-order32.ll#1 add
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/test/Transforms/LoadStoreVectorizer/X86/preserve-order64.ll#1 add
SWDEV-3 - [PM] Preparatory cleanups to ArgumentPromotion.
This pulls some obvious changes out of http://reviews.llvm.org/D21921 to
minimize the diff.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274445 91177308-0d34-0410-b5e6-96231b3b80d8
GitHash: 197a7516a32b69da7d1243308cb8eb6c5f29de0c
Affected files ...
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/lib/Transforms/IPO/ArgumentPromotion.cpp#2 edit
SWDEV-3 - [PM] Fix a small typo from when I ported JumpThreading
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274440 91177308-0d34-0410-b5e6-96231b3b80d8
GitHash: ea9886a5909183770b8d0baa9061150adf664b1a
Affected files ...
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/lib/Transforms/Scalar/JumpThreading.cpp#2 edit
SWDEV-3 - [Hexagon] Create global std::map lazily.
This could of course be a simple binary search with no global state
involved at all if someone cares enough. Just don't make everyone
linking the hexagon backend pay for it on process startup and shutdown.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274437 91177308-0d34-0410-b5e6-96231b3b80d8
GitHash: b4e53350f9349677e2a0178bde5b8b0c3b743b5e
Affected files ...
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/lib/Target/Hexagon/MCTargetDesc/HexagonMCDuplexInfo.cpp#2 edit
SWDEV-3 - CodeGen: Use MachineInstr& in SlotIndexes.cpp, NFC
Avoid implicit conversions from iterator to pointer by preferring
MachineInstr& and using range-based for loops.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274354 91177308-0d34-0410-b5e6-96231b3b80d8
GitHash: effa4cc200078395a74decd1ae2d1e380c79a2f7
Affected files ...
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/lib/CodeGen/SlotIndexes.cpp#2 edit
SWDEV-3 - CodeGen: Use MachineInstr& in RegAllocFast, NFC
Use MachineInstr& instead of MachineInstr* in RegAllocFast to avoid
implicit conversions from MachineInstrBundleIterator. RAFast::spillAll
and RAFast::spillVirtReg still take iterators, since their argument may
be an end iterator from MachineBasicBlock::getFirstTerminator.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274353 91177308-0d34-0410-b5e6-96231b3b80d8
GitHash: ce5fdc00e7ed9f05c643b056d0561a8133b5438b
Affected files ...
... //depot/stg/opencl/drivers/opencl/compiler/llvm.git/lib/CodeGen/RegAllocFast.cpp#2 edit