SWDEV-101448 - [CQE OCL][Brahma][PERF][QR] ~21% perf drop is observed with lulesh-cl subtest of ComputeApps tests : Faulty CL # 1306133
- Use the logic for transfer size before CL#1306133
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#124 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#10 edit
SWDEV-79445 - OCL generic changes and code clean-up
- Improve image fill performance with multiple writes in a single thread. The current split has 3 regions
Affected files ...
... //depot/stg/opencl/drivers/opencl/library/common.hsa/src/blitKernels.cl#4 edit
... //depot/stg/opencl/drivers/opencl/library/common/src/blitKernels.cl#4 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#123 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.hpp#40 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#8 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.hpp#4 edit
SWDEV-101206 - [CQE OCL][Perf][G][QR] Upto ~9% Performance drop observed while running Video Composition subtest of Compubench; Faulty CL#1306133
- Use the original logic without DMA flush. Flush on staging write helps with a blocking op only, but currently VDI doesn't have that information.
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#122 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#7 edit
SWDEV-79445 - OCL generic changes and code clean-up
- Update staging copy path with a flush so CPU copy and SDMA transfer could run asynchronously.
- Tune chunk size for transfers
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#121 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/pal/palblit.cpp#6 edit
SWDEV-79151 - clenqueuereadImage is slow when using a pinned buffer and a row_picth!0
- Add a check if the provided rowPitch is equal to the actual transfer width. SDMA doesn't support row/slice pitches, thus runtime still has to fall back to compute in other cases
Affected files ...
... //depot/stg/opencl/drivers/opencl/api/opencl/amdocl/cl_memobj.cpp#78 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#120 edit
... //depot/stg/opencl/drivers/opencl/runtime/platform/memory.cpp#122 edit
... //depot/stg/opencl/drivers/opencl/runtime/platform/memory.hpp#92 edit
EPR #419072 - [OpenCL2.0] Enable 16MB large on device queues
- Enable device queue creation up to 12MB. That should allow to run Intel SDK sample from the EPR that requires 6MB queue only.
- Currently a queue with >12.5MB size has a significant performance degradation. Thus the current max possible is 12MB. In general it's preferable to use the queue size more suitable for the task, rather than max possible.
Affected files ...
... //depot/stg/opencl/drivers/opencl/library/hsa/hsail/src/devenq/schedule.cl#10 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#115 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.hpp#38 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpudefs.hpp#123 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpudevice.cpp#517 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpusched.hpp#17 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpuvirtual.cpp#372 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpuvirtual.hpp#131 edit
EPR #415638 - Improve APU performance
- Force remote allocation of local and persistent memory to Remote from RemoteUSWC:
- Use gpu copy for remote/pinned image/buffer.
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#114 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpuresource.cpp#211 edit
ECR #304775 - Optimize oclBandwidthTest from nVidia SDK
- Cache pinned memory, since the benchmark sends the same transfer in a single batch. Thus we could avoid pin/unpin
- Swap SDMA engine allocation order. Blit manager allocates a queue on device, thus the first app queue was getting the paging second SDMA.
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#112 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.hpp#37 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpuvirtual.cpp#339 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpuvirtual.hpp#121 edit
ECR #304775 - Optimization for rectangular copies(Part2). Due to HW restriction of 14bits for src and dst pitch, its advantageous to choose optimal bpp. Higher the bpp the larger the byte pitch. This indirectly helps to reduce the number of packets for buffer copy(line by line vs a single sub_win raw packet)
ReviewBoardURL = http://ocltc.amd.com/reviews/r/5605/diff/
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#109 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpuresource.cpp#191 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpuresource.hpp#76 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gslbe/src/rt/GSLContext.cpp#64 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gslbe/src/rt/GSLContext.h#38 edit
ECR #304775 - Refactor code to do line by line copies for read\write Rect. This avoids taking the blit copy path which may be even slower.
ReviewBoardURL = http://ocltc.amd.com/reviews/r/5567/
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#108 edit
ECR #304775 - Device enqueuing
- Use atomic fetch for enqueue flags
- Switch to a multithreaded scheduler
- Add a workaround for Linux host_multi_queue failures. Linux has only 2 queues, but the test allocates multiple host queues and the same HW ring can be used
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#106 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpudevice.cpp#449 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpudevice.hpp#127 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpuschedcl.cpp#22 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpuvirtual.cpp#325 edit
ECR #304775 - Device enqueuing
- Switch to the single thread scheduler for now(the current version isn't friendly for single thread). Hopefully it's a temporary solution until synchronization issue with multithreaded scheduler will be identified.
Affected files ...
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpublit.cpp#104 edit
... //depot/stg/opencl/drivers/opencl/runtime/device/gpu/gpuschedcl.cpp#20 edit