Jonathan Kim 62f3f250ce Optimize and fix SDMA gang copies
Optimizations include:
- Greedy gang by placing gang leaders on first D2D sdma blit context
to avoid dead locking with other gang leaders and items.  Note that
this is fine since we can't avoid an oversubscription problem when
there is only 1 xGMI link anyways, so treat all xGMI links as a single
pipe for ganging.
- Non-leader gang items don't have to poll on dependency signals so this
opens up more non-blocking SDMA channels.
- unlock gang lock when gangs are not needed.
- Change gang factor lookup from vector pair to map and register all
gpus in gang factor lookup regardless of link type so that we can take
advantage of the O(logN) direct key/value lookup time.

Fixes include:
- HSA_PAGE_SIZE_4KB was an incorrect macro to use for gang size limit.
As a result, small copies ended up ganging and hitting latency limit.
Use hardcoded 4096 bytes instead.
- Cap auxillary gang factor to the number of non-XGMI SDMA engines.

Change-Id: Ic23fde131502906a807134a04599aa6d012e8cbb
2024-01-25 10:42:27 -05:00
2024-01-25 10:42:27 -05:00
2017-08-28 12:57:11 -05:00
S
توضیحات
No description provided
282 MiB
Languages
C++ 67.5%
C 20.6%
Python 6.6%
CMake 3.4%
Shell 0.6%
دیگر 1.1%