7965c8b53c
* Fix memory fence and use non-temporal store * Use amdgcn builtin instead of inline asm * Move threadfence location * Revert changes to gfx90a * Rework gfx90a change * Apply changes to gfx94x