* Fix memory fence and use non-temporal store * Use amdgcn builtin instead of inline asm * Move threadfence location * Revert changes to gfx90a * Rework gfx90a change * Apply changes to gfx94x [ROCm/rccl commit: 7965c8b53c]
7965c8b53c