481a35bc59
* Fix memory fence and use non-temporal store
* Use amdgcn builtin instead of inline asm
* Move threadfence location
* Revert changes to gfx90a
* Rework gfx90a change
* Apply changes to gfx94x
[ROCm/rccl commit: 7965c8b53c]