0af4d3623f
-Move staging buffer locks inside the staging buffer code. -Remove dedicated per-device completion_signal + per-device lock - instead allocated signal from the per-stream pool. This elimintes the lock and allows more concurrency. -remove switch HIP_DISABLE_BIDIR_MEMCPY