-Move staging buffer locks inside the staging buffer code.
-Remove dedicated per-device completion_signal + per-device lock -
instead allocated signal from the per-stream pool. This elimintes
the lock and allows more concurrency.
-remove switch HIP_DISABLE_BIDIR_MEMCPY
- refactor staging buffer to operate on hsa* data structures not
hc::accelerator.
- use hsa_memory_allocate to allocate staging buffers rather than
am_alloc.
- Refactor device reset with single member function. Don't reallocate
staging buffers on reset.
- Properly track dependencies based on command type. Add new deps for
H2D and D2D rather than overloading H2D.