- devicereset would lose track of default stream and thus subsequent
synchronization calls might not actually sychronize.
- Also deviceReset now correctly frees streams.
- fix waits in P2P staging copy - first phase (Device0-to-Staging) must
wait for second phase (Staging to Device1) to finish draining the
buffer.
- add P2P staging buffer copy.
- If copy device does not have sufficient access permissions, fall back
to staging buffer.
- improve docs for which copy device is used.
- NULL stream was waiting for itself to be empty before each command.
- Force "blocking" streams to wait for NULL to empty. This was missing
before.
- async copy was disabling itself via trueAsync=false for common cases.
Refactor:
- rename _null_stream to _default_stream.
- move some null sync function to defaultSync, move to dev member func.
-Move staging buffer locks inside the staging buffer code.
-Remove dedicated per-device completion_signal + per-device lock -
instead allocated signal from the per-stream pool. This elimintes
the lock and allows more concurrency.
-remove switch HIP_DISABLE_BIDIR_MEMCPY
- refactor staging buffer to operate on hsa* data structures not
hc::accelerator.
- use hsa_memory_allocate to allocate staging buffers rather than
am_alloc.
- Refactor device reset with single member function. Don't reallocate
staging buffers on reset.
- Properly track dependencies based on command type. Add new deps for
H2D and D2D rather than overloading H2D.