hipEventRecord is much slower in hipclang/vdi
- Make sure default streams don't sync each other.
- Add null stream into the list of default streams.
- Code clean-up to simplify queue look-up.
Change-Id: I36e1fc8d86a600e3dce806694d95d146ed8afd03
- HIPPerfDispatchSpeed disparity between HIP/HCC vs HIP/VDI
Insert a wait marker command in the default stream only when
HIP has pending operations on other async streams
Change-Id: I68660a54867fab7571ba57eb1df5feb1bca1c61a
~45% to 50% of Performance drop on rocBLAS_int8 test
Enable cudaSetDeviceFlags() api call. Use active wait by default
for all devices.
Change-Id: Ifc2ebe3dd9b0aa3fdbfbc9cb5c2cd8b3b726124f