~45% to 50% of Performance drop on rocBLAS_int8 test Enable cudaSetDeviceFlags() api call. Use active wait by default for all devices. Change-Id: Ifc2ebe3dd9b0aa3fdbfbc9cb5c2cd8b3b726124f