Support hipLaunchCooperativeKernelMultiDevice()
- Add validation logic for MGPU launches to pass a cuda test
Change-Id: Iccca7fde43493fc3bc6685512d39202271ae3e92
~45% to 50% of Performance drop on rocBLAS_int8 test
Enable cudaSetDeviceFlags() api call. Use active wait by default
for all devices.
Change-Id: Ifc2ebe3dd9b0aa3fdbfbc9cb5c2cd8b3b726124f