e0874ab540
add support for both cuda compatible implementation and hcc(faster)
implementation with test
Change-Id: I79a22344f458391d7dffac5f147619a542e97e4e
[ROCm/hip commit: 8264d5d6bd]