Summary: 1. remove the noinline attribute for AllReduceThreeKernel; 2. change AUTPUNROLL for tree functions to 1 or 2; Combining 1 and 2 will reduce the scratch usage from 1256 to 952 [ROCm/rccl commit: eec319038e]
eec319038e