d8a06589c9
Summary:
1. remove the noinline attribute for AllReduceThreeKernel;
2. change AUTPUNROLL for tree functions to 1 or 2;
Combining 1 and 2 will reduce the scratch usage from 1256 to 952
[ROCm/rccl commit: eec319038e]