1ace5d05ed
* Reapply "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)"
This reverts commit 943ad6f7820739385a0b54e81f823d0df1dbf71c.
* Decreasing NCCL_LL128_SHMEM_ELEMS_PER_THREAD from 16 to 8
[ROCm/rccl commit: 3f7c08648f]