[Device] Add dynamic fetch/reduce pipelining for reduction collectives - Simple protocol (#1861)
* Support pipelining codegen and template specialization * Support ReduceCopy pipelining for AllReduce, ReduceScatter, and Reduce (currently enabled for bfloat16) * Remove need for FUNC_INDEX_TOTAL * Add pipeline field to device function key construction logic * Avoid unneeded codegen for LL/LL64 kernels * Modify conditions and add pipeline dtypes env * Optimize selection for both gfx942 and gfx950 * Increase pipeline bitfield width * Use __forceinline__ for all device functions * Realign reduceCopy with original form * Add opt-out option to enable perf debugs * Remove force-reduce-pipelining option from README * Update CHANGELOG.md --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
这个提交包含在:
@@ -64,7 +64,6 @@ RCCL build & installation helper script
|
||||
-t|--tests_build Build rccl unit tests, but do not run
|
||||
--time-trace Plot the build time of RCCL (requires `ninja-build` package installed on the system)
|
||||
--verbose Show compile commands
|
||||
--force-reduce-pipeline Force reduce_copy sw pipeline to be used for every reduce-based collectives and datatypes
|
||||
```
|
||||
|
||||
By default, RCCL builds for all GPU targets defined in `DEFAULT_GPUS` in `CMakeLists.txt`. To target specific GPU(s), and potentially reduce build time, use `--amdgpu_targets` as a `;` separated string listing GPU(s) to target.
|
||||
|
||||
在新工单中引用
屏蔽一个用户