eb0b1387b721ec78a7b9abd48031a8540c4486d2
* Add initial commit to increase tb size to 512
* Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used
Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X
* Adjust nthreads for LL
* Opt threads for reduce_scatter upper small range
* Add macro for single node
* Restrict MSCCL to 256 threads to prevent mem access fault
* Support pre-MI350 compatibility
* Partially refactor threadblock size override
* Use const macros instead of numerals
* opt out of unused function
[ROCm/rccl commit: 12f51ba8bf]
Описание
No description provided
Languages
C++
67.5%
C
20.6%
Python
6.6%
CMake
3.4%
Shell
0.6%
Разное
1.1%