* Print KL/CL/KE events for all warps * Fix count off-by-one issue * Fix opCount in KE and restore CPU thread option * Simplify count calculation [ROCm/rccl commit: ebf7e2305e]
ebf7e2305e