170cc1afde
* Print KL/CL/KE events for all warps
* Fix count off-by-one issue
* Fix opCount in KE and restore CPU thread option
* Simplify count calculation
[ROCm/rccl commit: ebf7e2305e]