From 9bc0e13ef0a7f4a232557d8f7cc82732d9434626 Mon Sep 17 00:00:00 2001
From: Mark Santesson <msantesson@nvidia.com>
Date: Tue, 2 Sep 2025 13:21:14 -0700
Subject: [PATCH] NCCL 2.28.3-1
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Device API (Experimental)
 * Introduces device-side APIs to integrate NCCL communication directly into application kernels.
 * Supports LSA (Load/Store Access) for CUDA P2P communication over NVLink and some PCIe platforms.
 * Supports Multimem for hardware multicast using NVLink SHARP.
 * Adds initial framework for GIN (GPU-Initiated Networking), currently under development.
 * Introduces device communicators created using ncclDevCommCreate.
 * Enables device-side communication operations with synchronization (ncclLsaBarrierSession) and memory accessors (ncclGetLsaPointer, ncclGetLsaMultimemPointer).
 * Experimental APIs - signatures and functionality may evolve in future releases.
 * No ABI compatibility is guaranteed — applications must be recompiled with each new NCCL release.

Symmetric memory improvements
 * Support for aggregating symmetric operations using ncclGroupStart/End APIs.
 * Reimplement symmetric kernels using device API.

New Host APIs
 * Introduce new host collective APIs: ncclAlltoAll, ncclScatter, ncclGather.

CE (Copy Engine) Collectives
 * Reduce SM utilization for alltoall, scatter, gather, and allgather within a single (MN)NVL domain.
 * Free up SM capacity for the application to do computation at the same time.
 * To enable the feature for ncclAllGather, ncclAlltoAll, ncclGather, ncclScatter, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t.

NCCL Inspector Plugin
 * Introduces an Inspector plugin for always-on performance monitoring.
 * Produces structured JSON output with metadata, execution time, bandwidth, and optional event traces for each NCCL operation.
 * Enables integration with analysis tools such as Performance Exporter to visualize NCCL performance bottlenecks.
 * Lightweight to enable via environment variables NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE.

CMake support (Experiemental)
 * Adds a CMake build system as an alternative to existing Makefiles.
 * Known issues: pkg.build and Device API currently do not work with CMake.
 * The known issues will be addressed in a future release.

Decreased max CTA count from 32 to 16 on Blackwell
 * SM overhead is decreased by 50% with this improvement.
 * This may cause some perf drop on Blackwell because of the reduced SM usage.
 * If the extra SM capacity is not desired, two options are available to restore to previous behavior: 1) Setting NCCL_MIN_CTAS=32 NCCL_MAX_CTAS=32 environment variables; 2) setting communicator config to over-write max CTA count to 32.
 * Based on community feedback, future versions may consider different trade-offs between performance and SM overhead.

Plugins
 * Network
   * App-aware Network plugin. NCCL passes information about communication operations to be executed on the network end point. This allows for better tuning of network end points and their use in the plugins.
   * Improve handling of physical and virtual network devices and load/unload.
   * Network plugin version 11 - add explicit context and communication ID support for per communicator init/finalize.
   * Add Multi-Request Net API. Using this will help NCCL to anticipate multiple send/recv requests and optimize for it. See maxMultiRequestSize field in ncclNetProperties_v11_t.
 * Profiler
   * Add support for API events (group, collective, and p2p) and for tracking kernel launches in the profiler plugin.
   * Add Inspector Profiler Plugin (see section above).
   * Add a hook to Google’s CoMMA profiler on github.
 * Tuner
   * Expose NCCL tuning constants at tuner initialization via ncclTunerConstants_v5_t.
   * Add NVL Domain Information API.
 * Support multiple plugin types from a single shared object.

New Parameterization and ncclConfig changes:
 * Add new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack.
 * Add NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions.
 * Add NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks” case as opt in.
 * Move nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig.
 * Enable PxN over C2C by default
   * PxN over C2C will improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe.
   * This behavior can be overridden by setting NCCL_PXN_C2C=0.

Other Improvements:
 * Allow FP8 support for non-reductive operations on pre sm90 devices. (See https://github.com/pytorch/pytorch/pull/151594#discussion_r2135777776)
 * Fix NVLS+CollNet and temporarily disables COLLNET_CHAIN for >8 GPUs.
 * Only consider running interfaces for socket traffic. NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit. (https://github.com/NVIDIA/nccl/issues/1798)
 * Modernize mutex management. Convert to std::mutex and std::lock_guard.
 * Remove sm35 and sm50 GENCODE targets which have long been deprecated and were causing issues with the latest NCCL release builds.
 * Improved NVLS/NVLSTree tuning prediction to improve algorithm and protocol selection.
 * NVLSTree Tuning Fixes. Update tuning data for H100, GB200-NV72.
 * Respond better to RoCE link flaps. Instead of reporting an “unknown event” it will now report “GID table changed”.
 * Move libvirt bridge interface to the end of possible interfaces so that they are considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.


[ROCm/rccl commit: f1308997d0420148b1be1c24d63f19d902ae589b]
---
 projects/rccl/ext-net/README.md               |   25 +-
 projects/rccl/ext-net/example/CMakeLists.txt  |   19 +
 projects/rccl/ext-net/example/nccl/net.h      |   10 +-
 .../rccl/ext-net/example/nccl/net_device.h    |    5 +-
 projects/rccl/ext-net/example/nccl/net_v10.h  |    3 +-
 projects/rccl/ext-net/example/nccl/net_v11.h  |  120 ++
 projects/rccl/ext-net/example/nccl/net_v9.h   |    3 +-
 projects/rccl/ext-net/example/plugin.c        |  101 +-
 projects/rccl/ext-profiler/README.md          |  121 +-
 .../rccl/ext-profiler/example/CMakeLists.txt  |   34 +
 projects/rccl/ext-profiler/example/Makefile   |   16 +-
 projects/rccl/ext-profiler/example/README.md  |  180 +-
 projects/rccl/ext-profiler/example/event.c    |   30 -
 projects/rccl/ext-profiler/example/event.h    |  155 +-
 .../rccl/ext-profiler/example/nccl/profiler.h |   33 +-
 .../ext-profiler/example/nccl/profiler_v5.h   |  152 ++
 .../example/{plugin.c => plugin.cc}           |  264 ++-
 projects/rccl/ext-profiler/example/plugin.h   |    5 +-
 .../example/{print_event.c => print_event.cc} |   99 +-
 projects/rccl/ext-profiler/example/queue.h    |   50 +
 .../rccl/ext-profiler/google-CoMMA/Makefile   |   22 +
 projects/rccl/ext-profiler/inspector/Makefile |   62 +
 .../rccl/ext-profiler/inspector/README.md     |  216 +++
 .../inspector/exporter/example/README.md      |  151 ++
 .../exporter/example/perf_summary_exporter.py |  548 ++++++
 .../exporter/example/requirements.txt         |    6 +
 .../rccl/ext-profiler/inspector/inspector.cc  | 1530 +++++++++++++++++
 .../rccl/ext-profiler/inspector/inspector.h   |  198 +++
 .../inspector/inspector_plugin.cc             |  493 ++++++
 projects/rccl/ext-profiler/inspector/json.cc  |  496 ++++++
 projects/rccl/ext-profiler/inspector/json.h   |   83 +
 .../rccl/ext-profiler/inspector/nccl/common.h |   73 +
 .../ext-profiler/inspector/nccl/profiler.h    |   85 +
 .../inspector/nccl/profiler_net.h             |   19 +
 .../ext-profiler/inspector/nccl/profiler_v1.h |  112 ++
 .../ext-profiler/inspector/nccl/profiler_v2.h |  108 ++
 .../ext-profiler/inspector/nccl/profiler_v3.h |  116 ++
 .../ext-profiler/inspector/nccl/profiler_v4.h |  127 ++
 .../ext-profiler/inspector/nccl/profiler_v5.h |  151 ++
 .../rccl/ext-profiler/inspector/nccl/types.h  |   21 +
 .../rccl/ext-profiler/inspector/version.h     |   12 +
 projects/rccl/ext-tuner/README.md             |    2 +-
 projects/rccl/ext-tuner/example/.gitignore    |   49 +
 .../rccl/ext-tuner/example/CMakeLists.txt     |   26 +
 projects/rccl/ext-tuner/example/nccl/tuner.h  |   51 +-
 projects/rccl/ext-tuner/example/plugin.c      |   36 +-
 .../rccl/ext-tuner/example/test/test_plugin.c |  178 +-
 projects/rccl/makefiles/common.mk             |    7 +-
 projects/rccl/makefiles/version.mk            |    4 +-
 projects/rccl/pkg/Makefile                    |    2 +-
 .../rccl/pkg/debian/libnccl-dev.install.in    |    2 +-
 projects/rccl/pkg/redhat/nccl.spec.in         |    4 +-
 projects/rccl/pkg/srctxz/Makefile             |    2 +-
 projects/rccl/pkg/srctxz/create_srctxz.sh.in  |   28 +-
 projects/rccl/src/CMakeLists.txt              |  180 ++
 projects/rccl/src/Makefile                    |   20 +-
 projects/rccl/src/allocator.cc                |  394 ++++-
 projects/rccl/src/bootstrap.cc                |   25 +-
 projects/rccl/src/ce_coll.cc                  |  615 +++++++
 projects/rccl/src/collectives.cc              |   42 +
 projects/rccl/src/debug.cc                    |   46 +-
 projects/rccl/src/dev_runtime.cc              |  995 +++++++++++
 projects/rccl/src/device/CMakeLists.txt       |   60 +
 projects/rccl/src/device/Makefile             |    8 +-
 projects/rccl/src/device/common.h             |   14 +-
 projects/rccl/src/device/generate.py          |   18 +-
 .../rccl/src/device/symmetric/all_gather.cuh  |  250 +--
 .../rccl/src/device/symmetric/all_reduce.cuh  |  331 ++--
 .../rccl/src/device/symmetric/generate.py     |   62 +-
 projects/rccl/src/device/symmetric/kernel.cuh |   24 +-
 .../rccl/src/device/symmetric/primitives.cuh  |  455 +----
 .../src/device/symmetric/reduce_scatter.cuh   |  251 +--
 projects/rccl/src/enqueue.cc                  |  598 ++++---
 projects/rccl/src/graph/CMakeLists.txt        |   14 +
 projects/rccl/src/graph/connect.cc            |   40 +-
 projects/rccl/src/graph/paths.cc              |   15 +-
 projects/rccl/src/graph/topo.cc               |  373 ++--
 projects/rccl/src/graph/topo.h                |   24 +-
 projects/rccl/src/graph/tuning.cc             |  173 +-
 projects/rccl/src/graph/xml.cc                |   44 +-
 projects/rccl/src/graph/xml.h                 |    7 +
 projects/rccl/src/group.cc                    |  265 +--
 projects/rccl/src/include/allocator.h         |   52 +-
 projects/rccl/src/include/bitops.h            |   29 +-
 projects/rccl/src/include/ce_coll.h           |   76 +
 projects/rccl/src/include/channel.h           |    7 +-
 projects/rccl/src/include/coll_net.h          |    3 +-
 projects/rccl/src/include/collectives.h       |    8 +-
 projects/rccl/src/include/comm.h              |   55 +-
 projects/rccl/src/include/core.h              |    1 +
 projects/rccl/src/include/cpuset.h            |   99 +-
 projects/rccl/src/include/cudawrap.h          |    6 +
 projects/rccl/src/include/debug.h             |   26 +-
 projects/rccl/src/include/dev_runtime.h       |   92 +
 projects/rccl/src/include/device.h            |   24 +-
 projects/rccl/src/include/graph.h             |    2 +
 projects/rccl/src/include/group.h             |    1 +
 projects/rccl/src/include/nccl_common.h       |   26 +-
 projects/rccl/src/include/nccl_device.h       |   15 +
 .../rccl/src/include/nccl_device/README.md    |   32 +
 projects/rccl/src/include/nccl_device/comm.h  |   10 +
 projects/rccl/src/include/nccl_device/coop.h  |  152 ++
 projects/rccl/src/include/nccl_device/core.h  |  150 ++
 .../include/nccl_device/impl/comm__funcs.h    |   10 +
 .../include/nccl_device/impl/comm__types.h    |   40 +
 .../include/nccl_device/impl/core__funcs.h    |  210 +++
 .../include/nccl_device/impl/core__types.h    |   26 +
 .../include/nccl_device/impl/ll_a2a__funcs.h  |  229 +++
 .../include/nccl_device/impl/ll_a2a__types.h  |   37 +
 .../nccl_device/impl/mem_barrier__funcs.h     |  126 ++
 .../nccl_device/impl/mem_barrier__types.h     |   46 +
 .../src/include/nccl_device/impl/ptr__funcs.h |  157 ++
 .../src/include/nccl_device/impl/ptr__types.h |   11 +
 .../rccl/src/include/nccl_device/ll_a2a.h     |   53 +
 .../src/include/nccl_device/mem_barrier.h     |   35 +
 projects/rccl/src/include/nccl_device/ptr.h   |   61 +
 .../rccl/src/include/nccl_device/utility.h    |  352 ++++
 projects/rccl/src/include/net.h               |    6 +
 projects/rccl/src/include/net_device.h        |    5 +-
 projects/rccl/src/include/nvmlwrap.h          |   21 +
 projects/rccl/src/include/nvtx.h              |  108 +-
 .../src/include/nvtx3/nvToolsExtCounters.h    |    2 +-
 .../nvtx3/nvToolsExtSemanticsCounters.h       |    2 +-
 .../include/nvtx3/nvToolsExtSemanticsScope.h  |    2 +-
 .../nvtx3/nvtxDetail/nvtxExtHelperMacros.h    |    2 +-
 .../include/nvtx3/nvtxDetail/nvtxExtImpl.h    |    2 +-
 .../nvtx3/nvtxDetail/nvtxExtImplCounters_v1.h |    2 +-
 .../nvtxDetail/nvtxExtPayloadHelperInternal.h |    2 +-
 .../nvtx3/nvtxDetail/nvtxExtPayloadTypeInfo.h |    2 +-
 .../include/nvtx3/nvtxDetail/nvtxExtTypes.h   |    2 +-
 .../rccl/src/include/nvtx_payload_schemas.h   |   23 +
 projects/rccl/src/include/plugin/nccl_net.h   |   23 +-
 .../rccl/src/include/plugin/nccl_profiler.h   |   32 +-
 projects/rccl/src/include/plugin/nccl_tuner.h |   41 +-
 .../rccl/src/include/plugin/net/net_v10.h     |    4 +-
 .../rccl/src/include/plugin/net/net_v11.h     |  188 ++
 projects/rccl/src/include/plugin/net/net_v9.h |    4 +-
 projects/rccl/src/include/plugin/plugin.h     |    2 +
 .../src/include/plugin/profiler/profiler_v5.h |  151 ++
 .../rccl/src/include/plugin/tuner/tuner_v5.h  |   87 +
 projects/rccl/src/include/profiler.h          |   36 +
 projects/rccl/src/include/proxy.h             |   11 +
 projects/rccl/src/include/register.h          |   24 +-
 projects/rccl/src/include/register_inline.h   |   17 +-
 projects/rccl/src/include/scheduler.h         |   17 +
 projects/rccl/src/include/shm.h               |    6 +
 projects/rccl/src/include/shmutils.h          |    4 +-
 projects/rccl/src/include/sym_kernels.h       |  112 ++
 projects/rccl/src/include/symmetric.h         |   90 -
 projects/rccl/src/include/transport.h         |   12 +-
 projects/rccl/src/include/utils.h             |    2 +
 projects/rccl/src/init.cc                     |  201 ++-
 projects/rccl/src/init_nvtx.cc                |   13 +
 projects/rccl/src/misc/CMakeLists.txt         |   20 +
 projects/rccl/src/misc/cudawrap.cc            |   17 +-
 projects/rccl/src/misc/gdrwrap.cc             |    5 +-
 projects/rccl/src/misc/ibvwrap.cc             |    5 +-
 projects/rccl/src/misc/mlx5dvwrap.cc          |   30 +-
 projects/rccl/src/misc/nvmlwrap.cc            |   17 +
 projects/rccl/src/misc/param.cc               |   10 +-
 projects/rccl/src/misc/shmutils.cc            |   45 +-
 projects/rccl/src/misc/socket.cc              |    8 +-
 projects/rccl/src/misc/strongstream.cc        |    9 +-
 projects/rccl/src/misc/utils.cc               |   30 +-
 projects/rccl/src/mnnvl.cc                    |    6 +-
 projects/rccl/src/nccl.h.in                   |   59 +-
 projects/rccl/src/nccl_device/CMakeLists.txt  |    9 +
 projects/rccl/src/nccl_device/core.cc         |   57 +
 projects/rccl/src/nccl_device/ll_a2a.cc       |   26 +
 projects/rccl/src/nccl_device/mem_barrier.cc  |   21 +
 projects/rccl/src/plugin/CMakeLists.txt       |   18 +
 projects/rccl/src/plugin/net.cc               |  125 +-
 projects/rccl/src/plugin/net/CMakeLists.txt   |   12 +
 projects/rccl/src/plugin/net/net_v10.cc       |  187 +-
 projects/rccl/src/plugin/net/net_v11.cc       |   31 +
 projects/rccl/src/plugin/net/net_v6.cc        |   68 +-
 projects/rccl/src/plugin/net/net_v7.cc        |   67 +-
 projects/rccl/src/plugin/net/net_v8.cc        |   67 +-
 projects/rccl/src/plugin/net/net_v9.cc        |  117 +-
 projects/rccl/src/plugin/plugin_open.cc       |   66 +-
 projects/rccl/src/plugin/profiler.cc          |  297 +++-
 .../rccl/src/plugin/profiler/CMakeLists.txt   |   11 +
 .../rccl/src/plugin/profiler/profiler_v1.cc   |   16 +-
 .../rccl/src/plugin/profiler/profiler_v2.cc   |   16 +-
 .../rccl/src/plugin/profiler/profiler_v3.cc   |   16 +-
 .../rccl/src/plugin/profiler/profiler_v4.cc   |  104 +-
 .../rccl/src/plugin/profiler/profiler_v5.cc   |   21 +
 projects/rccl/src/plugin/tuner.cc             |   30 +-
 projects/rccl/src/plugin/tuner/CMakeLists.txt |   10 +
 projects/rccl/src/plugin/tuner/tuner_v2.cc    |   14 +-
 projects/rccl/src/plugin/tuner/tuner_v3.cc    |   12 +-
 projects/rccl/src/plugin/tuner/tuner_v4.cc    |   22 +-
 projects/rccl/src/plugin/tuner/tuner_v5.cc    |   21 +
 projects/rccl/src/proxy.cc                    |   68 +-
 projects/rccl/src/ras/CMakeLists.txt          |   11 +
 projects/rccl/src/ras/ras.cc                  |    8 +-
 projects/rccl/src/register/CMakeLists.txt     |    9 +
 projects/rccl/src/register/coll_reg.cc        |    9 +-
 projects/rccl/src/register/register.cc        |  113 --
 projects/rccl/src/register/sendrecv_reg.cc    |    6 +
 projects/rccl/src/scheduler/CMakeLists.txt    |    7 +
 .../rccl/src/scheduler/symmetric_sched.cc     |  235 +++
 .../rccl/src/{symmetric.cc => sym_kernels.cc} |  179 +-
 projects/rccl/src/transport/CMakeLists.txt    |   15 +
 projects/rccl/src/transport/coll_net.cc       |   62 +-
 projects/rccl/src/transport/generic.cc        |    6 +
 projects/rccl/src/transport/net.cc            |  149 +-
 projects/rccl/src/transport/net_ib.cc         |  271 +--
 projects/rccl/src/transport/net_socket.cc     |   70 +-
 projects/rccl/src/transport/nvls.cc           |  136 +-
 projects/rccl/src/transport/p2p.cc            |   90 +-
 projects/rccl/src/transport/profiler.cc       |    2 +-
 212 files changed, 15535 insertions(+), 2938 deletions(-)
 create mode 100644 projects/rccl/ext-net/example/CMakeLists.txt
 create mode 100644 projects/rccl/ext-net/example/nccl/net_v11.h
 create mode 100644 projects/rccl/ext-profiler/example/CMakeLists.txt
 delete mode 100644 projects/rccl/ext-profiler/example/event.c
 create mode 100644 projects/rccl/ext-profiler/example/nccl/profiler_v5.h
 rename projects/rccl/ext-profiler/example/{plugin.c => plugin.cc} (68%)
 rename projects/rccl/ext-profiler/example/{print_event.c => print_event.cc} (76%)
 create mode 100644 projects/rccl/ext-profiler/example/queue.h
 create mode 100644 projects/rccl/ext-profiler/google-CoMMA/Makefile
 create mode 100644 projects/rccl/ext-profiler/inspector/Makefile
 create mode 100644 projects/rccl/ext-profiler/inspector/README.md
 create mode 100644 projects/rccl/ext-profiler/inspector/exporter/example/README.md
 create mode 100644 projects/rccl/ext-profiler/inspector/exporter/example/perf_summary_exporter.py
 create mode 100644 projects/rccl/ext-profiler/inspector/exporter/example/requirements.txt
 create mode 100644 projects/rccl/ext-profiler/inspector/inspector.cc
 create mode 100644 projects/rccl/ext-profiler/inspector/inspector.h
 create mode 100644 projects/rccl/ext-profiler/inspector/inspector_plugin.cc
 create mode 100644 projects/rccl/ext-profiler/inspector/json.cc
 create mode 100644 projects/rccl/ext-profiler/inspector/json.h
 create mode 100644 projects/rccl/ext-profiler/inspector/nccl/common.h
 create mode 100644 projects/rccl/ext-profiler/inspector/nccl/profiler.h
 create mode 100644 projects/rccl/ext-profiler/inspector/nccl/profiler_net.h
 create mode 100644 projects/rccl/ext-profiler/inspector/nccl/profiler_v1.h
 create mode 100644 projects/rccl/ext-profiler/inspector/nccl/profiler_v2.h
 create mode 100644 projects/rccl/ext-profiler/inspector/nccl/profiler_v3.h
 create mode 100644 projects/rccl/ext-profiler/inspector/nccl/profiler_v4.h
 create mode 100644 projects/rccl/ext-profiler/inspector/nccl/profiler_v5.h
 create mode 100644 projects/rccl/ext-profiler/inspector/nccl/types.h
 create mode 100644 projects/rccl/ext-profiler/inspector/version.h
 create mode 100644 projects/rccl/ext-tuner/example/.gitignore
 create mode 100644 projects/rccl/ext-tuner/example/CMakeLists.txt
 create mode 100644 projects/rccl/src/CMakeLists.txt
 create mode 100644 projects/rccl/src/ce_coll.cc
 create mode 100644 projects/rccl/src/dev_runtime.cc
 create mode 100644 projects/rccl/src/device/CMakeLists.txt
 create mode 100644 projects/rccl/src/graph/CMakeLists.txt
 create mode 100644 projects/rccl/src/include/ce_coll.h
 create mode 100644 projects/rccl/src/include/dev_runtime.h
 create mode 100644 projects/rccl/src/include/nccl_device.h
 create mode 100644 projects/rccl/src/include/nccl_device/README.md
 create mode 100644 projects/rccl/src/include/nccl_device/comm.h
 create mode 100644 projects/rccl/src/include/nccl_device/coop.h
 create mode 100644 projects/rccl/src/include/nccl_device/core.h
 create mode 100644 projects/rccl/src/include/nccl_device/impl/comm__funcs.h
 create mode 100644 projects/rccl/src/include/nccl_device/impl/comm__types.h
 create mode 100644 projects/rccl/src/include/nccl_device/impl/core__funcs.h
 create mode 100644 projects/rccl/src/include/nccl_device/impl/core__types.h
 create mode 100644 projects/rccl/src/include/nccl_device/impl/ll_a2a__funcs.h
 create mode 100644 projects/rccl/src/include/nccl_device/impl/ll_a2a__types.h
 create mode 100644 projects/rccl/src/include/nccl_device/impl/mem_barrier__funcs.h
 create mode 100644 projects/rccl/src/include/nccl_device/impl/mem_barrier__types.h
 create mode 100644 projects/rccl/src/include/nccl_device/impl/ptr__funcs.h
 create mode 100644 projects/rccl/src/include/nccl_device/impl/ptr__types.h
 create mode 100644 projects/rccl/src/include/nccl_device/ll_a2a.h
 create mode 100644 projects/rccl/src/include/nccl_device/mem_barrier.h
 create mode 100644 projects/rccl/src/include/nccl_device/ptr.h
 create mode 100644 projects/rccl/src/include/nccl_device/utility.h
 create mode 100644 projects/rccl/src/include/plugin/net/net_v11.h
 create mode 100644 projects/rccl/src/include/plugin/profiler/profiler_v5.h
 create mode 100644 projects/rccl/src/include/plugin/tuner/tuner_v5.h
 create mode 100644 projects/rccl/src/include/scheduler.h
 create mode 100644 projects/rccl/src/include/sym_kernels.h
 delete mode 100644 projects/rccl/src/include/symmetric.h
 create mode 100644 projects/rccl/src/misc/CMakeLists.txt
 create mode 100644 projects/rccl/src/nccl_device/CMakeLists.txt
 create mode 100644 projects/rccl/src/nccl_device/core.cc
 create mode 100644 projects/rccl/src/nccl_device/ll_a2a.cc
 create mode 100644 projects/rccl/src/nccl_device/mem_barrier.cc
 create mode 100644 projects/rccl/src/plugin/CMakeLists.txt
 create mode 100644 projects/rccl/src/plugin/net/CMakeLists.txt
 create mode 100644 projects/rccl/src/plugin/net/net_v11.cc
 create mode 100644 projects/rccl/src/plugin/profiler/CMakeLists.txt
 create mode 100644 projects/rccl/src/plugin/profiler/profiler_v5.cc
 create mode 100644 projects/rccl/src/plugin/tuner/CMakeLists.txt
 create mode 100644 projects/rccl/src/plugin/tuner/tuner_v5.cc
 create mode 100644 projects/rccl/src/ras/CMakeLists.txt
 create mode 100644 projects/rccl/src/register/CMakeLists.txt
 create mode 100644 projects/rccl/src/scheduler/CMakeLists.txt
 create mode 100644 projects/rccl/src/scheduler/symmetric_sched.cc
 rename projects/rccl/src/{symmetric.cc => sym_kernels.cc} (52%)
 create mode 100644 projects/rccl/src/transport/CMakeLists.txt

diff --git a/projects/rccl/ext-net/README.md b/projects/rccl/ext-net/README.md
index 90fe89bf59..8bcaf3096d 100644
--- a/projects/rccl/ext-net/README.md
+++ b/projects/rccl/ext-net/README.md
@@ -60,36 +60,36 @@ of newer ones.
 The `nccl/` directory is populated with `net_vX.h` files extracting all relevant definitions
 from old API versions. It also provides error codes in `err.h`.
 
-# API (v10)
+# API (v11)
 
-Below is the main `ncclNet_v10` struct. Each function is explained in later sections.
+Below is the main `ncclNet_v11` struct. Each function is explained in later sections.
 
 ```
 typedef struct {
   // Name of the network (mainly for logs)
   const char* name;
   // Initialize the network.
-  ncclResult_t (*init)(ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction);
+  ncclResult_t (*init)(void** ctx, uint64_t commId, ncclNetCommConfig_v11_t* config, ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction);
   // Return the number of adapters.
   ncclResult_t (*devices)(int* ndev);
   // Get various device properties.
-  ncclResult_t (*getProperties)(int dev, ncclNetProperties_v10_t* props);
+  ncclResult_t (*getProperties)(int dev, ncclNetProperties_v11_t* props);
   // Create a receiving object and provide a handle to connect to it. The
   // handle can be up to NCCL_NET_HANDLE_MAXSIZE bytes and will be exchanged
   // between ranks to create a connection.
-  ncclResult_t (*listen)(int dev, void* handle, void** listenComm);
+  ncclResult_t (*listen)(void* ctx, int dev, void* handle, void** listenComm);
   // Connect to a handle and return a sending comm object for that peer.
   // This call must not block for the connection to be established, and instead
   // should return successfully with sendComm == NULL with the expectation that
   // it will be called again until sendComm != NULL.
   // If *sendDevComm points to a valid object, then NCCL is requesting device offload for this connection
-  ncclResult_t (*connect)(int dev, ncclNetCommConfig_v10_t* config, void* handle, void** sendComm, ncclNetDeviceHandle_v10_t** sendDevComm);
+  ncclResult_t (*connect)(void* ctx, int dev, void* handle, void** sendComm, ncclNetDeviceHandle_v11_t** sendDevComm);
   // Finalize connection establishment after remote peer has called connect.
   // This call must not block for the connection to be established, and instead
   // should return successfully with recvComm == NULL with the expectation that
   // it will be called again until recvComm != NULL.
   // If *recvDevComm points to a valid object, then NCCL is requesting device offload for this connection
-  ncclResult_t (*accept)(void* listenComm, void** recvComm, ncclNetDeviceHandle_v10_t** recvDevComm);
+  ncclResult_t (*accept)(void* listenComm, void** recvComm, ncclNetDeviceHandle_v11_t** recvDevComm);
   // Register/Deregister memory. Comm can be either a sendComm or a recvComm.
   // Type is either NCCL_PTR_HOST or NCCL_PTR_CUDA.
   ncclResult_t (*regMr)(void* comm, void* data, size_t size, int type, void** mhandle);
@@ -191,6 +191,12 @@ This will allow the plugin to discover network devices and make sure they are us
 `init` function does not return `ncclSuccess`, then NCCL will not use the plugin and fall back on
 internal ones.
 
+Every call to `init` returns an opaque context that the plugin uses internally to allocate resources
+and manage state. Such context is passed to other net plugin calls that create further resources,
+such as `listen` and `connect`. Every context is uniquely associated to a communicator
+using the commId. The network can also be initialized with a per communicator configuration using
+the `config` argument.
+
 To allow the plugin logs to integrate into the NCCL logs seemlessly, NCCL provides a logging
 function to `init`. This function is typically used to allow for `INFO` and `WARN` macros within
 the plugin code adding the following definitions:
@@ -282,7 +288,7 @@ side.
 `listen`
 
 To create a connection, NCCL will start by calling `listen` on the receiver side. This function
-takes a device number as input argument, and should return a local `listenComm` object, and a
+takes the opaque plugin context returned by `init` and a device number as input argument, and should return a local `listenComm` object, and a
 `handle` to pass to the other side, so that the sender side can connect to the receiver.
 
 The `handle` is a buffer of size `NCCL_NET_HANDLE_MAXSIZE` and is provided by NCCL.
@@ -304,7 +310,8 @@ the `listen` call previously. If the sender did not connect yet, `accept` should
 should return `ncclSuccess`, setting `recvComm` to `NULL`. NCCL will call `accept` again until it
 succeeds.
 
-The `connect` API takes a `ncclNetCommConfig_t`, which contains a trafficClass field.
+The `connect` API takes the opaque plugin context returned by `init`. The plugin context can reference
+the `ncclNetCommConfig_t` passed to the `init` function and containing a trafficClass field.
 This field can be used by the network plugin to specify the QoS level of the connection. By default,
 `trafficClass` is set to -1 but can be configured by the application during communicator initialization
 to select a plugin-supported QoS level.
diff --git a/projects/rccl/ext-net/example/CMakeLists.txt b/projects/rccl/ext-net/example/CMakeLists.txt
new file mode 100644
index 0000000000..d8af7fe362
--- /dev/null
+++ b/projects/rccl/ext-net/example/CMakeLists.txt
@@ -0,0 +1,19 @@
+set(SRC_FILES
+    ${CMAKE_CURRENT_SOURCE_DIR}/plugin.c
+)
+
+# Create shared library
+add_library(nccl-net-example SHARED ${SRC_FILES})
+
+# Set include directories
+target_include_directories(nccl-net-example PRIVATE
+    ${CMAKE_CURRENT_SOURCE_DIR}/nccl
+)
+
+# Set output name to match Makefile
+set_target_properties(nccl-net-example PROPERTIES
+    OUTPUT_NAME "nccl-net-example"
+    PREFIX "lib"
+    POSITION_INDEPENDENT_CODE ON
+    LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/test/unit/plugins
+)
diff --git a/projects/rccl/ext-net/example/nccl/net.h b/projects/rccl/ext-net/example/nccl/net.h
index 4cc66915b4..9b3e6e03c5 100644
--- a/projects/rccl/ext-net/example/nccl/net.h
+++ b/projects/rccl/ext-net/example/nccl/net.h
@@ -22,7 +22,9 @@
 
 // Maximum number of requests per comm object
 #define NCCL_NET_MAX_REQUESTS 32
+#define NCCL_NET_MAX_DEVS_PER_NIC 4
 
+#include "net_v11.h"
 #include "net_v10.h"
 #include "net_v9.h"
 #include "net_v8.h"
@@ -33,9 +35,9 @@
 #include "net_v3.h"
 #include "net_v2.h"
 
-typedef ncclNet_v10_t ncclNet_t;
-typedef ncclNetProperties_v10_t ncclNetProperties_t;
-typedef ncclNetVDeviceProps_v10_t ncclNetVDeviceProps_t;
-typedef ncclNetCommConfig_v10_t ncclNetCommConfig_t;
+typedef ncclNet_v11_t ncclNet_t;
+typedef ncclNetProperties_v11_t ncclNetProperties_t;
+typedef ncclNetVDeviceProps_v11_t ncclNetVDeviceProps_t;
+typedef ncclNetCommConfig_v11_t ncclNetCommConfig_t;
 
 #endif // end include guard
diff --git a/projects/rccl/ext-net/example/nccl/net_device.h b/projects/rccl/ext-net/example/nccl/net_device.h
index d693101a34..56bcea83fd 100644
--- a/projects/rccl/ext-net/example/nccl/net_device.h
+++ b/projects/rccl/ext-net/example/nccl/net_device.h
@@ -12,7 +12,7 @@
 
 // Arbitrary version number - A given NCCL build will only be compatible with a single device networking plugin
 // version. NCCL will check the supplied version number from net->getProperties() and compare to its internal version.
-#define NCCL_NET_DEVICE_UNPACK_VERSION 0x7  
+#define NCCL_NET_DEVICE_UNPACK_VERSION 0x7
 
 typedef enum {NCCL_NET_DEVICE_HOST=0, NCCL_NET_DEVICE_UNPACK=1} ncclNetDeviceType;
 
@@ -27,6 +27,7 @@ typedef struct {
 typedef ncclNetDeviceHandle_v7_t ncclNetDeviceHandle_v8_t;
 typedef ncclNetDeviceHandle_v8_t ncclNetDeviceHandle_v9_t;
 typedef ncclNetDeviceHandle_v9_t ncclNetDeviceHandle_v10_t;
-typedef ncclNetDeviceHandle_v10_t ncclNetDeviceHandle_t;
+typedef ncclNetDeviceHandle_v10_t ncclNetDeviceHandle_v11_t;
+typedef ncclNetDeviceHandle_v11_t ncclNetDeviceHandle_t;
 
 #endif
diff --git a/projects/rccl/ext-net/example/nccl/net_v10.h b/projects/rccl/ext-net/example/nccl/net_v10.h
index 809e7c001d..bb0c661bb7 100644
--- a/projects/rccl/ext-net/example/nccl/net_v10.h
+++ b/projects/rccl/ext-net/example/nccl/net_v10.h
@@ -5,10 +5,9 @@
 #ifndef NET_V10_H_
 #define NET_V10_H_
 
-#define NCCL_NET_MAX_DEVS_PER_NIC_V10 4
 typedef struct {
   int ndevs;
-  int devs[NCCL_NET_MAX_DEVS_PER_NIC_V10];
+  int devs[NCCL_NET_MAX_DEVS_PER_NIC];
 } ncclNetVDeviceProps_v10_t;
 
 
diff --git a/projects/rccl/ext-net/example/nccl/net_v11.h b/projects/rccl/ext-net/example/nccl/net_v11.h
new file mode 100644
index 0000000000..1c8adc6c56
--- /dev/null
+++ b/projects/rccl/ext-net/example/nccl/net_v11.h
@@ -0,0 +1,120 @@
+/*
+ * Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
+ */
+
+#ifndef NET_V11_H_
+#define NET_V11_H_
+
+typedef struct {
+  int ndevs;
+  int devs[NCCL_NET_MAX_DEVS_PER_NIC];
+} ncclNetVDeviceProps_v11_t;
+
+#define NCCL_NET_TRAFFIC_CLASS_UNDEF -1
+
+typedef struct {
+  // Plugin-specific TC value
+  int trafficClass;
+} ncclNetCommConfig_v11_t;
+
+
+typedef struct {
+  char* name;                      // Used mostly for logging.
+  char* pciPath;                   // Path to the PCI device in /sys.
+  uint64_t guid;                   // Unique identifier for the NIC chip. Important for
+                                   // cards with multiple PCI functions (Physical or virtual).
+  int ptrSupport;                  // [NCCL_PTR_HOST|NCCL_PTR_CUDA|NCCL_PTR_DMABUF]
+  int regIsGlobal;                 // regMr is not tied to a particular comm
+  int forceFlush;                  // Force a flush on receives
+  int speed;                       // Port speed in Mbps.
+  int port;                        // Port number.
+  float latency;                   // Network latency
+  int maxComms;                    // Maximum number of comms we can create
+  int maxRecvs;                    // Maximum number of grouped receives.
+  ncclNetDeviceType netDeviceType; // Network offload type
+  int netDeviceVersion;            // Version number for network offload
+  ncclNetVDeviceProps_v11_t vProps;
+  size_t maxP2pBytes;              // Max transfer size for point-to-point operations
+  size_t maxCollBytes;             // Max transfer size for collective operations
+  int maxMultiRequestSize;         // Maximum number of requests supported in a single multi-request.
+} ncclNetProperties_v11_t;
+
+typedef struct {
+  int32_t maxConcurrentPeers;
+  int32_t minConcurrentPeers;
+  int32_t maxFlowsPerPeer;
+  int32_t minFlowsPerPeer;
+} ncclNetCommAttr_v11_t;
+
+typedef struct {
+  ncclNetCommAttr_v11_t sendCommAttr;
+  ncclNetCommAttr_v11_t recvCommAttr;
+  uint32_t op;
+  uint32_t algo;
+  uint32_t proto;
+} ncclNetAttr_v11_t;
+
+typedef struct {
+  // Name of the network (mainly for logs)
+  const char* name;
+  // Initialize the network.
+  ncclResult_t (*init)(void** ctx, uint64_t commId, ncclNetCommConfig_v11_t* config, ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction);
+  // Return the number of adapters.
+  ncclResult_t (*devices)(int* ndev);
+  // Get various device properties.
+  ncclResult_t (*getProperties)(int dev, ncclNetProperties_v11_t* props);
+  // Create a receiving object and provide a handle to connect to it. The
+  // handle can be up to NCCL_NET_HANDLE_MAXSIZE bytes and will be exchanged
+  // between ranks to create a connection.
+  ncclResult_t (*listen)(void* ctx, int dev, void* handle, void** listenComm);
+  // Connect to a handle and return a sending comm object for that peer.
+  // This call must not block for the connection to be established, and instead
+  // should return successfully with sendComm == NULL with the expectation that
+  // it will be called again until sendComm != NULL.
+  // If *sendDevComm points to a valid object, then NCCL is requesting device offload for this connection
+  ncclResult_t (*connect)(void* ctx, int dev, void* handle, void** sendComm, ncclNetDeviceHandle_v11_t** sendDevComm);
+  // Finalize connection establishment after remote peer has called connect.
+  // This call must not block for the connection to be established, and instead
+  // should return successfully with recvComm == NULL with the expectation that
+  // it will be called again until recvComm != NULL.
+  // If *recvDevComm points to a valid object, then NCCL is requesting device offload for this connection
+  ncclResult_t (*accept)(void* listenComm, void** recvComm, ncclNetDeviceHandle_v11_t** recvDevComm);
+  // Register/Deregister memory. Comm can be either a sendComm or a recvComm.
+  // Type is either NCCL_PTR_HOST or NCCL_PTR_CUDA.
+  ncclResult_t (*regMr)(void* comm, void* data, size_t size, int type, void** mhandle);
+  /* DMA-BUF support */
+  ncclResult_t (*regMrDmaBuf)(void* comm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle);
+  ncclResult_t (*deregMr)(void* comm, void* mhandle);
+  // Asynchronous send to a peer.
+  // May return request == NULL if the call cannot be performed (or would block)
+  ncclResult_t (*isend)(void* sendComm, void* data, size_t size, int tag, void* mhandle, void* phandle, void** request);
+  // Asynchronous recv from a peer.
+  // May return request == NULL if the call cannot be performed (or would block)
+  ncclResult_t (*irecv)(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles, void** phandles, void** request);
+  // Perform a flush/fence to make sure all data received with NCCL_PTR_CUDA is
+  // visible to the GPU
+  ncclResult_t (*iflush)(void* recvComm, int n, void** data, int* sizes, void** mhandles, void** request);
+  // Test whether a request is complete. If size is not NULL, it returns the
+  // number of bytes sent/received.
+  ncclResult_t (*test)(void* request, int* done, int* sizes);
+  // Close and free send/recv comm objects
+  ncclResult_t (*closeSend)(void* sendComm);
+  ncclResult_t (*closeRecv)(void* recvComm);
+  ncclResult_t (*closeListen)(void* listenComm);
+
+  // Copy the given mhandle to a dptr in a format usable by this plugin's device code
+  ncclResult_t (*getDeviceMr)(void* comm, void* mhandle, void** dptr_mhandle);
+
+  // Notify the plugin that a recv has completed by the device
+  ncclResult_t (*irecvConsumed)(void* recvComm, int n, void* request);
+
+  // Virtual NIC APIs. makeVDevice will create a virtual NIC given the specified properties, and tell the caller
+  // what index this new vNIC exists at
+  ncclResult_t (*makeVDevice)(int* d, ncclNetVDeviceProps_v11_t* props);
+  // Finalize the network.
+  ncclResult_t (*finalize)(void* ctx);
+
+  ncclResult_t (*setNetAttr)(void* ctx, ncclNetAttr_v11_t* netAttr);
+} ncclNet_v11_t;
+
+#endif // end include guard
diff --git a/projects/rccl/ext-net/example/nccl/net_v9.h b/projects/rccl/ext-net/example/nccl/net_v9.h
index ca60ad651e..9dea09cbd1 100644
--- a/projects/rccl/ext-net/example/nccl/net_v9.h
+++ b/projects/rccl/ext-net/example/nccl/net_v9.h
@@ -5,10 +5,9 @@
 #ifndef NET_V9_H_
 #define NET_V9_H_
 
-#define NCCL_NET_MAX_DEVS_PER_NIC_V9 4
 typedef struct {
   int ndevs;
-  int devs[NCCL_NET_MAX_DEVS_PER_NIC_V9];
+  int devs[NCCL_NET_MAX_DEVS_PER_NIC];
 } ncclNetVDeviceProps_v9_t;
 
 typedef struct {
diff --git a/projects/rccl/ext-net/example/plugin.c b/projects/rccl/ext-net/example/plugin.c
index 97a29875d0..b0a9a4c597 100644
--- a/projects/rccl/ext-net/example/plugin.c
+++ b/projects/rccl/ext-net/example/plugin.c
@@ -11,7 +11,7 @@
 
 int max_requests = NCCL_NET_MAX_REQUESTS;
 
-__hidden ncclResult_t pluginInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction) { return ncclSuccess; }
+__hidden ncclResult_t pluginInit(void** ctx, uint64_t commId, ncclNetCommConfig_t* config, ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction) { return ncclSuccess; }
 __hidden ncclResult_t pluginDevices(int* ndev) { *ndev = 0; return ncclSuccess; }
 __hidden ncclResult_t pluginPciPath(int dev, char** path) { return ncclInternalError; }
 __hidden ncclResult_t pluginPtrSupport(int dev, int* supportedTypes) { return ncclInternalError; }
@@ -51,8 +51,8 @@ __hidden ncclResult_t pluginGetProperties(int dev, ncclNetProperties_t* props) {
   return ncclSuccess;
 }
 
-__hidden ncclResult_t pluginListen(int dev, void* handle, void** listenComm) { return ncclInternalError; }
-__hidden ncclResult_t pluginConnect(int dev, ncclNetCommConfig_t* config, void* handle, void** sendComm, ncclNetDeviceHandle_t** sendDevComm) { return ncclInternalError; }
+__hidden ncclResult_t pluginListen(void* ctx, int dev, void* handle, void** listenComm) { return ncclInternalError; }
+__hidden ncclResult_t pluginConnect(void* ctx, int dev, void* handle, void** sendComm, ncclNetDeviceHandle_t** sendDevComm) { return ncclInternalError; }
 __hidden ncclResult_t pluginAccept(void* listenComm, void** recvComm, ncclNetDeviceHandle_t** recvDevComm) { return ncclInternalError; }
 __hidden ncclResult_t pluginRegMr(void* collComm, void* data, size_t size, int type, void** mhandle) { return ncclInternalError; }
 __hidden ncclResult_t pluginRegMrDmaBuf(void* collComm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle) { return ncclInternalError; }
@@ -67,10 +67,11 @@ __hidden ncclResult_t pluginCloseListen(void* listenComm) { return ncclInternalE
 __hidden ncclResult_t pluginIrecvConsumed(void* recvComm, int n, void* request) { return ncclInternalError; }
 __hidden ncclResult_t pluginGetDeviceMr(void* comm, void* mhandle, void** dptr_mhandle) { return ncclInternalError; }
 __hidden ncclResult_t pluginMakeVDevice(int* d, ncclNetVDeviceProps_t* props) { return ncclInternalError; }
+__hidden ncclResult_t pluginFinalize(void* ctx) { return ncclSuccess; }
 
 #define PLUGIN_NAME "Plugin"
 
-const ncclNet_v10_t ncclNetPlugin_v10 = {
+const ncclNet_v11_t ncclNetPlugin_v11 = {
   .name = PLUGIN_NAME,
   .init = pluginInit,
   .devices = pluginDevices,
@@ -91,18 +92,84 @@ const ncclNet_v10_t ncclNetPlugin_v10 = {
   .getDeviceMr = pluginGetDeviceMr,
   .irecvConsumed = pluginIrecvConsumed,
   .makeVDevice   = pluginMakeVDevice,
+  .finalize = pluginFinalize,
 };
 
+__hidden ncclResult_t pluginInit_v10(ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction) { return ncclSuccess; }
+__hidden ncclResult_t pluginGetProperties_v10(int dev, ncclNetProperties_v10_t* props) {
+  // Below are default values, if unsure don't change.
+
+  props->name = "Example";
+  // Fill for proper topology detection, e.g. /sys/devices/pci0000:00/0000:00:10.0/0000:0b:00.0
+  props->pciPath = NULL;
+  // Only used to detect NICs with multiple PCI attachments.
+  props->guid = 0;
+  // Add NCCL_PTR_CUDA if GPU Direct RDMA is supported and regMr can take CUDA pointers.
+  props->ptrSupport = NCCL_PTR_HOST;
+  // If you regMr has a fast registration cache, set to 1. If set to 0, user buffer registration may be disabled.
+  props->regIsGlobal = 0;
+  // Force flush after receive. Needed if the control path and data path use a different path to the GPU
+  props->forceFlush = 0;
+  // Speed in *Mbps*. 100000 means 100G
+  props->speed = 100000;
+  // Port number, used in conjunction with guid
+  props->port = 0;
+  // Custom latency (used to help tuning if latency is high. If set to 0, use default NCCL values.
+  props->latency = 0;
+  // Maximum number of comm objects we can create.
+  props->maxComms = 1024*1024;
+  // Maximum number of receive operations taken by irecv().
+  props->maxRecvs = NCCL_PLUGIN_MAX_RECVS;
+  // Coupling with NCCL network device-side code.
+  props->netDeviceType = NCCL_NET_DEVICE_HOST;
+  props->netDeviceVersion = NCCL_NET_DEVICE_INVALID_VERSION;
+  // Used to tell NCCL core whether this is a virtual device fusing multiple physical devices.
+  props->vProps.ndevs = 1;
+  props->vProps.devs[0] = dev;
+  // maximum transfer sizes the plugin can handle
+  props->maxP2pBytes = NCCL_MAX_NET_SIZE_BYTES;
+  props->maxCollBytes = NCCL_MAX_NET_SIZE_BYTES;
+  return ncclSuccess;
+}
+
+__hidden ncclResult_t pluginListen_v10(int d, void* handle, void** listenComm) { return ncclInternalError; }
+__hidden ncclResult_t pluginConnect_v10(int dev, ncclNetCommConfig_v10_t* config, void* handle, void** sendComm, ncclNetDeviceHandle_v10_t** sendDevComm) { return ncclInternalError; }
+__hidden ncclResult_t pluginMakeVDevice_v10(int* d, ncclNetVDeviceProps_v10_t* props) { return ncclInternalError; }
+
+const ncclNet_v10_t ncclNetPlugin_v10 = {
+  .name = PLUGIN_NAME,
+  .init = pluginInit_v10,
+  .devices = pluginDevices,
+  .getProperties = pluginGetProperties_v10,
+  .listen = pluginListen_v10,
+  .connect = pluginConnect_v10,
+  .accept = pluginAccept,
+  .regMr = pluginRegMr,
+  .regMrDmaBuf = pluginRegMrDmaBuf,
+  .deregMr = pluginDeregMr,
+  .isend = pluginIsend,
+  .irecv = pluginIrecv,
+  .iflush = pluginIflush,
+  .test = pluginTest,
+  .closeSend = pluginCloseSend,
+  .closeRecv = pluginCloseRecv,
+  .closeListen = pluginCloseListen,
+  .getDeviceMr = pluginGetDeviceMr,
+  .irecvConsumed = pluginIrecvConsumed,
+  .makeVDevice   = pluginMakeVDevice_v10,
+};
+
+
 __hidden ncclResult_t pluginInit_v9(ncclDebugLogger_t logFunction) {
-  return pluginInit(logFunction, NULL);
+  return pluginInit_v10(logFunction, NULL);
 }
 
 __hidden ncclResult_t pluginGetProperties_v9(int dev, ncclNetProperties_v9_t* props) {
-  return pluginGetProperties(dev, (ncclNetProperties_t*)props);
+  return pluginGetProperties_v10(dev, (ncclNetProperties_v10_t*)props);
 }
 
 __hidden ncclResult_t pluginConnect_v9(int dev, void* handle, void** sendComm, ncclNetDeviceHandle_t** sendDevComm){
-  return pluginConnect(dev, NULL, handle, sendComm, sendDevComm);
+  return pluginConnect_v10(dev, NULL, handle, sendComm, sendDevComm);
 }
 
 __hidden ncclResult_t pluginIsend_v9(void* sendComm, void* data, size_t size, int tag, void* mhandle, void** request) {
@@ -120,7 +187,7 @@ const ncclNet_v9_t ncclNetPlugin_v9 = {
   .init = pluginInit_v9,
   .devices = pluginDevices,
   .getProperties = pluginGetProperties_v9,
-  .listen = pluginListen,
+  .listen = pluginListen_v10,
   .connect = pluginConnect_v9,
   .accept = pluginAccept,
   .regMr = pluginRegMr,
@@ -172,7 +239,7 @@ const ncclNet_v8_t ncclNetPlugin_v8 = {
   .init = pluginInit_v9,
   .devices = pluginDevices,
   .getProperties = pluginGetProperties_v8,
-  .listen = pluginListen,
+  .listen = pluginListen_v10,
   .connect = pluginConnect_v9,
   .accept = pluginAccept,
   .regMr = pluginRegMr,
@@ -216,7 +283,7 @@ const ncclNet_v7_t ncclNetPlugin_v7 = {
   .init = pluginInit_v9,
   .devices = pluginDevices,
   .getProperties = pluginGetProperties_v7,
-  .listen = pluginListen,
+  .listen = pluginListen_v10,
   .connect = pluginConnect_v9,
   .accept = pluginAccept,
   .regMr = pluginRegMr_v7,
@@ -257,7 +324,7 @@ const ncclNet_v6_t ncclNetPlugin_v6 = {
   .init = pluginInit_v9,
   .devices = pluginDevices,
   .getProperties = pluginGetProperties_v6,
-  .listen = pluginListen,
+  .listen = pluginListen_v10,
   .connect = pluginConnect_v6,
   .accept = pluginAccept_v6,
   .regMr = pluginRegMr_v7,
@@ -278,7 +345,7 @@ const ncclNet_v5_t ncclNetPlugin_v5 = {
   .init = pluginInit_v9,
   .devices = pluginDevices,
   .getProperties = pluginGetProperties_v6,
-  .listen = pluginListen,
+  .listen = pluginListen_v10,
   .connect = pluginConnect_v6,
   .accept = pluginAccept_v6,
   .regMr = pluginRegMr_v7,
@@ -320,7 +387,7 @@ static ncclResult_t pluginConnect_v4(int dev, void* handle, void** sendComm) {
   ncclResult_t ret;
   do {
     ncclNetDeviceHandle_v7_t* handle = NULL;
-    ret = pluginConnect(dev, NULL, handle, sendComm, &handle);
+    ret = pluginConnect_v10(dev, NULL, handle, sendComm, &handle);
   } while (ret == ncclSuccess && *sendComm == NULL);
   return ret;
 }
@@ -337,7 +404,7 @@ const ncclNet_v4_t ncclNetPlugin_v4 = {
   .init = pluginInit_v9,
   .devices = pluginDevices,
   .getProperties = pluginGetProperties_v4,
-  .listen = pluginListen,
+  .listen = pluginListen_v10,
   .connect = pluginConnect_v4,
   .accept = pluginAccept_v4,
   .regMr = pluginRegMr_v7,
@@ -363,12 +430,12 @@ static ncclResult_t pluginFlush(void* recvComm, void* data, int size, void* mhan
 }
 static ncclResult_t pluginInit_v3(ncclDebugLogger_t logFunction) {
   max_requests = NCCL_NET_MAX_REQUESTS_V3;
-  return pluginInit(logFunction, NULL);
+  return pluginInit_v10(logFunction, NULL);
 }
 #include <string.h>
 static ncclResult_t pluginListen_v3(int dev, void* handle, void** listenComm) {
   char pluginHandle[NCCL_NET_HANDLE_MAXSIZE];
-  ncclResult_t ret = pluginListen(dev, &pluginHandle, listenComm);
+  ncclResult_t ret = pluginListen_v10(dev, &pluginHandle, listenComm);
   memcpy(handle, &pluginHandle, NCCL_NET_HANDLE_MAXSIZE_V4);
   return ret;
 }
@@ -403,7 +470,7 @@ const ncclNet_v2_t ncclNetPlugin_v2 = {
   .devices = pluginDevices,
   .pciPath = pluginPciPath,
   .ptrSupport = pluginPtrSupport,
-  .listen = pluginListen,
+  .listen = pluginListen_v3,
   .connect = pluginConnect_v4,
   .accept = pluginAccept_v4,
   .regMr = pluginRegMr_v7,
diff --git a/projects/rccl/ext-profiler/README.md b/projects/rccl/ext-profiler/README.md
index 27bd4e25ce..1d85213a65 100644
--- a/projects/rccl/ext-profiler/README.md
+++ b/projects/rccl/ext-profiler/README.md
@@ -49,9 +49,9 @@ of newer ones.
 The `nccl/` directory is populated with `profiler_vX.h` files extracting all relevant definitions
 from old API versions. It also provides error codes in `err.h`.
 
-# API (v4)
+# API (v5)
 
-Below is the main `ncclProfiler_v4` struct. Each function is explained in later sections.
+Below is the main `ncclProfiler_v5` struct. Each function is explained in later sections.
 
 ```
 typedef struct {
@@ -60,15 +60,15 @@ typedef struct {
   // init - initialize the profiler plugin
   // Input
   //  - context        : opaque profiler context object for separating profiler behavior across comms
+  //  - commId         : communicator id
   //  - commName       : user assigned communicator name
-  //  - commHash       : communicator id
   //  - nNodes         : number of nodes in communicator
   //  - nranks         : number of ranks in communicator
   //  - rank           : rank identifier in communicator
   //  - logfn          : logger function
   // Output
   //  - eActivationMask: bitmask of active events set by the plugin
-  ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
+  ncclResult_t (*init)(void** context, uint64_t commId, int* eActivationMask, const char* commName, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
 
   // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
   // Input
@@ -76,7 +76,7 @@ typedef struct {
   //  - eDescr : pointer to ncclProfilerEventDescr_t object
   // Output
   //  - eHandle: return event handle for supplied event descriptor object
-  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);
+  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v5_t* eDescr);
 
   // stopEvent - stop/finalize an event inside and event set
   // Input
@@ -88,13 +88,13 @@ typedef struct {
   //  - eHandle   : handle to event object created through startEvent
   //  - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
   //  - eState    : event state transition
-  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);
+  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v5_t eState, ncclProfilerEventStateArgs_v5_t* eStateArgs);
 
   // finalize - finalize the profiler plugin
   // Input
   //  - context: opaque profiler context object
   ncclResult_t (*finalize)(void* context);
-} ncclProfiler_v4_t;
+} ncclProfiler_v5_t;
 ```
 
 ## Error codes
@@ -148,10 +148,37 @@ is the `ncclProfilerEventDescr_t` struct.
 
 ```
 typedef struct {
-  uint8_t type;             // event type (e.g., ncclProfileGroup, ncclProfileColl, ...)
-  void* parentObj;          // pointer to parent event used to expose the event hierarchy to the profiler
-  int rank;                 // rank that generated the event
+  uint64_t type;             // event type descriptor: ncclProfileGroupApi, ncclProfileCollApi, ...
+  void* parentObj;           // pointer to parent event used to expose the event hierarchy to the profiler
+  int rank;                  // rank that generated the event
   union {
+    struct {                 // GroupAPI event metadata
+      bool graphCaptured;    // Set to true if the Group API event is emitted inside a CUDA graph capture
+      int groupDepth;        // Determines the depth of a ncclGroup. A depth of 1 implies that the Group API call is implicit (internal to NCCL)
+                             // and not called by the user. Any depth greater than 1 means that the user made the Group API call.
+    } groupApi;
+
+    struct {                 // Collective API call metadata
+      const char* func;      // string containing name of the collective operation during
+      size_t count;          // data count
+      const char* datatype;  // string containing the name of the datatype
+      int root;              // root rank
+      void* stream;          // Opaque handle that points to the CUDA stream that the operation is enqueued in
+      bool graphCaptured;    // Set to true if the Collective API event is emitted inside a CUDA graph capture
+    } collApi;
+
+    struct {                // Point-to-point API call metadata
+      const char* func;     // string containing name of the p2p operation
+      size_t count;         // data count
+      const char* datatype; // string containing the name of the datatype
+      void* stream;         // Opaque handle that points to a CUDA stream object
+      bool graphCaptured;   // Set to true if the Collective API event is emitted inside a CUDA graph capture
+    } p2pApi;
+
+    struct {                // Kernel Launch event metadata
+      void* stream;         // Opaque handle that points to the CUDA stream that the operation is enqueued in
+    } kernelLaunch;
+
     struct {                // collective events metadata
       uint64_t seqNumber;   // sequence number of this collective operation in the communicator
       const char* func;     // string containing name of the collective
@@ -164,6 +191,7 @@ typedef struct {
       uint8_t nWarps;       // number of GPU warps for this collective
       const char* algo;     // string containing name of the algorithm for this collective
       const char* proto;    // string containing name of the protocol for this collective
+      void* parentGroup;    // for backward compatibility with v4 - this points to the legacy v4 group parent
     } coll;
 
     struct {                // point-to-point events metadata
@@ -173,6 +201,7 @@ typedef struct {
       size_t count;
       int peer;             // peer rank for this point-to-point
       uint8_t nChannels;    // number of channels for this p2p
+      void* parentGroup;    // for backward compatibility with v4 - this points to the legacy v4 group parent
     } p2p;
 
     struct {                // proxyOp events metadata
@@ -198,12 +227,12 @@ typedef struct {
       void* data;           // pointer to network plugin defined event
     } netPlugin;
   };
-} ncclProfilerEventDescr_v4_t;
+} ncclProfilerEventDescr_v5_t;
 ```
 
-NCCL defines the following events: `ncclProfileGroup`, `ncclProfileColl`, `ncclProfileP2p`,
-`ncclProfileProxyOp`, `ncclProfileProxyStep`, `ncclProfileProxyCtrl`, `ncclProfileKernelCh` and
-`ncclProfileNetPlugin`.
+NCCL defines the following events: `ncclProfileGroupApi`, `ncclProfileCollApi`, `ncclProfileP2pApi`, `ncclProfileKernelLaunch`,
+`ncclProfileGroup`, `ncclProfileColl`, `ncclProfileP2p`,`ncclProfileProxyOp`, `ncclProfileProxyStep`, `ncclProfileProxyCtrl`,
+`ncclProfileKernelCh` and `ncclProfileNetPlugin`.
 
 #### stopEvent
 
@@ -213,10 +242,10 @@ handle after `eventStop` is undefined behavior.
 
 #### recordEventState
 
-Some events can only be started and stopped. For example, `ncclProfileGroup`, `ncclProfileColl`,
-`ncclProfileP2p`, cannot be updated through calls to `recordEventState`.
+Some events can only be started and stopped. For example, `ncclProfileP2pApi`, `ncclProfileCollApi`, `ncclProfileGroup`,
+`ncclProfileColl`, `ncclProfileP2p` cannot be updated through calls to `recordEventState`.
 
-`ncclProfileProxyOp`, `ncclProfileProxyStep`, `ncclProfileNetPlugin`, `ncclProfileKernelCh`, and
+`ncclProfileGroupApi`, `ncclProfileProxyOp`, `ncclProfileProxyStep`, `ncclProfileNetPlugin`, `ncclProfileKernelCh`, and
 `ncclProfileProxyCtrl` can be updated through calls to `recordEventState`.
 
 The state of these events can be updated, along with event attributes, using `recordEventState`.
@@ -258,9 +287,21 @@ typedef enum {
 
   // ncclProfileKernelCh event states
   ncclProfilerKernelChStop             = 22,// state marks stop of kernelCh event and timestamp update
-} ncclProfilerEventState_v4_t;
+
+  // Group API States
+  ncclProfilerGroupStartApiStop        = 23,// state marks the end of a ncclGroupStart() API call
+  ncclProfilerEndGroupApiStart         = 24 // state marks the start of a ncclGroupEnd() API call
+} ncclProfilerEventState_v5_t;
 ```
 
+NCCL profile API events are generated when the API calls are made, right after NCCL checks
+for graph capture information. They parent collective, point-to-point and kernel launch events
+and persist across multiple operations in a group.
+
+`ncclProfileKernelLaunch` events are generated when the CUDA call to a kernel launch is made. In the
+case of graph capture, the event start indicates that the kernel launch operation has been recorded,
+not launched.
+
 `ncclProfileProxyOp` events are generated by the proxy progress thread while it is processing
 network requests for the GPU kernel. ProxyOp events are generated for every active channel and
 provide a summary of the activity of the proxy progress thread for that channel. Most of the
@@ -379,7 +420,7 @@ typedef union {
   struct {                // attribute to update for ncclProfileKernelCh events
     uint64_t pTimer;      // timestamp provided by the NCCL kernel
   } kernelCh;
-} ncclProfilerEventStateArgs_v4_t;
+} ncclProfilerEventStateArgs_v5_t;
 ```
 
 The example profiler in `ext-profiler/example` contains details on how to capture and use the events above.
@@ -389,27 +430,33 @@ The example profiler in `ext-profiler/example` contains details on how to captur
 NCCL core events (reported above) are organized into a hierarchy as reported below:
 
 ```
-Group event
+Group API event
    |
-   +- Collective event
+   +- Collective API event
    |  |
-   |  +- ProxyOp event
-   |  |  |
-   |  |  +- ProxyStep event
-   |  |     |
-   |  |     +- NetPlugin event
-   |  |
-   |  +- KernelCh event
+   |  +- Collective event
+   |     |
+   |     +- ProxyOp event
+   |     |  |
+   |     |  +- ProxyStep event
+   |     |     |
+   |     |     +- NetPlugin event
+   |     |
+   |     +- KernelCh event
    |
-   +- Point-to-point event
-      |
-      +- ProxyOp event
-      |  |
-      |  +- ProxyStep event
-      |     |
-      |     +- NetPlugin event
-      |
-      +- KernelCh event
+   +- Point-to-point API event
+   |  |
+   |  +- Point-to-point event
+   |     |
+   |     +- ProxyOp event
+   |     |  |
+   |     |  +- ProxyStep event
+   |     |     |
+   |     |     +- NetPlugin event
+   |     |
+   |     +- KernelCh event
+   |
+   +- Kernel Launch event
 
 ProxyCtrl event
 ```
diff --git a/projects/rccl/ext-profiler/example/CMakeLists.txt b/projects/rccl/ext-profiler/example/CMakeLists.txt
new file mode 100644
index 0000000000..fd2f04df65
--- /dev/null
+++ b/projects/rccl/ext-profiler/example/CMakeLists.txt
@@ -0,0 +1,34 @@
+# Find all C source files in current directory
+set(SRC_FILES
+    ${CMAKE_CURRENT_SOURCE_DIR}/plugin.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/print_event.cc
+)
+
+# Create shared library
+add_library(nccl-profiler-example SHARED ${SRC_FILES})
+
+# Set include directories
+target_include_directories(nccl-profiler-example PRIVATE
+    ${CMAKE_CURRENT_SOURCE_DIR}/nccl
+    ${CUDAToolkit_INCLUDE_DIRS}
+)
+
+# Set output name to match Makefile
+set_target_properties(nccl-profiler-example PROPERTIES
+    OUTPUT_NAME "nccl-profiler-example"
+    PREFIX "lib"
+    POSITION_INDEPENDENT_CODE ON
+    LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib
+)
+
+add_custom_command(TARGET nccl-profiler-example POST_BUILD
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${CMAKE_BINARY_DIR}/test/unit/plugins
+    COMMAND ${CMAKE_COMMAND} -E copy ${CMAKE_BINARY_DIR}/lib/libnccl-profiler-example.so ${CMAKE_BINARY_DIR}/test/unit/plugins
+)
+
+# Add custom target for clean (equivalent to Makefile clean target)
+add_custom_target(clean-profiler-lib
+    COMMAND ${CMAKE_COMMAND} -E remove -f ${CMAKE_BINARY_DIR}/lib/libnccl-profiler-example.so
+    COMMAND ${CMAKE_COMMAND} -E remove -f ${CMAKE_BINARY_DIR}/test/unit/plugins/libnccl-profiler-example.so
+    COMMENT "Cleaning libnccl-profiler-example.so"
+)
diff --git a/projects/rccl/ext-profiler/example/Makefile b/projects/rccl/ext-profiler/example/Makefile
index 777ff5badc..f6383e1b61 100644
--- a/projects/rccl/ext-profiler/example/Makefile
+++ b/projects/rccl/ext-profiler/example/Makefile
@@ -5,18 +5,20 @@
 #
 .DEFAULT_GOAL: build
 include ../../makefiles/common.mk
-SRCDIR   ?= $(abspath ../..)
 BUILDDIR ?= .
 NCCLDIR  := $(BUILDDIR)
 
-SRC_FILES := $(wildcard *.c)
+SRC_FILES := $(wildcard *.cc)
+DST_DIR   := $(BUILDDIR)
+OBJ_FILES := $(SRC_FILES:%.cc=${DST_DIR}/%.o)
+DEP_FILES := $(OBJ_FILES:%.o=%.dep)
 
-build: ${BUILDDIR}/libnccl-profiler-example.so
+build: ${DST_DIR}/libnccl-profiler-example.so
 
-${BUILDDIR}/libnccl-profiler-example.so: ${SRC_FILES}
+${DST_DIR}/libnccl-profiler-example.so: ${SRC_FILES}
 	@printf "Compiling  %-35s > %s\n" $< $@
-	@mkdir -p ${BUILDDIR}
-	$(CC) -Inccl -fPIC -shared -o $@ $^
+	@mkdir -p ${DST_DIR}
+	$(CXX) -Inccl -I${CUDA_INC} -fPIC -shared -o $@ $^
 
 clean:
-	rm -f ${BUILDDIR}/libnccl-profiler-example.so
+	rm -f ${DST_DIR}/libnccl-profiler-example.so
diff --git a/projects/rccl/ext-profiler/example/README.md b/projects/rccl/ext-profiler/example/README.md
index d98e58f157..abc11a57e6 100644
--- a/projects/rccl/ext-profiler/example/README.md
+++ b/projects/rccl/ext-profiler/example/README.md
@@ -13,8 +13,7 @@ change the size of the event window the profiler keeps track of.
 
 ## Building the profiler plugin
 
-To use the example plugin, just type `make`. You will need a NCCL build's include directory present.
-You can override `NCCL_HOME` to where the NCCL installation is on your system.
+To build the example plugin shipped as part of NCCL, just type `make`.
 
 ## Using the profiler plugin
 
@@ -27,13 +26,13 @@ You can override `NCCL_HOME` to where the NCCL installation is on your system.
 
    As an example, setting:
 
-   `NCCL_PROFILE_EVENT_MASK` to 1 (`ncclProfileGroup`) | 2 (`ncclProfileColl`) | 8 (`ncclProfileProxyOp`)
+   `NCCL_PROFILE_EVENT_MASK` to 256 (`ncclProfileGroupApi`) | 2 (`ncclProfileColl`) | 8 (`ncclProfileProxyOp`)
 
-   enables the profiling of the group, the collective and the proxy op events. The same events can be
+   enables the profiling of the group API, the collective and the proxy op events. The same events can be
    expressed more concisely by setting `NCCL_PROFILE_EVENT_MASK` to 8 (`ncclProfileProxyOp`). Indeed,
    in NCCL all the events above (in the event hierarchy) the one requested are also captured. The advantage
    is that the profiler can easily correlate events that belong to the same NCCL operation and present
-   them accordingly.
+   them accordingly. Setting `NCCL_PROFILE_EVENT_MASK` to 4095 enables all events supported by the v5 profiler.
 
 3. Set `NCCL_PROFILE_DUMP_FILE` to the name of the dump file for the collected traces. A file named
    ${NCCL_PROFILE_DUMP_FILE}-hostname-tid.txt is created. Profiler traces are saved using the chrome
@@ -57,11 +56,14 @@ The group, collective and p2p pools contain objects for the corresponding events
 contains objects for `ProxyCtrl` events and the `ProxyDetach` pool contains objects for `ProxyOp` events
 generated by remote proxies. A list of pools and their size is reported below:
 
-- `NCCL_PROFILE_GROUP_POOL_SIZE` (16)
-- `NCCL_PROFILE_COLL_POOL_SIZE` (16)
-- `NCCL_PROFILE_P2P_POOL_SIZE` (1024)
+- `NCCL_PROFILE_GROUP_API_POOL_SIZE` (256)
+- `NCCL_PROFILE_COLL_API_POOL_SIZE` (256)
+- `NCCL_PROFILE_P2P_API_POOL_SIZE` (256)
+- `NCCL_PROFILE_KERNEL_LAUNCH_POOL_SIZE` (256)
+- `NCCL_PROFILE_COLL_POOL_SIZE` (256)
+- `NCCL_PROFILE_P2P_POOL_SIZE` (256)
 - `NCCL_PROFILE_PROXY_CTRL_POOL_SIZE` (16)
-- `NCCL_PROFILE_PROXY_DETACH_POOL_SIZE` (128)
+- `NCCL_PROFILE_PROXY_DETACH_POOL_SIZE` (256)
 
 Remote proxy operations are generated when PXN is in use. Refer to this article for more information
 about PXN and how it works:
@@ -73,76 +75,58 @@ The example profiler generates traces using the json format. An example of trace
 
 ```
 [
-{"name": "Group", "cat": "GROUP", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 764234.611328, "args": {"groupId": 0}},
-{"name": "AllReduce", "cat": "COLL", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 764237.294922, "args": {"SeqNum": 0, "CommHash": 673864846479792718, "Rank": 1, "Count": 32768, "Datatype": "ncclFloat32", "Algorithm": "RING", "Protocol": "LL", "nMaxChannels": 2}},
-{"name": "Recv", "cat": "PROXY", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768464.936523, "args": {"Channel": 0, "Peer": 0, "Steps": 14, "ChunkSize": 32768, "transSize": 229376, "POSTED": {"step": 14, "ts": 772020.300781}, "RECEIVED": {"step": 14, "ts": 772196.049805}, "TRANSMITTED": {"step": 14, "ts": 772197.326172}, "DONE": {"step": 14, "ts": 772201.538086}}},
-{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768465.158203, "args": {"Step": 0}},
-{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768477.924805},
-{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768477.924805, "args": {"Step": 0}},
-{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768547.197266},
-{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768547.197266, "args": {"Step": 0}},
-{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768564.174805},
-{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768564.174805, "args": {"Step": 0}},
-{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768568.276367},
-{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 768503.604492, "args": {"Step": 1}},
-{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 768504.549805},
-{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 768504.549805, "args": {"Step": 1}},
-{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 769994.490234},
-{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 769994.490234, "args": {"Step": 1}},
-{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 769995.012695},
-{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 769995.012695, "args": {"Step": 1}},
-{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 770006.914062},
-{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 768506.941406, "args": {"Step": 2}},
-{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 768507.435547},
-{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 768507.435547, "args": {"Step": 2}},
-{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 771452.536133},
-{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 771452.536133, "args": {"Step": 2}},
-{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 771453.060547},
-{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 771453.060547, "args": {"Step": 2}},
-{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 771468.458008},
-{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 768509.484375, "args": {"Step": 3}},
-{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 768510.250000},
-{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 768510.250000, "args": {"Step": 3}},
-{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.499023},
-{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.499023, "args": {"Step": 3}},
-{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.991211},
-{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.991211, "args": {"Step": 3}},
-{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 771910.500000},
-{"name": "Send", "cat": "PROXY", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 768482.878906, "args": {"Channel": 0, "Peer": 2, "Steps": 14, "ChunkSize": 32768, "transSize": 229376, "POSTED": {"step": 14, "ts": 771995.675781}, "REM_FIFO_WAIT": {"step": 14, "ts": 772190.692383}, "TRANSMITTED": {"step": 14, "ts": 772191.516602}, "DONE": {"step": 14, "ts": 772208.473633}}},
-{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 14, "pid": 4157654, "tid": 1, "ts": 768483.019531, "args": {"Step": 0}},
-{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 14, "pid": 4157654, "tid": 1, "ts": 768483.300781},
-{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 14, "pid": 4157654, "tid": 1, "ts": 768483.300781, "args": {"Step": 0}},
-{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 14, "pid": 4157654, "tid": 1, "ts": 769594.615234},
-{"name": "SendWait", "cat": "NET", "ph": "b", "id": 14, "pid": 4157654, "tid": 1, "ts": 769594.615234, "args": {"Step": 0}},
-{"name": "SendWait", "cat": "NET", "ph": "e", "id": 14, "pid": 4157654, "tid": 1, "ts": 769618.889648},
-{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 15, "pid": 4157654, "tid": 1, "ts": 768505.083008, "args": {"Step": 1}},
-{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 15, "pid": 4157654, "tid": 1, "ts": 768505.163086},
-{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 15, "pid": 4157654, "tid": 1, "ts": 768505.163086, "args": {"Step": 1}},
-{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 15, "pid": 4157654, "tid": 1, "ts": 769610.555664},
-{"name": "SendWait", "cat": "NET", "ph": "b", "id": 15, "pid": 4157654, "tid": 1, "ts": 769610.555664, "args": {"Step": 1}},
-{"name": "SendWait", "cat": "NET", "ph": "e", "id": 15, "pid": 4157654, "tid": 1, "ts": 769622.517578},
-{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 16, "pid": 4157654, "tid": 1, "ts": 768507.937500, "args": {"Step": 2}},
-{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 16, "pid": 4157654, "tid": 1, "ts": 768508.017578},
-{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 16, "pid": 4157654, "tid": 1, "ts": 768508.017578, "args": {"Step": 2}},
-{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 16, "pid": 4157654, "tid": 1, "ts": 770002.129883},
-{"name": "SendWait", "cat": "NET", "ph": "b", "id": 16, "pid": 4157654, "tid": 1, "ts": 770002.129883, "args": {"Step": 2}},
-{"name": "SendWait", "cat": "NET", "ph": "e", "id": 16, "pid": 4157654, "tid": 1, "ts": 770013.848633},
-{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 17, "pid": 4157654, "tid": 1, "ts": 768510.742188, "args": {"Step": 3}},
-{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 17, "pid": 4157654, "tid": 1, "ts": 768510.822266},
-{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 17, "pid": 4157654, "tid": 1, "ts": 768510.822266, "args": {"Step": 3}},
-{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 17, "pid": 4157654, "tid": 1, "ts": 771461.563477},
-{"name": "SendWait", "cat": "NET", "ph": "b", "id": 17, "pid": 4157654, "tid": 1, "ts": 771461.563477, "args": {"Step": 3}},
-{"name": "SendWait", "cat": "NET", "ph": "e", "id": 17, "pid": 4157654, "tid": 1, "ts": 771469.171875},
+{"name": "Group API", "cat": "GROUP_API", "ph": "b", "id": 0, "pid": 225798, "tid": 1, "ts": 3433.595001, "args": {"groupApiId": 0, "groupDepth":1}},
+{"name": "KernelLaunch", "cat": "KERNEL_LAUNCH", "ph": "b", "id": 0, "pid": 225798, "tid": 1, "ts": 0.000000, "args": {"groupId": 0, "Stream": 0x5020000567d0}},
+{"name": "KernelLaunch", "cat": "KERNEL_LAUNCH", "ph": "e", "id": 0, "pid": 225798, "tid": 1, "ts": 111991.558990},
+{"name": "AllReduce", "cat": "COLL_API", "ph": "b", "id": 0, "pid": 225798, "tid": 1, "ts": 0.000000, "args": {"count": 262144, "datatype": ncclFloat32, "root": 0, "GraphCaptured":0, "Stream": 0x5020000567d0}},
+{"name": "AllReduce", "cat": "COLL", "ph": "b", "id": 0, "pid": 225798, "tid": 1, "ts": 111994.477997, "args": {"SeqNum": 0, "CommHash": 1493613951195738943, "Rank": 0, "Count": 262144, "Datatype": "ncclFloat32", "Algorithm": "RING", "Protocol": "SIMPLE", "nChannels": 2}},
+{"name": "KernelCh", "cat": "GPU", "ph": "b", "id": 0, "pid": 225798, "tid": 1, "ts": 119711.888000, "args": {"Channel": 0, "StartGpuClk": 1756135989724672000, "StopGpuClk": 1756135989732831232}},
+{"name": "ScheduleRecv", "cat": "PROXY", "ph": "b", "id": 0, "pid": 225798, "tid": 1, "ts": 119652.709991, "args": {"Channel": 0, "Peer": 1, "Steps": 4, "ChunkSize": 4194304, "transSize": 524288}},
+{"name": "ScheduleRecv", "cat": "PROXY", "ph": "e", "id": 0, "pid": 225798, "tid": 1, "ts": 119686.300995},
+{"name": "ProgressRecv", "cat": "PROXY", "ph": "b", "id": 0, "pid": 225798, "tid": 1, "ts": 119686.300995, "args": {"Channel": 0, "Peer": 1, "Steps": 4, "ChunkSize": 4194304, "transSize": 524288}},
+{“name": "RecvWait", "cat": "NET", "ph": "b", "id": 0, "pid": 225798, "tid": 1, "ts": 119707.677979, "args": {"Step": 0}},
+{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 0, "pid": 225798, "tid": 1, "ts": 119807.691986},
+{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 0, "pid": 225798, "tid": 1, "ts": 119807.691986, "args": {"Step": 0}},
+{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 0, "pid": 225798, "tid": 1, "ts": 119867.338989},
+{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 0, "pid": 225798, "tid": 1, "ts": 119867.338989, "args": {"Step": 0}},
+{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 0, "pid": 225798, "tid": 1, "ts": 120120.983002},
+{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 1, "pid": 225798, "tid": 1, "ts": 119733.647980, "args": {"Step": 1}},
+{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 1, "pid": 225798, "tid": 1, "ts": 119844.401001},
+{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 1, "pid": 225798, "tid": 1, "ts": 119844.401001, "args": {"Step": 1}},
+{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 1, "pid": 225798, "tid": 1, "ts": 119890.567993},
+{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 1, "pid": 225798, "tid": 1, "ts": 119890.567993, "args": {"Step": 1}},
+{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 1, "pid": 225798, "tid": 1, "ts": 120121.129974},
+{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 2, "pid": 225798, "tid": 1, "ts": 119753.023987, "args": {"Step": 2}},
+{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 2, "pid": 225798, "tid": 1, "ts": 120038.847992},
+{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 2, "pid": 225798, "tid": 1, "ts": 120038.847992, "args": {"Step": 2}},
+{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 2, "pid": 225798, "tid": 1, "ts": 120085.685974},
+{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 2, "pid": 225798, "tid": 1, "ts": 120085.685974, "args": {"Step": 2}},
+{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 2, "pid": 225798, "tid": 1, "ts": 120121.244995},
+{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 3, "pid": 225798, "tid": 1, "ts": 119772.510986, "args": {"Step": 3}},
+{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 3, "pid": 225798, "tid": 1, "ts": 120062.944977},
+{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 3, "pid": 225798, "tid": 1, "ts": 120062.944977, "args": {"Step": 3}},
+{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 3, "pid": 225798, "tid": 1, "ts": 120101.089996},
+{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 3, "pid": 225798, "tid": 1, "ts": 120101.089996, "args": {"Step": 3}},
+{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 3, "pid": 225798, "tid": 1, "ts": 120165.115997},
+{"name": "ProgressRecv", "cat": "PROXY", "ph": "e", "id": 0, "pid": 225798, "tid": 1, "ts": 120165.356995},
+{"name": "ScheduleSend", "cat": "PROXY", "ph": "b", "id": 1, "pid": 225798, "tid": 1, "ts": 119656.950989, "args": {"Channel": 0, "Peer": 1, "Steps": 4, "ChunkSize": 4194304, "transSize": 524288}},
+{"name": "ScheduleSend", "cat": "PROXY", "ph": "e", "id": 1, "pid": 225798, "tid": 1, "ts": 119709.078979},
+{"name": "ProgressSend", "cat": "PROXY", "ph": "b", "id": 1, "pid": 225798, "tid": 1, "ts": 119709.078979, "args": {"Channel": 0, "Peer": 1, "Steps": 4, "ChunkSize": 4194304, "transSize": 524288}},
+{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 4, "pid": 225798, "tid": 1, "ts": 119710.632996, "args": {"Step": 0}},
+{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 4, "pid": 225798, "tid": 1, "ts": 119808.636993},
+{"name": "SendPeerWait", "cat": "NET", "ph": "b", "id": 4, "pid": 225798, "tid": 1, "ts": 119808.636993, "args": {"Step": 0}},
+{"name": "SendPeerWait", "cat": "NET", "ph": "e", "id": 4, "pid": 225798, "tid": 1, "ts": 119818.972992},
  ... [ trace truncated for brevity ]
-{"name": "AllReduce", "cat": "COLL", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 772209.317383},
-{"name": "Group", "cat": "GROUP", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 772209.418945},
+{"name": "AllReduce", "cat": "COLL", "ph": "e", "id": 17, "pid": 225798, "tid": 1, "ts": 170633.535980},
+{"name": "AllReduce", "cat": "COLL_API", "ph": "e", "id": 17, "pid": 225798, "tid": 1, "ts": 170582.923981},
+{"name": "Group API", "cat": "GROUP_API", "ph": "e", "id": 17, "pid": 225798, "tid": 1, "ts": 170637.582001},
 {}]
 ```
 
 Details about the fields used in the trace can be found at this link:
 https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview?tab=t.0#heading=h.yr4qxyxotyw
 
-The trace above is obtained by running a `ncclAllReduce` operation on 8 GPUs, communicating with each other through
+The trace above is obtained by running a `ncclAllReduce` operation on 2 GPUs, communicating with each other through
 the network interface. The `Group` event encloses all traces that are related to the single `ncclAllReduce` call.
 (Note that for single collective invocations, where there are no explicit group calls, NCCL creates a group with only
 one collective and this is what is presented in the traces above).
@@ -161,38 +145,17 @@ The `AllReduce` entry presents information about the `ncclAllReduce` operation.
 - datatype    : NCCL datatype
 - algorithm   : algorithm used to process the ncclAllReduce
 - protocol    : protocol used to process the ncclAllReduce
-- nMaxChannels: max number of channels used to process the ncclAllReduce
+- nChannels   : Number of channels used to process the ncclAllReduce
 
 If the proxy events are not active (e.g., the `ncclAllReduce` is intranode) the end timestamp will match the time
 consumed by the CPU to launch the collective. For more details refer to `ext-profiler/README.md`, section `Profiling
 of collective and p2p operations`.
 
-### Proxy Send
-The `Send` entry presents information about the `ProxyOp` processing in the progress thread. It contains the following
-info in the args field:
-
-- Channel      : id of the channel used by this proxy operation to send data to the peer
-- Peer         : peer rank
-- Steps        : number of network steps required to transfer transSize bytes to the peer
-- ChunkSize    : chunk size used by NCCL to pipeline data through the proxy thread
-- transSize    : bytes transferred across the channel by this proxy operation
-- POSTED       : struct containing the number of buffer posts to the GPU and the time stamp for the last post
-- REM_FIFO_WAIT: struct containing the number of remote buffer waits and the time stamp for the last wait
-- TRANSMITTED  : struct containing the number of network sends and the time stamp of the last send
-- DONE         : struct containing the number of network sends completed and the time stamp of the last send completed
-
-In case of a network problem the POSTED, REM_FIFO_WAIT, TRANSMITTED and DONE might all have partially updated steps,
-which could help identify at which point the network problem occurred.
-
 The Proxy send trace gives a summary of the proxy progress thread activity for the channel. If more details are
 needed, these can be obtained by enabling the proxy step event (`ncclProfileProxyStep`). In which case the trace
 entries below are also reported by the profiler.
 
-#### Proxy SendBufferWait
-
-Presents, for every network step, the time the CPU proxy spends waiting for the channel staging buffer to become available.
-
-#### Proxy SendGPUWait
+#### Proxy SendGpuWait
 
 Presents, for every network step, the time the CPU proxy spends waiting for the GPU to provide the data in the staging
 buffer.
@@ -201,31 +164,6 @@ buffer.
 
 Presents, for every network step, the time the CPU proxy spends waiting for the `isend` to complete
 
-### Proxy Recv
-
-The `Recv` entry presents information about the `ProxyOp` processing in the progress thread. It contains the following
-info in the args field:
-
-- Channel    : id of the channel used by this proxy operation to recv data from the peer
-- Peer       : peer rank
-- Steps      : number of network steps required to transfer transSize bytes from the peer
-- ChunkSize  : chunk size used by NCCL to pipeline data through the proxy thread
-- transSize  : bytes transferred across the channel by this proxy operation
-- POSTED     : struct containing the number of recvs posted and the time stamp for the last recv posted
-- RECEIVED   : struct containing the number of recvs completed and the time stamp for the last recv completed
-- TRANSMITTED: struct containing the number of recvs flushed to the GPU memory and the time stamp for the last recv flushed
-- DONE       : struct containing the number of flush completed and the time stamp for the last flush completed
-
-The Proxy Recv trace gives a summary of the proxy progress thread activity for the channel. If more details are
-needed, these can be obtained by enabling the proxy step event (`ncclProfileProxyStep`). In which case the trace
-entries below are also reported by the profiler.
-
-
-#### Proxy RecvBufferWait
-
-Presents, for every network step, the time the CPU proxy spends waiting for the staging buffer for the channel to
-become available.
-
 #### Proxy RecvWait
 
 Presents, for every network step, the time the CPU proxy spends waiting for a posted `irecv` to complete
@@ -234,6 +172,6 @@ Presents, for every network step, the time the CPU proxy spends waiting for a po
 
 Presents, for every network step, the time the CPU proxy spends waitng for the recv data to be flushed to the GPU
 
-#### Proxy RecvGPUWait
+#### Proxy RecvGpuWait
 
 Presents, for every network step, the time the CPU proxy spends waiting for the GPU to consume the recv data
diff --git a/projects/rccl/ext-profiler/example/event.c b/projects/rccl/ext-profiler/example/event.c
deleted file mode 100644
index 717fe86884..0000000000
--- a/projects/rccl/ext-profiler/example/event.c
+++ /dev/null
@@ -1,30 +0,0 @@
-/*************************************************************************
- * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
- *
- * See LICENSE.txt for license information
- ************************************************************************/
-
-#include <stdio.h>
-#include "event.h"
-
-int taskEventQueueEmpty(struct group* g) {
-  return g->eventHead == NULL;
-}
-
-void taskEventQueueEnqueue(struct group* g, struct taskEventBase* event) {
-  event->next = NULL;
-  if (g->eventHead) g->eventTail->next = event;
-  else g->eventHead = event;
-  g->eventTail = event;
-}
-
-struct taskEventBase* taskEventQueueHead(struct group* g) {
-  return g->eventHead;
-}
-
-struct taskEventBase* taskEventQueueDequeue(struct group* g) {
-  struct taskEventBase* tmp = g->eventHead;
-  g->eventHead = g->eventHead->next;
-  if (g->eventHead == NULL) g->eventTail = NULL;
-  return tmp;
-}
diff --git a/projects/rccl/ext-profiler/example/event.h b/projects/rccl/ext-profiler/example/event.h
index 4c1b8f53a9..ae830cd25f 100644
--- a/projects/rccl/ext-profiler/example/event.h
+++ b/projects/rccl/ext-profiler/example/event.h
@@ -10,10 +10,14 @@
 #include <sys/types.h>
 #include <stdint.h>
 #include <unistd.h>
+#include <cstring>
+#include "err.h"
 #include "profiler.h"
+#include "queue.h"
+#include <cuda_runtime.h>
 
 #define MAX_CHANNELS                     32
-#define MAX_STEPS                        16
+#define MAX_STEPS                        1024
 #define MAX_OPS                          16 // Up to 64K ranks for PAT
 #define MAX_EVENTS_PER_REQ               (8)
 
@@ -21,7 +25,7 @@ struct proxyOp;
 struct proxyStep;
 
 struct netPlugin {
-  uint8_t type;
+  uint64_t type;
   int pluginType;
   int pluginVer;
   uint8_t pluginEvent;
@@ -63,7 +67,7 @@ struct kernelCh {
 #define PROXY_STEP_MAX_STATES 3
 
 struct proxyStep {
-  uint8_t type;                     // type of event: network transfer
+  uint64_t type;                     // type of event: network transfer
   int state;
   int step;                         // network transfer id in given channel
   int isSend;                       // send/recv channel operation
@@ -76,7 +80,7 @@ struct proxyStep {
 };
 
 struct proxyOp {
-  uint8_t type;                     // type of event: proxy operation
+  uint64_t type;                     // type of event: proxy operation
   uint8_t channelId;                // channel id for this proxy operation
   pid_t pid;
   int rank;
@@ -97,7 +101,7 @@ struct group;
 struct context;
 
 struct proxyCtrl {
-  uint8_t type;
+  uint64_t type;
   struct context* ctx;              // profiler context
   double startTs;
   double stopTs;
@@ -107,12 +111,12 @@ struct proxyCtrl {
 
 // task level event base structure
 struct taskEventBase {
-  uint8_t type;                     // event type: collective/p2p
+  uint64_t type;                     // event type: collective/p2p
   int rank;                         // rank of the operation in NCCL communicator
   const char* func;                 // ncclFunc*
   int refCount;                     // number of references for this operation
-  struct group* parent;             // parent event group
-  struct taskEventBase* next;       // next top level event in group
+  void* parent;                     // parent API event
+  struct taskEventBase* next;       // next top level event
   double startTs;
   double stopTs;
 };
@@ -147,7 +151,7 @@ struct p2p {
 };
 
 struct group {
-  uint8_t type;
+  uint64_t type;
   struct context* ctx;              // profiler context
   int groupId;
   int refCount;
@@ -158,6 +162,70 @@ struct group {
   struct group* next;               // next group event in queue
 };
 
+struct collApi {
+  uint64_t type;
+  struct groupApi* parent;
+  struct context* ctx;              // profiler context
+  int collApiId;
+  int refCount;
+  cudaStream_t stream;
+  const char* func;
+  size_t count;
+  const char* datatype;
+  int root;
+  bool graphCaptured;
+  struct taskEventBase* eventHead;  // queue head for task events
+  struct taskEventBase* eventTail;  // queue tail for task events
+  double startTs;
+  double stopTs;
+  struct collApi* next;
+};
+
+struct p2pApi {
+  uint64_t type;
+  struct groupApi* parent;
+  struct context* ctx;              // profiler context
+  int p2pApiId;
+  int refCount;
+  const char* func;
+  cudaStream_t stream;
+  size_t count;
+  const char* datatype;
+  bool graphCaptured;
+  struct taskEventBase* eventHead;  // queue head for task events
+  struct taskEventBase* eventTail;  // queue tail for task events
+  double startTs;
+  double stopTs;
+  struct p2pApi* next;
+};
+
+struct kernelLaunch {
+  uint64_t type;
+  struct groupApi* parent;
+  cudaStream_t stream;
+  int kernelLaunchId;
+  double startTs;
+  double stopTs;
+  struct kernelLaunch* next;
+};
+
+struct groupApi {
+  uint64_t type;
+  struct context* ctx;
+  int groupApiId;
+  int refCount;
+  bool graphCaptured;
+  int groupDepth;
+  struct profilerQueue<struct p2pApi, &p2pApi::next> p2pApiEvents;
+  struct profilerQueue<struct collApi, &collApi::next> collApiEvents;
+  struct profilerQueue<struct kernelLaunch, &kernelLaunch::next> kernelLaunchEvents;
+  double endOfncclGroupStartTs;
+  double startOfncclGroupEndTs;
+  double startTs;
+  double stopTs;
+  struct groupApi* next;
+};
+
 // arrays for different event objects
 struct context {
   const char* commName;
@@ -165,6 +233,26 @@ struct context {
   int nranks;
   int rank;
 
+  int groupApiPoolSize;
+  int groupApiPoolBase;
+  int groupApiPoolIndex;
+  struct groupApi* groupApiPool;
+
+  int collApiPoolSize;
+  int collApiPoolBase;
+  int collApiPoolIndex;
+  struct collApi* collApiPool;
+
+  int p2pApiPoolSize;
+  int p2pApiPoolBase;
+  int p2pApiPoolIndex;
+  struct p2pApi* p2pApiPool;
+
+  int kernelLaunchPoolSize;
+  int kernelLaunchPoolBase;
+  int kernelLaunchPoolIndex;
+  struct kernelLaunch* kernelLaunchPool;
+
   int groupPoolSize;
   int groupPoolBase;
   int groupPoolIndex;
@@ -186,9 +274,50 @@ struct context {
   struct proxyCtrl* proxyCtrlPool;
 };
 
-int taskEventQueueEmpty(struct group* g);
-void taskEventQueueEnqueue(struct group* g, struct taskEventBase* event);
-struct taskEventBase* taskEventQueueHead(struct group* g);
-struct taskEventBase* taskEventQueueDequeue(struct group* g);
+template <typename T>
+inline int taskEventQueueEmpty(T *obj) {
+  return obj->eventHead == NULL;
+}
+
+template <typename T>
+inline void taskEventQueueEnqueue(T* obj, struct taskEventBase* event) {
+  event->next = NULL;
+  if (obj->eventHead) obj->eventTail->next = event;
+  else obj->eventHead = event;
+  obj->eventTail = event;
+}
+
+template <typename T>
+inline struct taskEventBase* taskEventQueueHead(T *obj) {
+    return obj->eventHead;
+}
+
+template <typename T>
+inline struct taskEventBase* taskEventQueueDequeue(T* obj) {
+  struct taskEventBase* tmp = obj->eventHead;
+  obj->eventHead = obj->eventHead->next;
+  if (obj->eventHead == NULL) obj->eventTail = NULL;
+  return tmp;
+}
+
+template <typename T>
+inline void resetTaskEvents(T *obj, struct context* ctx) {
+  while (!taskEventQueueEmpty(obj)) {
+    struct taskEventBase* base = taskEventQueueDequeue(obj);
+    if (base->type == ncclProfileColl) {
+      struct collective* c = (struct collective *)base;
+      // reset event proxyOps & proxySteps
+      memset(c->nProxyOps, 0, sizeof(int)*MAX_CHANNELS);
+      // release collective events in the group and return them to the collective pool
+      __atomic_fetch_add(&ctx->collPoolBase, 1, __ATOMIC_RELAXED);
+    } else if (base->type == ncclProfileP2p) {
+      struct p2p* p = (struct p2p *)base;
+      // reset event proxyOp and proxySteps
+      memset(&p->op, 0, sizeof(struct proxyOp)*MAX_CHANNELS);
+      // release p2p events in the group and return them to the p2p pool
+      __atomic_fetch_add(&ctx->p2pPoolBase, 1, __ATOMIC_RELAXED);
+    }
+  }
+}
 
 #endif
diff --git a/projects/rccl/ext-profiler/example/nccl/profiler.h b/projects/rccl/ext-profiler/example/nccl/profiler.h
index c911426d91..715885f72b 100644
--- a/projects/rccl/ext-profiler/example/nccl/profiler.h
+++ b/projects/rccl/ext-profiler/example/nccl/profiler.h
@@ -11,17 +11,20 @@
 #include <stdlib.h>
 
 #include "common.h"
-#include "err.h"
 
 enum {
-  ncclProfileGroup     = (1 << 0),  // group event type
-  ncclProfileColl      = (1 << 1),  // host collective call event type
-  ncclProfileP2p       = (1 << 2),  // host point-to-point call event type
-  ncclProfileProxyOp   = (1 << 3),  // proxy operation event type
-  ncclProfileProxyStep = (1 << 4),  // proxy step event type
-  ncclProfileProxyCtrl = (1 << 5),  // proxy control event type
-  ncclProfileKernelCh  = (1 << 6),  // kernel channel event type
-  ncclProfileNetPlugin = (1 << 7),  // network plugin-defined, events
+  ncclProfileGroup          = (1 << 0),  // group event type
+  ncclProfileColl           = (1 << 1),  // host collective call event type
+  ncclProfileP2p            = (1 << 2),  // host point-to-point call event type
+  ncclProfileProxyOp        = (1 << 3),  // proxy operation event type
+  ncclProfileProxyStep      = (1 << 4),  // proxy step event type
+  ncclProfileProxyCtrl      = (1 << 5),  // proxy control event type
+  ncclProfileKernelCh       = (1 << 6),  // kernel channel event type
+  ncclProfileNetPlugin      = (1 << 7),  // network plugin-defined, events
+  ncclProfileGroupApi       = (1 << 8),  // Group API events
+  ncclProfileCollApi        = (1 << 9),  // Collective API events
+  ncclProfileP2pApi         = (1 << 10), // Point-to-Point API events
+  ncclProfileKernelLaunch   = (1 << 11), // Kernel launch events
 };
 
 typedef enum {
@@ -56,21 +59,27 @@ typedef enum {
 
   /* Kernel event states */
   ncclProfilerKernelChStop             = 22,
+
+  /* Group API States */
+  ncclProfilerEndGroupApiStart         = 23,
+  ncclProfilerBeginGroupApiEnd         = 24
 } ncclProfilerEventState_t;
 
 typedef ncclProfilerEventState_t ncclProfilerEventState_v1_t;
 typedef ncclProfilerEventState_t ncclProfilerEventState_v2_t;
 typedef ncclProfilerEventState_t ncclProfilerEventState_v3_t;
 typedef ncclProfilerEventState_t ncclProfilerEventState_v4_t;
+typedef ncclProfilerEventState_t ncclProfilerEventState_v5_t;
 
+#include "profiler_v5.h"
 #include "profiler_v4.h"
 #include "profiler_v3.h"
 #include "profiler_v2.h"
 #include "profiler_v1.h"
 #include "profiler_net.h"
 
-typedef ncclProfiler_v4_t ncclProfiler_t;
-typedef ncclProfilerEventDescr_v4_t ncclProfilerEventDescr_t;
-typedef ncclProfilerEventStateArgs_v4_t ncclProfilerEventStateArgs_t;
+typedef ncclProfiler_v5_t ncclProfiler_t;
+typedef ncclProfilerEventDescr_v5_t ncclProfilerEventDescr_t;
+typedef ncclProfilerEventStateArgs_v5_t ncclProfilerEventStateArgs_t;
 
 #endif // end include guard
diff --git a/projects/rccl/ext-profiler/example/nccl/profiler_v5.h b/projects/rccl/ext-profiler/example/nccl/profiler_v5.h
new file mode 100644
index 0000000000..8bbc85eeb3
--- /dev/null
+++ b/projects/rccl/ext-profiler/example/nccl/profiler_v5.h
@@ -0,0 +1,152 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef PROFILER_V5_H_
+#define PROFILER_V5_H_
+#include <stdbool.h>
+
+typedef struct {
+  uint64_t type;                // event type descriptor: ncclProfileGroupApi, ...
+  void* parentObj;              // pointer to the profiler parent object
+  int rank;                     // originating rank
+  union {
+    struct {
+      int graphCaptured;
+      int groupDepth;
+    } groupApi;
+
+    struct {
+      const char* func;
+      size_t count;
+      const char* datatype;
+      int root;
+      void* stream;
+      bool graphCaptured;
+    } collApi;
+
+    struct {
+      const char* func;
+      size_t count;
+      const char* datatype;
+      void* stream;
+      bool graphCaptured;
+    } p2pApi;
+
+    struct {
+      void* stream;
+    } kernelLaunch;
+
+    struct {
+      uint64_t seqNumber;
+      const char* func;
+      void const* sendBuff;
+      void* recvBuff;
+      size_t count;
+      int root;
+      const char* datatype;
+      uint8_t nChannels;
+      uint8_t nWarps;
+      const char* algo;
+      const char* proto;
+      void* parentGroup; // for backward compatibility with v4
+    } coll;
+
+    struct {
+      const char* func;
+      void* buff;
+      const char* datatype;
+      size_t count;
+      int peer;
+      uint8_t nChannels;
+      void* parentGroup; // for backward compatibility with v4
+    } p2p;
+
+    struct {
+      pid_t pid;                // pid of the originating process
+      uint8_t channelId;        // channel id for this proxy operation
+      int peer;                 // remote rank for send/recv
+      int nSteps;               // number of steps for this proxy operation
+      int chunkSize;            // amount of data transferred by this proxy operation
+      int isSend;
+    } proxyOp;
+
+    struct {
+      int step;
+    } proxyStep;
+
+    struct {
+      uint8_t channelId;
+      uint64_t pTimer;          // start timestamp from GPU globaltimer
+    } kernelCh;
+
+    struct {
+      int64_t id;
+      void* data;
+    } netPlugin;
+  };
+} ncclProfilerEventDescr_v5_t;
+
+typedef union {
+  struct {
+    size_t transSize;
+  } proxyStep;
+
+  struct {
+    int appendedProxyOps;
+  } proxyCtrl;
+
+  struct {
+    void* data;
+  } netPlugin;
+
+  struct {
+    uint64_t pTimer;
+  } kernelCh;
+} ncclProfilerEventStateArgs_v5_t;
+
+typedef struct {
+  const char* name;
+
+  // init - initialize the profiler plugin
+  // Input
+  //  - context        : opaque profiler context object for separating profiler behavior across comms
+  //  - commId         : communicator id
+  //  - commName       : user assigned communicator name
+  //  - nNodes         : number of nodes in communicator
+  //  - nranks         : number of ranks in communicator
+  //  - rank           : rank identifier in communicator
+  //  - logfn          : logger function
+  // Output
+  //  - eActivationMask: bitmask of active events set by the plugin
+  ncclResult_t (*init)(void** context, uint64_t commId, int* eActivationMask, const char* commName, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
+
+  // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
+  // Input
+  //  - context: opaque profiler context object
+  //  - eDescr : pointer to ncclProfilerEventDescr_t object
+  // Output
+  //  - eHandle: return event handle for supplied event descriptor object
+  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v5_t* eDescr);
+
+  // stopEvent - stop/finalize an event inside and event set
+  // Input
+  //  - eHandle: handle to event object
+  ncclResult_t (*stopEvent)(void* eHandle);
+
+  // recordEventState - record event state transitions and event attribute updates
+  // Input
+  //  - eHandle   : handle to event object created through startEvent
+  //  - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
+  //  - eState    : event state transition
+  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v5_t eState, ncclProfilerEventStateArgs_v5_t* eStateArgs);
+
+  // finalize - finalize the profiler plugin
+  // Input
+  //  - context: opaque profiler context object
+  ncclResult_t (*finalize)(void* context);
+} ncclProfiler_v5_t;
+
+#endif
diff --git a/projects/rccl/ext-profiler/example/plugin.c b/projects/rccl/ext-profiler/example/plugin.cc
similarity index 68%
rename from projects/rccl/ext-profiler/example/plugin.c
rename to projects/rccl/ext-profiler/example/plugin.cc
index b89cd46275..f6d4956b3d 100644
--- a/projects/rccl/ext-profiler/example/plugin.c
+++ b/projects/rccl/ext-profiler/example/plugin.cc
@@ -6,7 +6,7 @@
 
 #include <stdio.h>
 #include <pthread.h>
-#include <string.h>
+#include <cstring>
 #include <linux/limits.h>
 #include <sys/time.h>
 #include <sys/types.h>
@@ -22,12 +22,20 @@ static int initialized;             // initialization counter for profiler
 static double startTime;            // profiler start time
 
 static const int defaultEActivationMask = ncclProfileColl | ncclProfileP2p;
-static const int defaultGroupPoolSize = 16;
-static const int defaultCollPoolSize = 16;
-static const int defaultP2pPoolSize = 1024;
+static const int defaultGroupApiPoolSize = 256;
+static const int defaultCollApiPoolSize = 256;
+static const int defaultP2pApiPoolSize = 256;
+static const int defaultKernelLaunchPoolSize = 256;
+static const int defaultGroupPoolSize = 256;
+static const int defaultCollPoolSize = 256;
+static const int defaultP2pPoolSize = 256;
 static const int defaultProxyCtrlPoolSize = 16;
-static const int defaultDetachPoolSize = 128;
+static const int defaultDetachPoolSize = 256;
 
+static int groupApiPoolSize;
+static int collApiPoolSize;
+static int p2pApiPoolSize;
+static int kernelLaunchPoolSize;
 static int groupPoolSize;
 static int collPoolSize;
 static int p2pPoolSize;
@@ -51,7 +59,7 @@ static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
 static pid_t pid;
 static int* eActivationMaskPtr;
 
-__hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn) {
+__hidden ncclResult_t exampleProfilerInit(void** context, uint64_t commId, int* eActivationMask, const char* commName, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn) {
   pthread_mutex_lock(&lock);
   if (__atomic_fetch_add(&initialized, 1, __ATOMIC_RELAXED) == 0) {
     // first thread initializes event mask, environment and detach pool
@@ -59,6 +67,18 @@ __hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask,
     str = getenv("NCCL_PROFILE_EVENT_MASK");
     __atomic_store_n(eActivationMask, str ? atoi(str) : 0, __ATOMIC_RELAXED);
 
+    str = getenv("NCCL_PROFILE_GROUP_API_POOL_SIZE");
+    groupApiPoolSize = str ? atoi(str) : defaultGroupApiPoolSize;
+
+    str = getenv("NCCL_PROFILE_COLL_API_POOL_SIZE");
+    collApiPoolSize = str ? atoi(str) : defaultCollApiPoolSize;
+
+    str = getenv("NCCL_PROFILE_P2P_API_POOL_SIZE");
+    p2pApiPoolSize = str ? atoi(str) : defaultP2pApiPoolSize;
+
+    str = getenv("NCCL_PROFILE_KERNEL_LAUNCH_POOL_SIZE");
+    kernelLaunchPoolSize = str ? atoi(str) : defaultKernelLaunchPoolSize;
+
     str = getenv("NCCL_PROFILE_GROUP_POOL_SIZE");
     groupPoolSize = str ? atoi(str) : defaultGroupPoolSize;
 
@@ -96,11 +116,23 @@ __hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask,
   // pre-allocate memory for event object pools in dedicated profiler context
   struct context* ctx = (struct context *)calloc(1, sizeof(*ctx));
   ctx->commName = commName;
-  ctx->commHash = commHash;
+  ctx->commHash = commId;
   ctx->nranks = nranks;
   ctx->rank = rank;
   logFn = logfn;
-  INFO(NCCL_INIT, "PROFILER/Plugin: init commName: %s commHash: %lu nranks: %d rank: %d", commName ? commName : "", commHash, nranks, rank);
+  INFO(NCCL_INIT, "PROFILER/Plugin: init commName: %s commHash: %lu nranks: %d rank: %d", commName ? commName : "", commId, nranks, rank);
+
+  ctx->groupApiPool = (struct groupApi *)calloc(groupApiPoolSize, sizeof(*ctx->groupApiPool));
+  if (ctx->groupApiPool == NULL) goto fail;
+
+  ctx->collApiPool = (struct collApi *)calloc(collApiPoolSize, sizeof(*ctx->collApiPool));
+  if (ctx->collApiPool == NULL) goto fail;
+
+  ctx->p2pApiPool = (struct p2pApi *)calloc(p2pApiPoolSize, sizeof(*ctx->p2pApiPool));
+  if (ctx->p2pApiPool == NULL) goto fail;
+
+  ctx->kernelLaunchPool = (struct kernelLaunch *)calloc(kernelLaunchPoolSize, sizeof(*ctx->kernelLaunchPool));
+  if (ctx->kernelLaunchPool == NULL) goto fail;
 
   ctx->groupPool = (struct group *)calloc(groupPoolSize, sizeof(*ctx->groupPool));
   if (ctx->groupPool == NULL) goto fail;
@@ -130,16 +162,22 @@ fail:
   if (ctx->p2pPool) free(ctx->p2pPool);
   if (ctx->collPool) free(ctx->collPool);
   if (ctx->groupPool) free(ctx->groupPool);
+  if (ctx->collApiPool) free(ctx->collApiPool);
+  if (ctx->p2pApiPool) free(ctx->p2pApiPool);
+  if (ctx->kernelLaunchPool) free(ctx->kernelLaunchPool);
+  if (ctx->groupApiPool) free(ctx->groupApiPool);
   free(ctx);
   if (detachPool) free(detachPool);
   return ncclSystemError;
 }
 
+static const char* profilerDumpFile;
+
 __hidden ncclResult_t exampleProfilerFinalize(void* context) {
   FILE* fh = NULL;
   char filename[PATH_MAX] = { 0 };
   struct context* ctx = (struct context *)context;
-  const char* dump = getenv("NCCL_PROFILE_DUMP_FILE");
+  const char* dump = profilerDumpFile ? profilerDumpFile : getenv("NCCL_PROFILE_DUMP_FILE");
   if (dump) {
     sprintf(filename, "%s_%lu_%d.json", dump, ctx->commHash, ctx->rank);
     fh = fopen(filename, "w");
@@ -148,10 +186,12 @@ __hidden ncclResult_t exampleProfilerFinalize(void* context) {
   INFO(NCCL_INIT, "PROFILER/Plugin: finalize commName: %s commHash: %lu nranks: %d rank: %d", ctx->commName ? ctx->commName : "", ctx->commHash, ctx->nranks, ctx->rank);
 
   // print last N groups/collectives/p2ps
-  int start = (ctx->groupPoolIndex - groupPoolSize >= 0) ? ctx->groupPoolIndex - groupPoolSize : 0;
-  int end = ctx->groupPoolIndex;
+  // Note that since the v5 version of the profiler, group API events are now at the top of the hierarchy.
+  // Legacy Group events from v4 are still emitted for compatibility purposes when using the v4 profiler but excluded from this example.
+  int start = (ctx->groupApiPoolIndex - groupApiPoolSize >= 0) ? ctx->groupApiPoolIndex - groupApiPoolSize : 0;
+  int end = ctx->groupApiPoolIndex;
   for (int i = start; i < end; i++) {
-    printEvent(fh, &ctx->groupPool[i%groupPoolSize]);
+    printEvent(fh, &ctx->groupApiPool[i%groupApiPoolSize]);
   }
 
   start = (ctx->proxyCtrlPoolIndex - proxyCtrlPoolSize >= 0) ? ctx->proxyCtrlPoolIndex - proxyCtrlPoolSize : 0;
@@ -161,6 +201,10 @@ __hidden ncclResult_t exampleProfilerFinalize(void* context) {
   }
 
   free(ctx->groupPool);
+  free(ctx->collApiPool);
+  free(ctx->p2pApiPool);
+  free(ctx->kernelLaunchPool);
+  free(ctx->groupApiPool);
   free(ctx->collPool);
   free(ctx->p2pPool);
   free(ctx->proxyCtrlPool);
@@ -187,7 +231,113 @@ __hidden void updateEvent(void* handle);
 __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, ncclProfilerEventDescr_t* eDescr) {
   *eHandle = NULL;
   struct context* ctx = (struct context *)context;
-  if (eDescr->type == ncclProfileGroup) {
+  if (eDescr->type == ncclProfileGroupApi) {
+    struct groupApi* event;
+    int groupApiId = __atomic_fetch_add(&ctx->groupApiPoolIndex, 1, __ATOMIC_RELAXED);
+    if ((groupApiId - __atomic_load_n(&ctx->groupApiPoolBase, __ATOMIC_RELAXED)) < groupApiPoolSize) {
+      // if there are available group API events grab one
+      event = &ctx->groupApiPool[groupApiId%groupApiPoolSize];
+      // Make sure all child events of the picked group API event are cleared
+      while (!profilerQueueEmpty(&event->collApiEvents)) {
+        struct collApi *collApiEvent = profilerQueueDequeue(&event->collApiEvents);
+        resetTaskEvents(collApiEvent, ctx);
+        __atomic_fetch_add(&ctx->collApiPoolBase, 1, __ATOMIC_RELAXED);
+      }
+      while (!profilerQueueEmpty(&event->p2pApiEvents)) {
+        struct p2pApi *p2pApiEvent = profilerQueueDequeue(&event->p2pApiEvents);
+        resetTaskEvents(p2pApiEvent, ctx);
+        __atomic_fetch_add(&ctx->p2pApiPoolBase, 1, __ATOMIC_RELAXED);
+      }
+      while (!profilerQueueEmpty(&event->kernelLaunchEvents)) {
+        profilerQueueDequeue(&event->kernelLaunchEvents);
+        __atomic_fetch_add(&ctx->kernelLaunchPoolBase, 1, __ATOMIC_RELAXED);
+      }
+    } else {
+      // else drop this event
+      __atomic_fetch_sub(&ctx->groupApiPoolIndex, 1, __ATOMIC_RELAXED);
+      return ncclSuccess;
+    }
+    event->type = ncclProfileGroupApi;
+    event->ctx = ctx;
+    event->groupApiId = groupApiId;
+    event->graphCaptured = eDescr->groupApi.graphCaptured;
+    event->groupDepth = eDescr->groupApi.groupDepth;
+    event->startTs = gettime() - startTime;
+    *eHandle = event;
+  } else if (eDescr->type == ncclProfileCollApi) {
+    if (eDescr->parentObj == NULL) return ncclSuccess;
+    struct collApi* event;
+    int collApiId = __atomic_fetch_add(&ctx->collApiPoolIndex, 1, __ATOMIC_RELAXED);
+    if ((collApiId - __atomic_load_n(&ctx->collApiPoolBase, __ATOMIC_RELAXED)) < collApiPoolSize) {
+      // if there are available Coll API events grab one
+      event = &ctx->collApiPool[collApiId%collApiPoolSize];
+      resetTaskEvents(event, ctx);
+    } else {
+      // else drop this event
+      __atomic_fetch_sub(&ctx->collApiPoolIndex, 1, __ATOMIC_RELAXED);
+      return ncclSuccess;
+    }
+    event->type = ncclProfileCollApi;
+    event->collApiId = collApiId;
+    event->ctx = ctx;
+    event->func = eDescr->collApi.func;
+    event->stream = (cudaStream_t) eDescr->collApi.stream;
+    event->count = eDescr->collApi.count;
+    event->datatype = eDescr->collApi.datatype;
+    event->root = eDescr->collApi.root;
+    event->graphCaptured = eDescr->collApi.graphCaptured;
+    struct groupApi* parent = (struct groupApi *) eDescr->parentObj;
+    event->parent = parent;
+    profilerQueueEnqueue(&parent->collApiEvents, event);
+    __atomic_fetch_add(&parent->refCount, 1, __ATOMIC_RELAXED);
+    *eHandle = event;
+  } else if (eDescr->type == ncclProfileP2pApi) {
+    if (eDescr->parentObj == NULL) return ncclSuccess;
+    struct p2pApi* event;
+    int p2pApiId = __atomic_fetch_add(&ctx->p2pApiPoolIndex, 1, __ATOMIC_RELAXED);
+    if ((p2pApiId - __atomic_load_n(&ctx->p2pApiPoolBase, __ATOMIC_RELAXED)) < p2pApiPoolSize) {
+      // if there are available p2p API events grab one
+      event = &ctx->p2pApiPool[p2pApiId%p2pApiPoolSize];
+      resetTaskEvents(event, ctx);
+    } else {
+      // else drop this event
+      __atomic_fetch_sub(&ctx->p2pApiPoolIndex, 1, __ATOMIC_RELAXED);
+      return ncclSuccess;
+    }
+    event->type = ncclProfileP2pApi;
+    event->p2pApiId = p2pApiId;
+    event->ctx = ctx;
+    event->func = eDescr->p2pApi.func;
+    event->stream = (cudaStream_t) eDescr->p2pApi.stream;
+    event->count = eDescr->p2pApi.count;
+    event->datatype = eDescr->p2pApi.datatype;
+    event->graphCaptured = eDescr->p2pApi.graphCaptured;
+    struct groupApi* parent = (struct groupApi *) eDescr->parentObj;
+    event->parent = parent;
+    profilerQueueEnqueue(&parent->p2pApiEvents, event);
+    __atomic_fetch_add(&parent->refCount, 1, __ATOMIC_RELAXED);
+    *eHandle = event;
+  } else if (eDescr->type == ncclProfileKernelLaunch) {
+    if (eDescr->parentObj == NULL) return ncclSuccess;
+    struct kernelLaunch* event;
+    int kernelLaunchId = __atomic_fetch_add(&ctx->kernelLaunchPoolIndex, 1, __ATOMIC_RELAXED);
+    if ((kernelLaunchId - __atomic_load_n(&ctx->kernelLaunchPoolBase, __ATOMIC_RELAXED)) < kernelLaunchPoolSize) {
+      // if there are available kernel API events grab one
+      event = &ctx->kernelLaunchPool[kernelLaunchId%kernelLaunchPoolSize];
+    } else {
+      // else drop this event
+      __atomic_fetch_sub(&ctx->kernelLaunchPoolIndex, 1, __ATOMIC_RELAXED);
+      return ncclSuccess;
+    }
+    event->type = ncclProfileKernelLaunch;
+    event->stream = (cudaStream_t) eDescr->kernelLaunch.stream;
+    struct groupApi* parent = (struct groupApi *) eDescr->parentObj;
+    event->parent = parent;
+    profilerQueueEnqueue(&parent->kernelLaunchEvents, event);
+    __atomic_fetch_add(&parent->refCount, 1, __ATOMIC_RELAXED);
+    *eHandle = event;
+  } else if (eDescr->type == ncclProfileGroup) {
+    if (eDescr->parentObj == NULL) return ncclSuccess;
     struct group* event;
     int groupId = __atomic_fetch_add(&ctx->groupPoolIndex, 1, __ATOMIC_RELAXED);
     if ((groupId - __atomic_load_n(&ctx->groupPoolBase, __ATOMIC_RELAXED)) < groupPoolSize) {
@@ -222,7 +372,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
     debugEvent(event, "GroupStart");
   } else if (eDescr->type == ncclProfileColl) {
     // the parent might be null if we run out of events
-    struct group* parent = (struct group *)eDescr->parentObj;
+    struct collApi* parent = (struct collApi *)eDescr->parentObj;
     if (parent == NULL) return ncclSuccess;
 
     struct collective* event;
@@ -253,12 +403,12 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
     event->proto = eDescr->coll.proto;
     *eHandle = event;
     taskEventQueueEnqueue(parent, (struct taskEventBase *)event);
-    // increment the group ref counter so the event will staty open
+    // increment the group ref counter so the event will stay open
     __atomic_fetch_add(&parent->refCount, 1, __ATOMIC_RELAXED);
     debugEvent(event, "CollStart");
   } else if (eDescr->type == ncclProfileP2p) {
     // the parent might be null if we run out of events
-    struct group* parent = (struct group *)eDescr->parentObj;
+    struct p2pApi* parent = (struct p2pApi*) eDescr->parentObj;
     if (parent == NULL) return ncclSuccess;
 
     struct p2p* event;
@@ -458,8 +608,34 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
 }
 
 void updateEvent(void* handle) {
-  uint8_t type = *(uint8_t *)handle;
-  if (type == ncclProfileGroup) {
+  uint64_t type = *(uint64_t *)handle;
+  if (type == ncclProfileGroupApi) {
+    struct groupApi* event = (struct groupApi*) handle;
+    if (__atomic_sub_fetch(&event->refCount, 1, __ATOMIC_RELAXED) == 0) {
+      event->stopTs = gettime() - startTime;
+      __atomic_fetch_add(&event->ctx->groupApiPoolBase, 1, __ATOMIC_RELAXED);
+    }
+  } else if (type == ncclProfileCollApi) {
+    struct collApi* event = (struct collApi*) handle;
+    if (__atomic_sub_fetch(&event->refCount, 1, __ATOMIC_RELAXED) == 0) {
+      event->stopTs = gettime() - startTime;
+      __atomic_fetch_add(&event->ctx->collApiPoolBase, 1, __ATOMIC_RELAXED);
+    }
+    updateEvent(event->parent);
+    return;
+  } else if (type == ncclProfileP2pApi) {
+    struct p2pApi* event = (struct p2pApi*) handle;
+    if (__atomic_sub_fetch(&event->refCount, 1, __ATOMIC_RELAXED) == 0) {
+      event->stopTs = gettime() - startTime;
+      __atomic_fetch_add(&event->ctx->p2pApiPoolBase, 1, __ATOMIC_RELAXED);
+    }
+    updateEvent(event->parent);
+    event->stopTs = gettime() - startTime;
+  } else if (type == ncclProfileKernelLaunch) {
+    struct kernelLaunch* event = (struct kernelLaunch*) handle;
+    event->stopTs = gettime() - startTime;
+    updateEvent(event->parent);
+  } else if (type == ncclProfileGroup) {
     struct group* event = (struct group *)handle;
     if (__atomic_sub_fetch(&event->refCount, 1, __ATOMIC_RELAXED) == 0) {
       event->stopTs = gettime() - startTime;
@@ -527,25 +703,35 @@ __hidden ncclResult_t exampleProfilerStopEvent(void* eHandle) {
   // the event handle might be null if we run out of events
   if (eHandle == NULL) return ncclSuccess;
 
-  uint8_t type = *(uint8_t *)eHandle;
-  if (type == ncclProfileGroup) {
-    // stopping the group event in NCCL core does not
-    // mean the group has completed. It means the group
-    // was submitted/enqueued so we need to keep the event open
+  uint64_t type = *(uint64_t *)eHandle;
+  // Stopping API events, Kernel Launch events, collective/p2p task events
+  // in NCCL core do not mean that they are complete. It means that the
+  // operation was enqueued so we need to keep the events open
+  if (type == ncclProfileGroupApi) {
+    struct groupApi* event = (struct groupApi*) eHandle;
+    event->stopTs = gettime() - startTime;
+    return ncclSuccess;
+  } else if (type == ncclProfileCollApi) {
+    struct collApi* event = (struct collApi*) eHandle;
+    event->stopTs = gettime() - startTime;
+    return ncclSuccess;
+  } else if (type == ncclProfileP2pApi) {
+    struct p2pApi* event = (struct p2pApi*) eHandle;
+    event->stopTs = gettime() - startTime;
+    return ncclSuccess;
+  } else if (type == ncclProfileKernelLaunch) {
+    struct kernelLaunch* event = (struct kernelLaunch*) eHandle;
+    event->stopTs = gettime() - startTime;
+    return ncclSuccess;
+  } else if (type == ncclProfileGroup) {
     struct group* event = (struct group *)eHandle;
     event->stopTs = gettime() - startTime;
     return ncclSuccess;
   } else if (type == ncclProfileColl) {
-    // stopping the collective event in NCCL core does not
-    // mean the collective has completed. It means the collective
-    // was submitted/enqueued so we need to keep the event open
     struct collective* event = (struct collective *)eHandle;
     event->base.stopTs = gettime() - startTime;
     return ncclSuccess;
   } else if (type == ncclProfileP2p) {
-    // stopping the p2p event in NCCL core does not
-    // mean the p2p has completed. It means the p2p
-    // was submitted/enqueued so we need to keep the event open
     struct p2p* event = (struct p2p *)eHandle;
     event->base.stopTs = gettime() - startTime;
     return ncclSuccess;
@@ -559,8 +745,15 @@ __hidden ncclResult_t exampleProfilerRecordEventState(void* eHandle, ncclProfile
   // the event handle might be null if we run out of events
   if (eHandle == NULL) return ncclSuccess;
 
-  uint8_t type = *(uint8_t *)eHandle;
-  if (type == ncclProfileProxyOp) {
+  uint64_t type = *(uint64_t *)eHandle;
+  if (type == ncclProfileGroupApi) {
+    struct groupApi* event = (struct groupApi*) eHandle;
+    if (eState == ncclProfilerEndGroupApiStart) {
+      event->endOfncclGroupStartTs = gettime() - startTime;
+    } else if (eState == ncclProfilerBeginGroupApiEnd) {
+      event->startOfncclGroupEndTs = gettime() - startTime;
+    }
+  } else if (type == ncclProfileProxyOp) {
     struct proxyOp* event = (struct proxyOp *)eHandle;
     if (eState == ncclProfilerProxyOpInProgress_v4) {
       event->progrTs = gettime() - startTime;
@@ -592,6 +785,8 @@ __hidden ncclResult_t exampleProfilerRecordEventState(void* eHandle, ncclProfile
       case ncclProfilerProxyStepRecvGPUWait:
         event->timestamp[PROXY_STEP_RECV_GPU_WAIT] = gettime() - startTime;
         break;
+      default:
+        break;
     }
   } else if (type == ncclProfileProxyCtrl) {
     struct proxyCtrl* event = (struct proxyCtrl *)eHandle;
@@ -609,7 +804,7 @@ __hidden ncclResult_t exampleProfilerRecordEventState(void* eHandle, ncclProfile
   return ncclSuccess;
 }
 
-ncclProfiler_t ncclProfiler_v4 = {
+ncclProfiler_t ncclProfiler_v5 = {
   "Example-profiler",
   exampleProfilerInit,
   exampleProfilerStartEvent,
@@ -618,14 +813,15 @@ ncclProfiler_t ncclProfiler_v4 = {
   exampleProfilerFinalize,
 };
 
-int exampleProfilerStart(int eActivationMask) {
+__attribute__((visibility("default"))) int exampleProfilerStart(int eActivationMask, const char* name) {
+  profilerDumpFile = name;
   if (__atomic_load_n(&initialized, __ATOMIC_RELAXED)) {
     __atomic_store_n(eActivationMaskPtr, eActivationMask, __ATOMIC_RELAXED);
   }
   return ncclSuccess;
 }
 
-int exampleProfilerStop(void) {
+__attribute__((visibility("default"))) int exampleProfilerStop(void) {
   if (__atomic_load_n(&initialized, __ATOMIC_RELAXED)) {
     __atomic_store_n(eActivationMaskPtr, 0, __ATOMIC_RELAXED);
   }
diff --git a/projects/rccl/ext-profiler/example/plugin.h b/projects/rccl/ext-profiler/example/plugin.h
index b4d07060a3..9248ebf082 100644
--- a/projects/rccl/ext-profiler/example/plugin.h
+++ b/projects/rccl/ext-profiler/example/plugin.h
@@ -7,7 +7,8 @@
 #ifndef PLUGIN_H_
 #define PLUGIN_H_
 
-int exampleProfilerStart(int eActivationMask);
-int exampleProfilerStop(void);
+__attribute__((visibility("default"))) int exampleProfilerStart(int eActivationMask, const char* name);
+__attribute__((visibility("default"))) int exampleProfilerStop(void);
+
 
 #endif
diff --git a/projects/rccl/ext-profiler/example/print_event.c b/projects/rccl/ext-profiler/example/print_event.cc
similarity index 76%
rename from projects/rccl/ext-profiler/example/print_event.c
rename to projects/rccl/ext-profiler/example/print_event.cc
index a56106e102..ca3c7cfaee 100644
--- a/projects/rccl/ext-profiler/example/print_event.c
+++ b/projects/rccl/ext-profiler/example/print_event.cc
@@ -5,15 +5,59 @@
  ************************************************************************/
 
 #include <stdio.h>
+#include "err.h"
 #include "profiler.h"
 #include "event.h"
 #include "print_event.h"
+#include <cuda_runtime.h>
 
 #define __hidden __attribute__ ((visibility("hidden")))
 
 // FIXME: chrome tracing asynchronous events (following used) allow event nesting for events that have same id and category
 // It appears that nesting more than three events causes issues. Therefore, every event is given an increasing id and a
-// category that matches the type of event (GROUP, COLL, P2P, PROXY, NET)
+// category that matches the type of event (GROUP API, COLL API, P2P API, GROUP, COLL, P2P, PROXY, NET)
+static __thread int groupApiId;
+__hidden void printGroupApiEventHeader(FILE* fh, struct groupApi* event) {
+  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"GROUP_API\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"groupApiId\": %d, \"groupDepth\":%d}},\n",
+          "Group API", groupApiId, getpid(), 1, event->startTs, event->groupApiId, event->groupDepth);
+}
+
+__hidden void printGroupApiEventTrailer(FILE* fh, struct groupApi* event) {
+  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"GROUP_API\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
+          "Group API", groupApiId++, getpid(), 1, event->stopTs);
+}
+
+static __thread int p2pApiId;
+__hidden void printP2pApiEventHeader(FILE* fh, struct p2pApi* event) {
+  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"P2P_API\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"count\": %lu, \"datatype\": %s, \"GraphCaptured\":%d, \"Stream\": %p}},\n",
+      event->func, p2pApiId, getpid(), 1, event->startTs, event->count, event->datatype, event->graphCaptured, event->stream);
+}
+
+__hidden void printP2pApiEventTrailer(FILE* fh, struct p2pApi* event) {
+  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"P2P_API\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
+          event->func, p2pApiId++, getpid(), 1, event->stopTs);
+}
+
+static __thread int collApiId;
+__hidden void printCollApiEventHeader(FILE* fh, struct collApi* event) {
+  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"COLL_API\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"count\": %lu, \"datatype\": %s, \"root\": %d, \"GraphCaptured\":%d, \"Stream\": %p}},\n",
+      event->func, collApiId, getpid(), 1, event->startTs, event->count, event->datatype, event->root, event->graphCaptured, event->stream);
+}
+
+__hidden void printCollApiEventTrailer(FILE* fh, struct collApi* event) {
+  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"COLL_API\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
+      event->func, collApiId++, getpid(), 1, event->stopTs);
+}
+
+static __thread int kernelLaunchId;
+__hidden void printKernelLaunchEventHeader(FILE* fh, struct kernelLaunch* event) {
+  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"KERNEL_LAUNCH\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"groupId\": %d, \"Stream\": %p}},\n", "KernelLaunch", kernelLaunchId, getpid(), 1, event->startTs, event->kernelLaunchId, event->stream);
+}
+
+__hidden void printKernelLaunchEventTrailer(FILE* fh, struct kernelLaunch* event) {
+  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"KERNEL_LAUNCH\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n", "KernelLaunch", kernelLaunchId++, getpid(), 1, event->stopTs);
+}
+
 static __thread int groupId;
 __hidden void printGroupEventHeader(FILE* fh, struct group* event) {
   fprintf(fh, "{\"name\": \"%s\", \"cat\": \"GROUP\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"groupId\": %d}},\n",
@@ -28,7 +72,7 @@ __hidden void printGroupEventTrailer(FILE* fh, struct group* event) {
 static __thread int collId;
 __hidden void printCollEventHeader(FILE* fh, struct collective* event) {
   fprintf(fh, "{\"name\": \"%s\", \"cat\": \"COLL\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"SeqNum\": %lu, \"CommHash\": %lu, \"Rank\": %d, \"Count\": %lu, \"Datatype\": \"%s\", \"Algorithm\": \"%s\", \"Protocol\": \"%s\", \"nChannels\": %d}},\n",
-          event->base.func, collId, getpid(), 1, event->base.startTs, event->seqNumber, event->base.parent->ctx->commHash, event->base.rank, event->count, event->datatype, event->algo, event->proto, event->nChannels);
+          event->base.func, collId, getpid(), 1, event->base.startTs, event->seqNumber, ((struct collApi*)event->base.parent)->ctx->commHash, event->base.rank, event->count, event->datatype, event->algo, event->proto, event->nChannels);
 }
 
 __hidden void printCollEventTrailer(FILE* fh, struct collective* event) {
@@ -39,7 +83,7 @@ __hidden void printCollEventTrailer(FILE* fh, struct collective* event) {
 static __thread int p2pId;
 __hidden void printP2pEventHeader(FILE* fh, struct p2p* event) {
   fprintf(fh, "{\"name\": \"%s\", \"cat\": \"P2P\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"CommHash\": %lu, \"Rank\": %d, \"Peer\": %d, \"Count\": %lu, \"Datatype\": \"%s\", \"nChannels\": %d}},\n",
-          event->base.func, p2pId, getpid(), 1, event->base.startTs, event->base.parent->ctx->commHash, event->base.rank, event->peer, event->count, event->datatype, event->nChannels);
+          event->base.func, p2pId, getpid(), 1, event->base.startTs, ((struct p2pApi*)event->base.parent)->ctx->commHash, event->base.rank, event->peer, event->count, event->datatype, event->nChannels);
 }
 
 __hidden void printP2pEventTrailer(FILE* fh, struct p2p* event) {
@@ -173,7 +217,7 @@ void debugEvent(void* eHandle, const char* tag) {
   char filename[64] = { 0 };
   sprintf(filename, "EventDebug-%d", getpid());
   FILE* fh = fopen(filename, "a+");
-  uint8_t type = *(uint8_t *)eHandle;
+  uint64_t type = *(uint64_t *)eHandle;
   if (type == ncclProfileGroup) {
     struct group* event = (struct group *)eHandle;
     fprintf(fh, "Group event %p tag = %s {\n", event, tag);
@@ -241,8 +285,51 @@ void debugEvent(void* eHandle, const char* tag) {
 
 void printEvent(FILE* fh, void* handle) {
   if (handle == NULL || fh == NULL) return;
-  uint8_t type = *(uint8_t *)handle;
-  if (type == ncclProfileGroup) {
+  uint64_t type = *(uint64_t *)handle;
+  if (type == ncclProfileGroupApi) {
+    struct groupApi* g = (struct groupApi*) handle;
+    printGroupApiEventHeader(fh, g);
+    struct kernelLaunch* kernelLaunchHead = profilerQueueHead(&g->kernelLaunchEvents);
+    while (kernelLaunchHead != NULL) {
+      printEvent(fh, kernelLaunchHead);
+      kernelLaunchHead = kernelLaunchHead->next;
+    }
+    struct collApi* collApiHead = profilerQueueHead(&g->collApiEvents);
+    while (collApiHead != NULL) {
+      printEvent(fh, collApiHead);
+      collApiHead = collApiHead->next;
+    }
+    struct p2pApi* p2pApiHead = profilerQueueHead(&g->p2pApiEvents);
+    while (p2pApiHead != NULL) {
+      printEvent(fh, p2pApiHead);
+      p2pApiHead = p2pApiHead->next;
+    }
+    printGroupApiEventTrailer(fh, g);
+  } else if (type == ncclProfileCollApi) {
+    struct collApi* collApiEvent = (struct collApi *) handle;
+    printCollApiEventHeader(fh, collApiEvent);
+    struct taskEventBase* base = taskEventQueueHead(collApiEvent);
+    while (base) {
+      struct taskEventBase* next = base->next;
+      printEvent(fh, base);
+      base = next;
+    }
+    printCollApiEventTrailer(fh, collApiEvent);
+  } else if (type == ncclProfileP2pApi) {
+    struct p2pApi* p2pApiEvent = (struct p2pApi *) handle;
+    printP2pApiEventHeader(fh, p2pApiEvent);
+    struct taskEventBase* base = taskEventQueueHead(p2pApiEvent);
+    while (base) {
+      struct taskEventBase* next = base->next;
+      printEvent(fh, base);
+      base = next;
+    }
+    printP2pApiEventTrailer(fh, p2pApiEvent);
+  } else if (type == ncclProfileKernelLaunch) {
+    struct kernelLaunch* kernelLaunchEvent = (struct kernelLaunch *) handle;
+    printKernelLaunchEventHeader(fh, kernelLaunchEvent);
+    printKernelLaunchEventTrailer(fh, kernelLaunchEvent);
+  } else if (type == ncclProfileGroup) {
     struct group* g = (struct group *)handle;
     printGroupEventHeader(fh, g);
     struct taskEventBase* base = taskEventQueueHead(g);
diff --git a/projects/rccl/ext-profiler/example/queue.h b/projects/rccl/ext-profiler/example/queue.h
new file mode 100644
index 0000000000..dfb14f575f
--- /dev/null
+++ b/projects/rccl/ext-profiler/example/queue.h
@@ -0,0 +1,50 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+#ifndef QUEUE_H
+#define QUEUE_H
+
+template<typename T, T *T::*next>
+struct profilerQueue {
+  T *head, *tail;
+};
+
+template<typename T, T *T::*next>
+ inline void profilerQueueConstruct(profilerQueue<T,next> *me) {
+  me->head = nullptr;
+  me->tail = nullptr;
+}
+
+template<typename T, T *T::*next>
+ inline bool profilerQueueEmpty(profilerQueue<T,next> *me) {
+  return me->head == nullptr;
+}
+
+template<typename T, T *T::*next>
+inline T* profilerQueueHead(profilerQueue<T,next> *me) {
+  return me->head;
+}
+
+template<typename T, T *T::*next>
+ inline T* profilerQueueTail(profilerQueue<T,next> *me) {
+  return me->tail;
+}
+
+template<typename T, T *T::*next>
+ inline void profilerQueueEnqueue(profilerQueue<T,next> *me, T *x) {
+  x->*next = nullptr;
+  (me->head ? me->tail->*next : me->head) = x;
+  me->tail = x;
+}
+
+template<typename T, T *T::*next>
+ inline T* profilerQueueDequeue(profilerQueue<T,next> *me) {
+  T *ans = me->head;
+  me->head = ans->*next;
+  if (me->head == nullptr) me->tail = nullptr;
+  return ans;
+}
+
+#endif
diff --git a/projects/rccl/ext-profiler/google-CoMMA/Makefile b/projects/rccl/ext-profiler/google-CoMMA/Makefile
new file mode 100644
index 0000000000..2da5169904
--- /dev/null
+++ b/projects/rccl/ext-profiler/google-CoMMA/Makefile
@@ -0,0 +1,22 @@
+.PHONY: build-CoMMA
+
+all: build-CoMMA
+
+build-CoMMA: clone-CoMMA
+	cd CoMMA && cargo build
+
+clone-CoMMA:
+	@if [ ! -d CoMMA ] ; then \
+		git clone https://github.com/google/CoMMA.git; \
+		ln -s $(PWD)/.. CoMMA/third_party/nccl/ext-profiler; \
+	fi
+
+clean:
+	@if [ -d CoMMA ] ; then \
+		cd CoMMA && cargo clean; \
+	fi
+
+delete:
+	@if [ -d CoMMA ] ; then \
+		rm -rf CoMMA; \
+	fi
diff --git a/projects/rccl/ext-profiler/inspector/Makefile b/projects/rccl/ext-profiler/inspector/Makefile
new file mode 100644
index 0000000000..301c46b203
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/Makefile
@@ -0,0 +1,62 @@
+#
+# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+#
+# See LICENSE.txt for license information
+#
+
+# Variables
+NCCL_HOME := ../../build
+INC := -I$(NCCL_HOME)/include -I$(CUDA_HOME)/include -Inccl
+PLUGIN_SO := libnccl-profiler-inspector.so
+VERSION_FILE := version.cc
+
+# Compiler and flags
+CXX := g++
+CXXFLAGS := -g -O3 -fPIC -shared -march=native -DNDEBUG -Wall -Wextra
+
+ifeq ($(DEBUG), 1)
+CXXFLAGS += -g2 -ggdb3 -rdynamic -funwind-tables -fno-omit-frame-pointer
+endif
+
+ifeq ($(ASAN), 1)
+CXXFLAGS += -fsanitize=address
+LDFLAGS += -fsanitize=address -static-libasan
+NVLDFLAGS += -Xcompiler -fsanitize=address,-static-libasan
+endif
+
+ifeq ($(UBSAN), 1)
+CXXFLAGS += -fsanitize=undefined
+LDFLAGS += -fsanitize=undefined -static-libubsan
+NVLDFLAGS += -Xcompiler -fsanitize=undefined,-static-libubsan
+endif
+
+# Source files
+SOURCES := inspector_plugin.cc inspector.cc json.cc
+
+# Default target
+all: $(PLUGIN_SO)
+
+# Rule to build the plugin
+$(PLUGIN_SO): $(VERSION_FILE) $(SOURCES)
+	@echo "Compiling to create $@ from $^"
+	$(CXX) $(INC) $(CXXFLAGS) -o $@ -Wl,-soname,$(PLUGIN_SO) $^
+
+# Rule to generate version.cc
+$(VERSION_FILE):
+	@GIT_INFO=$$(./utils/extract_git_version.sh); \
+	echo '#include "version.h"' > $(VERSION_FILE).tmp; \
+	echo 'const char* get_git_version_info() { return "'$$GIT_INFO'"; }' >> $(VERSION_FILE).tmp; \
+	if ! cmp $(VERSION_FILE).tmp $(VERSION_FILE); then \
+		echo "updating ${VERSION_FILE} file -> $$GIT_INFO"; \
+		mv $(VERSION_FILE).tmp $(VERSION_FILE); \
+	else \
+		echo "${VERSION_FILE} up to date -> $$GIT_INFO"; \
+		rm $(VERSION_FILE).tmp; \
+	fi
+
+# Clean target
+clean:
+	rm -f $(VERSION_FILE) $(PLUGIN_SO)
+
+# Phony targets
+.PHONY: all clean
diff --git a/projects/rccl/ext-profiler/inspector/README.md b/projects/rccl/ext-profiler/inspector/README.md
new file mode 100644
index 0000000000..daf26f7dd8
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/README.md
@@ -0,0 +1,216 @@
+# NCCL Inspector Plugin
+
+The NCCL Inspector is a plugin for the NVIDIA Collective Communications Library (NCCL) that provides detailed, per-communicator, per-collective performance and metadata logging. It is designed to help users analyze and debug NCCL collective operations by generating structured JSON output for each operation.
+
+## Related Documentation
+
+- **[Performance Exporter](exporter/example/README.md)** - Tool for analyzing and visualizing NCCL performance data from inspector logs
+
+## Folder Location
+
+The Inspector plugin source is located in:
+
+```
+ext-profiler/inspector/
+```
+
+## Building the Inspector Plugin
+
+To build the Inspector plugin, run:
+
+```bash
+make
+```
+
+The build system will automatically detect CUDA and NCCL installations from your environment. If you need to specify custom paths, you can set `CUDA_HOME` and `NCCL_HOME` environment variables or pass them as make arguments.
+
+### Build Options
+
+The Makefile supports several build options:
+
+- **DEBUG=1**: Enable debug build with additional debugging information
+- **ASAN=1**: Enable Address Sanitizer for memory error detection
+- **UBSAN=1**: Enable Undefined Behavior Sanitizer
+
+Example debug build:
+```bash
+make DEBUG=1
+```
+
+### Build Output
+
+The build process creates:
+- `libnccl-profiler-inspector.so`: The main inspector plugin library
+- `version.cc`: Auto-generated version information from git
+
+## Using NCCL Inspector
+
+### Key Differences from Normal NCCL Usage
+
+The main difference between running NCCL with the Inspector plugin versus running NCCL normally is the addition of environment variables that enable detailed performance logging:
+
+**Normal NCCL Run:**
+```bash
+# Standard NCCL execution
+./your_nccl_application
+```
+
+**NCCL Inspector Run:**
+```bash
+# NCCL Inspector enabled execution
+export NCCL_PROFILER_PLUGIN=/path/to/nccl/ext-profiler/inspector/libnccl-profiler-inspector.so
+export NCCL_INSPECTOR_ENABLE=1
+export NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS=500
+./your_nccl_application
+```
+
+### Required Environment Variables
+
+- `NCCL_PROFILER_PLUGIN=/path/to/nccl/ext-profiler/inspector/libnccl-profiler-inspector.so`
+  Loads the Inspector plugin into NCCL.
+- `NCCL_INSPECTOR_ENABLE=1`
+  Enables the Inspector plugin.
+- `NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS=<interval>`
+  Sets the interval (in microseconds) for the internal dump thread to write output. Example: `500`.
+- `NCCL_INSPECTOR_DUMP_DIR=<output_dir>` (optional)
+  Sets the output directory for logs. If not set, defaults to `nccl-inspector-unknown-jobid` or `nccl-inspector-<slurm_job_id>` if running under SLURM.
+- `NCCL_INSPECTOR_DUMP_VERBOSE=<0|1>` (optional)
+  Enables verbose output including event trace information. Set to `1` to enable, `0` to disable (default).
+
+### Example Usage
+
+**Single Node:**
+```bash
+export NCCL_PROFILER_PLUGIN=/path/to/nccl/ext-profiler/inspector/libnccl-profiler-inspector.so
+export NCCL_INSPECTOR_ENABLE=1
+export NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS=500
+./build/test/perf/all_reduce_perf -b 8 -e 16G -f 2 -g 8
+```
+
+**Multi-Node (SLURM):**
+```bash
+# Add these environment variables to your SLURM script
+export NCCL_PROFILER_PLUGIN=/path/to/nccl/ext-profiler/inspector/libnccl-profiler-inspector.so
+export NCCL_INSPECTOR_ENABLE=1
+export NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS=500
+export NCCL_INSPECTOR_DUMP_DIR=/path/to/logs/${SLURM_JOB_ID}/
+
+# Then run your normal NCCL application
+srun your_nccl_application
+```
+
+## Example Scripts
+
+For detailed example scripts showing how to integrate NCCL Inspector with different workloads, see the **[test/examples/](test/examples/)** directory:
+
+- **Single Node Example**: Basic NCCL performance testing with inspector
+- **Multi-Node SLURM Example**: Comprehensive multi-node testing with various collective operations
+- **Training Workload Example**: Integration with distributed training workloads
+
+## Output Example
+
+Each output file contains JSON objects with the following structure:
+
+```json
+{
+  "header": {
+    "id": "0x7f8c496ae9f661",
+    "rank": 2,
+    "n_ranks": 8,
+    "nnodes": 1
+  },
+  "metadata": {
+    "inspector_output_format_version": "v4.0",
+    "git_rev": "",
+    "rec_mechanism": "profiler_plugin",
+    "dump_timestamp_us": 1748030377748202,
+    "hostname": "example-hostname",
+    "pid": 1639453
+  },
+  "coll_perf": {
+    "coll": "AllReduce",
+    "coll_sn": 1407,
+    "coll_msg_size_bytes": 17179869184,
+    "coll_exec_time_us": 61974,
+    "coll_algobw_gbs": 277.210914,
+    "coll_busbw_gbs": 485.119099
+  }
+}
+```
+
+## Output Example Verbose
+
+To enable verbose output with event trace information, set the `NCCL_INSPECTOR_DUMP_VERBOSE=1` environment variable:
+
+```bash
+export NCCL_INSPECTOR_DUMP_VERBOSE=1
+```
+
+This will include additional event trace information in the JSON output, showing the sequence of callbacks and timestamps for each individual event.
+
+```json
+{
+  "header": {
+    "id": "0xe62dedaa97644a",
+    "rank": 4,
+    "n_ranks": 8,
+    "nnodes": 1
+  },
+  "metadata": {
+    "inspector_output_format_version": "v4.0",
+    "git_rev": "9019a1912-dirty",
+    "rec_mechanism": "nccl_profiler_interface",
+    "dump_timestamp_us": 1752867229276385,
+    "hostname": "example-hostname",
+    "pid": 438776
+  },
+  "coll_perf": {
+    "coll": "ReduceScatter",
+    "coll_sn": 1231,
+    "coll_msg_size_bytes": 2147483648,
+    "coll_exec_time_us": 41057,
+    "coll_timing_source": "kernel_gpu",
+    "coll_algobw_gbs": 418.439467,
+    "coll_busbw_gbs": 366.134533,
+    "event_trace_sn": {
+      "coll_start_sn": 1,
+      "coll_stop_sn": 2,
+      "kernel_events": [
+        {
+          "channel_id": 0,
+          "kernel_start_sn": 3,
+          "kernel_stop_sn": 48,
+          "kernel_record_sn": 47
+        }
+      ]
+    },
+    "event_trace_ts": {
+      "coll_start_ts": 1752867229235059,
+      "coll_stop_ts": 1752867229235064,
+      "kernel_events": [
+        {
+          "channel_id": 0,
+          "kernel_start_ts": 1752867229235181,
+          "kernel_stop_ts": 1752867229275811,
+          "kernel_record_ts": 1752867229275811
+        }
+      ]
+    }
+  }
+}
+```
+
+Multiple such JSON objects are written, one per collective operation per communicator.
+
+## Output Directory
+
+- By default, output files are written to:
+  - `nccl-inspector-unknown-jobid` (if no SLURM job ID is present)
+  - `nccl-inspector-<slurm_job_id>` (if running under SLURM)
+- You can override this with the `NCCL_INSPECTOR_DUMP_DIR` environment variable.
+
+## Additional Notes
+
+- The plugin is compatible with standard NCCL workflows and can be used in both single-node and multi-node (SLURM) environments.
+- For more details, see the source code and comments in `ext-profiler/inspector/`.
+
diff --git a/projects/rccl/ext-profiler/inspector/exporter/example/README.md b/projects/rccl/ext-profiler/inspector/exporter/example/README.md
new file mode 100644
index 0000000000..26e4b2e573
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/exporter/example/README.md
@@ -0,0 +1,151 @@
+# NCCL Inspector Performance Summary Exporter
+
+This tool processes NCCL Inspector log files and generates comprehensive performance analysis reports including visualizations and statistical summaries.
+One can build similar exporters to integrate with various observability systems like Elastic, Prometheus or other Custom Metric systems.
+
+## Features
+
+- **Performance Analysis**: Generates statistical summaries for collective operations
+- **Communication Type Classification**: Automatically categorizes communication patterns
+- **Visualizations**: Creates scatter plots, histograms, and box plots for performance metrics
+- **Data Export**: Converts logs to Parquet format for efficient processing
+- **Multi-format Log Support**: Processes `.log`, `.log.gz`, `.jsonl`, and `.jsonl.gz` files
+- **Parallel Processing**: Utilizes multi-core processing for faster analysis
+
+## Requirements
+
+- Python 3.7+
+- Access to NCCL Inspector log files
+
+## Installation
+
+### Clone the Repository
+
+```bash
+git clone https://github.com/NVIDIA/nccl.git
+cd nccl/ext-profiler/inspector/exporter/example
+```
+
+Install the required dependencies using the provided `requirements.txt` file:
+
+```bash
+pip install -r requirements.txt
+```
+
+## Usage
+
+The script processes NCCL Inspector log files from a specified directory.
+
+**Note:** To generate NCCL Inspector log files, you need to run your NCCL application with the inspector plugin enabled. The log files will be output to a directory specified by the `NCCL_INSPECTOR_DUMP_DIR` environment variable. For detailed setup instructions and environment variable configuration, see the [Inspector README](../../../README.md).
+
+### Basic Usage
+
+```bash
+python perf_summary_exporter.py --input_dir /path/to/nccl/inspector/logs
+```
+
+This mode processes all log files in the specified directory and its subdirectories recursively.
+
+### Command Line Arguments
+
+- `--input_dir <path>`: **Required**. Directory containing NCCL Inspector log files (searches recursively in subdirectories)
+- `--output_dir <name>`: **Optional**. Custom output directory name (default: `<input_directory_name>-analysis`)
+
+## Output
+
+The tool generates:
+
+1. **Parquet Files**: One per log file containing processed log data (stored in `parquet_files/` subdirectory)
+2. **Summary Directory**: Contains comprehensive analysis results
+3. **Visualizations**: Scatter plots, histograms, and box plots for each message size
+4. **CSV Files**: Detailed summaries for each message size and collective type
+5. **Log File**: Processing log with detailed information
+
+## Example Output Structure
+
+```
+<output_dir_name>/
+├── output.log
+├── parquet_files/
+│   ├── <filename1>.parquet
+│   ├── <filename2>.parquet
+│   └── ...
+└── summary/
+    ├── scatter_plot_<comm_type>_<coll_type>.png
+    ├── combined_scatter_plot_<comm_type>_<coll_type>.png
+    └── msg_size_<human_readable_size>/
+        ├── histograms/
+        │   └── histogram_<comm_type>_<coll_type>_<size>.png
+        ├── boxplots/
+        │   └── boxplot_<comm_type>_<coll_type>_<size>.png
+        └── summary_<comm_type>_<coll_type>_<size>.csv
+```
+
+## Supported Communicator Types
+
+- `single-rank`
+- `nvlink-only`
+- `hca-only`
+- `mixed`
+
+## Supported Collective Types
+
+- `AllReduce`
+- `AllGather`
+- `ReduceScatter`
+- `Broadcast`
+
+## Log File Formats
+
+### Supported Formats
+
+- `.log` - Plain text JSON lines
+- `.log.gz` - Compressed JSON lines
+- `.jsonl` - JSON lines format
+- `.jsonl.gz` - Compressed JSON lines
+
+### Expected JSON Structure
+
+```json
+{
+  "header": {
+    "id": "0x9e7a479f95a66c",
+    "rank": 31,
+    "n_ranks": 32,
+    "nnodes": 4
+  },
+  "metadata": {
+    "inspector_output_format_version": "v4.0",
+    "git_rev": "75e61acda-dirty",
+    "rec_mechanism": "nccl_profiler_interface",
+    "dump_timestamp_us": 1749490229087081,
+    "hostname": "example-hostname",
+    "pid": 468528
+  },
+  "coll_perf": {
+    "coll": "ReduceScatter",
+    "coll_sn": 129,
+    "coll_msg_size_bytes": 65536,
+    "coll_exec_time_us": 110,
+    "coll_timing_source": "kernel_gpu",
+    "coll_algobw_gbs": 19.065018,
+    "coll_busbw_gbs": 18.469236
+  }
+}
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. **No log files found**: Ensure the log directory path is correct and contains valid log files
+2. **Missing dependencies**: Ensure all requirements are installed in your virtual environment
+3. **Mixed file formats**: The tool will exit if it detects mixed `.log`, `.log.gz`, `.jsonl`, and `.jsonl.gz` files in the same directory. This is typically indicative of corrupt input directories caused by multiple overlapping NCCL Inspector runs with different output format options. Clean the directory and re-run with consistent settings.
+
+### Log Files
+
+The tool creates detailed logs in the output directory. Check `output.log` for processing information and any error messages.
+
+## Support
+
+Please refer to the github issues page at https://github.com/NVIDIA/nccl/issues. Your question may already have been asked by another user. If not, feel free to create a new issue and refer to the "inspector plugin" in the title.
diff --git a/projects/rccl/ext-profiler/inspector/exporter/example/perf_summary_exporter.py b/projects/rccl/ext-profiler/inspector/exporter/example/perf_summary_exporter.py
new file mode 100644
index 0000000000..5913152ce0
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/exporter/example/perf_summary_exporter.py
@@ -0,0 +1,548 @@
+from pathlib import Path
+import argparse
+import glob
+import gzip
+import sys
+import pandas as pd
+from concurrent.futures import ProcessPoolExecutor
+import json
+from tqdm.auto import tqdm
+import duckdb
+import math
+import matplotlib.pyplot as plt
+import matplotlib.dates
+from matplotlib.gridspec import GridSpec
+import os
+import logging
+import contextlib
+from datetime import datetime
+import numpy as np
+
+def setup_logging(output_dir):
+    log_file = output_dir / "output.log"
+    logging.basicConfig(
+        filename=log_file,
+        level=logging.INFO,
+        format="%(asctime)s - %(levelname)s - %(message)s",
+    )
+
+
+@contextlib.contextmanager
+def smart_open(filename, mode="r"):
+    if filename.endswith(".gz"):
+        opener = gzip.open
+    else:
+        opener = open
+
+    with opener(filename, mode) as f:
+        yield f
+
+
+def get_log_files_and_output_dir():
+    parser = argparse.ArgumentParser(description="Process log files in a directory.")
+    parser.add_argument(
+        "--input_dir",
+        type=str,
+        help="The directory containing NCCL Inspector log files to process.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        help="Custom output directory name (default: auto-generated from input directory)."
+    )
+    args = parser.parse_args()
+
+    if args.input_dir:
+        # Use the provided input directory
+        root_dir = Path(args.input_dir)
+        if not root_dir.exists():
+            raise FileNotFoundError(f"Input directory not found: {root_dir}")
+
+    logfiles = list(glob.iglob(str(Path(root_dir) / "**" / "*.log"), recursive=True))
+    gzlogfiles = list(
+        glob.iglob(str(Path(root_dir) / "**" / "*.log.gz"), recursive=True)
+    )
+    jsonlfiles = list(
+        glob.iglob(str(Path(root_dir) / "**" / "*.jsonl"), recursive=True)
+    )
+    gzjsonlfiles = list(
+        glob.iglob(str(Path(root_dir) / "**" / "*.jsonl.gz"), recursive=True)
+    )
+    if (
+            sum((1 for x in [logfiles, gzlogfiles, jsonlfiles, gzjsonlfiles] if len(x) > 0))
+            > 1
+    ):
+        ### TODO: we could probably generate some logic to pick the "right" file to load, but for now, bail
+        logging.critical("Appear to have mixed .log/.log.gz/.jsonl/.jsonl.gz; bailing!")
+        sys.exit(1)
+
+    files = logfiles + gzlogfiles + jsonlfiles + gzjsonlfiles
+
+    if not files:
+        print("No inspector logs found")
+        sys.exit(1)
+
+    # Generate output directory name from input directory
+    if args.output_dir:
+        output_dir_name = args.output_dir
+    else:
+        output_dir_name = f"{root_dir.name}-analysis"
+
+    return files, output_dir_name
+
+def bytes_to_human_readable(size_bytes):
+    """
+    Convert bytes to human-readable format using decimal (SI) units.
+
+    Uses powers of 1000 (decimal/SI standard):
+    - 1 KB = 1,000 bytes
+    - 1 MB = 1,000,000 bytes
+    - 1 GB = 1,000,000,000 bytes
+
+    Not binary units (powers of 1024):
+    - Does NOT use KiB, MiB, GiB (1024-based)
+
+    Args:
+        size_bytes: Number of bytes to convert
+
+    Returns:
+        Human-readable string (e.g., "1.50MB", "2.34GB")
+    """
+    if size_bytes == 0:
+        return "0B"
+    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
+    i = int(math.log10(int(size_bytes)) / 3)
+    s = round(size_bytes * math.pow(10, -3 * i), 2)
+    return f"{s:.2f}{size_name[i]}"
+
+def timestamp_to_datetime(timestamp_us):
+    """Convert microsecond timestamp to datetime string"""
+    return datetime.fromtimestamp(timestamp_us / 1000000).strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
+
+def microseconds_to_human_readable(microseconds):
+    """Convert microseconds to human readable format"""
+    if microseconds < 1000:
+        return f"{microseconds:.1f}μs"
+    elif microseconds < 1000000:
+        return f"{microseconds/1000:.1f}ms"
+    else:
+        return f"{microseconds/1000000:.1f}s"
+
+def get_comm_type(row) -> str:
+    if row["n_ranks"] == 1:
+        return "single-rank"
+    elif row["nnodes"] == 1:
+        return "nvlink-only"
+    elif row["n_ranks"] == row["nnodes"]:
+        return "hca-only"
+    else:
+        return "mixed"
+
+def parse_file(filepath: Path, output_dir):
+    filename = Path(filepath).stem
+    parquet_file = output_dir / f"{filename}.parquet"
+
+    # Check if parquet file exists and is newer than source file
+    if parquet_file.exists():
+        source_mtime = Path(filepath).stat().st_mtime
+        parquet_mtime = parquet_file.stat().st_mtime
+        if parquet_mtime >= source_mtime:
+            logging.info(f"Parquet file {parquet_file} is up to date. Skipping...")
+            return
+        else:
+            logging.info(f"Source file {filepath} is newer than parquet. Regenerating...")
+
+    # Check if file is empty or too small
+    file_size = Path(filepath).stat().st_size
+    if file_size == 0:
+        logging.warning(f"Skipping empty file: {filepath}")
+        return
+
+    recs = []
+    try:
+        with smart_open(filepath, "r") as infile:
+            for lineno, line in enumerate(infile):
+                try:
+                    json_recs = json.loads(line)
+                except json.JSONDecodeError:
+                    logging.error(f"Failed to parse line {filepath}:{lineno}")
+                    continue
+
+                # Validate that required fields exist
+                if not all(key in json_recs for key in ["header", "metadata", "coll_perf"]):
+                    logging.error(f"Missing required fields in {filepath}:{lineno}")
+                    continue
+
+                header = json_recs["header"]
+                metadata = json_recs["metadata"]
+                comm_type = get_comm_type(header)
+                coll_perf = json_recs["coll_perf"]
+                recs.append(
+                    dict(
+                        **header,
+                        comm_type=comm_type,
+                        **coll_perf,
+                        **metadata,
+                    )
+                )
+    except Exception as e:
+        logging.error(f"Error reading file {filepath}: {e}")
+        return
+
+    # Skip files with no valid records
+    if not recs:
+        logging.warning(f"No valid records found in file: {filepath}. Skipping...")
+        return
+
+    df = pd.DataFrame(recs)
+    df.to_parquet(parquet_file)
+    logging.info(f"Created parquet file {parquet_file} with {len(recs)} records")
+
+def create_per_node_parquet_files(files, output_dir):
+    output_dir = Path(output_dir) / "parquet_files"
+    output_dir.mkdir(parents=True, exist_ok=True)
+    max_workers = min(64, len(files), os.cpu_count() or 1)
+    with ProcessPoolExecutor(max_workers=max_workers) as executor:
+        list(
+            tqdm(
+                executor.map(parse_file, files, [output_dir] * len(files)),
+                total=len(files),
+                desc="Processing files",
+                unit="file",
+            )
+        )
+    return output_dir
+
+def generate_scatter_plot(df, comm_type, coll_type, output_file):
+    plt.figure(figsize=(10, 6), dpi=100)
+    distinct_msg_sizes = df["coll_msg_size_bytes"].unique()
+
+    for msg_size in distinct_msg_sizes:
+        df_msg_size = df[df["coll_msg_size_bytes"] == msg_size]
+        mean_busbw = df_msg_size["mean_coll_busbw_gbs"].mean()
+        plt.scatter(
+            df_msg_size["coll_sn"],
+            df_msg_size["mean_coll_busbw_gbs"],
+            label=f"MsgSize: {bytes_to_human_readable(msg_size)} (Mean: {mean_busbw:.2f} GB/s)",
+            alpha=0.5,
+        )
+
+    plt.xlabel("Operation Sequence Number")
+    plt.ylabel("Mean Collective Bus BW (GB/s)")
+    plt.title(f"Comm Type: {comm_type}, Coll Type: {coll_type}")
+    plt.legend(title="Message Size", loc="upper right")
+    plt.tight_layout()
+    plt.savefig(output_file)
+    plt.close()
+    logging.info(f"Scatter plot saved to {output_file}")
+
+def generate_combined_scatter_plot(df, comm_type, coll_type, output_file, max_cols=3):
+    distinct_msg_sizes = df["coll_msg_size_bytes"].unique()
+    num_plots = len(distinct_msg_sizes)
+
+    # Compute number of rows and columns
+    num_cols = min(max_cols, num_plots)  # Limit max columns
+    num_rows = (num_plots + num_cols - 1) // num_cols  # Calculate rows dynamically
+
+    # Create figure with GridSpec
+    fig = plt.figure(figsize=(5 * num_cols, 5 * num_rows), dpi=100)
+    gs = GridSpec(num_rows, num_cols, figure=fig)
+
+    for i, msg_size in enumerate(distinct_msg_sizes):
+        row, col = divmod(i, num_cols)  # Determine row & column index
+        ax = fig.add_subplot(gs[row, col])  # Create subplot at position
+
+        df_msg_size = df[df["coll_msg_size_bytes"] == msg_size]
+        mean_busbw = df_msg_size["mean_coll_busbw_gbs"].mean()
+        ax.scatter(
+            df_msg_size["coll_sn"],
+            df_msg_size["mean_coll_busbw_gbs"],
+            label=f"MsgSize: {bytes_to_human_readable(msg_size)} (Mean: {mean_busbw:.2f} GB/s)",
+            alpha=0.5,
+        )
+        ax.set_xlabel("Op Seq No")
+        ax.set_ylabel("Mean Collective Bus BW (GB/s)")
+        ax.set_title(f"Message Size: {bytes_to_human_readable(msg_size)}({msg_size})")
+        ax.legend(loc="upper right")
+
+    fig.suptitle(f"Comm Type: {comm_type}, Coll Type: {coll_type}", ha="center", y=0.98)
+
+    plt.tight_layout()
+    plt.savefig(output_file)
+    plt.close()
+    logging.info(f"Combined scatter plot saved to {output_file}")
+
+def generate_histogram(df, comm_type, coll_type, output_file, message_size):
+    plt.figure(figsize=(10, 6), dpi=100)
+    data_range = df["mean_coll_busbw_gbs"].max() - df["mean_coll_busbw_gbs"].min()
+    num_bins = min(50, int(data_range) + 1)
+    plt.hist(
+        df["mean_coll_busbw_gbs"],
+        bins=num_bins,
+        alpha=0.7,
+        color="b",
+        edgecolor="black",
+        linewidth=1.2,
+    )
+    plt.xlabel("Mean Collective Bus BW (GB/s)")
+    plt.ylabel("Frequency")
+    plt.title(
+        f"Comm Type: {comm_type}, Coll Type: {coll_type} Mean Collective Bus BW Histogram\nMsg Size: {message_size}"
+    )
+    plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f"{y:.0f}"))
+    plt.gca().xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f"{x:.2f} GB/s"))
+    plt.gca().xaxis.get_offset_text().set_visible(False)
+    plt.tight_layout()
+    plt.savefig(output_file)
+    plt.close()
+    logging.info(f"Histogram saved to {output_file}")
+
+def generate_boxplot(df, comm_type, coll_type, output_file, message_size):
+    plt.figure(figsize=(10, 6))
+    boxprops = dict(linestyle="-", linewidth=2, color="blue")
+    flierprops = dict(marker="o", color="red", alpha=0.5)
+    medianprops = dict(linestyle="-", linewidth=2.5, color="orange")
+    whiskerprops = dict(linestyle="--", linewidth=2, color="green")
+    capprops = dict(linestyle="-", linewidth=2, color="black")
+
+    plt.boxplot(
+        df["mean_coll_busbw_gbs"],
+        vert=False,
+        patch_artist=True,
+        boxprops=boxprops,
+        flierprops=flierprops,
+        medianprops=medianprops,
+        whiskerprops=whiskerprops,
+        capprops=capprops,
+    )
+
+    plt.xlabel("Mean Coll Bus BW (GB/s)")
+    plt.title(
+        f"Box Plot of Coll Bus BW (CommType: {comm_type} - Coll Type: {coll_type} - Msg Size: {message_size})"
+    )
+
+    # Adding labels for min, max, and median
+    stats = df["mean_coll_busbw_gbs"].describe(percentiles=[0.5])
+    plt.annotate(
+        f"Min: {stats['min']:.2f}",
+        xy=(stats["min"], 1),
+        xytext=(stats["min"], 1.1),
+        arrowprops=dict(facecolor="black", shrink=0.05),
+    )
+    plt.annotate(
+        f"Median: {stats['50%']:.2f}",
+        xy=(stats["50%"], 1),
+        xytext=(stats["50%"], 1.1),
+        arrowprops=dict(facecolor="black", shrink=0.05),
+    )
+    plt.annotate(
+        f"Max: {stats['max']:.2f}",
+        xy=(stats["max"], 1),
+        xytext=(stats["max"], 1.1),
+        arrowprops=dict(facecolor="black", shrink=0.05),
+    )
+
+    plt.tight_layout()
+    plt.savefig(output_file)
+    plt.close()
+    logging.info(f"Box plot saved to {output_file}")
+
+
+def summarize_data_per_comm_coll_type(output_root, comm_type, coll_type, output_dir_name):
+    """Summarize parquet data per communication and collective type using DuckDB"""
+    logging.info(f"Summarizing data per comm/coll type for {output_dir_name}, {comm_type} and {coll_type}")
+
+    # Check if there are any parquet files
+    parquet_dir = output_root / "parquet_files"
+    parquet_files = list(parquet_dir.glob("*.parquet"))
+    if not parquet_files:
+        logging.warning(f"No parquet files found for {comm_type} and {coll_type}")
+        return None
+
+    # Clean up invalid/empty parquet files by moving them to a separate directory
+    invalid_dir = parquet_dir / "invalid"
+    invalid_dir.mkdir(exist_ok=True)
+
+    invalid_count = 0
+    for pf in parquet_files:
+        try:
+            # Check file size first
+            if pf.stat().st_size == 0:
+                logging.warning(f"Moving zero-byte parquet file {pf} to invalid directory")
+                pf.rename(invalid_dir / pf.name)
+                invalid_count += 1
+                continue
+
+            # Use pyarrow to check parquet metadata without reading data
+            import pyarrow.parquet as pq
+            parquet_file = pq.ParquetFile(pf)
+            if parquet_file.metadata.num_rows == 0:
+                logging.warning(f"Moving empty parquet file {pf} (0 rows) to invalid directory")
+                pf.rename(invalid_dir / pf.name)
+                invalid_count += 1
+        except Exception as e:
+            logging.warning(f"Moving invalid parquet file {pf} to invalid directory: {e}")
+            pf.rename(invalid_dir / pf.name)
+            invalid_count += 1
+
+    # Check if any valid files remain
+    remaining_files = list(parquet_dir.glob("*.parquet"))
+    if not remaining_files:
+        logging.warning(f"No valid parquet files found for {comm_type} and {coll_type} (moved {invalid_count} invalid files)")
+        return None
+
+    logging.info(f"Found {len(remaining_files)} valid parquet files (moved {invalid_count} invalid files)")
+
+    try:
+        duckdb.execute(
+            f"CREATE OR REPLACE VIEW logs AS SELECT * FROM read_parquet('{parquet_dir}/*.parquet')"
+        )
+        df = duckdb.execute(f"""
+            SELECT
+                id,
+                coll_sn,
+                coll_msg_size_bytes,
+                AVG(coll_busbw_gbs) as mean_coll_busbw_gbs,
+                COUNT(*) as log_count,
+                ARRAY_DISTINCT(LIST(n_ranks)) as n_ranks,
+                ARRAY_DISTINCT(LIST(nnodes)) as nnodes,
+                MIN(dump_timestamp_us) as coll_start_timestamp_us,
+                MAX(dump_timestamp_us) as coll_end_timestamp_us,
+                (MAX(dump_timestamp_us) - MIN(dump_timestamp_us)) as coll_duration_us
+            FROM logs
+            WHERE coll = '{coll_type}' and comm_type = '{comm_type}'
+            GROUP BY id, coll_sn, coll_msg_size_bytes
+            ORDER BY coll_sn
+        """).df()
+    except Exception as e:
+        logging.error(f"Error executing DuckDB query for {comm_type} and {coll_type}: {e}")
+        return None
+
+    if df.empty:
+        logging.info(f"No data for {comm_type} and {coll_type}")
+        return None
+
+    # Add human-readable formatting
+    df["human_readable_coll_msg_size_bytes"] = df["coll_msg_size_bytes"].apply(
+        bytes_to_human_readable
+    )
+
+    # Log example of time range data for first few rows
+    if len(df) > 0:
+        sample_row = df.iloc[0]
+        start_time = timestamp_to_datetime(sample_row['coll_start_timestamp_us'])
+        end_time = timestamp_to_datetime(sample_row['coll_end_timestamp_us'])
+        duration = microseconds_to_human_readable(sample_row['coll_duration_us'])
+        logging.info(f"Example time range - ID: {sample_row['id']}, Coll_SN: {sample_row['coll_sn']}, "
+                     f"Start: {start_time}, End: {end_time}, Duration: {duration}")
+
+    return df
+
+
+def generate_visualizations(df, output_root, comm_type, coll_type):
+    """Generate all visualizations and save CSV files for the processed data"""
+    logging.info(f"Generating visualizations for {comm_type} and {coll_type}")
+
+    summary_dir = output_root / "summary"
+    summary_dir.mkdir(parents=True, exist_ok=True)
+
+    # Scatter Plot for all message sizes
+    output_file = summary_dir / f"scatter_plot_{comm_type}_{coll_type}.png"
+    generate_scatter_plot(df, comm_type, coll_type, output_file)
+
+    # Combined Scatter Plot for all message sizes
+    output_file = summary_dir / f"combined_scatter_plot_{comm_type}_{coll_type}.png"
+    generate_combined_scatter_plot(df, comm_type, coll_type, output_file)
+
+    distinct_msg_sizes = df["coll_msg_size_bytes"].unique()
+    for msg_size in distinct_msg_sizes:
+        hr_msg_size = bytes_to_human_readable(msg_size)
+        msg_size_dir = summary_dir / f"msg_size_{msg_size}_{hr_msg_size}"
+        msg_size_hist_dir = msg_size_dir / "histograms"
+        msg_size_boxplot_dir = msg_size_dir / "boxplots"
+        msg_size_dir.mkdir(parents=True, exist_ok=True)
+        msg_size_hist_dir.mkdir(parents=True, exist_ok=True)
+        msg_size_boxplot_dir.mkdir(parents=True, exist_ok=True)
+
+        df_msg_size = df[df["coll_msg_size_bytes"] == msg_size]
+
+        # Add human-readable time formatting
+        df_msg_size = df_msg_size.copy()
+        df_msg_size["coll_start_datetime"] = df_msg_size["coll_start_timestamp_us"].apply(timestamp_to_datetime)
+        df_msg_size["coll_end_datetime"] = df_msg_size["coll_end_timestamp_us"].apply(timestamp_to_datetime)
+        df_msg_size["coll_duration_human"] = df_msg_size["coll_duration_us"].apply(microseconds_to_human_readable)
+
+        # Histogram
+        output_file = (
+            msg_size_hist_dir / f"histogram_{comm_type}_{coll_type}_{msg_size}.png"
+        )
+        generate_histogram(
+            df_msg_size,
+            comm_type,
+            coll_type,
+            output_file,
+            bytes_to_human_readable(msg_size),
+        )
+
+        # Box Plot
+        output_file = (
+            msg_size_boxplot_dir / f"boxplot_{comm_type}_{coll_type}_{msg_size}.png"
+        )
+        generate_boxplot(
+            df_msg_size,
+            comm_type,
+            coll_type,
+            output_file,
+            bytes_to_human_readable(msg_size),
+        )
+
+        output_file = msg_size_dir / f"summary_{comm_type}_{coll_type}_{msg_size}.csv"
+        df_msg_size.to_csv(output_file, index=False)
+        logging.info(
+            f"Summary for {comm_type}, {coll_type}, and msg_size {msg_size} written to {output_file}"
+        )
+
+
+def generate_summary(output_root, comm_type, coll_type, output_dir_name):
+    """Generate summary by summarizing data per comm/coll type and creating visualizations"""
+    logging.info(f"Generating summary for {output_dir_name}, {comm_type} and {coll_type}")
+
+    # Step 1: Summarize data per communication and collective type
+    df = summarize_data_per_comm_coll_type(output_root, comm_type, coll_type, output_dir_name)
+
+    # Step 2: Generate visualizations if data exists
+    if df is not None:
+        generate_visualizations(df, output_root, comm_type, coll_type)
+    else:
+        logging.warning(f"No data found for {comm_type} and {coll_type} - skipping visualization generation")
+
+
+def generate_summary_wrapper(args):
+    return generate_summary(*args)
+
+
+if __name__ == "__main__":
+    files, output_dir_name = get_log_files_and_output_dir()
+    print(f"Number of log files found: {len(files)}")
+    print(f"Output directory: {output_dir_name}")
+    output_dir = Path(output_dir_name)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    setup_logging(output_dir)
+    create_per_node_parquet_files(files, output_dir)
+    comm_types = ["single-rank", "nvlink-only", "hca-only", "mixed"]
+    coll_types = ["AllReduce", "AllGather", "ReduceScatter", "Broadcast"]
+    summary_args = [
+        (output_dir, comm_type, coll_type, output_dir_name)
+        for comm_type in comm_types
+        for coll_type in coll_types
+    ]
+    max_workers = min(64, len(summary_args), os.cpu_count() or 1)
+    with ProcessPoolExecutor(max_workers=max_workers) as executor:
+        list(
+            tqdm(
+                executor.map(generate_summary_wrapper, summary_args),
+                total=len(summary_args),
+                desc="Generating summaries",
+            )
+        )
+        print("Done!")
diff --git a/projects/rccl/ext-profiler/inspector/exporter/example/requirements.txt b/projects/rccl/ext-profiler/inspector/exporter/example/requirements.txt
new file mode 100644
index 0000000000..8a47aae519
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/exporter/example/requirements.txt
@@ -0,0 +1,6 @@
+pandas>=1.3.0
+tqdm>=4.60.0
+duckdb>=0.8.0
+matplotlib>=3.3.0
+pyarrow>=5.0.0
+numpy>=1.21.0
diff --git a/projects/rccl/ext-profiler/inspector/inspector.cc b/projects/rccl/ext-profiler/inspector/inspector.cc
new file mode 100644
index 0000000000..0cb9371d56
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/inspector.cc
@@ -0,0 +1,1530 @@
+#include "inspector.h"
+
+#include <assert.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <time.h>
+#include <unistd.h>
+#include <errno.h>
+#include <cstring>
+
+#include "common.h"
+
+#define JSON_CHK(expr)                                                  \
+  do {                                                                  \
+    const jsonResult_t res = (expr);                                    \
+    if (res != jsonSuccess) {                                           \
+      INFO(NCCL_INSPECTOR, "jsonError: %s\n", jsonErrorString(res));    \
+      return inspectorJsonError;                                        \
+    }                                                                   \
+  } while (0)
+
+#define INS_CHK(call)                                                   \
+  do {                                                                  \
+    inspectorResult_t res = call;                                       \
+    if (inspectorSuccess != res) {                                      \
+      INFO(NCCL_INSPECTOR, "%s:%d -> error %d: %s", __FILE__, __LINE__, res, \
+           inspectorErrorString(res));                                  \
+      return res;                                                       \
+    }                                                                   \
+  } while (0);
+
+#define JSON_CHK_GOTO(expr, res, label)                                 \
+  do {                                                                  \
+    const jsonResult_t macro_res = (expr);                              \
+    if (macro_res != jsonSuccess) {                                     \
+      INFO(NCCL_INSPECTOR, "jsonError: %s\n", jsonErrorString(macro_res)); \
+      res = inspectorJsonError;                                         \
+      goto label;                                                       \
+    }                                                                   \
+  } while (0)
+
+#define INS_CUDA_CHK(cmd)                                               \
+  do {                                                                  \
+    cudaError_t err = cmd;                                              \
+    if (err != cudaSuccess) {                                           \
+      INFO(NCCL_INSPECTOR, "Cuda failure '%s'", cudaGetErrorString(err)); \
+      return inspectorCudaError;                                        \
+    }                                                                   \
+  } while (false)
+
+
+// Global flag to control inspector use
+static bool enableNcclInspector = false;
+// Global flag to control starting internal dump thread
+static bool enableNcclInspectorDumpThread = false;
+// Global flag to control verbose dumping (event_trace)
+static bool enableNcclInspectorDumpVerbose = false;
+// Extra guard to prevent spurious messages for eager pollers that try to dump
+// out results before we have initialized
+static bool ncclInspectorInit = false;
+
+// Define the global logFn variable
+ncclDebugLogger_t logFn = nullptr;
+
+/*
+ * Description:
+ *
+ *   Returns the current time in microseconds since the epoch.
+ *
+ * Thread Safety:
+ *
+ *   Thread-safe (uses gettimeofday).
+ *
+ * Input:
+ *
+ *   None.
+ *
+ * Output:
+ *
+ *   None.
+ *
+ * Return:
+ *   uint64_t - current time in microseconds.
+ *
+ * Error Handling:
+ *   This function uses gettimeofday() which rarely fails. In case of
+ *   failure, the function returns 0. Callers should check for 0 return
+ *   value if precise error handling is required.
+ *
+ */
+uint64_t inspectorGetTime() {
+  uint64_t ts = 0;
+  timeval tv;
+
+  gettimeofday(&tv, 0);
+  ts = tv.tv_sec * 1000000 + tv.tv_usec;
+  return ts;
+}
+
+/*
+ * Description:
+ *
+ *   Converts a string to the corresponding ncclDataType_t enum value.
+ *
+ * Thread Safety:
+ *   Thread-safe (read-only string input).
+ *
+ * Input:
+ *
+ *   const char* str - string representation of the datatype.
+ *
+ * Output:
+ *
+ *   None.
+ *
+ * Return:
+ *
+ *   ncclDataType_t - corresponding enum value, or -1 if unknown.
+ *
+ */
+ncclDataType_t inspectorStringToDatatype(const char* str) {
+  if (strcmp(str, "ncclInt8") == 0) return ncclInt8;
+  if (strcmp(str, "ncclInt32") == 0) return ncclInt32;
+  if (strcmp(str, "ncclUint32") == 0) return ncclUint32;
+  if (strcmp(str, "ncclInt64") == 0) return ncclInt64;
+  if (strcmp(str, "ncclUint64") == 0) return ncclUint64;
+  if (strcmp(str, "ncclFloat16") == 0) return ncclFloat16;
+  if (strcmp(str, "ncclFloat32") == 0) return ncclFloat32;
+  if (strcmp(str, "ncclFloat64") == 0) return ncclFloat64;
+  if (strcmp(str, "ncclBfloat16") == 0) return ncclBfloat16;
+  if (strcmp(str, "ncclFloat8e4m3") == 0) return ncclFloat8e4m3;
+  if (strcmp(str, "ncclFloat8e5m2") == 0) return ncclFloat8e5m2;
+  return (ncclDataType_t)-1;  // Or handle error as appropriate
+}
+
+/*
+ * Description:
+ *
+ *   Converts a string to the corresponding ncclFunc_t enum value.
+ *
+ * Thread Safety:
+ *   Thread-safe (read-only string input).
+ *
+ * Input:
+ *   const char* str - string representation of the function (must not be NULL).
+ *
+ * Output:
+ *   None.
+ *
+ * Return:
+ *   ncclFunc_t - corresponding enum value, or ncclNumFuncs if unknown.
+ *
+ * Preconditions:
+ *   - str must not be NULL
+ */
+ncclFunc_t ncclStringToFunc(const char* str) {
+  if (strcmp(str, "AllGather") == 0) return ncclFuncAllGather;
+  if (strcmp(str, "AllReduce") == 0) return ncclFuncAllReduce;
+  if (strcmp(str, "Broadcast") == 0) return ncclFuncBroadcast;
+  if (strcmp(str, "Recv") == 0) return ncclFuncRecv;
+  if (strcmp(str, "Reduce") == 0) return ncclFuncReduce;
+  if (strcmp(str, "ReduceScatter") == 0) return ncclFuncReduceScatter;
+  if (strcmp(str, "SendRecv") == 0) return ncclFuncSendRecv;
+  if (strcmp(str, "Send") == 0) return ncclFuncSend;
+  return ncclNumFuncs; // Invalid / unknown
+}
+
+const char* ncclFuncToString(ncclFunc_t fn) {
+  switch (fn) {
+  case ncclFuncAllGather: return "AllGather";
+  case ncclFuncAllReduce: return "AllReduce";
+  case ncclFuncBroadcast: return "Broadcast";
+  case ncclFuncRecv: return "Recv";
+  case ncclFuncReduce: return "Reduce";
+  case ncclFuncReduceScatter: return "ReduceScatter";
+  case ncclFuncSendRecv: return "SendRecv";
+  case ncclFuncSend: return "Send";
+  default: return "Invalid";
+  }
+}
+
+struct inspectorDumpThread;
+static inspectorDumpThread* dumper = nullptr;
+
+#define UNUSED(x) (void)(x)
+
+inspectorResult_t inspectorLockInit(pthread_rwlock_t* lockRef) {
+  if (0 != pthread_rwlock_init(lockRef, nullptr)) {
+    return inspectorLockError;
+  } else {
+    return inspectorSuccess;
+  }
+}
+
+inspectorResult_t inspectorLockDestroy(pthread_rwlock_t* lockRef) {
+  if (0 != pthread_rwlock_destroy(lockRef)) {
+    return inspectorLockError;
+  } else {
+    return inspectorSuccess;
+  }
+}
+
+inspectorResult_t inspectorLockRd(pthread_rwlock_t* lockRef) {
+  if (0 != pthread_rwlock_rdlock(lockRef)) {
+    return inspectorLockError;
+  } else {
+    return inspectorSuccess;
+  }
+}
+
+inspectorResult_t inspectorLockWr(pthread_rwlock_t* lockRef) {
+  if (0 != pthread_rwlock_wrlock(lockRef)) {
+    return inspectorLockError;
+  } else {
+    return inspectorSuccess;
+  }
+}
+
+inspectorResult_t inspectorUnlockRWLock(pthread_rwlock_t* lockRef) {
+  if (0 != pthread_rwlock_unlock(lockRef)) {
+    return inspectorLockError;
+  } else {
+    return inspectorSuccess;
+  }
+}
+
+// TODO inspect these retvals
+#define INSPECTOR_LOCK_RD_FLAG(lockRef, lockFlag, debug)        \
+  do {                                                          \
+    if (!lockFlag) {                                            \
+      INS_CHK(inspectorLockRd(lockRef));                 \
+    }                                                           \
+    lockFlag = true;                                            \
+  } while (0);
+
+#define INSPECTOR_LOCK_WR_FLAG(lockRef, lockFlag, debug)        \
+  do {                                                          \
+    if (!lockFlag) {                                            \
+      INS_CHK(inspectorLockWr(lockRef));                 \
+    }                                                           \
+    lockFlag = true;                                            \
+  } while (0);
+
+#define INSPECTOR_UNLOCK_RW_LOCK_FLAG(lockRef, lockFlag, debug) \
+  do {                                                          \
+    if (lockFlag) {                                             \
+      INS_CHK(inspectorUnlockRWLock(lockRef));           \
+    }                                                           \
+    lockFlag = false;                                           \
+  } while (0);
+
+struct inspectorCommInfoList {
+  struct inspectorCommInfo* comms;
+  uint32_t ncomms;
+  pthread_rwlock_t guard;
+};
+
+struct inspectorState {
+  struct inspectorCommInfoList liveComms;
+  struct inspectorCommInfoList deletedComms;
+};
+
+
+static inspectorState g_state;
+
+static inspectorResult_t inspectorCommInfoListInit(struct inspectorCommInfoList* commList) {
+  if (commList->comms) {
+    return inspectorGlobalInitError;
+  }
+  commList->comms = nullptr;
+  commList->ncomms = 0;
+  INS_CHK(inspectorLockInit(&commList->guard));
+  return inspectorSuccess;
+}
+
+static inspectorResult_t inspectorGlobalStateInit() {
+  memset(&g_state, 0, sizeof(struct inspectorState));
+  INS_CHK(inspectorCommInfoListInit(&g_state.liveComms));
+  INS_CHK(inspectorCommInfoListInit(&g_state.deletedComms));
+  return inspectorSuccess;
+}
+
+/*
+ * Description:
+ *
+ *   Converts inspectorTimingSource_t enum to a string representation.
+ *
+ * Thread Safety:
+ *   Thread-safe (read-only operation).
+ *
+ * Input:
+ *   inspectorTimingSource_t timingSource - timing source enum value.
+ *
+ * Output:
+ *   None.
+ *
+ * Return:
+ *   const char* - string representation of the timing source.
+ */
+static const char* inspectorTimingSourceToString(inspectorTimingSource_t timingSource) {
+  switch (timingSource) {
+  case inspectorTimingSourceKernelGpu:
+    return "kernel_gpu";
+  case inspectorTimingSourceKernelCpu:
+    return "kernel_cpu";
+  case inspectorTimingSourceCollectiveCpu:
+    return "collective_cpu";
+  default:
+    return "unknown";
+  }
+}
+
+/*
+ * Description:
+ *
+ *   Writes the header information for a communicator to the JSON output.
+ *
+ * Thread Safety:
+ *   Not thread-safe (should be called with proper locking).
+ *
+ * Input:
+ *   jsonFileOutput* jfo - JSON output handle.
+ *   struct inspectorCommInfo* commInfo - communicator info.
+ *
+ * Output:
+ *   Header is written to JSON output.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ *
+ */
+static inspectorResult_t inspectorCommInfoHeader(jsonFileOutput* jfo,
+                                                 struct inspectorCommInfo* commInfo) {
+  JSON_CHK(jsonStartObject(jfo));
+  JSON_CHK(jsonKey(jfo, "id")); JSON_CHK(jsonStr(jfo, commInfo->commHashStr));
+  JSON_CHK(jsonKey(jfo, "rank")); JSON_CHK(jsonInt(jfo, commInfo->rank));
+  JSON_CHK(jsonKey(jfo, "n_ranks")); JSON_CHK(jsonInt(jfo, commInfo->nranks));
+  JSON_CHK(jsonKey(jfo, "nnodes")); JSON_CHK(jsonUint64(jfo, commInfo->nnodes));
+  JSON_CHK(jsonFinishObject(jfo));
+  return inspectorSuccess;
+}
+
+/*
+ * Description:
+ *
+ *   Writes metadata header information to the JSON output.
+ *
+ * Thread Safety:
+ *   Not thread-safe (should be called with proper locking).
+ *
+ * Input:
+ *   jsonFileOutput* jfo - JSON output handle.
+ *
+ * Output:
+ *   Metadata header is written to JSON output.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ *
+ */
+static inspectorResult_t inspectorCommInfoMetaHeader(jsonFileOutput* jfo) {
+  JSON_CHK(jsonStartObject(jfo));
+  {
+    JSON_CHK(jsonKey(jfo, "inspector_output_format_version")); JSON_CHK(jsonStr(jfo, "v4.0"));
+    JSON_CHK(jsonKey(jfo, "git_rev")); JSON_CHK(jsonStr(jfo, get_git_version_info()));
+    JSON_CHK(jsonKey(jfo, "rec_mechanism")); JSON_CHK(jsonStr(jfo, "nccl_profiler_interface"));
+    JSON_CHK(jsonKey(jfo, "dump_timestamp_us")); JSON_CHK(jsonUint64(jfo, inspectorGetTime()));
+    char hostname[256];
+    gethostname(hostname, 255);
+    JSON_CHK(jsonKey(jfo, "hostname")); JSON_CHK(jsonStr(jfo, hostname));
+    JSON_CHK(jsonKey(jfo, "pid")); JSON_CHK(jsonUint64(jfo, getpid()));
+  }
+  JSON_CHK(jsonFinishObject(jfo));
+  return inspectorSuccess;
+}
+
+/*
+ * Description:
+ *
+ *   Writes verbose information (event_trace) for a completed
+ *   collective operation to the JSON output.
+ *
+ * Thread Safety:
+ *   Not thread-safe (should be called with proper locking).
+ *
+ * Input:
+ *   jsonFileOutput* jfo - JSON output handle.
+ *   const struct inspectorCompletedCollInfo* collInfo - completed
+ *   collective info.
+ *
+ * Output:
+ *   Verbose collective info is written to JSON output.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ *
+ */
+static inline inspectorResult_t inspectorCompletedCollVerbose(jsonFileOutput* jfo,
+                                                              struct inspectorCompletedCollInfo* collInfo) {
+  // Add event trace information
+  JSON_CHK(jsonKey(jfo, "event_trace_sn"));
+  JSON_CHK(jsonStartObject(jfo));
+  {
+    // Collective events
+    JSON_CHK(jsonKey(jfo, "coll_start_sn")); JSON_CHK(jsonUint64(jfo, collInfo->collEvtTrk.evntTrace[NCCL_INSP_EVT_TRK_COLL_START].sn));
+    JSON_CHK(jsonKey(jfo, "coll_stop_sn")); JSON_CHK(jsonUint64(jfo, collInfo->collEvtTrk.evntTrace[NCCL_INSP_EVT_TRK_COLL_STOP].sn));
+
+    // Kernel events
+    JSON_CHK(jsonKey(jfo, "kernel_events"));
+    JSON_CHK(jsonStartList(jfo));
+    for (uint32_t ch = 0; ch < collInfo->collEvtTrk.nChannels; ch++) {
+      JSON_CHK(jsonStartObject(jfo));
+      JSON_CHK(jsonKey(jfo, "channel_id")); JSON_CHK(jsonInt(jfo, ch));
+      JSON_CHK(jsonKey(jfo, "kernel_start_sn")); JSON_CHK(jsonUint64(jfo, collInfo->collEvtTrk.kernelCh[ch].evntTrace[NCCL_INSP_EVT_TRK_KERNEL_START].sn));
+      JSON_CHK(jsonKey(jfo, "kernel_stop_sn")); JSON_CHK(jsonUint64(jfo, collInfo->collEvtTrk.kernelCh[ch].evntTrace[NCCL_INSP_EVT_TRK_KERNEL_STOP].sn));
+      JSON_CHK(jsonKey(jfo, "kernel_record_sn")); JSON_CHK(jsonUint64(jfo, collInfo->collEvtTrk.kernelCh[ch].evntTrace[NCCL_INSP_EVT_TRK_KERNEL_RECORD].sn));
+      JSON_CHK(jsonFinishObject(jfo));
+    }
+    JSON_CHK(jsonFinishList(jfo));
+  }
+  JSON_CHK(jsonFinishObject(jfo));
+
+  JSON_CHK(jsonKey(jfo, "event_trace_ts"));
+  JSON_CHK(jsonStartObject(jfo));
+  {
+    // Collective events
+    JSON_CHK(jsonKey(jfo, "coll_start_ts")); JSON_CHK(jsonUint64(jfo, collInfo->collEvtTrk.evntTrace[NCCL_INSP_EVT_TRK_COLL_START].ts));
+    JSON_CHK(jsonKey(jfo, "coll_stop_ts")); JSON_CHK(jsonUint64(jfo, collInfo->collEvtTrk.evntTrace[NCCL_INSP_EVT_TRK_COLL_STOP].ts));
+
+    // Kernel events
+    JSON_CHK(jsonKey(jfo, "kernel_events"));
+    JSON_CHK(jsonStartList(jfo));
+    for (uint32_t ch = 0; ch < collInfo->collEvtTrk.nChannels; ch++) {
+      JSON_CHK(jsonStartObject(jfo));
+      JSON_CHK(jsonKey(jfo, "channel_id")); JSON_CHK(jsonInt(jfo, ch));
+      JSON_CHK(jsonKey(jfo, "kernel_start_ts")); JSON_CHK(jsonUint64(jfo, collInfo->collEvtTrk.kernelCh[ch].evntTrace[NCCL_INSP_EVT_TRK_KERNEL_START].ts));
+      JSON_CHK(jsonKey(jfo, "kernel_stop_ts")); JSON_CHK(jsonUint64(jfo, collInfo->collEvtTrk.kernelCh[ch].evntTrace[NCCL_INSP_EVT_TRK_KERNEL_STOP].ts));
+      JSON_CHK(jsonKey(jfo, "kernel_record_ts")); JSON_CHK(jsonUint64(jfo, collInfo->collEvtTrk.kernelCh[ch].evntTrace[NCCL_INSP_EVT_TRK_KERNEL_RECORD].ts));
+      JSON_CHK(jsonFinishObject(jfo));
+    }
+    JSON_CHK(jsonFinishList(jfo));
+  }
+  JSON_CHK(jsonFinishObject(jfo));
+
+  return inspectorSuccess;
+}
+
+/*
+ * Description:
+ *
+ *   Writes completed collective operation information to the JSON
+ *   output.
+ *
+ * Thread Safety:
+ *   Not thread-safe (should be called with proper locking).
+ *
+ * Input:
+ *   jsonFileOutput* jfo - JSON output handle.
+ *   const struct inspectorCompletedCollInfo* collInfo - completed
+ *   collective info.
+ *
+ * Output:
+ *   Collective info is written to JSON output.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ *
+ */
+static inline inspectorResult_t inspectorCompletedColl(jsonFileOutput* jfo,
+                                                        struct inspectorCompletedCollInfo* collInfo) {
+  JSON_CHK(jsonStartObject(jfo));
+  {
+
+    JSON_CHK(jsonKey(jfo, "coll")); JSON_CHK(jsonStr(jfo, ncclFuncToString(collInfo->func)));
+
+    JSON_CHK(jsonKey(jfo, "coll_sn")); JSON_CHK(jsonUint64(jfo, collInfo->sn));
+
+    JSON_CHK(jsonKey(jfo, "coll_msg_size_bytes")); JSON_CHK(jsonUint64(jfo, collInfo->msgSizeBytes));
+
+    JSON_CHK(jsonKey(jfo, "coll_exec_time_us")); JSON_CHK(jsonUint64(jfo, collInfo->execTimeUsecs));
+
+    JSON_CHK(jsonKey(jfo, "coll_timing_source")); JSON_CHK(jsonStr(jfo, inspectorTimingSourceToString(collInfo->timingSource)));
+
+    JSON_CHK(jsonKey(jfo, "coll_algobw_gbs")); JSON_CHK(jsonDouble(jfo, collInfo->algoBwGbs));
+
+    JSON_CHK(jsonKey(jfo, "coll_busbw_gbs")); JSON_CHK(jsonDouble(jfo, collInfo->busBwGbs));
+
+    if (enableNcclInspectorDumpVerbose) {
+      INS_CHK(inspectorCompletedCollVerbose(jfo, collInfo));
+    }
+  }
+  JSON_CHK(jsonFinishObject(jfo));
+
+  return inspectorSuccess;
+}
+
+
+/*
+ * Description:
+ *
+ *   Dumps the state of a communicator to the JSON output if needed.
+ *
+ * Thread Safety:
+ *   Not thread-safe (should be called with proper locking).
+ *
+ * Input:
+ *   jsonFileOutput* jfo - JSON output handle.
+ *   inspectorCommInfo* commInfo - communicator info.
+ *   bool* needs_writing - set to true if output was written.
+ *
+ * Output:
+ *   State is dumped to JSON output if needed.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ *
+ */
+static inspectorResult_t inspectorCommInfoDump(jsonFileOutput* jfo,
+                                               inspectorCommInfo* commInfo,
+                                               bool* needs_writing) {
+  *needs_writing = false;
+
+  if (commInfo == nullptr)
+    return inspectorSuccess;
+
+  struct inspectorCompletedCollInfo collInfo;
+  memset(&collInfo, 0, sizeof(struct inspectorCompletedCollInfo));
+
+  inspectorLockWr(&commInfo->guard);
+  if (commInfo->dump) {
+    *needs_writing = true;
+    memcpy(&collInfo,
+           &commInfo->completedCollInfo,
+           sizeof(struct inspectorCompletedCollInfo));
+    commInfo->dump = false;
+  }
+  inspectorUnlockRWLock(&commInfo->guard);
+
+  if (*needs_writing) {
+    JSON_CHK(jsonLockOutput(jfo));
+    JSON_CHK(jsonStartObject(jfo));
+    {
+      JSON_CHK(jsonKey(jfo, "header"));
+      inspectorCommInfoHeader(jfo, commInfo);
+
+      JSON_CHK(jsonKey(jfo, "metadata"));
+      inspectorCommInfoMetaHeader(jfo);
+
+      JSON_CHK(jsonKey(jfo, "coll_perf"));
+      INS_CHK(inspectorCompletedColl(jfo, &collInfo));
+    }
+    JSON_CHK(jsonFinishObject(jfo));
+    JSON_CHK(jsonNewline(jfo));
+    JSON_CHK(jsonUnlockOutput(jfo));
+  }
+  return inspectorSuccess;
+}
+
+
+/*
+ * Description:
+ *
+ *   Dumps the state of all communicators in a commList to the JSON
+ *   output.
+ *
+ * Thread Safety:
+ *   Thread-safe - assumes no locks are taken and acquires all necessary
+ *   locks to iterate through all communicator objects and dump their state.
+ *
+ * Input:
+ *   jsonFileOutput* jfo - JSON output handle (must not be NULL).
+ *   struct inspectorCommInfoList* commList - list of communicators (must not be NULL).
+ *
+ * Output:
+ *   State of all communicators is dumped to JSON output.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ *
+ */
+static inspectorResult_t inspectorCommInfoListDump(jsonFileOutput* jfo,
+                                                   struct inspectorCommInfoList* commList) {
+  bool flush = false;
+  INS_CHK(inspectorLockRd(&commList->guard));
+  inspectorResult_t res = inspectorSuccess;
+  if (commList->ncomms > 0) {
+    for (struct inspectorCommInfo* itr = commList->comms;
+         itr != nullptr;
+         itr = itr->next) {
+      bool needs_writing;
+      INS_CHK_GOTO(inspectorCommInfoDump(jfo, itr, &needs_writing), res, finalize);
+      if (needs_writing) {
+        flush = true;
+      }
+    }
+    if (flush) {
+      JSON_CHK_GOTO(jsonLockOutput(jfo), res, finalize);
+      JSON_CHK_GOTO(jsonFlushOutput(jfo), res, finalize);
+      JSON_CHK_GOTO(jsonUnlockOutput(jfo), res, finalize);
+    }
+  }
+finalize:
+  INS_CHK(inspectorUnlockRWLock(&commList->guard));
+  return res;
+}
+
+/*
+ * Description:
+ *   Finalizes and cleans up a commList, freeing all communicators.
+ *
+ * Thread Safety:
+ *   Not thread-safe (should be called with proper locking).
+ *
+ * Input:
+ *   struct commList* commList - list of communicators.
+ *
+ * Output:
+ *   All communicators are freed.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ *
+ */
+static inspectorResult_t inspectorCommInfoListFinalize(struct inspectorCommInfoList* commList) {
+  struct inspectorCommInfo* nextComm = nullptr;
+  INS_CHK(inspectorLockWr(&commList->guard));
+  while (commList->comms != nullptr && commList->ncomms != 0) {
+    INFO(NCCL_INSPECTOR, "NCCL Inspector: comm %lu still in tracker",
+         commList->comms->commHash);
+    nextComm = commList->comms->next;
+    INS_CHK(inspectorLockDestroy(&commList->comms->guard));
+    free(commList->comms);
+    commList->comms = nextComm;
+    commList->ncomms--;
+  }
+  INS_CHK(inspectorUnlockRWLock(&commList->guard));
+  return inspectorSuccess;
+}
+
+/*
+ * Description:
+ *
+ *   Ensures the given directory exists and is writable, creating it
+ *   if necessary.
+ *
+ * Thread Safety:
+ *   Not thread-safe (should be called during initialization).
+ *
+ * Input:
+ *   char* workdir - directory path.
+ *
+ * Output:
+ *   Directory is created if needed.
+ *
+ * Return:
+ *
+ *   bool - true if directory exists and is writable, false otherwise.
+ *
+ */
+static bool ensureDir(char* workdir) {
+  struct stat st;
+
+  // Check if directory exists
+  if (stat(workdir, &st) == 0) {
+    if (S_ISDIR(st.st_mode)) {
+      // Directory exists, check if it's writable
+      if (access(workdir, W_OK) == 0) {
+        return true; // Directory exists and is writable
+      } else {
+        INFO(NCCL_INSPECTOR,
+             "NCCL Inspectoer: dump directory %s exists, but is not "
+             "writable",
+             workdir);
+        return false;
+      }
+    } else {
+      INFO(NCCL_INSPECTOR,
+           "NCCL Inspector: dump location %s exists, but is not a "
+           "directory",
+           workdir);
+      return false;
+    }
+  } else {
+    // Directory doesn't exist, try to create it
+    const mode_t mode = 0777;
+    if (mkdir(workdir, mode) == 0) {
+      return true; // Directory created successfully
+    } else {
+      INFO(NCCL_INSPECTOR,
+           "NCCL Inspector: failed to create dump directory %s: %s", workdir,
+           strerror(errno));
+      return false;
+    }
+  }
+}
+
+/*
+ * Description:
+ *
+ *   Generates the output dump directory path based on environment
+ *   variables.
+ *
+ * Thread Safety:
+ *   Not thread-safe (should be called during initialization).
+ *
+ * Input:
+ *   char** workdir - pointer to output directory string.
+ *
+ * Output:
+ *   workdir is set to the generated directory path.
+ *
+ * Return:
+ *   None.
+ */
+static void genDumpDir(char** workdir) {
+  char* dumpdir = getenv("NCCL_INSPECTOR_DUMP_DIR");
+  if (dumpdir != NULL) {
+    *workdir = strdup(dumpdir);
+    // TODO check errors here
+    return;
+  }
+
+  char* jobid = getenv("SLURM_JOBID");
+  bool badJobId = true;
+  if (jobid != NULL) {
+    errno = 0;
+    const int intid = strtol(jobid, NULL, 10);
+    if (errno == 0) {
+      char tmp[2048];
+      snprintf(tmp, 2048, "nccl-inspector-%d", intid);
+      *workdir = strdup(tmp);
+      badJobId = false;
+    }
+  }
+
+  if (badJobId) {
+    *workdir = strdup("nccl-inspector-unknown-jobid");
+  }
+}
+
+struct inspectorDumpThread {
+  bool run{false};
+  jsonFileOutput* jfo;
+  char* outputRoot;
+  uint64_t sampleIntervalUsecs;
+  pthread_t pthread;
+  pthread_rwlock_t guard;
+
+  inspectorDumpThread(const char* outputRoot, uint64_t sampleIntervalUsecs)
+    : jfo(nullptr), outputRoot(strdup(outputRoot)), sampleIntervalUsecs(sampleIntervalUsecs) {
+    if (inspectorLockInit(&guard) != inspectorSuccess) {
+      INFO(NCCL_INSPECTOR, "NCCL Inspector inspectorDumpThread: couldn't init lock");
+    }
+  }
+
+  ~inspectorDumpThread() {
+    if (jfo != nullptr) {
+      jsonFinalizeFileOutput(jfo);
+      jfo = nullptr;
+    }
+    if (outputRoot != nullptr) {
+      free(outputRoot);
+      outputRoot = nullptr;
+    }
+    if (inspectorLockDestroy(&guard) != inspectorSuccess) {
+      INFO(NCCL_INSPECTOR, "NCCL Inspector inspectorDumpThread: couldn't destroy lock");
+    }
+  }
+
+  void startThread() {
+    inspectorLockWr(&guard);
+    run = true;
+    inspectorUnlockRWLock(&guard);
+    if (pthread_create(&pthread, NULL, dumpMain, this) != 0) {
+      INFO(NCCL_INSPECTOR,
+           "NCCL Inspector inspectorDumpThread: couldn't create dump thread!");
+      return;
+    }
+    INFO(NCCL_INSPECTOR, "NCCL Inspector inspectorDumpThread: created");
+  }
+
+  void stopThread() {
+    INFO(NCCL_ENV, "NCCL Inspector Stopping Dump thread");
+    inspectorLockWr(&guard);
+    run = false;
+    inspectorUnlockRWLock(&guard);
+    struct timespec ts;
+    ts.tv_sec = 0;
+    ts.tv_nsec = 1000000; // 1ms
+    nanosleep(&ts, NULL);
+    INFO(NCCL_INSPECTOR, "NCCL Inspector inspectorDumpThread: stopped");
+  }
+
+  inspectorResult_t inspectorStateDump(const char* output_root) {
+    if (!ncclInspectorInit) {
+      return inspectorUninitializedError;
+    }
+    if (!enableNcclInspector) {
+      INFO(NCCL_INSPECTOR, "NCCL Inspector is not enabled, will not do ncclAllCommTallyDump");
+      return inspectorDisabledError;
+    }
+
+    if (jfo == 0) {
+      char hostname[256];
+      gethostname(hostname, 255);
+      char tmp[2048];
+      snprintf(tmp, sizeof(tmp), "%s/%s-pid%d.log", output_root, hostname, getpid());
+      jsonResult_t result = jsonInitFileOutput(&jfo, tmp);
+      if (jsonSuccess != result) {
+        INFO(NCCL_INSPECTOR, "Cannot open %s for writing: %s", tmp, jsonErrorString(result));
+        return inspectorFileOpenError;
+      }
+      chmod(tmp, 0666);
+    }
+
+    if (jfo != nullptr) {
+      inspectorCommInfoListDump(jfo, &g_state.liveComms);
+      inspectorCommInfoListDump(jfo, &g_state.deletedComms);
+    }
+
+    if (g_state.deletedComms.ncomms > 0) {
+      inspectorCommInfoListFinalize(&g_state.deletedComms);
+    }
+    return inspectorSuccess;
+  }
+
+  static void* dumpMain(void* arg) {
+    inspectorDumpThread* dumper = (inspectorDumpThread*)arg;
+    inspectorResult_t res = inspectorSuccess;
+    struct timespec ts;
+    ts.tv_sec = dumper->sampleIntervalUsecs / 1000000;
+    ts.tv_nsec = dumper->sampleIntervalUsecs % 1000000;
+
+    while (dumper->run) {
+      inspectorLockWr(&dumper->guard);
+      if (!dumper->run) {
+        inspectorUnlockRWLock(&dumper->guard);
+        break;
+      }
+      res = dumper->inspectorStateDump(dumper->outputRoot);
+      if (res == inspectorFileOpenError || res == inspectorDisabledError) {
+        inspectorUnlockRWLock(&dumper->guard);
+        break;
+      }
+      inspectorUnlockRWLock(&dumper->guard);
+
+      nanosleep(&ts, NULL);
+    }
+
+    return 0;
+  }
+};
+
+/*
+ * Description:
+ *
+ *   Shows the NCCL Inspector plugin version and configuration
+ *   environment variables in a structured format similar to NCCL's
+ *   showVersion function.
+ *
+ * Thread Safety:
+ *   Thread-safe (read-only environment variable access).
+ *
+ * Input:
+ *   None.
+ *
+ * Output:
+ *   Logs version and environment variables to debug output.
+ *
+ * Return:
+ *   None.
+ */
+static void showInspectorVersion() {
+  VERSION("NCCL Inspector Plugin - Version: %s", get_git_version_info());
+}
+
+/*
+ * Description:
+ *
+ *   Shows all NCCL Inspector environment variables and their values
+ *   in a structured format.
+ *
+ * Thread Safety:
+ *   Thread-safe (read-only environment variable access).
+ *
+ * Input:
+ *   None.
+ *
+ * Output:
+ *   Logs environment variables to debug output.
+ *
+ * Return:
+ *   None.
+ */
+static void showInspectorEnvVars() {
+  struct {
+    const char* name;
+    const char* value;
+    const char* defaultVal;
+    const char* description;
+  } envVars[] = {
+    {"NCCL_INSPECTOR_ENABLE", getenv("NCCL_INSPECTOR_ENABLE"), "0", "Enable/disable inspector plugin"},
+    {"NCCL_INSPECTOR_DUMP_THREAD_ENABLE", getenv("NCCL_INSPECTOR_DUMP_THREAD_ENABLE"), "1", "Enable/disable dump thread"},
+    {"NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS", getenv("NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS"), "0", "Dump thread interval in microseconds"},
+    {"NCCL_INSPECTOR_DUMP_DIR", getenv("NCCL_INSPECTOR_DUMP_DIR"), "(auto-generated)", "Output directory for inspector logs"},
+    {"NCCL_INSPECTOR_DUMP_VERBOSE", getenv("NCCL_INSPECTOR_DUMP_VERBOSE"), "0", "Enable/disable verbose dumping (event_trace)"}
+  };
+
+  const int numEnvVars = sizeof(envVars) / sizeof(envVars[0]);
+
+  VERSION("NCCL Inspector Environment Variables:");
+  for (int i = 0; i < numEnvVars; i++) {
+    VERSION("  %s = %s%s%s",
+            envVars[i].name,
+            envVars[i].value ? envVars[i].value : "(not set)",
+            envVars[i].value ? "" : ", default=",
+            envVars[i].value ? "" : envVars[i].defaultVal);
+  }
+}
+
+/*
+ * Description:
+ *
+ *   Initializes the global inspector state and starts the dump thread
+ *   if enabled.
+ *
+ * Thread Safety:
+ *
+ *   Not thread-safe (should be called during initialization).
+ *
+ * Input:
+ *   None.
+ *
+ * Output:
+ *   Global state is initialized and dump thread may be started.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ */
+inspectorResult_t inspectorGlobalInit(int rank) {
+  char* str = getenv("NCCL_INSPECTOR_ENABLE");
+  int enable = str ? atoi(str) : 0; // default disable
+  enableNcclInspector = enable == 0 ? false : true;
+  ncclInspectorInit = true;
+
+  // Show version and environment configuration (similar to NCCL's showVersion)
+  if (rank == 0) {
+    showInspectorVersion();
+    showInspectorEnvVars();
+  }
+
+  if (enableNcclInspector == false) {
+    VERSION("NCCL Inspector Plugin DISABLED (NCCL_INSPECTOR_ENABLE=%s)",
+            str ? str : "0");
+    return inspectorDisabledError;
+  }
+
+  INS_CHK(inspectorGlobalStateInit());
+
+  str = getenv("NCCL_INSPECTOR_DUMP_THREAD_ENABLE");
+  enable = str ? atoi(str) : 1; // default enable
+  enableNcclInspectorDumpThread = enable == 0 ? false : true;
+
+  str = getenv("NCCL_INSPECTOR_DUMP_VERBOSE");
+  enable = str ? atoi(str) : 0; // default disable
+  enableNcclInspectorDumpVerbose = enable == 0 ? false : true;
+
+  if (enableNcclInspectorDumpThread) {
+    str = getenv("NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS");
+    const uint64_t interval = str ? strtoull(str, 0, 0) : 0;
+
+    if (interval == 0) {
+      INFO(NCCL_INSPECTOR, "NCCL Inspector: dump thread enabled but "
+           "NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS is 0; not "
+           "starting internal dump "
+           "thread.");
+      return inspectorSuccess;
+    }
+
+    char* dumpdir;
+    genDumpDir(&dumpdir);
+
+    if (dumpdir != nullptr) {
+      if (!ensureDir(dumpdir)) {
+        free(dumpdir);
+        INFO(NCCL_INSPECTOR, "NCCL Inspector: failed to generate a dump dir; not "
+             "starting internal dump thread.");
+        return inspectorSuccess;
+      }
+
+      dumper = new inspectorDumpThread(dumpdir, interval);
+      dumper->startThread();
+
+      INFO(NCCL_INSPECTOR,
+           "NCCL Inspector enabled with polling interval %lu us and "
+           "output directory %s",
+           interval, dumpdir);
+      free(dumpdir);
+    } else {
+      INFO(NCCL_INSPECTOR, "NCCL Inspector: failed to generate a dump "
+           "dir; not starting internal dump thread.");
+    }
+  } else {
+    INFO(NCCL_INSPECTOR,
+         "NCCL Inspector: NCCL_INSPECTOR_DUMP_THREAD_ENABLE set to 0; not "
+         "starting internal dump "
+         "thread.");
+  }
+  return inspectorSuccess;
+}
+
+/*
+ * Description:
+ *
+ *   Returns a string describing the given inspectorResult_t error
+ *   code.
+ *
+ * Thread Safety:
+ *   Thread-safe (read-only operation).
+ *
+ * Input:
+ *   inspectorResult_t result - error code.
+ *
+ * Output:
+ *   None.
+ *
+ * Return:
+ *   const char* - error string.
+ */
+const char* inspectorErrorString(inspectorResult_t result) {
+  switch (result) {
+  case inspectorSuccess:
+    return "Success";
+  case inspectorUninitializedError:
+    return "Inspector is not initialized";
+  case inspectorMemoryError:
+    return "Inspector encountered issue allocating memory";
+  case inspectorFileOpenError:
+    return "Inspector could not open file";
+  case inspectorDisabledError:
+    return "Inspector is disabled";
+  case inspectorLockError:
+    return "Inspector encountered error with lock";
+  case inspectorPthreadError:
+    return "Inspector encountered error with pthreads";
+  case inspectorJsonError:
+    return "Inspector encountered error while emitting JSON";
+  case inspectorCudaError:
+    return "Inspector encountered CUDA error";
+  case inspectorBadHash:
+    return "Inspector encountered bad communicator hash";
+  case inspectorDeleteUnknownCommError:
+    return "Inspector was asked to delete a communicator that it is not "
+      "tracking";
+  case inspectorAddDuplicateCommError:
+    return "Inspector was asked to add a communicator it was already "
+      "tracking";
+  case inspectorNop:
+    return "Inspector NOP";
+  case inspectorNullTally:
+    return "Inspector encountered a null OpTally";
+  case inspectorGlobalInitError:
+    return "Inspector encountered a repeated global init";
+  case inspectorReturn:
+    return "Inspector Unconditional Return";
+  default:
+    return "Unknown error";
+  }
+}
+
+/*
+ * Description:
+ *   Converts a communicator hash to a string.
+ *
+ * Thread Safety:
+ *   Thread-safe (writes to provided buffer).
+ *
+ * Input:
+ *   uint64_t commHash - communicator hash.
+ *   char hashStr[NCCL_COMM_HASH_LENGTH] - output buffer.
+ *
+ * Output:
+ *   hashStr is set to the string representation of commHash.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ */
+inspectorResult_t inspectorCommGetHashStr(uint64_t commHash,
+                                          char hashStr[NCCL_COMM_HASH_LENGTH]) {
+  snprintf(hashStr, NCCL_COMM_HASH_LENGTH, "0x%lx",
+           commHash);
+  return inspectorSuccess;
+}
+
+/*
+ * Description:
+ *   Compares two communicator configurations for equality.
+ *
+ * Thread Safety:
+ *   Thread-safe (read-only comparison).
+ *
+ * Input:
+ *   uint64_t lCommHash - left communicator hash.
+ *   uint64_t rCommHash - right communicator hash.
+ *   int lRank - left rank.
+ *   int rRank - right rank.
+ *
+ * Output:
+ *   None.
+ *
+ * Return:
+ *   bool - true if communicators are equal (same hash and rank), false otherwise.
+ */
+static bool comm_eq(uint64_t lCommHash, uint64_t rCommHash,
+                    int lRank, int rRank) {
+  return lCommHash == rCommHash && lRank == rRank;
+}
+
+/*
+ * Description:
+ *   Initializes a communicator info structure with the provided parameters.
+ *
+ * Thread Safety:
+ *   Not thread-safe - should be called during communicator initialization.
+ *
+ * Input:
+ *   struct inspectorCommInfo* commInfo - communicator info structure to initialize (must not be NULL).
+ *   const char* commName - communicator name (can be NULL).
+ *   uint64_t commHash - communicator hash.
+ *   int nnodes - number of nodes (must be > 0).
+ *   int nranks - number of ranks (must be > 0).
+ *   int rank - rank (must be >= 0 and < nranks).
+ *
+ * Output:
+ *   commInfo is initialized with the provided parameters.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ *
+ * Preconditions:
+ *   - commInfo must not be NULL
+ *   - nnodes must be positive
+ *   - nranks must be positive
+ *   - rank must be non-negative and less than nranks
+ */
+static inspectorResult_t inspectorFillCommInfo(struct inspectorCommInfo* commInfo,
+                                               const char* commName, uint64_t commHash,
+                                               int nnodes, int nranks, int rank) {
+  commInfo->commName = commName;
+  commInfo->commHash = commHash;
+  inspectorCommGetHashStr(commHash, commInfo->commHashStr);
+  commInfo->rank = rank;
+  commInfo->nranks = nranks;
+  commInfo->nnodes = nnodes;
+  commInfo->dump = false;
+  INS_CHK(inspectorLockInit(&commInfo->guard));
+  commInfo->next = nullptr;
+  return inspectorSuccess;
+}
+
+/*
+ * Description:
+ *   Adds a communicator to the global state.
+ *
+ * Thread Safety:
+ *   Thread-safe (uses locks internally).
+ *
+ * Input:
+ *   struct inspectorCommInfo **commInfo - pointer to output struct (must not be NULL).
+ *   const char* commName - communicator name (can be NULL).
+ *   uint64_t commHash - communicator hash.
+ *   int nNodes - number of nodes (must be > 0).
+ *   int nranks - number of ranks (must be > 0).
+ *   int rank - rank (must be >= 0 and < nranks).
+ *
+ * Output:
+ *   commInfo is set to the new communicator struct.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ *
+ * Preconditions:
+ *   - commInfo must not be NULL
+ *   - nNodes must be positive
+ *   - nranks must be positive
+ *   - rank must be non-negative and less than nranks
+ */
+inspectorResult_t inspectorAddComm(struct inspectorCommInfo **commInfo,
+                                   const char* commName, uint64_t commHash,
+                                   int nNodes, int nranks, int rank) {
+  struct inspectorCommInfoList* liveCommInfoList = &g_state.liveComms;
+  struct inspectorCommInfo* commInfoPtr = nullptr;
+
+  inspectorResult_t res = inspectorSuccess;
+  bool locked = false;
+  INSPECTOR_LOCK_RD_FLAG(&liveCommInfoList->guard, locked,
+                         "inspectorAddComm: commList::guard -rd");
+  for (struct inspectorCommInfo* itr = liveCommInfoList->comms;
+       itr != nullptr;
+       itr = itr->next) {
+    if (comm_eq(commHash, itr->commHash, rank, itr->rank)) {
+      INFO(NCCL_INSPECTOR, "NCCL Inspector: comm 0x%lx already in tracker",
+           commHash);
+      res = inspectorAddDuplicateCommError;
+      goto finalize;
+    }
+  }
+  INSPECTOR_UNLOCK_RW_LOCK_FLAG(&liveCommInfoList->guard, locked,
+                                "inspectorAddComm: commList::guard");
+  commInfoPtr
+    = (struct inspectorCommInfo*)calloc(1, sizeof(struct inspectorCommInfo));
+  if (0 == commInfoPtr) {
+    res = inspectorMemoryError;
+    goto finalize;
+  }
+  INS_CHK_GOTO(inspectorFillCommInfo(commInfoPtr,
+                                     commName,
+                                     commHash,
+                                     nNodes,
+                                     nranks,
+                                     rank),
+               res, fail);
+
+  INSPECTOR_LOCK_WR_FLAG(&liveCommInfoList->guard, locked,
+                         "inspectorAddComm: commList::guard -wr");
+  ++liveCommInfoList->ncomms;
+  commInfoPtr->next = liveCommInfoList->comms;
+  liveCommInfoList->comms = commInfoPtr;
+
+finalize:
+  INSPECTOR_UNLOCK_RW_LOCK_FLAG(&liveCommInfoList->guard, locked,
+                                "inspectorAddComm: commList::guard");
+  *commInfo = commInfoPtr;
+  return res;
+fail:
+  if (commInfoPtr) {
+    free(commInfoPtr);
+    commInfoPtr = nullptr;
+  }
+  goto finalize;
+}
+
+/*
+ * Description:
+ *
+ *   Removes a communicator from the global state and moves it to the
+ *   deleted list.
+ *
+ * Thread Safety:
+ *   Thread-safe (uses locks internally).
+ *
+ * Input:
+ *   struct inspectorCommInfo *commInfo - communicator to remove.
+ *
+ * Output:
+ *   Communicator is removed from live list and added to deleted list.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ */
+inspectorResult_t inspectorDelComm(struct inspectorCommInfo *commInfo) {
+  struct inspectorCommInfoList* liveCommInfoList = &g_state.liveComms;
+  struct inspectorCommInfoList* deletedCommInfoList = &g_state.deletedComms;
+  struct inspectorCommInfo* commInfoPtr = nullptr;
+  bool locked = false;
+
+  INFO(NCCL_INSPECTOR, "NCCL Inspector: DelComm removing 0x%lx",
+       commInfo->commHash);
+
+  INSPECTOR_LOCK_WR_FLAG(&liveCommInfoList->guard, locked,
+                         "inspectorDelComm: liveCommInfoList::guard -wr");
+  struct inspectorCommInfo** prev_ptr = &liveCommInfoList->comms;
+  for (struct inspectorCommInfo* itr = liveCommInfoList->comms;
+       itr != nullptr;
+       itr = itr->next) {
+    if (comm_eq(commInfo->commHash, itr->commHash, commInfo->rank, itr->rank)) {
+      *prev_ptr = itr->next;
+      liveCommInfoList->ncomms--;
+
+      commInfoPtr = itr;
+      break;
+    }
+    prev_ptr = &itr->next;
+  }
+  INSPECTOR_UNLOCK_RW_LOCK_FLAG(&liveCommInfoList->guard, locked,
+                                "inspectorDelComm: liveCommInfoList::guard -unlock");
+
+  if (!commInfoPtr) {
+    INFO(NCCL_INSPECTOR, "NCCL Inspector: DelComm can't remove 0x%lx, not present",
+         commInfo->commHash);
+    return inspectorDeleteUnknownCommError;
+  }
+
+  inspectorLockWr(&commInfoPtr->guard);
+  commInfoPtr->dump = false;
+  inspectorUnlockRWLock(&commInfoPtr->guard);
+
+  INSPECTOR_LOCK_WR_FLAG(&deletedCommInfoList->guard, locked,
+                         "inspectorDelComm: deletedCommInfoList::guard -wr");
+  commInfoPtr->next = deletedCommInfoList->comms;
+  deletedCommInfoList->comms = commInfoPtr;
+  deletedCommInfoList->ncomms++;
+  INSPECTOR_UNLOCK_RW_LOCK_FLAG(&deletedCommInfoList->guard, locked,
+                                "inspectorDelComm: deletedCommInfoList::guard -unlock");
+
+  return inspectorSuccess;
+}
+
+/*
+ * Description:
+ *
+ *   Computes the algorithmic and bus bandwidth (in GB/s) for a given
+ *   NCCL collective operation, based on the communication info and
+ *   completed collective details. The calculation uses the message
+ *   size, execution time, and the type of collective operation to
+ *   determine the effective bandwidths. The 'factor' variable adjusts
+ *   the bus bandwidth calculation according to the communication
+ *   pattern of each collective, as described in the NCCL performance
+ *   documentation:
+ *   https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md
+ *
+ * Thread Safety:
+ *
+ *   This function does not perform any locking and assumes the caller
+ *   ensures thread safety if required.
+ *
+ * Input:
+ *
+ *   commInfo - Pointer to inspectorCommInfo structure containing
+ *   communicator details.
+ *
+ *   completedColl- Pointer to inspectorCompletedCollInfo structure
+ *   containing completed collective info.
+ *
+ *   collType - The type of NCCL collective operation (ncclFunc_t).
+ *
+ * Output:
+ *   Updates the algoBwGbs and busBwGbs fields of the completedColl
+ *   structure.
+ *
+ * Return:
+ *   N.A. (void function)
+ */
+void inspectorComputeCollBw(struct inspectorCommInfo *commInfo,
+                            struct inspectorCompletedCollInfo *completedColl,
+                            ncclFunc_t collType) {
+  double timeInSec = completedColl->execTimeUsecs / 1000000.0;
+  double factor = 0.0;
+  double trafficSize = 0.0;
+  switch (collType) {
+  case ncclFuncReduce:
+  case ncclFuncBroadcast:
+    trafficSize = (double)completedColl->msgSizeBytes;
+    factor = 1;
+    break;
+  case ncclFuncAllReduce:
+    trafficSize = (double)completedColl->msgSizeBytes;
+    factor = ((double)(2 * (commInfo->nranks - 1))) / ((double)commInfo->nranks);
+    break;
+  case ncclFuncReduceScatter:
+    trafficSize = (double)(completedColl->msgSizeBytes * commInfo->nranks);
+    factor = ((double)(commInfo->nranks - 1)) / ((double)commInfo->nranks);
+    break;
+  case ncclFuncAllGather:
+    trafficSize = (double)(completedColl->msgSizeBytes * commInfo->nranks);
+    factor = ((double)(commInfo->nranks - 1)) / ((double)commInfo->nranks);
+    break;
+  case ncclFuncSendRecv:
+  case ncclFuncSend:
+  case ncclFuncRecv:
+    trafficSize = (double)completedColl->msgSizeBytes;
+    factor = 1;
+    break;
+  default:
+    trafficSize = 0;
+    factor = 0.0;
+  }
+  completedColl->algoBwGbs = timeInSec != 0 ? (trafficSize / 1.0E9 / timeInSec) : 0;
+  completedColl->busBwGbs = completedColl->algoBwGbs * factor;
+}
+
+/*
+ * Description:
+ *
+ *   Helper function to calculate kernel execution time using GPU
+ *   clock values.  The GPU clock values are measured in nanoseconds
+ *   from the globaltimer register.
+ *
+ * Thread Safety:
+ *   Thread-safe (read-only operations on kernel info).
+ *
+ * Input:
+ *   struct inspectorKernelChInfo *kernelCh - kernel channel info
+ *   containing GPU clock values.
+ *
+ * Output:
+ *   None.
+ *
+ * Return:
+ *   uint64_t - execution time in microseconds, or 0 if invalid timing
+ *   data.
+ */
+static uint64_t calculateKernelGpuExecTimeUsecs(struct inspectorKernelChInfo *kernelCh) {
+  if (kernelCh->startGpuClk != 0 && kernelCh->stopGpuClk != 0) {
+    if (kernelCh->stopGpuClk > kernelCh->startGpuClk) {
+      uint64_t execTimeNanosecs = kernelCh->stopGpuClk - kernelCh->startGpuClk;
+      return execTimeNanosecs / 1000;
+    }
+  }
+  return 0;
+}
+
+/*
+ * Description:
+ *
+ *   Calculates the maximum kernel execution time across all kernel
+ *   channels in a collective operation, using GPU clock values when
+ *   available and falling back to CPU timestamps when necessary.
+ *
+ * Thread Safety:
+ *   Thread-safe (read-only operations on collective info).
+ *
+ * Input:
+ *   struct inspectorCollInfo *collInfo - collective operation info
+ *   containing kernel channels.
+ *   inspectorTimingSource_t *timingSource - pointer to store the timing source used.
+ *
+ * Output:
+ *   timingSource is set to indicate whether GPU, CPU, or collective timing was used.
+ *
+ * Return:
+ *
+ *   uint64_t - maximum execution time in microseconds across all
+ *              kernels, or collective execution time if no kernel
+ *              timing is available.
+ *
+ */
+static uint64_t calculateMaxKernelExecTimeUsecs(struct inspectorCollInfo *collInfo,
+                                                inspectorTimingSource_t *timingSource) {
+  uint64_t maxKernelExecTimeUsecs = 0;
+  inspectorTimingSource_t bestTimingSource = inspectorTimingSourceCollectiveCpu;
+
+  for (uint32_t i = 0; i < collInfo->nChannels; i++) {
+    struct inspectorKernelChInfo *kernelCh = &collInfo->kernelCh[i];
+    uint64_t gpuExecTimeUsecs = calculateKernelGpuExecTimeUsecs(kernelCh);
+    if (gpuExecTimeUsecs > 0) {
+      if (gpuExecTimeUsecs > maxKernelExecTimeUsecs) {
+        maxKernelExecTimeUsecs = gpuExecTimeUsecs;
+        bestTimingSource = inspectorTimingSourceKernelGpu;
+      }
+    } else {
+      if (kernelCh->tsCompletedUsec > kernelCh->tsStartUsec) {
+        uint64_t cpuExecTimeUsecs = kernelCh->tsCompletedUsec - kernelCh->tsStartUsec;
+        if (cpuExecTimeUsecs > maxKernelExecTimeUsecs) {
+          maxKernelExecTimeUsecs = cpuExecTimeUsecs;
+          bestTimingSource = inspectorTimingSourceKernelCpu;
+        }
+      }
+    }
+  }
+
+  if (maxKernelExecTimeUsecs > 0) {
+    *timingSource = bestTimingSource;
+    return maxKernelExecTimeUsecs;
+  } else {
+    *timingSource = inspectorTimingSourceCollectiveCpu;
+    return collInfo->tsCompletedUsec - collInfo->tsStartUsec;
+  }
+}
+
+/*
+ * Description:
+ *
+ *   Updates the performance information for a completed collective
+ *   operation.
+ *
+ * Thread Safety:
+ *   Thread-safe (uses locks internally).
+ *
+ * Input:
+ *   struct inspectorCommInfo *commInfo - communicator info.
+ *   struct inspectorCollInfo *collInfo - completed collective info.
+ *
+ * Output:
+ *   commInfo is updated with completed collective info.
+ *
+ * Return:
+ *   None.
+ *
+ */
+void inspectorUpdateCollPerf(struct inspectorCompletedCollInfo *completedColl,
+                             struct inspectorCollInfo *collInfo) {
+  completedColl->func = ncclStringToFunc(collInfo->func);
+  completedColl->sn = collInfo->sn;
+  completedColl->msgSizeBytes = collInfo->msgSizeBytes;
+  completedColl->execTimeUsecs =
+    calculateMaxKernelExecTimeUsecs(collInfo, &completedColl->timingSource);
+  completedColl->collEvtTrk = collInfo->collEvtTrk;
+}
+
+/*
+ * Description:
+ *
+ *   Finalizes the global inspector state and stops the dump thread if
+ *   running.
+ *
+ * Thread Safety:
+ *   Not thread-safe (should be called during teardown).
+ *
+ * Input:
+ *   None.
+ *
+ * Output:
+ *   Global state is finalized and dump thread is stopped.
+ *
+ * Return:
+ *   inspectorResult_t - success or error code.
+ *
+ */
+inspectorResult_t inspectorGlobalFinalize() {
+  if (dumper) {
+    dumper->stopThread();
+    delete dumper;
+    dumper = nullptr;
+  }
+  return inspectorSuccess;
+}
diff --git a/projects/rccl/ext-profiler/inspector/inspector.h b/projects/rccl/ext-profiler/inspector/inspector.h
new file mode 100644
index 0000000000..98e050f97c
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/inspector.h
@@ -0,0 +1,198 @@
+#pragma once
+
+#include <pthread.h>
+
+#include "json.h"
+#include "common.h"
+#include "version.h"
+
+#define MAX_CHANNELS                     64
+
+#define INS_CHK_GOTO(call, res, label)                                  \
+  do {                                                                  \
+    res = call;                                                         \
+    if (inspectorSuccess != res) {                                      \
+      INFO(NCCL_INSPECTOR, "%s:%d -> error %d: %s", __FILE__, __LINE__, res, \
+           inspectorErrorString(res));                                  \
+      goto label;                                                       \
+    }                                                                   \
+  } while (0);
+
+
+typedef enum {
+  ncclFuncBroadcast = 0,
+  ncclFuncReduce = 1,
+  ncclFuncAllGather = 2,
+  ncclFuncReduceScatter = 3,
+  ncclFuncAllReduce = 4,
+  ncclFuncSendRecv = 5,
+  ncclFuncSend = 6,
+  ncclFuncRecv = 7,
+  ncclNumFuncs = 8
+} ncclFunc_t;
+
+typedef enum {
+  inspectorSuccess = 0,
+  inspectorUninitializedError,
+  inspectorMemoryError,
+  inspectorFileOpenError,
+  inspectorDisabledError,
+  inspectorLockError,
+  inspectorPthreadError,
+  inspectorJsonError,
+  inspectorCudaError,
+  inspectorBadHash,
+  inspectorDeleteUnknownCommError,
+  inspectorAddDuplicateCommError,
+  inspectorNop,
+  inspectorNullTally,
+  inspectorGlobalInitError,
+  inspectorReturn,
+} inspectorResult_t;
+
+typedef enum {
+  inspectorTimingSourceKernelGpu = 0,
+  inspectorTimingSourceKernelCpu = 1,
+  inspectorTimingSourceCollectiveCpu = 2,
+} inspectorTimingSource_t;
+
+struct inspectorEventTraceInfo {
+  uint64_t ts;
+  uint64_t sn;
+};
+
+typedef enum {
+  NCCL_INSP_EVT_TRK_COLL_START = 0,
+  NCCL_INSP_EVT_TRK_COLL_STOP = 1,
+  NCCL_INSP_EVT_TRK_COLL_NEVT = 2,
+} inspectorEventTrkColl_t;
+
+typedef enum {
+  NCCL_INSP_EVT_TRK_KERNEL_START = 0,
+  NCCL_INSP_EVT_TRK_KERNEL_STOP = 1,
+  NCCL_INSP_EVT_TRK_KERNEL_RECORD = 2,
+  NCCL_INSP_EVT_TRK_KERNEL_NEVT = 3,
+} inspectorEventTrkKernel_t;
+
+struct inspectorEventTrkKernelInfo {
+  struct inspectorEventTraceInfo evntTrace[NCCL_INSP_EVT_TRK_KERNEL_NEVT];
+};
+
+struct inspectorEventTrkCollInfo {
+  int sn;
+  uint32_t nChannels;
+  struct inspectorEventTraceInfo evntTrace[NCCL_INSP_EVT_TRK_COLL_NEVT];
+  struct inspectorEventTrkKernelInfo kernelCh[MAX_CHANNELS];
+};
+
+struct inspectorCompletedCollInfo {
+  ncclFunc_t func;
+  uint64_t sn;
+  size_t msgSizeBytes;
+  uint64_t execTimeUsecs;
+  inspectorTimingSource_t timingSource;
+  double algoBwGbs;
+  double busBwGbs;
+  // Event trace information
+  struct inspectorEventTrkCollInfo collEvtTrk;
+};
+
+enum {
+  NCCL_COMM_HASH_LENGTH = 17
+};
+
+struct inspectorCommInfo {
+  struct inspectorCommInfo* next;
+
+  const char* commName;
+  uint64_t commHash;
+  char commHashStr[NCCL_COMM_HASH_LENGTH];
+  int rank;
+  int nranks;
+  int nnodes;
+
+  bool dump;
+  struct inspectorCompletedCollInfo completedCollInfo;
+  pthread_rwlock_t guard;
+};
+
+struct inspectorKernelChInfo {
+  uint64_t type;
+  int refCount; /*unused*/
+  struct inspectorCollInfo *collInfo;
+  uint8_t channelId;
+  uint64_t tsStartUsec;
+  uint64_t tsCompletedUsec;
+  uint64_t startGpuClk;
+  uint64_t stopGpuClk;
+};
+
+struct inspectorCollInfo {
+  uint64_t type;
+  int refCount;
+  struct inspectorCommInfo *commInfo;
+  const char* func;
+  uint64_t sn;
+  size_t msgSizeBytes;
+  uint64_t tsStartUsec;
+  uint64_t tsCompletedUsec;
+  uint32_t nChannels;
+  uint32_t nKernelChStarted;
+  uint32_t nKernelChCompleted;
+  pthread_rwlock_t guard;
+  struct inspectorKernelChInfo kernelCh[MAX_CHANNELS];
+  struct inspectorEventTrkCollInfo collEvtTrk;
+};
+
+
+
+extern ncclDebugLogger_t logFn;
+#define VERSION(...) logFn(NCCL_LOG_VERSION, NCCL_ALL, __FILE__, __LINE__, __VA_ARGS__)
+#define INFO(FLAGS, ...) logFn(NCCL_LOG_INFO, (FLAGS), __func__, __LINE__, __VA_ARGS__)
+#define WARN(...) logFn(NCCL_LOG_WARN, NCCL_ALL, __FILE__, __LINE__, __VA_ARGS__)
+
+inline int ncclTypeSize(ncclDataType_t type) {
+  switch (type) {
+  case ncclInt8:
+  case ncclUint8:
+  case ncclFloat8e4m3:
+  case ncclFloat8e5m2:
+    return 1;
+  case ncclFloat16:
+  case ncclBfloat16:
+    return 2;
+  case ncclInt32:
+  case ncclUint32:
+  case ncclFloat32:
+    return 4;
+  case ncclInt64:
+  case ncclUint64:
+  case ncclFloat64:
+    return 8;
+  default:
+    return -1;
+  }
+}
+
+const char* inspectorErrorString(inspectorResult_t result);
+
+inspectorResult_t inspectorLockInit(pthread_rwlock_t* lockRef);
+inspectorResult_t inspectorLockDestroy(pthread_rwlock_t* lockRef);
+inspectorResult_t inspectorLockRd(pthread_rwlock_t* lockRef);
+inspectorResult_t inspectorLockWr(pthread_rwlock_t* lockRef);
+inspectorResult_t inspectorUnlockRWLock(pthread_rwlock_t* lockRef);
+inspectorResult_t inspectorGlobalInit(int rank);
+inspectorResult_t inspectorGlobalFinalize();
+uint64_t inspectorGetTime();
+inspectorResult_t inspectorAddComm(struct inspectorCommInfo **commInfo,
+                                   const char* commName, uint64_t commHash,
+                                   int nNodes, int nranks, int rank);
+inspectorResult_t inspectorDelComm(struct inspectorCommInfo *commInfo);
+
+void inspectorUpdateCollPerf(struct inspectorCompletedCollInfo *completedColl,
+                             struct inspectorCollInfo *collInfo);
+ncclDataType_t inspectorStringToDatatype(const char* str);
+
+void inspectorComputeCollBw(struct inspectorCommInfo *commInfo,
+                            struct inspectorCompletedCollInfo *completedColl,
+                            ncclFunc_t collType);
diff --git a/projects/rccl/ext-profiler/inspector/inspector_plugin.cc b/projects/rccl/ext-profiler/inspector/inspector_plugin.cc
new file mode 100644
index 0000000000..b1872157d5
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/inspector_plugin.cc
@@ -0,0 +1,493 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include <stdio.h>
+#include <pthread.h>
+#include <string.h>
+#include <linux/limits.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+#include "profiler.h"
+#include "inspector.h"
+
+#define __hidden __attribute__ ((visibility("hidden")))
+
+static int gInitialized;
+
+static pthread_mutex_t gLock = PTHREAD_MUTEX_INITIALIZER;
+
+
+/*
+ * Description:
+ *   Records an event trace with timestamp and sequence number
+ *
+ * Thread Safety:
+ *   Not thread-safe - must be called with proper locking. This function
+ *   is designed to be called from within locked sections where the
+ *   collective info structure is already protected.
+ *
+ * Input:
+ *   struct inspectorEventTraceInfo* evtTrace - event trace array
+ *   int eventIndex - index in the event trace array (must be valid)
+ *   struct inspectorCollInfo* collInfo - collective info structure (must not be NULL)
+ *
+ * Output:
+ *   Event trace is updated with current timestamp and next sequence
+ *   number from collective
+ *
+ * Return:
+ *   uint64_t - the sequence number assigned to this event
+ *
+ * Preconditions:
+ *   - collInfo must not be NULL
+ *   - eventIndex must be within valid bounds for evtTrace array
+ *   - Function must be called from within a locked section
+ */
+static uint64_t inspectorRecordEventTrace(struct inspectorEventTraceInfo* evtTrace,
+                                          int eventIndex,
+                                          struct inspectorCollInfo* collInfo) {
+  evtTrace[eventIndex].ts = inspectorGetTime();
+  evtTrace[eventIndex].sn = ++collInfo->collEvtTrk.sn; // Increment coll sequence counter
+
+  return evtTrace[eventIndex].sn;
+}
+
+/*
+ * Description:
+ *
+ *   Initializes the NCCL Inspector plugin and global state for a
+ *   communicator.
+ *
+ * Thread Safety:
+ *   Thread-safe (uses mutex for initialization).
+ *
+ * Input:
+ *   void** context - pointer to plugin context.
+ *   int* eActivationMask - pointer to activation mask output.
+ *   const char* commName - communicator name.
+ *   uint64_t commHash - communicator hash.
+ *   int nNodes - number of nodes.
+ *   int nranks - number of ranks.
+ *   int rank - rank.
+ *   ncclDebugLogger_t logfn - logger function pointer.
+ *
+ * Output:
+ *   context is set to plugin context; eActivationMask is set.
+ *
+ * Return:
+ *   ncclResult_t - success or error code.
+ *
+ */
+__hidden ncclResult_t inspectorPluginInit(void** context, uint64_t commHash,
+                                          int* eActivationMask,
+                                          const char* commName,
+                                          int nNodes, int nranks, int rank,
+                                          ncclDebugLogger_t logfn) {
+  inspectorResult_t res = inspectorSuccess;
+  *context = nullptr;
+  logFn = logfn;
+
+  pthread_mutex_lock(&gLock);
+  if (++gInitialized == 1) {
+    res = inspectorGlobalInit(rank);
+    if (res != inspectorSuccess) {
+      WARN("Inspector Init Failed %s:%d -> error %d: %s",__FILE__, __LINE__, res,
+           inspectorErrorString(res));
+      gInitialized = 0;
+      pthread_mutex_unlock(&gLock);
+      return ncclInternalError;
+    }
+  }
+  pthread_mutex_unlock(&gLock);
+
+  INS_CHK_GOTO(inspectorAddComm((struct inspectorCommInfo **)context,
+                                commName, commHash,
+                                nNodes, nranks, rank), res, success);
+  *eActivationMask = ncclProfileColl | ncclProfileKernelCh;
+  INFO(NCCL_INIT, "PROFILER/Plugin: init commName: %s commHash: %lu nranks: %d rank: %d",
+       commName ? commName : "", commHash, nranks, rank);
+success:
+  if (res != inspectorSuccess) {
+    return ncclInternalError;
+  } else {
+    return ncclSuccess;
+  }
+}
+
+/*
+ * Description:
+ *
+ *   Finalizes the NCCL Inspector plugin and global state for a
+ *   communicator.
+ *
+ * Thread Safety:
+ *   Thread-safe (uses mutex for finalization).
+ *
+ * Input:
+ *   void* context - plugin context.
+ *
+ * Output:
+ *   Plugin context is finalized and cleaned up.
+ *
+ * Return:
+ *   ncclResult_t - success or error code.
+ *
+ */
+__hidden ncclResult_t inspectorPluginFinalize(void* context) {
+  inspectorDelComm((struct inspectorCommInfo *)context);
+  pthread_mutex_lock(&gLock);
+  if (--gInitialized == 0) {
+    inspectorGlobalFinalize();
+  }
+  pthread_mutex_unlock(&gLock);
+  return ncclSuccess;
+}
+
+inspectorResult_t inspectorPluginCollInfoRef(struct inspectorCollInfo *collInfo) {
+  collInfo->refCount += 1;
+  return inspectorSuccess;
+}
+
+inspectorResult_t inspectorPluginCollInfoRefSafe(struct inspectorCollInfo *collInfo) {
+  inspectorLockWr(&collInfo->guard);
+  inspectorPluginCollInfoRef(collInfo);
+  inspectorUnlockRWLock(&collInfo->guard);
+  return inspectorSuccess;
+}
+
+inspectorResult_t inspectorPluginCollInfoDeRef(struct inspectorCollInfo *collInfo) {
+  collInfo->refCount -= 1;
+  if (collInfo->refCount == 0) {
+    inspectorLockDestroy(&collInfo->guard);
+    memset(collInfo, 0, sizeof(struct inspectorCollInfo));
+    free(collInfo);
+    return inspectorReturn;
+  }
+  return inspectorSuccess;
+}
+
+inspectorResult_t inspectorPluginCollInfoDeRefSafe(struct inspectorCollInfo *collInfo) {
+  inspectorLockWr(&collInfo->guard);
+  inspectorResult_t res = inspectorPluginCollInfoDeRef(collInfo);
+  inspectorUnlockRWLock(&collInfo->guard);
+  return res;
+}
+
+/*
+ * Description:
+ *   Initializes a new inspectorCollInfo structure for a collective
+ *   event.
+ *
+ * Thread Safety:
+ *   Not thread-safe (allocates and initializes a new collective info
+ *   structure).
+ *
+ * Input:
+ *
+ *   struct inspectorCollInfo **collInfo - pointer to output
+ *   collective info struct.
+ *   ncclProfilerEventDescr_t *eDescr - event descriptor.
+ *
+ * Output:
+ *   collInfo is set to the new collective info struct.
+ *
+ * Return:
+ *   None.
+ */
+static void inspectorPluginCollInfoInit(struct inspectorCollInfo **collInfo,
+                                        ncclProfilerEventDescr_t *eDescr,
+                                        struct inspectorCommInfo *commInfo) {
+  struct inspectorCollInfo *collInfoPtr
+    = (struct inspectorCollInfo*)calloc(1, sizeof(struct inspectorCollInfo));
+  if (collInfoPtr == nullptr) {
+    WARN("Inspector: Failed to allocate memory for collective info structure");
+    *collInfo = nullptr;
+    return;
+  }
+  collInfoPtr->type = ncclProfileColl;
+  collInfoPtr->refCount = 0;
+  inspectorPluginCollInfoRef(collInfoPtr); //self ref; no locks needed
+  collInfoPtr->func = eDescr->coll.func;
+  collInfoPtr->sn = eDescr->coll.seqNumber;
+  collInfoPtr->nChannels = eDescr->coll.nChannels;
+  if (collInfoPtr->nChannels > 0) {
+    inspectorPluginCollInfoRef(collInfoPtr); //extra ref for kernel completion
+  }
+  collInfoPtr->tsStartUsec = inspectorGetTime();
+  collInfoPtr->msgSizeBytes =
+    ncclTypeSize(inspectorStringToDatatype(eDescr->coll.datatype)) * eDescr->coll.count;
+
+
+  collInfoPtr->commInfo = commInfo;
+  collInfoPtr->collEvtTrk.sn = 0;
+  collInfoPtr->collEvtTrk.nChannels = collInfoPtr->nChannels;
+  inspectorRecordEventTrace(collInfoPtr->collEvtTrk.evntTrace,
+                            NCCL_INSP_EVT_TRK_COLL_START, collInfoPtr);
+
+  inspectorLockInit(&collInfoPtr->guard);
+  *collInfo = collInfoPtr;
+}
+
+/*
+ * Description:
+ *
+ *   Initializes a new inspectorKernelChInfo structure for a kernel
+ *   channel event.
+ *
+ * Thread Safety:
+ *   Not thread-safe (initializes kernel channel info within a
+ *   collective info structure).
+ *
+ * Input:
+ *   struct inspectorKernelChInfo **kernelChInfo - pointer to output
+ *   kernel channel info struct.
+ *   ncclProfilerEventDescr_t *eDescr - event descriptor.
+ *
+ * Output:
+ *
+ *   kernelChInfo is set to the new kernel channel info struct.
+ *
+ * Return:
+ *   None.
+ */
+static void inspectorPluginKernelChInfoInit(struct inspectorKernelChInfo **kernelChInfo,
+                                            ncclProfilerEventDescr_t *eDescr) {
+  if (eDescr->parentObj) {
+    uint64_t parentType=*(uint64_t*)eDescr->parentObj;
+    if (parentType == ncclProfileColl) {
+      struct inspectorCollInfo *collInfo = (struct inspectorCollInfo*)eDescr->parentObj;
+      if (collInfo && collInfo->type == ncclProfileColl) {
+        inspectorLockWr(&collInfo->guard);
+        struct inspectorEventTraceInfo *krnlEvtTrk =
+          collInfo->collEvtTrk.kernelCh[eDescr->kernelCh.channelId].evntTrace;
+        inspectorRecordEventTrace(krnlEvtTrk,
+                                  NCCL_INSP_EVT_TRK_KERNEL_START,
+                                  collInfo);
+        struct inspectorKernelChInfo *kernelChInfoPtr
+          = &collInfo->kernelCh[eDescr->kernelCh.channelId];
+        kernelChInfoPtr->type = ncclProfileKernelCh;
+        kernelChInfoPtr->channelId = eDescr->kernelCh.channelId;
+        kernelChInfoPtr->startGpuClk = eDescr->kernelCh.pTimer;
+        if (kernelChInfoPtr->stopGpuClk == 0) {
+          inspectorPluginCollInfoRef(collInfo); //Pairs with Record Kernel Stop event
+        }
+        kernelChInfoPtr->tsStartUsec = inspectorGetTime();
+        if (collInfo->nKernelChStarted == 0) {
+          collInfo->tsStartUsec = kernelChInfoPtr->tsStartUsec;
+        }
+        collInfo->nKernelChStarted += 1;
+        inspectorPluginCollInfoRef(collInfo); //Pairs with Stop Kernel Event
+        kernelChInfoPtr->collInfo = collInfo;
+
+        *kernelChInfo = kernelChInfoPtr;
+        inspectorUnlockRWLock(&collInfo->guard);
+      }
+    }
+  }
+}
+/*
+ * Description:
+ *
+ *   Starts a profiling event for the NCCL Inspector plugin.
+ *
+ * Thread Safety:
+ *   Thread-safe (allocates and initializes event structures).
+ *
+ * Input:
+ *   void* context - plugin context.
+ *   void** eHandle - pointer to event handle output.
+ *   ncclProfilerEventDescr_t* eDescr - event descriptor.
+ *
+ * Output:
+ *   eHandle is set to the new event structure.
+ *
+ * Return:
+ *   ncclResult_t - success or error code.
+ *
+ */
+__hidden ncclResult_t inspectorPluginStartEvent(void* context,
+                                                void** eHandle,
+                                                ncclProfilerEventDescr_t* eDescr) {
+  if (context == nullptr || eDescr == nullptr) {
+    INFO(NCCL_INIT, "Profiler/Plugin: context/eDescr NULL for start event %s", __func__);
+    return ncclSuccess;
+  }
+  *eHandle = nullptr;
+  if (eDescr->type == ncclProfileColl) {
+    struct inspectorCollInfo *collEvent = nullptr;
+    struct inspectorCommInfo *commInfoCtx = (struct inspectorCommInfo*)context;
+    inspectorPluginCollInfoInit(&collEvent, eDescr, commInfoCtx);
+    *eHandle = collEvent;
+  } else if (eDescr->type == ncclProfileKernelCh) {
+    struct inspectorKernelChInfo *kernelChEvent = nullptr;
+    inspectorPluginKernelChInfoInit(&kernelChEvent, eDescr);
+    *eHandle = kernelChEvent;
+  } else {
+    return ncclSuccess;
+  }
+  return ncclSuccess;
+}
+
+/*
+ * Description:
+ *
+ *   Stops a profiling event for the NCCL Inspector plugin.
+ *
+ * Thread Safety:
+ *
+ *   Thread-safe (updates event state and performance info).
+ *
+ * Input:
+ *
+ *   void *eHandle - event handle.
+ *
+ * Output:
+ *
+ *   Event is stopped and performance info may be updated.
+ *
+ * Return:
+ *   ncclResult_t - success or error code.
+ *
+ */
+__hidden ncclResult_t inspectorPluginStopEvent(void *eHandle) {
+
+  if (eHandle == nullptr) {
+    INFO(NCCL_INIT,
+         "Profiler/Plugin: Event Handle NULL for start event %s", __func__);
+    return ncclSuccess;
+  }
+  uint64_t type = *(uint64_t *)eHandle;
+  inspectorResult_t res = inspectorSuccess;
+
+  if (type == ncclProfileColl) {
+    struct inspectorCollInfo *collInfo = (struct inspectorCollInfo *)eHandle;
+    // Record collective stop event
+    inspectorLockWr(&collInfo->guard);
+    inspectorRecordEventTrace(collInfo->collEvtTrk.evntTrace,
+                              NCCL_INSP_EVT_TRK_COLL_STOP,
+                              collInfo);
+    res = inspectorPluginCollInfoDeRef(collInfo);
+    if (res == inspectorReturn) {
+      // WARN("NCCL Inspector unnatural return: inspectorPluginStopEvent:ncclProfileColl");
+      return ncclSuccess;
+    }
+    inspectorUnlockRWLock(&collInfo->guard);
+    return ncclSuccess;
+  } else if (type == ncclProfileKernelCh) {
+    struct inspectorKernelChInfo *kernelChInfo
+      = (struct inspectorKernelChInfo *)eHandle;
+    struct inspectorCollInfo *collInfo = kernelChInfo->collInfo;
+    if (collInfo && collInfo->type == ncclProfileColl) {
+      inspectorLockWr(&collInfo->guard);
+      struct inspectorEventTraceInfo *krnlEvtTrk =
+        collInfo->collEvtTrk.kernelCh[kernelChInfo->channelId].evntTrace;
+      inspectorRecordEventTrace(krnlEvtTrk,
+                                NCCL_INSP_EVT_TRK_KERNEL_STOP,
+                                collInfo);
+      kernelChInfo->tsCompletedUsec = inspectorGetTime();
+      collInfo->nKernelChCompleted += 1;
+
+      res = inspectorPluginCollInfoDeRef(collInfo);
+      if (res == inspectorReturn) {
+        WARN("NCCL Inspector unnatural return: inspectorPluginStopEvent:ncclProfileKernelCh");
+        return ncclSuccess;
+      }
+      if ((collInfo->nKernelChCompleted == collInfo->nKernelChStarted)
+          && (collInfo->nKernelChCompleted == collInfo->nChannels)) {
+        struct inspectorCompletedCollInfo completedColl;
+        struct inspectorCommInfo *commInfo = collInfo->commInfo;
+        collInfo->tsCompletedUsec = kernelChInfo->tsCompletedUsec;
+        inspectorUpdateCollPerf(&completedColl, collInfo);
+
+        res = inspectorPluginCollInfoDeRef(collInfo);
+        if (res != inspectorReturn) {
+          inspectorUnlockRWLock(&collInfo->guard);
+        }
+        if (commInfo != nullptr) {
+          inspectorLockWr(&commInfo->guard);
+          inspectorComputeCollBw(commInfo,
+                                 &completedColl,
+                                 completedColl.func);
+          memcpy(&commInfo->completedCollInfo,
+                 &completedColl,
+                 sizeof(struct inspectorCompletedCollInfo));
+          commInfo->dump = true;
+          inspectorUnlockRWLock(&commInfo->guard);
+        }
+        return ncclSuccess;
+      }
+      inspectorUnlockRWLock(&collInfo->guard);
+    }
+    return ncclSuccess;
+  }
+  return ncclSuccess;
+}
+
+/*
+ * Description:
+ *
+ *   Records the state of a profiling event for the NCCL Inspector
+ *   plugin.
+ *
+ * Thread Safety:
+ *
+ *   Thread-safe (updates event state as needed).
+ *
+ * Input:
+ *   void* eHandle - event handle.
+ *   ncclProfilerEventState_t eState - event state.
+ *   ncclProfilerEventStateArgs_t* eStateArgs - event state arguments.
+ *
+ * Output:
+ *   Event state is updated as needed.
+ *
+ * Return:
+ *   ncclResult_t - success or error code.
+ *
+ */
+__hidden ncclResult_t inspectorPluginRecordEventState(void* eHandle,
+                                                      ncclProfilerEventState_t eState,
+                                                      ncclProfilerEventStateArgs_t* eStateArgs) {
+  if (eHandle == nullptr || eStateArgs == nullptr)
+    return ncclSuccess;
+
+  uint64_t type = *(uint64_t *)eHandle;
+
+  if (type == ncclProfileKernelCh && eState == ncclProfilerKernelChStop) {
+    struct inspectorKernelChInfo *kernelChInfo = (struct inspectorKernelChInfo *)eHandle;
+    struct inspectorCollInfo *collInfo = kernelChInfo->collInfo;
+    inspectorResult_t res = inspectorSuccess;
+    if (collInfo && collInfo->type == ncclProfileColl) {
+      inspectorLockWr(&collInfo->guard);
+      struct inspectorEventTraceInfo *krnlEvtTrk
+        = collInfo->collEvtTrk.kernelCh[kernelChInfo->channelId].evntTrace;
+      inspectorRecordEventTrace(krnlEvtTrk,
+                                NCCL_INSP_EVT_TRK_KERNEL_RECORD,
+                                collInfo);
+      kernelChInfo->stopGpuClk = eStateArgs->kernelCh.pTimer;
+      if (kernelChInfo->startGpuClk != 0) {
+        res = inspectorPluginCollInfoDeRef(collInfo);
+        if (res == inspectorReturn) {
+          WARN("NCCL Inspector unnatural return: inspectorPluginRecordEventState");
+          return ncclSuccess;
+        }
+      }
+      inspectorUnlockRWLock(&collInfo->guard);
+    }
+  }
+  return ncclSuccess;
+}
+
+ncclProfiler_t ncclProfiler_v5 = {
+  "Inspector",
+  inspectorPluginInit,
+  inspectorPluginStartEvent,
+  inspectorPluginStopEvent,
+  inspectorPluginRecordEventState,
+  inspectorPluginFinalize,
+};
diff --git a/projects/rccl/ext-profiler/inspector/json.cc b/projects/rccl/ext-profiler/inspector/json.cc
new file mode 100644
index 0000000000..e95d98d180
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/json.cc
@@ -0,0 +1,496 @@
+#include "json.h"
+#include <assert.h>
+#include <math.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+const char* jsonErrorString(jsonResult_t res) {
+  switch (res) {
+  case jsonSuccess:
+    return "jsonSuccess";
+  case jsonFileError:
+    return "jsonFileError";
+  case jsonUnknownStateError:
+    return "jsonUnknownStateError";
+  case jsonEmptyStateError:
+    return "jsonEmptyStateError";
+  case jsonExpectedNonNoneStateError:
+    return "jsonExpectedNonNoneStateError";
+  case jsonMemoryError:
+    return "jsonMemoryError";
+  case jsonStringOverflowError:
+    return "jsonStringOverflowError";
+  case jsonStringBadChar:
+    return "jsonStringBadChar";
+  case jsonLockError:
+    return "jsonLockError";
+  default:
+    return "unknown json error";
+  }
+}
+
+// We use these statics to mantain a stack of states where we are writing.
+typedef struct jsonFileOutput {
+  jsonState_t* states;
+  size_t state_cap; // Allocated stack capacity
+  size_t state_n;   // # of items in the stack.
+  FILE* fp;
+  pthread_mutex_t mutex;
+} jsonFileOutput;
+
+jsonResult_t jsonInitFileOutput(jsonFileOutput** jfo, const char* outfile) {
+  jsonFileOutput* new_jfo = (jsonFileOutput*)malloc(sizeof(jsonFileOutput));
+  if (new_jfo == NULL) {
+    return jsonMemoryError;
+  }
+  if (pthread_mutex_init(&new_jfo->mutex, NULL) != 0) {
+    free(new_jfo);
+    *jfo = 0;
+    return jsonLockError;
+  }
+  new_jfo->states = NULL;
+  new_jfo->state_cap = 0;
+  new_jfo->state_n = 0;
+  new_jfo->fp = fopen(outfile, "w");
+  if (new_jfo->fp == NULL) {
+    free(new_jfo);
+    *jfo = 0;
+    return jsonFileError;
+  }
+  *jfo = new_jfo;
+  return jsonSuccess;
+}
+
+jsonResult_t jsonNewline(jsonFileOutput* jfo) {
+  fprintf(jfo->fp, "\n");
+  return jsonSuccess;
+}
+
+jsonResult_t jsonFlushOutput(jsonFileOutput* jfo) {
+  fflush(jfo->fp);
+  return jsonSuccess;
+}
+
+jsonResult_t jsonLockOutput(jsonFileOutput* jfo) {
+  if (pthread_mutex_lock(&jfo->mutex) != 0) {
+    return jsonLockError;
+  }
+  return jsonSuccess;
+}
+
+jsonResult_t jsonUnlockOutput(jsonFileOutput* jfo) {
+  if (pthread_mutex_unlock(&jfo->mutex) != 0) {
+    return jsonLockError;
+  }
+  return jsonSuccess;
+}
+
+jsonResult_t jsonFinalizeFileOutput(jsonFileOutput* jfo) {
+  // Really should probably complain if we aren't in a valid state
+
+  if (pthread_mutex_destroy(&jfo->mutex) != 0) {
+    free(jfo);
+    return jsonLockError;
+  }
+  if (jfo->states != NULL) {
+    free(jfo->states);
+  }
+  jfo->states = NULL;
+  jfo->state_cap = 0;
+  jfo->state_n = 0;
+  if (jfo->fp) {
+    fclose(jfo->fp);
+    jfo->fp = 0;
+  }
+
+  free(jfo);
+  return jsonSuccess;
+}
+
+static int utf8copy(unsigned char* out, int out_lim, const unsigned char* in) {
+  int copy_len;
+  if ((in[0] & 0xE0) == 0xC0) {
+    // 2-byte sequence
+    if ((in[1] & 0xC0) != 0x80 || out_lim < 2) {
+      return 0;
+    }
+    copy_len = 2;
+  } else if ((in[0] & 0xF0) == 0xE0) {
+    // 3-byte sequence
+    if ((in[1] & 0xC0) != 0x80 || (in[2] & 0xC0) != 0x80 || out_lim < 3) {
+      return 0;
+    }
+    copy_len = 3;
+  } else if ((in[0] & 0xF8) == 0xF0) {
+    // 4-byte sequence
+    if ((in[1] & 0xC0) != 0x80 || (in[2] & 0xC0) != 0x80 || (in[3] & 0xC0) != 0x80 || out_lim < 4) {
+      return 0;
+    }
+    copy_len = 4;
+  } else {
+    // Invalid start byte
+    return 0;
+  }
+
+  for (int i = 0; i < copy_len; ++i) {
+    out[i] = in[i];
+  }
+
+  return copy_len;
+}
+
+// This tries to sanitize/quote a string from 'in' into 'out',
+// assuming 'out' has length 'lim'.  We mainly quote ",/,\,\t,\n, and
+// bail if we encounter non-printable stuff or non-ASCII stuff.
+// 'in' should be null-terminated, of course.
+//
+// We return false if we were not able to copy all of 'in', either for
+// length reasons or for unhandled characters.
+static jsonResult_t sanitizeJson(unsigned char out[], int lim, const unsigned char* in) {
+  int c = 0;
+  while (*in) {
+    if (c + 1 >= lim) {
+      out[c] = 0;
+      return jsonStringOverflowError;
+    }
+    switch (*in) {
+    case '"':
+    case '\\':
+    case '/':
+    case '\t':
+    case '\n':
+      if (c + 2 > lim) {
+        out[c] = 0;
+        return jsonStringOverflowError;
+      }
+
+      out[c++] = '\\';
+      if (*in == '\n') {
+        out[c++] = 'n';
+      } else if (*in == '\t') {
+        out[c++] = 't';
+      } else {
+        out[c++] = *in;
+      }
+      ++in;
+      break;
+    default:
+      if (*in <= 0x1F) {
+        out[c] = 0;
+        return jsonStringBadChar;
+      } else if (*in <= 0x7F) {
+        out[c++] = *in;
+        ++in;
+      } else {
+        const int utf8len = utf8copy(out + c, lim - c - 1, in);
+        if (utf8len == 0) {
+          out[c] = 0;
+          return jsonStringBadChar;
+        }
+        c += utf8len;
+        in += utf8len;
+      }
+      break;
+    }
+  }
+  out[c] = 0;
+  return jsonSuccess;
+}
+
+static size_t max(size_t a, size_t b) {
+  if (a < b) {
+    return b;
+  }
+  return a;
+}
+
+// Push state onto the state stack. Reallocate for extra storage if needed.
+// Because JSON_NONE is a pseudo-state, don't allow it to be pushed.
+static jsonResult_t jsonPushState(jsonFileOutput* jfo, jsonState_t state) {
+  if (state == JSON_NONE) {
+    return jsonExpectedNonNoneStateError;
+  }
+  if (jfo->state_cap <= (jfo->state_n + 1)) {
+    jfo->state_cap = max((size_t)16, jfo->state_cap * 2);
+    jfo->states = (jsonState_t*)realloc(jfo->states, sizeof(jsonState_t) * jfo->state_cap);
+    if (jfo->states == 0) {
+      return jsonMemoryError;
+    }
+  }
+  jfo->states[jfo->state_n++] = state;
+  return jsonSuccess;
+}
+
+// Return the current state at the top of the stack
+static jsonState_t jsonCurrState(const jsonFileOutput* jfo) {
+  if (jfo->state_n == 0) {
+    return JSON_NONE;
+  }
+  return jfo->states[jfo->state_n - 1];
+}
+
+// Replace the stack with state (equivalent to a pop & push if stack is not empty)
+static jsonResult_t jsonReplaceState(jsonFileOutput* jfo, jsonState_t state) {
+  if (state == JSON_NONE) {
+    return jsonExpectedNonNoneStateError;
+  }
+  if (jfo->state_n == 0) {
+    return jsonEmptyStateError;
+  }
+  jfo->states[jfo->state_n - 1] = state;
+  return jsonSuccess;
+}
+
+// Pop the top state off the stack, or return that the state is empty
+static jsonState_t jsonPopState(jsonFileOutput* jfo) {
+  if (jfo->state_n == 0) {
+    return JSON_NONE;
+  }
+  return jfo->states[--jfo->state_n];
+}
+
+// Emit a key and separator. Santize the key.
+// This is only acceptable if the top state is an object
+// Emit a ',' separator of we aren't the first item.
+jsonResult_t jsonKey(jsonFileOutput* jfo, const char* name) {
+  switch (jsonCurrState(jfo)) {
+  case JSON_OBJECT_EMPTY:
+    jsonReplaceState(jfo, JSON_OBJECT_SOME);
+    break;
+  case JSON_OBJECT_SOME:
+    fprintf(jfo->fp, ",");
+    break;
+  default:
+    return jsonUnknownStateError;
+  }
+  unsigned char tmp[2048];
+  const jsonResult_t res = sanitizeJson(tmp, sizeof(tmp), (const unsigned char*)name);
+  if (res != jsonSuccess) {
+    return res;
+  }
+  fprintf(jfo->fp, "\"%s\":", tmp);
+  jsonPushState(jfo, JSON_KEY);
+  return jsonSuccess;
+}
+
+// Helper function for inserting values.
+// Only acceptable after keys, top-level, or in lists.
+// Emit preceeding ',' if in a list and not first item.
+static jsonResult_t jsonValHelper(jsonFileOutput* jfo) {
+  switch (jsonCurrState(jfo)) {
+  case JSON_LIST_EMPTY:
+    jsonReplaceState(jfo, JSON_LIST_SOME);
+    break;
+  case JSON_LIST_SOME:
+    fprintf(jfo->fp, ",");
+    break;
+  case JSON_KEY:
+    jsonPopState(jfo);
+    break;
+  case JSON_NONE:
+    break;
+  default:
+    return jsonUnknownStateError;
+  }
+  return jsonSuccess;
+}
+
+// Start an object
+jsonResult_t jsonStartObject(jsonFileOutput* jfo) {
+  const jsonResult_t res = jsonValHelper(jfo);
+  if (res != jsonSuccess) {
+    return res;
+  }
+  fprintf(jfo->fp, "{");
+  return jsonPushState(jfo, JSON_OBJECT_EMPTY);
+}
+
+// Close an object
+jsonResult_t jsonFinishObject(jsonFileOutput* jfo) {
+  switch (jsonPopState(jfo)) {
+  case JSON_OBJECT_EMPTY:
+  case JSON_OBJECT_SOME:
+    break;
+  default:
+    return jsonUnknownStateError;
+  }
+  fprintf(jfo->fp, "}");
+  return jsonSuccess;
+}
+
+// Start a list
+jsonResult_t jsonStartList(jsonFileOutput* jfo) {
+  const jsonResult_t res = jsonValHelper(jfo);
+  if (res != jsonSuccess) {
+    return res;
+  }
+  fprintf(jfo->fp, "[");
+  return jsonPushState(jfo, JSON_LIST_EMPTY);
+}
+
+// Close a list
+jsonResult_t jsonFinishList(jsonFileOutput* jfo) {
+  switch (jsonPopState(jfo)) {
+  case JSON_LIST_EMPTY:
+  case JSON_LIST_SOME:
+    break;
+  default:
+    return jsonUnknownStateError;
+  }
+  fprintf(jfo->fp, "]");
+  return jsonSuccess;
+}
+
+// Write a null value
+jsonResult_t jsonNull(jsonFileOutput* jfo) {
+  const jsonResult_t res = jsonValHelper(jfo);
+  if (res != jsonSuccess) {
+    return res;
+  }
+  fprintf(jfo->fp, "null");
+  return jsonSuccess;
+}
+
+// Write a (sanititzed) string
+jsonResult_t jsonStr(jsonFileOutput* jfo, const char* str) {
+  if (str == NULL) {
+    jsonNull(jfo);
+    return jsonSuccess;
+  }
+  const jsonResult_t res = jsonValHelper(jfo);
+  if (res != jsonSuccess) {
+    return res;
+  }
+  unsigned char tmp[2048];
+  const jsonResult_t san_res = sanitizeJson(tmp, sizeof(tmp), (const unsigned char*)str);
+  if (san_res != jsonSuccess) {
+    return san_res;
+  }
+  fprintf(jfo->fp, "\"%s\"", tmp);
+  return jsonSuccess;
+}
+
+// Write a bool as "true" or "false" strings.
+jsonResult_t jsonBool(jsonFileOutput* jfo, bool val) {
+  return jsonStr(jfo, val ? "true" : "false");
+}
+
+// Write an integer value
+jsonResult_t jsonInt(jsonFileOutput* jfo, const int val) {
+  const jsonResult_t res = jsonValHelper(jfo);
+  if (res != jsonSuccess) {
+    return res;
+  }
+  fprintf(jfo->fp, "%d", val);
+  return jsonSuccess;
+}
+
+// Write an integer value
+jsonResult_t jsonUint32(jsonFileOutput* jfo, const uint32_t val) {
+  const jsonResult_t res = jsonValHelper(jfo);
+  if (res != jsonSuccess) {
+    return res;
+  }
+  fprintf(jfo->fp, "%u", val);
+  return jsonSuccess;
+}
+
+
+// Write an integer value
+jsonResult_t jsonUint64(jsonFileOutput* jfo, const uint64_t val) {
+  const jsonResult_t res = jsonValHelper(jfo);
+  if (res != jsonSuccess) {
+    return res;
+  }
+  fprintf(jfo->fp, "%lu", val);
+  return jsonSuccess;
+}
+
+// Write a size_t value
+jsonResult_t jsonSize_t(jsonFileOutput* jfo, const size_t val) {
+  const jsonResult_t res = jsonValHelper(jfo);
+  if (res != jsonSuccess) {
+    return res;
+  }
+  fprintf(jfo->fp, "%zu", val);
+  return jsonSuccess;
+}
+
+// Write a double value
+jsonResult_t jsonDouble(jsonFileOutput* jfo, const double val) {
+  const jsonResult_t res = jsonValHelper(jfo);
+  if (res != jsonSuccess) {
+    return res;
+  }
+  if (val != val) {
+    fprintf(jfo->fp, "\"nan\"");
+  } else {
+    fprintf(jfo->fp, "%lf", val);
+  }
+  return jsonSuccess;
+}
+
+#ifdef DO_JSON_TEST
+// compile with
+// gcc json.cc -Iinclude/ -DDO_JSON_TEST -o json_test
+// run with:
+// ./json_test
+// if something fails, it will print out the error
+// if it all works, print out "output matches reference"
+#define JSONCHECK(expr)                                                                            \
+  do {                                                                                             \
+    const jsonResult_t res = (expr);                                                               \
+    if (res != jsonSuccess) {                                                                      \
+      fprintf(stderr, "jsonError: %s\n", jsonErrorString(res));                                    \
+      exit(1);                                                                                     \
+    }                                                                                              \
+  } while (0)
+
+int main() {
+
+  const char refstr[] =
+      "{\"number\":123,\"utfstring\":\"∮ E⋅da = Q,  n → ∞, ∑ f(i) = ∏ g(i), ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ "
+      "¬β = ¬(¬α ∨ β),\",\"list\":[\"true\",null,9423812381231,3123111,0.694234]}";
+
+  jsonFileOutput* jfo;
+  JSONCHECK(jsonInitFileOutput(&jfo, "test.json"));
+  JSONCHECK(jsonStartObject(jfo));
+  JSONCHECK(jsonKey(jfo, "number"));
+  JSONCHECK(jsonInt(jfo, 123));
+  JSONCHECK(jsonKey(jfo, "utfstring"));
+  JSONCHECK(
+      jsonStr(jfo, "∮ E⋅da = Q,  n → ∞, ∑ f(i) = ∏ g(i), ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β),"));
+  JSONCHECK(jsonKey(jfo, "list"));
+  JSONCHECK(jsonStartList(jfo));
+  JSONCHECK(jsonBool(jfo, true));
+  JSONCHECK(jsonNull(jfo));
+  JSONCHECK(jsonUint64(jfo, 9423812381231ULL));
+  JSONCHECK(jsonSize_t(jfo, 3123111));
+  JSONCHECK(jsonDouble(jfo, 0.69423413));
+  JSONCHECK(jsonFinishList(jfo));
+  JSONCHECK(jsonFinishObject(jfo));
+  JSONCHECK(jsonFinalizeFileOutput(jfo));
+
+  FILE* fp = fopen("test.json", "r");
+
+  const size_t reflen = sizeof(refstr) / sizeof(char);
+
+  char buffer[reflen];
+
+  fread(buffer, sizeof(char), reflen, fp);
+
+  fclose(fp);
+
+  if (memcmp(buffer, refstr, reflen) == 0) {
+    printf("output matches reference\n");
+  } else {
+    printf("output    %s\nreference %s\n", buffer, refstr);
+    return 1;
+  }
+
+  return 0;
+}
+
+#endif
diff --git a/projects/rccl/ext-profiler/inspector/json.h b/projects/rccl/ext-profiler/inspector/json.h
new file mode 100644
index 0000000000..a0b6848437
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/json.h
@@ -0,0 +1,83 @@
+#pragma once
+
+#include <stdbool.h>
+#include <stdint.h>
+#include <stddef.h>
+
+typedef enum {
+  JSON_NONE, // A pseudo-state meaning that the document is empty
+  JSON_KEY,
+  JSON_OBJECT_EMPTY,
+  JSON_OBJECT_SOME,
+  JSON_LIST_EMPTY,
+  JSON_LIST_SOME,
+} jsonState_t;
+
+typedef enum {
+  jsonSuccess,
+  jsonFileError,
+  jsonUnknownStateError,
+  jsonEmptyStateError,
+  jsonExpectedNonNoneStateError,
+  jsonStringOverflowError,
+  jsonStringBadChar,
+  jsonMemoryError,
+  jsonLockError,
+} jsonResult_t;
+
+const char *jsonErrorString(jsonResult_t res);
+
+typedef struct jsonFileOutput jsonFileOutput;
+
+jsonResult_t jsonLockOutput(jsonFileOutput *jfo);
+
+jsonResult_t jsonUnlockOutput(jsonFileOutput *jfo);
+
+jsonResult_t jsonInitFileOutput(jsonFileOutput **jfo,
+                                const char *outfile);
+
+jsonResult_t jsonFinalizeFileOutput(jsonFileOutput *jfo);
+
+jsonResult_t jsonNewline(jsonFileOutput *jfo);
+jsonResult_t jsonFlushOutput(jsonFileOutput *jfo);
+
+// Emit a key and separator. Santize the key.
+// This is only acceptable if the top state is an object
+// Emit a ',' separator of we aren't the first item.
+jsonResult_t jsonKey(jsonFileOutput *jfo, const char *name);
+
+// Start an object
+jsonResult_t jsonStartObject(jsonFileOutput *jfo);
+
+// Close an object
+jsonResult_t jsonFinishObject(jsonFileOutput *jfo);
+
+// Start a list
+jsonResult_t jsonStartList(jsonFileOutput *jfo);
+
+// Close a list
+jsonResult_t jsonFinishList(jsonFileOutput *jfo);
+
+// Emit a null value
+jsonResult_t jsonNull(jsonFileOutput *jfo);
+
+// Write a (sanititzed) string
+jsonResult_t jsonStr(jsonFileOutput *jfo, const char *str);
+
+// Write a bool as "true" or "false" strings.
+jsonResult_t jsonBool(jsonFileOutput *jfo, bool val);
+
+// Write an integer value
+jsonResult_t jsonInt(jsonFileOutput *jfo, const int val);
+
+//Write an unsigned int value
+jsonResult_t jsonUint32(jsonFileOutput *jfo, const uint32_t val);
+
+// Write an integer value
+jsonResult_t jsonUint64(jsonFileOutput *jfo, const uint64_t val);
+
+// Write a size_t value
+jsonResult_t jsonSize_t(jsonFileOutput *jfo, const size_t val);
+
+// Write a double value
+jsonResult_t jsonDouble(jsonFileOutput *jfo, const double val);
diff --git a/projects/rccl/ext-profiler/inspector/nccl/common.h b/projects/rccl/ext-profiler/inspector/nccl/common.h
new file mode 100644
index 0000000000..f8ab7e9e64
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/nccl/common.h
@@ -0,0 +1,73 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef COMMON_H_
+#define COMMON_H_
+
+/* typedef enum {NCCL_LOG_NONE=0, NCCL_LOG_VERSION=1, NCCL_LOG_WARN=2, NCCL_LOG_INFO=3, NCCL_LOG_ABORT=4, NCCL_LOG_TRACE=5} ncclDebugLogLevel; */
+/* typedef enum {NCCL_INIT=1, NCCL_COLL=2, NCCL_P2P=4, NCCL_SHM=8, NCCL_NET=16, NCCL_GRAPH=32, NCCL_TUNING=64, NCCL_ENV=128, NCCL_ALLOC=256, NCCL_CALL=512, NCCL_PROXY=1024, NCCL_NVLS=2048, NCCL_BOOTSTRAP=4096, NCCL_REG=8192, NCCL_ALL=~0} ncclDebugLogSubSys; */
+
+/* Data types */
+typedef enum { ncclInt8       = 0, ncclChar       = 0,
+               ncclUint8      = 1,
+               ncclInt32      = 2, ncclInt        = 2,
+               ncclUint32     = 3,
+               ncclInt64      = 4,
+               ncclUint64     = 5,
+               ncclFloat16    = 6, ncclHalf       = 6,
+               ncclFloat32    = 7, ncclFloat      = 7,
+               ncclFloat64    = 8, ncclDouble     = 8,
+               ncclBfloat16   = 9,
+               ncclFloat8e4m3 = 10,
+               ncclFloat8e5m2 = 11,
+               ncclNumTypes   = 12
+} ncclDataType_t;
+
+typedef enum {
+  NCCL_LOG_NONE = 0,
+  NCCL_LOG_VERSION = 1,
+  NCCL_LOG_WARN = 2,
+  NCCL_LOG_INFO = 3,
+  NCCL_LOG_ABORT = 4,
+  NCCL_LOG_TRACE = 5
+} ncclDebugLogLevel;
+
+typedef enum { ncclSuccess                 =  0,
+               ncclUnhandledCudaError      =  1,
+               ncclSystemError             =  2,
+               ncclInternalError           =  3,
+               ncclInvalidArgument         =  4,
+               ncclInvalidUsage            =  5,
+               ncclRemoteError             =  6,
+               ncclInProgress              =  7,
+               ncclNumResults              =  8 } ncclResult_t;
+
+
+typedef enum {
+  NCCL_INIT = 0x1,
+  NCCL_COLL = 0x2,
+  NCCL_P2P = 0x4,
+  NCCL_SHM = 0x8,
+  NCCL_NET = 0x10,
+  NCCL_GRAPH = 0x20,
+  NCCL_TUNING = 0x40,
+  NCCL_ENV = 0x80,
+  NCCL_ALLOC = 0x100,
+  NCCL_CALL = 0x200,
+  NCCL_PROXY = 0x400,
+  NCCL_NVLS = 0x800,
+  NCCL_BOOTSTRAP = 0x1000,
+  NCCL_REG = 0x2000,
+  NCCL_PROFILE = 0x4000,
+  NCCL_RAS = 0x8000,
+  NCCL_INSPECTOR = 0x100000, // big number to avoid short-term conflicts
+  NCCL_ALL = ~0
+} ncclDebugLogSubSys;
+
+
+typedef void (*ncclDebugLogger_t)(ncclDebugLogLevel level, unsigned long flags, const char *file, int line, const char *fmt, ...);
+
+#endif
diff --git a/projects/rccl/ext-profiler/inspector/nccl/profiler.h b/projects/rccl/ext-profiler/inspector/nccl/profiler.h
new file mode 100644
index 0000000000..715885f72b
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/nccl/profiler.h
@@ -0,0 +1,85 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef PROFILER_H_
+#define PROFILER_H_
+
+#include <stdint.h>
+#include <stdlib.h>
+
+#include "common.h"
+
+enum {
+  ncclProfileGroup          = (1 << 0),  // group event type
+  ncclProfileColl           = (1 << 1),  // host collective call event type
+  ncclProfileP2p            = (1 << 2),  // host point-to-point call event type
+  ncclProfileProxyOp        = (1 << 3),  // proxy operation event type
+  ncclProfileProxyStep      = (1 << 4),  // proxy step event type
+  ncclProfileProxyCtrl      = (1 << 5),  // proxy control event type
+  ncclProfileKernelCh       = (1 << 6),  // kernel channel event type
+  ncclProfileNetPlugin      = (1 << 7),  // network plugin-defined, events
+  ncclProfileGroupApi       = (1 << 8),  // Group API events
+  ncclProfileCollApi        = (1 << 9),  // Collective API events
+  ncclProfileP2pApi         = (1 << 10), // Point-to-Point API events
+  ncclProfileKernelLaunch   = (1 << 11), // Kernel launch events
+};
+
+typedef enum {
+  ncclProfilerProxyOpSendPosted        = 0,  // deprecated in v4
+  ncclProfilerProxyOpSendRemFifoWait   = 1,  // deprecated in v4
+  ncclProfilerProxyOpSendTransmitted   = 2,  // deprecated in v4
+  ncclProfilerProxyOpSendDone          = 3,  // deprecated in v4
+  ncclProfilerProxyOpRecvPosted        = 4,  // deprecated in v4
+  ncclProfilerProxyOpRecvReceived      = 5,  // deprecated in v4
+  ncclProfilerProxyOpRecvTransmitted   = 6,  // deprecated in v4
+  ncclProfilerProxyOpRecvDone          = 7,  // deprecated in v4
+  ncclProfilerProxyOpInProgress_v4     = 19,
+
+  /* Legacy proxy profiler states */
+  ncclProfilerProxyStepSendGPUWait     = 8,
+  ncclProfilerProxyStepSendPeerWait_v4 = 20,
+  ncclProfilerProxyStepSendWait        = 9,
+  ncclProfilerProxyStepRecvWait        = 10,
+  ncclProfilerProxyStepRecvFlushWait   = 11,
+  ncclProfilerProxyStepRecvGPUWait     = 12,
+
+  /* Legacy proxy control states */
+  ncclProfilerProxyCtrlIdle            = 13,
+  ncclProfilerProxyCtrlActive          = 14,
+  ncclProfilerProxyCtrlSleep           = 15,
+  ncclProfilerProxyCtrlWakeup          = 16,
+  ncclProfilerProxyCtrlAppend          = 17,
+  ncclProfilerProxyCtrlAppendEnd       = 18,
+
+  /* Network defined events states */
+  ncclProfilerNetPluginUpdate          = 21,
+
+  /* Kernel event states */
+  ncclProfilerKernelChStop             = 22,
+
+  /* Group API States */
+  ncclProfilerEndGroupApiStart         = 23,
+  ncclProfilerBeginGroupApiEnd         = 24
+} ncclProfilerEventState_t;
+
+typedef ncclProfilerEventState_t ncclProfilerEventState_v1_t;
+typedef ncclProfilerEventState_t ncclProfilerEventState_v2_t;
+typedef ncclProfilerEventState_t ncclProfilerEventState_v3_t;
+typedef ncclProfilerEventState_t ncclProfilerEventState_v4_t;
+typedef ncclProfilerEventState_t ncclProfilerEventState_v5_t;
+
+#include "profiler_v5.h"
+#include "profiler_v4.h"
+#include "profiler_v3.h"
+#include "profiler_v2.h"
+#include "profiler_v1.h"
+#include "profiler_net.h"
+
+typedef ncclProfiler_v5_t ncclProfiler_t;
+typedef ncclProfilerEventDescr_v5_t ncclProfilerEventDescr_t;
+typedef ncclProfilerEventStateArgs_v5_t ncclProfilerEventStateArgs_t;
+
+#endif // end include guard
diff --git a/projects/rccl/ext-profiler/inspector/nccl/profiler_net.h b/projects/rccl/ext-profiler/inspector/nccl/profiler_net.h
new file mode 100644
index 0000000000..4f0a4182c8
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/nccl/profiler_net.h
@@ -0,0 +1,19 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef PROFILER_NET_H_
+#define PROFILER_NET_H_
+
+#define NCCL_PROFILER_NET_VER_BITS  (16)
+#define NCCL_PROFILER_NET_VER_MASK  (~0U >> NCCL_PROFILER_NET_VER_BITS)
+#define NCCL_PROFILER_NET_TYPE_MASK (~0U << NCCL_PROFILER_NET_VER_BITS)
+
+typedef enum {
+  NCCL_PROFILER_NET_TYPE_IB   = (1U << NCCL_PROFILER_NET_VER_BITS),
+  NCCL_PROFILER_NET_TYPE_SOCK = (2U << NCCL_PROFILER_NET_VER_BITS),
+} ncclProfilerNetType;
+
+#endif
diff --git a/projects/rccl/ext-profiler/inspector/nccl/profiler_v1.h b/projects/rccl/ext-profiler/inspector/nccl/profiler_v1.h
new file mode 100644
index 0000000000..9abcea76d6
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/nccl/profiler_v1.h
@@ -0,0 +1,112 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef PROFILER_V1_H_
+#define PROFILER_V1_H_
+
+#include <stdint.h>
+#include <stddef.h>
+#include <sys/types.h>
+
+
+typedef struct {
+  uint8_t type;                 // event type descriptor: ncclProfileColl, ...
+  void* parentObj;              // pointer to the profiler parent object (for coll is the group)
+  int rank;                     // originating rank
+  union {
+    struct {
+      const char* name;
+      uint64_t commHash;
+      uint64_t seqNumber;
+      uint8_t func;
+      void const* sendBuff;
+      void* recvBuff;
+      size_t count;
+      int root;
+      uint8_t datatype;
+      uint32_t op;
+      size_t trafficBytes;
+      uint8_t nMaxChannels;
+      uint8_t nWarps;
+      uint8_t algo;
+      uint8_t proto;
+      int isCollnet;
+      int isNvls;
+    } coll;
+
+    struct {
+      const char* name;
+      uint64_t commHash;
+      uint8_t func;
+      void* buff;
+      uint8_t datatype;
+      size_t count;
+      int peer;
+    } p2p;
+
+    struct {
+      pid_t pid;                // pid of the originating process
+      uint8_t channelId;        // channel id for this proxy operation
+      int peer;                 // remote rank for send/recv
+      int nSteps;               // number of steps for this proxy operation
+      int chunkSize;            // amount of data transferred by this proxy operation
+      int isSend;
+    } proxyOp;
+
+    struct {
+      int step;
+    } proxyStep;
+  };
+} ncclProfilerEventDescr_v1_t;
+
+typedef union {
+  struct {
+    size_t transSize;
+    int steps;
+  } proxyOp;
+
+  struct {
+    int appendedProxyOps;
+  } proxyCtrl;
+} ncclProfilerEventStateArgs_v1_t;
+
+typedef struct {
+  const char* name;
+
+  // init - initialize the profiler plugin
+  // Input
+  //  - context        : opaque profiler context object for separating profiler behavior across comms
+  // Output
+  //  - eActivationMask: bitmask of active events set by the plugin
+  ncclResult_t (*init)(void** context, int* eActivationMask);
+
+  // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
+  // Input
+  //  - context: opaque profiler context object
+  //  - eDescr : pointer to ncclProfilerEventDescr_t object
+  // Output
+  //  - eHandle: return event handle for supplied event descriptor object
+  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v1_t* eDescr);
+
+  // stopEvent - stop/finalize an event inside and event set
+  // Input
+  //  - eHandle: handle to event object
+  ncclResult_t (*stopEvent)(void* eHandle);
+
+  // recordEventState - record event state transitions and event attribute updates
+  // Input
+  //  - eHandle   : handle to event object created through startEvent
+  //  - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
+  //  - eState    : event state transition
+  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v1_t eState, ncclProfilerEventStateArgs_v1_t* eStateArgs);
+
+  // finalize - finalize the profiler plugin
+  // Input
+  //  - context: opaque profiler context object
+  ncclResult_t (*finalize)(void* context);
+} ncclProfiler_v1_t;
+
+#endif
diff --git a/projects/rccl/ext-profiler/inspector/nccl/profiler_v2.h b/projects/rccl/ext-profiler/inspector/nccl/profiler_v2.h
new file mode 100644
index 0000000000..6a2699b588
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/nccl/profiler_v2.h
@@ -0,0 +1,108 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef PROFILER_V2_H_
+#define PROFILER_V2_H_
+
+#include <stdint.h>
+#include <stddef.h>
+#include <sys/types.h>
+
+typedef struct {
+  uint8_t type;                 // event type descriptor: ncclProfileColl, ...
+  void* parentObj;              // pointer to the profiler parent object (for coll is the group)
+  int rank;                     // originating rank
+  union {
+    struct {
+      const char* name;
+      uint64_t commHash;
+      uint64_t seqNumber;
+      const char* func;
+      void const* sendBuff;
+      void* recvBuff;
+      size_t count;
+      int root;
+      const char* datatype;
+      size_t trafficBytes;
+      uint8_t nMaxChannels;
+      uint8_t nWarps;
+      const char* algo;
+      const char* proto;
+    } coll;
+
+    struct {
+      const char* name;
+      uint64_t commHash;
+      const char* func;
+      void* buff;
+      const char* datatype;
+      size_t count;
+      int peer;
+    } p2p;
+
+    struct {
+      pid_t pid;                // pid of the originating process
+      uint8_t channelId;        // channel id for this proxy operation
+      int peer;                 // remote rank for send/recv
+      int nSteps;               // number of steps for this proxy operation
+      int chunkSize;            // amount of data transferred by this proxy operation
+      int isSend;
+    } proxyOp;
+
+    struct {
+      int step;
+    } proxyStep;
+  };
+} ncclProfilerEventDescr_v2_t;
+
+typedef union {
+  struct {
+    size_t transSize;
+    int steps;
+  } proxyOp;
+
+  struct {
+    int appendedProxyOps;
+  } proxyCtrl;
+} ncclProfilerEventStateArgs_v2_t;
+
+typedef struct {
+  const char* name;
+
+  // init - initialize the profiler plugin
+  // Input
+  //  - context        : opaque profiler context object for separating profiler behavior across comms
+  // Output
+  //  - eActivationMask: bitmask of active events set by the plugin
+  ncclResult_t (*init)(void** context, int* eActivationMask);
+
+  // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
+  // Input
+  //  - context: opaque profiler context object
+  //  - eDescr : pointer to ncclProfilerEventDescr_t object
+  // Output
+  //  - eHandle: return event handle for supplied event descriptor object
+  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v2_t* eDescr);
+
+  // stopEvent - stop/finalize an event inside and event set
+  // Input
+  //  - eHandle: handle to event object
+  ncclResult_t (*stopEvent)(void* eHandle);
+
+  // recordEventState - record event state transitions and event attribute updates
+  // Input
+  //  - eHandle   : handle to event object created through startEvent
+  //  - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
+  //  - eState    : event state transition
+  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v2_t eState, ncclProfilerEventStateArgs_v2_t* eStateArgs);
+
+  // finalize - finalize the profiler plugin
+  // Input
+  //  - context: opaque profiler context object
+  ncclResult_t (*finalize)(void* context);
+} ncclProfiler_v2_t;
+
+#endif
diff --git a/projects/rccl/ext-profiler/inspector/nccl/profiler_v3.h b/projects/rccl/ext-profiler/inspector/nccl/profiler_v3.h
new file mode 100644
index 0000000000..d4def08e14
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/nccl/profiler_v3.h
@@ -0,0 +1,116 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef PROFILER_V3_H_
+#define PROFILER_V3_H_
+
+#include <stdint.h>
+#include <stddef.h>
+#include <sys/types.h>
+
+typedef struct {
+  uint8_t type;                 // event type descriptor: ncclProfileColl, ...
+  void* parentObj;              // pointer to the profiler parent object (for coll is the group)
+  int rank;                     // originating rank
+  union {
+    struct {
+      const char* name;
+      uint64_t commHash;
+      uint64_t seqNumber;
+      const char* func;
+      void const* sendBuff;
+      void* recvBuff;
+      size_t count;
+      int root;
+      const char* datatype;
+      uint8_t nMaxChannels;
+      uint8_t nWarps;
+      const char* algo;
+      const char* proto;
+    } coll;
+
+    struct {
+      const char* name;
+      uint64_t commHash;
+      const char* func;
+      void* buff;
+      const char* datatype;
+      size_t count;
+      int peer;
+    } p2p;
+
+    struct {
+      pid_t pid;                // pid of the originating process
+      uint8_t channelId;        // channel id for this proxy operation
+      int peer;                 // remote rank for send/recv
+      int nSteps;               // number of steps for this proxy operation
+      int chunkSize;            // amount of data transferred by this proxy operation
+      int isSend;
+    } proxyOp;
+
+    struct {
+      int step;
+    } proxyStep;
+
+    struct {
+      uint8_t channelId;
+    } kernelCh;
+
+    struct {
+      int64_t id;
+      void* data;
+    } netPlugin;
+  };
+} ncclProfilerEventDescr_v3_t;
+
+typedef union {
+  struct {
+    size_t transSize;
+    int steps;
+  } proxyOp;
+
+  struct {
+    int appendedProxyOps;
+  } proxyCtrl;
+} ncclProfilerEventStateArgs_v3_t;
+
+typedef struct {
+  const char* name;
+
+  // init - initialize the profiler plugin
+  // Input
+  //  - context        : opaque profiler context object for separating profiler behavior across comms
+  // Output
+  //  - eActivationMask: bitmask of active events set by the plugin
+  ncclResult_t (*init)(void** context, int* eActivationMask);
+
+  // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
+  // Input
+  //  - context: opaque profiler context object
+  //  - eDescr : pointer to ncclProfilerEventDescr_t object
+  // Output
+  //  - eHandle: return event handle for supplied event descriptor object
+  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v3_t* eDescr);
+
+  // stopEvent - stop/finalize an event inside and event set
+  // Input
+  //  - eHandle: handle to event object
+  ncclResult_t (*stopEvent)(void* eHandle);
+
+  // recordEventState - record event state transitions and event attribute updates
+  // Input
+  //  - eHandle   : handle to event object created through startEvent
+  //  - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
+  //  - eState    : event state transition
+  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v3_t eState, ncclProfilerEventStateArgs_v3_t* eStateArgs);
+
+  // finalize - finalize the profiler plugin
+  // Input
+  //  - context: opaque profiler context object
+  ncclResult_t (*finalize)(void* context);
+} ncclProfiler_v3_t;
+
+#endif
diff --git a/projects/rccl/ext-profiler/inspector/nccl/profiler_v4.h b/projects/rccl/ext-profiler/inspector/nccl/profiler_v4.h
new file mode 100644
index 0000000000..75f00548d9
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/nccl/profiler_v4.h
@@ -0,0 +1,127 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef PROFILER_V4_H_
+#define PROFILER_V4_H_
+
+#include <stdint.h>
+#include <stddef.h>
+#include <sys/types.h>
+
+typedef struct {
+  uint8_t type;                 // event type descriptor: ncclProfileColl, ...
+  void* parentObj;              // pointer to the profiler parent object (for coll is the group)
+  int rank;                     // originating rank
+  union {
+    struct {
+      uint64_t seqNumber;
+      const char* func;
+      void const* sendBuff;
+      void* recvBuff;
+      size_t count;
+      int root;
+      const char* datatype;
+      uint8_t nChannels;
+      uint8_t nWarps;
+      const char* algo;
+      const char* proto;
+    } coll;
+
+    struct {
+      const char* func;
+      void* buff;
+      const char* datatype;
+      size_t count;
+      int peer;
+      uint8_t nChannels;
+    } p2p;
+
+    struct {
+      pid_t pid;                // pid of the originating process
+      uint8_t channelId;        // channel id for this proxy operation
+      int peer;                 // remote rank for send/recv
+      int nSteps;               // number of steps for this proxy operation
+      int chunkSize;            // amount of data transferred by this proxy operation
+      int isSend;
+    } proxyOp;
+
+    struct {
+      int step;
+    } proxyStep;
+
+    struct {
+      uint8_t channelId;
+      uint64_t pTimer;          // start timestamp from GPU globaltimer
+    } kernelCh;
+
+    struct {
+      int64_t id;
+      void* data;
+    } netPlugin;
+  };
+} ncclProfilerEventDescr_v4_t;
+
+typedef union {
+  struct {
+    size_t transSize;
+  } proxyStep;
+
+  struct {
+    int appendedProxyOps;
+  } proxyCtrl;
+
+  struct {
+    void* data;
+  } netPlugin;
+
+  struct {
+    uint64_t pTimer;
+  } kernelCh;
+} ncclProfilerEventStateArgs_v4_t;
+
+typedef struct {
+  const char* name;
+
+  // init - initialize the profiler plugin
+  // Input
+  //  - context        : opaque profiler context object for separating profiler behavior across comms
+  //  - commName       : user assigned communicator name
+  //  - commHash       : communicator id
+  //  - nNodes         : number of nodes in communicator
+  //  - nranks         : number of ranks in communicator
+  //  - rank           : rank identifier in communicator
+  //  - logfn          : logger function
+  // Output
+  //  - eActivationMask: bitmask of active events set by the plugin
+  ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
+
+  // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
+  // Input
+  //  - context: opaque profiler context object
+  //  - eDescr : pointer to ncclProfilerEventDescr_t object
+  // Output
+  //  - eHandle: return event handle for supplied event descriptor object
+  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);
+
+  // stopEvent - stop/finalize an event inside and event set
+  // Input
+  //  - eHandle: handle to event object
+  ncclResult_t (*stopEvent)(void* eHandle);
+
+  // recordEventState - record event state transitions and event attribute updates
+  // Input
+  //  - eHandle   : handle to event object created through startEvent
+  //  - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
+  //  - eState    : event state transition
+  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);
+
+  // finalize - finalize the profiler plugin
+  // Input
+  //  - context: opaque profiler context object
+  ncclResult_t (*finalize)(void* context);
+} ncclProfiler_v4_t;
+
+#endif
diff --git a/projects/rccl/ext-profiler/inspector/nccl/profiler_v5.h b/projects/rccl/ext-profiler/inspector/nccl/profiler_v5.h
new file mode 100644
index 0000000000..dab1db9e1a
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/nccl/profiler_v5.h
@@ -0,0 +1,151 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef PROFILER_V5_H_
+#define PROFILER_V5_H_
+
+typedef struct {
+  uint64_t type;                // event type descriptor: ncclProfileColl, ...
+  void* parentObj;              // pointer to the profiler parent object (for coll is the group)
+  int rank;                     // originating rank
+  union {
+    struct {
+      bool graphCaptured;
+      int groupDepth;
+    } groupApi;
+
+    struct {
+      const char* func;
+      size_t count;
+      const char* datatype;
+      int root;
+      void* stream;
+      bool graphCaptured;
+    } collApi;
+
+    struct {
+      const char* func;
+      size_t count;
+      const char* datatype;
+      void* stream;
+      bool graphCaptured;
+    } p2pApi;
+
+    struct {
+      void* stream;
+    } kernelLaunch;
+
+    struct {
+      uint64_t seqNumber;
+      const char* func;
+      void const* sendBuff;
+      void* recvBuff;
+      size_t count;
+      int root;
+      const char* datatype;
+      uint8_t nChannels;
+      uint8_t nWarps;
+      const char* algo;
+      const char* proto;
+      void* parentGroup; // for backward compatibility with v4
+    } coll;
+
+    struct {
+      const char* func;
+      void* buff;
+      const char* datatype;
+      size_t count;
+      int peer;
+      uint8_t nChannels;
+      void* parentGroup; // for backward compatibility with v4
+    } p2p;
+
+    struct {
+      pid_t pid;                // pid of the originating process
+      uint8_t channelId;        // channel id for this proxy operation
+      int peer;                 // remote rank for send/recv
+      int nSteps;               // number of steps for this proxy operation
+      int chunkSize;            // amount of data transferred by this proxy operation
+      int isSend;
+    } proxyOp;
+
+    struct {
+      int step;
+    } proxyStep;
+
+    struct {
+      uint8_t channelId;
+      uint64_t pTimer;          // start timestamp from GPU globaltimer
+    } kernelCh;
+
+    struct {
+      int64_t id;
+      void* data;
+    } netPlugin;
+  };
+} ncclProfilerEventDescr_v5_t;
+
+typedef union {
+  struct {
+    size_t transSize;
+  } proxyStep;
+
+  struct {
+    int appendedProxyOps;
+  } proxyCtrl;
+
+  struct {
+    void* data;
+  } netPlugin;
+
+  struct {
+    uint64_t pTimer;
+  } kernelCh;
+} ncclProfilerEventStateArgs_v5_t;
+
+typedef struct {
+  const char* name;
+
+  // init - initialize the profiler plugin
+  // Input
+  //  - context        : opaque profiler context object for separating profiler behavior across comms
+  //  - commId         : communicator id
+  //  - commName       : user assigned communicator name
+  //  - nNodes         : number of nodes in communicator
+  //  - nranks         : number of ranks in communicator
+  //  - rank           : rank identifier in communicator
+  //  - logfn          : logger function
+  // Output
+  //  - eActivationMask: bitmask of active events set by the plugin
+  ncclResult_t (*init)(void** context, uint64_t commId, int* eActivationMask, const char* commName, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
+
+  // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
+  // Input
+  //  - context: opaque profiler context object
+  //  - eDescr : pointer to ncclProfilerEventDescr_t object
+  // Output
+  //  - eHandle: return event handle for supplied event descriptor object
+  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v5_t* eDescr);
+
+  // stopEvent - stop/finalize an event inside and event set
+  // Input
+  //  - eHandle: handle to event object
+  ncclResult_t (*stopEvent)(void* eHandle);
+
+  // recordEventState - record event state transitions and event attribute updates
+  // Input
+  //  - eHandle   : handle to event object created through startEvent
+  //  - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
+  //  - eState    : event state transition
+  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v5_t eState, ncclProfilerEventStateArgs_v5_t* eStateArgs);
+
+  // finalize - finalize the profiler plugin
+  // Input
+  //  - context: opaque profiler context object
+  ncclResult_t (*finalize)(void* context);
+} ncclProfiler_v5_t;
+
+#endif
diff --git a/projects/rccl/ext-profiler/inspector/nccl/types.h b/projects/rccl/ext-profiler/inspector/nccl/types.h
new file mode 100644
index 0000000000..f43fdc1636
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/nccl/types.h
@@ -0,0 +1,21 @@
+/*
+ * Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
+ */
+
+#ifndef NCCL_TYPES_H_
+#define NCCL_TYPES_H_
+
+/* Data types */
+typedef enum { ncclInt8       = 0, ncclChar       = 0,
+               ncclUint8      = 1,
+               ncclInt32      = 2, ncclInt        = 2,
+               ncclUint32     = 3,
+               ncclInt64      = 4,
+               ncclUint64     = 5,
+               ncclFloat16    = 6, ncclHalf       = 6,
+               ncclFloat32    = 7, ncclFloat      = 7,
+               ncclFloat64    = 8, ncclDouble     = 8,
+               ncclBfloat16   = 9,
+} ncclDataType_t;
+
+#endif
diff --git a/projects/rccl/ext-profiler/inspector/version.h b/projects/rccl/ext-profiler/inspector/version.h
new file mode 100644
index 0000000000..347757dfca
--- /dev/null
+++ b/projects/rccl/ext-profiler/inspector/version.h
@@ -0,0 +1,12 @@
+#ifndef VERSION_H
+#define VERSION_H
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+const char* get_git_version_info();
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+
+#endif // VERSION_H
diff --git a/projects/rccl/ext-tuner/README.md b/projects/rccl/ext-tuner/README.md
index 67a743a12b..7595e03ba9 100644
--- a/projects/rccl/ext-tuner/README.md
+++ b/projects/rccl/ext-tuner/README.md
@@ -179,4 +179,4 @@ When developing new tuner plugins:
 - [NCCL Documentation](https://docs.nvidia.com/deeplearning/nccl/)
 - Example plugin implementations in this directory
 
-For questions and support, refer to the NCCL community resources and documentation.
\ No newline at end of file
+For questions and support, refer to the NCCL community resources and documentation.
diff --git a/projects/rccl/ext-tuner/example/.gitignore b/projects/rccl/ext-tuner/example/.gitignore
new file mode 100644
index 0000000000..a3d6f635f5
--- /dev/null
+++ b/projects/rccl/ext-tuner/example/.gitignore
@@ -0,0 +1,49 @@
+# Compiled shared objects and binaries
+*.so
+*.o
+*.a
+*.out
+*.exe
+*.dll
+*.dylib
+*.bin
+*.elf
+
+# Python cache
+__pycache__/
+*.pyc
+*.pyo
+
+# Build and test artifacts
+/build/
+*.log
+*.tmp
+*.swp
+
+# Ignore all CSV files except scripts/sample_performance_data.csv
+*.csv
+!scripts/sample_performance_data.csv
+
+# Ignore all .conf files except nccl_tuner.conf
+*.conf
+!nccl_tuner.conf
+
+my_configs
+
+# Ignore test binary
+test/test_plugin
+
+# Editor/OS files
+.DS_Store
+Thumbs.db
+
+# Backup files
+*~
+*.bak
+
+# Ignore by convention
+*.old
+*.orig
+
+# Git
+.git/
diff --git a/projects/rccl/ext-tuner/example/CMakeLists.txt b/projects/rccl/ext-tuner/example/CMakeLists.txt
new file mode 100644
index 0000000000..1c116b4467
--- /dev/null
+++ b/projects/rccl/ext-tuner/example/CMakeLists.txt
@@ -0,0 +1,26 @@
+# Find all C source files in current directory
+set(SRC_FILES
+    ${CMAKE_CURRENT_SOURCE_DIR}/plugin.c
+)
+
+# Create shared library
+add_library(nccl-tuner-example SHARED ${SRC_FILES})
+
+# Set include directories
+target_include_directories(nccl-tuner-example PRIVATE
+    ${CMAKE_CURRENT_SOURCE_DIR}/nccl
+)
+
+# Set output name to match Makefile
+set_target_properties(nccl-tuner-example PROPERTIES
+    OUTPUT_NAME "nccl-tuner-example"
+    PREFIX "lib"
+    POSITION_INDEPENDENT_CODE ON
+    LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/test/unit/plugins
+)
+
+# Add custom target for clean (equivalent to Makefile clean target)
+add_custom_target(clean-tuner-lib
+    COMMAND ${CMAKE_COMMAND} -E remove -f ${CMAKE_CURRENT_BINARY_DIR}/libnccl-tuner-example.so
+    COMMENT "Cleaning libnccl-tuner-example.so"
+)
diff --git a/projects/rccl/ext-tuner/example/nccl/tuner.h b/projects/rccl/ext-tuner/example/nccl/tuner.h
index 77b543d12c..dc956b1c04 100644
--- a/projects/rccl/ext-tuner/example/nccl/tuner.h
+++ b/projects/rccl/ext-tuner/example/nccl/tuner.h
@@ -45,6 +45,40 @@ typedef enum {
 
 #define NCCL_ALGO_PROTO_IGNORE -1.0
 
+#define NCCL_HW_NVLINK 0
+#define NCCL_HW_PCI 1
+#define NCCL_HW_NET 2
+#define NCCL_NUM_HW_LINKS 3
+
+#define NCCL_VOLTA_COMPCAP_IDX 0
+#define NCCL_AMPERE_COMPCAP_IDX 1
+#define NCCL_HOPPER_COMPCAP_IDX 2
+#define NCCL_BLACKWELL_COMPCAP_IDX 3
+#define NCCL_NUM_COMPCAPS 4
+
+#define NCCL_TUNING_SCALE_1NODE 0
+#define NCCL_TUNING_SCALE_2NODES 1
+#define NCCL_TUNING_SCALE_4NODES 2
+#define NCCL_NUM_TUNING_SCALES 3
+
+typedef struct {
+  int nNvlDomains;                    // number of NVLink domains
+  int minRanksPerNvlDomain;           // minimum ranks across all NVLink domains
+  int maxRanksPerNvlDomain;           // maximum ranks across all NVLink domains
+} ncclNvlDomainInfo_v5_t;
+
+typedef struct {
+  double baseLatencies [NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  double hwLatencies [NCCL_NUM_HW_LINKS][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+
+  double llMaxBws [NCCL_NUM_COMPCAPS][NCCL_NUM_TUNING_SCALES];
+  double perChMaxRingLL128Bws [NCCL_NUM_COMPCAPS][NCCL_NUM_TUNING_SCALES];
+  double perChMaxTreeLL128Bws [NCCL_NUM_COMPCAPS][NCCL_NUM_TUNING_SCALES];
+  double perChMaxTreeBws [NCCL_NUM_COMPCAPS][NCCL_NUM_TUNING_SCALES];
+
+
+} ncclTunerConstants_v5_t;
+
 // API to be implemented by external tuner
 typedef struct {
   // Name of the tuner
@@ -52,12 +86,17 @@ typedef struct {
 
   // Initializes tuner states.
   // Inputs:
+  //   - commId: communicator identifier
   //   - nRanks: number of ranks in current communicator. Each communicator initialize its own tuner.
   //   - nNodes: number of nodes in current communicator.
   //   - logFunction: a logFunction can be useful to integrate logging together with NCCL core.
+  //   - nvlDomainInfo: NVL domain information struct
   // Outputs:
   //   - context: tuner context object
-  ncclResult_t (*init)(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context);
+  // Input/Output:
+  //   - constants: tuner constants
+  ncclResult_t (*init)(void** ctx, uint64_t commId, size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction,
+                      ncclNvlDomainInfo_v5_t* nvlDomainInfo, ncclTunerConstants_v5_t* constants);
 
   // Gets info (algo, protocol, number of ctas and threads) for a given collective.
   // Inputs:
@@ -87,11 +126,13 @@ typedef struct {
 
   // Terminates the plugin and cleans up any resources that the plugin allocated.
   // context: tuner context object
-  ncclResult_t (*destroy)(void* context);
-} ncclTuner_v4_t;
+  ncclResult_t (*finalize)(void* context);
+} ncclTuner_v5_t;
 
-typedef ncclTuner_v4_t ncclTuner_t;
+typedef ncclTuner_v5_t ncclTuner_t;
+typedef ncclNvlDomainInfo_v5_t ncclNvlDomainInfo_t;
+typedef ncclTunerConstants_v5_t ncclTunerConstants_t;
 
-#define NCCL_TUNER_PLUGIN_SYMBOL "ncclTunerPlugin_v4"
+#define NCCL_TUNER_PLUGIN_SYMBOL "ncclTunerPlugin_v5"
 
 #endif
diff --git a/projects/rccl/ext-tuner/example/plugin.c b/projects/rccl/ext-tuner/example/plugin.c
index 1b8031ed11..af813495a1 100644
--- a/projects/rccl/ext-tuner/example/plugin.c
+++ b/projects/rccl/ext-tuner/example/plugin.c
@@ -51,6 +51,7 @@ typedef struct {
   size_t nRanks;
   size_t nNodes;
   ncclDebugLogger_t logFunction;
+  ncclNvlDomainInfo_v5_t nvlDomainInfo;
 } TunerContext;
 
 // Parse collective type from string
@@ -289,7 +290,25 @@ static ncclResult_t loadConfig(TunerContext* ctx, const char* filename) {
   return ncclSuccess;
 }
 
-__hidden ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context) {
+__hidden ncclResult_t pluginInit(void** context, uint64_t commId, size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction,
+                                 ncclNvlDomainInfo_v5_t* nvlDomainInfo, ncclTunerConstants_v5_t* constants) {
+
+  if (NULL != constants) {
+    // NCCL constants tuning
+    // Tune NCCL's internal tuning model to improve base algo/proto selection.
+    // Note: Example numbers are for reference only.
+    //       Actual numbers may vary depending on the hardware and network topology.
+    //       These numbers are not guaranteed to be optimal for all cases.
+    // Limit the tree bandwidth to 15GB/s
+    constants->perChMaxTreeBws[NCCL_BLACKWELL_COMPCAP_IDX][NCCL_TUNING_SCALE_4NODES] = 15.0;
+
+    // Limit the ring bandwidth to 20GB/s
+    constants->perChMaxRingLL128Bws[NCCL_BLACKWELL_COMPCAP_IDX][NCCL_TUNING_SCALE_4NODES] = 20.0;
+
+    // Set NVLSTree base network latency to 24us
+    constants->hwLatencies[NCCL_HW_NET][NCCL_ALGO_NVLS][NCCL_PROTO_SIMPLE] = 24.0;
+  }
+  
   TunerContext* ctx = (TunerContext*)malloc(sizeof(TunerContext));
   if (!ctx) return ncclSystemError;
 
@@ -299,10 +318,16 @@ __hidden ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t
   ctx->nRanks = nRanks;
   ctx->nNodes = nNodes;
   ctx->logFunction = logFunction;
+  if (nvlDomainInfo) {
+    ctx->nvlDomainInfo = *nvlDomainInfo;
+  } else {
+    memset(&ctx->nvlDomainInfo, 0, sizeof(ncclNvlDomainInfo_v5_t));
+  }
 
   if (logFunction) {
     logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
-                "TUNER/ExamplePlugin: Initializing tuner for %zu nodes, %zu ranks", nNodes, nRanks);
+                "TUNER/ExamplePlugin: Initializing tuner for %zu nodes, %zu ranks, %d NVL domains",
+                nNodes, nRanks, ctx->nvlDomainInfo.nNvlDomains);
   }
 
   // Try to load config file from environment variable or default location
@@ -432,7 +457,7 @@ __hidden ncclResult_t pluginGetCollInfo(void* context, ncclFunc_t collType, size
   return ncclSuccess;
 }
 
-__hidden ncclResult_t pluginDestroy(void* context) {
+__hidden ncclResult_t pluginFinalize(void* context) {
   if (context) {
     TunerContext* ctx = (TunerContext*)context;
     if (ctx->configs) {
@@ -443,11 +468,12 @@ __hidden ncclResult_t pluginDestroy(void* context) {
   return ncclSuccess;
 }
 
+
 #define PLUGIN_NAME "Example"
 
-const ncclTuner_v4_t ncclTunerPlugin_v4 = {
+const ncclTuner_v5_t ncclTunerPlugin_v5 = {
   .name = PLUGIN_NAME,
   .init = pluginInit,
   .getCollInfo = pluginGetCollInfo,
-  .destroy = pluginDestroy
+  .finalize = pluginFinalize
 };
diff --git a/projects/rccl/ext-tuner/example/test/test_plugin.c b/projects/rccl/ext-tuner/example/test/test_plugin.c
index 28897c4490..c0300d51c2 100644
--- a/projects/rccl/ext-tuner/example/test/test_plugin.c
+++ b/projects/rccl/ext-tuner/example/test/test_plugin.c
@@ -97,12 +97,12 @@ int test_plugin_init() {
   void* context = NULL;
 
   // Test successful initialization
-  ncclResult_t result = pluginInit(8, 2, mock_logger, &context);
+  ncclResult_t result = pluginInit(&context, 0, 8, 2, mock_logger, NULL, NULL);
   TEST_ASSERT(result == ncclSuccess, "Plugin init should succeed");
   TEST_ASSERT(context != NULL, "Context should be allocated");
 
   // Clean up
-  pluginDestroy(context);
+  pluginFinalize(context);
   TEST_PASS();
 }
 
@@ -122,11 +122,11 @@ int test_config_parsing_valid() {
   setenv("NCCL_TUNER_CONFIG_FILE", "test_valid.conf", 1);
 
   void* context = NULL;
-  ncclResult_t result = pluginInit(16, 2, mock_logger, &context);
+  ncclResult_t result = pluginInit(&context, 0, 16, 2, mock_logger, NULL, NULL);
   TEST_ASSERT(result == ncclSuccess, "Plugin init with valid config should succeed");
 
   // Clean up
-  pluginDestroy(context);
+  pluginFinalize(context);
   unlink("test_valid.conf");
   unsetenv("NCCL_TUNER_CONFIG_FILE");
   TEST_PASS();
@@ -143,12 +143,12 @@ int test_config_parsing_invalid() {
   setenv("NCCL_TUNER_CONFIG_FILE", "test_invalid.conf", 1);
 
   void* context = NULL;
-  ncclResult_t result = pluginInit(8, 1, mock_logger, &context);
+  ncclResult_t result = pluginInit(&context, 0, 8, 1, mock_logger, NULL, NULL);
   // Should still succeed but with no valid configs loaded
   TEST_ASSERT(result == ncclSuccess, "Plugin init should succeed even with invalid config");
 
   // Clean up
-  pluginDestroy(context);
+  pluginFinalize(context);
   unlink("test_invalid.conf");
   unsetenv("NCCL_TUNER_CONFIG_FILE");
   TEST_PASS();
@@ -164,7 +164,7 @@ int test_collective_matching() {
   setenv("NCCL_TUNER_CONFIG_FILE", "test_match.conf", 1);
 
   void* context = NULL;
-  pluginInit(8, 1, mock_logger, &context);
+  pluginInit(&context, 0, 8, 1, mock_logger, NULL, NULL);
 
   // Create mock cost table
   float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
@@ -208,7 +208,7 @@ int test_collective_matching() {
   TEST_ASSERT(nChannels == 4, "Should set 4 channels");
 
   // Clean up
-  pluginDestroy(context);
+  pluginFinalize(context);
   unlink("test_match.conf");
   unsetenv("NCCL_TUNER_CONFIG_FILE");
   TEST_PASS();
@@ -225,7 +225,7 @@ int test_size_matching() {
   setenv("NCCL_TUNER_CONFIG_FILE", "test_size.conf", 1);
 
   void* context = NULL;
-  pluginInit(8, 1, mock_logger, &context);
+  pluginInit(&context, 0, 8, 1, mock_logger, NULL, NULL);
 
   float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
   float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
@@ -279,7 +279,7 @@ int test_size_matching() {
   TEST_ASSERT(nChannels == 8, "Large: Should set 8 channels");
 
   // Clean up
-  pluginDestroy(context);
+  pluginFinalize(context);
   unlink("test_size.conf");
   unsetenv("NCCL_TUNER_CONFIG_FILE");
   TEST_PASS();
@@ -297,7 +297,7 @@ int test_topology_matching() {
 
   // Test with single node setup
   void* context1 = NULL;
-  pluginInit(8, 1, mock_logger, &context1);  // 8 ranks, 1 node
+  pluginInit(&context1, 0, 8, 1, mock_logger, NULL, NULL);  // 8 ranks, 1 node
 
   float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
   float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
@@ -315,11 +315,11 @@ int test_topology_matching() {
   TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Single node: Should match tree config");
   TEST_ASSERT(nChannels == 2, "Single node: Should set 2 channels");
 
-  pluginDestroy(context1);
+  pluginFinalize(context1);
 
   // Test with 4 nodes, 32 ranks setup
   void* context2 = NULL;
-  pluginInit(32, 4, mock_logger, &context2);  // 32 ranks, 4 nodes
+  pluginInit(&context2, 0, 32, 4, mock_logger, NULL, NULL);  // 32 ranks, 4 nodes
 
   for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
     for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
@@ -348,7 +348,7 @@ int test_default_channels() {
   setenv("NCCL_TUNER_CONFIG_FILE", "test_default.conf", 1);
 
   void* context = NULL;
-  pluginInit(8, 1, mock_logger, &context);
+  pluginInit(&context, 0, 8, 1, mock_logger, NULL, NULL);
 
   float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
   float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
@@ -368,7 +368,7 @@ int test_default_channels() {
   TEST_ASSERT(nChannels == 1, "Should keep default channels (1) when config has -1");
 
   // Clean up
-  pluginDestroy(context);
+  pluginFinalize(context);
   unlink("test_default.conf");
   unsetenv("NCCL_TUNER_CONFIG_FILE");
   TEST_PASS();
@@ -385,7 +385,7 @@ int test_regbuff_matching() {
   setenv("NCCL_TUNER_CONFIG_FILE", "test_regbuff.conf", 1);
 
   void* context = NULL;
-  pluginInit(8, 1, mock_logger, &context);
+  pluginInit(&context, 0, 8, 1, mock_logger, NULL, NULL);
 
   float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
   float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
@@ -436,7 +436,7 @@ int test_regbuff_matching() {
   TEST_ASSERT(nChannels == 8, "Any regBuff: Should set 8 channels");
 
   // Clean up
-  pluginDestroy(context);
+  pluginFinalize(context);
   unlink("test_regbuff.conf");
   unsetenv("NCCL_TUNER_CONFIG_FILE");
   TEST_PASS();
@@ -453,7 +453,7 @@ int test_pipeops_matching() {
   setenv("NCCL_TUNER_CONFIG_FILE", "test_pipeops.conf", 1);
 
   void* context = NULL;
-  pluginInit(8, 1, mock_logger, &context);
+  pluginInit(&context, 0, 8, 1, mock_logger, NULL, NULL);
 
   float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
   float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
@@ -503,7 +503,7 @@ int test_pipeops_matching() {
   TEST_ASSERT(nChannels == 8, "Any pipeOps: Should set 8 channels");
 
   // Clean up
-  pluginDestroy(context);
+  pluginFinalize(context);
   unlink("test_pipeops.conf");
   unsetenv("NCCL_TUNER_CONFIG_FILE");
   TEST_PASS();
@@ -518,7 +518,7 @@ int test_no_match_fallback() {
   setenv("NCCL_TUNER_CONFIG_FILE", "test_fallback.conf", 1);
 
   void* context = NULL;
-  pluginInit(8, 1, mock_logger, &context);
+  pluginInit(&context, 0, 8, 1, mock_logger, NULL, NULL);
 
   float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
   float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
@@ -542,7 +542,7 @@ int test_no_match_fallback() {
   TEST_ASSERT(nChannels == 1, "Should use default channels");
 
   // Clean up
-  pluginDestroy(context);
+  pluginFinalize(context);
   unlink("test_fallback.conf");
   unsetenv("NCCL_TUNER_CONFIG_FILE");
   TEST_PASS();
@@ -592,7 +592,7 @@ int test_large_config() {
 
   // Initialize plugin with large config
   void* context = NULL;
-  ncclResult_t result = pluginInit(16, 4, mock_logger, &context);
+  ncclResult_t result = pluginInit(&context, 0, 16, 4, mock_logger, NULL, NULL);
   TEST_ASSERT(result == ncclSuccess, "Plugin init with large config should succeed");
   TEST_ASSERT(context != NULL, "Context should be allocated");
 
@@ -651,7 +651,7 @@ int test_large_config() {
   TEST_ASSERT(result == ncclSuccess, "GetCollInfo should work with large config set");
 
   // Clean up
-  pluginDestroy(context);
+  pluginFinalize(context);
   unlink(large_config_file);
   unsetenv("NCCL_TUNER_CONFIG_FILE");
 
@@ -683,7 +683,7 @@ int test_very_large_config_stress() {
 
   // Test initialization with stress config
   void* context = NULL;
-  ncclResult_t result = pluginInit(8, 2, mock_logger, &context);
+  ncclResult_t result = pluginInit(&context, 0, 8, 2, mock_logger, NULL, NULL);
   TEST_ASSERT(result == ncclSuccess, "Plugin should handle very large config files");
 
   TunerContext* ctx = (TunerContext*)context;
@@ -704,7 +704,7 @@ int test_very_large_config_stress() {
   }
 
   // Clean up
-  pluginDestroy(context);
+  pluginFinalize(context);
   unlink(stress_config_file);
   unsetenv("NCCL_TUNER_CONFIG_FILE");
 
@@ -725,7 +725,7 @@ int test_empty_config() {
   setenv("NCCL_TUNER_CONFIG_FILE", empty_config_file, 1);
 
   void* context = NULL;
-  ncclResult_t result = pluginInit(8, 2, mock_logger, &context);
+  ncclResult_t result = pluginInit(&context, 0, 8, 2, mock_logger, NULL, NULL);
   TEST_ASSERT(result == ncclSuccess, "Plugin should handle empty config files");
 
   TunerContext* ctx = (TunerContext*)context;
@@ -750,13 +750,134 @@ int test_empty_config() {
   TEST_ASSERT(result == ncclSuccess, "GetCollInfo should work with empty config");
 
   // Clean up
-  pluginDestroy(context);
+  pluginFinalize(context);
   unlink(empty_config_file);
   unsetenv("NCCL_TUNER_CONFIG_FILE");
 
   TEST_PASS();
 }
 
+// Test NVLink domain info handling
+int test_nvl_domain_info() {
+  printf("Testing NVLink domain info handling...\n");
+
+  // Test NVLink domain structure with min/max ranks per domain
+  ncclNvlDomainInfo_v5_t nvl_domain = {
+    .nNvlDomains = 2, // 2 nodes = 2 domains
+    .minRanksPerNvlDomain = 3, // minimum ranks across all domains (bottleneck)
+    .maxRanksPerNvlDomain = 5  // maximum ranks across all domains (capacity)
+  };
+  
+  void* context = NULL;
+  ncclResult_t result = pluginInit(&context, 0, 8, 2, mock_logger, &nvl_domain, NULL);
+  TEST_ASSERT(result == ncclSuccess, "Plugin init with NVLink domains should succeed");
+  
+  // Validate NVLD info structure
+  TEST_ASSERT(nvl_domain.nNvlDomains == 2, "Should have 2 domains (nodes)");
+  TEST_ASSERT(nvl_domain.minRanksPerNvlDomain == 3, "Should have minimum 3 ranks per domain");
+  TEST_ASSERT(nvl_domain.maxRanksPerNvlDomain == 5, "Should have maximum 5 ranks per domain");
+  
+  // Clean up
+  pluginFinalize(context);
+  printf("NVLink domain info test passed!\n");
+  TEST_PASS();
+}
+
+int test_tuner_constants() {
+  // Initialize constants to -1.0 for testing purposes
+  ncclTunerConstants_v5_t constants = {
+    // Base latencies: [NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS]
+    .baseLatencies = {
+      {-1.0, -1.0, -1.0},    // NCCL_ALGO_TREE: LL, LL128, Simple
+      {-1.0, -1.0, -1.0},    // NCCL_ALGO_RING: LL, LL128, Simple
+      {-1.0, -1.0, -1.0},   // NCCL_ALGO_COLLNET_DIRECT
+      {-1.0, -1.0, -1.0},   // NCCL_ALGO_COLLNET_CHAIN
+      {-1.0, -1.0, -1.0},    // NCCL_ALGO_NVLS
+      {-1.0, -1.0, -1.0},    // NCCL_ALGO_NVLS_TREE
+      {-1.0, -1.0, -1.0}     // NCCL_ALGO_PAT
+    },
+
+    // Hardware latencies: [NCCL_NUM_HW_LINKS][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS]
+    .hwLatencies = {
+      // NCCL_HW_NVLINK
+      {
+        {-1.0, -1.0, -1.0},    // TREE
+        {-1.0, -1.0, -1.0},    // RING
+        {-1.0, -1.0, -1.0},    // COLLNET_DIRECT
+        {-1.0, -1.0, -1.0},    // COLLNET_CHAIN
+        {-1.0, -1.0, -1.0},    // NVLS
+        {-1.0, -1.0, -1.0},    // NVLS_TREE
+        {-1.0, -1.0, -1.0}     // PAT
+      },
+      // NCCL_HW_PCI
+      {
+        {-1.0, -1.0, -1.0},   // TREE
+        {-1.0, -1.0, -1.0},    // RING
+        {-1.0, -1.0, -1.0},  // COLLNET_DIRECT
+        {-1.0, -1.0, -1.0},  // COLLNET_CHAIN
+        {-1.0, -1.0, -1.0},     // NVLS
+        {-1.0, -1.0, -1.0},   // NVLS_TREE
+        {-1.0, -1.0, -1.0}   // PAT
+      },
+      // NCCL_HW_NET
+      {
+        {-1.0, -1.0, -1.0},  // TREE
+        {-1.0, -1.0, -1.0},  // RING
+        {-1.0, -1.0, -1.0},  // COLLNET_DIRECT
+        {-1.0, -1.0, -1.0},  // COLLNET_CHAIN
+        {-1.0, -1.0, -1.0},  // NVLS
+        {-1.0, -1.0, -1.0},  // NVLS_TREE
+        {-1.0, -1.0, -1.0}   // PAT
+      }
+    },
+
+    // LL maximum bandwidths: [NCCL_NUM_COMPCAPS][NCCL_NUM_TUNING_SCALES]
+    .llMaxBws = {
+      {-1.0, -1.0, -1.0},  // Volta: 1node, 2nodes, 4nodes
+      {-1.0, -1.0, -1.0},  // Ampere: 1node, 2nodes, 4nodes
+      {-1.0, -1.0, -1.0},  // Hopper: 1node, 2nodes, 4nodes
+      {-1.0, -1.0, -1.0}   // Blackwell: 1node, 2nodes, 4nodes
+    },
+
+    // Per-channel maximum Ring LL128 bandwidths: [NCCL_NUM_COMPCAPS][NCCL_NUM_TUNING_SCALES]
+    .perChMaxRingLL128Bws = {
+      {-1.0, -1.0, -1.0},   // Volta: 1node, 2nodes, 4nodes
+      {-1.0, -1.0, -1.0},  // Ampere: 1node, 2nodes, 4nodes
+      {-1.0, -1.0, -1.0},  // Hopper: 1node, 2nodes, 4nodes
+      {-1.0, -1.0, -1.0}   // Blackwell: 1node, 2nodes, 4nodes
+    },
+
+    // Per-channel maximum Tree LL128 bandwidths: [NCCL_NUM_COMPCAPS][NCCL_NUM_TUNING_SCALES]
+    .perChMaxTreeLL128Bws = {
+      {-1.0, -1.0, -1.0},    // Volta: 1node, 2nodes, 4nodes
+      {-1.0, -1.0, -1.0},   // Ampere: 1node, 2nodes, 4nodes
+      {-1.0, -1.0, -1.0},  // Hopper: 1node, 2nodes, 4nodes
+      {-1.0, -1.0, -1.0}   // Blackwell: 1node, 2nodes, 4nodes
+    },
+
+    // Per-channel maximum Tree bandwidths: [NCCL_NUM_COMPCAPS][NCCL_NUM_TUNING_SCALES]
+    .perChMaxTreeBws = {
+      {-1.0, -1.0, -1.0},  // Volta: 1node, 2nodes, 4nodes
+      {-1.0, -1.0, -1.0},  // Ampere: 1node, 2nodes, 4nodes
+      {-1.0, -1.0, -1.0},  // Hopper: 1node, 2nodes, 4nodes
+      {-1.0, -1.0, -1.0}   // Blackwell: 1node, 2nodes, 4nodes
+    }
+  };
+
+  void* context = NULL;
+  ncclResult_t result = pluginInit(&context, 0, 8, 2, mock_logger, NULL, &constants);
+  TEST_ASSERT(result == ncclSuccess, "Plugin init with constants should succeed");
+
+  // Test that the constants were set correctly
+  TEST_ASSERT(constants.perChMaxTreeBws[NCCL_BLACKWELL_COMPCAP_IDX][NCCL_TUNING_SCALE_4NODES] == 15.0, "Tree bandwidth should be 15GB/s");
+  TEST_ASSERT(constants.perChMaxRingLL128Bws[NCCL_BLACKWELL_COMPCAP_IDX][NCCL_TUNING_SCALE_4NODES] == 20.0, "Ring bandwidth should be 20GB/s");
+  TEST_ASSERT(constants.hwLatencies[NCCL_HW_NET][NCCL_ALGO_NVLS][NCCL_PROTO_SIMPLE] == 24.0, "NVLSTree base network latency should be 24us");
+
+  // Clean up
+  pluginFinalize(context);
+  TEST_PASS();
+}
+
 // Test runner function pointer type
 typedef int (*TestFunction)(void);
 
@@ -782,6 +903,8 @@ TestCase test_cases[] = {
   {"large-config", test_large_config, "Large configuration files (dynamic allocation)"},
   {"stress-config", test_very_large_config_stress, "Very large configuration stress test"},
   {"empty-config", test_empty_config, "Empty configuration file handling"},
+  {"nvl-domain", test_nvl_domain_info, "NVL domain info handling"},
+  {"constants", test_tuner_constants, "Tuner constants initialization"},
   {NULL, NULL, NULL} // End marker
 };
 
@@ -825,6 +948,7 @@ int main(int argc, char* argv[]) {
   if (argc == 1) {
     // No arguments - run all tests
     for (int i = 0; test_cases[i].name != NULL; i++) {
+      printf("Running test: %s\n", test_cases[i].name);
       total++;
       passed += test_cases[i].func();
     }
diff --git a/projects/rccl/makefiles/common.mk b/projects/rccl/makefiles/common.mk
index 0f01671b67..f8f455dec6 100644
--- a/projects/rccl/makefiles/common.mk
+++ b/projects/rccl/makefiles/common.mk
@@ -32,13 +32,8 @@ CUDA_MINOR = $(shell echo $(CUDA_VERSION) | cut -d "." -f 2)
 
 # You should define NVCC_GENCODE in your environment to the minimal set
 # of archs to reduce compile time.
-CUDA8_GENCODE = -gencode=arch=compute_50,code=sm_50 \
-                -gencode=arch=compute_60,code=sm_60 \
+CUDA8_GENCODE = -gencode=arch=compute_60,code=sm_60 \
                 -gencode=arch=compute_61,code=sm_61
-ifeq ($(shell test "0$(CUDA_MAJOR)" -lt 12; echo $$?),0)
-# SM35 is deprecated from CUDA12.0 onwards
-CUDA8_GENCODE += -gencode=arch=compute_35,code=sm_35
-endif
 CUDA9_GENCODE = -gencode=arch=compute_70,code=sm_70
 CUDA10_GENCODE = -gencode=arch=compute_75,code=sm_75
 CUDA11_GENCODE = -gencode=arch=compute_80,code=sm_80
diff --git a/projects/rccl/makefiles/version.mk b/projects/rccl/makefiles/version.mk
index 3b182d61bd..d0e97c0657 100644
--- a/projects/rccl/makefiles/version.mk
+++ b/projects/rccl/makefiles/version.mk
@@ -1,6 +1,6 @@
 ##### version
 NCCL_MAJOR   := 2
-NCCL_MINOR   := 27
-NCCL_PATCH   := 7
+NCCL_MINOR   := 28
+NCCL_PATCH   := 3
 NCCL_SUFFIX  :=
 PKG_REVISION := 1
diff --git a/projects/rccl/pkg/Makefile b/projects/rccl/pkg/Makefile
index ab6487be9b..cffd5d76fd 100644
--- a/projects/rccl/pkg/Makefile
+++ b/projects/rccl/pkg/Makefile
@@ -10,7 +10,7 @@ build : debian.build txz.build
 
 BUILDDIR ?= $(abspath ../build)
 ABSBUILDDIR := $(abspath $(BUILDDIR))
-TARGETS := debian txz
+TARGETS := debian txz doc
 all:   ${TARGETS:%=%.build}
 prep:  ${TARGETS:%=%.prep}
 build: ${TARGETS:%=%.build}
diff --git a/projects/rccl/pkg/debian/libnccl-dev.install.in b/projects/rccl/pkg/debian/libnccl-dev.install.in
index 45120e6de4..b656e63abb 100644
--- a/projects/rccl/pkg/debian/libnccl-dev.install.in
+++ b/projects/rccl/pkg/debian/libnccl-dev.install.in
@@ -1,4 +1,4 @@
 bin/ncclras /usr/bin
-include/nccl.h /usr/include
+include/* /usr/include
 lib/libnccl.so /usr/lib/${pkg:MultiArch}
 lib/libnccl_static.a /usr/lib/${pkg:MultiArch}
diff --git a/projects/rccl/pkg/redhat/nccl.spec.in b/projects/rccl/pkg/redhat/nccl.spec.in
index d629555924..30ddbf19fa 100644
--- a/projects/rccl/pkg/redhat/nccl.spec.in
+++ b/projects/rccl/pkg/redhat/nccl.spec.in
@@ -47,8 +47,8 @@ ln -s libnccl.so.${nccl:Major}.${nccl:Minor}.${nccl:Patch} $RPM_BUILD_ROOT/%{_li
 # devel
 install -m 755 -d $RPM_BUILD_ROOT/%{_bindir}
 install -m 755 -d $RPM_BUILD_ROOT/%{_includedir}
+cp -a include/* $RPM_BUILD_ROOT/%{_includedir}/
 install -m 755 bin/ncclras $RPM_BUILD_ROOT/%{_bindir}
-install -m 644 include/nccl.h $RPM_BUILD_ROOT/%{_includedir}
 ln -s libnccl.so.${nccl:Major} $RPM_BUILD_ROOT/%{_libdir}/libnccl.so
 
 # static
@@ -67,7 +67,7 @@ rm -rf $RPM_BUILD_ROOT
 %doc LICENSE.txt
 %defattr(-,root,root,-)
 %{_bindir}/ncclras
-%{_includedir}/nccl.h
+%{_includedir}/*
 %{_libdir}/libnccl.so
 
 %files static
diff --git a/projects/rccl/pkg/srctxz/Makefile b/projects/rccl/pkg/srctxz/Makefile
index 01cab95a43..a8d9e0da99 100644
--- a/projects/rccl/pkg/srctxz/Makefile
+++ b/projects/rccl/pkg/srctxz/Makefile
@@ -22,7 +22,7 @@ prep: $(TXZTARGETS)
 build: prep
 	$(MAKE) -C ../../src clean
 	@printf "Building source tar.xz package\n"
-	(cd $(BUILDDIR); bash srctxz/create_srctxz.sh)
+	(cd $(BUILDDIR); SRCTXZ_APITESTS=$(SRCTXZ_APITESTS) bash srctxz/create_srctxz.sh)
 	mkdir -p $(PKGDIR)
 	mv $(BUILDDIR)/../../nccl-src*.txz $(PKGDIR)
 
diff --git a/projects/rccl/pkg/srctxz/create_srctxz.sh.in b/projects/rccl/pkg/srctxz/create_srctxz.sh.in
index 11bdd52db7..0e627dd257 100644
--- a/projects/rccl/pkg/srctxz/create_srctxz.sh.in
+++ b/projects/rccl/pkg/srctxz/create_srctxz.sh.in
@@ -28,8 +28,34 @@ NCCL_SUFFIX=${nccl:Suffix}
 NCCL_BUILD=${pkg:Revision}
 
 NCCLNAME="nccl-src_${NCCL_MAJOR}.${NCCL_MINOR}.${NCCL_PATCH}${NCCL_SUFFIX}-${NCCL_BUILD}"
+if [ "${SRCTXZ_APITESTS}" = "1" ]; then
+  NCCLNAME+="-apitest"
+fi
 
-tar --exclude build \
+
+INCLUDE_TEST_ENTRIES=("apitest" "googletest" "gtest.mk")
+
+if [ "${SRCTXZ_APITESTS}" = "1" ]; then
+  # Exclude all entries inside test folder except those in INCLUDE_TEST_ENTRIES
+  for entry in $(ls $NCCLDIR/test); do
+    if [[ ! " ${INCLUDE_TEST_ENTRIES[@]} " =~ " $entry " ]]; then
+      EXCLUDE_TEST+=" --exclude $NCCLDIR/test/$entry"
+    fi
+  done
+else
+  # Exclude the entire test directory
+  EXCLUDE_TEST+=" --exclude test"
+fi
+
+tar --exclude fortran \
+    --exclude doc \
+    --exclude plc \
+    --exclude build \
     --exclude ".git*" \
+    --exclude share \
+    --exclude ompi \
+    --exclude ext-net \
     --exclude pkg/srctxz \
+    --exclude docker \
+    $EXCLUDE_TEST \
     --transform "s/^$NCCLDIR/$NCCLNAME/" -Jcf $NCCLNAME.txz --owner=0 --group=0 $NCCLDIR
diff --git a/projects/rccl/src/CMakeLists.txt b/projects/rccl/src/CMakeLists.txt
new file mode 100644
index 0000000000..5ab69dc925
--- /dev/null
+++ b/projects/rccl/src/CMakeLists.txt
@@ -0,0 +1,180 @@
+# Source files
+set(LIBSRCFILES
+    bootstrap.cc
+    channel.cc
+    ce_coll.cc
+    collectives.cc
+    debug.cc
+    enqueue.cc
+    group.cc
+    init.cc
+    init_nvtx.cc
+    proxy.cc
+    transport.cc
+    mnnvl.cc
+    allocator.cc
+    sym_kernels.cc
+    dev_runtime.cc
+)
+
+# Add compatibility shim if using static cudart
+if(CUDARTLIB STREQUAL "cudart_static")
+    list(APPEND LIBSRCFILES enhcompat.cc)
+endif()
+
+# Configure pkg-config file
+configure_file(
+    ${CMAKE_CURRENT_SOURCE_DIR}/nccl.pc.in
+    ${CMAKE_BINARY_DIR}/lib/pkgconfig/nccl.pc
+    @ONLY
+)
+
+# Add files from subdirectories
+add_subdirectory(transport)
+add_subdirectory(misc)
+add_subdirectory(register)
+add_subdirectory(graph)
+add_subdirectory(plugin)
+add_subdirectory(device)
+add_subdirectory(nccl_device)
+add_subdirectory(ras)
+add_subdirectory(scheduler)
+
+add_compile_options(-fmacro-prefix-map=${CMAKE_CURRENT_SOURCE_DIR}/=)
+
+# Add all source files
+list(APPEND LIBSRCFILES
+    ${TRANSPORT_SOURCES}
+    ${MISC_SOURCES}
+    ${REGISTER_SOURCES}
+    ${GRAPH_SOURCES}
+    ${PLUGIN_SOURCES}
+    ${RAS_SOURCES}
+    ${SYM_SOURCES}
+    ${SCHEDULER_SOURCES}
+)
+
+###################### Create a shared NCCL library ############################
+add_library(nccl SHARED)
+
+target_sources(nccl PRIVATE ${LIBSRCFILES})
+
+# Include directories
+target_include_directories(nccl PUBLIC
+    ${CMAKE_CURRENT_SOURCE_DIR}/device
+    ${CMAKE_CURRENT_SOURCE_DIR}/include
+    ${CMAKE_CURRENT_SOURCE_DIR}/include/plugin
+    ${CUDAToolkit_INCLUDE_DIRS}
+    ${CUDAToolkit_INCLUDE_DIRS}/cccl
+)
+
+add_custom_command(
+    OUTPUT ${CMAKE_BINARY_DIR}/include/nccl.h
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${CMAKE_BINARY_DIR}/include
+    COMMAND sed -e "s/\\\$$\\{nccl:Major\\}/${NCCL_MAJOR}/g"
+                -e "s/\\\$$\\{nccl:Minor\\}/${NCCL_MINOR}/g"
+                -e "s/\\\$$\\{nccl:Patch\\}/${NCCL_PATCH}/g"
+                -e "s/\\\$$\\{nccl:Suffix\\}/${NCCL_SUFFIX}/g"
+                -e "s/\\\$$\\{nccl:Version\\}/${NCCL_VERSION_CODE}/g"
+                ${CMAKE_CURRENT_SOURCE_DIR}/nccl.h.in > ${CMAKE_BINARY_DIR}/include/nccl.h
+    BYPRODUCTS ${CMAKE_BINARY_DIR}/include/nccl.h
+)
+
+add_custom_target(nccl_header DEPENDS ${CMAKE_BINARY_DIR}/include/nccl.h)
+
+add_dependencies(nccl nccl_header)
+
+# Set version and output name
+set_target_properties(nccl PROPERTIES
+    VERSION ${NCCL_MAJOR}.${NCCL_MINOR}.${NCCL_PATCH}
+    SOVERSION ${NCCL_MAJOR}
+    OUTPUT_NAME "nccl"
+    PREFIX "lib"
+)
+
+# Set CUDA specific flags
+set_target_properties(nccl PROPERTIES
+    CUDA_SEPARABLE_COMPILATION ON
+    CUDA_RESOLVE_DEVICE_SYMBOLS ON
+    CUDA_ARCHITECTURES "${CMAKE_CUDA_ARCHITECTURES}"
+    POSITION_INDEPENDENT_CODE ON
+)
+
+# Link libraries
+target_link_libraries(nccl
+    PRIVATE
+    nccl_device
+    pthread
+    rt
+    dl
+    ${CUDAToolkit_LIBRARIES}
+    ${EXTRA_LIBS}
+)
+
+# Set output directories for nccl shared library
+set_target_properties(nccl PROPERTIES
+    LIBRARY_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/lib"
+)
+
+###################### Create a ras binary executable ############################
+set(RAS_BINSRCFILES ras/client.cc)
+
+add_executable(ncclras ${RAS_BINSRCFILES})
+
+target_include_directories(ncclras PUBLIC
+    ${CMAKE_BINARY_DIR}/include
+    ${CUDAToolkit_INCLUDE_DIRS}
+)
+
+add_dependencies(ncclras nccl_header)
+
+target_link_libraries(ncclras
+    PRIVATE
+    pthread
+    rt
+    dl
+)
+
+# Set output directory for ncclras executable
+set_target_properties(ncclras PROPERTIES
+    RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/bin"
+)
+
+###################### Create a static NCCL library ############################
+add_library(nccl_static STATIC ${LIBSRCFILES})
+
+# Include directories
+target_include_directories(nccl_static PUBLIC
+    ${CMAKE_CURRENT_SOURCE_DIR}/device
+    ${CMAKE_CURRENT_SOURCE_DIR}/include
+    ${CMAKE_CURRENT_SOURCE_DIR}/include/plugin
+    ${CUDAToolkit_INCLUDE_DIRS}
+    ${CUDAToolkit_INCLUDE_DIRS}/cccl
+)
+
+# Add dependency on nccl_header
+add_dependencies(nccl_static nccl_header)
+
+# Link libraries
+target_link_libraries(nccl_static
+    PRIVATE
+    nccl_device
+    pthread
+    rt
+    dl
+    ${CUDAToolkit_LIBRARIES}
+    ${EXTRA_LIBS}
+)
+
+# Set CUDA specific flags
+set_target_properties(nccl_static PROPERTIES
+    CUDA_SEPARABLE_COMPILATION ON
+    CUDA_RESOLVE_DEVICE_SYMBOLS ON
+    CUDA_ARCHITECTURES "${CMAKE_CUDA_ARCHITECTURES}"
+    POSITION_INDEPENDENT_CODE ON
+)
+
+# Set output directory for nccl_static library
+set_target_properties(nccl_static PROPERTIES
+    ARCHIVE_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/lib"
+)
diff --git a/projects/rccl/src/Makefile b/projects/rccl/src/Makefile
index eab662ef98..be026cc267 100644
--- a/projects/rccl/src/Makefile
+++ b/projects/rccl/src/Makefile
@@ -7,10 +7,12 @@ include ../makefiles/common.mk
 include ../makefiles/version.mk
 
 ##### src files
-INCEXPORTS  := nccl.h
+INCEXPORTS  := nccl.h nccl_device.h \
+	$(patsubst include/%,%,$(wildcard include/nccl_device/*.h include/nccl_device/impl/*.h))
+
 LIBSRCFILES := \
 	bootstrap.cc channel.cc collectives.cc debug.cc enqueue.cc group.cc \
-	init.cc init_nvtx.cc proxy.cc transport.cc mnnvl.cc allocator.cc symmetric.cc \
+	init.cc init_nvtx.cc proxy.cc transport.cc mnnvl.cc allocator.cc dev_runtime.cc sym_kernels.cc ce_coll.cc \
 	$(wildcard graph/*.cc) \
 	$(wildcard misc/*.cc) \
 	$(wildcard transport/*.cc) \
@@ -19,6 +21,8 @@ LIBSRCFILES := \
 	$(wildcard plugin/net/*.cc) \
 	$(wildcard plugin/tuner/*.cc) \
 	$(wildcard plugin/profiler/*.cc) \
+	$(wildcard nccl_device/*.cc) \
+	$(wildcard scheduler/*.cc) \
 	$(filter-out ras/client.cc,$(wildcard ras/*.cc))
 BINSRCFILES := ras/client.cc
 
@@ -123,6 +127,16 @@ $(INCDIR)/nccl_%.h : include/nccl_%.h
 	mkdir -p $(INCDIR)
 	install -m 644 $< $@
 
+$(INCDIR)/nccl_device/%.h: include/nccl_device/%.h
+	@printf "Grabbing   %-35s > %s\n" $< $@
+	mkdir -p $(INCDIR)/nccl_device
+	install -m 644 $< $@
+
+$(INCDIR)/nccl_device/impl/%.h: include/nccl_device/impl/%.h
+	@printf "Grabbing   %-35s > %s\n" $< $@
+	mkdir -p $(INCDIR)/nccl_device/impl
+	install -m 644 $< $@
+
 $(PKGDIR)/%.pc : %.pc
 	@printf "Grabbing   %-35s > %s\n" $< $@
 	mkdir -p $(PKGDIR)
@@ -149,7 +163,7 @@ install : build
 	mkdir -p $(PREFIX)/bin
 	cp -P -v $(BUILDDIR)/lib/lib* $(PREFIX)/lib/
 	cp -P -v $(BUILDDIR)/lib/pkgconfig/* $(PREFIX)/lib/pkgconfig/
-	cp -v $(BUILDDIR)/include/* $(PREFIX)/include/
+	cp -v -r $(BUILDDIR)/include/* $(PREFIX)/include/
 	cp -v $(BUILDDIR)/bin/ncclras $(PREFIX)/bin/
 
 FILESTOFORMAT := $(shell find . -name ".\#*" -prune -o \( -name "*.cc" -o -name "*.h" \) -print | grep -v -E 'ibvwrap.h|nvmlwrap.h|gdrwrap.h|nccl.h')
diff --git a/projects/rccl/src/allocator.cc b/projects/rccl/src/allocator.cc
index c58181948a..f5638b92d5 100644
--- a/projects/rccl/src/allocator.cc
+++ b/projects/rccl/src/allocator.cc
@@ -7,10 +7,11 @@
 #include "comm.h"
 #include "transport.h"
 #include "group.h"
+#include "nvtx.h"
 
 NCCL_API(ncclResult_t, ncclMemAlloc, void **ptr, size_t size);
 ncclResult_t  ncclMemAlloc(void **ptr, size_t size) {
-  NVTX3_FUNC_RANGE_IN(nccl_domain);
+  NCCL_NVTX3_FUNC_RANGE;
   ncclResult_t ret = ncclSuccess;
 
 #if CUDART_VERSION >= 12010
@@ -98,7 +99,7 @@ fail:
 
 NCCL_API(ncclResult_t, ncclMemFree, void *ptr);
 ncclResult_t  ncclMemFree(void *ptr) {
-  NVTX3_FUNC_RANGE_IN(nccl_domain);
+  NCCL_NVTX3_FUNC_RANGE;
   ncclResult_t ret = ncclSuccess;
   int saveDevice;
 
@@ -127,70 +128,339 @@ fail:
   goto exit;
 }
 
-// This is a collective function and should be called by all ranks in the communicator
-ncclResult_t ncclCommSymmetricAllocInternal(struct ncclComm* comm, size_t size, size_t alignment, void** symPtr) {
-  ncclResult_t ret = ncclSuccess;
-  void* regSymAddr = NULL;
-  size_t allocSize = size;
-  size_t granularity;
-  CUdevice cuDev;
-  CUmemAllocationProp memprop = {};
-  CUmemGenericAllocationHandle memHandle;
-  int bit = 0, cnt = 0;
+////////////////////////////////////////////////////////////////////////////////
+// ncclSpace:
+//
+// This datastructure "cuts" the line of non-negative integers into segments
+// which alternate between "full" (allocated) and "empty" (not allocated). The
+// cuts are sorted ascending. The segment after the last cut must be empty
+// (the unallocated frontier). Knwoing this we can deduce whether the segment
+// ending at cut[i] is full or empty with this formula:
+//   isFull(i) = (i%2 != ncuts%2)
 
-  // aligment must be power of 2 as an input
-  while (bit < sizeof(size_t) * 8) {
-    if (alignment & (1L << bit)) cnt++;
-    if (cnt == 2) {
-      WARN("rank %d alignment %ld is not power of 2", comm->rank, alignment);
-      goto fail;
+void ncclSpaceConstruct(struct ncclSpace* a) {
+  memset(a, 0, sizeof(*a));
+}
+
+void ncclSpaceDestruct(struct ncclSpace* a) {
+  free(a->cuts);
+}
+
+static void insertSegment(struct ncclSpace* a, int index, int64_t lo, int64_t hi) {
+  // Insert space for two cuts in `a->cuts[]` before `index`.
+  if (a->count + 2 > a->capacity) {
+    a->capacity *= 2;
+    if (a->capacity == 0) a->capacity = 16;
+    int64_t* cuts1 = (int64_t*)malloc(a->capacity*sizeof(int64_t));
+    for (int i=0; i < index; i++) cuts1[i] = a->cuts[i];
+    for (int i=index; i < a->count; i++) cuts1[i+2] = a->cuts[i];
+    free(a->cuts);
+    a->cuts = cuts1;
+  } else {
+    for (int i=a->count-1; index <= i; i--) a->cuts[i+2] = a->cuts[i];
+  }
+  a->cuts[index+0] = lo;
+  a->cuts[index+1] = hi;
+  a->count += 2;
+
+  // Filter pairs of adjacent repeated values from cuts[]. Since these mark
+  // boundaries where segments transition between full<->empty, dropping such a
+  // pair fuses two adjacent segments together. Examples:
+  //   [1,2,3,3,4] -> [1,2,4]
+  //   [1,2,3,3,3,4] -> [1,2,3,4] // have to leave one 3 because its a full<->empty transition
+  //   [1,2,3,3,3,3,4] -> [1,2,4]
+  // Leading zeros don't have to be in pairs, they are always dropped:
+  //   [0,1,2] -> [1,2]
+  //   [0,0,1,2] -> [1,2]
+  int r = index, w = index; // Read and write cursors.
+  int64_t prev = r==0 ? 0 : a->cuts[r-1];
+  while (r < a->count) {
+    int64_t cur = a->cuts[r++];
+    a->cuts[w++] = cur;
+    if (prev == cur) { // Repeated value is an empty segment which can be deleted.
+      // Erase last two cuts or just one if we're at the start.
+      w -= w==1 ? 1 : 2;
+      // Zeros can only occur at the beginning (due to being sorted). We want to
+      // drop any number of zeros, but only even numbers of other repeated values.
+      // So set to zero here, which will make prev=0, thus if next value is zero
+      // it will be dropped but if its not zero then it will need to begin a new
+      // pair to be dropped.
+      cur = 0;
     }
-    bit++;
+    prev = cur;
   }
-  // temporarily align the alignment to NCCL_REC_PAGE_SIZE
-  ALIGN_SIZE(alignment, NCCL_REC_PAGE_SIZE);
-
-  CUCHECKGOTO(cuDeviceGet(&cuDev, comm->cudaDev), ret, fail);
-  memprop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
-  memprop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
-  memprop.requestedHandleTypes = ncclCuMemHandleType;
-  memprop.location.id = cuDev;
-  CUCHECKGOTO(cuMemGetAllocationGranularity(&granularity, &memprop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED), ret, fail);
-  ALIGN_SIZE(allocSize, granularity);
-
-  CUCHECKGOTO(cuMemCreate(&memHandle, allocSize, &memprop, 0), ret, fail);
-  ALIGN_SIZE(comm->symAllocHead, alignment);
-  NCCLCHECKGOTO(ncclIpcSymmetricMap(comm, comm->symAllocHead, allocSize, memHandle, &regSymAddr), ret, fail);
-  NCCLCHECKGOTO(ncclNvlsSymmetricMap(comm, comm->symAllocHead, allocSize, regSymAddr), ret, fail);
-  NCCLCHECKGOTO(bootstrapIntraNodeBarrier(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, comm->localRankToRank[0]), ret, fail);
-  comm->symAllocHead += allocSize;
-  *symPtr = regSymAddr;
-
-exit:
-  return ret;
-fail:
-  *symPtr = NULL;
-  goto exit;
+  a->count = w;
 }
 
-ncclResult_t ncclCommSymmetricFreeInternal(struct ncclComm* comm, void* symPtr) {
-  CUmemGenericAllocationHandle handle;
-  size_t size = 0;
-  ncclResult_t ret = ncclSuccess;
-  int saveDev = comm->cudaDev;
-  CUDACHECKGOTO(cudaGetDevice(&saveDev), ret, fail);
-  if (ncclCuMemEnable()) {
-    CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
-    CUCHECKGOTO(cuMemRetainAllocationHandle(&handle, symPtr), ret, fail);
-    CUCHECKGOTO(cuMemRelease(handle), ret, fail);
-    CUCHECKGOTO(cuMemGetAddressRange(NULL, &size, (CUdeviceptr)symPtr), ret, fail);
-    NCCLCHECKGOTO(ncclNvlsSymmetricFree(comm, size, symPtr), ret, fail);
-    NCCLCHECKGOTO(ncclIpcSymmetricFree(comm, size, symPtr), ret, fail);
-    CUCHECKGOTO(cuMemRelease(handle), ret, fail);
+ncclResult_t ncclSpaceAlloc(
+    struct ncclSpace* a, int64_t limit, int64_t size, int align,
+    int64_t* outOffset
+  ) {
+  // When allocating we try to locate the first empty segment which can hold
+  // the allocation and move its lower cut upward.
+  int i = a->count%2; // First empty segment ends at cuts[i]
+  size_t off;
+  while (i <= a->count) {
+    size_t lo = i == 0 ? 0 : a->cuts[i-1];
+    size_t hi = i == a->count ? limit : a->cuts[i];
+    off = alignUp(lo, align);
+    if (off + size <= hi) {
+      *outOffset = off;
+      if (i == 0 || off + size == hi) { // Slow path required.
+        insertSegment(a, i, off, off+size);
+      } else { // We can just append to the end of a full segment.
+        a->cuts[i-1] = off + size;
+      }
+      return ncclSuccess;
+    }
+    i += 2; // Next empty segment
   }
-exit:
-  CUDACHECK(cudaSetDevice(saveDev));
-  return ret;
-fail:
-  goto exit;
+  WARN("Allocation failed. No suitable space found to accommodate size=0x%lx within limit=0x%lx", (long)size, (long)limit);
+  return ncclInternalError;
+}
+
+ncclResult_t ncclSpaceFree(struct ncclSpace* a, int64_t offset, int64_t size) {
+  if (a->count == 0 || a->cuts[a->count-1] <= offset) {
+    WARN("No allocation found at offset=0x%lx", (long)offset);
+    return ncclInternalError;
+  }
+
+  // This could be binary search, but since allocate is linear there's no point.
+  int i = 1 - a->count%2; // First full segment ends at cuts[i]
+  while (a->cuts[i] <= offset) i += 2;
+
+  int64_t lo = i==0 ? 0 : a->cuts[i-1];
+  int64_t hi = a->cuts[i];
+
+  if (offset < lo || hi < offset + size) {
+    WARN("Given size=0x%lx extends beyond allocation.", (long)size);
+    return ncclInternalError;
+  }
+
+  // First try the two fast cases which just shrink a segment from one side.
+  if (i != 0 && lo == offset && offset + size != hi) {
+    a->cuts[i-1] = offset + size; // Bring bottom up.
+  } else if (lo != offset && offset + size == hi) {
+    a->cuts[i] = offset; // Bring top down.
+  } else { // Slow path.
+    insertSegment(a, i, offset, offset+size);
+  }
+  return ncclSuccess;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// ncclShadowPool:
+
+struct ncclShadowPage { // A contiguous block of (at most) 64 objects
+  struct ncclShadowPage* next;
+  int objSize;
+  uint64_t freeMask;
+  void* devObjs;
+};
+struct ncclShadowObject {
+  struct ncclShadowObject* next;
+  void* devObj;
+  void* hostObj;
+  struct ncclShadowPage* page; // null if not allocated in page but directly in CUDA mempool.
+};
+
+void ncclShadowPoolConstruct(struct ncclShadowPool* pool) {
+  pool->hbits = 0;
+  pool->count = 0;
+  pool->table = nullptr;
+  pool->pages = nullptr;
+}
+
+ncclResult_t ncclShadowPoolDestruct(struct ncclShadowPool* pool) {
+  if (pool->hbits != 0) {
+    cudaStream_t stream;
+    CUDACHECK(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking));
+
+    if (pool->count != 0) {
+      for (int i=0; i < 1<<pool->hbits; i++) {
+        struct ncclShadowObject* obj = pool->table[i];
+        while (obj != nullptr) {
+          struct ncclShadowPage* page = obj->page;
+          if (page != nullptr) {
+            if (page->freeMask == 0) { // Put full pages back into page list.
+              page->freeMask = 1;
+              page->next = pool->pages;
+              pool->pages = page;
+            }
+          } else {
+            cudaFreeAsync(obj->devObj, stream);
+          }
+          struct ncclShadowObject* next = obj->next;
+          free(obj);
+          obj = next;
+        }
+      }
+    }
+    free(pool->table);
+
+    while (pool->pages != nullptr) {
+      cudaFreeAsync(pool->pages->devObjs, stream);
+      struct ncclShadowPage* next = pool->pages->next;
+      free(pool->pages);
+      pool->pages = next;
+    }
+
+    cudaStreamSynchronize(stream);
+    cudaStreamDestroy(stream);
+    cudaMemPoolDestroy(pool->memPool);
+  }
+  return ncclSuccess;
+}
+
+static int hashBucket(int hbits, void* devObj) {
+  uintptr_t h = reinterpret_cast<uintptr_t>(devObj);
+  h ^= h>>32;
+  h *= 0x9e3779b97f4a7c13;
+  return (uint64_t)h >> (64-hbits);
+}
+
+static void hashInsert(struct ncclShadowPool* pool, struct ncclShadowObject* obj) {
+  int b = hashBucket(pool->hbits, obj->devObj);
+  obj->next = pool->table[b];
+  pool->table[b] = obj;
+}
+
+ncclResult_t ncclShadowPoolAlloc(
+    struct ncclShadowPool* pool, size_t size, void** outDevObj, void** outHostObj,
+    cudaStream_t stream
+  ) {
+  if (size == 0) {
+    if (outDevObj) *outDevObj = nullptr;
+    if (outHostObj) *outHostObj = nullptr;
+    return ncclSuccess;
+  }
+
+  int hbits = pool->hbits;
+  if (hbits == 0) {
+    cudaMemPoolProps props = {};
+    props.allocType = cudaMemAllocationTypePinned;
+    props.handleTypes = cudaMemHandleTypeNone;
+    props.location.type = cudaMemLocationTypeDevice;
+    cudaGetDevice(&props.location.id);
+    CUDACHECK(cudaMemPoolCreate(&pool->memPool, &props));
+
+    pool->hbits = hbits = 4;
+    pool->table = (struct ncclShadowObject**)malloc(sizeof(struct ncclShadowObject*)<<hbits);
+    for (int i=0; i < 1<<hbits; i++) pool->table[i] = nullptr;
+  }
+
+  // Check for hash table size increase before inserting. Maintain 2:1 object:bucket ratio.
+  if (pool->count+1 > 2<<hbits) {
+    struct ncclShadowObject** table0 = pool->table;
+    struct ncclShadowObject** table1 = (struct ncclShadowObject**)malloc(sizeof(struct ncclShadowObject*)<<(hbits+1));
+    pool->table = table1;
+    pool->hbits = hbits+1;
+    for (int i1=0; i1 < 2<<hbits; i1++) table1[i1] = nullptr;
+    for (int i0=0; i0 < 1<<hbits; i0++) {
+      struct ncclShadowObject* obj = table0[i0];
+      while (obj) {
+        struct ncclShadowObject* next = obj->next;
+        hashInsert(pool, obj);
+        obj = next;
+      }
+    }
+    hbits += 1; // match pool->hbits
+    free(table0);
+  }
+
+  struct ncclShadowPage* page;
+  void *devObj;
+  if ((64<<10)/size >= 3) {
+    int shift = std::max<int>(0, (int)log2Down(size) + 1 - 4);
+    int pageObjSize = ((size + (1<<shift)-1)>>shift)<<shift;
+    struct ncclShadowPage** pagePtr = &pool->pages;
+    while (true) {
+      page = *pagePtr;
+      if (page == nullptr) {
+        size_t pageSize = std::min<size_t>(64<<10, 64*pageObjSize);
+        page = (struct ncclShadowPage*)malloc(sizeof(struct ncclShadowPage));
+        page->objSize = pageObjSize;
+        page->freeMask = uint64_t(-1)>>(64 - pageSize/pageObjSize);
+        page->next = pool->pages;
+        pool->pages = page;
+        CUDACHECK(cudaMallocFromPoolAsync(&page->devObjs, pageSize, pool->memPool, stream));
+        CUDACHECK(cudaMemsetAsync(page->devObjs, 0, pageSize, stream));
+        // fall through...
+      }
+      if (page->objSize == pageObjSize) {
+        int slot = popFirstOneBit(&page->freeMask);
+        devObj = (char*)page->devObjs + slot*pageObjSize;
+        if (page->freeMask == 0) *pagePtr = page->next; // Remove full page from list.
+        break;
+      }
+      pagePtr = &page->next;
+    }
+  } else {
+    page = nullptr;
+    CUDACHECK(cudaMallocFromPoolAsync(&devObj, size, pool->memPool, stream));
+    CUDACHECK(cudaMemsetAsync(devObj, 0, size, stream));
+  }
+
+  struct ncclShadowObject* obj = (struct ncclShadowObject*)malloc(
+    sizeof(struct ncclShadowObject) + /*padding=*/alignof(max_align_t)-1 + size
+  );
+  obj->page = page;
+  obj->devObj = devObj;
+  obj->hostObj = alignUp((char*)(obj+1), alignof(max_align_t));
+  memset(obj->hostObj, 0, size);
+  hashInsert(pool, obj);
+  pool->count += 1;
+  if (outDevObj) *outDevObj = devObj;
+  if (outHostObj) *outHostObj = obj->hostObj;
+  return ncclSuccess;
+}
+
+ncclResult_t ncclShadowPoolFree(struct ncclShadowPool* pool, void* devObj, cudaStream_t stream) {
+  if (devObj == nullptr) return ncclSuccess;
+
+  int b = hashBucket(pool->hbits, devObj);
+  struct ncclShadowObject** pobj = &pool->table[b];
+  while (true) {
+    if (*pobj == nullptr) {
+      WARN("Device object does not exist in shadow pool.");
+      return ncclInternalError;
+    }
+    if ((*pobj)->devObj == devObj) break;
+    pobj = &(*pobj)->next;
+  }
+  struct ncclShadowObject* obj = *pobj;
+  *pobj = obj->next;
+  if (obj->page != nullptr) {
+    if (obj->page->freeMask == 0) {
+      obj->page->next = pool->pages;
+      pool->pages = obj->page;
+    }
+    int slot = ((char*)obj->devObj - (char*)obj->page->devObjs)/obj->page->objSize;
+    obj->page->freeMask |= uint64_t(1)<<slot;
+  } else {
+    CUDACHECK(cudaFreeAsync(devObj, stream));
+  }
+  free(obj);
+  pool->count -= 1;
+  return ncclSuccess;
+}
+
+ncclResult_t ncclShadowPoolToHost(struct ncclShadowPool* pool, void* devObj, void** hostObj) {
+  if (devObj == nullptr) {
+    *hostObj = nullptr;
+    return ncclSuccess;
+  }
+
+  int b = hashBucket(pool->hbits, devObj);
+  struct ncclShadowObject* obj = pool->table[b];
+  while (true) {
+    if (obj == nullptr) {
+      WARN("Device object does not exist in shadow pool.");
+      return ncclInternalError;
+    }
+    if (obj->devObj == devObj) break;
+    obj = obj->next;
+  }
+  *hostObj = obj->hostObj;
+  return ncclSuccess;
 }
diff --git a/projects/rccl/src/bootstrap.cc b/projects/rccl/src/bootstrap.cc
index f053372491..7615b9c528 100644
--- a/projects/rccl/src/bootstrap.cc
+++ b/projects/rccl/src/bootstrap.cc
@@ -14,6 +14,7 @@
 #include "proxy.h"
 #include "param.h"
 #include "ras.h"
+#include <mutex>
 
 #define BOOTSTRAP_N_CHECK_ABORT           10000
 #define BOOTSTRAP_TAG_CONNECT             (0x1 << 31)
@@ -85,13 +86,13 @@ struct bootstrapRootArgs {
 static char bootstrapNetIfName[MAX_IF_NAME_SIZE+1];
 static union ncclSocketAddress bootstrapNetIfAddr;
 static int bootstrapNetInitDone = 0;
-pthread_mutex_t bootstrapNetLock = PTHREAD_MUTEX_INITIALIZER;
+static std::mutex bootstrapNetMutex;
 
 NCCL_PARAM(BootstrapNetEnable,"OOB_NET_ENABLE", 0);
 
 ncclResult_t bootstrapNetInit() {
   if (bootstrapNetInitDone == 0) {
-    pthread_mutex_lock(&bootstrapNetLock);
+    std::lock_guard<std::mutex> lock(bootstrapNetMutex);
     if (bootstrapNetInitDone == 0) {
       const char* env = ncclGetEnv("NCCL_COMM_ID");
       int nIfs = 0;
@@ -99,21 +100,18 @@ ncclResult_t bootstrapNetInit() {
         union ncclSocketAddress remoteAddr;
         if (ncclSocketGetAddrFromString(&remoteAddr, env) != ncclSuccess) {
           WARN("Invalid NCCL_COMM_ID, please use format: <ipv4>:<port> or [<ipv6>]:<port> or <hostname>:<port>");
-          pthread_mutex_unlock(&bootstrapNetLock);
           return ncclInvalidArgument;
         }
         NCCLCHECK(ncclFindInterfaceMatchSubnet(bootstrapNetIfName, &bootstrapNetIfAddr, &remoteAddr, MAX_IF_NAME_SIZE,
                                                &nIfs));
         if (nIfs <= 0) {
           WARN("NET/Socket : No usable listening interface found");
-          pthread_mutex_unlock(&bootstrapNetLock);
           return ncclSystemError;
         }
       } else {
         NCCLCHECK(ncclFindInterfaces(bootstrapNetIfName, &bootstrapNetIfAddr, MAX_IF_NAME_SIZE, 1, &nIfs));
         if (nIfs <= 0) {
           WARN("Bootstrap : no socket interface found");
-          pthread_mutex_unlock(&bootstrapNetLock);
           return ncclInvalidUsage;
         }
       }
@@ -123,7 +121,6 @@ ncclResult_t bootstrapNetInit() {
       INFO(NCCL_BOOTSTRAP, "Bootstrap: Using%s", line);
       bootstrapNetInitDone = 1;
     }
-    pthread_mutex_unlock(&bootstrapNetLock);
   }
   return ncclSuccess;
 }
@@ -485,7 +482,7 @@ static ncclResult_t getUDS(uint64_t* peerUDS) {
 static ncclResult_t netGetDevice(int rank, struct ncclComm* comm, int* dev) {
   static int devOOB = -1;
   if (devOOB < 0) {
-    pthread_mutex_lock(&bootstrapNetLock);
+    std::lock_guard<std::mutex> lock(bootstrapNetMutex);
     if (devOOB < 0) {
       const char* userIfEnv = ncclGetEnv("NCCL_OOB_NET_IFNAME");
       if (userIfEnv && strlen(userIfEnv) > 0) {
@@ -516,7 +513,6 @@ static ncclResult_t netGetDevice(int rank, struct ncclComm* comm, int* dev) {
             WARN("no device found matching %s%s, verify NCCL_OOB_NET_IFNAME", searchExact ? "exactly " : "", userIfEnv);
           else
             WARN("no device found after excluding %s%s, verify NCCL_OOB_NET_IFNAME", searchExact ? "exactly " : "", userIfEnv);
-          pthread_mutex_unlock(&bootstrapNetLock);
           return ncclInvalidArgument;
         }
       } else {
@@ -529,13 +525,12 @@ static ncclResult_t netGetDevice(int rank, struct ncclComm* comm, int* dev) {
       bool hasProp = res == ncclSuccess;
       INFO(NCCL_BOOTSTRAP, "Bootstrap: Using %s:%d", (hasProp) ? props.name : "N/A", (hasProp) ? props.port : -1);
     }
-    pthread_mutex_unlock(&bootstrapNetLock);
   }
   *dev = devOOB;
   return ncclSuccess;
 }
 
-static ncclResult_t netRingConnect(ncclNet_t* net, struct bootstrapListen_t* listen, char peerHandle[NCCL_NET_HANDLE_MAXSIZE],
+static ncclResult_t netRingConnect(void* ctx, ncclNet_t* net, struct bootstrapListen_t* listen, char peerHandle[NCCL_NET_HANDLE_MAXSIZE],
                                    void** sendComm, ncclNetDeviceHandle_t** sendDevHandle,
                                    void** recvComm, ncclNetDeviceHandle_t** recvDevHandle, volatile uint32_t* abortFlag) {
 
@@ -543,7 +538,7 @@ static ncclResult_t netRingConnect(ncclNet_t* net, struct bootstrapListen_t* lis
   do {
     NCCLCHECK(checkAbort(abortFlag, &abortCounter));
     if (!*sendComm)
-      NCCLCHECK(net->connect(listen->net.dev, NULL, peerHandle, sendComm, sendDevHandle));
+      NCCLCHECK(net->connect(ctx, listen->net.dev, peerHandle, sendComm, sendDevHandle));
     if (!*recvComm)
       NCCLCHECK(net->accept(listen->net.comm, recvComm, recvDevHandle));
   } while (!*sendComm || !*recvComm);
@@ -655,7 +650,7 @@ ncclResult_t bootstrapInit(int nHandles, void* handles, struct ncclComm* comm) {
   if (ncclParamBootstrapNetEnable()) {
     // Create net interface for other ranks to contact me (all gather)
     NCCLCHECK(netGetDevice(rank, comm, &STATE_LISTEN(state, net.dev)));
-    NCCLCHECK(state->net->listen(STATE_LISTEN(state, net.dev), STATE_LISTEN(state, net.handle), &STATE_LISTEN(state, net.comm)));
+    NCCLCHECK(state->net->listen(comm->netContext, STATE_LISTEN(state, net.dev), STATE_LISTEN(state, net.handle), &STATE_LISTEN(state, net.comm)));
     memcpy(info.connectInfo.handle, STATE_LISTEN(state, net.handle), NCCL_NET_HANDLE_MAXSIZE);
   } else {
     // create socket for ring neightbor to contact mee
@@ -709,7 +704,7 @@ ncclResult_t bootstrapInit(int nHandles, void* handles, struct ncclComm* comm) {
 
   // accept and connect the ring network
   if (ncclParamBootstrapNetEnable()) {
-    NCCLCHECK(netRingConnect(state->net, &state->listen, nextPeer.handle,
+    NCCLCHECK(netRingConnect(comm->netContext, state->net, &state->listen, nextPeer.handle,
                              &STATE_RING(state, net.sendComm), &STATE_RING(state, net.sendDevHandle),
                              &STATE_RING(state, net.recvComm), &STATE_RING(state, net.recvDevHandle), state->abortFlag));
   } else {
@@ -802,7 +797,7 @@ ncclResult_t bootstrapSplit(uint64_t magic, struct ncclComm* comm, struct ncclCo
   // create a handle for the others to reach out to me
   if (ncclParamBootstrapNetEnable()) {
     NCCLCHECKGOTO(netGetDevice(rank, comm, &STATE_LISTEN(state, net.dev)), ret, fail);
-    NCCLCHECKGOTO(state->net->listen(STATE_LISTEN(state, net.dev), STATE_LISTEN(state, net.handle), &STATE_LISTEN(state, net.comm)), ret, fail);
+    NCCLCHECKGOTO(state->net->listen(comm->netContext, STATE_LISTEN(state, net.dev), STATE_LISTEN(state, net.handle), &STATE_LISTEN(state, net.comm)), ret, fail);
     memcpy(info.handle, STATE_LISTEN(state, net.handle), NCCL_NET_HANDLE_MAXSIZE);
   } else {
     // create socket for ring neightbor to contact mee
@@ -821,7 +816,7 @@ ncclResult_t bootstrapSplit(uint64_t magic, struct ncclComm* comm, struct ncclCo
   NCCLCHECKGOTO(bootstrapSend(parent->bootstrap, prev, BOOTSTRAP_TAG_COMMSPLIT, &info, sizeof(union ringConnectInfo)), ret, fail);
   NCCLCHECKGOTO(bootstrapRecv(parent->bootstrap, next, BOOTSTRAP_TAG_COMMSPLIT, &nextPeer, sizeof(union ringConnectInfo)), ret, fail);
   if (ncclParamBootstrapNetEnable()) {
-    NCCLCHECKGOTO(netRingConnect(state->net, &state->listen, nextPeer.handle,
+    NCCLCHECKGOTO(netRingConnect(comm->netContext, state->net, &state->listen, nextPeer.handle,
                                  &STATE_RING(state, net.sendComm), &STATE_RING(state, net.sendDevHandle),
                                  &STATE_RING(state, net.recvComm), &STATE_RING(state, net.recvDevHandle), state->abortFlag),
                   ret, fail);
diff --git a/projects/rccl/src/ce_coll.cc b/projects/rccl/src/ce_coll.cc
new file mode 100644
index 0000000000..3f3dcbd7f6
--- /dev/null
+++ b/projects/rccl/src/ce_coll.cc
@@ -0,0 +1,615 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "comm.h"
+#include "register_inline.h"
+#include <cuda.h>
+#include "cudawrap.h"
+#include "ce_coll.h"
+#include "alloc.h"
+
+// Static constant for graph synchronization
+static const uint32_t GRAPH_SYNC_VALUE = 1;
+
+// Static constants for intra-batch synchronization to improve CE collective performance with large scale
+// Frequency of intra-batch synchronization
+static const uint32_t CE_COLL_INTRA_BATCH_SYNC_FREQ = 8;
+// Message threshold for intra-batch synchronization
+static const uint64_t CE_COLL_INTRA_BATCH_SYNC_MSG_THRESHOLD = 512*1024*1024;
+
+ncclResult_t ncclCeInit(struct ncclComm* comm) {
+  ncclResult_t ret = ncclSuccess;
+
+  uint8_t* ceDevBase;
+  size_t ceDevBaseSize = alignUp(comm->nRanks*sizeof(uint32_t), 16) * 2;
+  ncclWindow_vidmem* ceWinDev;
+  ncclWindow_vidmem* ceWinDevHost;
+
+  // Ensure symmetric memory runtime is initialized
+  NCCLCHECKGOTO(ncclDevrInitOnce(comm), ret, fail);
+  // Allocate and register memory for the symmetric memory
+  NCCLCHECKGOTO(ncclMemAlloc((void**)&ceDevBase, ceDevBaseSize), ret, fail);
+  NCCLCHECKGOTO(ncclDevrWindowRegisterInGroup(comm, ceDevBase, ceDevBaseSize, NCCL_WIN_COLL_SYMMETRIC, &ceWinDev), ret, fail);
+  NCCLCHECKGOTO(ncclShadowPoolToHost(&comm->devrState.shadows, ceWinDev, &ceWinDevHost), ret, fail);
+  // Get the ncclDevrWindow from the winHost field
+  comm->ceColl.ceSyncWin = (struct ncclDevrWindow*)ceWinDevHost->winHost;
+
+  comm->ceColl.baseUCSymReadyOffset = 0;
+  comm->ceColl.baseUCSymComplOffset = alignUp(comm->nRanks*sizeof(uint32_t), 16);
+  comm->ceColl.baseUCSymReadyPtr = (uint8_t*)comm->ceColl.ceSyncWin->userPtr + comm->ceColl.baseUCSymReadyOffset;
+  comm->ceColl.baseUCSymComplPtr = (uint8_t*)comm->ceColl.ceSyncWin->userPtr + comm->ceColl.baseUCSymComplOffset;
+  comm->ceColl.ceSeqNum = 0;
+  comm->ceColl.useCompletePtr = false;
+  comm->ceColl.intraBatchSyncFreq = CE_COLL_INTRA_BATCH_SYNC_FREQ;
+  comm->ceColl.intraBatchSyncMsgThreshold = CE_COLL_INTRA_BATCH_SYNC_MSG_THRESHOLD;
+  INFO(NCCL_INIT, "Init CE, rank %d baseUCSymReadyPtr %p, baseUCSymComplPtr %p, seq num %d", comm->rank, comm->ceColl.baseUCSymReadyPtr, comm->ceColl.baseUCSymComplPtr, comm->ceColl.ceSeqNum);
+
+exit:
+  return ret;
+fail:
+  goto exit;
+}
+
+ncclResult_t ncclCeFinalize(struct ncclComm* comm) {
+  ncclResult_t ret = ncclSuccess;
+  
+  // Clean up ceInitTaskQueue
+  while (!ncclIntruQueueEmpty(&comm->ceInitTaskQueue)) {
+    struct ncclCeInitTask* task = ncclIntruQueueDequeue(&comm->ceInitTaskQueue);
+    free(task);
+  }
+  
+  // Clean up CE resources
+  if (comm->ceColl.baseUCSymReadyPtr != NULL) {
+    if (comm->ceColl.ceSyncWin && comm->ceColl.ceSyncWin->vidmem) {
+      NCCLCHECKGOTO(ncclCommWindowDeregister(comm, comm->ceColl.ceSyncWin->vidmem), ret, fail);
+      NCCLCHECKGOTO(ncclMemFree(comm->ceColl.baseUCSymReadyPtr), ret, fail);
+    }
+    comm->ceColl.baseUCSymReadyPtr = NULL;
+    comm->ceColl.baseUCSymComplPtr = NULL;
+    comm->ceColl.ceSyncWin = NULL;
+  }
+
+exit:
+  return ret;
+fail:
+  goto exit;
+}
+
+bool ncclCeImplemented(ncclFunc_t coll, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty) {
+  int driverVersion;
+  if (ncclCudaDriverVersion(&driverVersion) != ncclSuccess) return false;
+
+  // CE is supported in CUDA 12.5 and later
+  if (driverVersion >= 12050) {
+    switch (coll) {
+    case ncclFuncAllGather:
+    case ncclFuncAlltoAll:
+    case ncclFuncScatter:
+    case ncclFuncGather:
+      return true;
+    default:
+      return false;
+    }
+  }
+  return false;
+}
+
+ncclResult_t ncclPrepMCSync(struct ncclComm* comm, bool isComplete, CUstreamBatchMemOpParams* batchParams, size_t* opIdx, cudaStream_t stream) {
+  ncclResult_t ret = ncclSuccess;
+
+  uint32_t* readyPtrs    = (uint32_t*)comm->ceColl.baseUCSymReadyPtr;
+  uint32_t* completePtrs = (uint32_t*)comm->ceColl.baseUCSymComplPtr;
+
+  bool capturing = ncclCudaGraphValid(comm->planner.capturingGraph);
+  uint32_t currentSeq = ++comm->ceColl.ceSeqNum;
+
+  // Source pointer is either the constant graph sync value or the sequence number
+  void* srcPtr = capturing ? (void*)&GRAPH_SYNC_VALUE : (void*)&currentSeq;
+  // Wait value is either the constant graph sync value or the sequence number
+  uint32_t waitValue = capturing ? GRAPH_SYNC_VALUE : currentSeq;
+
+  // Use multi-cast address as destination pointer
+  void* mcDstPtr;
+  void* dstPtr = isComplete ? (void*)&completePtrs[comm->rank] : (void*)&readyPtrs[comm->rank];
+  size_t offset = (uint8_t*)dstPtr - (uint8_t*)comm->ceColl.ceSyncWin->userPtr;
+  NCCLCHECKGOTO(ncclDevrGetLsaTeamPtrMC(comm, comm->ceColl.ceSyncWin, offset, ncclTeamLsa(comm), &mcDstPtr), ret, fail);
+  
+  // Write our own ready/complete flag to the multi-cast address
+  CUDACHECKGOTO(cudaMemcpyAsync(
+    mcDstPtr,
+    srcPtr,
+    sizeof(uint32_t),
+    cudaMemcpyHostToDevice,
+    stream), ret, fail);
+
+  // Add local wait operations for every other rank
+  for (int r = 0; r < comm->nRanks; ++r) {
+    if (r == comm->rank) continue;
+    batchParams[*opIdx] = {};
+    batchParams[*opIdx].waitValue.operation = CU_STREAM_MEM_OP_WAIT_VALUE_32;
+    batchParams[*opIdx].waitValue.address = (CUdeviceptr)(isComplete ? (void*)&completePtrs[r] : (void*)&readyPtrs[r]);
+    batchParams[*opIdx].waitValue.value = waitValue;
+    batchParams[*opIdx].waitValue.flags = CU_STREAM_WAIT_VALUE_EQ;
+    (*opIdx)++;
+  }
+
+exit:
+  return ret;
+fail:
+  goto exit;
+}
+
+ncclResult_t ncclPrepUCSync(struct ncclComm* comm, bool isComplete,
+                               CUstreamBatchMemOpParams* batchParams,
+                               size_t* opIdx) {
+  ncclResult_t ret = ncclSuccess;
+
+  uint32_t* readyPtrs    = (uint32_t*)comm->ceColl.baseUCSymReadyPtr;
+  uint32_t* completePtrs = (uint32_t*)comm->ceColl.baseUCSymComplPtr;
+
+  bool capturing = ncclCudaGraphValid(comm->planner.capturingGraph);
+  uint32_t currentSeq = ++comm->ceColl.ceSeqNum;
+
+  // Write our own ready/complete flag to remote ranks
+  uint32_t waitValue = capturing ? GRAPH_SYNC_VALUE : currentSeq;
+  for (int r = 0; r < comm->nRanks; ++r) {
+    if (r == comm->rank) continue;
+    void * peerDstPtr;
+    void* dstPtr = isComplete ? (void*)&completePtrs[comm->rank] : (void*)&readyPtrs[comm->rank];
+    size_t offset = (uint8_t*)dstPtr - (uint8_t*)comm->ceColl.ceSyncWin->userPtr;
+    NCCLCHECKGOTO(ncclDevrGetLsaRankPtr(comm, comm->ceColl.ceSyncWin, offset, r, &peerDstPtr), ret, fail);
+    batchParams[*opIdx] = {};
+    batchParams[*opIdx].writeValue.operation = CU_STREAM_MEM_OP_WRITE_VALUE_32;
+    batchParams[*opIdx].writeValue.address  = (CUdeviceptr)peerDstPtr;
+    batchParams[*opIdx].writeValue.value = waitValue;
+    batchParams[*opIdx].writeValue.flags = CU_STREAM_WRITE_VALUE_DEFAULT;
+    (*opIdx)++;
+  }
+
+  // Add local wait operations for every other rank
+  for (int r = 0; r < comm->nRanks; ++r) {
+    if (r == comm->rank) continue;
+    batchParams[*opIdx] = {};
+    batchParams[*opIdx].waitValue.operation = CU_STREAM_MEM_OP_WAIT_VALUE_32;
+    batchParams[*opIdx].waitValue.address  = (CUdeviceptr)(isComplete ? (void*)&completePtrs[r] : (void*)&readyPtrs[r]);
+    batchParams[*opIdx].waitValue.value = waitValue;
+    batchParams[*opIdx].waitValue.flags = CU_STREAM_WAIT_VALUE_EQ;
+    (*opIdx)++;
+  }
+
+exit:
+  return ret;
+fail:
+  goto exit;
+}
+
+
+ncclResult_t ncclMemOpSync(struct ncclComm* comm, cudaStream_t stream) {
+  ncclResult_t ret = ncclSuccess;
+
+  // Get pointers to the ready and complete synchronization arrays
+  uint32_t* readyPtrs = (uint32_t*)comm->ceColl.baseUCSymReadyPtr;
+  uint32_t* completePtrs = (uint32_t*)comm->ceColl.baseUCSymComplPtr;
+  
+  // Allocate enough slots for all possible ops
+  size_t batchSize = (comm->nvlsSupport ? NCCL_CE_SYNC_OPS_PER_RANK_MC : NCCL_CE_SYNC_OPS_PER_RANK_UC) * comm->nRanks;
+  size_t opIdx = 0;
+
+  // Prepare batch memory operations for synchronization
+  CUstreamBatchMemOpParams* batchParams = nullptr;
+  NCCLCHECKGOTO(ncclCalloc(&batchParams, batchSize), ret, fail);
+
+  if (comm->nvlsSupport) {
+    NCCLCHECKGOTO(ncclPrepMCSync(comm, comm->ceColl.useCompletePtr, batchParams, &opIdx, stream), ret, fail);
+  } else {
+    NCCLCHECKGOTO(ncclPrepUCSync(comm, comm->ceColl.useCompletePtr, batchParams, &opIdx), ret, fail);
+  }
+
+  // For CUDA graph capture, add reset operation
+  if (ncclCudaGraphValid(comm->planner.capturingGraph)) {
+    for (int i = 0; i < comm->nRanks; i++) {
+      batchParams[opIdx] = {};
+      batchParams[opIdx].writeValue.operation = CU_STREAM_MEM_OP_WRITE_VALUE_32;
+      batchParams[opIdx].writeValue.address = (CUdeviceptr)(comm->ceColl.useCompletePtr ? (void*)&completePtrs[i] : (void*)&readyPtrs[i]);
+      batchParams[opIdx].writeValue.value = 0;
+      batchParams[opIdx].writeValue.flags = CU_STREAM_WRITE_VALUE_DEFAULT;
+      opIdx++;
+    }
+  }
+  
+  // Execute all memory operations in a single batch
+  CUCHECKGOTO(cuStreamBatchMemOp(stream, opIdx, batchParams, 0), ret, fail);
+
+  // Toggle the flag for next call
+  comm->ceColl.useCompletePtr = !comm->ceColl.useCompletePtr;
+
+exit:
+  if (batchParams) free(batchParams);
+  return ret;
+fail:
+  goto exit;
+}
+
+ncclResult_t ncclCeInitBatchOpsParams(struct ncclCeBatchOpsParams* params, int nRanks) {
+  ncclResult_t ret = ncclSuccess;
+  
+  params->srcs = nullptr;
+  params->dsts = nullptr;
+  params->sizes = nullptr;
+  params->numOps = 0;
+  params->intraBatchSync = false;
+#if CUDART_VERSION >= 12080
+  params->attrs = nullptr;
+  params->attrIdxs = nullptr;
+  params->numAttrs = 0;
+#endif
+  
+  NCCLCHECKGOTO(ncclCalloc(&params->srcs, nRanks), ret, fail);
+  NCCLCHECKGOTO(ncclCalloc(&params->dsts, nRanks), ret, fail);
+  NCCLCHECKGOTO(ncclCalloc(&params->sizes, nRanks), ret, fail);
+#if CUDART_VERSION >= 12080
+  NCCLCHECKGOTO(ncclCalloc(&params->attrs, nRanks), ret, fail);
+  NCCLCHECKGOTO(ncclCalloc(&params->attrIdxs, nRanks), ret, fail);
+#endif
+exit:
+  return ret;
+fail:
+  goto exit;
+}
+
+void ncclCeFreeBatchOpsParams(struct ncclCeBatchOpsParams* params) {
+  if (params->srcs) free(params->srcs);
+  if (params->dsts) free(params->dsts);
+  if (params->sizes) free(params->sizes);
+#if CUDART_VERSION >= 12080
+  if (params->attrs) free(params->attrs);
+  if (params->attrIdxs) free(params->attrIdxs);
+#endif
+}
+
+ncclResult_t ncclCeLaunchBatchOps(struct ncclComm* comm, struct ncclCeBatchOpsParams* params, cudaStream_t stream) {
+  ncclResult_t ret = ncclSuccess;
+
+  // Check if there are any operations to perform
+  if (params->numOps == 0) {
+    return ncclSuccess;
+  }
+
+  // Check if we are in a CUDA graph capture
+  bool capturing = ncclCudaGraphValid(comm->planner.capturingGraph);
+
+  int driverVersion;
+  NCCLCHECKGOTO(ncclCudaDriverVersion(&driverVersion), ret, fail);
+    
+  //--------------Graph capture--------------
+  // cudaMemcpyBatchAsync is not supported during CUDA graph capture
+  if (capturing) {
+    for (int i =0; i < params->numOps; i++) {
+      CUDACHECKGOTO(cudaMemcpyAsync(
+        (void*)params->dsts[i],
+        (void*)params->srcs[i],
+        params->sizes[i],
+        cudaMemcpyDeviceToDevice,
+        stream), ret, fail);
+
+      if (params->intraBatchSync && ((i+1) % comm->ceColl.intraBatchSyncFreq == 0) && ((i+1) < params->numOps)) {
+        NCCLCHECKGOTO(ncclMemOpSync(comm, stream), ret, fail);
+      }
+    }
+  }
+  //--------------No graph capture--------------
+  else {
+    if (CUDART_VERSION >= 12080 && driverVersion >= 12080) {
+#if CUDART_VERSION >= 12080
+    // For CUDA 12.8+, use batch memory copy for better performance
+    params->attrs[0] = {};
+    params->attrs[0].srcAccessOrder = cudaMemcpySrcAccessOrderStream;
+    params->attrs[0].flags = cudaMemcpyFlagPreferOverlapWithCompute;
+    params->attrIdxs[0] = 0;
+    params->numAttrs = 1;
+
+    if (params->intraBatchSync) {
+      // Break into multiple batches with sync between them
+      int batchSize = comm->ceColl.intraBatchSyncFreq;
+      for (int i = 0; i < params->numOps; i += batchSize) {
+        int currentBatchSize = (i + batchSize <= params->numOps) ? batchSize : params->numOps - i;
+
+        #if CUDART_VERSION >= 13000
+        CUDACHECKGOTO(cudaMemcpyBatchAsync(
+          &params->dsts[i], &params->srcs[i], &params->sizes[i], currentBatchSize,
+          params->attrs, params->attrIdxs, params->numAttrs, stream), ret, fail);
+        #else
+        CUDACHECKGOTO(cudaMemcpyBatchAsync(
+          &params->dsts[i], &params->srcs[i], &params->sizes[i], currentBatchSize,
+          params->attrs, params->attrIdxs, params->numAttrs, nullptr, stream), ret, fail);
+        #endif
+
+        // Sync after each batch
+        if (i + batchSize < params->numOps) {
+          NCCLCHECKGOTO(ncclMemOpSync(comm, stream), ret, fail);
+        }
+      }
+    } else {
+      // Use single batch for all operations
+      #if CUDART_VERSION >= 13000
+      CUDACHECKGOTO(cudaMemcpyBatchAsync(
+        params->dsts, params->srcs, params->sizes, params->numOps,
+        params->attrs, params->attrIdxs, params->numAttrs, stream), ret, fail);
+      #else
+      CUDACHECKGOTO(cudaMemcpyBatchAsync(
+        params->dsts, params->srcs, params->sizes, params->numOps,
+        params->attrs, params->attrIdxs, params->numAttrs, nullptr, stream), ret, fail);
+      #endif
+    }
+#endif
+    } else {
+      // For older CUDA versions, fall back to individual transfers
+      for (int i = 0; i < params->numOps; i++) {
+        CUDACHECKGOTO(cudaMemcpyAsync(
+          (void*)params->dsts[i],
+          (void*)params->srcs[i],
+          params->sizes[i],
+          cudaMemcpyDeviceToDevice,
+          stream), ret, fail);
+
+        if (params->intraBatchSync && ((i+1) % comm->ceColl.intraBatchSyncFreq == 0) && ((i+1) < params->numOps)) {
+          NCCLCHECKGOTO(ncclMemOpSync(comm, stream), ret, fail);
+        }
+      }
+    }
+  }
+
+exit:
+  return ret;
+fail:
+  goto exit;
+}
+
+
+ncclResult_t ncclCeAllGather(struct ncclComm* comm, struct ncclCeCollArgs* args, cudaStream_t stream) {
+  ncclResult_t ret = ncclSuccess;
+  
+  // Calculate the size of each rank's data chunk
+  const size_t chunkBytes = args->nElts * args->eltSize;
+  uint8_t* mySendBuff = (uint8_t*)args->sendBuff;
+  uint8_t* myRecvBuff = (uint8_t*)args->recvBuff + comm->rank * chunkBytes;
+  void* peerRecvBuff;
+  size_t offset;
+
+  struct ncclCeBatchOpsParams batchOpsParams = {};
+  NCCLCHECKGOTO(ncclCeInitBatchOpsParams(&batchOpsParams, comm->nRanks), ret, fail);
+
+  // Ensure all ranks are ready before starting transfers
+  NCCLCHECKGOTO(ncclMemOpSync(comm, stream), ret, fail);
+
+  // Copy own data to receive buffer if operation is out-of-place
+  if (myRecvBuff != mySendBuff) {
+    batchOpsParams.srcs[batchOpsParams.numOps] = (void*)mySendBuff;
+    batchOpsParams.dsts[batchOpsParams.numOps] = (void*)myRecvBuff;
+    batchOpsParams.sizes[batchOpsParams.numOps] = chunkBytes;
+    batchOpsParams.numOps++;
+  }
+
+  // Copy data to other ranks
+  for (int r = 1; r < comm->nRanks; r++) {
+    int targetRank = (comm->rank + r) % comm->nRanks;
+    offset = myRecvBuff - (uint8_t*)args->recvWin->userPtr;
+    NCCLCHECKGOTO(ncclDevrGetLsaRankPtr(comm, args->recvWin, offset, targetRank, &peerRecvBuff), ret, fail);
+    batchOpsParams.srcs[batchOpsParams.numOps] = (void*)mySendBuff;
+    batchOpsParams.dsts[batchOpsParams.numOps] = (void*)peerRecvBuff;
+    batchOpsParams.sizes[batchOpsParams.numOps] = chunkBytes;
+    batchOpsParams.numOps++;
+  }
+
+  // Check if we need to perform intra-batch synchronization
+  batchOpsParams.intraBatchSync = (batchOpsParams.numOps > comm->ceColl.intraBatchSyncFreq && chunkBytes*batchOpsParams.numOps >= comm->ceColl.intraBatchSyncMsgThreshold);
+
+  // Launch the batch operations
+  NCCLCHECKGOTO(ncclCeLaunchBatchOps(comm, &batchOpsParams, stream), ret, fail);
+
+  // Ensure all transfers are complete across all ranks
+  NCCLCHECKGOTO(ncclMemOpSync(comm, stream), ret, fail);
+
+exit:
+  ncclCeFreeBatchOpsParams(&batchOpsParams);
+  return ret;
+fail:
+  goto exit;
+}
+
+ncclResult_t ncclCeAlltoAll(struct ncclComm* comm, struct ncclCeCollArgs* args, cudaStream_t stream) {
+  ncclResult_t ret = ncclSuccess;
+  
+  // Calculate the size of data each rank sends to every other rank
+  const size_t chunkBytes = args->nElts * args->eltSize;
+  uint8_t* mySendBuff = (uint8_t*)args->sendBuff;
+  uint8_t* myRecvBuff = (uint8_t*)args->recvBuff;
+  void* peerRecvBuff;
+  size_t offset;
+
+  struct ncclCeBatchOpsParams batchOpsParams = {};
+  NCCLCHECKGOTO(ncclCeInitBatchOpsParams(&batchOpsParams, comm->nRanks * comm->nRanks), ret, fail);
+
+  // Ensure all ranks are ready before starting transfers
+  NCCLCHECKGOTO(ncclMemOpSync(comm, stream), ret, fail);
+
+  // Copy data to other ranks: send data chunk for each destination rank
+  for (int r = 0; r < comm->nRanks; r++) {
+    int dstRank = (comm->rank + r) % comm->nRanks;
+    uint8_t* srcPtr = mySendBuff + dstRank * chunkBytes;
+    uint8_t* dstPtr = myRecvBuff + comm->rank * chunkBytes;
+    
+    if (dstRank == comm->rank) {
+      // Local copy for own data
+      batchOpsParams.srcs[batchOpsParams.numOps] = (void*)srcPtr;
+      batchOpsParams.dsts[batchOpsParams.numOps] = (void*)dstPtr;
+      batchOpsParams.sizes[batchOpsParams.numOps] = chunkBytes;
+      batchOpsParams.numOps++;
+    } else {
+      // Remote copy to other ranks: send to rank dstRank's receive buffer at position comm->rank
+      offset = dstPtr - (uint8_t*)args->recvWin->userPtr;
+      NCCLCHECKGOTO(ncclDevrGetLsaRankPtr(comm, args->recvWin, offset, dstRank, &peerRecvBuff), ret, fail);
+      batchOpsParams.srcs[batchOpsParams.numOps] = (void*)srcPtr;
+      batchOpsParams.dsts[batchOpsParams.numOps] = (void*)peerRecvBuff;
+      batchOpsParams.sizes[batchOpsParams.numOps] = chunkBytes;
+      batchOpsParams.numOps++;
+    }
+  }
+
+  // Check if we need to perform intra-batch synchronization
+  batchOpsParams.intraBatchSync = (batchOpsParams.numOps > comm->ceColl.intraBatchSyncFreq && chunkBytes*batchOpsParams.numOps >= comm->ceColl.intraBatchSyncMsgThreshold);
+
+  // Launch the batch operations
+  NCCLCHECKGOTO(ncclCeLaunchBatchOps(comm, &batchOpsParams, stream), ret, fail);
+
+  // Ensure all transfers are complete across all ranks
+  NCCLCHECKGOTO(ncclMemOpSync(comm, stream), ret, fail);
+
+exit:
+  ncclCeFreeBatchOpsParams(&batchOpsParams);
+  return ret;
+fail:
+  goto exit;
+}
+
+ncclResult_t ncclCeScatter(struct ncclComm* comm, struct ncclCeCollArgs* args, cudaStream_t stream) {
+  ncclResult_t ret = ncclSuccess;
+  
+  // Calculate the size of data root sends to each rank
+  const size_t chunkBytes = args->nElts * args->eltSize;
+  uint8_t* mySendBuff = (uint8_t*)args->sendBuff;
+  uint8_t* myRecvBuff = (uint8_t*)args->recvBuff;
+  int rootRank = args->rootRank;
+  void* peerDstPtr;
+  size_t offset;
+
+  struct ncclCeBatchOpsParams batchOpsParams = {};
+  NCCLCHECKGOTO(ncclCeInitBatchOpsParams(&batchOpsParams, comm->nRanks), ret, fail);
+
+  // Ensure all ranks are ready before starting transfers
+  NCCLCHECKGOTO(ncclMemOpSync(comm, stream), ret, fail);
+
+  if (comm->rank == rootRank) {
+    // Check if this is an in-place scatter operation
+    bool isInPlace = (myRecvBuff == mySendBuff + comm->rank * chunkBytes);
+
+    // Copy root's own data first if not in-place
+    if (!isInPlace) {
+      uint8_t* srcPtr = mySendBuff + comm->rank * chunkBytes;
+      uint8_t* dstPtr = myRecvBuff;
+      batchOpsParams.srcs[batchOpsParams.numOps] = (void*)srcPtr;
+      batchOpsParams.dsts[batchOpsParams.numOps] = (void*)dstPtr;
+      batchOpsParams.sizes[batchOpsParams.numOps] = chunkBytes;
+      batchOpsParams.numOps++;
+    }
+
+    // Root rank distributes data to other ranks
+    for (int r = 1; r < comm->nRanks; r++) {
+      int dstRank = (comm->rank + r) % comm->nRanks;
+      uint8_t* srcPtr = mySendBuff + dstRank * chunkBytes;
+      uint8_t* dstPtr = isInPlace ? myRecvBuff + dstRank * chunkBytes : myRecvBuff;
+
+      offset = dstPtr - (uint8_t*)args->recvWin->userPtr;
+      NCCLCHECKGOTO(ncclDevrGetLsaRankPtr(comm, args->recvWin, offset, dstRank, &peerDstPtr), ret, fail);
+      batchOpsParams.srcs[batchOpsParams.numOps] = (void*)srcPtr;
+      batchOpsParams.dsts[batchOpsParams.numOps] = (void*)peerDstPtr;
+      batchOpsParams.sizes[batchOpsParams.numOps] = chunkBytes;
+      batchOpsParams.numOps++;
+    }
+  }
+  // Non-root ranks don't need to perform any copy operations
+
+  // Launch the batch operations
+  NCCLCHECKGOTO(ncclCeLaunchBatchOps(comm, &batchOpsParams, stream), ret, fail);
+
+  // Ensure all transfers are complete across all ranks
+  NCCLCHECKGOTO(ncclMemOpSync(comm, stream), ret, fail);
+
+exit:
+  ncclCeFreeBatchOpsParams(&batchOpsParams);
+  return ret;
+fail:
+  goto exit;
+}
+
+ncclResult_t ncclCeGather(struct ncclComm* comm, struct ncclCeCollArgs* args, cudaStream_t stream) {
+  ncclResult_t ret = ncclSuccess;
+  
+  // Calculate the size of data each rank sends to root
+  const size_t chunkBytes = args->nElts * args->eltSize;
+  uint8_t* mySendBuff = (uint8_t*)args->sendBuff;
+  uint8_t* myRecvBuff = (uint8_t*)args->recvBuff;
+  int rootRank = args->rootRank;
+  void* peerRecvBuff;
+  size_t offset;
+
+  struct ncclCeBatchOpsParams batchOpsParams = {};
+  NCCLCHECKGOTO(ncclCeInitBatchOpsParams(&batchOpsParams, 1), ret, fail);
+
+  // Ensure all ranks are ready before starting transfers
+  NCCLCHECKGOTO(ncclMemOpSync(comm, stream), ret, fail);
+
+  if (comm->rank == rootRank) {
+    // Root rank copies its own data to the correct position in receive buffer
+    uint8_t* dstPtr = myRecvBuff + comm->rank * chunkBytes;
+    if (mySendBuff != dstPtr) {
+      batchOpsParams.srcs[batchOpsParams.numOps] = (void*)mySendBuff;
+      batchOpsParams.dsts[batchOpsParams.numOps] = (void*)dstPtr;
+      batchOpsParams.sizes[batchOpsParams.numOps] = chunkBytes;
+      batchOpsParams.numOps++;
+    }
+  } else {
+    // Non-root ranks send their data to root's receive buffer
+    uint8_t* rootRecvPtr = (uint8_t*)args->recvBuff + comm->rank * chunkBytes;
+    offset = rootRecvPtr - (uint8_t*)args->recvWin->userPtr;
+    NCCLCHECKGOTO(ncclDevrGetLsaRankPtr(comm, args->recvWin, offset, rootRank, &peerRecvBuff), ret, fail);
+    batchOpsParams.srcs[batchOpsParams.numOps] = (void*)mySendBuff;
+    batchOpsParams.dsts[batchOpsParams.numOps] = (void*)peerRecvBuff;
+    batchOpsParams.sizes[batchOpsParams.numOps] = chunkBytes;
+    batchOpsParams.numOps++;
+  }
+
+  // Launch the batch operations
+  NCCLCHECKGOTO(ncclCeLaunchBatchOps(comm, &batchOpsParams, stream), ret, fail);
+
+  // Ensure all transfers are complete across all ranks
+  NCCLCHECKGOTO(ncclMemOpSync(comm, stream), ret, fail);
+
+exit:
+  ncclCeFreeBatchOpsParams(&batchOpsParams);
+  return ret;
+fail:
+  goto exit;
+}
+
+ncclResult_t ncclLaunchCeColl(struct ncclComm* comm, struct ncclKernelPlan* plan) {
+  ncclResult_t ret = ncclSuccess;
+  cudaStream_t stream = comm->planner.streams->stream;
+  struct ncclCeCollArgs* args = plan->ceCollArgs;
+
+  switch (args->func) {
+    case ncclFuncAllGather:
+      NCCLCHECKGOTO(ncclCeAllGather(comm, args, stream), ret, fail);
+      break;
+    case ncclFuncAlltoAll:
+      NCCLCHECKGOTO(ncclCeAlltoAll(comm, args, stream), ret, fail);
+      break;
+    case ncclFuncScatter:
+      NCCLCHECKGOTO(ncclCeScatter(comm, args, stream), ret, fail);
+      break;
+    case ncclFuncGather:
+      NCCLCHECKGOTO(ncclCeGather(comm, args, stream), ret, fail);
+      break;
+    default:
+      ret = ncclInvalidUsage;
+  }
+
+exit:
+  return ret;
+fail:
+  goto exit;
+}
diff --git a/projects/rccl/src/collectives.cc b/projects/rccl/src/collectives.cc
index 03122f8a73..ca69c9a785 100644
--- a/projects/rccl/src/collectives.cc
+++ b/projects/rccl/src/collectives.cc
@@ -14,10 +14,13 @@ const char* ncclFuncToString(ncclFunc_t fn) {
   switch (fn) {
   case ncclFuncAllGather: return "AllGather";
   case ncclFuncAllReduce: return "AllReduce";
+  case ncclFuncAlltoAll: return "AlltoAll";
   case ncclFuncBroadcast: return "Broadcast";
+  case ncclFuncGather: return "Gather";
   case ncclFuncRecv: return "Recv";
   case ncclFuncReduce: return "Reduce";
   case ncclFuncReduceScatter: return "ReduceScatter";
+  case ncclFuncScatter: return "Scatter";
   case ncclFuncSendRecv: return "SendRecv";
   case ncclFuncSend: return "Send";
   default: return "Invalid";
@@ -88,6 +91,19 @@ ncclResult_t ncclAllGather(const void* sendbuff, void* recvbuff, size_t sendcoun
   return ncclEnqueueCheck(&info);
 }
 
+NCCL_API(ncclResult_t, ncclAlltoAll, const void* sendbuff, void* recvbuff, size_t count,
+    ncclDataType_t datatype, ncclComm* comm, cudaStream_t stream);
+ncclResult_t ncclAlltoAll(const void* sendbuff, void* recvbuff, size_t count,
+    ncclDataType_t datatype, ncclComm* comm, cudaStream_t stream) {
+  NVTX3_FUNC_WITH_PARAMS(AlltoAll, NcclNvtxParamsAlltoAll,
+    NVTX3_PAYLOAD(comm ? comm->commHash : 0, count * ncclTypeSize(datatype)));
+
+  struct ncclInfo info = { ncclFuncAlltoAll, "AlltoAll",
+    sendbuff, recvbuff, count, datatype, ncclSum, 0, comm, stream, /* Args */
+    ALLTOALL_CHUNKSTEPS, ALLTOALL_SLICESTEPS };
+  return ncclEnqueueCheck(&info);
+}
+
 NCCL_API(ncclResult_t, ncclAllReduce, const void* sendbuff, void* recvbuff, size_t count,
     ncclDataType_t datatype, ncclRedOp_t op, ncclComm* comm, cudaStream_t stream);
 ncclResult_t ncclAllReduce(const void* sendbuff, void* recvbuff, size_t count,
@@ -121,6 +137,19 @@ ncclResult_t ncclBcast(void* buff, size_t count, ncclDataType_t datatype, int ro
   return ncclBroadcast(buff, buff, count, datatype, root, comm, stream);
 }
 
+NCCL_API(ncclResult_t, ncclGather, const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, int root,
+    ncclComm* comm, cudaStream_t stream);
+ncclResult_t ncclGather(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, int root,
+    ncclComm* comm, cudaStream_t stream) {
+  NVTX3_FUNC_WITH_PARAMS(Gather, NcclNvtxParamsGather,
+    NVTX3_PAYLOAD(comm ? comm->commHash : 0, count * ncclTypeSize(datatype), root));
+
+  struct ncclInfo info = { ncclFuncGather, "Gather",
+    sendbuff, recvbuff, count, datatype, ncclSum, root, comm, stream, /* Args */
+    GATHER_CHUNKSTEPS, GATHER_SLICESTEPS };
+  return ncclEnqueueCheck(&info);
+}
+
 NCCL_API(ncclResult_t, ncclReduce, const void* sendbuff, void* recvbuff, size_t count,
     ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream);
 ncclResult_t ncclReduce(const void* sendbuff, void* recvbuff, size_t count,
@@ -147,6 +176,19 @@ ncclResult_t ncclReduceScatter(const void* sendbuff, void* recvbuff, size_t recv
   return ncclEnqueueCheck(&info);
 }
 
+NCCL_API(ncclResult_t, ncclScatter, const void* sendbuff, void* recvbuff, size_t count,
+    ncclDataType_t datatype, int root, ncclComm* comm, cudaStream_t stream);
+ncclResult_t ncclScatter(const void* sendbuff, void* recvbuff, size_t count,
+    ncclDataType_t datatype, int root, ncclComm* comm, cudaStream_t stream) {
+  NVTX3_FUNC_WITH_PARAMS(Scatter, NcclNvtxParamsScatter,
+    NVTX3_PAYLOAD(comm ? comm->commHash : 0, count * ncclTypeSize(datatype), root));
+
+  struct ncclInfo info = { ncclFuncScatter, "Scatter",
+    sendbuff, recvbuff, count, datatype, ncclSum, root, comm, stream, /* Args */
+    SCATTER_CHUNKSTEPS, SCATTER_SLICESTEPS };
+  return ncclEnqueueCheck(&info);
+}
+
 NCCL_API(ncclResult_t, ncclSend, const void* sendbuff, size_t count, ncclDataType_t datatype, int peer,
     ncclComm_t comm, cudaStream_t stream);
 ncclResult_t ncclSend(const void* sendbuff, size_t count, ncclDataType_t datatype, int peer,
diff --git a/projects/rccl/src/debug.cc b/projects/rccl/src/debug.cc
index f034bc7e05..0d6ed84008 100644
--- a/projects/rccl/src/debug.cc
+++ b/projects/rccl/src/debug.cc
@@ -15,6 +15,7 @@
 #include <sys/syscall.h>
 #include <chrono>
 #include "param.h"
+#include <mutex>
 
 #define NCCL_DEBUG_RESET_TRIGGERED (-2)
 
@@ -28,9 +29,9 @@ static int pid = -1;
 static char hostname[1024];
 thread_local int ncclDebugNoWarn = 0;
 char ncclLastError[1024] = ""; // Global string for the last error in human readable form
-static uint64_t ncclDebugMask = 0;
+uint64_t ncclDebugMask = 0;
 FILE *ncclDebugFile = stdout;
-static pthread_mutex_t ncclDebugLock = PTHREAD_MUTEX_INITIALIZER;
+static std::mutex ncclDebugMutex;
 static std::chrono::steady_clock::time_point ncclEpoch;
 static bool ncclWarnSetDebugInfo = false;
 
@@ -269,15 +270,13 @@ static void ncclDebugInit() {
  * they can share the debugging mechanisms and output files
  */
 void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *filefunc, int line, const char *fmt, ...) {
-  bool locked = false; // Keeps track of the ncclDebugLock state.
   int gotLevel = __atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE);
 
   if (ncclDebugNoWarn != 0 && level == NCCL_LOG_WARN) { level = NCCL_LOG_INFO; flags = ncclDebugNoWarn; }
 
   // Save the last error (WARN) as a human readable string
   if (level == NCCL_LOG_WARN) {
-    pthread_mutex_lock(&ncclDebugLock);
-    locked = true;
+    std::lock_guard<std::mutex> lock(ncclDebugMutex);
     va_list vargs;
     va_start(vargs, fmt);
     (void) vsnprintf(ncclLastError, sizeof(ncclLastError), fmt, vargs);
@@ -285,20 +284,13 @@ void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *file
   }
 
   if (gotLevel >= 0 && (gotLevel < level || (flags & ncclDebugMask) == 0)) {
-    if (locked)
-      pthread_mutex_unlock(&ncclDebugLock);
     return;
   }
 
-  if (!locked) {
-    pthread_mutex_lock(&ncclDebugLock);
-    locked = true;
-  }
-  // From this point on ncclDebugLock is always locked so we don't need to check "locked" anymore.
+  std::lock_guard<std::mutex> lock(ncclDebugMutex);
   if (ncclDebugLevel < 0)
     ncclDebugInit();
   if (ncclDebugLevel < level || ((flags & ncclDebugMask) == 0)) {
-    pthread_mutex_unlock(&ncclDebugLock);
     return;
   }
 
@@ -386,17 +378,35 @@ void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *file
   // necessary since we write bytes instead of the string.
   buffer[len++] = '\n';
   fwrite(buffer, 1, len, ncclDebugFile);
-  pthread_mutex_unlock(&ncclDebugLock);
 }
 
-NCCL_API(void, ncclResetDebugInit);
-void ncclResetDebugInit() {
+// Non-deprecated version for internal use.
+extern "C"
+__attribute__ ((visibility("default")))
+void ncclResetDebugInitInternal() {
   // Cleans up from a previous ncclDebugInit() and reruns.
   // Use this after changing NCCL_DEBUG and related parameters in the environment.
-  pthread_mutex_lock(&ncclDebugLock);
+  std::lock_guard<std::mutex> lock(ncclDebugMutex);
   // Let ncclDebugInit() know to complete the reset.
   __atomic_store_n(&ncclDebugLevel, NCCL_DEBUG_RESET_TRIGGERED, __ATOMIC_RELEASE);
-  pthread_mutex_unlock(&ncclDebugLock);
+}
+
+// In place of: NCCL_API(void, ncclResetDebugInit);
+__attribute__ ((visibility("default")))
+__attribute__ ((alias("ncclResetDebugInit")))
+void pncclResetDebugInit();
+extern "C"
+__attribute__ ((visibility("default")))
+__attribute__ ((weak))
+__attribute__ ((deprecated("ncclResetDebugInit is not supported as part of the NCCL API and will be removed in the future")))
+void ncclResetDebugInit();
+
+
+void ncclResetDebugInit() {
+  // This is now deprecated as part of the NCCL API. It will be removed
+  // from the API in the future. It is still available as an
+  // exported symbol.
+  ncclResetDebugInitInternal();
 }
 
 NCCL_PARAM(SetThreadName, "SET_THREAD_NAME", 0);
diff --git a/projects/rccl/src/dev_runtime.cc b/projects/rccl/src/dev_runtime.cc
new file mode 100644
index 0000000000..54e6e01bfc
--- /dev/null
+++ b/projects/rccl/src/dev_runtime.cc
@@ -0,0 +1,995 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "dev_runtime.h"
+#include "comm.h"
+#include "device.h"
+#include "transport.h"
+#include "group.h"
+#include "nccl_device.h"
+
+NCCL_PARAM(WinStride, "WIN_STRIDE", -1);
+
+// Complete types from src/include/dev_runtime.h
+struct ncclDevrMemory {
+  int refCount;
+  struct ncclDevrMemory* next;
+  CUmemGenericAllocationHandle memHandle;
+  size_t size;
+  size_t bigOffset; // offset in big VA space
+};
+
+struct ncclDevrWindowSorted {
+  uintptr_t userAddr;
+  size_t size;
+  struct ncclDevrWindow* win;
+};
+
+struct ncclDevrTeam {
+  struct ncclDevrTeam* next;
+  struct ncclTeam team;
+  CUmemGenericAllocationHandle mcHandle;
+  void* mcBasePtr;
+  int worldRankList[];
+};
+
+////////////////////////////////////////////////////////////////////////////////
+// Helpers at the bottom:
+
+// Find least index such that `arg < sorted[i].key` (least upper bound)
+template<typename Obj, typename Key>
+static int listFindSortedLub(Key Obj::*key, Obj* sorted, int count, Key arg);
+
+template<typename Obj>
+static void listInsert(Obj** list, int* capacity, int* count, int index, Obj val);
+
+template<typename Obj>
+static void listRemove(Obj* list, int* count, int index);
+
+////////////////////////////////////////////////////////////////////////////////
+
+ncclResult_t ncclDevrInitOnce(struct ncclComm* comm) {
+  ncclResult_t ret = ncclSuccess;
+  struct ncclDevrState* devr = &comm->devrState;
+  if (devr->bigSize != 0) return ncclSuccess;
+
+  bool lsaIsLocal = true;
+  for (int i=0; i < comm->localRanks; i++) {
+    lsaIsLocal &= comm->localRankToRank[i] == comm->localRankToRank[0] + i;
+  }
+  devr->lsaSelf = lsaIsLocal ? comm->localRank : 0;
+  devr->lsaSize = lsaIsLocal ? comm->localRanks : 1;
+  devr->lsaRankList = (int*)malloc(devr->lsaSize*sizeof(int));
+  for (int i=0; i < devr->lsaSize; i++) {
+    devr->lsaRankList[i] = comm->rank + (i - devr->lsaSelf);
+  }
+
+  CUmemAllocationProp memProp = {};
+  memProp.type = CU_MEM_ALLOCATION_TYPE_PINNED;
+  memProp.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+  memProp.requestedHandleTypes = ncclCuMemHandleType;
+  memProp.location.id = comm->cudaDev;
+  CUCHECKGOTO(cuMemGetAllocationGranularity(&devr->granularity, &memProp, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED), ret, fail_lsaRankList);
+
+  devr->bigSize = ncclParamWinStride();
+  if (-devr->bigSize <= 1) {
+    devr->bigSize = 1;
+    for (int r=0; r < comm->nRanks; ++r) {
+      devr->bigSize = std::max<size_t>(devr->bigSize, comm->peerInfo[r].totalGlobalMem);
+    }
+  }
+  devr->bigSize = alignUp(devr->bigSize, size_t(1)<<32);
+  INFO(NCCL_INIT, "Symmetric VA size=%ldGB", (long)devr->bigSize>>30);
+  
+  ncclSpaceConstruct(&devr->bigSpace);
+  ncclShadowPoolConstruct(&devr->shadows);
+  return ncclSuccess;
+
+fail_lsaRankList:
+  free(devr->lsaRankList);
+  return ret;
+}
+
+static void symTeamDestroyAll(struct ncclComm* comm); // Further down
+
+ncclResult_t ncclDevrFinalize(struct ncclComm* comm) {
+  struct ncclDevrState* devr = &comm->devrState;
+  if (devr->bigSize == 0) return ncclSuccess;
+
+  while (!ncclIntruQueueEmpty(&devr->regTaskQueue)) {
+    struct ncclDevrRegTask* task = ncclIntruQueueDequeue(&devr->regTaskQueue);
+    free(task);
+  }
+  
+  symTeamDestroyAll(comm);
+  { // delete windowTable
+    cudaStream_t stream;
+    if (cudaSuccess == cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking)) {
+      struct ncclDevCommWindowTable* tableDev = devr->windowTable;
+      while (tableDev != nullptr) {
+        struct ncclDevCommWindowTable* tableHost;
+        if (ncclSuccess != ncclShadowPoolToHost(&devr->shadows, tableDev, &tableHost)) break;
+        struct ncclDevCommWindowTable* next = tableHost->next;
+        ncclShadowPoolFree(&devr->shadows, tableDev, stream);
+        tableDev = next;
+      }
+      cudaStreamSynchronize(stream);
+      cudaStreamDestroy(stream);
+    }
+  }
+  CUdeviceptr flatAddr = reinterpret_cast<CUdeviceptr>(devr->lsaFlatBase);
+  CUCHECKIGNORE(cuMemUnmap(flatAddr, devr->lsaSize*devr->bigSize));
+  CUCHECKIGNORE(cuMemAddressFree(flatAddr, devr->lsaSize*devr->bigSize));
+  ncclShadowPoolDestruct(&devr->shadows);
+  ncclSpaceDestruct(&devr->bigSpace);
+  free(devr->lsaRankList);
+  free(devr->winSorted);
+  return ncclSuccess;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+
+static ncclResult_t symMemoryMapLsaTeam(
+    struct ncclComm* comm, CUmemGenericAllocationHandle memHandle, size_t size, size_t bigOffset
+  ) {
+  ncclResult_t ret = ncclSuccess;
+  struct ncclDevrState* devr = &comm->devrState;
+  CUmemAccessDesc accessDesc = {};
+  union Message {
+    CUmemGenericAllocationHandle memHandle;
+    CUmemFabricHandle fabricHandle;
+  };
+
+  Message* messages = (Message*)calloc(devr->lsaSize, sizeof(Message));
+  if (ncclCuMemHandleType == CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR) {
+    messages[devr->lsaSelf].memHandle = memHandle;
+  } else {
+    CUCHECKGOTO(cuMemExportToShareableHandle(&messages[devr->lsaSelf].fabricHandle, memHandle, ncclCuMemHandleType, 0), ret, fail);
+  }
+
+  NCCLCHECKGOTO(bootstrapIntraNodeAllGather(comm->bootstrap, devr->lsaRankList, devr->lsaSelf, devr->lsaSize, messages, sizeof(Message)), ret, fail);
+
+  if (devr->lsaFlatBase == nullptr) { // Create on first need.
+    CUdeviceptr addr;
+    CUCHECKGOTO(cuMemAddressReserve(&addr, devr->lsaSize*devr->bigSize, NCCL_MAX_PAGE_SIZE, 0, 0), ret, fail);
+    devr->lsaFlatBase = reinterpret_cast<void*>(addr);
+  }
+  accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+  accessDesc.location.id = comm->cudaDev;
+  accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
+  for (int r = 0; r < devr->lsaSize; r++) {
+    CUmemGenericAllocationHandle impHandle;
+    if (r == devr->lsaSelf) {
+      impHandle = memHandle;
+    } else {
+      if (ncclCuMemHandleType == CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR) {
+        int fd = -1;
+        NCCLCHECKGOTO(ncclProxyClientGetFdBlocking(comm, devr->lsaRankList[r], &messages[r], &fd), ret, fail);
+        CUCHECKGOTO(cuMemImportFromShareableHandle(&impHandle, reinterpret_cast<void*>((uintptr_t)fd), ncclCuMemHandleType), ret, fail);
+        SYSCHECKGOTO(close(fd), "close", ret, fail);
+      } else {
+        CUCHECKGOTO(cuMemImportFromShareableHandle(&impHandle, (void*)&messages[r].fabricHandle, ncclCuMemHandleType), ret, fail);
+      }
+    }
+    CUdeviceptr addr = reinterpret_cast<uintptr_t>((char*)devr->lsaFlatBase + r*devr->bigSize + bigOffset);
+    CUCHECKGOTO(cuMemMap(addr, size, 0, impHandle, 0), ret, fail);
+    CUCHECKGOTO(cuMemSetAccess(addr, size, &accessDesc, 1), ret, fail);
+    if (r != devr->lsaSelf) {
+      CUCHECKGOTO(cuMemRelease(impHandle), ret, fail);
+    }
+  }
+  // Ensure everyone has imported my mem handle.
+  NCCLCHECKGOTO(bootstrapIntraNodeBarrier(comm->bootstrap, devr->lsaRankList, devr->lsaSelf, devr->lsaSize, 0xbeef), ret, fail);
+leave:
+  free(messages);
+  return ret;
+fail:
+  goto leave;
+}
+
+static ncclResult_t symBindTeamMemory(
+    struct ncclComm* comm, struct ncclDevrTeam* tm, struct ncclDevrMemory* mem
+  ) {
+  if (comm->nvlsSupport && tm->mcBasePtr != nullptr) {
+  #if CUDART_VERSION >= 12010
+    INFO(NCCL_NVLS, "Binding multicast memory at big=%lx to team {%d x %d}", mem->bigOffset, tm->team.nRanks, tm->team.stride);
+    CUCHECK(cuMulticastBindMem(tm->mcHandle, mem->bigOffset, mem->memHandle, 0, mem->size, 0));
+  #endif
+  }
+  return ncclSuccess;
+}
+
+static ncclResult_t symUnbindTeamMemory(
+    struct ncclComm* comm, struct ncclDevrTeam* tm, struct ncclDevrMemory* mem
+  ) {
+  if (comm->nvlsSupport && tm->mcBasePtr != nullptr) {
+  #if CUDART_VERSION >= 12010
+    CUCHECK(cuMulticastUnbind(tm->mcHandle, comm->cudaDev, mem->bigOffset, mem->size));
+  #endif
+  }
+  return ncclSuccess;
+}
+
+// Caller must barrier the team afterward.
+static ncclResult_t symTeamObtain(
+    struct ncclComm* comm, struct ncclTeam team, bool multimem,
+    struct ncclDevrTeam** outTeam
+  ) {
+  ncclResult_t ret = ncclSuccess;
+  struct ncclDevrState* devr = &comm->devrState;
+  struct ncclDevrTeam* t = devr->teamHead;
+  bool teamIsNew = false;
+  while (true) {
+    if (t == nullptr) {
+      teamIsNew = true;
+      t = (struct ncclDevrTeam*)malloc(sizeof(struct ncclDevrTeam) + team.nRanks*sizeof(int));
+      t->team = team;
+      t->mcHandle = 0x0;
+      t->mcBasePtr = nullptr;
+      for (int i=0; i < team.nRanks; i++) {
+        t->worldRankList[i] = comm->rank + (i - team.rank)*team.stride;
+      }
+      break;
+    } else if (t->team.rank == team.rank && t->team.nRanks == team.nRanks && t->team.stride == team.stride) {
+      if (!multimem || t->mcBasePtr != nullptr) {
+        // Matching team is sufficient
+        if (outTeam) *outTeam = t;
+        return ncclSuccess;
+      }
+      break; // Need to enable multimem
+    }
+  }
+
+  if (multimem) {
+    if (!comm->nvlsSupport) {
+      WARN("Multicast support requested for team but none available on system.");
+      ret = ncclInvalidArgument;
+      goto fail;
+    } else {
+    #if CUDART_VERSION >= 12010
+      CUmemGenericAllocationHandle mcHandle = 0;
+      CUdeviceptr mcAddr = 0;
+      CUmulticastObjectProp mcProp = {};
+      char shareableHandle[NVLS_HANDLE_SIZE];
+
+      mcProp.numDevices = team.nRanks;
+      mcProp.handleTypes = ncclCuMemHandleType;
+      mcProp.flags = 0;
+      mcProp.size = devr->bigSize;
+      if (team.rank == 0) {
+        NCCLCHECKGOTO(ncclNvlsGroupCreate(comm, &mcProp, team.rank, team.nRanks, &mcHandle, shareableHandle), ret, fail);
+        NCCLCHECKGOTO(bootstrapIntraNodeBroadcast(comm->bootstrap, t->worldRankList, team.rank, team.nRanks, 0, shareableHandle, NVLS_HANDLE_SIZE), ret, fail_mcHandle);
+      } else {
+        NCCLCHECKGOTO(bootstrapIntraNodeBroadcast(comm->bootstrap, t->worldRankList, team.rank, team.nRanks, 0, shareableHandle, NVLS_HANDLE_SIZE), ret, fail);
+        NCCLCHECKGOTO(ncclNvlsGroupConnect(comm, shareableHandle, t->worldRankList[0], &mcHandle), ret, fail);
+      }
+
+      CUCHECKGOTO(cuMulticastAddDevice(mcHandle, comm->cudaDev), ret, fail_mcHandle);
+      CUCHECKGOTO(cuMemAddressReserve(&mcAddr, devr->bigSize, NCCL_MAX_PAGE_SIZE, 0, 0), ret, fail_mcHandle);
+      CUCHECKGOTO(cuMemMap(mcAddr, devr->bigSize, 0, mcHandle, 0), ret, fail_mcHandle_mcAddr);
+      { CUmemAccessDesc accessDesc = {};
+        accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+        accessDesc.location.id = comm->cudaDev;
+        accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
+        CUCHECKGOTO(cuMemSetAccess(mcAddr, devr->bigSize, &accessDesc, 1), ret, fail_mcHandle_mcAddr_unmap);
+      }
+      t->mcHandle = mcHandle;
+      t->mcBasePtr = reinterpret_cast<void*>(mcAddr);
+
+      // Bind new team with all existing memories.
+      for (struct ncclDevrMemory* mem = devr->memHead; mem != nullptr; mem = mem->next) {
+        NCCLCHECKGOTO(symBindTeamMemory(comm, t, mem), ret, fail_mcHandle_mcAddr_unmap_mems);
+      }
+
+      if (false) { // Error labels:
+      fail_mcHandle_mcAddr_unmap_mems:
+        for (struct ncclDevrMemory* mem = devr->memHead; mem != nullptr; mem = mem->next) {
+          symUnbindTeamMemory(comm, t, mem);
+        }
+      fail_mcHandle_mcAddr_unmap:
+        CUCHECKIGNORE(cuMemUnmap(mcAddr, devr->bigSize));
+        goto fail_mcHandle_mcAddr; // silence unused label warning
+      fail_mcHandle_mcAddr:
+        CUCHECKIGNORE(cuMemAddressFree(mcAddr, devr->bigSize));
+        goto fail_mcHandle; // silence unused label warning
+      fail_mcHandle:
+        CUCHECKIGNORE(cuMemRelease(mcHandle));
+        goto fail; // silence unused label warning
+      }
+    #else
+      goto fail; // silence unused label warning
+    #endif
+    }
+  }
+
+  if (teamIsNew) {
+     // Add to list
+    t->next = devr->teamHead;
+    devr->teamHead = t;
+  }
+  if (outTeam) *outTeam = t;
+  return ret;
+
+fail:
+  if (teamIsNew) free(t);
+  return ret;
+}
+
+static void symTeamDestroyAll(struct ncclComm* comm) {
+  struct ncclDevrState* devr = &comm->devrState;
+  while (devr->teamHead != nullptr) {
+    struct ncclDevrTeam* t = devr->teamHead;
+    devr->teamHead = t->next;
+    if (t->mcBasePtr != nullptr) {
+      for (struct ncclDevrMemory* m = devr->memHead; m != nullptr; m = m->next) {
+        symUnbindTeamMemory(comm, t, m);
+      }
+      CUdeviceptr mcAddr = reinterpret_cast<CUdeviceptr>(t->mcBasePtr);
+      CUCHECKIGNORE(cuMemUnmap(mcAddr, devr->bigSize));
+      CUCHECKIGNORE(cuMemAddressFree(mcAddr, devr->bigSize));
+      CUCHECKIGNORE(cuMemRelease(t->mcHandle));
+    }
+    free(t);
+  }
+}
+
+// On success we take caller's reference on memHandle.
+// Due to multicast binds for each pre-exiting team, this function requires
+// caller do a world barrier before returning to user.
+static ncclResult_t symMemoryObtain(
+    struct ncclComm* comm, CUmemGenericAllocationHandle memHandle, size_t size,
+    struct ncclDevrMemory** outMem
+  ) {
+  ncclResult_t ret = ncclSuccess;
+  struct ncclDevrState* devr = &comm->devrState;
+  int64_t bigOffset = 0;
+
+  struct ncclDevrMemory* mem = devr->memHead;
+  while (mem != nullptr) {
+    if (mem->memHandle == memHandle) {
+      CUCHECKIGNORE(cuMemRelease(memHandle));
+      goto leave;
+    }
+    mem = mem->next;
+  }
+  // New memory.
+  mem = (struct ncclDevrMemory*)malloc(sizeof(struct ncclDevrMemory));
+  mem->refCount = 0;
+  mem->memHandle = memHandle;
+  mem->size = size;
+ 
+  // Grab offset in the big space.
+  NCCLCHECKGOTO(ncclSpaceAlloc(&devr->bigSpace, devr->bigSize, size, devr->granularity, &bigOffset), ret, fail_mem);
+  mem->bigOffset = bigOffset;
+
+  // Map unicast addresses into flat VA space for lsa team.
+  NCCLCHECKGOTO(symMemoryMapLsaTeam(comm, memHandle, size, bigOffset), ret, fail_mem_space);
+
+  // Bind new memory with each existing team.
+  for (struct ncclDevrTeam* t = devr->teamHead; t != nullptr; t = t->next) {
+    NCCLCHECKGOTO(symBindTeamMemory(comm, t, mem), ret, fail_mem_space_teams);
+  }
+  // Add to list of mems.
+  mem->next = devr->memHead;
+  devr->memHead = mem;
+
+leave:
+  mem->refCount += 1;
+  *outMem = mem;
+  return ret;
+
+fail_mem_space_teams:
+  for (struct ncclDevrTeam* t = devr->teamHead; t != nullptr; t = t->next) {
+    symUnbindTeamMemory(comm, t, mem);
+  }
+fail_mem_space:
+  ncclSpaceFree(&devr->bigSpace, bigOffset, size);
+fail_mem:
+  free(mem);
+//fail:
+  return ret;
+}
+
+static void symMemoryDropRef(
+    struct ncclComm* comm, struct ncclDevrMemory* mem
+  ) {
+  if (mem != nullptr && 0 == --mem->refCount) {
+    struct ncclDevrState* devr = &comm->devrState;
+    for (struct ncclDevrTeam* t = devr->teamHead; t != nullptr; t = t->next) {
+      symUnbindTeamMemory(comm, t, mem);
+    }
+    for (int r = 0; r < devr->lsaSize; r++) {
+      CUdeviceptr addr = reinterpret_cast<uintptr_t>((char*)devr->lsaFlatBase + r*devr->bigSize + mem->bigOffset);
+      CUCHECKIGNORE(cuMemUnmap(addr, mem->size));
+    }
+    ncclSpaceFree(&devr->bigSpace, mem->bigOffset, mem->size);
+    CUCHECKIGNORE(cuMemRelease(mem->memHandle));
+
+    struct ncclDevrMemory** ptr = &devr->memHead;
+    while (*ptr != mem) ptr = &(*ptr)->next;
+    *ptr = mem->next; // Remove from list.
+
+    free(mem);
+  }
+}
+
+static ncclResult_t symWindowTableInitOnce(struct ncclComm* comm, cudaStream_t stream) {
+  struct ncclDevrState* devr = &comm->devrState;
+  struct ncclDevCommWindowTable* tableDev = devr->windowTable;
+  if (tableDev == nullptr) { // Create on first need.
+    NCCLCHECK(ncclShadowPoolAlloc<ncclDevCommWindowTable>(&devr->shadows, &tableDev, nullptr, stream));
+    devr->windowTable = tableDev;
+  }
+  return ncclSuccess;
+}
+
+// On success we take callers reference on `mem`.
+static ncclResult_t symWindowCreate(
+    struct ncclComm* comm, struct ncclDevrMemory* mem,
+    size_t memOffset, void* userPtr, size_t userSize, int winFlags, void* localReg,
+    struct ncclWindow_vidmem** outWinDev, struct ncclDevrWindow** outWin,
+    cudaStream_t stream
+  ) {
+  uintptr_t userAddr = reinterpret_cast<uintptr_t>(userPtr);
+  struct ncclDevrState* devr = &comm->devrState;
+  struct ncclDevrWindow* win;
+
+  win = (struct ncclDevrWindow*)malloc(sizeof(struct ncclDevrWindow));
+  memset(win, 0, sizeof(*win));
+  win->memory = mem;
+  win->size = userSize;
+  win->bigOffset = mem->bigOffset + memOffset;
+  win->winFlags = winFlags;
+  win->localRegHandle = localReg;
+  if (userPtr == nullptr) {
+    // Null means caller has no VA and will use the lsa team flat VA address.
+    win->userPtr = (char*)devr->lsaFlatBase + (devr->lsaSelf*devr->bigSize) + mem->bigOffset;
+  } else {
+    win->userPtr = userPtr;
+  }
+
+  struct ncclWindow_vidmem* winDev;
+  struct ncclWindow_vidmem* winDevHost;
+  NCCLCHECK(ncclShadowPoolAlloc(&devr->shadows, &winDev, &winDevHost, stream));
+  win->vidmem = winDev;
+  winDevHost->lsaFlatBase = (char*)devr->lsaFlatBase + win->bigOffset;
+  winDevHost->mcOffset4K = win->bigOffset>>12;
+  winDevHost->stride4G = devr->bigSize>>32;
+  winDevHost->lsaRank = devr->lsaSelf;
+  winDevHost->worldRank = comm->rank;
+  winDevHost->winHost = (void*)win;
+  CUDACHECK(cudaMemcpyAsync(winDev, winDevHost, sizeof(struct ncclWindow_vidmem), cudaMemcpyHostToDevice, stream));
+
+  NCCLCHECK(symWindowTableInitOnce(comm, stream)); // ensure devr->windowTable exists
+  struct ncclDevCommWindowTable* tableDev = devr->windowTable;
+  struct ncclDevCommWindowTable* tableHost;
+  NCCLCHECK(ncclShadowPoolToHost(&devr->shadows, tableDev, &tableHost));
+  while (true) {
+    int i = 0;
+    while (i < 32 && tableHost->entries[i].window != nullptr) i += 1;
+    if (i < 32) {
+      tableHost->entries[i].base = userAddr;
+      tableHost->entries[i].size = userAddr + userSize;
+      tableHost->entries[i].window = winDev;
+      CUDACHECK(cudaMemcpyAsync(&tableDev->entries[i], &tableHost->entries[i], sizeof(tableHost->entries[i]), cudaMemcpyHostToDevice, stream));
+      break;
+    }
+    if (tableHost->next == nullptr) {
+      NCCLCHECK(ncclShadowPoolAlloc<ncclDevCommWindowTable>(&devr->shadows, &tableHost->next, nullptr, stream));
+      CUDACHECK(cudaMemcpyAsync(&tableDev->next, &tableHost->next, sizeof(tableHost->next), cudaMemcpyHostToDevice, stream));
+    }
+    tableDev = tableHost->next;
+    NCCLCHECK(ncclShadowPoolToHost(&devr->shadows, tableHost->next, &tableHost));
+  }
+
+  { // insert into winSorted[]
+    int i = listFindSortedLub(&ncclDevrWindowSorted::userAddr, devr->winSorted, devr->winSortedCount, userAddr);
+    struct ncclDevrWindowSorted winSort;
+    winSort.userAddr = userAddr;
+    winSort.size = userSize;
+    winSort.win = win;
+    listInsert(&devr->winSorted, &devr->winSortedCapacity, &devr->winSortedCount, i, winSort);
+  }
+
+  if (outWinDev) *outWinDev = winDev;
+  if (outWin) *outWin = win;
+  return ncclSuccess;
+}
+
+static ncclResult_t symWindowDestroy(struct ncclComm* comm, struct ncclWindow_vidmem* winDev, cudaStream_t stream) {
+  ncclResult_t ret = ncclSuccess;
+  struct ncclDevrState* devr = &comm->devrState;
+  struct ncclWindow_vidmem* winDevHost;
+  struct ncclDevrWindow* winHost;
+
+  NCCLCHECKGOTO(ncclShadowPoolToHost(&devr->shadows, winDev, &winDevHost), ret, fail);
+  winHost = (struct ncclDevrWindow*)winDevHost->winHost;
+
+  symMemoryDropRef(comm, winHost->memory);
+
+  { struct ncclDevCommWindowTable* tableDev = devr->windowTable;
+    struct ncclDevCommWindowTable* tableHost;
+    NCCLCHECKGOTO(ncclShadowPoolToHost(&devr->shadows, tableDev, &tableHost), ret, remove_winSorted);
+    while (true) {
+      int i = 0;
+      while (i < 32 && tableHost->entries[i].window != winDev) i += 1;
+      if (i < 32) {
+        memset(&tableHost->entries[i], 0, sizeof(tableHost->entries[i]));
+        CUDACHECKGOTO(cudaMemsetAsync(&tableDev->entries[i], 0, sizeof(tableDev->entries[i]), stream), ret, remove_winSorted);
+        break;
+      }
+      if (tableHost->next == nullptr) break; // Error didn't find window in table
+      tableDev = tableHost->next;
+      NCCLCHECKGOTO(ncclShadowPoolToHost(&devr->shadows, tableHost->next, &tableHost), ret, remove_winSorted);
+    }
+  }
+  NCCLCHECKGOTO(ncclShadowPoolFree(&devr->shadows, winDev, stream), ret, remove_winSorted);
+
+  NCCLCHECKGOTO(ncclCommDeregister(comm, winHost->localRegHandle), ret, remove_winSorted);
+
+remove_winSorted:
+  { int i = listFindSortedLub(&ncclDevrWindowSorted::userAddr, devr->winSorted, devr->winSortedCount, reinterpret_cast<uintptr_t>(winHost->userPtr));
+    i -= 1; // least upper bound is just after ours.
+    listRemove(devr->winSorted, &devr->winSortedCount, i);
+  }
+  free(winHost);
+fail:
+  return ret;
+}
+
+ncclResult_t ncclDevrWindowRegisterInGroup(
+    struct ncclComm* comm,
+    void* userPtr, size_t userSize, int winFlags, ncclWindow_t* outWinDev
+  ) {
+  ncclResult_t ret = ncclSuccess;
+  CUdeviceptr memAddr = 0;
+  size_t memSize = 0;
+  CUmemGenericAllocationHandle memHandle = 0x0;
+  size_t memOffset;
+  struct ncclDevrMemory* mem = nullptr;
+  cudaStream_t stream = nullptr;
+  void* localRegHandle = nullptr;
+
+  NCCLCHECKGOTO(ncclCommRegister(comm, userPtr, userSize, &localRegHandle), ret, fail);
+
+  if (!comm->symmetricSupport) {
+    // We just return the local registration handle directly in this case, as there's no reason to allocate the
+    // ncclWindow_vidmem structure on the device, etc.
+    *outWinDev = reinterpret_cast<struct ncclWindow_vidmem*>(localRegHandle);
+    return ncclSuccess;
+  }
+  if (winFlags & NCCL_WIN_COLL_SYMMETRIC) {
+    // Defer symmetric kernel init until at least one window with that flag exists.
+    NCCLCHECKGOTO(ncclSymkInitOnce(comm), ret, fail);
+  }
+
+  // Get underlying cumem handle:
+  CUCHECKGOTO(cuMemGetAddressRange(&memAddr, &memSize, reinterpret_cast<CUdeviceptr>(userPtr)), ret, fail_locReg);
+  memOffset = reinterpret_cast<CUdeviceptr>(userPtr) - memAddr;
+  if (memOffset%NCCL_WIN_REQUIRED_ALIGNMENT != 0) {
+    WARN("Window address must be suitably aligned.");
+    ret = ncclInvalidArgument;
+    goto fail;
+  }
+
+  CUCHECKGOTO(cuMemRetainAllocationHandle(&memHandle, reinterpret_cast<void*>(memAddr)), ret, fail_locReg);
+
+  // Trade cumem handle for ncclDevrMemory*
+  NCCLCHECKGOTO(symMemoryObtain(comm, memHandle, memSize, &mem), ret, fail_locReg_memHandle);
+  memHandle = 0x0; // symMemoryObtain took our reference
+
+  CUDACHECKGOTO(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking), ret, fail);
+
+  NCCLCHECKGOTO(symWindowCreate(
+      comm, mem, memOffset, userPtr, userSize, winFlags, localRegHandle, outWinDev, nullptr, stream
+    ), ret, fail_locReg_memHandle_mem_stream);
+  mem = nullptr; // symWindowCreate took our reference
+  
+  CUDACHECKGOTO(cudaStreamSynchronize(stream), ret, fail_locReg_memHandle_mem_stream_win);
+
+  // symWindowCreate needs barrier.
+  NCCLCHECKGOTO(bootstrapBarrier(comm->bootstrap, comm->rank, comm->nRanks, 0xbeef), ret, fail_locReg_memHandle_mem_stream_win);
+
+  cudaStreamDestroy(stream);
+  return ret;
+
+fail_locReg_memHandle_mem_stream_win:
+  symWindowDestroy(comm, *outWinDev, stream);
+  *outWinDev = nullptr;
+  cudaStreamSynchronize(stream);
+fail_locReg_memHandle_mem_stream:
+  cudaStreamDestroy(stream);
+  symMemoryDropRef(comm, mem);
+fail_locReg_memHandle:
+  if (memHandle != 0x0) { CUCHECKIGNORE(cuMemRelease(memHandle)); }
+fail_locReg:
+  ncclCommDeregister(comm, localRegHandle);
+fail:
+  *outWinDev = nullptr;
+  return ret;
+}
+
+static ncclResult_t deepCopyDevCommRequirements(
+    struct ncclDevCommRequirements const* src,
+    struct ncclDevCommRequirements** dst
+) {
+  ncclResult_t ret = ncclSuccess;
+  struct ncclDevResourceRequirements **dstRes;
+  struct ncclTeamRequirements **dstTeam;
+
+  NCCLCHECK(ncclCalloc(dst, 1));
+
+  /* copy the entire struct now and update linked lists later */
+  **dst = *src;
+
+  dstRes = &(*dst)->resourceRequirementsList;
+  for (struct ncclDevResourceRequirements* rr = src->resourceRequirementsList; rr != nullptr; rr = rr->next) {
+    NCCLCHECKGOTO(ncclCalloc(dstRes, 1), ret, fail);
+    (*dstRes)->bufferSize = rr->bufferSize;
+    (*dstRes)->bufferAlign = rr->bufferAlign;
+    (*dstRes)->outBufferHandle = rr->outBufferHandle;
+    dstRes = &(*dstRes)->next;
+  }
+
+  dstTeam = &(*dst)->teamRequirementsList;
+  for (struct ncclTeamRequirements* tr = src->teamRequirementsList; tr != nullptr; tr = tr->next) {
+    NCCLCHECKGOTO(ncclCalloc(dstTeam, 1), ret, fail);
+    (*dstTeam)->team = tr->team;
+    (*dstTeam)->multimem = tr->multimem;
+    (*dstTeam)->outMultimemHandle = tr->outMultimemHandle;
+    dstTeam = &(*dstTeam)->next;
+  }
+
+exit:
+  return ret;
+fail:
+  freeDevCommRequirements(*dst);
+  *dst = nullptr;
+  goto exit;
+}
+
+void freeDevCommRequirements(
+    struct ncclDevCommRequirements* reqs
+) {
+  if (reqs) {
+    while (reqs->resourceRequirementsList) {
+      struct ncclDevResourceRequirements* rr_next = reqs->resourceRequirementsList->next;
+      free(reqs->resourceRequirementsList);
+      reqs->resourceRequirementsList = rr_next;
+    }
+
+    while (reqs->teamRequirementsList) {
+      struct ncclTeamRequirements* tr_next = reqs->teamRequirementsList->next;
+      free(reqs->teamRequirementsList);
+      reqs->teamRequirementsList = tr_next;
+    }
+
+    free(reqs);
+  }
+}
+
+ncclResult_t ncclDevrCommCreateInternal(
+    struct ncclComm* comm,
+    struct ncclDevCommRequirements const* reqs, struct ncclDevComm* outDevComm
+  ) {
+  ncclResult_t ret = ncclSuccess;
+  struct ncclDevrState* devr = &comm->devrState;
+  struct ncclTeam world = ncclTeamWorld(comm);
+  struct ncclTeam lsa = ncclTeamInnerFactor(world, devr->lsaSize);
+  struct ncclDevrTeam* tmLsa;
+  size_t bufSizeTotal;
+  struct ncclDevResourceRequirements* resReqsHead;
+  struct ncclDevResourceRequirements lsaBarReq;
+  cudaStream_t stream = nullptr;
+  CUmemGenericAllocationHandle memHandle = 0x0;
+  struct ncclDevrMemory* mem = nullptr;
+  struct ncclDevrWindow* win = nullptr;
+  struct ncclWindow_vidmem* winHost = nullptr;
+
+  memset(outDevComm, 0, sizeof(*outDevComm));
+  outDevComm->rank = comm->rank;
+  outDevComm->nRanks = comm->nRanks;
+  outDevComm->nRanks_rcp32 = idivRcp32(comm->nRanks);
+  outDevComm->lsaRank = devr->lsaSelf;
+  outDevComm->lsaSize = devr->lsaSize;
+  outDevComm->lsaSize_rcp32 = idivRcp32(devr->lsaSize);
+
+  NCCLCHECKGOTO(symTeamObtain(comm, lsa, /*multicast=*/reqs->lsaMultimem, &tmLsa), ret, fail);
+  outDevComm->lsaMultimem.mcBasePtr = tmLsa->mcBasePtr;
+
+  { struct ncclTeamRequirements* tr = reqs->teamRequirementsList;
+    while (tr != nullptr) {
+      if (tr->multimem) {
+        struct ncclDevrTeam* tm;
+        NCCLCHECKGOTO(symTeamObtain(comm, tr->team, tr->multimem, &tm), ret, fail);
+        if (tr->outMultimemHandle != nullptr) tr->outMultimemHandle->mcBasePtr = tm->mcBasePtr;
+      }
+      tr = tr->next;
+    }
+  }
+
+  resReqsHead = reqs->resourceRequirementsList;
+
+  ncclLsaBarrierCreateRequirement(lsa, reqs->lsaBarrierCount, &outDevComm->lsaBarrier, &lsaBarReq);
+  lsaBarReq.next = resReqsHead;
+  resReqsHead = &lsaBarReq;
+
+  { struct ncclDevResourceRequirements* rr = resReqsHead;
+    bufSizeTotal = 0;
+    while (rr != nullptr) {
+      bufSizeTotal = alignUp(bufSizeTotal, std::max<size_t>(128, rr->bufferAlign));
+      if (rr->outBufferHandle != nullptr) *rr->outBufferHandle = bufSizeTotal/128;
+      bufSizeTotal += rr->bufferSize;
+      rr = rr->next;
+    }
+    bufSizeTotal = alignUp(bufSizeTotal, devr->granularity);
+  }
+
+  CUDACHECKGOTO(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking), ret, fail);
+
+  NCCLCHECKGOTO(symWindowTableInitOnce(comm, stream), ret, fail); // ensure devr->windowTable exists
+  outDevComm->windowTable = comm->devrState.windowTable;
+
+  if (bufSizeTotal == 0) {
+    outDevComm->resourceWindow = nullptr;
+    outDevComm->resourceWindow_inlined = {};
+  } else {
+    CUmemAllocationProp memProp = {};
+    memProp.type = CU_MEM_ALLOCATION_TYPE_PINNED;
+    memProp.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+    memProp.requestedHandleTypes = ncclCuMemHandleType;
+    memProp.location.id = comm->cudaDev;
+
+    CUCHECKGOTO(cuMemCreate(&memHandle, bufSizeTotal, &memProp, 0), ret, fail);
+
+    NCCLCHECKGOTO(symMemoryObtain(comm, memHandle, bufSizeTotal, &mem), ret, fail);
+    memHandle = 0x0; // Reference given to symMemoryObtain
+
+    NCCLCHECKGOTO(symWindowCreate( // Requires world barrier afterward.
+      comm, mem, /*memOffset=*/0, nullptr, bufSizeTotal, /*winFlags=*/0,
+      /*localReg=*/nullptr, &outDevComm->resourceWindow, &win,
+      stream), ret, fail);
+    mem = nullptr; // Reference given to symWindowCreate
+    NCCLCHECKGOTO(ncclShadowPoolToHost(&comm->devrState.shadows, win->vidmem, &winHost), ret, fail);
+    outDevComm->resourceWindow_inlined = *winHost;
+
+    CUDACHECKGOTO(cudaMemsetAsync(win->userPtr, 0, bufSizeTotal, stream), ret, fail);
+  }
+
+  CUDACHECKGOTO(cudaStreamSynchronize(stream), ret, fail);
+
+  NCCLCHECKGOTO(bootstrapBarrier(comm->bootstrap, comm->rank, comm->nRanks, 0xbeef), ret, fail);
+
+  cudaStreamDestroy(stream);
+  return ret;
+
+fail:
+  if (win != nullptr) {
+    symWindowDestroy(comm, win->vidmem, stream);
+    cudaStreamSynchronize(stream);
+  }
+  if (mem != nullptr) {
+    symMemoryDropRef(comm, mem);
+  }
+  if (memHandle != 0x0) {
+    CUCHECKIGNORE(cuMemRelease(memHandle));
+  }
+  if (stream != nullptr) {
+    cudaStreamDestroy(stream);
+  }
+  return ret;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+
+NCCL_API(ncclResult_t, ncclCommWindowRegister, ncclComm_t comm, void* ptr, size_t size, ncclWindow_t* win, int winFlags);
+ncclResult_t ncclCommWindowRegister(
+    struct ncclComm* comm, void* userPtr, size_t userSize,
+    struct ncclWindow_vidmem** outWinDev, int winFlags
+  ) {
+  ncclResult_t ret = ncclSuccess;
+  int saveDev;
+  struct ncclDevrRegTask* task;
+
+  CUDACHECK(cudaGetDevice(&saveDev));
+  NCCLCHECK(ncclGroupStartInternal());
+
+  if (userPtr == nullptr || userSize == 0 || !(comm->symmetricSupport || ncclParamLocalRegister())) goto exit;
+
+  NCCLCHECKGOTO(ncclCommEnsureReady(comm), ret, fail);
+  CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
+
+  NCCLCHECKGOTO(ncclDevrInitOnce(comm), ret, fail);
+
+  NCCLCHECKGOTO(ncclCalloc(&task, 1), ret, fail);
+  task->userPtr = userPtr;
+  task->userSize = userSize;
+  task->winFlags = winFlags;
+  task->outWinDev = outWinDev;
+  ncclIntruQueueEnqueue(&comm->devrState.regTaskQueue, task);
+  ncclGroupCommJoin(comm, ncclGroupTaskTypeSymRegister);
+
+exit:
+  ncclGroupErrCheck(ret);
+  NCCLCHECK(ncclGroupEndInternal());
+  cudaSetDevice(saveDev);
+  return ret;
+fail:
+  goto exit;
+}
+
+NCCL_API(ncclResult_t, ncclCommWindowDeregister, ncclComm_t comm, ncclWindow_t win);
+ncclResult_t ncclCommWindowDeregister(struct ncclComm* comm, struct ncclWindow_vidmem* winDev) {
+  ncclResult_t ret = ncclSuccess;
+  int saveDev;
+  cudaStream_t stream;
+
+  if (winDev == nullptr) goto exit;
+
+  if (!comm->symmetricSupport) {
+    NCCLCHECKGOTO(ncclCommDeregister(comm, winDev), ret, fail);
+    goto exit;
+  }
+  CUDACHECKGOTO(cudaGetDevice(&saveDev), ret, fail);
+  CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
+  CUDACHECKGOTO(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking), ret, fail_dev);
+  NCCLCHECKGOTO(symWindowDestroy(comm, winDev, stream), ret, fail_dev_stream);
+fail_dev_stream:
+  cudaStreamSynchronize(stream);
+  cudaStreamDestroy(stream);
+fail_dev:
+  cudaSetDevice(saveDev);
+fail:
+exit:
+  return ret;
+}
+
+ncclResult_t ncclDevrFindWindow(
+    struct ncclComm* comm, void const* userPtr, struct ncclDevrWindow** outWin
+  ) {
+  struct ncclDevrState* devr = &comm->devrState;
+  uintptr_t userAddr = reinterpret_cast<uintptr_t>(userPtr);
+  int i = listFindSortedLub(&ncclDevrWindowSorted::userAddr, devr->winSorted, devr->winSortedCount, userAddr);
+  if (0 < i && (userAddr - devr->winSorted[i-1].userAddr < devr->winSorted[i-1].size)) {
+    *outWin = devr->winSorted[i-1].win;
+  } else {
+    *outWin = nullptr;
+  }
+  return ncclSuccess;
+}
+
+NCCL_API(ncclResult_t, ncclDevCommCreate, ncclComm_t comm, ncclDevCommRequirements_t const* reqs, ncclDevComm_t* outDevComm);
+ncclResult_t ncclDevCommCreate(
+    ncclComm_t comm, struct ncclDevCommRequirements const* reqs,
+    struct ncclDevComm* outDevComm
+  ) {
+  ncclResult_t ret = ncclSuccess;
+  int saveDev;
+  struct ncclDevrCommCreateTask* task = nullptr;
+
+  CUDACHECK(cudaGetDevice(&saveDev));
+  NCCLCHECK(ncclGroupStartInternal());
+
+  if (!comm->symmetricSupport) {
+    WARN("Communicator does not support symmetric memory!");
+    ret = ncclInvalidUsage;
+    goto fail;
+  }
+
+  NCCLCHECKGOTO(ncclCommEnsureReady(comm), ret, fail);
+  CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
+
+  NCCLCHECKGOTO(ncclDevrInitOnce(comm), ret, fail);
+
+  NCCLCHECKGOTO(ncclCalloc(&task, 1), ret, fail);
+  // reqs must be deep copied to the task so background threads can safely access it
+  NCCLCHECKGOTO(deepCopyDevCommRequirements(reqs, &task->reqs), ret, fail);
+  task->outDevComm = outDevComm;
+  ncclIntruQueueEnqueue(&comm->devrState.commCreateTaskQueue, task);
+  ncclGroupCommJoin(comm, ncclGroupTaskTypeSymRegister);
+
+exit:
+  ncclGroupErrCheck(ret);
+  NCCLCHECK(ncclGroupEndInternal());
+  cudaSetDevice(saveDev);
+  return ret;
+fail:
+  free(task);
+  goto exit;
+}
+
+NCCL_API(ncclResult_t, ncclDevCommDestroy, ncclComm_t comm, ncclDevComm_t const* devComm);
+ncclResult_t ncclDevCommDestroy(
+    struct ncclComm* comm, struct ncclDevComm const* devComm
+  ) {
+  //struct ncclDevrState* devr = &comm->devrState;
+  if (devComm->resourceWindow != nullptr) {
+    NCCLCHECK(ncclCommWindowDeregister(comm, devComm->resourceWindow));
+  }
+  return ncclSuccess;
+}
+
+
+// Get the corresponding pointer in another lsa rank's symmetric memory window
+ncclResult_t ncclDevrGetLsaRankPtr(struct ncclComm* comm, struct ncclDevrWindow* winHost, size_t offset, int lsaRank, void** outPtr) {
+  if (winHost == nullptr || outPtr == nullptr) {
+    return ncclInvalidArgument;
+  }
+
+  struct ncclDevrState* devr = &comm->devrState;
+  
+  // Validate lsaRank is within bounds
+  if (lsaRank < 0 || lsaRank >= devr->lsaSize) {
+    return ncclInvalidArgument;
+  }
+
+  // Validate offset is within bounds
+  if (offset < 0 || offset >= winHost->size) {
+    return ncclInvalidArgument;
+  }
+
+  // Calculate the address with offset for the specified lsa rank
+  *outPtr = (void*)((uintptr_t)devr->lsaFlatBase + lsaRank * devr->bigSize + winHost->bigOffset + offset);
+  return ncclSuccess;
+}
+
+// Get the multicast address for a given team
+ncclResult_t ncclDevrGetLsaTeamPtrMC(struct ncclComm* comm, struct ncclDevrWindow* winHost, size_t offset, struct ncclTeam lsaTeam, void** outPtr){
+  if (winHost == nullptr || outPtr == nullptr) {
+    return ncclInvalidArgument;
+  }
+
+  if (!comm->nvlsSupport) {
+    return ncclInvalidUsage;
+  }
+
+  bool multimem = true;
+  struct ncclDevrTeam* tm;
+  NCCLCHECK(symTeamObtain(comm, lsaTeam, multimem, &tm));
+    
+  // Return the base multicast address for this team with offset
+  *outPtr = (void*)((uintptr_t)tm->mcBasePtr + winHost->bigOffset + offset);
+  return ncclSuccess;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+
+// Find the least index strictly greater than arg.
+template<typename Obj, typename Key>
+static int listFindSortedLub(Key Obj::*key, Obj* sorted, int count, Key arg) {
+  int lo = 0, hi = count;
+  while (lo + 16 < hi) {
+    int i = (lo + hi)/2;
+    if (sorted[i].*key <= arg) lo = i+1;
+    else hi = i;
+  }
+  int i = lo;
+  while (i < hi && sorted[i].*key <= arg) i++;
+  return i;
+}
+
+template<typename Obj>
+static void listInsert(Obj** list, int* capacity, int* count, int index, Obj val) {
+  if (*capacity < *count + 1) {
+    *capacity *= 2;
+    if (*capacity == 0) *capacity = 16;
+    *list = (Obj*)realloc(*list, (*capacity)*sizeof(Obj));
+  }
+  for (int j = *count; j != index; j--) {
+    (*list)[j] = (*list)[j-1];
+  }
+  (*list)[index] = val;
+  *count += 1;
+}
+
+template<typename Obj>
+static void listRemove(Obj* list, int* count, int index) {
+  for (int i = index; i+1 < *count; i++) {
+    list[i] = list[i+1];
+  }
+  *count -= 1;
+}
+
diff --git a/projects/rccl/src/device/CMakeLists.txt b/projects/rccl/src/device/CMakeLists.txt
new file mode 100644
index 0000000000..98447428df
--- /dev/null
+++ b/projects/rccl/src/device/CMakeLists.txt
@@ -0,0 +1,60 @@
+# Run the scripts once during configuration to get the file lists
+execute_process(
+    COMMAND ${Python3_EXECUTABLE} ${CMAKE_CURRENT_SOURCE_DIR}/generate.py ${CMAKE_CURRENT_BINARY_DIR}/gensrc "${ONLY_FUNCS}"
+    OUTPUT_VARIABLE files
+    WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
+)
+string(STRIP "${files}" files)
+list(TRANSFORM files PREPEND ${CMAKE_CURRENT_BINARY_DIR}/gensrc/)
+
+execute_process(
+    COMMAND ${Python3_EXECUTABLE} ${CMAKE_CURRENT_SOURCE_DIR}/symmetric/generate.py ${CMAKE_CURRENT_BINARY_DIR}/gensrc/symmetric "${ONLY_FUNCS}"
+    OUTPUT_VARIABLE symmetric_files
+    WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
+)
+string(STRIP "${symmetric_files}" symmetric_files)
+list(TRANSFORM symmetric_files PREPEND ${CMAKE_CURRENT_BINARY_DIR}/gensrc/symmetric/)
+
+# Create custom commands to generate source files with proper dependencies
+add_custom_command(
+    OUTPUT  ${files}
+    BYPRODUCTS ${files}
+    COMMAND ${Python3_EXECUTABLE} ${CMAKE_CURRENT_SOURCE_DIR}/generate.py ${CMAKE_CURRENT_BINARY_DIR}/gensrc "${ONLY_FUNCS}"
+    DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/generate.py
+    WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
+    COMMENT "Generating device source files"
+)
+
+add_custom_command(
+    OUTPUT  ${symmetric_files}
+    BYPRODUCTS ${symmetric_files}
+    COMMAND ${Python3_EXECUTABLE} ${CMAKE_CURRENT_SOURCE_DIR}/symmetric/generate.py ${CMAKE_CURRENT_BINARY_DIR}/gensrc/symmetric "${ONLY_FUNCS}"
+    DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/symmetric/generate.py
+    WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
+    COMMENT "Generating symmetric device source files"
+)
+
+# Add library target
+add_library(nccl_device OBJECT
+            ${files}
+            ${symmetric_files}
+            ${CMAKE_CURRENT_SOURCE_DIR}/common.cu
+            ${CMAKE_CURRENT_SOURCE_DIR}/onerank.cu
+)
+
+set_target_properties(nccl_device PROPERTIES
+    CUDA_SEPARABLE_COMPILATION ON
+    CUDA_RESOLVE_DEVICE_SYMBOLS ON
+)
+
+# Set include directories for the target
+target_include_directories(nccl_device PUBLIC
+    ${CMAKE_CURRENT_SOURCE_DIR}
+    ${CMAKE_SOURCE_DIR}/src/include
+    ${CMAKE_SOURCE_DIR}/src/include/plugin
+    ${CMAKE_BINARY_DIR}/include
+    ${CUDAToolkit_INCLUDE_DIRS}
+    ${CUDAToolkit_INCLUDE_DIRS}/cccl
+)
+
+add_dependencies(nccl_device nccl_header)
diff --git a/projects/rccl/src/device/Makefile b/projects/rccl/src/device/Makefile
index 67ab176cab..fd8f2759d4 100644
--- a/projects/rccl/src/device/Makefile
+++ b/projects/rccl/src/device/Makefile
@@ -19,7 +19,7 @@ OBJDIR := $(BUILDDIR)/obj/device
 MANIFEST := $(OBJDIR)/manifest
 DEVGLUE_OBJ  := $(OBJDIR)/device_glue.o
 
-INCFLAGS  = -I. -I.. -I$(BUILDDIR)/include -I../include
+INCFLAGS  = -I. -I.. -I$(BUILDDIR)/include -I../include -I../include/plugin
 NVCUFLAGS += $(INCFLAGS) --compiler-options "-fPIC -fvisibility=hidden"
 CXXFLAGS  += $(INCFLAGS)
 
@@ -47,7 +47,11 @@ endif
 define COMPILE_SYM
 @$(SAY) "Compiling" $2;\
  mkdir -p $(dir $1);\
- $(NVCC) $(NVCUFLAGS_SYM) $3 -dw $2 -o $1
+ if [[ -n "$3" ]]; then\
+ $(NVCC) $(NVCUFLAGS_SYM) $3 -dw $2 -o $1;\
+ else\
+ touch $2.empty.cu; $(NVCC) $(NVCUFLAGS_SYM) -dw $2.empty.cu -o $1; rm $2.empty.cu;\
+ fi
 endef
 
 DEPENDS.cu = $(NVCC) $(NVCUFLAGS) -M -dc $1
diff --git a/projects/rccl/src/device/common.h b/projects/rccl/src/device/common.h
index a2884b50c1..a31cf5f8ee 100644
--- a/projects/rccl/src/device/common.h
+++ b/projects/rccl/src/device/common.h
@@ -43,7 +43,7 @@ struct ncclShmemData {
   struct ncclDevKernelArgs args;
   int channelId;
   int aborted;
-  alignas(16) struct ncclDevComm comm;
+  alignas(16) struct ncclKernelComm comm;
   alignas(16) struct ncclDevChannel channel;
 
   int batchIx, nextBatchIx;
@@ -323,7 +323,7 @@ __device__ __forceinline__ void profiler(int action) {
         ncclShmem.comm.workCompleted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].counter = wc;
       }
       ncclShmem.channel.workCounter += ncclShmem.nWorks;
-      if (action == FINI) ((ncclDevCommAndChannels*)ncclShmem.args.comm)->channels[ncclShmem.channelId].workCounter = ncclShmem.channel.workCounter;
+      if (action == FINI) ((ncclKernelCommAndChannels*)ncclShmem.args.comm)->channels[ncclShmem.channelId].workCounter = ncclShmem.channel.workCounter;
     }
   }
 }
@@ -351,7 +351,7 @@ __device__ __forceinline__ void ncclKernelMain(struct ncclDevKernelArgs const* a
   /* set abort flag to 0 */
   if (tid == 0) {
     ncclShmem.aborted = 0;
-    ncclShmem.channel.workCounter = ((ncclDevCommAndChannels*)ncclShmem.args.comm)->channels[ncclShmem.channelId].workCounter;
+    ncclShmem.channel.workCounter = ((ncclKernelCommAndChannels*)ncclShmem.args.comm)->channels[ncclShmem.channelId].workCounter;
   }
 
   // Use first 2 warps to load comm and channel, and remaining load work batch.
@@ -359,14 +359,14 @@ __device__ __forceinline__ void ncclKernelMain(struct ncclDevKernelArgs const* a
   case 0:
     { void* dst = &ncclShmem.comm;
       void* src = ncclShmem.args.comm;
-      int bytes = sizeof(ncclDevComm);
-      static_assert(sizeof(ncclDevComm) <= 16*WARP_SIZE, "ncclDevComm cannot be loaded by a single warp in one insn.");
+      int bytes = sizeof(ncclKernelComm);
+      static_assert(sizeof(ncclKernelComm) <= 16*WARP_SIZE, "ncclKernelComm cannot be loaded by a single warp in one insn.");
       copyToShmem16(tid, dst, src, bytes);
     } break;
   case 1:
-    { // Get address of channel without incurring indirect load from ncclDevComm::channels
+    { // Get address of channel without incurring indirect load from ncclKernelComm::channels
       void* dst = &ncclShmem.channel;
-      void* src = &((ncclDevCommAndChannels*)ncclShmem.args.comm)->channels[ncclShmem.channelId];
+      void* src = &((ncclKernelCommAndChannels*)ncclShmem.args.comm)->channels[ncclShmem.channelId];
       int bytes = sizeof(ncclDevChannel);
       static_assert(sizeof(ncclDevChannel) <= 16*WARP_SIZE, "ncclDevChannel cannot be loaded by a single warp in one insn.");
       copyToShmem16(tid-WARP_SIZE, dst, src, bytes);
diff --git a/projects/rccl/src/device/generate.py b/projects/rccl/src/device/generate.py
index f9c3a0e79a..aefba94229 100755
--- a/projects/rccl/src/device/generate.py
+++ b/projects/rccl/src/device/generate.py
@@ -1,6 +1,7 @@
 #!/usr/bin/env python3
 import os
 import sys
+import shutil
 
 # Order of redops, tys, protos, algos must match src/include/device.h
 all_colls =  ["Broadcast","Reduce","AllGather","ReduceScatter","AllReduce","SendRecv"]
@@ -17,8 +18,11 @@ gensrc = sys.argv[1]
 
 if os.path.exists(gensrc):
   for name in os.listdir(gensrc):
-    os.remove(os.path.join(gensrc, name))
-    #os.truncate(os.path.join(gensrc, name), 0)
+    path = os.path.join(gensrc, name)
+    if os.path.isfile(path):
+      os.remove(path)
+    elif os.path.isdir(path):
+      shutil.rmtree(path)
 else:
   os.mkdir(gensrc)
 
@@ -322,6 +326,16 @@ def partition_by_name(fns):
 name_to_funcs = partition_by_name(fn for fn in primary_funcs if fn[0]!="Nop")
 name_to_kernels = partition_by_name(kfn for kfn in kernel_funcs if kfn[0]!="Generic")
 
+files = ""
+for name in sorted(name_to_funcs.keys()):
+    files += name + ";"
+files += "device_table.cu;"
+files += "host_table.cc"
+
+# Do not print files when running make
+if os.environ.get("NCCL_USE_CMAKE", "0") == "1":
+    print(files)
+
 # Generate <gensrc>/rules.mk
 with open(os.path.join(gensrc, "rules.mk"), "w") as f:
   out = f.write
diff --git a/projects/rccl/src/device/symmetric/all_gather.cuh b/projects/rccl/src/device/symmetric/all_gather.cuh
index 8f81347ecd..9f050836c5 100644
--- a/projects/rccl/src/device/symmetric/all_gather.cuh
+++ b/projects/rccl/src/device/symmetric/all_gather.cuh
@@ -1,32 +1,33 @@
-#include "symmetric.h"
-#include "symmetric/kernel.cuh"
-#include "symmetric/primitives.cuh"
+#include "sym_kernels.h"
+#include "kernel.cuh"
+#include "primitives.cuh"
 
 template<int BytePerPack, int UnrollPacks, int UnrollPeers>
 static __device__ void bcastDeep(
-    ncclSymPrims& prim, int tn, int t, bool waitNeeded,
-    char* inputHere, char* outputRank0, bool inPlace, int nIters
+    ncclSymkArgsHandler const& handler, int tn, int t,
+    bool waitNeeded, ncclLsaBarrierSession<ncclCoopCta>& bar,
+    ncclSymPtr<char> input, ncclSymPtr<char> output, bool inPlace, int nIters
   ) {
   using Pack = BytePack<BytePerPack>;
   int wn = tn/WARP_SIZE;
   int w = t/WARP_SIZE;
   int lane = t%WARP_SIZE;
-  int const& rank = prim.rank;
-  int const& nRanks = prim.nRanks;
-  uint32_t const& stride4G = prim.stride4G;
-  Pack* inpHere = (Pack*)inputHere + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
-  Pack* outRank0 = (Pack*)outputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
+  int const& rank = handler.comm.rank;
+  int const& nRanks = handler.comm.nRanks;
+
+  Pack* inpPacks = (Pack*)input.localPtr() + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
+  ncclSymPtr<Pack> outPacks = (ncclSymPtr<Pack>)output + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
   Pack tmp[UnrollPacks];
 
   nIters -= w;
   if (0 < nIters) {
     #pragma unroll
     for (int u=0; u < UnrollPacks; u++) {
-      tmp[u] = inpHere[u*WARP_SIZE];
+      tmp[u] = inpPacks[u*WARP_SIZE];
     }
   }
 
-  if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  if (waitNeeded) bar.wait(ncclCoopCta(), cuda::memory_order_relaxed);
 
   if (0 < nIters) {
     while (true) {
@@ -44,21 +45,21 @@ static __device__ void bcastDeep(
             if (partial && dr == nRanks) break;
             #pragma unroll UnrollPacks
             for (int u=0; u < UnrollPacks; u++) {
-              add4G(outRank0, r*stride4G)[u*WARP_SIZE] = tmp[u];
+              outPacks.lsaPtr(r)[u*WARP_SIZE] = tmp[u];
             }
             if (++r == nRanks) r = 0;
           }
         }
       }
-      inpHere += intptr_t(wn)*UnrollPacks*WARP_SIZE;
-      outRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
+      inpPacks += intptr_t(wn)*UnrollPacks*WARP_SIZE;
+      outPacks += intptr_t(wn)*UnrollPacks*WARP_SIZE;
       nIters -= wn;
       if (nIters <= 0) break;
 
       // Load data for next iteration.
       #pragma unroll
       for (int u=0; u < UnrollPacks; u++) {
-        tmp[u] = inpHere[u*WARP_SIZE];
+        tmp[u] = inpPacks[u*WARP_SIZE];
       }
     }
   }
@@ -66,18 +67,17 @@ static __device__ void bcastDeep(
 
 template<int UnrollPeers, typename T>
 static __device__ void bcastEnds(
-    ncclSymPrims& prim, int tn, int t,
-    T* inputHere, T* outputRank0, bool inPlace, size_t nElts, uint32_t nPreElts, size_t nSufElts
+    ncclSymkArgsHandler const& handler, int tn, int t,
+    ncclSymPtr<T> input, ncclSymPtr<T> output, bool inPlace, size_t nElts, uint32_t nPreElts, size_t nSufElts
   ) {
-  int const& rank = prim.rank;
-  int const& nRanks = prim.nRanks;
-  uint32_t const& stride4G = prim.stride4G;
-  BytePack<sizeof(T)>* inpHere = (BytePack<sizeof(T)>*)inputHere;
-  BytePack<sizeof(T)>* outRank0 = (BytePack<sizeof(T)>*)outputRank0;
+  int const& rank = handler.comm.rank;
+  int const& nRanks = handler.comm.nRanks;
+  BytePack<sizeof(T)>* inpPacks = (BytePack<sizeof(T)>*)input.localPtr();
+  ncclSymPtr<BytePack<sizeof(T)>> outPacks = (ncclSymPtr<BytePack<sizeof(T)>>)output;
   #pragma unroll 1
   for (size_t i = t; i < nPreElts+nSufElts; i += tn) {
     size_t elt = i < nPreElts ? i : nElts-nPreElts-nSufElts+i;
-    BytePack<sizeof(T)> tmp = inpHere[elt];
+    BytePack<sizeof(T)> tmp = inpPacks[elt];
     int dr = inPlace ? 1 : 0;
     int r = rank + dr;
     if (r == nRanks) r = 0;
@@ -85,14 +85,14 @@ static __device__ void bcastEnds(
     for (; dr + UnrollPeers <= nRanks; dr += UnrollPeers) {
       #pragma unroll UnrollPeers
       for (int u=0; u < UnrollPeers; u++) {
-        *add4G(outRank0+elt, r*stride4G) = tmp;
+        outPacks.lsaPtr(r)[elt] = tmp;
         if (++r == nRanks) r = 0;
       }
     }
     #pragma unroll UnrollPeers
     for (int u=0; u < UnrollPeers; u++) {
       if (dr+u == nRanks) break;
-      *add4G(outRank0+elt, r*stride4G) = tmp;
+      outPacks.lsaPtr(r)[elt] = tmp;
       if (++r == nRanks) r = 0;
     }
   }
@@ -100,95 +100,99 @@ static __device__ void bcastEnds(
 
 template<typename T>
 static __device__ void bcast(
-    ncclSymPrims& prim, int tn, int t, bool waitNeeded, T* input, T* output, size_t nElts
+    ncclSymkArgsHandler const& handler, int tn, int t, int nBlocks,
+    bool waitNeeded, ncclLsaBarrierSession<ncclCoopCta>& bar,
+    ncclSymPtr<T> input, ncclSymPtr<T> output, size_t nElts
   ) {
   bool inPlace = (input == output);
-  // Mpve to rank=0
-  output = prim.peerPtr(0, output);
-
-  uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
-  uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
   size_t nBytes = nElts*sizeof(T);
+  uint32_t nBlocks_rcp32 = nccl::utility::idivRcp32_upto64(nBlocks);
 
-  uint32_t nPreBytes = (128u - inputUptr)%128u;
+  uint32_t nPreBytes = (16 - input.offset)%16;
   nPreBytes = min((size_t)nPreBytes, nBytes);
   uintptr_t cursor = nPreBytes;
 
   constexpr int MinWarpPerBlock = 4;
 
-  if ((inputUptr-outputUptr)%16 == 0) {
+  if ((input.offset - output.offset)%16 == 0) {
     constexpr int BytePerPack = 16, UnrollPacks = 4, UnrollPeers = 2;
     constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
     uint32_t chunks = (nBytes-cursor)/BytePerChunk;
-    chunks -= imodFast32(chunks, prim.nBlocks, prim.nBlocks_rcp32);
+    chunks -= imodFast32(chunks, nBlocks, nBlocks_rcp32);
     if (chunks != 0) {
       uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
       bcastDeep<BytePerPack, UnrollPacks, UnrollPeers>(
-        prim, tn, t, waitNeeded,
-        (char*)input + cursor, (char*)output + cursor, inPlace,
-        chunks*MinWarpPerBlock
+        handler, tn, t, waitNeeded, bar,
+        (ncclSymPtr<char>)input + cursor,
+        (ncclSymPtr<char>)output + cursor,
+        inPlace, chunks*MinWarpPerBlock
       );
       cursor = cursorAfter;
       waitNeeded = false;
     }
   }
 
-  if (sizeof(T) == 4 || (sizeof(T) < 4 && (inputUptr-outputUptr)%4 == 0)) {
+  if (sizeof(T) == 4 || (sizeof(T) < 4 && (input.offset - output.offset)%4 == 0)) {
     constexpr int BytePerPack = 4, UnrollPacks = 4, UnrollPeers = 4;
     constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
     uint32_t chunks = (nBytes-cursor)/BytePerChunk;
-    chunks -= imodFast32(chunks, prim.nBlocks, prim.nBlocks_rcp32);
+    chunks -= imodFast32(chunks, nBlocks, nBlocks_rcp32);
     if (chunks != 0) {
       uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
       bcastDeep<(sizeof(T) <= BytePerPack ? BytePerPack : 0), UnrollPacks, UnrollPeers>(
-        prim, tn, t, waitNeeded,
-        (char*)input + cursor, (char*)output + cursor, inPlace,
-        chunks*MinWarpPerBlock
+        handler, tn, t, waitNeeded, bar,
+        (ncclSymPtr<char>)input + cursor,
+        (ncclSymPtr<char>)output + cursor,
+        inPlace, chunks*MinWarpPerBlock
       );
       cursor = cursorAfter;
       waitNeeded = false;
     }
   }
 
-  if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  if (waitNeeded) bar.wait(ncclCoopCta(), cuda::memory_order_relaxed);
 
   constexpr int UnrollPeers = 8;
   size_t nSufElts = (nBytes-cursor)/sizeof(T);
-  bcastEnds<UnrollPeers>(prim, tn, t, input, output, inPlace, nElts, nPreBytes/sizeof(T), nSufElts);
+  bcastEnds<UnrollPeers>(handler, tn, t, input, output, inPlace, nElts, nPreBytes/sizeof(T), nSufElts);
 }
 
-__device__ __forceinline__ void ncclSymRun_AllGather_ST(ncclSymDevArgs const* args) {
-  ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier);
-  int const& rank = prim.rank;
+__device__ __forceinline__ void ncclSymkRun_AllGather_ST(ncclSymkDevWorkArgs const* args) {
+  ncclSymkArgsHandler handler{args};
+  ncclLsaBarrierSession<ncclCoopCta> bar{
+    ncclCoopCta(), handler.comm, ncclTeamTagLsa(), blockIdx.x
+  };
+  int const& rank = handler.comm.rank;
 
-  // Threads numbered over rank.
-  int bt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
-                     prim.block, prim.nBlocks,
-                     threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
-  int btn = prim.nBlocks*blockDim.x;
+  bar.arrive(ncclCoopCta(), cuda::memory_order_relaxed);
 
-  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
-  //prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  bool waitNeeded = true;
+  handler.forEachWork<char>(
+      [&]__device__(int block, int nBlocks, size_t nElts, size_t nAllElts,
+                    ncclSymPtr<char> input, ncclSymPtr<char> output) {
+        // Threads numbered over rank.
+        int bt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
+                           block, nBlocks,
+                           threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
+        int btn = nBlocks*blockDim.x;
 
-  bcast(prim, btn, bt, /*waitNeeded=*/true, (char*)args->input, (char*)args->output + rank*args->nElts, args->nElts);
+        bcast(handler, btn, bt, nBlocks, waitNeeded, bar, input, output + rank*nAllElts, nElts);
 
-  prim.barrierArrive(ncclCoopCta(), /*release=*/true);
-  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+        waitNeeded = false;
+      }
+    );
+
+  bar.sync(ncclCoopCta(), cuda::memory_order_release);
 }
 
-
 template<typename T>
 static __device__ void bcastMultimem(
-    ncclSymPrims& prim, int tn, int t, T* input, T* output, size_t nElts
+    ncclSymkArgsHandler& handler, int tn, int t, ncclSymPtr<T> input, ncclSymPtr<T> output, size_t nElts
   ) {
-  // Move output to multimem
-  output = prim.multimemPtr(output);
-
-  uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
-  uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
   size_t nBytes = nElts*sizeof(T);
-
-  uint32_t nPreBytes = (16-inputUptr)%16;
+  uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input.localPtr());
+  uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output.multimemPtr(handler.comm.lsaMultimem));
+  uint32_t nPreBytes = (16 - input.offset)%16;
   nPreBytes = min((size_t)nPreBytes, nBytes);
   uintptr_t nSufBytes;
 
@@ -227,51 +231,52 @@ static __device__ void bcastMultimem(
     uintptr_t cursor = i < nPreBytes ? i : nBytes-nSufBytes+(i-nPreBytes);
     BytePack<sizeof(T)> val = *reinterpret_cast<BytePack<sizeof(T)>*>(inputUptr + cursor);
     multimem_st_global(outputUptr + cursor, val);
-    cursor += tn*sizeof(T);
   }
 }
 
-__device__ __forceinline__ void ncclSymRun_AllGather_STMC(ncclSymDevArgs const* args) {
-  ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier|ncclSymPrims_UseMultimem);
-  int const& rank = prim.rank;
+__device__ __forceinline__ void ncclSymkRun_AllGather_STMC(ncclSymkDevWorkArgs const* args) {
+  ncclSymkArgsHandler handler{args};
+  ncclLsaBarrierSession<ncclCoopCta> bar(
+    ncclCoopCta(), handler.comm, ncclTeamTagLsa(), blockIdx.x, /*multimem=*/true
+  );
+  int const& rank = handler.comm.rank;
 
-  char* input = args->input;
-  char* output = args->output;
-  size_t bytes = args->nElts;
-  // Round robin memory to blocks.
-  int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
-                    prim.block, prim.nBlocks,
-                    threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
-  int tn = prim.nBlocks*blockDim.x;
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed);
 
-  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
-  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  handler.forEachWork<char>(
+      [&]__device__(int block, int nBlocks, size_t nElts, size_t nAllElts,
+                    ncclSymPtr<char> input, ncclSymPtr<char> output) {
+        // Round robin memory to blocks.
+        int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
+                          block, nBlocks,
+                          threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
+        int tn = nBlocks*blockDim.x;
 
-  bcastMultimem(prim, tn, t, input, output + rank*bytes, bytes);
+        bcastMultimem(handler, tn, t, input, output + rank*nAllElts, nElts);
+      }
+    );
 
-  prim.barrierArrive(ncclCoopCta(), /*release=*/true);
-  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  bar.sync(ncclCoopCta(), cuda::memory_order_release);
 }
 
 template<typename EltType>
 static __device__ void allgather_LL_body(
-    ncclSymPrims &prim, EltType* input, EltType* output, int nElts, int nPacks, int nStrideElts
+    ncclSymkArgsHandler& handler, ncclLLA2ASession<ncclCoopCta>& lla2a,
+    EltType* input, EltType* output, int nElts, int nPacks, int nStrideElts
   ) {
   using Pack = BytePack<8>;
   constexpr int EltPerPack = 8/sizeof(EltType);
-
-  ncclCoopCta cta;
-  int rank = prim.rank;
-  int nRanks = prim.nRanks;
-  constexpr int tn = ncclSymMaxThreads;
+  int const& rank = handler.comm.rank;
+  int const& nRanks = handler.comm.nRanks;
   int t = threadIdx.x;
+  constexpr int tn = ncclSymkMaxThreads;
 
   #pragma unroll 1
   while (0 < nElts) {
     int nIterPacks = min(nPacks, tn);
     if (t < nIterPacks) {
       Pack x = loadPack<Pack>(input, t*EltPerPack, nElts);
-      prim.bcastLL(/*slot=*/nIterPacks*rank + t, x);
+      lla2a.bcast(/*slot=*/nIterPacks*rank + t, x);
     }
 
     int tn_div_nPacks = tn/nIterPacks;
@@ -284,7 +289,7 @@ static __device__ void allgather_LL_body(
       #pragma unroll 1
       for (int i = t; i < (nRanks*nIterPacks & -(Unroll*tn)); i += Unroll*tn) {
         Pack got[Unroll];
-        prim.template recvLL<Unroll, Unroll>(i, Unroll, tn, /*&*/got);
+        lla2a.template recvUnrolled<Unroll, Unroll>(i, Unroll, tn, /*&*/got);
         #pragma unroll
         for (int u=0; u < Unroll; u++) {
           storePack<Pack>(output + peer*nStrideElts, pack*EltPerPack, nElts, got[u]);
@@ -299,7 +304,7 @@ static __device__ void allgather_LL_body(
       if (i + n*tn < nRanks*nIterPacks) n += 1;
       if (n != 0) {
         Pack got[Unroll];
-        prim.template recvLL<1, Unroll>(i, n, tn, /*&*/got);
+        lla2a.template recvUnrolled<1, Unroll>(i, n, tn, /*&*/got);
         #pragma unroll
         for (int u=0; u < Unroll; u++) {
           if (u != 0 && u == n) break;
@@ -313,7 +318,7 @@ static __device__ void allgather_LL_body(
       // The non-unrolled but "obviously correct" implementation for reference.
       #pragma unroll 1
       for (int i = t; i < nRanks*nIterPacks; i += tn) {
-        Pack got = prim.template recvLL<Pack>(i);
+        Pack got = lla2a.template recv<Pack>(i);
         storePack(output + peer*nStrideElts, pack*EltPerPack, nElts, got);
         peer += tn_div_nPacks;
         pack += tn_mod_nPacks;
@@ -321,7 +326,7 @@ static __device__ void allgather_LL_body(
       }
     #endif
 
-    prim.endLL(cta);
+    lla2a.endEpoch(ncclCoopCta());
 
     input += tn*EltPerPack;
     output += tn*EltPerPack;
@@ -330,38 +335,41 @@ static __device__ void allgather_LL_body(
   }
 }
 
-static __device__ void ncclSymRun_AllGather_LL_impl(ncclSymDevArgs const* args, bool multimem) {
-  ncclSymPrims prim(args->comm, ncclSymPrims_UseLL | multimem*ncclSymPrims_UseMultimem);
+static __device__ void ncclSymkRun_AllGather_LL_impl(ncclSymkDevWorkArgs const* args, bool multimem) {
+  ncclSymkArgsHandler handler{args};
+  ncclLLA2ASession<ncclCoopCta> lla2a(
+    ncclCoopCta(), handler.comm, ncclTeamLsa(handler.comm), handler.lsaLLA2A, blockIdx.x, /*maxElts=*/ncclSymkMaxThreads, multimem, handler.comm.lsaMultimem
+  );
+
   using Pack = BytePack<8>;
   constexpr int BytePerPack = 8;
-  int nElts = args->nElts;
-  int nPacks = divUp(nElts, BytePerPack);
 
-  uint32_t nPackPerBlock, nPackModBlock;
-  idivmodFast32(&nPackPerBlock, &nPackModBlock, nPacks, prim.nBlocks, prim.nBlocks_rcp32);
-  int blockPackBegin = prim.block*nPackPerBlock + minval<int>(prim.block, nPackModBlock);
-  int blockPackEnd = blockPackBegin + nPackPerBlock + (prim.block < nPackModBlock ? 1 : 0);
-  int nBlockPacks = blockPackEnd - blockPackBegin;
-  int nBlockElts = nElts - blockPackBegin*BytePerPack;
-  nBlockElts = min(nBlockElts, nBlockPacks*BytePerPack);
-  char* blockInput = args->input + blockPackBegin*BytePerPack;
-  char* blockOutput = args->output + blockPackBegin*BytePerPack;
+  handler.singleWork<char>(
+      [&]__device__(int nElts, int nAllElts,
+                    ncclSymPtr<char> input, ncclSymPtr<char> output) {
+        int nPacks = divUp(nElts, BytePerPack);
 
-  uint32_t lowBits = args->nElts;
-  lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->input);
-  lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->output);
-  if (__builtin_expect(lowBits%8 == 0, true)) {
-    // NOTE: Specializing for 8-byte alignment in one case help at size=65K: 8.9us vs 5.6us
-    allgather_LL_body(prim, (BytePack<8>*)blockInput, (BytePack<8>*)blockOutput, nBlockElts/8, nBlockPacks, nElts/8);
-  } else {
-    allgather_LL_body(prim, blockInput, blockOutput, nBlockElts, nBlockPacks, nElts);
-  }
+        char* blockInput = input.localPtr();
+        char* blockOutput = output.localPtr();
+
+        uint32_t lowBits = nElts;
+        lowBits |= (uintptr_t)blockInput;
+        lowBits |= (uintptr_t)blockOutput;
+        if (__builtin_expect(lowBits%8 == 0, true)) {
+          // NOTE: Specializing for 8-byte alignment in one case help at size=65K: 8.9us vs 5.6us
+          allgather_LL_body(handler, lla2a, (BytePack<8>*)blockInput, (BytePack<8>*)blockOutput,
+                            nElts/8, nPacks, nAllElts/8);
+        } else {
+          allgather_LL_body(handler, lla2a, blockInput, blockOutput, nElts, nPacks, nAllElts);
+        }
+      }
+    );
 }
 
-__device__ __forceinline__ void ncclSymRun_AllGather_LL(ncclSymDevArgs const* args) {
-  ncclSymRun_AllGather_LL_impl(args, /*multimem=*/false);
+__device__ __forceinline__ void ncclSymkRun_AllGather_LL(ncclSymkDevWorkArgs const* args) {
+  ncclSymkRun_AllGather_LL_impl(args, /*multimem=*/false);
 }
 
-__device__ __forceinline__ void ncclSymRun_AllGather_LLMC(ncclSymDevArgs const* args) {
-  ncclSymRun_AllGather_LL_impl(args, /*multimem=*/true);
+__device__ __forceinline__ void ncclSymkRun_AllGather_LLMC(ncclSymkDevWorkArgs const* args) {
+  ncclSymkRun_AllGather_LL_impl(args, /*multimem=*/true);
 }
diff --git a/projects/rccl/src/device/symmetric/all_reduce.cuh b/projects/rccl/src/device/symmetric/all_reduce.cuh
index 6c5219784d..94e40babbf 100644
--- a/projects/rccl/src/device/symmetric/all_reduce.cuh
+++ b/projects/rccl/src/device/symmetric/all_reduce.cuh
@@ -1,35 +1,38 @@
-#include "symmetric.h"
-#include "symmetric/kernel.cuh"
-#include "symmetric/primitives.cuh"
+#include "sym_kernels.h"
+#include "nccl_device.h"
+#include "kernel.cuh"
+#include "primitives.cuh"
 
 template<int BytePerPack, int UnrollPacks, int UnrollPeers, typename T, typename Red>
 static __device__ __forceinline__ void allreduceDeep(
-    ncclSymPrims& prim, int tn, int t, bool waitNeeded,
-    Red red, char* inputRank0, char* outputRank0, int32_t nIters
+    ncclSymkArgsHandler const& handler, int tn, int t,
+    bool waitNeeded, ncclLsaBarrierSession<ncclCoopCta>& bar,
+    Red red, ncclSymPtr<char> input, ncclSymPtr<char> output, int32_t nIters
   ) {
   using Pack = BytePack<BytePerPack>;
   using Acc = typename Red::EltType;
   using AccPack = BytePack<BytePerPack*sizeof(Acc)/sizeof(T)>;
 
+  ncclTeam world = ncclTeamWorld(handler.comm);
   int wn = tn/WARP_SIZE;
   int w = t/WARP_SIZE;
   int lane = t%WARP_SIZE;
-  int const& rank = prim.rank;
-  int const& nRanks = prim.nRanks;
-  uint32_t const& stride4G = prim.stride4G;
-  Pack* inpRank0 = (Pack*)inputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
-  Pack* outRank0 = (Pack*)outputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
+  int const& rank = handler.comm.rank;
+  int const& nRanks = handler.comm.nRanks;
+
+  ncclSymPtr<Pack> inpPacks = (ncclSymPtr<Pack>)input + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
+  ncclSymPtr<Pack> outPacks = (ncclSymPtr<Pack>)output + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
   Pack acc0[UnrollPacks];
 
   nIters -= w;
   if (0 < nIters) {
     #pragma unroll
     for (int u=0; u < UnrollPacks; u++) {
-      acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
+      acc0[u] = inpPacks.peerPtr(world, rank)[u*WARP_SIZE];
     }
   }
 
-  if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  if (waitNeeded) bar.wait(ncclCoopCta(), cuda::memory_order_relaxed);
 
   if (0 < nIters) {
     while (true) {
@@ -39,7 +42,7 @@ static __device__ __forceinline__ void allreduceDeep(
       { Pack tmp1[UnrollPacks];
         #pragma unroll
         for (int u=0; u < UnrollPacks; u++) {
-          tmp1[u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
+          tmp1[u] = inpPacks.peerPtr(world, r)[u*WARP_SIZE];
         }
         #pragma unroll
         for (int u=0; u < UnrollPacks; u++) {
@@ -64,7 +67,7 @@ static __device__ __forceinline__ void allreduceDeep(
             if (partial && ur!=0 && dr+ur == nRanks) break;
             #pragma unroll UnrollPacks
             for (int u=0; u < UnrollPacks; u++) {
-              tmp1[ur][u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
+              tmp1[ur][u] = inpPacks.peerPtr(world, r)[u*WARP_SIZE];
             }
             if (++r == nRanks) r = 0;
           }
@@ -95,22 +98,22 @@ static __device__ __forceinline__ void allreduceDeep(
             if (partial && dr == nRanks) break;
             #pragma unroll UnrollPacks
             for (int u=0; u < UnrollPacks; u++) {
-              add4G(outRank0, r*stride4G)[u*WARP_SIZE] = acc0[u];
+              outPacks.peerPtr(world, r)[u*WARP_SIZE] = acc0[u];
             }
             if (++r == nRanks) r = 0;
           }
         }
       }
 
-      inpRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
-      outRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
+      inpPacks += intptr_t(wn)*UnrollPacks*WARP_SIZE;
+      outPacks += intptr_t(wn)*UnrollPacks*WARP_SIZE;
       nIters -= wn;
       if (nIters <= 0) break;
 
       // Load data for next iteration.
       #pragma unroll
       for (int u=0; u < UnrollPacks; u++) {
-        acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
+        acc0[u] = inpPacks.peerPtr(world, rank)[u*WARP_SIZE];
       }
     }
   }
@@ -118,21 +121,23 @@ static __device__ __forceinline__ void allreduceDeep(
 
 template<int UnrollPeers, typename Red, typename T>
 static __device__ __forceinline__ void allreduceEnds(
-    ncclSymPrims& prim, int tn, int t, Red red,
-    T* inputRank0, T* outputRank0, size_t nElts, uint32_t nPreElts, size_t nSufElts
+    ncclSymkArgsHandler const& handler, int tn, int t, Red red,
+    ncclSymPtr<T> input, ncclSymPtr<T> output,
+    size_t nElts, uint32_t nPreElts, size_t nSufElts
   ) {
   using Acc = typename Red::EltType;
 
-  int const& rank = prim.rank;
-  int const& nRanks = prim.nRanks;
-  uint32_t const& stride4G = prim.stride4G;
-  BytePack<sizeof(T)>* inpRank0 = (BytePack<sizeof(T)>*)inputRank0;
-  BytePack<sizeof(T)>* outRank0 = (BytePack<sizeof(T)>*)outputRank0;
+  ncclTeam world = ncclTeamWorld(handler.comm);
+  int const& rank = handler.comm.rank;
+  int const& nRanks = handler.comm.nRanks;
+
+  ncclSymPtr<BytePack<sizeof(T)>> inpPacks = (ncclSymPtr<BytePack<sizeof(T)>>)input;
+  ncclSymPtr<BytePack<sizeof(T)>> outPacks = (ncclSymPtr<BytePack<sizeof(T)>>)output;
 
   #pragma unroll 1
   for (size_t i = t; i < nPreElts+nSufElts; i += tn) {
     size_t elt = i < nPreElts ? i : nElts-nSufElts-nPreElts+i;
-    BytePack<sizeof(T)> acc0 = *add4G(inpRank0+elt, rank*stride4G);
+    BytePack<sizeof(T)> acc0 = inpPacks.peerPtr(world, rank)[elt];
     BytePack<sizeof(Acc)> acc1;
     BytePack<sizeof(T)> tmp[UnrollPeers];
     int dr = 1;
@@ -151,7 +156,7 @@ static __device__ __forceinline__ void allreduceEnds(
         #pragma unroll
         for (int u=0; u < UnrollPeers-partial; u++) {
           if (partial && u!=0 && dr+u == nRanks) break;
-          tmp[u] = *add4G(inpRank0+elt, r*stride4G);
+          tmp[u] = inpPacks.peerPtr(world, r)[elt];
           r += 1;
           if (r == nRanks) r = 0;
         }
@@ -179,7 +184,7 @@ static __device__ __forceinline__ void allreduceEnds(
         #pragma unroll
         for (int u=0; u < UnrollPeers-partial; u++) {
           if (partial && dr+u == nRanks) break;
-          *add4G(outRank0+elt, r*stride4G) = acc0;
+          outPacks.peerPtr(world, r)[elt] = acc0;
           r += 1;
           if (r == nRanks) r = 0;
         }
@@ -190,35 +195,33 @@ static __device__ __forceinline__ void allreduceEnds(
 
 template<typename Red, typename T>
 static __device__ void allreduce(
-    ncclSymPrims& prim, int tn, int t, bool waitNeeded,
-    Red red, T* input, T* output, size_t nElts
+    ncclSymkArgsHandler const& handler, int tn, int t, int nBlocks,
+    bool waitNeeded, ncclLsaBarrierSession<ncclCoopCta>& bar,
+    Red red, ncclSymPtr<T> input, ncclSymPtr<T> output, size_t nElts
   ) {
-  int nRanks = prim.nRanks;
-  int nBlocks = prim.nBlocks;
-  // Mpve to rank=0
-  input = prim.peerPtr(0, input);
-  output = prim.peerPtr(0, output);
-
-  uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
-  uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
+  int const& nRanks = handler.comm.nRanks;
+  int const& nRanks_rcp32 = handler.nRanks_rcp32;
   size_t nBytes = nElts*sizeof(T);
+  uint32_t nBlocks_rcp32 = nccl::utility::idivRcp32_upto64(nBlocks);
+  uint32_t nRanks_nBlocks_rcp32 = nccl::utility::imulRcp32(nRanks, nRanks_rcp32, nBlocks, nBlocks_rcp32);
 
-  uint32_t nPreBytes = (16u - inputUptr)%16u;
+  uint32_t nPreBytes = (16u - input.offset)%16u;
   nPreBytes = min((size_t)nPreBytes, nBytes);
   uintptr_t cursor = nPreBytes;
 
   constexpr int MinWarpPerBlock = 4;
 
-  if ((inputUptr-outputUptr)%16 == 0) {
+  if ((input.offset - output.offset)%16 == 0) {
     constexpr int BytePerPack = 16, UnrollPacks = 4, UnrollPeers = 2;
     constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
     uint32_t chunks = (nBytes-cursor)/BytePerChunk;
-    chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
+    chunks -= imodFast32(chunks, nRanks*nBlocks, nRanks_nBlocks_rcp32);
     if (chunks != 0) {
       uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
       allreduceDeep<BytePerPack, UnrollPacks, UnrollPeers, T>(
-        prim, tn, t, waitNeeded, red,
-        (char*)input + cursor, (char*)output + cursor,
+        handler, tn, t, waitNeeded, bar, red,
+        (ncclSymPtr<char>)input + cursor,
+        (ncclSymPtr<char>)output + cursor,
         chunks*MinWarpPerBlock
       );
       cursor = cursorAfter;
@@ -226,16 +229,17 @@ static __device__ void allreduce(
     }
   }
 
-  if (sizeof(T) == 4 || (sizeof(T) < 4 && (inputUptr-outputUptr)%4 == 0)) {
+  if (sizeof(T) == 4 || (sizeof(T) < 4 && (input.offset - output.offset)%4 == 0)) {
     constexpr int BytePerPack = 4, UnrollPacks = 4, UnrollPeers = 4;
     constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
     uint32_t chunks = (nBytes-cursor)/BytePerChunk;
-    chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
+    chunks -= imodFast32(chunks, nRanks*nBlocks, nRanks_nBlocks_rcp32);
     if (chunks != 0) {
       uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
       allreduceDeep<(sizeof(T) <= BytePerPack ? BytePerPack : 0), UnrollPacks, UnrollPeers, T>(
-        prim, tn, t, waitNeeded, red,
-        (char*)input + cursor, (char*)output + cursor,
+        handler, tn, t, waitNeeded, bar, red,
+        (ncclSymPtr<char>)input + cursor,
+        (ncclSymPtr<char>)output + cursor,
         chunks*MinWarpPerBlock
       );
       cursor = cursorAfter;
@@ -243,46 +247,51 @@ static __device__ void allreduce(
     }
   }
 
-  if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  if (waitNeeded) bar.wait(ncclCoopCta(), cuda::memory_order_relaxed);
 
   constexpr int UnrollPeers = 8;
   size_t nSufElts = (nBytes-cursor)/sizeof(T);
-  allreduceEnds<UnrollPeers>(prim, tn, t, red, input, output, nElts, nPreBytes/sizeof(T), nSufElts);
+  allreduceEnds<UnrollPeers>(handler, tn, t, red, input, output, nElts, nPreBytes/sizeof(T), nSufElts);
 }
 
-
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLD_AGxST(ncclSymDevArgs const* args) {
-  ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier);
-  int /*const&*/ rank = prim.rank;
-  int /*const&*/ nRanks = prim.nRanks;
-  Red<typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type> red(args->redOpArg);
+__device__ __forceinline__ void ncclSymkRun_AllReduce_RSxLD_AGxST(ncclSymkDevWorkArgs const* args) {
+  ncclSymkArgsHandler handler{args};
+  ncclLsaBarrierSession<ncclCoopCta> bar{
+    ncclCoopCta(), handler.comm, ncclTeamTagLsa(), blockIdx.x
+  };
 
-  // Threads numbered globally such that we round robin warps by rank then block.
-  int gt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
-                     rank, nRanks,
-                     prim.block, prim.nBlocks,
-                     threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
-  int gtn = nRanks*prim.nBlocks*blockDim.x;
+  Red<typename ncclSymkAccumType<Red, T, /*nvls=*/false>::Type> red(handler.devWork->redOpArg);
 
-  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
-  //prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  int const& rank = handler.comm.rank;
+  int const& nRanks = handler.comm.nRanks;
 
-  allreduce(prim, gtn, gt, /*waitNeeded=*/true, red, (T*)args->input, (T*)args->output, args->nElts);
+  bar.arrive(ncclCoopCta(), cuda::memory_order_relaxed);
 
-  prim.barrierArrive(ncclCoopCta(), /*release=*/true);
-  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  bool waitNeeded = true;
+  handler.forEachWork<T>(
+      [&]__device__(int block, int nBlocks, size_t nElts, size_t nAllElts,
+                    ncclSymPtr<T> input, ncclSymPtr<T> output) {
+        // Threads numbered globally such that we round robin warps by rank then block.
+        int gt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
+                           rank, nRanks,
+                           block, nBlocks,
+                           threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
+        int gtn = nRanks*nBlocks*blockDim.x;
+
+        allreduce(handler, gtn, gt, nBlocks, waitNeeded, bar, red, input, output, nElts);
+
+        waitNeeded = false;
+      }
+    );
+
+  bar.sync(ncclCoopCta(), cuda::memory_order_release);
 }
 
-
 template<typename Red, typename T>
 static __device__ void allreduceMultimem(
-    ncclSymPrims& prim, int tn, int t, Red red, T* input, T* output, size_t nElts
+    int tn, int t, Red red, T* input, T* output, size_t nElts
   ) {
-  // Mpve to multimem
-  input = prim.multimemPtr(input);
-  output = prim.multimemPtr(output);
-
   uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
   uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
   size_t nBytes = nElts*sizeof(T);
@@ -327,106 +336,132 @@ static __device__ void allreduceMultimem(
     uintptr_t cursor = i < nPreBytes ? i : nBytes-nSufBytes+(i-nPreBytes);
     BytePack<sizeof(T)> val = applyLoadMultimem<Red, sizeof(T)>(red, inputUptr + cursor);
     multimem_st_global(outputUptr + cursor, val);
-    cursor += tn*sizeof(T);
   }
 }
 
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLDMC_AGxSTMC(ncclSymDevArgs const* args) {
-  ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier|ncclSymPrims_UseMultimem);
-  Red<typename ncclSymAccumType<Red, T, /*nvls=*/true>::Type> red(args->redOpArg);
+__device__ __forceinline__ void ncclSymkRun_AllReduce_RSxLDMC_AGxSTMC(ncclSymkDevWorkArgs const* args) {
+  ncclSymkArgsHandler handler{args};
+  ncclLsaBarrierSession<ncclCoopCta> bar{
+    ncclCoopCta(), handler.comm, ncclTeamTagLsa(), blockIdx.x, /*multimem=*/true
+  };
 
-  // Threads numbered globally such that we round robin warps by rank then block.
-  int gt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
-                     prim.rank, prim.nRanks,
-                     prim.block, prim.nBlocks,
-                     threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
-  int gtn = prim.nRanks*prim.nBlocks*blockDim.x;
+  Red<typename ncclSymkAccumType<Red, T, /*nvls=*/true>::Type> red(handler.devWork->redOpArg);
 
-  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
-  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  int const& rank = handler.comm.rank;
+  int const& nRanks = handler.comm.nRanks;
+  auto const& multimem = handler.comm.lsaMultimem;
 
-  allreduceMultimem(prim, gtn, gt, red, (T*)args->input, (T*)args->output, args->nElts);
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed);
 
-  prim.barrierArrive(ncclCoopCta(), /*release=*/true);
-  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  handler.forEachWork<T>(
+      [&]__device__(int block, int nBlocks, size_t nElts, size_t nAllElts,
+                    ncclSymPtr<T> input, ncclSymPtr<T> output) {
+        // Threads numbered globally such that we round robin warps by rank then block.
+        int gt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
+                           rank, nRanks,
+                           block, nBlocks,
+                           threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
+        int gtn = nRanks*nBlocks*blockDim.x;
+
+        allreduceMultimem(gtn, gt, red, input.multimemPtr(multimem), output.multimemPtr(multimem), nElts);
+      }
+    );
+
+  bar.sync(ncclCoopCta(), cuda::memory_order_release);
 }
 
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLL_R_impl(ncclSymDevArgs const* args, bool multimem) {
-  ncclSymPrims prim(args->comm, ncclSymPrims_UseLL | multimem*ncclSymPrims_UseMultimem);
-  int /*const&*/ rank = prim.rank;
-  using Acc = typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type;
-  Red<Acc> red(args->redOpArg);
+__device__ __forceinline__ void ncclSymkRun_AllReduce_AGxLL_R_impl(ncclSymkDevWorkArgs const* args, bool multimem) {
+  ncclSymkArgsHandler handler{args};
+  ncclLLA2ASession<ncclCoopCta> lla2a(
+    ncclCoopCta(), handler.comm, ncclTeamLsa(handler.comm), handler.lsaLLA2A,
+    blockIdx.x, ncclSymkMaxThreads, multimem, handler.comm.lsaMultimem
+  );
+
+  int const& rank = handler.comm.rank;
+  int const& nRanks = handler.comm.nRanks;
+  using Acc = typename ncclSymkAccumType<Red, T, /*nvls=*/false>::Type;
+  Red<Acc> red(handler.devWork->redOpArg);
 
   using Pack = BytePack<8>;
   using AccPack = BytePack<8*sizeof(Acc)/sizeof(T)>;
   constexpr int EltPerPack = 8/sizeof(T);
-  int nElts = args->nElts;
-  int nPacks = divUp(nElts, EltPerPack);
 
-  bool packAligned = 8 <= alignof(T) || (
-      args->nElts*sizeof(T) |
-      (uint32_t)reinterpret_cast<uintptr_t>(args->input) |
-      (uint32_t)reinterpret_cast<uintptr_t>(args->output)
-    )%8 == 0;
+  handler.singleWork<T>(
+      [&]__device__(int nElts, int nAllElts,
+                    ncclSymPtr<T> inputPtr, ncclSymPtr<T> outputPtr) {
+        int nPacks = divUp(nElts, EltPerPack);
 
-  uint32_t nPackPerBlock, nPackModBlock;
-  idivmodFast32(&nPackPerBlock, &nPackModBlock, nPacks, prim.nBlocks, prim.nBlocks_rcp32);
-  int begin = prim.block*nPackPerBlock + minval<int>(prim.block, nPackModBlock);
-  int end = begin + nPackPerBlock + (prim.block < nPackModBlock ? 1 : 0);
+        T* input = (T*)inputPtr.localPtr();
+        T* output = (T*)outputPtr.localPtr();
 
-  nPacks = end - begin;
-  nElts -= begin*EltPerPack;
-  nElts = min(nElts, nPacks*EltPerPack);
-  T* input = (T*)args->input + begin*EltPerPack;
-  T* output = (T*)args->output + begin*EltPerPack;
+        bool packAligned = 8 <= alignof(T) || (nElts*sizeof(T) | (uintptr_t)input | (uintptr_t)output)%8 == 0;
 
-  ncclCoopCta cta;
-  int t = threadIdx.x;
-  int tn = ncclSymMaxThreads;
+        ncclCoopCta cta;
+        int t = threadIdx.x;
+        int tn = ncclSymkMaxThreads;
 
-  if (__builtin_expect(packAligned, true)) {
-    #pragma unroll 1
-    while (0 < nPacks) {
-      if (t < nPacks) {
-        int nIterPacks = min(nPacks, tn);
-        Pack inp = loadPack<Pack>((Pack*)input, t, nPacks);
-        prim.bcastLL(/*slot=*/nIterPacks*rank + t, inp);
-        Pack out = prim.template recvReduceLL<Pack, T>(t, nIterPacks, red);
-        storePack((Pack*)output, t, nPacks, out);
+        if (__builtin_expect(packAligned, true)) {
+          #pragma unroll 1
+          while (0 < nPacks) {
+            if (t < nPacks) {
+              int nIterPacks = min(nPacks, tn);
+              Pack inp = loadPack<Pack>((Pack*)input, t, nPacks);
+              lla2a.bcast(/*slot=*/nIterPacks*rank + t, inp);
+              AccPack out = lla2a.template recvReduce</*Unroll=*/8, Pack>(
+                /*slotStart=*/t, /*slotCount=*/nRanks, /*slotStride=*/nIterPacks,
+                /*eltToAcc=*/[&] __device__ (Pack x)->AccPack {
+                  return applyCast<T, Acc>(x);
+                },
+                /*reduce=*/[&] __device__ (AccPack a, AccPack b)->AccPack {
+                  return applyReduce(red, a, b);
+                }
+              );
+              storePack((Pack*)output, t, nPacks, applyCast<Acc, T>(out));
+            }
+            lla2a.endEpoch(cta);
+
+            input += tn*EltPerPack;
+            output += tn*EltPerPack;
+            nPacks -= tn;
+          }
+        } else {
+          #pragma unroll 1
+          while (0 < nElts) {
+            if (t*EltPerPack < nElts) {
+              int nIterPacks = min(nPacks, tn);
+              Pack inp = loadPack<Pack>(input, t*EltPerPack, nElts);
+              lla2a.bcast(/*slot=*/nIterPacks*rank + t, inp);
+              AccPack out = lla2a.template recvReduce</*Unroll=*/8, Pack>(
+                /*slotStart=*/t, /*slotCount=*/nRanks, /*slotStride=*/nIterPacks,
+                /*eltToAcc=*/[&] __device__ (Pack x)->AccPack {
+                  return applyCast<T, Acc>(x);
+                },
+                /*reduce=*/[&] __device__ (AccPack a, AccPack b)->AccPack {
+                  return applyReduce(red, a, b);
+                }
+              );
+              storePack(output, t*EltPerPack, nElts, applyCast<Acc, T>(out));
+            }
+            lla2a.endEpoch(cta);
+
+            input += tn*EltPerPack;
+            output += tn*EltPerPack;
+            nElts -= tn*EltPerPack;
+            nPacks -= tn;
+          }
+        }
       }
-      prim.endLL(cta);
-
-      input += tn*EltPerPack;
-      output += tn*EltPerPack;
-      nPacks -= tn;
-    }
-  } else {
-    #pragma unroll 1
-    while (0 < nElts) {
-      if (t*EltPerPack < nElts) {
-        int nIterPacks = min(nPacks, tn);
-        Pack inp = loadPack<Pack>(input, t*EltPerPack, nElts);
-        prim.bcastLL(/*slot=*/nIterPacks*rank + t, inp);
-        Pack out = prim.template recvReduceLL<Pack, T>(t, nIterPacks, red);
-        storePack(output, t*EltPerPack, nElts, out);
-      }
-      prim.endLL(cta);
-
-      input += tn*EltPerPack;
-      output += tn*EltPerPack;
-      nElts -= tn*EltPerPack;
-      nPacks -= tn;
-    }
-  }
+    );
 }
 
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLL_R(ncclSymDevArgs const* args) {
-  ncclSymRun_AllReduce_AGxLL_R_impl<Red, T>(args, /*multimem=*/false);
+__device__ __forceinline__ void ncclSymkRun_AllReduce_AGxLL_R(ncclSymkDevWorkArgs const* args) {
+  ncclSymkRun_AllReduce_AGxLL_R_impl<Red, T>(args, /*multimem=*/false);
 }
+
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLLMC_R(ncclSymDevArgs const* args) {
-  ncclSymRun_AllReduce_AGxLL_R_impl<Red, T>(args, /*multimem=*/true);
+__device__ __forceinline__ void ncclSymkRun_AllReduce_AGxLLMC_R(ncclSymkDevWorkArgs const* args) {
+  ncclSymkRun_AllReduce_AGxLL_R_impl<Red, T>(args, /*multimem=*/true);
 }
diff --git a/projects/rccl/src/device/symmetric/generate.py b/projects/rccl/src/device/symmetric/generate.py
index 8fcb9a425c..8e62bda5bb 100755
--- a/projects/rccl/src/device/symmetric/generate.py
+++ b/projects/rccl/src/device/symmetric/generate.py
@@ -1,6 +1,7 @@
 #!/usr/bin/env python3
 import os
 import sys
+import shutil
 
 ################################################################################
 # The first command line argument is the path to the directory to generate and
@@ -10,8 +11,11 @@ gensrc = sys.argv[1]
 
 if os.path.exists(gensrc):
   for name in os.listdir(gensrc):
-    os.remove(os.path.join(gensrc, name))
-    #os.truncate(os.path.join(gensrc, name), 0)
+    path = os.path.join(gensrc, name)
+    if os.path.isfile(path):
+      os.remove(path)
+    elif os.path.isdir(path):
+      shutil.rmtree(path)
 else:
   os.mkdir(gensrc)
 
@@ -94,7 +98,7 @@ def enumerate_kernels():
         yield Rec(coll="ReduceScatter", algo=algo, red=red, ty=ty)
 
 def required_cuda(k):
-  cudart, arch, specific_sms  = 0, 0, None
+  cudart, arch, specific_sms  = 0, 600, None
   is_nvls = k.algo in nvls_algos_by_coll.get(k.coll, [])
   if is_nvls:
     cudart = max(cudart, 12010)
@@ -133,13 +137,13 @@ def kernel_gencode(k):
 
 def kernel_cname(k):
   if k.coll in reductions:
-    return paste("_", "ncclSymDevKernel", k.coll, k.algo, k.red, k.ty)
+    return paste("_", "ncclSymkDevKernel", k.coll, k.algo, k.red, k.ty)
   else:
-    return paste("_", "ncclSymDevKernel", k.coll, k.algo)
+    return paste("_", "ncclSymkDevKernel", k.coll, k.algo)
 
 def kernel_conds(k):
   cudart, arch, specific_sms = required_cuda(k)
-  if cudart == 0: return (None, None)
+  if cudart == 0 and arch == 0: return (None, None)
 
   cudart_cond = "CUDART_VERSION >= %d"%cudart
   if not specific_sms:
@@ -152,30 +156,30 @@ def instantiate(k):
   cudart_cond, arch_cond = kernel_conds(k)
   if (cudart_cond, arch_cond) == (None, None):
     form_red_ty = (
-      "__global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
-      "  ncclSymRun_{id}<{red}, {ty}>(&args);\n"
+      "__global__ void {cname}(ncclSymkDevWorkArgs4K NCCL_GRID_CONSTANT const args4K) {{\n"
+      "  ncclSymkRun_{id}<{red}, {ty}>(&args4K.args);\n"
       "}}"
     )
     form = (
-      "__global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
-      "  ncclSymRun_{id}(&args);\n"
+      "__global__ void {cname}(ncclSymkDevWorkArgs4K NCCL_GRID_CONSTANT const args4K) {{\n"
+      "  ncclSymkRun_{id}(&args4K.args);\n"
       "}}"
     )
   else:
     form_red_ty = (
       "#if {cudart_cond}\n"
-      "  __global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
+      "  __global__ void {cname}(ncclSymkDevWorkArgs4K NCCL_GRID_CONSTANT const args4K) {{\n"
       "    #if {arch_cond}\n"
-      "      ncclSymRun_{id}<{red}, {ty}>(&args);\n"
+      "      ncclSymkRun_{id}<{red}, {ty}>(&args4K.args);\n"
       "    #endif\n"
       "  }}\n"
       "#endif"
     )
     form = (
       "#if {cudart_cond}\n"
-      "  __global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
+      "  __global__ void {cname}(ncclSymkDevWorkArgs4K NCCL_GRID_CONSTANT const args4K) {{\n"
       "    #if {arch_cond}\n"
-      "      ncclSymRun_{id}(&args);\n"
+      "      ncclSymkRun_{id}(&args4K.args);\n"
       "    #endif\n"
       "  }}\n"
       "#endif"
@@ -192,11 +196,11 @@ def instantiate(k):
 def prototype(k):
   cudart_cond, arch_cond = kernel_conds(k)
   if cudart_cond is None:
-    form = "__global__ void {cname}(ncclSymDevArgs const);"
+    form = "__global__ void {cname}(ncclSymkDevWorkArgs4K const);"
   else:
     form = (
       "#if {cudart_cond}\n"
-      "  __global__ void {cname}(ncclSymDevArgs const);\n"
+      "  __global__ void {cname}(ncclSymkDevWorkArgs4K const);\n"
       "#else\n"
       "  constexpr void* {cname} = nullptr;\n"
       "#endif"
@@ -223,18 +227,20 @@ for coll in set(k.coll for k in enumerate_kernels()):
   if (fname, coll) not in kernels_by_file:
     kernels_by_file[fname, coll] = []
 
+files_to_print = ""
 # Generate each kernel instantiation file
 for (fname, coll), ks in kernels_by_file.items():
+  files_to_print += fname + ";"
   with open(os.path.join(gensrc, fname), "w") as f:
-    emitln(f, '#include "symmetric.h"')
+    emitln(f, '#include "sym_kernels.h"')
     emitln(f, '#include "symmetric/kernel.cuh"')
     emitln(f, '#include "symmetric/{coll}.cuh"'.format(coll=coll_to_lower[coll]))
     for k in ks:
       emitln(f, instantiate(k))
 
-# Generate <gensrc>/symmetric_host.cc
-with open(os.path.join(gensrc, "symmetric_kernels.cc"), "w") as f:
-  emitln(f, '#include "symmetric.h"')
+# Generate <gensrc>/sym_kernels_host.cc
+with open(os.path.join(gensrc, "sym_kernels_host.cc"), "w") as f:
+  emitln(f, '#include "sym_kernels.h"')
   emitln(f, '#include "device.h"')
   emitln(f, '')
 
@@ -242,19 +248,19 @@ with open(os.path.join(gensrc, "symmetric_kernels.cc"), "w") as f:
     emitln(f, prototype(k))
   emitln(f, '')
 
-  emitln(f, 'extern int const ncclSymKernelCount = %d;' % len(list(enumerate_kernels())))
-  emitln(f, 'extern void* const ncclSymKernelList[] = {')
+  emitln(f, 'extern int const ncclSymkKernelCount = %d;' % len(list(enumerate_kernels())))
+  emitln(f, 'extern void* const ncclSymkKernelList[] = {')
   for k in enumerate_kernels():
     emitln(f, '(void*){cname},'.format(cname=kernel_cname(k)))
   emitln(f, 'nullptr};')
   emitln(f, '')
 
-  emitln(f, 'void* ncclSymGetKernelPtr(ncclSymKernelId id, int red, ncclDataType_t ty) {')
+  emitln(f, 'void* ncclSymkGetKernelPtr(ncclSymkKernelId id, int red, ncclDataType_t ty) {')
   indents += 1
   emitln(f, 'switch (id) {')
   emitln(f, 'default: return nullptr;')
   for (coll, algo), coll_algo_ks in partition(enumerate_kernels(), lambda k: (k.coll, k.algo)).items():
-    emitln(f, 'case ncclSymKernelId_'+coll+'_'+algo+':')
+    emitln(f, 'case ncclSymkKernelId_'+coll+'_'+algo+':')
     indents += 1
     if len(coll_algo_ks) == 1:
       emitln(f, 'return (void*)&'+kernel_cname(coll_algo_ks[0])+';')
@@ -277,9 +283,15 @@ with open(os.path.join(gensrc, "symmetric_kernels.cc"), "w") as f:
   emitln(f, '}')
 
 # Generate <gensrc>/rules.mk
+files_to_print += "rules.mk;"
+files_to_print += "sym_kernels_host.cc;"
+
+if os.environ.get("NCCL_USE_CMAKE", "0") == "1":
+    print(files_to_print)
+
 with open(os.path.join(gensrc, "rules.mk"), "w") as f:
   inst_names = sorted(set(kernel_fname(k) for k in enumerate_kernels()))
-  names = inst_names + ["symmetric_kernels.cc"]
+  names = inst_names + ["sym_kernels_host.cc"]
   f.write("LIB_OBJS_SYM_GEN = $(patsubst %,$(OBJDIR)/genobj/symmetric/%.o,{names})\n"
           .format(names=" ".join(names)))
   f.write("\n")
diff --git a/projects/rccl/src/device/symmetric/kernel.cuh b/projects/rccl/src/device/symmetric/kernel.cuh
index f631d51d9f..bff67e4604 100644
--- a/projects/rccl/src/device/symmetric/kernel.cuh
+++ b/projects/rccl/src/device/symmetric/kernel.cuh
@@ -1,27 +1,27 @@
 #ifndef NCCL_DEVICE_SYMMETRIC_KERNEL_H_
 #define NCCL_DEVICE_SYMMETRIC_KERNEL_H_
 
-#include "symmetric.h"
+#include "sym_kernels.h"
 
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLL_R(struct ncclSymDevArgs const* args);
+__device__ __forceinline__ void ncclSymkRun_AllReduce_AGxLL_R(struct ncclSymkDevWorkArgs const* args);
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLLMC_R(struct ncclSymDevArgs const* args);
+__device__ __forceinline__ void ncclSymkRun_AllReduce_AGxLLMC_R(struct ncclSymkDevWorkArgs const* args);
 
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLD_AGxST(struct ncclSymDevArgs const* args);
+__device__ __forceinline__ void ncclSymkRun_AllReduce_RSxLD_AGxST(struct ncclSymkDevWorkArgs const* args);
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLDMC_AGxSTMC(struct ncclSymDevArgs const* args);
+__device__ __forceinline__ void ncclSymkRun_AllReduce_RSxLDMC_AGxSTMC(struct ncclSymkDevWorkArgs const* args);
 
-__device__ __forceinline__ void ncclSymRun_AllGather_LL(struct ncclSymDevArgs const* args);
-__device__ __forceinline__ void ncclSymRun_AllGather_LLMC(struct ncclSymDevArgs const* args);
-__device__ __forceinline__ void ncclSymRun_AllGather_ST(struct ncclSymDevArgs const* args);
-__device__ __forceinline__ void ncclSymRun_AllGather_STMC(struct ncclSymDevArgs const* args);
+__device__ __forceinline__ void ncclSymkRun_AllGather_LL(struct ncclSymkDevWorkArgs const* args);
+__device__ __forceinline__ void ncclSymkRun_AllGather_LLMC(struct ncclSymkDevWorkArgs const* args);
+__device__ __forceinline__ void ncclSymkRun_AllGather_ST(struct ncclSymkDevWorkArgs const* args);
+__device__ __forceinline__ void ncclSymkRun_AllGather_STMC(struct ncclSymkDevWorkArgs const* args);
 
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_ReduceScatter_LL(struct ncclSymDevArgs const* args);
+__device__ __forceinline__ void ncclSymkRun_ReduceScatter_LL(struct ncclSymkDevWorkArgs const* args);
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_ReduceScatter_LD(struct ncclSymDevArgs const* args);
+__device__ __forceinline__ void ncclSymkRun_ReduceScatter_LD(struct ncclSymkDevWorkArgs const* args);
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_ReduceScatter_LDMC(struct ncclSymDevArgs const* args);
+__device__ __forceinline__ void ncclSymkRun_ReduceScatter_LDMC(struct ncclSymkDevWorkArgs const* args);
 #endif
diff --git a/projects/rccl/src/device/symmetric/primitives.cuh b/projects/rccl/src/device/symmetric/primitives.cuh
index 1670244007..73305d54cd 100644
--- a/projects/rccl/src/device/symmetric/primitives.cuh
+++ b/projects/rccl/src/device/symmetric/primitives.cuh
@@ -1,11 +1,11 @@
 #ifndef NCCL_DEVICE_SYMMETRIC_PRIMITIVES_H_
 #define NCCL_DEVICE_SYMMETRIC_PRIMITIVES_H_
 
-#include "symmetric.h"
+#include "sym_kernels.h"
 #include "bitops.h"
 #include "collectives.h"
-#include "op128.h"
-#include "reduce_kernel.h"
+#include "../op128.h"
+#include "../reduce_kernel.h"
 
 #if __CUDA_ARCH__ >= 700
 // __grid_constant__ appears to break cuda-gdb
@@ -24,397 +24,124 @@ static __device__ Int0 flattenIx(Int0 pos, Int1 size, Ints ...more) {
   return pos + size*flattenIx(more...);
 }
 
-// Precomputed integer reciprocoals for denominator values 1..64 inclusive.
-// Pass these to idivFast64() for fast division on the GPU.
-static __device__ uint64_t idivRcp64_upto64(int x) {
-  static constexpr uint64_t table[65] = {
-    idivRcp64(0x01), idivRcp64(0x01), idivRcp64(0x02), idivRcp64(0x03),
-    idivRcp64(0x04), idivRcp64(0x05), idivRcp64(0x06), idivRcp64(0x07),
-    idivRcp64(0x08), idivRcp64(0x09), idivRcp64(0x0a), idivRcp64(0x0b),
-    idivRcp64(0x0c), idivRcp64(0x0d), idivRcp64(0x0e), idivRcp64(0x0f),
-    idivRcp64(0x10), idivRcp64(0x11), idivRcp64(0x12), idivRcp64(0x13),
-    idivRcp64(0x14), idivRcp64(0x15), idivRcp64(0x16), idivRcp64(0x17),
-    idivRcp64(0x18), idivRcp64(0x19), idivRcp64(0x1a), idivRcp64(0x1b),
-    idivRcp64(0x1c), idivRcp64(0x1d), idivRcp64(0x1e), idivRcp64(0x1f),
-    idivRcp64(0x20), idivRcp64(0x21), idivRcp64(0x22), idivRcp64(0x23),
-    idivRcp64(0x24), idivRcp64(0x25), idivRcp64(0x26), idivRcp64(0x27),
-    idivRcp64(0x28), idivRcp64(0x29), idivRcp64(0x2a), idivRcp64(0x2b),
-    idivRcp64(0x2c), idivRcp64(0x2d), idivRcp64(0x2e), idivRcp64(0x2f),
-    idivRcp64(0x30), idivRcp64(0x31), idivRcp64(0x32), idivRcp64(0x33),
-    idivRcp64(0x34), idivRcp64(0x35), idivRcp64(0x36), idivRcp64(0x37),
-    idivRcp64(0x38), idivRcp64(0x39), idivRcp64(0x3a), idivRcp64(0x3b),
-    idivRcp64(0x3c), idivRcp64(0x3d), idivRcp64(0x3e), idivRcp64(0x3f),
-    idivRcp64(0x40)
-  };
-  return table[x];
-}
-
-static __device__ uint32_t idivRcp32_upto64(int x) {
-  return idivRcp64_upto64(x)>>32;
-}
-
 namespace {
-struct ncclCoopCta {
-  __device__ void sync() { __syncthreads(); }
-  __device__ int self() { return threadIdx.x; }
-  __device__ int count() { return blockDim.x; }
-};
-struct ncclCoopWarps {
-  int log2_nWarps;
-  __device__ void sync() {
-    asm volatile("barrier.sync %0, %1;" :: "r"(1 + (threadIdx.x>>(5+log2_nWarps))), "r"(32<<log2_nWarps) : "memory");
-  }
-  __device__ int self() { return threadIdx.x & ((32<<log2_nWarps)-1); }
-  __device__ int count() { return 32<<log2_nWarps; }
-};
-struct ncclCoopWarp {
-  __device__ void sync() { __syncwarp(); }
-  __device__ int self() { return threadIdx.x%32; }
-  __device__ int count() { return 32; }
-};
-}
+struct ncclSymkArgsHandler {
+  ncclDevComm const& comm;
+  ncclLLA2AHandle const& lsaLLA2A;
+  struct ncclSymkChannelWorkRange* channelWorkRange;
+  struct ncclSymkDevWork* devWork;
+  uint32_t nRanks_rcp32;
 
-namespace {
-static constexpr int ncclSymPrims_UseBarrier = 1;
-static constexpr int ncclSymPrims_UseLL = 2;
-static constexpr int ncclSymPrims_UseMultimem = 4;
-struct ncclSymPrims {
-  int flags;
-  int const &rank;
-  int const &nRanks;
-  uint32_t const &nRanks_rcp32;
-  int block, nBlocks;
-  uint32_t nBlocks_rcp32;
-  uint32_t nBlocks_nWarps_rcp32;
-  uint32_t nRanks_nBlocks_rcp32;
-  uint32_t nWarpPerRank, nWarpPerRank_rcp32;
-  struct ncclSymDevBase* const &base;
-  uintptr_t offsetMc;
+  __device__ ncclSymkArgsHandler(ncclSymkDevWorkArgs const* args):
+    comm(args->kcomm.devComm),
+    lsaLLA2A(args->kcomm.lsaLLA2A) {
+    channelWorkRange = args->getWorkRange();
 
-  uint32_t const &stride4G;
-  uint32_t barEpoch;
-  uint32_t llEpoch;
-
-  __device__ ncclSymPrims(ncclSymDevComm const &comm, int flags):
-    flags(flags),
-    rank(comm.rank),
-    nRanks(comm.nRanks),
-    nRanks_rcp32(comm.nRanks_rcp32),
-    block(blockIdx.x),
-    nBlocks(gridDim.x),
-    nBlocks_rcp32(idivRcp32_upto64(nBlocks)),
-    nBlocks_nWarps_rcp32(imulRcp32(nBlocks, nBlocks_rcp32, blockDim.x/32, idivRcp32_upto64(blockDim.x/32))),
-    nRanks_nBlocks_rcp32(imulRcp32(nRanks, nRanks_rcp32, gridDim.x, nBlocks_rcp32)),
-    nWarpPerRank(idivFast32(nBlocks*blockDim.x/32, nRanks, nRanks_rcp32)),
-    nWarpPerRank_rcp32(idivRcp32_upto64(nWarpPerRank)),
-    base(comm.base),
-    offsetMc((flags & ncclSymPrims_UseMultimem) ? (char*)comm.baseMc - (char*)base : 0x0),
-    stride4G(comm.stride4G) {
-
-    #if CUDART_VERSION >= 12030 && __CUDA_ARCH__ >= 900
-      cudaGridDependencySynchronize();
-    #endif
-
-    if ((flags & ncclSymPrims_UseBarrier) && threadIdx.x < nRanks) {
-      barEpoch = (flags & ncclSymPrims_UseMultimem) ? base->barEpochMc[block] : base->barEpochUc[block];
-    }
-    if (flags & ncclSymPrims_UseLL) llEpoch = base->llEpoch[block] + 2;
-  }
-  __device__  ~ncclSymPrims() {
-    if (threadIdx.x == 0) {
-      if (flags & ncclSymPrims_UseBarrier) {
-        ((flags & ncclSymPrims_UseMultimem) ? base->barEpochMc : base->barEpochUc)[block] = barEpoch;
-      }
-      if (flags & ncclSymPrims_UseLL) base->llEpoch[block] = llEpoch - 2;
-    }
+    devWork = args->getWorks(args->nMaxChannels);
+    nRanks_rcp32 = comm.nRanks_rcp32;
   }
 
   template<typename T>
-  __device__ T* peerPtr(int peer, T* selfPtr) {
-    return add4G(selfPtr, (peer-rank)*stride4G);
+    __device__ void getWorkRange(int block,
+                                 uint16_t& workLo, size_t& indexLo, uint16_t& workHi, size_t& indexHi) {
+    constexpr int EltPerCell = NCCL_SYM_KERNEL_CELL_SIZE / sizeof(T);
+    uint32_t fracLo, fracHi;
+
+    // Where the work begins
+    workLo = (block==0) ? 0 : channelWorkRange[block-1].workHi; // start where predecessor ends
+    fracLo = (block==0) ? 0 : channelWorkRange[block-1].fracHi + 1;
+    // If the predecessor ended on the work boundary, then we step to the beginning of the next work.
+    // This ensures we never have empty parts.
+    if (fracLo == 0x10000) {
+      workLo++;
+      fracLo = 0;
+    }
+    struct ncclSymkDevWork const& dw = devWork[workLo];
+    indexLo = ((fracLo * divUp(dw.nElts, EltPerCell)) >> 16) * EltPerCell;
+
+    // Where the work ends
+    workHi = channelWorkRange[block].workHi;
+    fracHi = channelWorkRange[block].fracHi + 1;
+    indexHi = min(((fracHi * divUp(dw.nElts, EltPerCell)) >> 16) * EltPerCell, dw.nElts);
   }
 
   template<typename T>
-  __device__ T* multimemPtr(T* selfPtr) {
-    return reinterpret_cast<T*>(reinterpret_cast<uintptr_t>(selfPtr) + offsetMc);
+    __device__ void getWorkRangeFused(int blockIdx, int w,
+                                      int& block, int& nBlocks, size_t& indexLo, size_t& indexHi) {
+    constexpr int EltPerCell = NCCL_SYM_KERNEL_CELL_SIZE / sizeof(T);
+    struct ncclSymkDevWork const& dw = devWork[w];
+    uint32_t fracLo, fracHi;
+    int lastBlock;
+
+    block = blockIdx - dw.sChannelId;
+    nBlocks = dw.nChannels;
+    lastBlock = dw.sChannelId+dw.nChannels-1;
+
+    // Where the work begins
+    fracLo = (dw.sChannelId==0) ? 0 : ((channelWorkRange[dw.sChannelId-1].fracHi + 1) & 0xFFFF);
+    indexLo = ((fracLo * divUp(dw.nElts, EltPerCell)) >> 16) * EltPerCell;
+    fracHi = (channelWorkRange[lastBlock].workHi == w) ? channelWorkRange[lastBlock].fracHi + 1 : 0x10000;
+    indexHi = min(((fracHi * divUp(dw.nElts, EltPerCell)) >> 16) * EltPerCell, dw.nElts);
   }
 
-  __device__  void barrierArrive(ncclCoopCta cta, bool release) {
-    cta.sync();
-    #if __CUDA_ARCH__ < 700
-      if (release) {
-        if (cta.self() == 0) __threadfence_system();
-        cta.sync();
-      }
-    #endif
-    if (flags & ncclSymPrims_UseMultimem) {
-    #if __CUDA_ARCH__ >= 900 && CUDART_VERSION >= 12010
-      if (cta.self() == 0) {
-        uint32_t* inbox = &multimemPtr(base)->barInboxMc[block];
-        if (release) {
-          asm volatile("multimem.red.release.sys.add.u32 [%0],1;" :: "l"(inbox));
+  template<typename T, typename Fn>
+    __device__ void forEachWork(Fn const& fn) {
+      uint16_t workLo, workHi;
+      size_t indexLo, indexHi;
+
+      getWorkRange<T>(blockIdx.x, workLo, indexLo, workHi, indexHi);
+
+      size_t currentIndexLo = indexLo;
+      #pragma unroll 1
+      for (int w = workLo; w <= workHi; w++) {
+        struct ncclSymkDevWork const& dw = devWork[w];
+        size_t const& nAllElts = dw.nElts;
+        size_t currentIndexHi;
+        int block, nBlocks;
+        if (blockIdx.x >= dw.sChannelId && blockIdx.x < dw.sChannelId + dw.nChannels) {
+          getWorkRangeFused<T>(blockIdx.x, w, block, nBlocks, currentIndexLo, currentIndexHi);
         } else {
-          asm volatile("multimem.red.relaxed.sys.add.u32 [%0],1;" :: "l"(inbox));
+          currentIndexHi = (w < workHi) ? nAllElts : indexHi;
+          block = 0;
+          nBlocks = 1;
         }
+
+        fn(block, nBlocks, currentIndexHi - currentIndexLo, nAllElts,
+           ncclSymPtr<T>(dw.inputWin, dw.inputOff) + currentIndexLo,
+           ncclSymPtr<T>(dw.outputWin, dw.outputOff) + currentIndexLo);
+
+        currentIndexLo = 0;
       }
-    #endif
-    } else {
-      int r = cta.self();
-      if (r != rank && r < nRanks) {
-        uint32_t* inbox = &peerPtr(r, base)->barInboxPerPeer[block*nRanks + rank];
-        #if __CUDA_ARCH__ >= 700
-          if (release) {
-            asm volatile("st.release.sys.u32 [%0],%1;" :: "l"(inbox), "r"(barEpoch+1));
-          } else {
-            asm volatile("st.relaxed.sys.u32 [%0],%1;" :: "l"(inbox), "r"(barEpoch+1));
-          }
-        #else
-          asm volatile("st.volatile.u32 [%0],%1;" :: "l"(inbox), "r"(barEpoch+1));
-        #endif
-      }
-    }
   }
 
-  __device__  void barrierWait(ncclCoopCta cta, bool acquire) {
-    if (flags & ncclSymPrims_UseMultimem) {
-    #if __CUDA_ARCH__ >= 900
-      if (cta.self() == 0) {
-        uint32_t* inbox = &base->barInboxMc[block];
-        while (true) {
-          uint32_t got;
-          if (acquire) {
-            asm volatile("ld.acquire.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
-          } else {
-            asm volatile("ld.relaxed.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
-          }
-          if (got-(barEpoch+nRanks) <= uint32_t(-1)>>1) break;
-        }
-        barEpoch += nRanks;
-      }
-    #endif
-    } else {
-      int r = cta.self();
-      if (r != rank && r < nRanks) {
-        uint32_t* inbox = &base->barInboxPerPeer[block*nRanks + r];
-        while (true) {
-          uint32_t got;
-          #if __CUDA_ARCH__ >= 700
-            if (acquire) {
-              asm volatile("ld.acquire.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
-            } else {
-              asm volatile("ld.relaxed.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
-            }
-          #else
-            asm volatile("ld.volatile.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
-          #endif
-          if (got-(barEpoch+1) <= uint32_t(-1)>>1) break;
-        }
-      }
-      #if __CUDA_ARCH__ < 700
-        if (acquire) {
-          cta.sync();
-          if (cta.self() == 0) __threadfence();
-        }
-      #endif
-      barEpoch += 1;
-    }
-    cta.sync();
-  }
+  template<typename T, typename Fn>
+    __device__ void singleWork(Fn const& fn) {
+      uint16_t w;
+      size_t indexLo, indexHi;
 
-  __device__ void endLL(ncclCoopCta cta) {
-    if (__builtin_expect(llEpoch >= -2u, false)) {
-      cta.sync();
-      uint4* buf = ncclSymDevBase_getLLBuf(base, nRanks, block, llEpoch);
-      int epochSize = ncclSymLLEpochSize(nRanks);
-      #pragma unroll 4
-      for (int i=cta.self(); i*16 < epochSize; i += cta.count()) {
-        buf[i] = uint4{0, 0, 0, 0};
-      }
-    }
-    cta.sync();
-    llEpoch += (llEpoch == -1u) ? 3 : 1;
-  }
+      getWorkRange<T>(blockIdx.x, w, indexLo, w, indexHi);
 
-  template<typename T>
-  __device__ void sendLL(int peer, int slot, T val) {
-    union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
-    tmp = val;
-    uint4* buf = ncclSymDevBase_getLLBuf(peerPtr(peer, base), nRanks, block, llEpoch) + slot;
-    #pragma unroll
-    for (int u=0; u < divUp(sizeof(T),8); u++) {
-      asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(buf + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
-    }
-  }
+      struct ncclSymkDevWork const& dw = devWork[w];
 
-  template<typename T>
-  __device__ void bcastLL(int slot, T val) {
-    if (flags & ncclSymPrims_UseMultimem) {
-      union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
-      tmp = val;
-      uint4* bufmc = ncclSymDevBase_getLLBuf(multimemPtr(base), nRanks, block, llEpoch) + slot;
-      #pragma unroll
-      for (int u=0; u < divUp(sizeof(T),8); u++) {
-        asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(bufmc + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
-      }
-    } else {
-      union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
-      tmp = val;
-      uint4* buf0 = ncclSymDevBase_getLLBuf(peerPtr(0, base), nRanks, block, llEpoch) + slot;
-      int dr = 0;
-      int r = rank;
-      #pragma unroll 1
-      for (; dr+8 <= nRanks; dr += 8) {
-        #pragma unroll
-        for (int ur=0; ur < 8; ur++) {
-          uint4* buf = add4G(buf0, r*stride4G);
-          #pragma unroll
-          for (int u=0; u < divUp(sizeof(T),8); u++) {
-            asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(buf + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
-          }
-          r += 1;
-          if (r == nRanks) r = 0;
-        }
-      }
-      #pragma unroll
-      for (int ur=0; ur < 8; ur++, dr++) {
-        if (dr == nRanks) break;
-        uint4* buf = add4G(buf0, r*stride4G);
-        #pragma unroll
-        for (int u=0; u < divUp(sizeof(T),8); u++) {
-          asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(buf + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
-        }
-        r += 1;
-        if (r == nRanks) r = 0;
-      }
-    }
-  }
-
-  template<int nSlotsMin, int nSlotsMax, typename T>
-  __device__ void recvLL(int slot0, int nSlots, int stride, T(&elts)[nSlotsMax]) {
-    uint4* buf = ncclSymDevBase_getLLBuf(base, nRanks, block, llEpoch) + slot0;
-    uint4 tmp[nSlotsMax][divUp(sizeof(T),8)];
-    //int spins=0;
-    while (true) {
-      #pragma unroll
-      for (int u=0; u < nSlotsMax; u++) {
-        if (u < nSlotsMin || u < nSlots) {
-          #pragma unroll
-          for (int v=0; v < divUp(sizeof(T),8); v++) {
-            asm volatile("ld.volatile.v4.u32 {%0,%1,%2,%3},[%4];" : "=r"(tmp[u][v].x), "=r"(tmp[u][v].y), "=r"(tmp[u][v].z), "=r"(tmp[u][v].w) : "l"(buf + u*stride + v*ncclSymLLMaxSlots(sizeof(T))));
-          }
-        }
-      }
-      bool okAll = true;
-      #pragma unroll
-      for (int u=0; u < nSlotsMax; u++) {
-        #pragma unroll
-        for (int v=0; v < divUp(sizeof(T),8); v++) {
-          if (u < nSlotsMin || u < nSlots) {
-            bool ok = tmp[u][v].y == llEpoch &&
-                      tmp[u][v].w == llEpoch;
-            okAll &= ok;
-          }
-        }
-      }
-      if (__builtin_expect(okAll, true)) break;
-      //if (spins++ == 10<<20) spins=0;
-    }
-    #pragma unroll
-    for (int u=0; u < nSlotsMax; u++) {
-      if (nSlotsMin <= u && u == nSlots) break;
-      union { T val; uint32_t u32[divUp(sizeof(T),8)][2]; };
-      #pragma unroll
-      for (int v=0; v < divUp(sizeof(T),8); v++) {
-        u32[v][0] = tmp[u][v].x;
-        u32[v][1] = tmp[u][v].z;
-      }
-      elts[u] = val;
-    }
-  }
-
-  template<typename Pack, typename T, typename Red, int Unroll=8>
-  __device__ Pack recvReduceLL(int slot, int stride, Red red) {
-    using Acc = typename Red::EltType;
-    using AccPack = BytePack<sizeof(Pack)*sizeof(Acc)/sizeof(T)>;
-    AccPack acc;
-    bool first = true;
-    int r = 0;
-    #pragma unroll 1
-    for (; r+Unroll <= nRanks; r += Unroll) {
-      Pack got[Unroll];
-      this->template recvLL</*Min=*/Unroll>(slot + r*stride, Unroll, stride, got);
-      AccPack acc0 = applyCast<T, Acc>(got[0]);
-      acc = first ? acc0 : applyReduce(red, acc, acc0);
-      first = false;
-      #pragma unroll
-      for (int i=1; i < Unroll; i++) acc = applyReduce(red, acc, applyCast<T, Acc>(got[i]));
-    }
-    if (r < nRanks) {
-      Pack got[Unroll];
-      this->template recvLL</*Min=*/1>(slot + r*stride, nRanks-r, stride, got);
-      AccPack acc0 = applyCast<T, Acc>(got[0]);
-      acc = first ? acc0 : applyReduce(red, acc, acc0);
-      #pragma unroll
-      for (int i=1; i < Unroll-1; i++) {
-        if (r+i < nRanks) acc = applyReduce(red, acc, applyCast<T, Acc>(got[i]));
-      }
-    }
-    return applyCast<Acc, T>(acc);
-  }
-
-  template<typename T>
-  __device__ T recvLL(int slot) {
-    T one[1];
-    this->template recvLL<1, 1, T>(slot, 1, 0, one);
-    return one[0];
-  }
-
-  template<typename Coop, typename T>
-  __device__ void coopRecvLL(Coop coop, int slot0, int nSlots, T* dst) {
-    int me = coop.self();
-    if (me < nSlots) {
-      uint4* buf = ncclSymDevBase_getLLBuf(base, nRanks, block, llEpoch) + slot0 + me;
-      uint4 got[divUp(sizeof(T), 8)];
-      //int spins=0;
-      #pragma unroll 1
-      while (true) {
-        #pragma unroll
-        for (int u=0; u < divUp(sizeof(T), 8); u++) {
-          asm volatile("ld.volatile.v4.u32 {%0,%1,%2,%3},[%4];" : "=r"(got[u].x), "=r"(got[u].y), "=r"(got[u].z), "=r"(got[u].w) : "l"(buf + u*ncclSymLLMaxSlots(sizeof(T))));
-        }
-        bool ok = true;
-        #pragma unroll
-        for (int u=0; u < divUp(sizeof(T), 8); u++) {
-          ok &= got[u].y == llEpoch;
-          ok &= got[u].w == llEpoch;
-        }
-        if (__builtin_expect(ok, true)) break;
-        //if (++spins == 10<<20) { spins=0; printf("r=%d LL spin @ ix=%d got=%d want=%d\n", rank, slot0+me, got[0].y, llEpoch); }
-      }
-      union { T val; uint32_t u32[divUp(sizeof(T), 8)][2]; };
-      #pragma unroll
-      for (int u=0; u < divUp(sizeof(T), 8); u++) {
-        u32[u][0] = got[u].x;
-        u32[u][1] = got[u].z;
-      }
-      dst[slot0 + me] = val;
-    }
+      fn(indexHi - indexLo, dw.nElts,
+         ncclSymPtr<T>(dw.inputWin, dw.inputOff) + indexLo,
+         ncclSymPtr<T>(dw.outputWin, dw.outputOff) + indexLo);
   }
 };
 }
 
 template<template<typename> typename Red, typename T, bool nvls>
-struct ncclSymAccumType { using Type = T; };
+struct ncclSymkAccumType { using Type = T; };
 
 // Only Red's whose opArg is invariant w.r.t. the datatype can have a different
 // accumulator type. At the moment this excludes integer min/max, sumpostdiv,
 // and premulsum.
-template<> struct ncclSymAccumType<FuncSum, __half, false> { using Type = float; };
+template<> struct ncclSymkAccumType<FuncSum, __half, false> { using Type = float; };
 #if defined(__CUDA_BF16_TYPES_EXIST__)
-template<> struct ncclSymAccumType<FuncSum, __nv_bfloat16, false> { using Type = float; };
+template<> struct ncclSymkAccumType<FuncSum, __nv_bfloat16, false> { using Type = float; };
 #endif
 #if defined(__CUDA_FP8_TYPES_EXIST__)
-template<> struct ncclSymAccumType<FuncSum, __nv_fp8_e4m3, false> { using Type = float; };
-template<> struct ncclSymAccumType<FuncSum, __nv_fp8_e5m2, false> { using Type = float; };
+template<> struct ncclSymkAccumType<FuncSum, __nv_fp8_e4m3, false> { using Type = float; };
+template<> struct ncclSymkAccumType<FuncSum, __nv_fp8_e5m2, false> { using Type = float; };
 #endif
 #endif
diff --git a/projects/rccl/src/device/symmetric/reduce_scatter.cuh b/projects/rccl/src/device/symmetric/reduce_scatter.cuh
index 4fd96093e6..8f79b3990b 100644
--- a/projects/rccl/src/device/symmetric/reduce_scatter.cuh
+++ b/projects/rccl/src/device/symmetric/reduce_scatter.cuh
@@ -1,35 +1,36 @@
-#include "symmetric.h"
-#include "symmetric/kernel.cuh"
-#include "symmetric/primitives.cuh"
+#include "sym_kernels.h"
+#include "kernel.cuh"
+#include "primitives.cuh"
 
 template<int BytePerPack, int UnrollPacks, int UnrollPeers, typename T, typename Red>
 static __device__ void reduceDeep(
-    ncclSymPrims& prim, int tn, int t, bool waitNeeded,
-    Red red, char* inputRank0, char* outputHere, int32_t nIters
+    ncclSymkArgsHandler const& handler, int tn, int t,
+    bool waitNeeded, ncclLsaBarrierSession<ncclCoopCta>& bar,
+    Red red, ncclSymPtr<char> input, ncclSymPtr<char> output, int32_t nIters
   ) {
   using Pack = BytePack<BytePerPack>;
   using Acc = typename Red::EltType;
   using AccPack = BytePack<BytePerPack*sizeof(Acc)/sizeof(T)>;
 
+  ncclTeam world = ncclTeamWorld(handler.comm);
   int wn = tn/WARP_SIZE;
   int w = t/WARP_SIZE;
   int lane = t%WARP_SIZE;
-  int const& rank = prim.rank;
-  int const& nRanks = prim.nRanks;
-  uint32_t const& stride4G = prim.stride4G;
-  Pack* inpRank0 = (Pack*)inputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
-  Pack* outHere = (Pack*)outputHere + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
+  int const& rank = handler.comm.rank;
+  int const& nRanks = handler.comm.nRanks;
+  ncclSymPtr<Pack> inpPacks = (ncclSymPtr<Pack>)input + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
+  ncclSymPtr<Pack> outPacks = (ncclSymPtr<Pack>)output + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
   Pack acc0[UnrollPacks];
 
   nIters -= w;
   if (0 < nIters) {
     #pragma unroll
     for (int u=0; u < UnrollPacks; u++) {
-      acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
+      acc0[u] = inpPacks.peerPtr(world, rank)[u*WARP_SIZE];
     }
   }
 
-  if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  if (waitNeeded) bar.wait(ncclCoopCta(), cuda::memory_order_relaxed);
 
   if (0 < nIters) {
     while (true) {
@@ -39,7 +40,7 @@ static __device__ void reduceDeep(
       { Pack tmp1[UnrollPacks];
         #pragma unroll
         for (int u=0; u < UnrollPacks; u++) {
-          tmp1[u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
+          tmp1[u] = inpPacks.peerPtr(world, r)[u*WARP_SIZE];
         }
         #pragma unroll
         for (int u=0; u < UnrollPacks; u++) {
@@ -65,7 +66,7 @@ static __device__ void reduceDeep(
             if (partial && ur!=0 && dr+ur == nRanks) break;
             #pragma unroll UnrollPacks
             for (int u=0; u < UnrollPacks; u++) {
-              tmp1[ur][u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
+              tmp1[ur][u] = inpPacks.peerPtr(world, r)[u*WARP_SIZE];
             }
             r += 1;
             if (r == nRanks) r = 0;
@@ -85,17 +86,17 @@ static __device__ void reduceDeep(
       for (int u=0; u < UnrollPacks; u++) acc0[u] = applyCast<Acc, T>(acc1[u]);
 
       #pragma unroll UnrollPacks
-      for (int u=0; u < UnrollPacks; u++) outHere[u*WARP_SIZE] = acc0[u];
+      for (int u=0; u < UnrollPacks; u++) outPacks.localPtr()[u*WARP_SIZE] = acc0[u];
 
-      inpRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
-      outHere += intptr_t(wn)*UnrollPacks*WARP_SIZE;
+      inpPacks += intptr_t(wn)*UnrollPacks*WARP_SIZE;
+      outPacks += intptr_t(wn)*UnrollPacks*WARP_SIZE;
       nIters -= wn;
       if (nIters <= 0) break;
 
       // Load data for next iteration.
       #pragma unroll
       for (int u=0; u < UnrollPacks; u++) {
-        acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
+        acc0[u] = inpPacks.peerPtr(world, rank)[u*WARP_SIZE];
       }
     }
   }
@@ -103,20 +104,22 @@ static __device__ void reduceDeep(
 
 template<int UnrollPeers, typename Red, typename T>
 static __device__ void reduceEnds(
-    ncclSymPrims& prim, int tn, int t, Red red,
-    T* inputRank0, T* outputHere, size_t nElts, uint32_t nPreElts, size_t nSufElts
+    ncclSymkArgsHandler const& handler, int tn, int t, Red red,
+    ncclSymPtr<T> input, ncclSymPtr<T> output,
+    size_t nElts, uint32_t nPreElts, size_t nSufElts
   ) {
   using Acc = typename Red::EltType;
 
-  int const& rank = prim.rank;
-  int const& nRanks = prim.nRanks;
-  uint32_t const& stride4G = prim.stride4G;
-  BytePack<sizeof(T)>* inpRank0 = (BytePack<sizeof(T)>*)inputRank0;
-  BytePack<sizeof(T)>* outHere = (BytePack<sizeof(T)>*)outputHere;
+  ncclTeam world = ncclTeamWorld(handler.comm);
+  int const& rank = handler.comm.rank;
+  int const& nRanks = handler.comm.nRanks;
+
+  ncclSymPtr<BytePack<sizeof(T)>> inpPacks = (ncclSymPtr<BytePack<sizeof(T)>>)input;
+  ncclSymPtr<BytePack<sizeof(T)>> outPacks = (ncclSymPtr<BytePack<sizeof(T)>>)output;
   #pragma unroll 1
   for (size_t i = t; i < nPreElts+nSufElts; i += tn) {
     size_t elt = i < nPreElts ? i : nElts-nSufElts-nPreElts+i;
-    BytePack<sizeof(T)> acc0 = *add4G(inpRank0+elt, rank*stride4G);
+    BytePack<sizeof(T)> acc0 = inpPacks.peerPtr(world, rank)[elt];
     BytePack<sizeof(Acc)> acc1;
     BytePack<sizeof(T)> tmp[UnrollPeers];
     int dr = 1;
@@ -135,7 +138,7 @@ static __device__ void reduceEnds(
         #pragma unroll
         for (int u=0; u < UnrollPeers-partial; u++) {
           if (partial && u!=0 && dr+u == nRanks) break;
-          tmp[u] = *add4G(inpRank0+elt, r*stride4G);
+          tmp[u] = inpPacks.peerPtr(world, r)[elt];
           r += 1;
           if (r == nRanks) r = 0;
         }
@@ -152,26 +155,25 @@ static __device__ void reduceEnds(
     }
 
     acc0 = applyCast<Acc, T>(acc1);
-    outHere[elt] = acc0;
+    outPacks.localPtr()[elt] = acc0;
   }
 }
 
 template<typename Red, typename T>
 static __device__ void reduce(
-    ncclSymPrims& prim, int tn, int t, bool waitNeeded,
-    Red red, T* input, T* output, size_t nElts
+    ncclSymkArgsHandler const& handler, int tn, int t, int nBlocks,
+    bool waitNeeded, ncclLsaBarrierSession<ncclCoopCta>& bar,
+    Red red, ncclSymPtr<T> input, ncclSymPtr<T> output, size_t nElts
   ) {
-  int nRanks = prim.nRanks;
-  int nBlocks = prim.nBlocks;
-  // Mpve input to rank=0
-  input = prim.peerPtr(0, input);
+  int const& nRanks = handler.comm.nRanks;
+  int const& nRanks_rcp32 = handler.nRanks_rcp32;
+  uint32_t nBlocks_rcp32 = nccl::utility::idivRcp32_upto64(nBlocks);
+  uint32_t nRanks_nBlocks_rcp32 = nccl::utility::imulRcp32(nRanks, nRanks_rcp32, nBlocks, nBlocks_rcp32);
 
-  uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
-  uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
-  uint32_t alignment = uint32_t(inputUptr - outputUptr);
+  uint32_t alignment = uint32_t(input.offset - output.offset);
   size_t nBytes = nElts*sizeof(T);
 
-  uint32_t nPreBytes = (16u - inputUptr)%16u;
+  uint32_t nPreBytes = (16u - input.offset)%16u;
   nPreBytes = min((size_t)nPreBytes, nBytes);
   uintptr_t cursor = nPreBytes;
 
@@ -181,12 +183,12 @@ static __device__ void reduce(
     constexpr int BytePerPack = 16, UnrollPacks = 4, UnrollPeers = 2;
     constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
     uint32_t chunks = (nBytes-cursor)/BytePerChunk;
-    chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
+    chunks -= imodFast32(chunks, nRanks*nBlocks, nRanks_nBlocks_rcp32);
     if (chunks != 0) {
       uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
       reduceDeep<BytePerPack, UnrollPacks, UnrollPeers, T>(
-        prim, tn, t, waitNeeded, red,
-        (char*)input + cursor, (char*)output + cursor,
+        handler, tn, t, waitNeeded, bar, red,
+        (ncclSymPtr<char>)input + cursor, (ncclSymPtr<char>)output + cursor,
         chunks*MinWarpPerBlock
       );
       cursor = cursorAfter;
@@ -198,12 +200,12 @@ static __device__ void reduce(
     constexpr int BytePerPack = 4, UnrollPacks = 4, UnrollPeers = 4;
     constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
     uint32_t chunks = (nBytes-cursor)/BytePerChunk;
-    chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
+    chunks -= imodFast32(chunks, nRanks*nBlocks, nRanks_nBlocks_rcp32);
     if (chunks != 0) {
       uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
       reduceDeep<(sizeof(T) <= BytePerPack ? BytePerPack : 0), UnrollPacks, UnrollPeers, T>(
-        prim, tn, t, waitNeeded, red,
-        (char*)input + cursor, (char*)output + cursor,
+        handler, tn, t, waitNeeded, bar, red,
+        (ncclSymPtr<char>)input + cursor, (ncclSymPtr<char>)output + cursor,
         chunks*MinWarpPerBlock
       );
       cursor = cursorAfter;
@@ -211,42 +213,47 @@ static __device__ void reduce(
     }
   }
 
-  if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  if (waitNeeded) bar.wait(ncclCoopCta(), cuda::memory_order_relaxed);
 
   constexpr int UnrollPeers = 8;
   size_t nSufElts = (nBytes-cursor)/sizeof(T);
-  reduceEnds<UnrollPeers>(prim, tn, t, red, input, output, nElts, nPreBytes/sizeof(T), nSufElts);
+  reduceEnds<UnrollPeers>(handler, tn, t, red, input, output, nElts, nPreBytes/sizeof(T), nSufElts);
 }
 
-
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_ReduceScatter_LD(ncclSymDevArgs const* args) {
-  ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier);
-  Red<typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type> red(args->redOpArg);
+__device__ __forceinline__ void ncclSymkRun_ReduceScatter_LD(ncclSymkDevWorkArgs const* args) {
+  ncclSymkArgsHandler handler{args};
+  ncclLsaBarrierSession<ncclCoopCta> bar{
+    ncclCoopCta(), handler.comm, ncclTeamTagLsa(), blockIdx.x
+  };
+  Red<typename ncclSymkAccumType<Red, T, /*nvls=*/false>::Type> red(handler.devWork->redOpArg);
+  int const& rank = handler.comm.rank;
 
-  // Round robin warps over blocks.
-  int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
-                    prim.block, prim.nBlocks,
-                    threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
-  int tn = prim.nBlocks*blockDim.x;
+  bar.arrive(ncclCoopCta(), cuda::memory_order_relaxed);
 
-  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
-  //prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  bool waitNeeded = true;
+  handler.forEachWork<T>(
+      [&]__device__(int block, int nBlocks, size_t nElts, size_t nAllElts,
+                    ncclSymPtr<T> input, ncclSymPtr<T> output) {
+        // Round robin warps over blocks.
+        int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
+                          block, nBlocks,
+                          threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
+        int tn = nBlocks*blockDim.x;
 
-  reduce(prim, tn, t, /*waitNeeded=*/true, red, (T*)args->input + prim.rank*args->nElts, (T*)args->output, args->nElts);
+        reduce(handler, tn, t, nBlocks, waitNeeded, bar, red, input + rank*nElts, output, nElts);
 
-  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
-  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+        waitNeeded = false;
+      }
+    );
+
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed);
 }
 
-
 template<typename Red, typename T>
 static __device__ void reduceMultimem(
-    ncclSymPrims& prim, int tn, int t, Red red, T* input, T* output, size_t nElts
+    int tn, int t, Red red, T* input, T* output, size_t nElts
   ) {
-  // Mpve input to multimem
-  input = prim.multimemPtr(input);
-
   uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
   uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
   size_t nBytes = nElts*sizeof(T);
@@ -291,41 +298,52 @@ static __device__ void reduceMultimem(
     uintptr_t cursor = i < nPreBytes ? i : nBytes-nSufBytes+(i-nPreBytes);
     BytePack<sizeof(T)> val = applyLoadMultimem<Red, sizeof(T)>(red, inputUptr + cursor);
     *reinterpret_cast<BytePack<sizeof(T)>*>(outputUptr + cursor) = val;
-    cursor += tn*sizeof(T);
   }
 }
 
 template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_ReduceScatter_LDMC(ncclSymDevArgs const* args) {
-  ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier|ncclSymPrims_UseMultimem);
-  Red<typename ncclSymAccumType<Red, T, /*nvls=*/true>::Type> red(args->redOpArg);
+__device__ __forceinline__ void ncclSymkRun_ReduceScatter_LDMC(ncclSymkDevWorkArgs const* args) {
+  ncclSymkArgsHandler handler{args};
+  ncclLsaBarrierSession<ncclCoopCta> bar{
+    ncclCoopCta(), handler.comm, ncclTeamTagLsa(), blockIdx.x, /*multimem=*/true
+  };
+  Red<typename ncclSymkAccumType<Red, T, /*nvls=*/true>::Type> red(handler.devWork->redOpArg);
 
-  // Round robin warps over blocks.
-  int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
-                    prim.block, prim.nBlocks,
-                    threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
-  int tn = prim.nBlocks*blockDim.x;
+  int const& rank = handler.comm.rank;
+  auto const& multimem = handler.comm.lsaMultimem;
 
-  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
-  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed);
 
-  reduceMultimem(prim, tn, t, red, (T*)args->input + prim.rank*args->nElts, (T*)args->output, args->nElts);
+  handler.forEachWork<T>(
+      [&]__device__(int block, int nBlocks, size_t nElts, size_t nAllElts,
+                    ncclSymPtr<T> input, ncclSymPtr<T> output) {
+        // Round robin warps over blocks.
+        int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
+                          block, nBlocks,
+                          threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
+        int tn = nBlocks*blockDim.x;
 
-  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
-  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+        reduceMultimem(tn, t, red, input.multimemPtr(multimem) + rank*nElts, output.localPtr(), nElts);
+      }
+    );
+
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed);
 }
 
 // T is user type, EltType is the most aligned type
 template<typename T, typename Red, typename EltType>
-__device__ __forceinline__ void ncclSymRun_ReduceScatter_LL_body(
-    ncclSymPrims &prim, Red red, EltType* input, EltType* output, int nElts, int nPacks, int nStrideElts) {
+__device__ __forceinline__ void ncclSymkRun_ReduceScatter_LL_body(
+    ncclSymkArgsHandler& handler, ncclLLA2ASession<ncclCoopCta>& lla2a,
+    Red red, EltType* input, EltType* output, int nElts, int nPacks, int nStrideElts) {
   using Pack = BytePack<8>;
+  using Acc = typename Red::EltType;
+  using AccPack = BytePack<8*sizeof(Acc)/sizeof(T)>;
   constexpr int EltPerPack = 8/sizeof(EltType);
 
-  int nRanks = prim.nRanks;
-  int rank = prim.rank;
+  int const& nRanks = handler.comm.nRanks;
+  int const& rank = handler.comm.rank;
   int t = threadIdx.x;
-  int tn = ncclSymMaxThreads;
+  constexpr int tn = ncclSymkMaxThreads;
   ncclCoopCta cta;
 
   #pragma unroll 1
@@ -339,17 +357,25 @@ __device__ __forceinline__ void ncclSymRun_ReduceScatter_LL_body(
     #pragma unroll 1
     for (int i = t; i < nRanks*nIterPacks; i += tn) {
       Pack got = loadPack<Pack>(input + peer*nStrideElts, pack*EltPerPack, nElts);
-      prim.sendLL(peer, rank*nIterPacks + pack, got);
+      lla2a.send(peer, rank*nIterPacks + pack, got);
       peer += tn_div_nPacks;
       pack += tn_mod_nPacks;
       if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
     }
 
     if (t < nIterPacks) {
-      Pack got = prim.template recvReduceLL<Pack, T>(t, nIterPacks, red);
-      storePack(output, t*EltPerPack, nElts, got);
+      AccPack got = lla2a.template recvReduce</*Unroll=*/8, Pack>(
+        /*slotStart=*/t, /*slotCount=*/nRanks, /*slotStride=*/nIterPacks,
+        /*eltToAcc=*/[&] __device__ (Pack x)->AccPack {
+          return applyCast<T, Acc>(x);
+        },
+        /*reduce=*/[&] __device__ (AccPack a, AccPack b)->AccPack {
+          return applyReduce(red, a, b);
+        }
+      );
+      storePack(output, t*EltPerPack, nElts, applyCast<Acc, T>(got));
     }
-    prim.endLL(cta);
+    lla2a.endEpoch(cta);
 
     input += tn*EltPerPack;
     output += tn*EltPerPack;
@@ -357,31 +383,34 @@ __device__ __forceinline__ void ncclSymRun_ReduceScatter_LL_body(
     nPacks -= tn;
   }
 }
-template<template<typename> typename Red, typename T>
-__device__ __forceinline__ void ncclSymRun_ReduceScatter_LL(ncclSymDevArgs const* args) {
-  ncclSymPrims prim(args->comm, ncclSymPrims_UseLL);
-  Red<typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type> red(args->redOpArg);
 
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymkRun_ReduceScatter_LL(ncclSymkDevWorkArgs const* args) {
+  ncclSymkArgsHandler handler{args};
+  ncclLLA2ASession<ncclCoopCta> lla2a(
+    ncclCoopCta(), handler.comm, ncclTeamLsa(handler.comm), handler.lsaLLA2A, blockIdx.x, ncclSymkMaxThreads
+  );
+  Red<typename ncclSymkAccumType<Red, T, /*nvls=*/false>::Type> red(handler.devWork->redOpArg);
   using Pack = BytePack<8>;
   constexpr int EltPerPack = 8/sizeof(T);
-  int nAllElts = args->nElts;
-  int nAllPacks = divUp(nAllElts, EltPerPack);
-  uint32_t nPackPerBlock, nPackModBlock;
-  idivmodFast32(&nPackPerBlock, &nPackModBlock, nAllPacks, prim.nBlocks, prim.nBlocks_rcp32);
-  int blockPackBegin = prim.block*nPackPerBlock + minval<int>(prim.block, nPackModBlock);
-  int blockPackEnd = blockPackBegin + nPackPerBlock + (prim.block < nPackModBlock ? 1 : 0);
-  int nPacks = blockPackEnd - blockPackBegin;
-  int nElts = nAllElts - blockPackBegin*EltPerPack;
-  nElts = min(nElts, nPacks*EltPerPack);
-  T* input = (T*)args->input + blockPackBegin*EltPerPack;
-  T* output = (T*)args->output + blockPackBegin*EltPerPack;
 
-  uint32_t lowBits = args->nElts*sizeof(T);
-  lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->input);
-  lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->output);
-  if (__builtin_expect(lowBits%8 == 0, true)) {
-    ncclSymRun_ReduceScatter_LL_body<T>(prim, red, (Pack*)input, (Pack*)output, nPacks, nPacks, nAllElts/EltPerPack);
-  } else {
-    ncclSymRun_ReduceScatter_LL_body<T>(prim, red, input, output, nElts, nPacks, nAllElts);
-  }
+  handler.singleWork<T>(
+      [&]__device__(int nElts, int nAllElts,
+                    ncclSymPtr<T> inputPtr, ncclSymPtr<T> outputPtr) {
+        int nPacks = divUp(nElts, EltPerPack);
+
+        T* input = (T*)inputPtr.localPtr();
+        T* output = (T*)outputPtr.localPtr();
+
+        uint32_t lowBits = nElts*sizeof(T);
+        lowBits |= (uintptr_t)input;
+        lowBits |= (uintptr_t)output;
+        if (__builtin_expect(lowBits%8 == 0, true)) {
+          ncclSymkRun_ReduceScatter_LL_body<T>(handler, lla2a, red, (Pack*)input, (Pack*)output,
+                                               nPacks, nPacks, divUp(nAllElts, EltPerPack));
+        } else {
+          ncclSymkRun_ReduceScatter_LL_body<T>(handler, lla2a, red, input, output, nElts, nPacks, nAllElts);
+        }
+      }
+    );
 }
diff --git a/projects/rccl/src/enqueue.cc b/projects/rccl/src/enqueue.cc
index 225a4cffc6..00a0ef8da6 100644
--- a/projects/rccl/src/enqueue.cc
+++ b/projects/rccl/src/enqueue.cc
@@ -14,6 +14,9 @@
 #include "profiler.h"
 #include "transport.h"
 #include "register_inline.h"
+#include "ce_coll.h"
+#include "nvtx.h"
+#include "scheduler.h"
 
 #include <cstring> // std::memcpy
 #include <cinttypes> // PRIx64
@@ -30,8 +33,8 @@ ncclResult_t ncclInitKernelsForDevice(int cudaArch, int maxSharedMem, size_t* ma
   int ncclMaxSharedMem = ncclShmemDynamicSize(cudaArch);
 
   for (int sym=0; sym <= 1; sym++) {
-    int kcount = sym==0 ? ncclDevKernelCount : ncclSymKernelCount;
-    void* const* kptrs = sym==0 ? ncclDevKernelList : ncclSymKernelList;
+    int kcount = sym==0 ? ncclDevKernelCount : ncclSymkKernelCount;
+    void* const* kptrs = sym==0 ? ncclDevKernelList : ncclSymkKernelList;
     for (int k=0; k < kcount; k++) {
       void* fn = kptrs[k];
       cudaFuncAttributes attr = {0};
@@ -164,6 +167,7 @@ static void finishPlan(struct ncclComm* comm, struct ncclKernelPlan* plan) {
   size_t workBytes = plan->workBytes;
   size_t batchBytes = plan->nWorkBatches*sizeof(struct ncclDevWorkBatch);
 
+  if (plan->isSymColl) return;
   plan->threadPerBlock = std::max(plan->threadPerBlock, NCCL_MIN_NTHREADS);
 
   // If we can fit everything into the kernel args we do so.
@@ -263,7 +267,6 @@ static bool testBudget(
 
 ncclResult_t ncclTasksRegAndEnqueue(struct ncclComm* comm) {
   struct ncclKernelPlanner* planner = &comm->planner;
-  if (planner->isSymColl) return ncclSuccess;
   struct ncclTaskColl *task;
   task = ncclIntruQueueHead(&planner->collTaskQueue);
   while (task != nullptr) {
@@ -328,6 +331,7 @@ next:
 ncclResult_t ncclPrepareTasks(struct ncclComm* comm, bool* algoNeedConnect, bool* needConnect, ncclSimInfo_t* simInfo) {
   struct ncclKernelPlanner* planner = &comm->planner;
   planner->persistent = ncclCudaGraphValid(planner->capturingGraph);
+
   // Tasks from the sorter come out ordered size descending.
   struct ncclTaskColl* task = ncclTaskCollSorterDequeueAll(&planner->collSorter);
   // Tasks are assembled by (fn,op,ty) size ascending.
@@ -336,36 +340,8 @@ ncclResult_t ncclPrepareTasks(struct ncclComm* comm, bool* algoNeedConnect, bool
   int fnOpTyIndices[ncclNumFuncs*ncclNumDevRedOps*ncclNumTypes];
   int fnOpTyCount = 0;
 
-  if (comm->nNodes == 1 && planner->nTasksColl == 1 && planner->nTasksP2p == 0) {
-    void* sendSymPtr;
-    void* recvSymPtr;
-    struct ncclReg* sendReg;
-    struct ncclReg* recvReg;
-    size_t size = task->count*ncclTypeSize(task->datatype);
-    NCCLCHECK(ncclRegFindSymmetric(comm, task->sendbuff, size, &sendSymPtr, &sendReg));
-    NCCLCHECK(ncclRegFindSymmetric(comm, task->recvbuff, size, &recvSymPtr, &recvReg));
-    bool implemented = ncclSymImplemented(task->func, task->opDev.op, task->datatype);
-
-    if (sendReg && recvReg && (sendReg->winFlags & recvReg->winFlags & NCCL_WIN_COLL_SYMMETRIC) && implemented) {
-      enum ncclSymKernelId kernel;
-      int nChannels, nWarps;
-      float estTimeUs = 1.e18;
-      NCCLCHECK(ncclSymPickKernel(comm, task->func, task->opDev.op, task->datatype, task->count, &estTimeUs, &kernel, &nChannels, &nWarps));
-
-      // We should only use symmetric kernel if it beats the asymmetric kernel. But the
-      // perf model accuracy from asymmetric kernels is too inaccurate and reports too high
-      // of a bandwidth. For now just always use symmetric if available.
-      if (kernel != ncclSymKernelId_Count) {
-        task->sendbuff = sendSymPtr;
-        task->recvbuff = recvSymPtr;
-        task->devFuncId = (int)kernel;
-        task->nMaxChannels = nChannels;
-        task->nWarps = nWarps;
-        ncclIntruQueueEnqueue(&planner->collTaskQueue, task);
-        planner->isSymColl = true;
-        return ncclSuccess;
-      }
-    }
+  if (comm->symmetricSupport) {
+    NCCLCHECK(ncclMakeSymmetricTaskList(comm, task, &planner->collSymTaskQueue, &task));
   }
 
   // Walk the size sorted tasks, binning them by (fn,op,ty).
@@ -532,7 +508,7 @@ static ncclResult_t scheduleCollTasksToPlan(
   size_t trafficBytes[2*2] = {0, 0, 0, 0}; // [collnet][nvls]
   int nChannels[2*2] = {0, 0, 0, 0}; // [collnet][nvls]
   int const nMaxChannels[2*2] = {comm->nChannels, comm->nvlsChannels, // [collnet][nvls]
-                                 comm->nChannels, comm->nvlsChannels};
+                                 comm->nChannels, std::min(comm->nChannels, comm->nvlsChannels)};
   constexpr size_t MinTrafficPerChannel = 16 << 10; // 16K traffic as minimal
   do {
     size_t workBytes = 0;
@@ -725,6 +701,7 @@ static ncclResult_t scheduleCollTasksToPlan(
         }
         proxyOp->eActivationMask = task->eActivationMask;
         proxyOp->incWorkCounter = true;
+        proxyOp->nChannels = nChannels;
         addWorkBatchToPlan(comm, plan, c, workNode->workType, task->devFuncId, plan->workBytes);
         // Coverity reports "proxyOp->connection" as being possibly uninitialized.  It's hard to
         // determine if that's actually true but it's also not clear if that would be an issue.
@@ -740,6 +717,8 @@ static ncclResult_t scheduleCollTasksToPlan(
       plan->kernelFn = ncclDevKernelForFunc[task->devFuncId];
       plan->kernelSpecialized = ncclDevKernelForFuncIsSpecialized[task->devFuncId];
     }
+    // Profiler
+    plan->groupApiEventHandle = task->groupApiEventHandle;
 
     if (comm->rank == 0) {
       INFO(NCCL_TUNING, "%s: %ld Bytes -> Algo %s proto %s channel{Lo..Hi}={%d..%d}",
@@ -792,8 +771,9 @@ static ncclResult_t addP2pToPlan(
     int nChannelsMin, int nChannelsMax, int p2pRound,
     int sendRank, void* sendAddr, ssize_t sendBytes,
     int recvRank, void* recvAddr, ssize_t recvBytes,
-    struct ncclTaskP2p** p2pTasks
+    const int planTotalTasks[], struct ncclTaskP2p** p2pTasks
   ) {
+  ncclResult_t ret = ncclSuccess;
   constexpr int connIndex = 1;
   bool selfSend = (sendRank == comm->rank);
   // recv: dir=0, send: dir=1
@@ -804,6 +784,8 @@ static ncclResult_t addP2pToPlan(
   bool proxySameProcess[2] = {true, true};
   void** handles[2] = {NULL, NULL};
   uint8_t base = ncclP2pChannelBaseForRound(comm, p2pRound);
+  struct ncclProxyOp proxyOps[2] = {};
+  int nProxyOps = selfSend ? 0 : 2;
   if (!selfSend) {
     for (int part=0; part < nChannelsMax; part++) {
       int channelId = ncclP2pChannelForPart(comm->p2pnChannels, base, part);
@@ -857,7 +839,7 @@ static ncclResult_t addP2pToPlan(
       bool pxnUsed = !ncclPxnDisable(comm) && comm->isAllNvlink && comm->maxLocalRanks > 1;
       if (bytes[dir] > 0 && proxySameProcess[dir] && protocol[dir] == NCCL_PROTO_SIMPLE && (!pxnUsed)) {
         int regFlag = 0;
-        NCCLCHECK(ncclCalloc(&handles[dir], nChannelsMax));
+        NCCLCHECKGOTO(ncclCalloc(&handles[dir], nChannelsMax), ret, cleanup);
         for (int part = 0; part < nChannelsMax; part++) {
           int channelId = ncclP2pChannelForPart(comm->p2pnChannels, base, part);
           struct ncclChannelPeer** channelPeers = comm->channels[channelId].peers;
@@ -880,7 +862,7 @@ static ncclResult_t addP2pToPlan(
       void* regAddr = NULL;
       if (conn->conn.flags & (NCCL_P2P_WRITE | NCCL_P2P_READ)) {
         // We require users registering buffers on both sides
-        NCCLCHECK(ncclRegisterP2pIpcBuffer(comm, addrs[dir], bytes[dir], peerRank, &regFlag, &regAddr, &plan->cleanupQueue));
+        NCCLCHECKGOTO(ncclRegisterP2pIpcBuffer(comm, addrs[dir], bytes[dir], peerRank, &regFlag, &regAddr, &plan->cleanupQueue), ret, cleanup);
         if (regFlag) {
           if (dir == 0 && (conn->conn.flags & NCCL_P2P_WRITE)) recvAddr = regAddr;
           else if (dir == 1 && (conn->conn.flags & NCCL_P2P_READ)) sendAddr = regAddr;
@@ -905,14 +887,17 @@ static ncclResult_t addP2pToPlan(
     if (p2pTasks[dir]) p2pTasks[dir]->nChannels = nChannels[dir];
   }
 
-  struct ncclWorkList* workNode = ncclMemoryStackAllocInlineArray<ncclWorkList, ncclDevWorkP2p>(&comm->memScoped, 1);
+  struct ncclWorkList* workNode;
+  workNode = ncclMemoryStackAllocInlineArray<ncclWorkList, ncclDevWorkP2p>(&comm->memScoped, 1);
   workNode->workType = ncclDevWorkTypeP2p;
   workNode->size = sizeof(struct ncclDevWorkP2p);
   ncclIntruQueueEnqueue(&plan->workQueue, workNode);
-  uint32_t workOffset = plan->workBytes;
+  uint32_t workOffset;
+  workOffset = plan->workBytes;
   plan->workBytes += sizeof(struct ncclDevWorkP2p);
 
-  struct ncclDevWorkP2p* work = (struct ncclDevWorkP2p*)(workNode+1);
+  struct ncclDevWorkP2p* work;
+  work = (struct ncclDevWorkP2p*)(workNode+1);
   work->nP2pChannels = comm->p2pnChannels;
   work->channelBase = base;
   work->nSendChannels = nChannels[1];
@@ -933,8 +918,6 @@ static ncclResult_t addP2pToPlan(
   work->recvBytes = recvBytes==-1 ? 0 : recvBytes;
   work->profilerEnabled = ncclProfilerPluginLoaded() && ((p2pTasks[0] ? p2pTasks[0] : p2pTasks[1])->eActivationMask & ncclProfileKernelCh);
 
-  struct ncclProxyOp proxyOps[2] = {};
-  int nProxyOps = selfSend ? 0 : 2;
   for (int dir=0; dir < nProxyOps; dir++) {
     struct ncclProxyOp* op = &proxyOps[dir];
     op->root = dir ? sendRank : recvRank;
@@ -947,6 +930,7 @@ static ncclResult_t addP2pToPlan(
     op->chunkSize = chunkSize[dir];
     op->reg = netRegistered[dir];
     op->coll = p2pTasks[dir] ? p2pTasks[dir]->func : 0;
+    op->collAPI = p2pTasks[dir] ? p2pTasks[dir]->collAPI : 0;
     op->task.p2p = p2pTasks[dir];
     op->rank = comm->rank;
     op->eActivationMask = p2pTasks[dir] ? p2pTasks[dir]->eActivationMask : 0;
@@ -955,6 +939,15 @@ static ncclResult_t addP2pToPlan(
   }
 
   nChannelsMax = std::max(nChannels[0], nChannels[1]);
+  // Determine how many peers this plan will target concurrently. Make a
+  // simplifying assumption that each task targets a different peer.
+  // Each task is striped across 'nChannelsMax' of 'p2pnChannels' channels.
+  // Each channel runs up to NCCL_MAX_DEV_WORK_P2P_PER_BATCH tasks concurrently.
+  int maxConcurrent;
+  int concurrentTasks[2];
+  maxConcurrent = comm->p2pnChannels / nChannelsMax * NCCL_MAX_DEV_WORK_P2P_PER_BATCH;
+  concurrentTasks[0] = std::min(planTotalTasks[0], maxConcurrent);
+  concurrentTasks[1] = std::min(planTotalTasks[1], maxConcurrent);
   for (int part=0; part < nChannelsMax; part++) {
     int incWorkCounter = -1;
     int channelId = ncclP2pChannelForPart(comm->p2pnChannels, base, part);
@@ -1003,13 +996,17 @@ static ncclResult_t addP2pToPlan(
         // equal one plus the batch index this p2p settled in.
         proxyOps[dir].channelId = channelId;
         proxyOps[dir].opCount = uint64_t(comm->planner.wipPlan.channels[channelId].nWorkBatchesP2p)<<1 | 1;
-        NCCLCHECK(addProxyOpIfNeeded(comm, plan, &proxyOps[dir]));
-        NCCLCHECK(addProfilerProxyOpIfNeeded(comm, plan, &proxyOps[dir]));
+        proxyOps[dir].nChannels = nChannels[dir];
+        proxyOps[dir].nPeers = concurrentTasks[dir];
+        NCCLCHECKGOTO(addProxyOpIfNeeded(comm, plan, &proxyOps[dir]), ret, cleanup);
+        NCCLCHECKGOTO(addProfilerProxyOpIfNeeded(comm, plan, &proxyOps[dir]), ret, cleanup);
       }
     }
   }
-
-  return ncclSuccess;
+cleanup:
+  free(handles[0]);
+  free(handles[1]);
+  return ret;
 }
 
 static int calcP2pChannelCount(size_t totalSize, int minChannels, int maxChannels, size_t minSize, size_t maxSize) {
@@ -1041,6 +1038,8 @@ static ncclResult_t scheduleP2pTasksToPlan(
   // Try to use all channels, but one channel per operation.
   while (nChannelsMin*nRanks > comm->p2pnChannels && nChannelsMin > 1) nChannelsMin /= 2;
 
+  // Save the total count of send/recv tasks in the plan
+  int planTotalTasks[2] = {comm->planner.nTasksP2pRecv, comm->planner.nTasksP2pSend};
   while (comm->planner.nTasksP2p != 0) {
     for (int round=0; round < nRanks; round++) {
       int sendRank = comm->p2pSchedule[round].sendRank;
@@ -1071,22 +1070,30 @@ static ncclResult_t scheduleP2pTasksToPlan(
         ncclMemoryPoolFree(&comm->memPool_ncclTaskP2p, send);
         ncclMemoryPoolFree(&comm->memPool_ncclTaskP2p, recv);
         comm->planner.nTasksP2p -= 2;
+        comm->planner.nTasksP2pSend -= 1;
+        comm->planner.nTasksP2pRecv -= 1;
       } else {
         // Ensure room for worst case of one new batch per channel.
         if (!testBudget(budget, plan->nWorkBatches+nChannelsMax, plan->workBytes + sizeof(struct ncclDevWorkP2p))) {
           return ncclSuccess;
         }
         struct ncclTaskP2p* p2pTasks[2] = { recv, send };
-        NCCLCHECK(addP2pToPlan(comm, plan, nChannelsMin, nChannelsMax, round, sendRank, sendBuff, sendBytes, recvRank, recvBuff, recvBytes, p2pTasks));
+        NCCLCHECK(addP2pToPlan(comm, plan, nChannelsMin, nChannelsMax, round, sendRank, sendBuff, sendBytes, recvRank, recvBuff, recvBytes, planTotalTasks, p2pTasks));
         if (send != nullptr) {
           ncclIntruQueueDequeue(&peers[sendRank].sendQueue);
+          // Profiler - We can overwrite groupAPI event handles here since all operations here belong to the same group
+          plan->groupApiEventHandle = send->groupApiEventHandle;
           ncclIntruQueueEnqueue(&plan->p2pTaskQueue, send);
           comm->planner.nTasksP2p -= 1;
+          comm->planner.nTasksP2pSend -= 1;
         }
         if (recv != nullptr) {
           ncclIntruQueueDequeue(&peers[recvRank].recvQueue);
+          // Profiler - We can overwrite groupAPI event handles here since all operations here belong to the same group
+          plan->groupApiEventHandle = recv->groupApiEventHandle;
           ncclIntruQueueEnqueue(&plan->p2pTaskQueue, recv);
           comm->planner.nTasksP2p -= 1;
+          comm->planner.nTasksP2pRecv -= 1;
         }
       }
     }
@@ -1125,7 +1132,7 @@ namespace {
 }
 
 static ncclResult_t uploadWork(struct ncclComm* comm, struct ncclKernelPlan* plan) {
-  if (plan->isSymColl) return ncclSuccess;
+  if (plan->isSymColl || plan->isCeColl) return ncclSuccess;
 
   size_t workBytes = plan->workBytes;
   size_t batchBytes = plan->nWorkBatches*sizeof(struct ncclDevWorkBatch);
@@ -1297,7 +1304,7 @@ static ncclResult_t hostStreamPlanTask(struct ncclComm* comm, struct ncclKernelP
 }
 
 static void CUDART_CB hostStreamPlanCallback(void *plan_) {
-  NVTX3_FUNC_RANGE_IN(nccl_domain);
+  NCCL_NVTX3_FUNC_RANGE;
   struct ncclKernelPlan* plan = (struct ncclKernelPlan*)plan_;
   ncclResult_t result = hostStreamPlanTask(plan->comm, plan);
   if (result != ncclSuccess) {
@@ -1318,6 +1325,9 @@ static ncclResult_t reclaimPlan(struct ncclComm* comm, struct ncclCommCallback*
       CUDACHECK(cudaThreadExchangeStreamCaptureMode(&mode));
     }
   }
+  if (plan->isSymColl) {
+    free(plan->kernelSymArgs);
+  }
   // Free coll tasks
   struct ncclTaskColl* ct = ncclIntruQueueHead(&plan->collTaskQueue);
   while (ct != nullptr) {
@@ -1394,7 +1404,9 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
   planner->persistent = persistent;
   int nPlans = 0;
 
-  if (planner->nTasksColl + planner->nTasksP2p != 0) {
+  if (planner->nTasksColl + planner->nTasksP2p != 0 ||
+      !ncclIntruQueueEmpty(&planner->collSymTaskQueue) ||
+      !ncclIntruQueueEmpty(&planner->collCeTaskQueue)) {
     do {
       memset(&planner->wipPlan, 0, sizeof(planner->wipPlan));
 
@@ -1406,53 +1418,55 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
       plan->workStorageType = persistent ? ncclDevWorkStorageTypePersistent
                                          : ncclDevWorkStorageTypeFifo;
 
-      if (planner->isSymColl) {
-        plan->workStorageType = ncclDevWorkStorageTypeArgs;
+      if (!ncclIntruQueueEmpty(&planner->collCeTaskQueue)) {
+        struct ncclTaskColl* task = ncclIntruQueueHead(&planner->collCeTaskQueue);
+        plan->isCeColl = true;
+        plan->ceCollArgs = ncclMemoryStackAlloc<struct ncclCeCollArgs>(&comm->memScoped);
+        plan->ceCollArgs->rootRank = task->root;
+        plan->ceCollArgs->nElts = task->count;
+        plan->ceCollArgs->eltSize = ncclTypeSize(task->datatype);
+        plan->ceCollArgs->sendBuff = (uint8_t*)task->sendbuff;
+        plan->ceCollArgs->recvBuff = (uint8_t*)task->recvbuff;
+        plan->ceCollArgs->func = task->func;
+        plan->ceCollArgs->sendWin = task->sendWin;
+        plan->ceCollArgs->recvWin = task->recvWin;
 
-        struct ncclTaskColl* task = ncclIntruQueueHead(&planner->collTaskQueue);
-        plan->isSymColl = true;
-        plan->kernelFn = ncclSymGetKernelPtr((ncclSymKernelId)task->devFuncId, task->opDev.op, task->datatype);
-        plan->threadPerBlock = task->nWarps*WARP_SIZE;
-        plan->channelMask = uint64_t(-1) >> (64-task->nMaxChannels);
-
-        plan->kernelArgsSize = sizeof(struct ncclSymDevArgs);
-        plan->kernelSymArgs = ncclMemoryStackAlloc<struct ncclSymDevArgs>(&comm->memScoped);
-        plan->kernelSymArgs->comm = comm->symDevComm;
-        plan->kernelSymArgs->rootRank = task->root;
-        plan->kernelSymArgs->redOpArg = task->opDev.scalarArg;
-        plan->kernelSymArgs->nElts = task->count;
-        plan->kernelSymArgs->input = (char*)task->sendbuff;
-        plan->kernelSymArgs->output = (char*)task->recvbuff;
-
-        planner->nTasksColl -= 1;
         ncclIntruQueueEnqueue(&planner->planQueue, plan);
-        INFO(NCCL_TUNING, "%s [Symmetric]: %ld Bytes -> Kernel %s nchannels %d nthreads %d",
-        ncclFuncToString(task->func), task->count * ncclTypeSize(task->datatype), ncclSymKernelIdToString(task->devFuncId), task->nMaxChannels, plan->threadPerBlock);
+        ncclIntruQueueDequeue(&planner->collCeTaskQueue);
+        ncclMemoryPoolFree(&comm->memPool_ncclTaskColl, task);
         nPlans += 1;
       } else {
-        struct ncclKernelPlanBudget budget;
-        budget.inArgsBytes = comm->workArgsBytes - sizeof(struct ncclDevKernelArgs);
-        // Non-persistent kernels fill up at most half of our fifo per kernel.
-        budget.outArgsBytes = plan->persistent ? (1<<30) : comm->workFifoBytes/2;
+	if (!ncclIntruQueueEmpty(&planner->collSymTaskQueue)) {
+          NCCLCHECKGOTO(ncclSymmetricTaskScheduler(comm, &planner->collSymTaskQueue, plan), result, failure);
+        }
+        else {
+          struct ncclKernelPlanBudget budget;
+          budget.inArgsBytes = comm->workArgsBytes - sizeof(struct ncclDevKernelArgs);
+          // Non-persistent kernels fill up at most half of our fifo per kernel.
+          budget.outArgsBytes = plan->persistent ? (1<<30) : comm->workFifoBytes/2;
 
-        // Drain coll tasks first. This is essential since we partition tasks based
-        // on the work budget and p2p work isn't collective. If we were to drain p2p
-        // first, the place where we cut the kernel could vary by rank which would
-        // cause the "shortest channel first" channel picker to have divergent results.
-        if (planner->nTasksColl != 0) {
-          NCCLCHECKGOTO(scheduleCollTasksToPlan(comm, plan, &budget), result, failure);
-        }
-        // And only drain p2p tasks once colls are depleted.
-        if (planner->nTasksColl == 0 && planner->nTasksP2p != 0) {
-          NCCLCHECKGOTO(scheduleP2pTasksToPlan(comm, plan, &budget), result, failure);
+          // Drain coll tasks first. This is essential since we partition tasks based
+          // on the work budget and p2p work isn't collective. If we were to drain p2p
+          // first, the place where we cut the kernel could vary by rank which would
+          // cause the "shortest channel first" channel picker to have divergent results.
+          if (planner->nTasksColl != 0) {
+            NCCLCHECKGOTO(scheduleCollTasksToPlan(comm, plan, &budget), result, failure);
+          }
+          // And only drain p2p tasks once colls are depleted.
+          if (planner->nTasksColl == 0 && planner->nTasksP2p != 0) {
+            NCCLCHECKGOTO(scheduleP2pTasksToPlan(comm, plan, &budget), result, failure);
+          }
         }
+
         finishPlan(comm, plan);
         if (plan->workBytes != 0) {
           ncclIntruQueueEnqueue(&planner->planQueue, plan);
           nPlans += 1;
         }
       }
-    } while (planner->nTasksColl + planner->nTasksP2p != 0);
+    } while (planner->nTasksColl + planner->nTasksP2p != 0 ||
+             !ncclIntruQueueEmpty(&planner->collSymTaskQueue) ||
+             !ncclIntruQueueEmpty(&planner->collCeTaskQueue));
 
     struct ncclKernelPlan* planHead = ncclIntruQueueHead(&planner->planQueue);
     planner->unlaunchedPlansHead = planHead;
@@ -1531,8 +1545,6 @@ ncclResult_t ncclLaunchKernelBefore_NoUncapturedCuda(struct ncclComm* comm, stru
 NCCL_PARAM(MemSyncDomain, "MEM_SYNC_DOMAIN", cudaLaunchMemSyncDomainRemote);
 #endif
 
-NCCL_PARAM(NvlinkUtilCentricSchedEnable, "NVLINK_UTIL_CENTRIC_SCHED_ENABLE", 0);
-
 ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan) {
   ncclResult_t ret = ncclSuccess;
   struct ncclKernelPlanner* planner = &comm->planner;
@@ -1542,6 +1554,9 @@ ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan
   dim3 block = {(unsigned)plan->threadPerBlock, 1, 1};
   int smem = ncclShmemDynamicSize(comm->cudaArch);
   cudaStream_t launchStream = planner->streams->stream;
+
+  NCCLCHECK(ncclProfilerStartKernelLaunchEvent(plan, launchStream));
+
   void* extra[] = {
     CU_LAUNCH_PARAM_BUFFER_POINTER, plan->kernelArgs,
     CU_LAUNCH_PARAM_BUFFER_SIZE, &plan->kernelArgsSize,
@@ -1588,25 +1603,24 @@ ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan
     }
     #endif
     #if CUDART_VERSION >= 12030
-    bool capturing = ncclCudaGraphValid(planner->capturingGraph);
     enum ncclImplicitOrder implicitOrder;
-    NCCLCHECKGOTO(getImplicitOrder(&implicitOrder, capturing, driverVersion), ret, do_return);
+    NCCLCHECKGOTO(getImplicitOrder(&implicitOrder, plan->persistent, driverVersion), ret, do_return);
     if (implicitOrder == ncclImplicitOrderLaunch) {
       launchAttrs[attrs].id = CU_LAUNCH_ATTRIBUTE_LAUNCH_COMPLETION_EVENT;
       launchAttrs[attrs].value.launchCompletionEvent.event = comm->sharedRes->launchEvent;
       launchAttrs[attrs].value.launchCompletionEvent.flags = 0;
       attrs++;
     }
-    if (comm->planner.isSymColl && compCap >= 90 && driverVersion >= 12030) {
+    if (plan->isSymColl && compCap >= 90 && driverVersion >= 12030) {
       launchAttrs[attrs].id = CU_LAUNCH_ATTRIBUTE_PROGRAMMATIC_STREAM_SERIALIZATION;
       launchAttrs[attrs].value.programmaticStreamSerializationAllowed = 1;
       attrs++;
     }
     #endif
     #if CUDART_VERSION >= 13000
-    if (compCap >= 90 && driverVersion >= 13000) {
+    if (compCap >= 100 && driverVersion >= 13000) {
       launchAttrs[attrs].id = CU_LAUNCH_ATTRIBUTE_NVLINK_UTIL_CENTRIC_SCHEDULING;
-      launchAttrs[attrs].value.nvlinkUtilCentricScheduling = ncclParamNvlinkUtilCentricSchedEnable();
+      launchAttrs[attrs].value.nvlinkUtilCentricScheduling = comm->config.nvlinkCentricSched;
       attrs++;
     }
     #endif
@@ -1628,6 +1642,7 @@ ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan
   }
 
 do_return:
+  NCCLCHECK(ncclProfilerStopKernelLaunchEvent(plan));
   return ret;
 }
 
@@ -1765,6 +1780,8 @@ static ncclResult_t updateCollCostTable(
     if ((a == NCCL_ALGO_COLLNET_DIRECT || a == NCCL_ALGO_COLLNET_CHAIN) && collNetSupport != 1) continue;
     // CollNetDirect is only supported for up to 8 local GPUs
     if (a == NCCL_ALGO_COLLNET_DIRECT && comm->maxLocalRanks > NCCL_MAX_DIRECT_ARITY+1) continue;
+    // Disable CollNet Chain for more than 8 local GPUs
+    if (a == NCCL_ALGO_COLLNET_CHAIN && comm->maxLocalRanks > NCCL_MAX_DIRECT_ARITY+1) continue;
     if ((a == NCCL_ALGO_NVLS || a == NCCL_ALGO_NVLS_TREE) && (!nvlsSupport || (info->func != ncclFuncAllReduce && comm->localRanks > NCCL_MAX_NVLS_ARITY))) continue;
     if (a == NCCL_ALGO_NVLS && collNetSupport != 1 && comm->nNodes > 1) continue;
     /* Tree reduceScatter doesn't support scaling yet */
@@ -1844,7 +1861,11 @@ static ncclResult_t topoGetAlgoInfo(
     }
   } else if (info->algorithm == NCCL_ALGO_NVLS || info->algorithm == NCCL_ALGO_NVLS_TREE) {
     // NVLS should not need more than 16 channels to get peak BW.
-    nc = comm->nvlsChannels;
+    if (comm->nNodes > 1 && info->algorithm == NCCL_ALGO_NVLS) {
+      nc = std::min(comm->nvlsChannels, comm->nChannels);
+    } else {
+      nc = comm->nvlsChannels;
+    }
   } else {
     // Ring/Tree channel tuning
     while (nBytes < nc * nt * threadThreshold) {
@@ -2107,6 +2128,7 @@ static ncclResult_t calcCollChunking(
   }
   proxyOp->pattern = pattern;
   proxyOp->coll = info->func;
+  proxyOp->collAPI = info->func;
   proxyOp->root = info->root;
   proxyOp->isOneRPN = comm->isOneRPN;
   // This is used by P2P to reduce the receive buffer size. We don't use it in collectives
@@ -2170,6 +2192,35 @@ static ncclResult_t calcCollChunking(
     proxyOp->nbytes = DIVUP(nBytes, nChannels);
   }
 
+  // Set peer count hints used by network plugin
+  switch (proxyOp->pattern) {
+  case ncclPatternRing:
+  case ncclPatternRingTwice:
+  case ncclPatternPipelineFrom:
+  case ncclPatternPipelineTo:
+  case ncclPatternPatUp:
+  case ncclPatternPatDown:
+    proxyOp->nPeers = 1;
+    break;
+  case ncclPatternTreeUp:
+  case ncclPatternTreeDown:
+  case ncclPatternTreeUpDown:
+  case ncclPatternNvlsTree:
+    proxyOp->nPeers = (NCCL_MAX_TREE_ARITY - 1) * 2;
+    break;
+  case ncclPatternCollnetChain:
+  case ncclPatternCollnetDirect:
+  case ncclPatternNvls:
+  case ncclPatternProfiler:
+    // Peer count hints unused
+    break;
+  case ncclPatternSend:
+  case ncclPatternRecv:
+  default:
+    WARN("Unknown pattern %d", pattern);
+    return ncclInternalError;
+  }
+
   *outChunkSize = proxyOp->chunkSize;
   return ncclSuccess;
 }
@@ -2269,111 +2320,8 @@ static ncclResult_t hostToDevRedOp(
   return ncclSuccess;
 }
 
-// Converts `info` to a task and adds it to `comm->planner`. The exception is with
-// single rank communicators, collectives are issued as `ncclMemcpyAsync`s and
-// thus don't need a task.
-static ncclResult_t taskAppend(struct ncclComm* comm, struct ncclInfo* info) {
+static ncclResult_t ncclPlannerSetCapturingGraph(struct ncclComm* comm, struct ncclInfo* info) {
   struct ncclKernelPlanner *planner = &comm->planner;
-
-  if (info->coll == ncclFuncSend || info->coll == ncclFuncRecv) {
-    int peer = info->root;
-    ssize_t nBytes = info->count*ncclTypeSize(info->datatype);
-    bool isSendNotRecv = info->coll == ncclFuncSend;
-
-    // Must be in thread local group before tasks can be alloc'd in `comm->memScoped`.
-    ncclGroupCommJoin(info->comm, ncclGroupTaskTypeCollective);
-    struct ncclTaskP2p* p2p = ncclMemoryPoolAlloc<struct ncclTaskP2p>(&comm->memPool_ncclTaskP2p, &comm->memPermanent);
-    p2p->func = info->coll;
-    p2p->buff = (void*)info->recvbuff;
-    p2p->count = info->count;
-    p2p->datatype = info->datatype;
-    p2p->root = info->root;
-    p2p->bytes = nBytes;
-    p2p->eActivationMask = __atomic_load_n(&ncclProfilerEventMask, __ATOMIC_RELAXED);
-    ncclIntruQueueEnqueue(
-      isSendNotRecv ? &planner->peers[peer].sendQueue : &planner->peers[peer].recvQueue,
-      p2p);
-    planner->nTasksP2p += 1;
-
-    // Mark channels that need pre-connect
-    if (comm->rank != peer) {
-      if (!(isSendNotRecv ? planner->peers[peer].sendSeen : planner->peers[peer].recvSeen)) {
-        // planner->peers[peer].send/recvSeen is private to each comm, so we need to set it anyway.
-        (isSendNotRecv ? planner->peers[peer].sendSeen : planner->peers[peer].recvSeen) = true;
-        int round = 0;
-        while (peer != (isSendNotRecv ? comm->p2pSchedule[round].sendRank
-                                      : comm->p2pSchedule[round].recvRank)) {
-          round += 1;
-        }
-        uint8_t base = ncclP2pChannelBaseForRound(comm, round);
-        for (int c=0; c < comm->p2pnChannelsPerPeer; c++) {
-          int channelId = ncclP2pChannelForPart(comm->p2pnChannels, base, c);
-          if (isSendNotRecv) {
-            if (comm->channels[channelId].peers[peer]->send[1].hasSeen == 0) { // P2P uses only 1 connector
-              // the send/recv connector is shared among split shared comms. We need to set hasSeen to
-              // 1 in order to avoid duplicate connection setup if user group sendrecv ops with split
-              // shared comms together.
-              comm->channels[channelId].peers[peer]->send[1].hasSeen = 1;
-              comm->connectSend[peer] |= (1UL<<channelId);
-              ncclGroupCommPreconnect(comm);
-            }
-          } else {
-            if (comm->channels[channelId].peers[peer]->recv[1].hasSeen == 0) { // P2P uses only 1 connector
-              comm->channels[channelId].peers[peer]->recv[1].hasSeen = 1;
-              comm->connectRecv[peer] |= (1UL<<channelId);
-              ncclGroupCommPreconnect(comm);
-            }
-          }
-        }
-      }
-    }
-  } else {
-    // Empty collectives can be discarded.
-    if (info->count == 0) return ncclSuccess;
-
-    if (info->datatype == ncclFloat8e4m3 || info->datatype == ncclFloat8e5m2) {
-      if (comm->minCompCap < 90) {
-        WARN("FP8 reduction support begins with sm90 capable devices.");
-        return ncclInvalidArgument;
-      }
-    }
-
-    // Copy reduction op state from op handle into info struct here since the
-    // op handle may be destroyed before ncclGroupEnd().
-    struct ncclDevRedOpFull opDev;
-    NCCLCHECK(hostToDevRedOp(&opDev, info->op, info->datatype, comm));
-
-    if (comm->nRanks == 1) {
-      NCCLCHECK(ncclLaunchOneRank(info->recvbuff, info->sendbuff, info->count, opDev, info->datatype, info->stream));
-      return ncclSuccess;
-    } else {
-      // Must be in thread local group before tasks can be alloc'd in `comm->memScoped`.
-      ncclGroupCommJoin(info->comm, ncclGroupTaskTypeCollective);
-      struct ncclTaskColl* t = ncclMemoryPoolAlloc<struct ncclTaskColl>(&comm->memPool_ncclTaskColl, &comm->memPermanent);
-      t->func = info->coll;
-      t->sendbuff = info->sendbuff;
-      t->recvbuff = info->recvbuff;
-      t->count = info->count;
-      t->root = info->root;
-      t->datatype = info->datatype;
-      size_t elementSize = ncclTypeSize(t->datatype);
-      if (t->func == ncclFuncAllGather || t->func == ncclFuncBroadcast) {
-        t->count *= elementSize;
-        t->datatype = ncclInt8;
-        elementSize = 1;
-      }
-      t->trafficBytes = t->count*elementSize*ncclFuncTrafficPerByte(t->func, comm->nRanks);
-      t->opHost = info->op;
-      t->opDev = opDev; // C++ struct assignment
-      t->chunkSteps = info->chunkSteps;
-      t->sliceSteps = info->sliceSteps;
-      t->eActivationMask = __atomic_load_n(&ncclProfilerEventMask, __ATOMIC_RELAXED);
-
-      planner->nTasksColl += 1;
-      ncclTaskCollSorterInsert(&planner->collSorter, t, t->trafficBytes);
-    }
-  }
-
   if (info->stream != planner->streamRecent || planner->streams == nullptr) {
     planner->streamRecent = info->stream;
     struct ncclCudaStreamList* l = planner->streams;
@@ -2401,7 +2349,263 @@ static ncclResult_t taskAppend(struct ncclComm* comm, struct ncclInfo* info) {
   return ncclSuccess;
 }
 
+static ncclResult_t p2pTaskAppend(
+    struct ncclComm* comm,
+    struct ncclInfo* info,
+    ncclFunc_t coll,
+    ncclFunc_t collAPI,
+    void* buff,
+    size_t count,
+    ncclDataType_t datatype,
+    int peer) {
+  struct ncclKernelPlanner *planner = &comm->planner;
+
+  // Determine peer and basic parameters.
+  ssize_t nBytes = count*ncclTypeSize(datatype);
+  bool isSendNotRecv = coll == ncclFuncSend;
+
+  // Must be in thread local group before tasks can be alloc'd in `comm->memScoped`.
+  ncclGroupCommJoin(comm, ncclGroupTaskTypeCollective);
+  info->coll = coll;
+  // Set capturing graph. Called here so that profiler can emit a group API event with this information
+  NCCLCHECK(ncclPlannerSetCapturingGraph(comm, info));
+  bool isGraphCaptured = ncclCudaGraphValid(planner->capturingGraph);
+  NCCLCHECK(ncclProfilerStartGroupApiEvent(info, isGraphCaptured));
+  NCCLCHECK(ncclProfilerRecordGroupApiEventState(ncclProfilerGroupStartApiStop));
+
+  NCCLCHECK(ncclProfilerStartP2pApiEvent(info, isGraphCaptured));
+
+  struct ncclTaskP2p* p2p = ncclMemoryPoolAlloc<struct ncclTaskP2p>(&comm->memPool_ncclTaskP2p, &comm->memPermanent);
+  p2p->func = coll;
+  p2p->collAPI = collAPI;
+  p2p->buff = buff;
+  p2p->count = count;
+  p2p->datatype = datatype;
+  p2p->root = peer;
+  p2p->bytes = nBytes;
+  p2p->eActivationMask = ncclProfilerApiState.eActivationMask;
+  p2p->groupApiEventHandle = ncclProfilerApiState.groupApiEventHandle;
+  p2p->p2pApiEventHandle = ncclProfilerApiState.p2pApiEventHandle;
+  ncclIntruQueueEnqueue(
+    isSendNotRecv ? &planner->peers[peer].sendQueue : &planner->peers[peer].recvQueue,
+    p2p);
+  planner->nTasksP2p += 1;
+  if (isSendNotRecv)
+    planner->nTasksP2pSend += 1;
+  else
+    planner->nTasksP2pRecv += 1;
+
+  // Mark channels that need pre-connect
+  if (comm->rank != peer) {
+    if (!(isSendNotRecv ? planner->peers[peer].sendSeen : planner->peers[peer].recvSeen)) {
+      // planner->peers[peer].send/recvSeen is private to each comm, so we need to set it anyway.
+      (isSendNotRecv ? planner->peers[peer].sendSeen : planner->peers[peer].recvSeen) = true;
+      int round = 0;
+      while (peer != (isSendNotRecv ? comm->p2pSchedule[round].sendRank
+                                    : comm->p2pSchedule[round].recvRank)) {
+        round += 1;
+      }
+      uint8_t base = ncclP2pChannelBaseForRound(comm, round);
+      for (int c=0; c < comm->p2pnChannelsPerPeer; c++) {
+        int channelId = ncclP2pChannelForPart(comm->p2pnChannels, base, c);
+        if (isSendNotRecv) {
+          if (comm->channels[channelId].peers[peer]->send[1].hasSeen == 0) { // P2P uses only 1 connector
+            // the send/recv connector is shared among split shared comms. We need to set hasSeen to
+            // 1 in order to avoid duplicate connection setup if user group sendrecv ops with split
+            // shared comms together.
+            comm->channels[channelId].peers[peer]->send[1].hasSeen = 1;
+            comm->channels[channelId].peers[peer]->send[1].p2pOnly = 1;
+            comm->connectSend[peer] |= (1UL<<channelId);
+            ncclGroupCommPreconnect(comm);
+          }
+        } else {
+          if (comm->channels[channelId].peers[peer]->recv[1].hasSeen == 0) { // P2P uses only 1 connector
+            comm->channels[channelId].peers[peer]->recv[1].hasSeen = 1;
+            comm->channels[channelId].peers[peer]->recv[1].p2pOnly = 1;
+            comm->connectRecv[peer] |= (1UL<<channelId);
+            ncclGroupCommPreconnect(comm);
+          }
+        }
+      }
+    }
+  }
+  ncclProfilerStopP2pApiEvent();
+  return ncclSuccess;
+}
+
+static ncclResult_t collTaskAppend(
+    struct ncclComm* comm,
+    struct ncclInfo* info,
+    struct ncclDevRedOpFull opDev) {
+  struct ncclKernelPlanner *planner = &comm->planner;
+
+  // Must be in thread local group before tasks can be alloc'd in `comm->memScoped`.
+  ncclGroupCommJoin(info->comm, ncclGroupTaskTypeCollective);
+  // Set capturing graph. Called here so that profiler can emit a group API event with this information
+  NCCLCHECK(ncclPlannerSetCapturingGraph(comm, info));
+  bool isGraphCaptured = ncclCudaGraphValid(planner->capturingGraph);
+  NCCLCHECK(ncclProfilerStartGroupApiEvent(info, isGraphCaptured));
+  NCCLCHECK(ncclProfilerRecordGroupApiEventState(ncclProfilerGroupStartApiStop));
+  NCCLCHECK(ncclProfilerStartCollApiEvent(info, isGraphCaptured));
+  
+  struct ncclTaskColl* t = ncclMemoryPoolAlloc<struct ncclTaskColl>(&comm->memPool_ncclTaskColl, &comm->memPermanent);
+  t->func = info->coll;
+  t->sendbuff = info->sendbuff;
+  t->recvbuff = info->recvbuff;
+  t->count = info->count;
+  t->root = info->root;
+  t->datatype = info->datatype;
+  size_t elementSize = ncclTypeSize(t->datatype);
+  if (t->func == ncclFuncAllGather || t->func == ncclFuncBroadcast) {
+    t->count *= elementSize;
+    t->datatype = ncclInt8;
+    elementSize = 1;
+  }
+  t->trafficBytes = t->count*elementSize*ncclFuncTrafficPerByte(t->func, comm->nRanks);
+  t->opHost = info->op;
+  t->opDev = opDev; // C++ struct assignment
+  t->chunkSteps = info->chunkSteps;
+  t->sliceSteps = info->sliceSteps;
+  t->eActivationMask = ncclProfilerApiState.eActivationMask;
+  t->groupApiEventHandle = ncclProfilerApiState.groupApiEventHandle;
+  t->collApiEventHandle = ncclProfilerApiState.collApiEventHandle;
+
+  planner->nTasksColl += 1;
+  ncclTaskCollSorterInsert(&planner->collSorter, t, t->trafficBytes);
+
+  ncclProfilerStopCollApiEvent();
+  return ncclSuccess;
+}
+
+static ncclResult_t ceCollTaskAppend(
+    struct ncclComm* comm,
+    struct ncclInfo* info,
+    struct ncclDevrWindow* sendWin,
+    struct ncclDevrWindow* recvWin,
+    struct ncclDevRedOpFull opDev) {
+  struct ncclKernelPlanner *planner = &comm->planner;
+  
+  // Check if CE needs initialization
+  if (comm->ceColl.baseUCSymReadyPtr == NULL && ncclIntruQueueEmpty(&comm->ceInitTaskQueue)) {
+    struct ncclCeInitTask* ceTask;
+    NCCLCHECK(ncclCalloc(&ceTask, 1));
+    ceTask->comm = comm;
+    ncclIntruQueueEnqueue(&comm->ceInitTaskQueue, ceTask);
+    ncclGroupCommJoin(comm, ncclGroupTaskTypeSymRegister);
+  }
+
+  // Must be in thread local group before tasks can be alloc'd in `comm->memScoped`.
+  ncclGroupCommJoin(info->comm, ncclGroupTaskTypeCollective);
+  NCCLCHECK(ncclPlannerSetCapturingGraph(comm, info));
+  struct ncclTaskColl* t = ncclMemoryPoolAlloc<struct ncclTaskColl>(&comm->memPool_ncclTaskColl, &comm->memPermanent);
+
+  t->func = info->coll;
+  t->sendbuff = info->sendbuff;
+  t->recvbuff = info->recvbuff;
+  t->count = info->count;
+  t->root = info->root;
+  t->datatype = info->datatype;
+  size_t elementSize = ncclTypeSize(t->datatype);
+  if (t->func == ncclFuncAllGather || t->func == ncclFuncBroadcast) {
+    t->count *= elementSize;
+    t->datatype = ncclInt8;
+    elementSize = 1;
+  }
+  t->trafficBytes = t->count*elementSize*ncclFuncTrafficPerByte(t->func, comm->nRanks);
+  t->opHost = info->op;
+  t->opDev = opDev; // C++ struct assignment
+  t->chunkSteps = info->chunkSteps;
+  t->sliceSteps = info->sliceSteps;
+  t->eActivationMask = __atomic_load_n(&ncclProfilerEventMask, __ATOMIC_RELAXED);
+  t->sendWin = sendWin;
+  t->recvWin = recvWin;
+
+  ncclIntruQueueEnqueue(&planner->collCeTaskQueue, t);
+
+  return ncclSuccess;
+}
+
+// Converts `info` to a task and adds it to `comm->planner`. The exception is with
+// single rank communicators, collectives are issued as `ncclMemcpyAsync`s and
+// thus don't need a task.
+static ncclResult_t taskAppend(struct ncclComm* comm, struct ncclInfo* info) {
+  ncclFunc_t collAPI = info->coll;
+
+  if (info->coll == ncclFuncSend || info->coll == ncclFuncRecv) {
+    NCCLCHECK(p2pTaskAppend(comm, info, info->coll, collAPI, (void*)info->recvbuff, info->count, info->datatype, info->root));
+  } else {
+    // Empty collectives can be discarded.
+    if (info->count == 0) return ncclSuccess;
+
+    if (info->datatype == ncclFloat8e4m3 || info->datatype == ncclFloat8e5m2) {
+      if (comm->minCompCap < 90 && info->coll != ncclFuncAllGather && info->coll != ncclFuncBroadcast && info->coll != ncclFuncAlltoAll && info->coll != ncclFuncScatter && info->coll != ncclFuncGather) {
+        WARN("FP8 reduction support begins with sm90 capable devices.");
+        return ncclInvalidArgument;
+      }
+    }
+
+    // Copy reduction op state from op handle into info struct here since the
+    // op handle may be destroyed before ncclGroupEnd().
+    struct ncclDevRedOpFull opDev;
+    NCCLCHECK(hostToDevRedOp(&opDev, info->op, info->datatype, comm));
+
+    if (comm->nRanks == 1) {
+      NCCLCHECK(ncclLaunchOneRank(info->recvbuff, info->sendbuff, info->count, opDev, info->datatype, info->stream));
+      return ncclSuccess;
+    } else {
+      struct ncclDevrWindow* sendWin;
+      struct ncclDevrWindow* recvWin;
+      ncclDevrFindWindow(comm, info->sendbuff, &sendWin);
+      ncclDevrFindWindow(comm, info->recvbuff, &recvWin);
+      bool ceImplemented = ncclCeImplemented(info->coll, info->op, info->datatype);
+      
+      // Append CE collective task if CE is supported and requested by user
+      if (comm->symmetricSupport && comm->nNodes == 1 && sendWin && recvWin && (sendWin->winFlags & recvWin->winFlags & NCCL_WIN_COLL_SYMMETRIC) && comm->config.CTAPolicy == NCCL_CTA_POLICY_ZERO && ceImplemented) {
+        NCCLCHECK(ceCollTaskAppend(comm, info, sendWin, recvWin, opDev));
+      }
+      // Append kernel-based collective
+      else {
+        if (info->coll == ncclFuncAlltoAll) {
+          for (int r=0; r<comm->nRanks; r++) {
+            NCCLCHECK(p2pTaskAppend(comm, info, ncclFuncSend, collAPI, (void*)((char*)info->sendbuff+r*info->count*ncclTypeSize(info->datatype)), info->count, info->datatype, r));
+            NCCLCHECK(p2pTaskAppend(comm, info, ncclFuncRecv, collAPI, (void*)((char*)info->recvbuff+r*info->count*ncclTypeSize(info->datatype)), info->count, info->datatype, r));
+          }
+        } else if (info->coll == ncclFuncGather){
+          size_t offset = 0;
+          NCCLCHECK(p2pTaskAppend(comm, info, ncclFuncSend, collAPI, (void*)info->sendbuff, info->count, info->datatype, info->root));
+          if (comm->rank == info->root) {
+            for (int r=0; r<comm->nRanks; r++) {
+              void* buff = (void*)((char*)info->recvbuff + offset);
+              NCCLCHECK(p2pTaskAppend(comm, info, ncclFuncRecv, collAPI, buff, info->count, info->datatype, r));
+              offset += info->count * ncclTypeSize(info->datatype);
+            }
+          }
+        } else if (info->coll == ncclFuncScatter) {
+          size_t offset = 0;
+          if (comm->rank == info->root) {
+            for (int r = 0; r < comm->nRanks; r++) {
+              void* buff = (void*)((char*)info->sendbuff + offset);
+              NCCLCHECK(p2pTaskAppend(comm, info, ncclFuncSend, collAPI, buff, info->count, info->datatype, r));
+              offset += info->count * ncclTypeSize(info->datatype);
+            }
+          }
+          NCCLCHECK(p2pTaskAppend(comm, info, ncclFuncRecv, collAPI, (void*)info->recvbuff, info->count, info->datatype, info->root));
+        } else {
+          NCCLCHECK(collTaskAppend(comm, info, opDev));
+        }
+      }
+    }
+  }
+
+  return ncclSuccess;
+}
+
 ncclResult_t ncclEnqueueCheck(struct ncclInfo* info) {
+  // Profiler - If a group API event has already started, update the profilerGroupDepth so that the depth
+  // updates correctly for implicit ncclGroupStartInternal and ncclGroupEndInternal calls
+  if (ncclProfilerApiState.profilerGroupDepth > 0) {
+    ncclProfilerApiState.profilerGroupDepth++;
+  }
   NCCLCHECK(ncclGroupStartInternal());
   ncclResult_t ret = ncclSuccess;
   int devOld = -1;
diff --git a/projects/rccl/src/graph/CMakeLists.txt b/projects/rccl/src/graph/CMakeLists.txt
new file mode 100644
index 0000000000..1dec7cbf7d
--- /dev/null
+++ b/projects/rccl/src/graph/CMakeLists.txt
@@ -0,0 +1,14 @@
+# Graph sources
+set(GRAPH_SOURCES
+    ${CMAKE_CURRENT_SOURCE_DIR}/topo.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/tuning.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/xml.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/search.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/paths.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/connect.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/rings.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/trees.cc
+)
+
+# Add graph sources to parent scope
+set(GRAPH_SOURCES ${GRAPH_SOURCES} PARENT_SCOPE)
diff --git a/projects/rccl/src/graph/connect.cc b/projects/rccl/src/graph/connect.cc
index 152739b0cc..c5fe959ae6 100644
--- a/projects/rccl/src/graph/connect.cc
+++ b/projects/rccl/src/graph/connect.cc
@@ -21,6 +21,7 @@ ncclResult_t ncclTopoPreset(struct ncclComm* comm, struct ncclTopoGraph** graphs
   int localRanks = comm->topo->nodes[GPU].count;
   int nChannels = comm->nChannels;
 
+  topoRanks->crossNicRing = graphs[NCCL_ALGO_RING]->crossNic;
   topoRanks->nvlsHeadNum = 0;
   for (int c=0; c<nChannels; c++) {
     struct ncclChannel* channel = comm->channels+c;
@@ -232,7 +233,6 @@ static ncclResult_t connectCollNet(struct ncclComm* comm, struct ncclTopoGraph*
     sprintf(line+strlen(line), "nUp %d nHeads %d ", nUp, nHeads);
     sprintf(line+strlen(line), "headRank %d out %d shift %d", channel->collnetDirect.headRank, channel->collnetDirect.out, channel->collnetDirect.shift);
     INFO(NCCL_GRAPH, "%s", line);
-    channel->collnetChain.depth = comm->nRanks/comm->nNodes;
   }
   free(heads);
   return ncclSuccess;
@@ -249,7 +249,7 @@ static ncclResult_t connectNvls(struct ncclComm* comm, int* nvlsHeads, int nHead
     if (nvlsHeads[h * comm->nNodes + comm->node] == comm->rank) headRank = h;
   }
 
-  for (int c=0; c<comm->nChannels; c++) {
+  for (int c=0; c<comm->nvlsChannels; c++) {
     struct ncclChannel* channel = comm->channels+c;
     channel->nvls.nHeads = nHeads;
     for (int h=0; h<nHeads; h++) channel->nvls.up[h] = comm->nRanks+1+h;
@@ -301,7 +301,7 @@ static ncclResult_t connectNvls(struct ncclComm* comm, int* nvlsHeads, int nHead
   }
   // Set prev/next in all channels (NVLS compute channels work
   // orthogonally to NVLS search channels).
-  for (int c=0; c<comm->nChannels; c++) {
+  for (int c=0; c<comm->nvlsChannels; c++) {
     struct ncclChannel* channel = comm->channels+c;
     channel->nvls.treeUp = treeUp[c%2];
     channel->nvls.treeDown[0] = channel->nvls.down;
@@ -389,17 +389,17 @@ ncclResult_t ncclTopoPostset(struct ncclComm* comm, int* firstRanks, int* treePa
   NCCLCHECKGOTO(ncclCalloc(&treeToChild1, nNodes*MAXCHANNELS), ret, fail);
   NCCLCHECKGOTO(ncclCalloc(&nvlsHeads, nNodes*MAXCHANNELS), ret, fail);
 
-  // Alternate rings to avoid crossing rails
-  if (graphs[NCCL_ALGO_RING]->crossNic == 2 && (nChannels % 2) == 0) {
-    for (int r=0; r<comm->nRanks; r++) {
-      if (comm->rankToNode[r] % 2 == 1) {
-        // Exchange rings
-        for (int c=0; c<nChannels; c+=2) {
-          exchangeValues(allTopoRanks[r]->ringRecv+c, allTopoRanks[r]->ringRecv+(c^1));
-          exchangeValues(allTopoRanks[r]->ringSend+c, allTopoRanks[r]->ringSend+(c^1));
-          exchangeValues(allTopoRanks[r]->ringPrev+c, allTopoRanks[r]->ringPrev+(c^1));
-          exchangeValues(allTopoRanks[r]->ringNext+c, allTopoRanks[r]->ringNext+(c^1));
-        }
+  // Alternate rings to avoid crossing rails.
+  // CrossNic values could be not the same on all nodes as it depends on the number of net devs and the NVLink bandwidth.
+  // Therefore, it's only done if the rank obtained a solution with crossNic=2.
+  for (int r = 0; r < comm->nRanks; r++) {
+    if (allTopoRanks[r]->crossNicRing == 2 && (nChannels % 2) == 0 && (comm->rankToNode[r] % 2) == 1) {
+      // Exchange rings
+      for (int c=0; c<nChannels; c+=2) {
+        exchangeValues(allTopoRanks[r]->ringRecv+c, allTopoRanks[r]->ringRecv+(c^1));
+        exchangeValues(allTopoRanks[r]->ringSend+c, allTopoRanks[r]->ringSend+(c^1));
+        exchangeValues(allTopoRanks[r]->ringPrev+c, allTopoRanks[r]->ringPrev+(c^1));
+        exchangeValues(allTopoRanks[r]->ringNext+c, allTopoRanks[r]->ringNext+(c^1));
       }
     }
   }
@@ -459,7 +459,14 @@ ncclResult_t ncclTopoPostset(struct ncclComm* comm, int* firstRanks, int* treePa
       int collNetNchannels = std::min(MAXCHANNELS, nChannels+nChannels/2);
       nChannels = comm->nChannels = copyChannels(comm, nChannels, collNetNchannels, ringPrev, ringNext);
     }
-    NCCLCHECKGOTO(connectCollNet(comm, graphs[NCCL_ALGO_COLLNET_DIRECT]), ret, fail);
+
+    for (int c = 0; c < comm->nChannels; c++) {
+      comm->channels[c].collnetChain.depth = comm->nRanks/comm->nNodes;
+    }
+
+    if (comm->maxLocalRanks <= NCCL_MAX_DIRECT_ARITY+1) {
+      NCCLCHECKGOTO(connectCollNet(comm, graphs[NCCL_ALGO_COLLNET_DIRECT]), ret, fail);
+    }
   }
 
   // Use 4 compute channels per search channel to reach peak BW on <8 PPN
@@ -490,9 +497,6 @@ ncclResult_t ncclTopoPostset(struct ncclComm* comm, int* firstRanks, int* treePa
   if (shared && comm->nvlsChannels > parent->nvlsResources->nChannels) {
     comm->nvlsChannels = parent->nvlsResources->nChannels;
   }
-  if (comm->nChannels < comm->nvlsChannels) {
-    nChannels = comm->nChannels = copyChannels(comm, comm->nChannels, comm->nvlsChannels, ringPrev, ringNext);
-  }
   NCCLCHECKGOTO(connectNvls(comm, nvlsHeads, minHeadNum), ret, fail);
 #endif
   if (shared && comm->nChannels > parent->sharedRes->tpNChannels) {
diff --git a/projects/rccl/src/graph/paths.cc b/projects/rccl/src/graph/paths.cc
index 82c0d99726..86d185bc06 100644
--- a/projects/rccl/src/graph/paths.cc
+++ b/projects/rccl/src/graph/paths.cc
@@ -375,11 +375,15 @@ ncclResult_t ncclTopoCheckMNNVL(struct ncclTopoSystem* system, struct ncclPeerIn
   nvmlGpuFabricInfoV_t *fabricInfo1 = &info1->fabricInfo;
   nvmlGpuFabricInfoV_t *fabricInfo2 = &info2->fabricInfo;
   // A zero UUID means we don't have MNNVL fabric info
-  if ((((long *)&fabricInfo2->clusterUuid)[0]|((long *)fabricInfo2->clusterUuid)[1]) == 0) return ncclSuccess;
+  unsigned long uuid0 = 0;
+  unsigned long uuid1 = 0;
+  memcpy(&uuid0, fabricInfo2->clusterUuid, sizeof(uuid0));
+  memcpy(&uuid1, fabricInfo2->clusterUuid + sizeof(uuid0), sizeof(uuid1));
+  if ((uuid0 | uuid1) == 0) return ncclSuccess;
   if ((memcmp(fabricInfo1->clusterUuid, fabricInfo2->clusterUuid, NVML_GPU_FABRIC_UUID_LEN) == 0) &&
       (fabricInfo1->cliqueId == fabricInfo2->cliqueId)) {
     TRACE(NCCL_NET, "MNNVL matching peer 0x%lx UUID %lx.%lx cliqueId 0x%x",
-         info2->busId, ((long *)fabricInfo2->clusterUuid)[0], ((long *)fabricInfo2->clusterUuid)[1], fabricInfo2->cliqueId);
+         info2->busId, uuid0, uuid1, fabricInfo2->cliqueId);
     *ret = 1;
   }
   return ncclSuccess;
@@ -613,7 +617,7 @@ ncclResult_t ncclTopoGetPxnRanks(struct ncclComm* comm, int** intermediateRanks,
   return ncclSuccess;
 }
 
-NCCL_PARAM(PxnC2c, "PXN_C2C", 0);
+NCCL_PARAM(PxnC2c, "PXN_C2C", 1);
 
 ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm* comm) {
   // Precompute paths between GPUs/NICs.
@@ -793,7 +797,6 @@ void ncclTopoFree(struct ncclTopoSystem* system) {
   free(system);
 }
 
-NCCL_PARAM(NChannelsPerNetPeer, "NCHANNELS_PER_NET_PEER", -1);
 
 static ncclResult_t ncclTopoGetNchannels(struct ncclComm* comm, int g /*local gpu index*/, int peerRank, int* nChannels) {
   int peer;
@@ -815,8 +818,8 @@ static ncclResult_t ncclTopoGetNchannels(struct ncclComm* comm, int g /*local gp
     }
   } else {
     // Remote rank, use network
-    int nNetChannels = ncclParamNChannelsPerNetPeer();
-    if (nNetChannels == -1) {
+    int nNetChannels = comm->config.nChannelsPerNetPeer;
+    if (nNetChannels == NCCL_CONFIG_UNDEF_INT) {
        //start from 2 channels per NIC and reduce with scale
        nNetChannels = 2;
 
diff --git a/projects/rccl/src/graph/topo.cc b/projects/rccl/src/graph/topo.cc
index 8fdf54ea4b..3a87725f1d 100644
--- a/projects/rccl/src/graph/topo.cc
+++ b/projects/rccl/src/graph/topo.cc
@@ -8,6 +8,7 @@
 #include "graph.h"
 #include "topo.h"
 #include "comm.h"
+#include "nccl.h"
 #include "nvmlwrap.h"
 #include "coll_net.h"
 #include "transport.h"
@@ -15,6 +16,7 @@
 #include <fcntl.h>
 #include "cpuset.h"
 #include "bootstrap.h"
+#include <mutex>
 
 #define BUSID_SIZE (sizeof("0000:00:00.0"))
 #define BUSID_REDUCED_SIZE (sizeof("0000:00"))
@@ -404,7 +406,7 @@ ncclResult_t ncclTopoAddGpu(struct ncclXmlNode* xmlGpu, struct ncclTopoSystem* s
 
 #define PCI_BRIDGE_DEVICE_CLASS "0x060400"
 
-struct kvDict kvDictPciClass[] = { { PCI_BRIDGE_DEVICE_CLASS, PCI }, { "0x068000", NVS }, { "0x068001", CPU }, { "0x03", GPU }, { "0x02", NIC }, { NULL, PCI /* Default fallback value */ } };
+struct kvDict kvDictPciClass[] = { { PCI_BRIDGE_DEVICE_CLASS, PCI }, {"0x080100", /*CX8 data direct*/PCI}, { "0x068000", NVS }, { "0x068001", CPU }, { "0x03", GPU }, { "0x02", NIC }, { NULL, PCI /* Default fallback value */ } };
 struct kvDict kvDictPciGen[] = {
   { "2.5 GT/s", 15 }, { "5 GT/s", 30 }, { "8 GT/s", 60 }, { "16 GT/s", 120 }, { "32 GT/s", 240 }, /* Kernel 5.6 and earlier */
   { "2.5 GT/s PCIe", 15 }, { "5.0 GT/s PCIe", 30 }, { "8.0 GT/s PCIe", 60 }, { "16.0 GT/s PCIe", 120 }, { "32.0 GT/s PCIe", 240 }, { "64.0 GT/s PCIe", 480 },
@@ -982,8 +984,7 @@ ncclResult_t ncclTopoMakePciParent(struct ncclXml* xml, struct ncclXmlNode** par
   return ncclSuccess;
 }
 
-ncclResult_t ncclTopoMakeVnic(struct ncclXml* xml, ncclNetVDeviceProps_t* vProps,
-struct ncclXmlNode** physNetNodes, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
+ncclResult_t ncclTopoMakeVnic(struct ncclXml* xml, struct ncclTopoNetInfo* netInfo, ncclNetVDeviceProps_t* vProps, struct ncclXmlNode** physNetNodes) {
   if (vProps->ndevs > NCCL_NET_MAX_DEVS_PER_NIC) {
     WARN("TOPO/NET : Tried to merge too many NICs. %d > %d", vProps->ndevs, NCCL_NET_MAX_DEVS_PER_NIC);
     return ncclInternalError;
@@ -997,7 +998,7 @@ struct ncclXmlNode** physNetNodes, ncclResult_t (*makeVDevice)(int*, ncclNetVDev
 
   // Trigger the merge, then get the new device's properties
   int vDevIndex = 0;
-  ncclResult_t ret = makeVDevice(&vDevIndex, vProps);
+  ncclResult_t ret = netInfo->makeVDevice(&vDevIndex, vProps);
   if (ret != ncclSuccess) {
     INFO(NCCL_GRAPH|NCCL_INIT|NCCL_NET, "TOPO/NET : Tried merging multiple devices together and failed. vProps={ndevs=%d, devs=[%d %d %d %d]}. Set NCCL_NET_MERGE_LEVEL=LOC to disable NIC fusion.",
       vProps->ndevs, vProps->devs[0], vProps->devs[1], vProps->devs[2], vProps->devs[3]);
@@ -1015,9 +1016,10 @@ struct ncclXmlNode** physNetNodes, ncclResult_t (*makeVDevice)(int*, ncclNetVDev
   return ncclSuccess;
 }
 
-ncclResult_t ncclTopoForceMerge(struct ncclXml* xml, char* str, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
+ncclResult_t ncclTopoForceMerge(struct ncclXml* xml, struct ncclTopoNetInfo* netInfo, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs) {
   ncclResult_t ret = ncclSuccess;
-  INFO(NCCL_ENV|NCCL_NET, "TOPO/NET : Force-fusing NICs using NCCL_NET_FORCE_MERGE=%s", str);
+  const char* str = netInfo->forceMerge;
+  INFO(NCCL_ENV | NCCL_NET, "TOPO/NET : Force-fusing NICs using NCCL_NET_FORCE_MERGE=%s", str);
   char* ncStr;
   NCCLCHECK(ncclCalloc(&ncStr, strlen(str)+1));
   strcpy(ncStr, str);
@@ -1053,7 +1055,7 @@ ncclResult_t ncclTopoForceMerge(struct ncclXml* xml, char* str, int* placedDevs,
       goto fail;
     }
 
-    ret = ncclTopoMakeVnic(xml, &vProps, physNetNodes, makeVDevice);
+    ret = ncclTopoMakeVnic(xml, netInfo, &vProps, physNetNodes);
     if (ret == ncclSuccess) {
       // Only set that a device is "placed" after successfully making a vNic (it's possible to exit before this)
       for (int i = 0; i < vProps.ndevs; i++) {
@@ -1075,7 +1077,7 @@ fail:
   goto exit;
 }
 
-ncclResult_t ncclTopoAutoMerge(struct ncclXml* xml, int mergeLevel, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
+ncclResult_t ncclTopoAutoMerge(struct ncclXml* xml, struct ncclTopoNetInfo* netInfo, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs) {
   // Compute the path type between each device
   int* paths = NULL;
   ncclResult_t res = ncclSuccess;
@@ -1105,7 +1107,7 @@ ncclResult_t ncclTopoAutoMerge(struct ncclXml* xml, int mergeLevel, int* placedD
       // Select each unplaced device "j" which is at most "mergeLevel" distance from "i", but not equal to "i"
       // (Don't merge the same device with itself)
       for (int j = 0; j < nPhysDevs; j++) {
-        if (paths[i*nPhysDevs + j] <= mergeLevel &&
+        if (paths[i*nPhysDevs + j] <= netInfo->mergeLevel &&
         placedDevs[j] == 0 && j != i) {
           vProps.devs[vProps.ndevs++] = j;
           placedDevs[j] = 1;
@@ -1119,7 +1121,7 @@ ncclResult_t ncclTopoAutoMerge(struct ncclXml* xml, int mergeLevel, int* placedD
         return ncclInternalError;
       }
 
-      ncclResult_t ret = ncclTopoMakeVnic(xml, &vProps, physNetNodes, makeVDevice);
+      ncclResult_t ret = ncclTopoMakeVnic(xml, netInfo, &vProps, physNetNodes);
 
       // Merging failed.
       // Mark all as unplaced and increase their distance to disconnected (PATH_DIS)
@@ -1157,6 +1159,92 @@ struct kvDict nicPathKvList[] = {
   { NULL, 0 }
 };
 
+
+ncclResult_t ncclTopoFindLinkWidthRec(ncclXmlNode* node, ncclXmlNode** physNetNodes, int ndevs, int* foundPhysNet, int* linkWidth) {
+  int myLinkWidth = 0;
+  if (strcmp(node->name, "pci") == 0) {
+    NCCLCHECK(xmlGetAttrInt(node, "link_width", &myLinkWidth));
+#ifdef ENABLE_TRACE
+    const char *busidAttr, *linkAttr;
+    NCCLCHECK(xmlGetAttrStr(node, "busid", &busidAttr));
+    NCCLCHECK(xmlGetAttr(node, "link_width", &linkAttr));
+    TRACE(NCCL_GRAPH, "Found link_width (%s)=%d for busid=%s", linkAttr, myLinkWidth, busidAttr);
+#endif
+  }
+
+  *foundPhysNet = 0;
+  // Detect if a physical child is found. This information will be propagated up the stack.
+  int devId = 0;
+  while (devId < ndevs && !(*foundPhysNet)) *foundPhysNet = (node == physNetNodes[devId++]);
+
+  int totalChildLinkWidth = 0;
+  for (int i = 0; i < node->nSubs; i++) {
+    ncclXmlNode* child = node->subs[i];
+    int found = 0;
+    int tempLinkWidth = 0;
+    NCCLCHECK(ncclTopoFindLinkWidthRec(child, physNetNodes, ndevs, &found, &tempLinkWidth));
+    if (found) {
+      *foundPhysNet = 1;
+      totalChildLinkWidth += tempLinkWidth;
+    }
+  }
+
+  if (*foundPhysNet == 0) {
+    // No child NICs were found, do not accrue any detected link_width
+    *linkWidth = 0;
+    INFO(NCCL_GRAPH, "Did not find child net device. Returning link_width=%d totalChildLinkWidth=%d", *linkWidth, totalChildLinkWidth);
+  } else if (totalChildLinkWidth == 0) {
+    // If A child NIC was found but no link_width was detected among children, assign the link_width to mine (I am the first pci node right above the physNetNode).
+    *linkWidth = myLinkWidth;
+    INFO(NCCL_GRAPH, "Found child net device for %s. Returning link_width=%d totalChildLinkWidth=%d", node->name, *linkWidth, totalChildLinkWidth);
+  } else {
+  // Standard recursive accrual of link_width. The link_width is either the bottleneck of this PCI node's width or the sum of its children's width.
+    *linkWidth = myLinkWidth > 0 ? std::min(myLinkWidth, totalChildLinkWidth) : totalChildLinkWidth;
+    INFO(NCCL_GRAPH, "Found child net device for %s. Returning link_width=%d totalChildLinkWidth=%d", node->name, *linkWidth, totalChildLinkWidth);
+  }
+
+  return ncclSuccess;
+}
+
+// DFS over nodes under common parent
+// Exclude link widths of non-physNetNodes chains
+ncclResult_t ncclTopoFindLinkWidth(ncclXmlNode* parent, ncclXmlNode** physNetNodes, int ndevs, int* linkWidth) {
+  *linkWidth = 0;
+  for (int i = 0; i < parent->nSubs; i++) {
+    ncclXmlNode* child = parent->subs[i];
+    int foundPhysNet = 0;
+    int childLinkWidth = 0;
+    NCCLCHECK(ncclTopoFindLinkWidthRec(child, physNetNodes, ndevs, &foundPhysNet, &childLinkWidth));
+    if (foundPhysNet) {
+      *linkWidth += childLinkWidth;
+    }
+  }
+
+  return ncclSuccess;
+}
+
+ncclResult_t ncclTopoWidenLinks(ncclXmlNode** physNetNodes, int ndevs, ncclXmlNode* parent) {
+  int sumLinkWidth = 0;
+  NCCLCHECK(ncclTopoFindLinkWidth(parent, physNetNodes, ndevs, &sumLinkWidth));
+  for (int i = 0; i < ndevs; i++) {
+    ncclXmlNode* temp = physNetNodes[i];
+    while (temp != parent) {
+      if (strcmp(temp->name, "pci") == 0) {
+        NCCLCHECK(xmlSetAttrInt(temp, "link_width", sumLinkWidth));
+        TRACE(NCCL_GRAPH, "Set link_width to %d for node %s", sumLinkWidth, temp->name);
+      }
+      temp = temp->parent;
+    }
+  }
+
+  if (strcmp(parent->name, "pci") == 0) {
+    NCCLCHECK(xmlSetAttrInt(parent, "link_width", sumLinkWidth));
+    TRACE(NCCL_GRAPH, "Set link_width to %d for node %s", sumLinkWidth, parent->name);
+  }
+
+  return ncclSuccess;
+}
+
 ncclResult_t ncclTopoGetVNicParent(struct ncclXml* xml, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), ncclNetVDeviceProps_t* vProps, ncclXmlNode** parent) {
   ncclNetProperties_t props[NCCL_NET_MAX_DEVS_PER_NIC];
   ncclXmlNode* physNetNodes[NCCL_NET_MAX_DEVS_PER_NIC];
@@ -1170,54 +1258,50 @@ ncclResult_t ncclTopoGetVNicParent(struct ncclXml* xml, ncclResult_t (*getProper
 
   int path = PATH_LOC;
   NCCLCHECK(ncclTopoGetPath(physNetNodes, vProps->ndevs, &path, parent));
-  if (path == PATH_LOC) {
-    *parent = NULL;
-  } else if (parent && strcmp((*parent)->name, "pci") == 0) {
-    // Compare PCI class here to avoid NCCL WARN when the "class" attribute doesn't exist
-    const char* c;
-    NCCLCHECK(xmlGetAttrStr(*parent, "class", &c));
-    if (strcmp(c, PCI_BRIDGE_DEVICE_CLASS) == 0) {
+  if (path == PATH_PHB || path == PATH_PXB || path == PATH_PIX) {
+    INFO(NCCL_GRAPH, "Widening links");
+    NCCLCHECK(ncclTopoWidenLinks(physNetNodes, vProps->ndevs, *parent));
+  }
+
+  if (*parent) {
+    if (strcmp((*parent)->name, "pci") == 0) {
+      // Compare PCI class here to avoid NCCL WARN when the "class" attribute doesn't exist
+      const char* c;
+      NCCLCHECK(xmlGetAttrStr(*parent, "class", &c));
+      if (c && strcmp(c, PCI_BRIDGE_DEVICE_CLASS) == 0) {
+        // If the common parent is a PCI switch, we must reparent the new NIC under a made up pci device with a unique busid
+        NCCLCHECK(ncclTopoMakePciParent(xml, parent, physNetNodes[0]));
+      }
+    } else if (strcmp((*parent)->name, "cpu") == 0) {
       // If the common parent is a PCI switch, we must reparent the new NIC under a made up pci device with a unique busid
       NCCLCHECK(ncclTopoMakePciParent(xml, parent, physNetNodes[0]));
     }
   }
+
   TRACE(NCCL_GRAPH, "Selected parent %s with path %d", (*parent)->name, path);
   return ncclSuccess;
 }
 
-ncclResult_t ncclTopoMakeVNics(struct ncclXml* xml, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*getProperties)(int, ncclNetProperties_t*), int physicalDevs) {
+ncclResult_t ncclTopoMakeVNics(struct ncclXml* xml, struct ncclTopoNetInfo* netInfo, int physicalDevs) {
   int* placedDevs = NULL;
   struct ncclXmlNode** physNetNodes = NULL;
+  ncclNetProperties_t* props = NULL;
+  ncclResult_t res = ncclSuccess;
   if (physicalDevs == 0) return ncclSuccess;
 
-  ncclCalloc(&physNetNodes, physicalDevs);
-  ncclResult_t res = ncclSuccess;
-
-  ncclNetProperties_t* props = NULL;
-  ncclCalloc(&props, physicalDevs);
+  NCCLCHECK(ncclCalloc(&physNetNodes, physicalDevs));
+  NCCLCHECK(ncclCalloc(&placedDevs, physicalDevs));
+  NCCLCHECK(ncclCalloc(&props, physicalDevs));
   for (int i = 0; i < physicalDevs; i++) {
-    NCCLCHECKGOTO(getProperties(i, props + i), res, out);
+    NCCLCHECKGOTO(netInfo->getProperties(i, props + i), res, out);
     struct ncclXmlNode* physNetNode;
     NCCLCHECKGOTO(xmlFindTagKv(xml, "net", &physNetNode, "name", props[i].name), res, out);
     physNetNodes[i] = physNetNode;
     TRACE(NCCL_GRAPH, "Found physical ncclNet node %d %s", i,  props[i].name);
   }
 
-  // By default, don't merge any devices
-  int mergeLevel;
-  mergeLevel = PATH_PORT;
-  { // Avoids warnings related to jumping to "out"
-    const char* mergeLevelEnv = ncclGetEnv("NCCL_NET_MERGE_LEVEL");
-    if (mergeLevelEnv) kvConvertToInt(mergeLevelEnv, &mergeLevel, nicPathKvList);
-    char* forceMerge = (char*) ncclGetEnv("NCCL_NET_FORCE_MERGE");
-    NCCLCHECK(ncclCalloc(&placedDevs, physicalDevs));
-    memset(placedDevs, 0, sizeof(int)*physicalDevs);
-
-    if (forceMerge) {
-      NCCLCHECKGOTO(ncclTopoForceMerge(xml, forceMerge, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
-    }
-  }
-  NCCLCHECKGOTO(ncclTopoAutoMerge(xml, mergeLevel, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
+  if (netInfo->forceMerge) NCCLCHECKGOTO(ncclTopoForceMerge(xml, netInfo, placedDevs, props, physNetNodes, physicalDevs), res, out);
+  NCCLCHECKGOTO(ncclTopoAutoMerge(xml, netInfo, placedDevs, props, physNetNodes, physicalDevs), res, out);
 
 out:
   free(physNetNodes);
@@ -1226,10 +1310,10 @@ out:
   return res;
 }
 
-static ncclResult_t ncclTopoPopulateNics(ncclXml* xml, int startIndex, int endIndex, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), const char* netName, int coll, int virtualNics, bool dmaBufSupport) {
+static ncclResult_t ncclTopoPopulateNics(ncclXml* xml, int startIndex, int endIndex, struct ncclTopoNetInfo* netInfo, int virtualNics) {
   for (int n = startIndex; n < endIndex; n++) {
     ncclNetProperties_t props;
-    NCCLCHECK(getProperties(n, &props));
+    NCCLCHECK(netInfo->getProperties(n, &props));
     struct ncclXmlNode* netNode = NULL;
     struct ncclXmlNode* parent = NULL;
     if (virtualNics) {
@@ -1237,7 +1321,7 @@ static ncclResult_t ncclTopoPopulateNics(ncclXml* xml, int startIndex, int endIn
       NCCLCHECK(xmlFindTagKv(xml, "net", &net, "name", props.name));
       // In the event of multithreaded use case, we need to re-discover the shared parent of the given devices for this vNIC
       // Only run this if the net doesn't exist locally - this may alter the XML state
-      if (net == NULL) NCCLCHECK(ncclTopoGetVNicParent(xml, getProperties, &props.vProps, &parent));
+      if (net == NULL) NCCLCHECK(ncclTopoGetVNicParent(xml, netInfo->getProperties, &props.vProps, &parent));
     }
 
     NCCLCHECK(ncclTopoFillNet(xml, props.pciPath, props.name, &netNode, parent));
@@ -1248,18 +1332,18 @@ static ncclResult_t ncclTopoPopulateNics(ncclXml* xml, int startIndex, int endIn
     NCCLCHECK(xmlSetAttrInt(netNode, "keep", 1));
     int dev;
     xmlGetAttrIntDefault(netNode, "dev", &dev, -1);
-    if (dev != -1 && dev != n) INFO(NCCL_GRAPH, "TOPO/NET : Changing %s dev index from %d to %d", netName, dev, n);
+    if (dev != -1 && dev != n) INFO(NCCL_GRAPH, "TOPO/NET : Changing %s dev index from %d to %d", netInfo->name, dev, n);
     NCCLCHECK(xmlSetAttrInt(netNode, "dev", n));
     NCCLCHECK(xmlInitAttrInt(netNode, "latency", props.latency));
     NCCLCHECK(xmlInitAttrInt(netNode, "speed", props.speed));
     NCCLCHECK(xmlInitAttrInt(netNode, "port", props.port));
     NCCLCHECK(xmlInitAttrUint64(netNode, "guid", props.guid));
     NCCLCHECK(xmlInitAttrInt(netNode, "maxconn", props.maxComms));
-    bool gdrSupport = (props.ptrSupport & NCCL_PTR_CUDA) || (dmaBufSupport && (props.ptrSupport & NCCL_PTR_DMABUF));
-    INFO(NCCL_NET,"NET/%s : GPU Direct RDMA %s for HCA %d '%s'", netName, gdrSupport ? "Enabled" : "Disabled", n, props.name);
+    bool gdrSupport = (props.ptrSupport & NCCL_PTR_CUDA) || (netInfo->dmaBufSupport && (props.ptrSupport & NCCL_PTR_DMABUF));
+    INFO(NCCL_NET,"NET/%s : GPU Direct RDMA %s for HCA %d '%s'", netInfo->name, gdrSupport ? "Enabled" : "Disabled", n, props.name);
     NCCLCHECK(xmlInitAttrInt(netNode, "gdr", gdrSupport));
     // Only set coll if it's not 0
-    if (coll) NCCLCHECK(xmlInitAttrInt(netNode, "coll", coll));
+    if (netInfo->coll) NCCLCHECK(xmlInitAttrInt(netNode, "coll", netInfo->coll));
 
     const char* keepAttr;
     NCCLCHECK(xmlGetAttr(netNode, "coll", &colAttr));
@@ -1272,51 +1356,45 @@ static ncclResult_t ncclTopoPopulateNics(ncclXml* xml, int startIndex, int endIn
 }
 
 // Calls to network plugin APIs should be protected. This function should be called inside a per-process lock.
-ncclResult_t ncclTopoProcessNet(ncclXml* xml, int coll, const char* dumpXmlFile, ncclTopoNetState* state, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*devices)(int*), const char* netName, bool dmaBufSupport) {
-  int usePhysicalDevices = (dumpXmlFile || makeVDevice == NULL);
-  if (state->nPhysicalNics == -1) NCCLCHECK(devices(&state->nPhysicalNics));
-  // Enumerate physical devices
-  NCCLCHECK(ncclTopoPopulateNics(xml, 0, state->nPhysicalNics, getProperties, netName, coll, false, dmaBufSupport));
+ncclResult_t ncclTopoProcessNet(ncclXml* xml, const char* dumpXmlFile, struct ncclTopoNetInfo* net) {
+  bool usePhysicalDevices = (dumpXmlFile || net->makeVDevice == NULL);
+  int nPhysicalNics, nVirtualNics;
+  NCCLCHECK(net->getDevCount(net->netPluginIndex, &nPhysicalNics, &nVirtualNics));
+  // List the physical devices in the topo
+  NCCLCHECK(ncclTopoPopulateNics(xml, 0, nPhysicalNics, net, /*virtual=*/false));
   if (!usePhysicalDevices) {
-    if (state->nVirtualNics == -1) {
-      NCCLCHECK(ncclTopoMakeVNics(xml, makeVDevice, getProperties, state->nPhysicalNics));
+    // Virtual devices are only created once per network
+    if (nVirtualNics == NCCL_UNDEF_DEV_COUNT) {
+      NCCLCHECK(ncclTopoMakeVNics(xml, net, nPhysicalNics));
+      // Update the number of virtual devices both locally and in the state tracking the plugin.
+      // Note: 0 is a valid number of virtual devices
       int nDevs;
-      NCCLCHECK(devices(&nDevs));
-      state->nVirtualNics = nDevs - state->nPhysicalNics;
+      NCCLCHECK(net->devices(&nDevs));
+      nVirtualNics = nDevs - nPhysicalNics;
+      NCCLCHECK(net->setVirtDevCount(net->netPluginIndex, nVirtualNics));
     }
-    if (state->nVirtualNics > 0) {
-      // Populate new devices
-      NCCLCHECK(ncclTopoPopulateNics(xml, state->nPhysicalNics, state->nPhysicalNics+state->nVirtualNics, getProperties, netName, coll, true, dmaBufSupport));
+    // populate the virtual devices if any
+    if (nVirtualNics > 0) {
+      NCCLCHECK(ncclTopoPopulateNics(xml, nPhysicalNics, nPhysicalNics + nVirtualNics, net, /*virtual=*/true));
     }
   }
 
   return ncclSuccess;
 }
 
-static pthread_mutex_t netLock = PTHREAD_MUTEX_INITIALIZER;
-ncclTopoNetState netStates[NCCL_NET_MAX_PLUGINS] = {};
-ncclTopoNetState collNetStates[NCCL_NET_MAX_PLUGINS] = {};
-ncclResult_t ncclTopoGetSharedState(ncclTopoNetState** state, const char* name, ncclTopoNetState* states) {
-  INFO(NCCL_GRAPH, "Retrieving state for %s", name);
-  for (int i = 0; i < NCCL_NET_MAX_PLUGINS; i++) {
-    // Empty slot
-    if (states[i].name == NULL) {
-      states[i].nVirtualNics = -1;
-      states[i].nPhysicalNics = -1;
-      states[i].name = strdup(name);
-      *state = states + i;
-      INFO(NCCL_GRAPH, "Initialized state %d for %s", i, name);
-      return ncclSuccess;
-    // Found my slot
-    } else if (strcmp(states[i].name, name) == 0) {
-      *state = states + i;
-      return ncclSuccess;
-    }
+ncclResult_t ncclTopoGetFusionEnv(int* mergeLevel, const char** forceMerge) {
+  if (forceMerge) *forceMerge = ncclGetEnv("NCCL_NET_FORCE_MERGE");
+  const char* mergeLevelEnv = ncclGetEnv("NCCL_NET_MERGE_LEVEL");
+  if (mergeLevelEnv) {
+    kvConvertToInt(mergeLevelEnv, mergeLevel, nicPathKvList);
+  } else {
+    *mergeLevel = PATH_PORT;
   }
-  WARN("NET/TOPO : Couldn't find net with name %s", name);
-  return ncclInternalError;
+  return ncclSuccess;
 }
 
+static std::mutex netMutex;
+
 ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** system, const char* dumpXmlFile) {
   ncclResult_t ret = ncclSuccess;
   struct ncclXml* xml;
@@ -1324,7 +1402,7 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
   int* localRanks = NULL;
   struct ncclXml* rankXml;
   int localRank = -1, nLocalRanks = 0;
-  int netLockHeld = 0;
+  struct ncclTopoNetInfo netInfo = {0};
   NCCLCHECK(xmlAlloc(&xml, NCCL_TOPO_XML_MAX_NODES));
   const char* xmlTopoFile = ncclGetEnv("NCCL_TOPO_FILE");
   if (xmlTopoFile) {
@@ -1364,21 +1442,35 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
 
   // Auto-detect NICs if needed. net/collnet share the same xml/graph nodes,
   // so we start with collnet so that it has precedence.
-  pthread_mutex_lock(&netLock);
-  netLockHeld = 1;
-  INFO(NCCL_GRAPH, "TOPO/NET : Importing network plugins to topology");
-  ncclTopoNetState* state;
-  state = NULL;
-  if (collNetSupport(comm)) {
-    NCCLCHECKGOTO(ncclTopoGetSharedState(&state, comm->ncclCollNet->name, collNetStates), ret, fail);
-    NCCLCHECKGOTO(ncclTopoProcessNet(xml, 1, dumpXmlFile, state,
-      comm->ncclCollNet->getProperties, comm->ncclCollNet->makeVDevice, comm->ncclCollNet->devices, comm->ncclCollNet->name, comm->dmaBufSupport), ret, fail);
+  {
+      std::lock_guard<std::mutex> lock(netMutex);
+      INFO(NCCL_GRAPH, "TOPO/NET : Importing network plugins to topology");
+      if (collNetSupport(comm)) {
+        netInfo.coll = 1;
+        netInfo.netPluginIndex = comm->netPluginIndex;
+        netInfo.dmaBufSupport = comm->dmaBufSupport;
+        netInfo.getDevCount = ncclCollNetGetDevCount;
+        netInfo.setVirtDevCount = ncclCollNetSetVirtDevCount;
+        netInfo.name = comm->ncclCollNet->name;
+        netInfo.getProperties = comm->ncclCollNet->getProperties;
+        netInfo.makeVDevice = comm->ncclCollNet->makeVDevice;
+        netInfo.devices = comm->ncclCollNet->devices;
+        NCCLCHECK(ncclTopoGetFusionEnv(&netInfo.mergeLevel, &netInfo.forceMerge));
+        NCCLCHECKGOTO(ncclTopoProcessNet(xml, dumpXmlFile, &netInfo), ret, fail);
+      }
+
+      netInfo.coll = 0;
+      netInfo.netPluginIndex = comm->netPluginIndex;
+      netInfo.dmaBufSupport = comm->dmaBufSupport;
+      netInfo.getDevCount = ncclNetGetDevCount;
+      netInfo.setVirtDevCount = ncclNetSetVirtDevCount;
+      netInfo.name = comm->ncclNet->name;
+      netInfo.getProperties = comm->ncclNet->getProperties;
+      netInfo.makeVDevice = comm->ncclNet->makeVDevice;
+      netInfo.devices = comm->ncclNet->devices;
+      NCCLCHECK(ncclTopoGetFusionEnv(&netInfo.mergeLevel, &netInfo.forceMerge));
+      NCCLCHECKGOTO(ncclTopoProcessNet(xml, dumpXmlFile, &netInfo), ret, fail);
   }
-  NCCLCHECKGOTO(ncclTopoGetSharedState(&state, comm->ncclNet->name, netStates), ret, fail);
-  NCCLCHECKGOTO(ncclTopoProcessNet(xml, 0, dumpXmlFile, state,
-    comm->ncclNet->getProperties, comm->ncclNet->makeVDevice, comm->ncclNet->devices, comm->ncclNet->name, comm->dmaBufSupport), ret, fail);
-  pthread_mutex_unlock(&netLock);
-  netLockHeld = 0;
 
   // Remove XML branches which don't have a node with keep="1" (typically when importing a topology)
   NCCLCHECKGOTO(ncclTopoTrimXml(xml), ret, fail);
@@ -1436,7 +1528,6 @@ exit:
   free(xml);
   return ret;
 fail:
-  if (netLockHeld) pthread_mutex_unlock(&netLock);
   goto exit;
 }
 
@@ -1491,6 +1582,38 @@ ncclResult_t getLocalNetCountByBw(struct ncclTopoSystem* system, int gpu, int *c
   return ncclSuccess;
 }
 
+enum netDevsPolicy {
+  NETDEVS_POLICY_AUTO = 0x0,
+  NETDEVS_POLICY_ALL = 0x1,
+  NETDEVS_POLICY_MAX = 0x2,
+  NETDEVS_POLICY_UNDEF = 0xffffffff
+};
+
+static enum netDevsPolicy netDevsPolicy = NETDEVS_POLICY_UNDEF;
+static int netDevsPolicyNum = -1;
+
+static void getNetDevsPolicyOnce() {
+  const char* envStr = ncclGetEnv("NCCL_NETDEVS_POLICY");
+  if (envStr) {
+    if (strcasecmp(envStr, "AUTO") == 0) {
+      netDevsPolicy = NETDEVS_POLICY_AUTO;
+    } else if (strcasecmp(envStr, "ALL") == 0) {
+      netDevsPolicy = NETDEVS_POLICY_ALL;
+    } else if (strncasecmp(envStr, "MAX:", strlen("MAX:")) == 0) {
+      int envNum = atoi(envStr + strlen("MAX:"));
+      if (envNum > 0) {
+        netDevsPolicy = NETDEVS_POLICY_MAX;
+        netDevsPolicyNum = envNum;
+      }
+    }
+    if (netDevsPolicy == NETDEVS_POLICY_UNDEF)
+      INFO(NCCL_ENV, "Unable to recognize NCCL_NETDEVS_POLICY=%s, using NCCL_NETDEVS_POLICY_AUTO instead.", envStr);
+    else
+      INFO(NCCL_ENV, "NCCL_NETDEVS_POLICY set by environment to %s", envStr);
+  }
+  if (netDevsPolicy == NETDEVS_POLICY_UNDEF) netDevsPolicy = NETDEVS_POLICY_AUTO;
+}
+
 ncclResult_t ncclTopoGetLocalNet(struct ncclTopoSystem* system, int rank, int channelId, int64_t* id, int* dev) {
   int gpu;
   NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpu, /*showWarn=*/true));
@@ -1503,13 +1626,30 @@ ncclResult_t ncclTopoGetLocalNet(struct ncclTopoSystem* system, int rank, int ch
     return ncclInternalError;
   }
 
-  int localGpus[NCCL_TOPO_MAX_NODES];
-  int localGpuCount;
-  NCCLCHECK(ncclTopoGetLocal(system, NET, localNets[0], GPU, localGpus, &localGpuCount, NULL));
+  static pthread_once_t once = PTHREAD_ONCE_INIT;
+  pthread_once(&once,getNetDevsPolicyOnce);
+  int netsPerGpu = 0;
+  if (netDevsPolicy == NETDEVS_POLICY_AUTO) {
+    int localGpus[NCCL_TOPO_MAX_NODES];
+    int localGpuCount;
+    NCCLCHECK(ncclTopoGetLocal(system, NET, localNets[0], GPU, localGpus, &localGpuCount, NULL));
+    netsPerGpu = DIVUP(localNetCount, localGpuCount);
+  } else if (netDevsPolicy == NETDEVS_POLICY_ALL) {
+    netsPerGpu = localNetCount;
+  } else if (netDevsPolicy == NETDEVS_POLICY_MAX) {
+    if (netDevsPolicyNum <= 0) {
+      WARN("Invalid number of network devices = %d for policy MAX", netDevsPolicyNum);
+      return ncclInternalError;
+    }
+    netsPerGpu = std::min(netDevsPolicyNum, localNetCount);
+  } else {
+    WARN("Unknown netDevs policy");
+    return ncclInternalError;
+  }
 
   int net = system->nodes[GPU].nodes[gpu].gpu.dev;
   if (isPow2(localNetCount)) net = mirrorBits(net, localNetCount);
-  net += channelId%(DIVUP(localNetCount,localGpuCount));
+  net += channelId%(netsPerGpu);
   if (id) *id = system->nodes[NET].nodes[localNets[net%localNetCount]].id;
   if (dev) *dev = system->nodes[NET].nodes[localNets[net%localNetCount]].net.dev;
   return ncclSuccess;
@@ -1567,25 +1707,10 @@ ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu
   cpu_set_t mask;
   SYSCHECK(sched_getaffinity(0, sizeof(cpu_set_t), &mask), "sched_getaffinity");
 
-#ifdef ENABLE_TRACE
-  {
-    char affinityStr[sizeof(cpu_set_t)*2];
-    TRACE(NCCL_INIT, "Current affinity for GPU %d is %s", gpu->gpu.dev,
-          ncclCpusetToRangeStr(&mask, affinityStr, sizeof(affinityStr)));
-  }
-#endif
-
   // Get the affinity of the CPU close to our GPU.
   cpu_set_t cpuMask = cpu->cpu.affinity;
 
-#ifdef ENABLE_TRACE
-  {
-    char affinityStr[sizeof(cpu_set_t)*2];
-    TRACE(NCCL_INIT, "CPU GPU affinity for GPU %d is %s", gpu->gpu.dev,
-          ncclCpusetToRangeStr(&cpuMask, affinityStr, sizeof(affinityStr)));
-  }
-#endif
-
+  // Get the final affinity
   cpu_set_t finalMask;
   if (ncclParamIgnoreCpuAffinity())
     // Ignore the CPU affinity set and use the GPU one instead
@@ -1596,12 +1721,22 @@ ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu
 
   memcpy(affinity, &finalMask, sizeof(cpu_set_t));
 
-  // If there is a non empty set, use it to set affinity
+  // display the final affinity
+  char msg[1024] = "";
+  snprintf(msg + strlen(msg), sizeof(msg) - strlen(msg), "Affinity for GPU %d is ", gpu->gpu.dev);
   if (CPU_COUNT(&finalMask)) {
-    char affinityStr[sizeof(cpu_set_t)*2];
-    INFO(NCCL_INIT, "Setting affinity for GPU %d to %s", gpu->gpu.dev,
-         ncclCpusetToRangeStr(&finalMask, affinityStr, sizeof(affinityStr)));
+    (void)ncclCpusetToRangeStr(&finalMask, msg + strlen(msg), sizeof(msg) - strlen(msg));
+  } else {
+    snprintf(msg + strlen(msg), sizeof(msg) - strlen(msg), "empty, ignoring");
   }
+  snprintf(msg + strlen(msg), sizeof(msg) - strlen(msg), ". (GPU affinity = ");
+  (void)ncclCpusetToRangeStr(&cpuMask, msg + strlen(msg), sizeof(msg) - strlen(msg));
+  if (!ncclParamIgnoreCpuAffinity()) {
+    snprintf(msg + strlen(msg), sizeof(msg) - strlen(msg), " ; CPU affinity = ");
+    (void)ncclCpusetToRangeStr(&mask, msg + strlen(msg), sizeof(msg) - strlen(msg));
+  }
+  snprintf(msg + strlen(msg), sizeof(msg) - strlen(msg), ").");
+  INFO(NCCL_INIT, "%s: %s", __func__, msg);
   return ncclSuccess;
 }
 
diff --git a/projects/rccl/src/graph/topo.h b/projects/rccl/src/graph/topo.h
index 9ef10ff2d3..49d408d95f 100644
--- a/projects/rccl/src/graph/topo.h
+++ b/projects/rccl/src/graph/topo.h
@@ -190,12 +190,26 @@ ncclResult_t ncclTopoGetGpuMinPath(struct ncclTopoSystem* system, int type, int*
 ncclResult_t ncclTopoGetGpuMaxPath(struct ncclTopoSystem* system, int type, int* max);
 ncclResult_t ncclTopoSplitNvLink(struct ncclTopoSystem* system, int* splitNvLink);
 
-struct ncclTopoNetState {
-  int nVirtualNics;
-  int nPhysicalNics;
+struct ncclTopoNetInfo {
+  bool coll;
+  // communicator-specific information
+  int netPluginIndex;
+  bool dmaBufSupport;
+  // NIC fusion
+  int mergeLevel;
+  const char* forceMerge;
+  // dev count tracking functions (not part of ncclNet)
+  ncclResult_t (*getDevCount)(int, int*, int*);
+  ncclResult_t (*setVirtDevCount)(int, int);
+  // ncclNet API functions
   const char* name;
+  ncclResult_t (*getProperties)(int, ncclNetProperties_t*);
+  ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*);
+  ncclResult_t (*devices)(int*);
 };
-ncclResult_t ncclTopoProcessNet(ncclXml* xml, int coll, const char* dumpXmlFile, ncclTopoNetState* state, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*devices)(int*), const char* netName, bool dmaBufSupport);
+
+ncclResult_t ncclTopoProcessNet(ncclXml* xml, const char* dumpXmlFile, struct ncclTopoNetInfo* net);
+ncclResult_t ncclTopoGetFusionEnv(int* mergeLevel, const char** forceMerge);
 
 #define NCCL_TOPO_XML_MAX_NODES 256
 #define NCCL_GRAPH_XML_MAX_NODES 4096
@@ -240,6 +254,8 @@ static ncclResult_t ncclTopoDevToRank(struct ncclTopoSystem* system, int dev, in
   return ncclInternalError;
 }
 
+extern struct kvDict nicPathKvList[];
+
 static ncclResult_t ncclTopoIdToNetDev(struct ncclTopoSystem* system, int64_t id, int* netDev) {
   *netDev = -1;
   for (int i=0; i<system->nodes[NET].count; i++) {
diff --git a/projects/rccl/src/graph/tuning.cc b/projects/rccl/src/graph/tuning.cc
index 8e99f18c36..bfb2798501 100644
--- a/projects/rccl/src/graph/tuning.cc
+++ b/projects/rccl/src/graph/tuning.cc
@@ -8,6 +8,7 @@
 #include "device.h"
 #include "comm.h"
 #include "topo.h"
+#include "nccl_tuner.h"
 
 NCCL_PARAM(Nthreads, "NTHREADS", -2);
 NCCL_PARAM(Ll128Nthreads, "LL128_NTHREADS", -2);
@@ -129,63 +130,72 @@ fail:
   goto exit;
 }
 
-// Latencies in us, Bandwidths in GB/s
-// Tree { LL, LL128, Simple } , Ring { LL, LL128, Simple }
-static const float baseLat  [NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS] = {
-       {  6.8, 14.0,  8.4 }, {  6.6, 14.0,  8.4 },  // Tree, Ring
-       {    0,    0,    0 }, {    0,    0,    0 },  // Collnet Direct, Chain
-       {    0,    0,    0 }, {    0,    0,    0 }}; // NVLS, NVLS Tree
+// NVLS efficiency factor.
+static const float nvlsEfficiency[NCCL_NUM_COMPCAPS] = {
+  0.0f, // Volta
+  0.0f, // Ampere
+  0.85f, // Hopper
+  0.74f, // Blackwell
+};
 
-// NVLink, PCI, Network
-#define NCCL_HW_NVLINK 0
-#define NCCL_HW_PCI 1
-#define NCCL_HW_NET 2
-static float hwLat [3][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS] =
-{ /* NVLINK */
-  { /* Tree (LL/LL128/Simple)*/ { .6, 1.25, 4.0 }, /* Ring (LL/LL128/Simple)*/ { .6, 1.9, 3.4 },
-    /* CollNetDirect (Simple)*/ { 0, 0, 3.7 }, /* CollNetChain (Simple)*/ { 0, 0, 2.8 },
-    /* NVLS */ { 0, 0, 25 }, /* NVLSTree */ { 0, 0, 25 } },
+// Default tuner constants
+static const ncclTunerConstants_t ncclTunerConstantsDefaults = {
+  .baseLatencies = {
+    {  6.8, 14.0,  8.4 }, {  6.6, 14.0,  8.4 },  // Tree, Ring
+    {    0,    0,    0 }, {    0,    0,    0 },  // Collnet Direct, Chain
+    {    0,    0,    0 }, {    0,    0,    0 },  // NVLS, NVLS Tree
+    {  8.0,  8.0,  8.0 }                         // PAT
+    },
+  .hwLatencies = {
+  /* NVLINK */
+  { { .6, 1.25, 4.0 }, { .6, 1.9, 3.4 }, /* Tree (LL/LL128/Simple), Ring (LL/LL128/Simple)*/
+    {  0,    0, 3.7 }, {  0,   0,  2.8 }, /* CollNetDirect (LL/LL128/Simple), CollNetChain (LL/LL128/Simple)*/
+    {  0,    0,  25 }, {  0,   0,  25 }, /* NVLS (LL/LL128/Simple), NVLSTree (LL/LL128/Simple)*/
+    {  0,    0, 4.0 } /* PAT (LL/LL128/Simple)*/
+    },
   /* PCI */
-  { /* Tree (LL/LL128/Simple)*/ { 1.0, 1.9, 4.0 }, /* Ring (LL/LL128/Simple)*/ { 1.0, 2.5, 5.7 },
-    /* CollNetDirect (Simple)*/ { 0, 0, 3.7 }, /* CollNetChain (Simple)*/ { 0, 0, 2.8 },
-    /* NVLS */ { 0, 0, 0 }, /* NVLSTree */ { 0, 0, 0 } },
+  { { 1.0, 1.9, 4.0 }, { 1.0, 2.5, 5.7 }, /* Tree (LL/LL128/Simple), Ring (LL/LL128/Simple)*/
+    {  0,    0, 3.7 }, {  0,   0,  2.8 }, /* CollNetDirect (LL/LL128/Simple), CollNetChain (LL/LL128/Simple)*/
+    {  0,    0,   0 }, {  0,   0,    0 }, /* NVLS (LL/LL128/Simple), NVLSTree (LL/LL128/Simple)*/
+    {  0,    0, 4.0 } /* PAT (LL/LL128/Simple)*/
+    },
   /* NET */
-  { /* Tree (LL/LL128/Simple)*/ { 5.0, 8.5, 14 }, /* Ring (LL/LL128/Simple)*/ { 2.7, 4.0, 14.0 },
-    /* CollNetDirect (Simple)*/ { 0, 0, 31 }, /* CollNetChain (Simple)*/ { 0, 0, 30 },
-    /* NVLS */ { 0, 0, 18 }, /* NVLSTree */ { 0, 0, 14 } }
-};
-
-/* Array indexes used below */
-#define VOLTA_COMPCAP_IDX 0
-#define AMPERE_COMPCAP_IDX 1
-#define HOPPER_COMPCAP_IDX 2
-#define BLACKWELL_COMPCAP_IDX 3
-
-// LL128 max BW per channel
-static const double llMaxBws[][3] = {
-  /* Volta-N1/Intel-N2/Intel-N4) */ {39.0, 39.0, 20.4},
-  /* Ampere-N1/AMD-N2/AMD-N4) */ {87.7, 22.5 /*avg of ring & tree*/, 19.0},
-  /* Hopper-N1/AMD-N2/AMD-N4) */ {141.0, 45.0 /*avg of ring & tree*/, 35.0},
-  /* Blackwell-N1/AMD-N2/AMD-N4) */ {2*141.0, 2*45.0 /*avg of ring & tree*/, 2*35.0},
-};
-
-static const double perChMaxRingLL128Bws[][3] = {
-  /* Volta (N1/N2/N4) */  {20.0, 20.0, 20.0},
-  /* Ampere (N1/N2/N4) */ {20.0, 20.0, 20.0},
-  /* Hopper (N1/N2/N4) */ {36.7, 36.7, 36.7},
-  /* Blackwell (N1/N2/N4) */ {2*36.7, 2*36.7, 2*36.7},
-};
-static const double perChMaxTreeLL128Bws[][3] = {
-  /* Volta (N1/N2/N4) */  {20.0, 20.0, 20.0},
-  /* Ampere (N1/N2/N4) */ {20.0, 20.0, 20.0},
-  /* Hopper (N1/N2/N4) */ {36.7, 36.7, 29.0},
-  /* Blackwell (N1/N2/N4) */ {2*36.7, 2*36.7, 2*29.0},
-};
-static const double perChMaxTreeBws[][3] = {
-  /* Volta (N1/N2/N4) */  {26.5, 18.5, 10.0},
-  /* Ampere (N1/N2/N4) */ {24.0, 23.6, 17.8},
-  /* Hopper (N1/N2/N4) */ {38.7, 41.4, 36.0},
-  /* Blackwell (N1/N2/N4) */ {2*38.7, 2*41.4, 2*36.0},
+  { { 5.0, 8.5, 14 }, { 2.7, 4.0, 14.0 }, /* Tree (LL/LL128/Simple), Ring (LL/LL128/Simple)*/
+    {   0,   0, 31 }, {   0,   0,   30 }, /* CollNetDirect (LL/LL128/Simple), CollNetChain (LL/LL128/Simple)*/
+    {   0,   0, 18 }, {   0,   0,   14 }, /* NVLS (LL/LL128/Simple), NVLSTree (LL/LL128/Simple)*/
+    {   0,   0, 14 } /* PAT (LL/LL128/Simple)*/
+    },
+  },
+  .llMaxBws = {
+     {39.0, 39.0, 20.4}, /* Volta-N1/Intel-N2/Intel-N4) */
+     {87.7, 22.5 /*avg of ring & tree*/, 19.0}, /* Ampere-N1/AMD-N2/AMD-N4) */
+     {141.0, 45.0 /*avg of ring & tree*/, 35.0}, /* Hopper-N1/AMD-N2/AMD-N4) */
+     {2*141.0, 2*45.0 /*avg of ring & tree*/, 2*35.0}, /* Blackwell-N1/AMD-N2/AMD-N4) */
+  },
+  .perChMaxRingLL128Bws = {
+    {20.0, 20.0, 20.0}, /* Volta (N1/N2/N4) */
+    {20.0, 20.0, 20.0}, /* Ampere (N1/N2/N4) */
+    {36.7, 36.7, 36.7}, /* Hopper (N1/N2/N4) */
+    {2*36.7, 2*36.7, 2*36.7}, /* Blackwell (N1/N2/N4) */
+  },
+  .perChMaxTreeLL128Bws = {
+    {20.0, 20.0, 20.0}, /* Volta (N1/N2/N4) */
+    {20.0, 20.0, 20.0}, /* Ampere (N1/N2/N4) */
+    {36.7, 36.7, 29.0}, /* Hopper (N1/N2/N4) */
+    {55.6, 31.67, 20.0}, /* Blackwell (N1/N2/N4) */
+  },
+  .perChMaxTreeBws = {
+    {26.5, 18.5, 10.0}, /* Volta (N1/N2/N4) */
+    {24.0, 23.6, 17.8}, /* Ampere (N1/N2/N4) */
+    {38.7, 41.4, 36.0}, /* Hopper (N1/N2/N4) */
+    {70.0, 42.8, 24.0}, /* Blackwell (N1/N2/N4) */
+  },
+  .perChMaxNVLSTreeBws = {
+    {26.5, 18.5, 10.0}, /* Volta (N1/N2/N4) */
+    {24.0, 23.6, 17.8}, /* Ampere (N1/N2/N4) */
+    {0.0, 57.7, 45.5}, /* Hopper (N1/N2/N4) */
+    {0.0, 96.0, 43.1} /* Blackwell (N1/N2/N4) */
+  }
 };
 
 NCCL_PARAM(PatEnable, "PAT_ENABLE", 2);
@@ -210,6 +220,13 @@ static float getNetOverhead(struct ncclComm* comm) {
 
 NCCL_PARAM(Ll128C2c, "LL128_C2C", 1);
 
+ncclResult_t ncclTopoInitTunerConstants(struct ncclComm* comm) {
+
+  comm->tunerConstants = ncclTunerConstantsDefaults;
+
+  return ncclSuccess;
+}
+
 ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCompCap, struct ncclTopoGraph** graphs) {
   int simpleDefaultThreads = (graphs[NCCL_ALGO_RING]->bwIntra*graphs[NCCL_ALGO_RING]->nChannels <= PCI_BW) ? 256 : NCCL_SIMPLE_MAX_NTHREADS;
   comm->maxThreads[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] =
@@ -229,17 +246,18 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
   int nRanks = comm->nRanks;
   if (nRanks <= 1) return ncclSuccess;
 
-  int compCapIndex = minCompCap >= 100 ? BLACKWELL_COMPCAP_IDX : (minCompCap >= 90 ? HOPPER_COMPCAP_IDX : minCompCap >= 80 ? AMPERE_COMPCAP_IDX : VOLTA_COMPCAP_IDX);
+  int compCapIndex = minCompCap >= 100 ? NCCL_BLACKWELL_COMPCAP_IDX : (minCompCap >= 90 ? NCCL_HOPPER_COMPCAP_IDX : minCompCap >= 80 ? NCCL_AMPERE_COMPCAP_IDX : NCCL_VOLTA_COMPCAP_IDX);
   int index2 = nNodes <= 2 ? nNodes-1 : 2;
   // LL: for single node, we look at GPU type; for multi-node, we look at CPU type
   int index1 = nNodes == 1 ? compCapIndex :
                (comm->cpuVendor == NCCL_TOPO_CPU_VENDOR_AMD || comm->cpuVendor == NCCL_TOPO_CPU_VENDOR_MIXED) ? 1 : 0;
-  double llMaxBw = llMaxBws[index1][index2];
-  double perChMaxTreeBw = perChMaxTreeBws[compCapIndex][index2];
-  double perChMaxRingLL128Bw = perChMaxRingLL128Bws[compCapIndex][index2];
-  double perChMaxTreeLL128Bw = perChMaxTreeLL128Bws[compCapIndex][index2];
+  double llMaxBw = comm->tunerConstants.llMaxBws[index1][index2];
+  double perChMaxTreeBw = comm->tunerConstants.perChMaxTreeBws[compCapIndex][index2];
+  double perChMaxRingLL128Bw = comm->tunerConstants.perChMaxRingLL128Bws[compCapIndex][index2];
+  double perChMaxTreeLL128Bw = comm->tunerConstants.perChMaxTreeLL128Bws[compCapIndex][index2];
+  double perChMaxNVLSTreeBw = comm->tunerConstants.perChMaxNVLSTreeBws[compCapIndex][index2];
   // De-penalize Tree/Simple latency on Power systems to favor Tree than Ring
-  if (comm->cpuArch == NCCL_TOPO_CPU_ARCH_POWER) hwLat[NCCL_HW_PCI][NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] = hwLat[NCCL_HW_PCI][NCCL_ALGO_RING][NCCL_PROTO_SIMPLE];
+  if (comm->cpuArch == NCCL_TOPO_CPU_ARCH_POWER) comm->tunerConstants.hwLatencies[NCCL_HW_PCI][NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] = comm->tunerConstants.hwLatencies[NCCL_HW_PCI][NCCL_ALGO_RING][NCCL_PROTO_SIMPLE];
   float ppn = (float)nRanks / nNodes;
 
   int intraHw[NCCL_NUM_ALGORITHMS], hw[NCCL_NUM_ALGORITHMS];
@@ -264,15 +282,22 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
             && a == NCCL_ALGO_PAT && (p != NCCL_PROTO_SIMPLE || ncclPatEnable(comm) == 0)) continue;
         int collnet = (a == NCCL_ALGO_COLLNET_DIRECT || a == NCCL_ALGO_COLLNET_CHAIN) ? 1 : 0;
         float bw = nNodes <= 2 || collnet ? graphs[a]->bwIntra : graphs[a]->bwInter;
-        if (a == NCCL_ALGO_NVLS) {
+        if (a == NCCL_ALGO_NVLS_TREE || a == NCCL_ALGO_NVLS)
+        {
+          // NVLS/NVLStree needs at least 2 channels
+          if (graphs[a]->nChannels < 2 ) continue;
+          // Convert to NVLS busBW/channel
+          float intraBw = graphs[a]->bwIntra * nvlsEfficiency[compCapIndex] * (graphs[a]->nChannels - 1) / graphs[a]->nChannels;
+	  // AllReduce pipelines two operations.
           if (coll == ncclFuncAllReduce) {
-            bw = std::min(graphs[a]->bwIntra, graphs[a]->bwInter);
+            intraBw *= 2.0f;
           } else {
-            // allgather and reducescatter
-            bw = std::min(graphs[a]->bwIntra * (ppn - 1.0f) / ppn, graphs[a]->bwInter * 0.9f);
+            intraBw *= (ppn - 1) / ppn;
           }
-        }
-        if (a == NCCL_ALGO_NVLS_TREE) bw = std::min(graphs[a]->bwIntra, nNodes <= 2 ? graphs[a]->bwInter : graphs[a]->bwInter/2);
+          // Handle 2 node case of NVLSTree
+          float interBw = graphs[a]->bwInter * ((nNodes <= 2 && a == NCCL_ALGO_NVLS_TREE) ? 2 : 1);
+          bw = std::min( {intraBw, interBw, a == NCCL_ALGO_NVLS_TREE ? (float)perChMaxNVLSTreeBw : std::numeric_limits<float>::max()} );
+        };
         float busBw = graphs[a]->nChannels * bw;
 
         // Various model refinements
@@ -320,27 +345,26 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
         // Convert bus BW to algorithm BW
         if (!(a != NCCL_ALGO_RING && (coll == ncclFuncAllGather || coll == ncclFuncReduceScatter))) {
           float ratio = 1.0f;
-          if (a == NCCL_ALGO_RING) ratio *= (1.0 * nRanks) / nsteps;
-          else if (a == NCCL_ALGO_NVLS || a == NCCL_ALGO_NVLS_TREE) ratio *= 5.0/6.0;
+          if (a == NCCL_ALGO_RING || a == NCCL_ALGO_NVLS || a == NCCL_ALGO_NVLS_TREE) ratio *= (1.0 * nRanks) / nsteps;
           else ratio *= .5;
           busBw *= ratio;
         }
         comm->bandwidths[coll][a][p] = busBw;
-        comm->latencies[coll][a][p] = baseLat[a][p];
-        float intraLat = hwLat[intraHw[a]][a][p];
+        comm->latencies[coll][a][p] = comm->tunerConstants.baseLatencies[a][p];
+        float intraLat = comm->tunerConstants.hwLatencies[intraHw[a]][a][p];
         // With ppn=1 latencies are fully exposed, use the Tree network latency
-        float interLat = ppn == 1 ? hwLat[NCCL_HW_NET][NCCL_ALGO_TREE][p] : hwLat[NCCL_HW_NET][a][p];
+        float interLat = ppn == 1 ? comm->tunerConstants.hwLatencies[NCCL_HW_NET][NCCL_ALGO_TREE][p] : comm->tunerConstants.hwLatencies[NCCL_HW_NET][a][p];
         interLat += graphs[a]->latencyInter;
         // Also add the flush extra latency
         if (p == NCCL_PROTO_SIMPLE) interLat += graphs[a]->latencyInter;
 
         if (a == NCCL_ALGO_RING) {
-          float lat = hwLat[hw[a]][a][p];
+          float lat = comm->tunerConstants.hwLatencies[hw[a]][a][p];
           if ((coll == ncclFuncReduce || coll == ncclFuncBroadcast)) {
             if (graphs[a]->sameChannels) {
               comm->latencies[coll][a][p] += lat;
             } else {
-              if (p == NCCL_PROTO_SIMPLE) lat = hwLat[hw[a]][NCCL_ALGO_TREE][p]; // Add some chunk latency, waiting for proper chunk modeling
+              if (p == NCCL_PROTO_SIMPLE) lat = comm->tunerConstants.hwLatencies[hw[a]][NCCL_ALGO_TREE][p]; // Add some chunk latency, waiting for proper chunk modeling
               comm->latencies[coll][a][p] += nsteps*lat;
             }
           } else {
@@ -371,8 +395,7 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
           comm->latencies[coll][a][p] += intraLat + 2 * log2i(nNodes) * interLat;
         } else if (a == NCCL_ALGO_PAT) {
           if (coll == ncclFuncAllGather || coll == ncclFuncReduceScatter) {
-            comm->latencies[coll][a][p] = 8 // Base time
-              + log2i(nNodes) * (interLat/3.5) // Log latency
+            comm->latencies[coll][a][p] += log2i(nNodes) * (interLat/3.5) // Log latency
               + nRanks * 2.8; // Still a linear part; hopefully we'll manage to remove it at some point.
           }
         }
diff --git a/projects/rccl/src/graph/xml.cc b/projects/rccl/src/graph/xml.cc
index 96b0c9a7c5..0101206277 100644
--- a/projects/rccl/src/graph/xml.cc
+++ b/projects/rccl/src/graph/xml.cc
@@ -917,31 +917,33 @@ ncclResult_t ncclTopoFillNet(struct ncclXml* xml, const char* pciPath, const cha
 
   if (*netNode != NULL) return ncclSuccess;
 
-  const char* pciSysPath = pciPath;
-  if (pciSysPath) {
-    char subSystem[PATH_MAX];
-    NCCLCHECK(ncclTopoGetSubsystem(pciSysPath, subSystem));
-    // This is not a PCI device (virtual, usb, ...).
-    if (strcmp(subSystem, "pci") != 0) {
-      INFO(NCCL_NET|NCCL_GRAPH, "Topology detection: network path %s is not a PCI device (%s). Attaching to first CPU", pciSysPath, subSystem);
-      pciSysPath = NULL;
-    }
-  }
-
   struct ncclXmlNode* parent = NULL;
   if (forceParent) {
     parent = forceParent;
-  } else if (pciSysPath) {
-    int offset;
-    for (offset=strlen(pciSysPath)-1; pciSysPath[offset] != '/'; offset--);
-    char busId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
-    strcpy(busId, pciSysPath+offset+1);
-    NCCLCHECK(ncclTopoGetPciNode(xml, busId, &parent));
-    NCCLCHECK(xmlSetAttrIfUnset(parent, "class", "0x02"));
-    NCCLCHECK(ncclTopoGetXmlFromSys(parent, xml));
   } else {
-    // Virtual NIC, no PCI device, attach to first CPU
-    NCCLCHECK(xmlFindTag(xml, "cpu", &parent));
+    const char* pciSysPath = pciPath;
+    if (pciSysPath) {
+      char subSystem[PATH_MAX];
+      NCCLCHECK(ncclTopoGetSubsystem(pciSysPath, subSystem));
+      // This is not a PCI device (virtual, usb, ...).
+      if (strcmp(subSystem, "pci") != 0 && !forceParent) {
+        INFO(NCCL_NET | NCCL_GRAPH, "Topology detection: network path (name = %s) %s is not a PCI device (%s). Attaching to first CPU", netName, pciSysPath, subSystem);
+        pciSysPath = NULL;
+      }
+    }
+
+    if (pciSysPath) {
+      int offset;
+      for (offset = strlen(pciSysPath) - 1; pciSysPath[offset] != '/'; offset--);
+      char busId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
+      strcpy(busId, pciSysPath + offset + 1);
+      NCCLCHECK(ncclTopoGetPciNode(xml, busId, &parent));
+      NCCLCHECK(xmlSetAttrIfUnset(parent, "class", "0x02"));
+      NCCLCHECK(ncclTopoGetXmlFromSys(parent, xml));
+    } else {
+      // Virtual NIC, no PCI device, attach to first CPU
+      NCCLCHECK(xmlFindTag(xml, "cpu", &parent));
+    }
   }
 
   struct ncclXmlNode* nicNode = NULL;
diff --git a/projects/rccl/src/graph/xml.h b/projects/rccl/src/graph/xml.h
index ad9f0faffe..ac9ef7286d 100644
--- a/projects/rccl/src/graph/xml.h
+++ b/projects/rccl/src/graph/xml.h
@@ -124,6 +124,13 @@ static ncclResult_t xmlGetAttrUint64(struct ncclXmlNode* node, const char* attrN
   return ncclSuccess;
 }
 
+static ncclResult_t xmlGetAttrUint64Default(struct ncclXmlNode* node, const char* attrName, uint64_t* value, uint64_t defaultValue) {
+  const char* str;
+  NCCLCHECK(xmlGetAttr(node, attrName, &str));
+  *value = str ? strtoull(str, NULL, 0) : defaultValue;
+  return ncclSuccess;
+}
+
 static ncclResult_t xmlGetAttrLong(struct ncclXmlNode* node, const char* attrName, int64_t* value) {
   const char* str;
   NCCLCHECK(xmlGetAttrStr(node, attrName, &str));
diff --git a/projects/rccl/src/group.cc b/projects/rccl/src/group.cc
index 08ac54e9e3..aa2824412a 100644
--- a/projects/rccl/src/group.cc
+++ b/projects/rccl/src/group.cc
@@ -11,6 +11,9 @@
 #include "channel.h"
 #include <assert.h>
 #include "bootstrap.h"
+#include "ce_coll.h"
+#include "profiler.h"
+#include "nvtx.h"
 
 #define GROUP_MAX_RECLAIM_STEPS 10
 
@@ -90,7 +93,7 @@ ncclResult_t ncclAsyncJobComplete(struct ncclAsyncJob* job) {
 NCCL_API(ncclResult_t, ncclGroupStart);
 ncclResult_t ncclGroupStart() {
   ncclResult_t ret = ncclSuccess;
-  NVTX3_FUNC_RANGE_IN(nccl_domain);
+  NCCL_NVTX3_FUNC_RANGE;
 
   NCCLCHECK(ncclGroupStartInternal());
   TRACE_CALL("ncclGroupStart()");
@@ -100,7 +103,7 @@ ncclResult_t ncclGroupStart() {
 NCCL_API(ncclResult_t, ncclGroupEnd);
 ncclResult_t ncclGroupEnd() {
   ncclResult_t ret = ncclSuccess;
-  NVTX3_FUNC_RANGE_IN(nccl_domain);
+  NCCL_NVTX3_FUNC_RANGE;
   NCCLCHECKGOTO(ncclGroupEndInternal(), ret, exit);
   TRACE_CALL("ncclGroupEnd()");
 exit:
@@ -110,7 +113,7 @@ exit:
 NCCL_API(ncclResult_t, ncclGroupSimulateEnd, ncclSimInfo_t* simInfo);
 ncclResult_t ncclGroupSimulateEnd(ncclSimInfo_t* simInfo) {
   ncclResult_t ret = ncclSuccess;
-  NVTX3_FUNC_RANGE_IN(nccl_domain);
+  NCCL_NVTX3_FUNC_RANGE;
   NCCLCHECKGOTO(ncclGroupEndInternal(simInfo), ret, exit);
   TRACE_CALL("ncclGroupSimulateEnd()");
 exit:
@@ -123,64 +126,87 @@ struct ncclPreconnectJob {
   bool* algoNeedConnect;
 };
 
+struct ncclPrepareTasksAndCollPreconnectJob {
+  struct ncclAsyncJob base;
+  struct ncclComm* comm;
+  ncclSimInfo_t* simInfo;
+};
+
 ncclResult_t ncclP2PPreconnectFunc(struct ncclAsyncJob* job_) {
   struct ncclPreconnectJob* job = (struct ncclPreconnectJob*)job_;
   struct ncclComm* comm = job->comm;
   CUDACHECK(cudaSetDevice(comm->cudaDev));
-  if (CPU_COUNT(&comm->cpuAffinity)) sched_setaffinity(0, sizeof(cpu_set_t), &comm->cpuAffinity);
+  if (!job_->isThreadMain && CPU_COUNT(&comm->cpuAffinity)) sched_setaffinity(0, sizeof(cpu_set_t), &comm->cpuAffinity);
   NCCLCHECK(ncclTransportP2pSetup(comm, NULL, 1));
   return ncclSuccess;
 }
 
+static ncclResult_t ncclCollPreconnect(struct ncclComm* comm, bool* algoNeedConnect) {
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; ++i) {
+    if (algoNeedConnect[i]) {
+      switch (i) {
+        case NCCL_ALGO_RING: {
+          NCCLCHECK(ncclTransportRingConnect(comm));
+          break;
+        }
+        case NCCL_ALGO_TREE: {
+          NCCLCHECK(ncclTransportTreeConnect(comm));
+          break;
+        }
+        case NCCL_ALGO_NVLS: {
+          /* If we are using NVLS_TREE algo, we must mark NVLS algo to set up
+           * NVLS intra-node buffer */
+          NCCLCHECK(ncclNvlsBufferSetup(comm));
+          break;
+        }
+        case NCCL_ALGO_NVLS_TREE: {
+          NCCLCHECK(ncclNvlsTreeConnect(comm));
+          break;
+        }
+        case NCCL_ALGO_COLLNET_CHAIN: {
+          NCCLCHECK(ncclCollNetChainBufferSetup(comm));
+          break;
+        }
+        case NCCL_ALGO_COLLNET_DIRECT: {
+          NCCLCHECK(ncclCollNetDirectBufferSetup(comm));
+          break;
+        }
+        case NCCL_ALGO_PAT: {
+          NCCLCHECK(ncclTransportPatConnect(comm));
+          break;
+        }
+        // Yes, it's a dead code.  That's fine...
+        // coverity[dead_error_begin]
+        default: {
+          NCCLCHECK(ncclInternalError);
+        }
+      }
+    }
+  }
+  return ncclSuccess;
+}
+
+ncclResult_t ncclPrepareTasksAndCollPreconnectFunc(struct ncclAsyncJob* job_) {
+  struct ncclPrepareTasksAndCollPreconnectJob* job = (ncclPrepareTasksAndCollPreconnectJob*)job_;
+  struct ncclComm* comm = job->comm;
+  bool needConnect;
+  bool algoNeedConnect[NCCL_NUM_ALGORITHMS];
+  memset(algoNeedConnect, 0, sizeof(bool)*NCCL_NUM_ALGORITHMS);
+  CUDACHECK(cudaSetDevice(comm->cudaDev));
+  if (!job_->isThreadMain && CPU_COUNT(&comm->cpuAffinity)) sched_setaffinity(0, sizeof(cpu_set_t), &comm->cpuAffinity);
+  NCCLCHECK(ncclPrepareTasks(comm, algoNeedConnect, &needConnect, job->simInfo));
+  if (comm->cuMemSupport && needConnect) NCCLCHECK(ncclCollPreconnect(comm, algoNeedConnect));
+  return ncclSuccess;
+}
+
 ncclResult_t ncclCollPreconnectFunc(struct ncclAsyncJob* job_) {
   struct ncclPreconnectJob* job = (struct ncclPreconnectJob*)job_;
   struct ncclComm* comm = job->comm;
   ncclResult_t ret = ncclSuccess;
 
-  CUDACHECK(cudaSetDevice(comm->cudaDev));
-  if (CPU_COUNT(&comm->cpuAffinity)) sched_setaffinity(0, sizeof(cpu_set_t), &comm->cpuAffinity);
-  for (int i = 0; i < NCCL_NUM_ALGORITHMS; ++i) {
-    if (job->algoNeedConnect[i]) {
-      switch (i) {
-        case NCCL_ALGO_RING: {
-          NCCLCHECKGOTO(ncclTransportRingConnect(comm), ret, fail);
-          break;
-        }
-        case NCCL_ALGO_TREE: {
-          NCCLCHECKGOTO(ncclTransportTreeConnect(comm), ret, fail);
-          break;
-        }
-        case NCCL_ALGO_NVLS: {
-          /* If we are using NVLS_TREE algo, we must mark NVLS algo to set up
-           * NVLS intra-node buffer */
-          NCCLCHECKGOTO(ncclNvlsBufferSetup(comm), ret, fail);
-          break;
-        }
-        case NCCL_ALGO_NVLS_TREE: {
-          NCCLCHECKGOTO(ncclNvlsTreeConnect(comm), ret, fail);
-          break;
-        }
-        case NCCL_ALGO_COLLNET_CHAIN: {
-          NCCLCHECKGOTO(ncclCollNetChainBufferSetup(comm), ret, fail);
-          break;
-        }
-        case NCCL_ALGO_COLLNET_DIRECT: {
-          NCCLCHECKGOTO(ncclCollNetDirectBufferSetup(comm), ret, fail);
-          break;
-        }
-        case NCCL_ALGO_PAT: {
-          NCCLCHECKGOTO(ncclTransportPatConnect(comm), ret, fail);
-          break;
-        }
-        // Yes, it's a dead code.  That's fine...
-        // coverity[dead_error_begin]
-        default: {
-          ret = ncclInternalError;
-          goto fail;
-        }
-      }
-    }
-  }
+  if (!job_->isThreadMain) CUDACHECK(cudaSetDevice(comm->cudaDev));
+  if (!job_->isThreadMain && CPU_COUNT(&comm->cpuAffinity)) sched_setaffinity(0, sizeof(cpu_set_t), &comm->cpuAffinity);
+  NCCLCHECKGOTO(ncclCollPreconnect(comm, job->algoNeedConnect), ret, fail);
 
 exit:
   free(job->algoNeedConnect);
@@ -194,52 +220,33 @@ struct ncclGroupSymmetricJob {
   struct ncclComm* comm;
 };
 
-NCCL_PARAM(WinStride, "WIN_STRIDE", -1);
-
 ncclResult_t ncclCommGroupRegisterSymmetric(struct ncclAsyncJob* job_) {
   struct ncclGroupSymmetricJob* job = (struct ncclGroupSymmetricJob*)job_;
   struct ncclComm* comm = job->comm;
   ncclResult_t ret = ncclSuccess;
 
   CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
-  if (comm->baseStride == 0) {
-    cudaStream_t hostStream;
-    // first time to allocate symmetric VA space.
-    // calling into this function means symmetric is supported.
-    struct ncclSymDevBase* symBase = NULL;
-    size_t size = ncclSymDevBase::size(comm->localRanks);
-    if (ncclParamWinStride() != -1) {
-      comm->baseStride = ncclParamWinStride();
-    } else {
-      size_t maxStride = 0;
-      for (int r = 0; r < comm->nRanks; ++r)
-        if (comm->peerInfo[r].totalGlobalMem > maxStride) maxStride = comm->peerInfo[r].totalGlobalMem;
-      comm->baseStride = maxStride;
-    }
-    INFO(NCCL_INIT, "rank %d base stride %zuGB total VM %zuGB", comm->rank, comm->baseStride >> 30, (comm->baseStride * comm->localRanks) >> 30);
-    NCCLCHECKGOTO(ncclIpcSymmetricInit(comm), ret, fail);
-    NCCLCHECKGOTO(ncclNvlsSymmetricInit(comm), ret, fail);
-    comm->symAllocHead = 0;
 
-    // Allocate symmetric memory for NCCL internal usage
-    NCCLCHECKGOTO(ncclCommSymmetricAllocInternal(comm, size, alignof(struct ncclSymDevBase), (void**)&symBase), ret, fail);
-    assert((void*)symBase == (void*)(comm->baseUCSymPtr + comm->localRank * comm->baseStride));
-    NCCLCHECKGOTO(ncclStrongStreamAcquire(ncclCudaGraphNone(), &comm->sharedRes->hostStream, /*concurrent=*/false, &hostStream), ret, fail);
-    CUDACHECKGOTO(cudaMemsetAsync(symBase, 0, size, hostStream), ret, fail);
-    CUDACHECKGOTO(cudaStreamSynchronize(hostStream), ret, fail);
-    NCCLCHECKGOTO(ncclStrongStreamRelease(ncclCudaGraphNone(), &comm->sharedRes->hostStream, /*concurrent=*/false), ret, fail);
-
-    comm->symDevComm.base = (struct ncclSymDevBase*)(comm->baseUCSymPtr + comm->localRank * comm->baseStride);
-    comm->symDevComm.baseMc = (struct ncclSymDevBase*)comm->baseMCSymPtr;
-    comm->symDevComm.nRanks = comm->localRanks;
-    comm->symDevComm.nRanks_rcp32 = idivRcp32(comm->localRanks);
-    comm->symDevComm.rank = comm->localRank;
-    comm->symDevComm.stride4G = comm->baseStride >> 32;
+  while (!ncclIntruQueueEmpty(&comm->devrState.regTaskQueue)) {
+    struct ncclDevrRegTask* task = ncclIntruQueueDequeue(&comm->devrState.regTaskQueue);
+    NCCLCHECKGOTO(ncclDevrWindowRegisterInGroup(
+      comm, task->userPtr, task->userSize, task->winFlags, task->outWinDev),
+      ret, fail);
+    free(task);
   }
 
-  while (!ncclIntruQueueEmpty(&comm->symRegTaskQueue)) {
-    struct ncclSymRegTask* task = ncclIntruQueueDequeue(&comm->symRegTaskQueue);
-    NCCLCHECKGOTO(ncclCommSymmetricRegisterInternal(comm, task->buff, task->baseSize, task->alignment, task->memHandle, task->regHandle), ret, fail);
+  while (!ncclIntruQueueEmpty(&comm->devrState.commCreateTaskQueue)) {
+    struct ncclDevrCommCreateTask* task = ncclIntruQueueDequeue(&comm->devrState.commCreateTaskQueue);
+    NCCLCHECKGOTO(ncclDevrCommCreateInternal(
+      comm, (struct ncclDevCommRequirements const*)task->reqs, task->outDevComm),
+      ret, fail);
+    freeDevCommRequirements(task->reqs); // free additional task memory for reqs
+    free(task);
+  }
+
+  while (!ncclIntruQueueEmpty(&comm->ceInitTaskQueue)) {
+    struct ncclCeInitTask* task = ncclIntruQueueDequeue(&comm->ceInitTaskQueue);
+    NCCLCHECKGOTO(ncclCeInit(task->comm), ret, fail);
     free(task);
   }
 
@@ -296,7 +303,11 @@ static ncclResult_t doLaunches(struct ncclComm* head) {
             comm->planner.unlaunchedPlansHead = plan->next;
             CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), result, failure);
             NCCLCHECKGOTO(ncclLaunchKernelBefore_NoUncapturedCuda(comm, plan), result, failure);
-            NCCLCHECKGOTO(ncclLaunchKernel(comm, plan), result, failure);
+            if (plan->isCeColl) {
+              NCCLCHECKGOTO(ncclLaunchCeColl(comm, plan), result, failure);
+            } else {
+              NCCLCHECKGOTO(ncclLaunchKernel(comm, plan), result, failure);
+            }
           }
           // Barrier reduction input indicates if we require further rounds.
           if (useBarrier) ncclCommIntraBarrierIn(comm, comm->planner.unlaunchedPlansHead != nullptr ? 1 : 0);
@@ -392,6 +403,12 @@ static ncclResult_t asyncJobLaunch(struct ncclIntruQueue<struct ncclAsyncJob, &n
 
   if (!ncclIntruQueueEmpty(asyncJobsMain)) {
     struct ncclAsyncJob* job = ncclIntruQueueHead(asyncJobsMain);
+    if (job->next == nullptr) {
+      job->isThreadMain = true;
+      ncclAsyncJobMain(job);
+      job->state = ncclGroupJobJoined;
+      return job->result;
+    }
     do {
       PTHREADCHECKGOTO(pthread_create(&job->thread, nullptr, ncclAsyncJobMain, job), "pthread_create", ret, fail);
       job = job->next;
@@ -444,6 +461,51 @@ fail:
   goto exit;
 }
 
+NCCL_PARAM(SingleProcMemRegEnable, "SINGLE_PROC_MEM_REG_ENABLE", 0);
+
+static ncclResult_t ncclPrepareTasksAndCollPreconnect(struct ncclComm* comm, ncclSimInfo_t* simInfo, struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next>* asyncCollJobs) {
+  if (ncclParamSingleProcMemRegEnable()) {
+    struct ncclPrepareTasksAndCollPreconnectJob* job;
+    NCCLCHECK(ncclCalloc(&job, 1));
+    job->base.func = ncclPrepareTasksAndCollPreconnectFunc;
+    job->base.undo = nullptr;
+    job->base.destructor = free;
+    job->base.state = ncclGroupJobRunning;
+    job->base.abortFlag = comm->abortFlag;
+    job->base.abortFlagDev = comm->abortFlagDev;
+    job->comm = comm;
+    job->simInfo = simInfo;
+    ncclIntruQueueEnqueue(asyncCollJobs, &job->base);
+  } else {
+    bool needConnect = false;
+    bool algoNeedConnect[NCCL_NUM_ALGORITHMS];
+    memset(algoNeedConnect, 0, sizeof(bool) * NCCL_NUM_ALGORITHMS);
+
+    CUDACHECK(cudaSetDevice(comm->cudaDev));
+    NCCLCHECK(ncclPrepareTasks(comm, algoNeedConnect, &needConnect, simInfo));
+
+    if (comm->cuMemSupport && needConnect) {
+      ncclResult_t ret;
+      struct ncclPreconnectJob* job;
+      NCCLCHECK(ncclCalloc(&job, 1));
+      job->base.func = ncclCollPreconnectFunc;
+      job->base.undo = nullptr;
+      job->base.destructor = free;
+      job->base.state = ncclGroupJobRunning;
+      job->base.abortFlag = comm->abortFlag;
+      job->base.abortFlagDev = comm->abortFlagDev;
+      job->comm = comm;
+      if ((ret = ncclCalloc(&job->algoNeedConnect, NCCL_NUM_ALGORITHMS))) {
+        free(job);
+        NCCLCHECK(ret);
+      }
+      memcpy(job->algoNeedConnect, algoNeedConnect, sizeof(bool) * NCCL_NUM_ALGORITHMS);
+      ncclIntruQueueEnqueue(asyncCollJobs, &job->base);
+    }
+  }
+  return ncclSuccess;
+}
+
 static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInfo = NULL) {
   ncclResult_t ret = ncclSuccess;
   struct ncclGroupJob *gjob = (struct ncclGroupJob*) job_;
@@ -518,27 +580,7 @@ static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInf
       // at the same time.
       comm = cliqueHead;
       do {
-        bool needConnect = false;
-        bool algoNeedConnect[NCCL_NUM_ALGORITHMS];
-        memset(algoNeedConnect, 0, sizeof(bool) * NCCL_NUM_ALGORITHMS);
-
-        CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
-        NCCLCHECKGOTO(ncclPrepareTasks(comm, algoNeedConnect, &needConnect, simInfo), ret, fail);
-
-        if (comm->cuMemSupport && needConnect) {
-          struct ncclPreconnectJob* job;
-          NCCLCHECKGOTO(ncclCalloc(&job, 1), ret, fail);
-          job->base.func = ncclCollPreconnectFunc;
-          job->base.undo = nullptr;
-          job->base.destructor = free;
-          job->base.state = ncclGroupJobRunning;
-          job->base.abortFlag = comm->abortFlag;
-          job->base.abortFlagDev = comm->abortFlagDev;
-          job->comm = comm;
-          NCCLCHECKGOTO(ncclCalloc(&job->algoNeedConnect, NCCL_NUM_ALGORITHMS), ret, fail);
-          memcpy(job->algoNeedConnect, algoNeedConnect, sizeof(bool) * NCCL_NUM_ALGORITHMS);
-          ncclIntruQueueEnqueue(&asyncCollJobs, &job->base);
-        }
+        NCCLCHECKGOTO(ncclPrepareTasksAndCollPreconnect(comm, simInfo, &asyncCollJobs), ret, fail);
         comm = comm->groupNext[ncclGroupTaskTypeCollective];
       } while (comm != nullptr && comm->intraComm0 == cliqueHead->intraComm0);
       // connect
@@ -617,6 +659,13 @@ ncclResult_t ncclGroupEndInternal(ncclSimInfo_t* simInfo) {
     goto exit;
   }
 
+  if (ncclProfilerApiState.profilerGroupDepth > 0) {
+    ncclProfilerApiState.profilerGroupDepth--;
+  }
+  if (ncclProfilerApiState.profilerGroupDepth == 0) {
+    NCCLCHECK(ncclProfilerRecordGroupApiEventState(ncclProfilerGroupEndApiStart));
+  }
+
   if ((--ncclGroupDepth) > 0) goto exit;
 
   if ((ret = ncclGroupError) != ncclSuccess) goto fail;
@@ -701,6 +750,8 @@ ncclResult_t ncclGroupEndInternal(ncclSimInfo_t* simInfo) {
   groupLocalResetJobState();
 
 exit:
+  // Profiler group API start is called inside taskAppend to get graph capture information for the event
+  NCCLCHECK(ncclProfilerStopGroupApiEvent());
   return ret;
 fail:
   if (groupJob) {
diff --git a/projects/rccl/src/include/allocator.h b/projects/rccl/src/include/allocator.h
index 189c3d4e25..05da29a62a 100644
--- a/projects/rccl/src/include/allocator.h
+++ b/projects/rccl/src/include/allocator.h
@@ -7,7 +7,55 @@
 #ifndef NCCL_ALLOCATOR_H_
 #define NCCL_ALLOCATOR_H_
 
-ncclResult_t ncclCommSymmetricAllocInternal(struct ncclComm* comm, size_t size, size_t alignment, void** symPtr);
-ncclResult_t ncclCommSymmetricFreeInternal(struct ncclComm* comm, void* symPtr);
+////////////////////////////////////////////////////////////////////////////////
+// ncclSpace: Allocates contiguous segments of non-negative integers. Useful
+// as a memory allocator when we can't put allocator state within the memory
+// being allocated.
+
+struct ncclSpace {
+  int count;
+  int capacity;
+  int64_t* cuts;
+};
+
+void ncclSpaceConstruct(struct ncclSpace* a);
+void ncclSpaceDestruct(struct ncclSpace* a);
+ncclResult_t ncclSpaceAlloc(struct ncclSpace* a, int64_t spaceLimit, int64_t objSize, int objAlign, int64_t* outObjOffset);
+ncclResult_t ncclSpaceFree(struct ncclSpace* a, int64_t objOffset, int64_t objSize);
+
+
+////////////////////////////////////////////////////////////////////////////////
+// ncclShadowPool: Allocates device-side objects, their host-side shadows, and
+// maintains the device->host object address mapping.
+
+struct ncclShadowObject;
+struct ncclShadowPage;
+struct ncclShadowPool {
+  int count, hbits;
+  struct ncclShadowObject** table;
+  cudaMemPool_t memPool;
+  struct ncclShadowPage* pages;
+};
+
+void ncclShadowPoolConstruct(struct ncclShadowPool*);
+ncclResult_t ncclShadowPoolDestruct(struct ncclShadowPool*);
+ncclResult_t ncclShadowPoolAlloc(struct ncclShadowPool*, size_t size, void** outDevObj, void** outHostObj, cudaStream_t stream);
+ncclResult_t ncclShadowPoolFree(struct ncclShadowPool*, void* devObj, cudaStream_t stream);
+ncclResult_t ncclShadowPoolToHost(struct ncclShadowPool*, void* devObj, void** outHostObj);
+
+template<typename T>
+static inline ncclResult_t ncclShadowPoolAlloc(struct ncclShadowPool* pool, T** outDevObj, T** outHostObj, cudaStream_t stream) {
+  void* devObj;
+  void* hostObj;
+  ncclResult_t got = ncclShadowPoolAlloc(pool, sizeof(T), &devObj, &hostObj, stream);
+  if (outDevObj) *outDevObj = (T*)devObj;
+  if (outHostObj) *outHostObj = (T*)hostObj;
+  return got;
+}
+
+template<typename T>
+static inline ncclResult_t ncclShadowPoolToHost(struct ncclShadowPool* pool, T* devObj, T** hostObj) {
+  return ncclShadowPoolToHost(pool, (void*)devObj, (void**)hostObj);
+}
 
 #endif
diff --git a/projects/rccl/src/include/bitops.h b/projects/rccl/src/include/bitops.h
index 71053ed493..badc91b500 100644
--- a/projects/rccl/src/include/bitops.h
+++ b/projects/rccl/src/include/bitops.h
@@ -41,6 +41,9 @@ constexpr static __host__ __device__ Int maxval(Int a, Int b, More ...more) {
   #endif
 }
 
+#define BIT(x) (1UL << (x))
+#define MASK(x) ((1UL << x) - 1UL)
+
 #define DIVUP(x, y) \
     (((x)+(y)-1)/(y))
 
@@ -68,14 +71,26 @@ static __host__ __device__ constexpr Z roundDown(X x, Y y) {
 }
 
 // assumes second argument is a power of 2
-template<typename X, typename Z = decltype(X()+int())>
-static __host__ __device__ constexpr Z alignUp(X x, int a) {
-  return (x + a-1) & Z(-a);
+template<typename X, typename Y, typename Z = decltype(X()+Y())>
+static __host__ __device__ constexpr Z alignUp(X x, Y a) {
+  return (x + a-1) & -Z(a);
 }
+template<typename T>
+static __host__ __device__ T* alignUp(T* x, size_t a) {
+  static_assert(sizeof(T) == 1, "Only single byte types allowed.");
+  return reinterpret_cast<T*>((reinterpret_cast<uintptr_t>(x) + a-1) & -uintptr_t(a));
+}
+
 // assumes second argument is a power of 2
-template<typename X, typename Z = decltype(X()+int())>
-static __host__ __device__ constexpr Z alignDown(X x, int a) {
-  return x & Z(-a);
+template<typename X, typename Y, typename Z = decltype(X()+Y())>
+static __host__ __device__ constexpr Z alignDown(X x, Y a) {
+  return x & -Z(a);
+}
+
+template<typename T>
+static __host__ __device__ T* alignDown(T* x, size_t a) {
+  static_assert(sizeof(T) == 1, "Only single byte types allowed.");
+  return reinterpret_cast<T*>(reinterpret_cast<uintptr_t>(x) & -uintptr_t(a));
 }
 
 template<typename Int>
@@ -341,7 +356,7 @@ static __host__ UInt reverseSubBits(UInt x) {
     default: static_assert(8*sizeof(UInt) <= 64, "Unsupported integer type.");
     }
     return reverseSubBits<UInt, 8>(x);
-  } else if (nSubBits == 1) {
+  } else if (nSubBits <= 1) {
     return x;
   } else {
     UInt m = UInt(-1)/((UInt(1)<<(nSubBits/2))+1);
diff --git a/projects/rccl/src/include/ce_coll.h b/projects/rccl/src/include/ce_coll.h
new file mode 100644
index 0000000000..e47effb8c1
--- /dev/null
+++ b/projects/rccl/src/include/ce_coll.h
@@ -0,0 +1,76 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef NCCL_CE_COLL_H_
+#define NCCL_CE_COLL_H_
+
+#include "nccl.h"
+#include "nccl_common.h"
+#include "bitops.h"
+
+// Memory operations per rank for different synchronization protocols
+#define NCCL_CE_SYNC_OPS_PER_RANK_MC 2
+#define NCCL_CE_SYNC_OPS_PER_RANK_UC 3
+
+struct ncclCeColl {
+  uint8_t* baseUCSymReadyPtr;
+  uint8_t* baseUCSymComplPtr;
+  size_t baseUCSymReadyOffset;
+  size_t baseUCSymComplOffset;
+  uint32_t ceSeqNum;
+  bool useCompletePtr;
+  uint32_t intraBatchSyncFreq;
+  uint64_t intraBatchSyncMsgThreshold;
+  struct ncclDevrWindow* ceSyncWin;
+};
+
+struct ncclCeInitTask {
+  struct ncclCeInitTask *next;
+  struct ncclComm* comm;
+};
+
+struct alignas(16) ncclCeCollArgs {
+  ncclFunc_t func;
+  int rootRank;
+  size_t nElts;
+  size_t eltSize;
+  uint8_t* sendBuff;
+  uint8_t* recvBuff;
+  struct ncclDevrWindow* sendWin;
+  struct ncclDevrWindow* recvWin;
+};
+
+struct ncclCeBatchOpsParams {
+  void** dsts;
+  void** srcs;
+  size_t* sizes;
+  size_t numOps;
+  bool intraBatchSync;
+#if CUDART_VERSION >= 12080
+  cudaMemcpyAttributes* attrs;
+  size_t* attrIdxs;
+  size_t numAttrs;
+#endif
+};
+
+bool ncclCeImplemented(ncclFunc_t coll, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty);
+
+ncclResult_t ncclCeInit(struct ncclComm* comm);
+
+ncclResult_t ncclCeFinalize(struct ncclComm* comm);
+
+ncclResult_t ncclMemOpSync(struct ncclComm* comm, cudaStream_t stream);
+
+ncclResult_t ncclLaunchCeColl(struct ncclComm* comm, struct ncclKernelPlan* plan);
+
+ncclResult_t ncclCeAllGather(struct ncclComm* comm, struct ncclCeCollArgs* args, cudaStream_t stream);
+
+ncclResult_t ncclCeScatter(struct ncclComm* comm, struct ncclCeCollArgs* args, cudaStream_t stream);
+
+ncclResult_t ncclCeGather(struct ncclComm* comm, struct ncclCeCollArgs* args, cudaStream_t stream);
+
+ncclResult_t ncclCeAlltoAll(struct ncclComm* comm, struct ncclCeCollArgs* args, cudaStream_t stream);
+#endif /* NCCL_CE_COLL_H_ */
diff --git a/projects/rccl/src/include/channel.h b/projects/rccl/src/include/channel.h
index ee9aa6d0b8..bd34f54c13 100644
--- a/projects/rccl/src/include/channel.h
+++ b/projects/rccl/src/include/channel.h
@@ -17,15 +17,16 @@ ncclResult_t initCollnetChannel(struct ncclComm* comm, int channelId, struct ncc
 ncclResult_t freeChannel(struct ncclChannel* channel, int nRanks, int collnetNRanks, int nvlsNRanks);
 
 inline uint8_t ncclP2pChannelBaseForRound(struct ncclComm* comm, int p2pRound) {
+  int base;
   if (comm->nNodes > 1) {
     int nodeDelta = p2pRound/comm->maxLocalRanks;
     int localDelta = p2pRound%comm->maxLocalRanks;
-    int base = nodeDelta*divUp(comm->maxLocalRanks, NCCL_MAX_DEV_WORK_P2P_PER_BATCH);
+    base = nodeDelta*divUp(comm->maxLocalRanks, NCCL_MAX_DEV_WORK_P2P_PER_BATCH);
     base += localDelta/NCCL_MAX_DEV_WORK_P2P_PER_BATCH;
-    return base & 0xff;
   } else {
-    return p2pRound & 0xff;
+    base = p2pRound;
   }
+  return reverseBits(base, log2Up(comm->p2pnChannels));
 }
 
 #endif
diff --git a/projects/rccl/src/include/coll_net.h b/projects/rccl/src/include/coll_net.h
index affbf0a24a..574fd95eb6 100644
--- a/projects/rccl/src/include/coll_net.h
+++ b/projects/rccl/src/include/coll_net.h
@@ -16,7 +16,7 @@ typedef char collNetHandle_t[NCCL_NET_HANDLE_MAXSIZE];
 static const char* collNetName(struct ncclComm* comm) { return comm->ncclCollNet->name; }
 static ncclResult_t collNetDevices(struct ncclComm* comm, int* ndev) { NCCLCHECK(comm->ncclCollNet->devices(ndev)); return ncclSuccess; }
 static ncclResult_t collNetGetProperties(struct ncclComm* comm, int dev, ncclNetProperties_t* props) { NCCLCHECK(comm->ncclCollNet->getProperties(dev, props)); return ncclSuccess; }
-static ncclResult_t collNetListen(struct ncclComm* comm, int dev, void* handle, void** listenComm) { NCCLCHECK(comm->ncclCollNet->listen(dev, handle, listenComm)); return ncclSuccess; }
+static ncclResult_t collNetListen(struct ncclComm* comm, int dev, void* handle, void** listenComm) { NCCLCHECK(comm->ncclCollNet->listen(comm->collNetContext, dev, handle, listenComm)); return ncclSuccess; }
 static ncclResult_t collNetConnect(struct ncclComm* comm, void* handles[], int nranks, int rank, void* listenComm, void** collComm) { NCCLCHECK(comm->ncclCollNet->connect(handles, nranks, rank, listenComm, collComm)); return ncclSuccess; }
 static ncclResult_t collNetReduceSupport(struct ncclComm* comm, ncclDataType_t dataType, ncclRedOp_t redOp, int* supported) { NCCLCHECK(comm->ncclCollNet->reduceSupport(dataType, redOp, supported)); return ncclSuccess; }
 static ncclResult_t collNetRegMr(struct ncclComm* comm, void* collComm, void* data, size_t size, int type, void** mhandle) { NCCLCHECK(comm->ncclCollNet->regMr(collComm, data, size, type, mhandle)); return ncclSuccess; }
@@ -29,6 +29,7 @@ static ncclResult_t collNetIflush(struct ncclComm* comm, void* collComm, void* d
 static ncclResult_t collNetTest(struct ncclComm* comm, void* request, int* done, int* size) { NCCLCHECK(comm->ncclCollNet->test(request, done, size)); return ncclSuccess; }
 static ncclResult_t collNetCloseColl(struct ncclComm* comm, void* collComm) { NCCLCHECK(comm->ncclCollNet->closeColl(collComm)); return ncclSuccess; }
 static ncclResult_t collNetCloseListen(struct ncclComm* comm, void* listenComm) { NCCLCHECK(comm->ncclCollNet->closeListen(listenComm)); return ncclSuccess; }
+static ncclResult_t collNetFinalize(struct ncclComm* comm, void* ctx) { NCCLCHECK(comm->ncclCollNet->finalize(ctx)); return ncclSuccess; }
 
 static int collNetSupport(struct ncclComm* comm) { return comm->ncclCollNet != nullptr ? 1 : 0; }
 
diff --git a/projects/rccl/src/include/collectives.h b/projects/rccl/src/include/collectives.h
index c68b0418cf..038eb8dd16 100644
--- a/projects/rccl/src/include/collectives.h
+++ b/projects/rccl/src/include/collectives.h
@@ -8,7 +8,7 @@
 #define NCCL_COLLECTIVES_H_
 
 #include "nccl.h"
-#include "nccl_common.h"
+#include "nccl_tuner.h"
 #include "device.h"
 
 #define NCCL_MAX_NET_SIZE (1024*1024*1024L) // Rather than send INT_MAX which is 2G-1, send a power of two.
@@ -18,10 +18,16 @@
 #define ALLREDUCE_CHUNKSTEPS (NCCL_STEPS/2)
 #define ALLGATHER_SLICESTEPS (NCCL_STEPS/4)
 #define ALLGATHER_CHUNKSTEPS (NCCL_STEPS/2)
+#define ALLTOALL_SLICESTEPS 1
+#define ALLTOALL_CHUNKSTEPS 1
 #define REDUCESCATTER_SLICESTEPS (NCCL_STEPS/4)
 #define REDUCESCATTER_CHUNKSTEPS (NCCL_STEPS/2)
 #define BROADCAST_SLICESTEPS 1
 #define BROADCAST_CHUNKSTEPS 1
+#define GATHER_SLICESTEPS 1
+#define GATHER_CHUNKSTEPS 1
+#define SCATTER_SLICESTEPS 1
+#define SCATTER_CHUNKSTEPS 1
 #define REDUCE_SLICESTEPS 1
 #define REDUCE_CHUNKSTEPS 1
 #define NCCL_MAX_SLICE_PER_CHUNK 2  // max value for CHUNKSTEPS/SLICESTEPS, must accord with above
diff --git a/projects/rccl/src/include/comm.h b/projects/rccl/src/include/comm.h
index 1378e0765e..22faf36829 100644
--- a/projects/rccl/src/include/comm.h
+++ b/projects/rccl/src/include/comm.h
@@ -18,6 +18,9 @@
 #include "graph.h"
 #include "profiler.h"
 #include "allocator.h"
+#include "dev_runtime.h"
+#include "sym_kernels.h"
+#include "ce_coll.h"
 
 #if CUDART_VERSION < 9000
 struct cudaLaunchParams {
@@ -198,12 +201,14 @@ struct ncclTaskColl {
   int32_t nMaxChannels:8;
   int32_t nWarps:8;
   int32_t algorithm:8, protocol:8;
-  uint32_t isCollnet:1, isNvls:1;
-  uint32_t devFuncId:30;
+  uint32_t isCollnet:1, isNvls:1, isSymLast:1;
+  uint32_t devFuncId:29;
   int regBufType;
   // number of elements in planner->ipcMemQueue associated with this collective
   int nCleanupQueueElts;
 
+  struct ncclDevrWindow* sendWin;
+  struct ncclDevrWindow* recvWin;
   void* sendMhandle;
   void* recvMhandle;
   void** sendNetHandles;
@@ -217,12 +222,16 @@ struct ncclTaskColl {
 
   // Profiler plugin
   int eActivationMask;
+  void* groupApiEventHandle;
+  void* collApiEventHandle;
   void* eventHandle;
   uint8_t nChannels;
 };
+
 struct ncclTaskP2p {
   struct ncclTaskP2p* next;
   ncclFunc_t func;
+  ncclFunc_t collAPI;
   void* buff;
   size_t count;
   ncclDataType_t datatype;
@@ -231,6 +240,8 @@ struct ncclTaskP2p {
 
   // Profiler plugin
   int eActivationMask;
+  void* groupApiEventHandle;
+  void* p2pApiEventHandle;
   void* eventHandle;
   uint8_t nChannels;
 };
@@ -246,12 +257,14 @@ struct ncclKernelPlan {
   bool persistent; // aka captured in a graph
   bool isHostCbEnq;
   bool isSymColl;
+  bool isCeColl;
   enum ncclDevWorkStorageType workStorageType;
   bool kernelSpecialized;
   void* kernelFn;
   union {
     struct ncclDevKernelArgs* kernelArgs;
-    struct ncclSymDevArgs* kernelSymArgs;
+    void* kernelSymArgs;
+    struct ncclCeCollArgs* ceCollArgs;
   };
   size_t kernelArgsSize;
   uint64_t channelMask; // bitset of which channels are present
@@ -270,6 +283,8 @@ struct ncclKernelPlan {
   struct ncclIntruQueue<struct ncclProxyOp, &ncclProxyOp::enqNext> proxyOpQueue;
 
   // Profiler plugin
+  void* groupApiEventHandle;
+  void* kernelLaunchEventHandle;
   void* groupEventHandle;
 };
 
@@ -360,9 +375,8 @@ struct ncclKernelPlanner {
   struct ncclTaskCollSorter collSorter;
   struct Peer* peers/*[nRanks]*/;
   int nTasksColl, nTasksP2p;
+  int nTasksP2pSend, nTasksP2pRecv;
   bool persistent;
-  bool isSymColl;
-
   // The list of user streams aggregated over all tasks present.
   struct ncclCudaStreamList* streams;
   // The most recent user stream. Ignored if streams==nullptr
@@ -378,6 +392,8 @@ struct ncclKernelPlanner {
   //////////////////////////////////////////////////////////////////////////////
 
   struct ncclIntruQueue<struct ncclTaskColl, &ncclTaskColl::next> collTaskQueue;
+  struct ncclIntruQueue<struct ncclTaskColl, &ncclTaskColl::next> collCeTaskQueue;
+  struct ncclIntruQueue<struct ncclTaskColl, &ncclTaskColl::next> collSymTaskQueue;
   struct ncclIntruQueue<struct ncclWorkList, &ncclWorkList::next> collWorkQueue;
   struct ncclIntruQueue<struct ncclWorkList, &ncclWorkList::next> tmpCollWorkQueue;
   struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next> collCleanupQueue;
@@ -417,6 +433,8 @@ typedef enum ncclGroupTaskType {
   ncclGroupTaskTypeNum = 2,
 } ncclGroupTaskType_t;
 
+struct ncclCommSymTeams;
+
 struct ncclComm {
   uint64_t startMagic;
   struct ncclMemoryStack memPermanent, memScoped;
@@ -436,10 +454,12 @@ struct ncclComm {
   bool peerInfoValid;
 
   ncclNet_t* ncclNet;
+  void* netContext;
   int netPluginIndex;
   int ncclNetVer;
   ncclNetDeviceType netDeviceType;
   ncclCollNet_t* ncclCollNet;
+  void* collNetContext;
   void* bootstrap;
   // Bitmasks for ncclTransportP2pSetup
   uint64_t* connectSend;
@@ -472,6 +492,7 @@ struct ncclComm {
   int localRank;
   int localRanks;
   int maxLocalRanks;
+  int minLocalRanks;
   int* rankToNode;
   int* rankToLocalRank;
   int* localRankToRank;
@@ -482,6 +503,9 @@ struct ncclComm {
   struct cliqueInfo clique; // Our MNNVL clique information
   int cliqueRank; // Our rank within the MNNVL clique
 
+  // NVL Domain info
+  ncclNvlDomainInfo_v5_t nvlDomainInfo;
+
   bool checkPointers;
   bool dmaBufSupport;
 
@@ -508,7 +532,8 @@ struct ncclComm {
   int p2pChunkSize;
   int nvlsChunkSize;
 
-  // Algorithm/Protocols thresholds
+  // Tuner values
+  ncclTunerConstants_t tunerConstants;
   ssize_t threadThresholds[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
   float latencies[NCCL_NUM_FUNCTIONS][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
   float bandwidths[NCCL_NUM_FUNCTIONS][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
@@ -527,8 +552,7 @@ struct ncclComm {
   uint32_t destroyFlag;
 
   // Device side of the communicator (for cudaFree's)
-  struct ncclDevComm* devComm; // actually = &ncclDevCommAndChannels::comm
-  struct ncclSymDevComm symDevComm;
+  struct ncclKernelComm* devComm; // actually = &ncclKernelCommAndChannels::comm
 
   uint32_t workArgsBytes; // max size of kernel args
   uint32_t workFifoBytes; // size of workFifoBuf, power of 2
@@ -624,6 +648,10 @@ struct ncclComm {
   uint64_t seqNumber[NCCL_NUM_FUNCTIONS];
   struct ncclProfilerProxy profiler;
 
+  // CE Collective
+  struct ncclCeColl ceColl;
+  struct ncclIntruQueue<struct ncclCeInitTask, &ncclCeInitTask::next> ceInitTaskQueue;
+  
   // buffer registration cache
   struct ncclRegCache regCache;
   int isAllNvlink;
@@ -632,13 +660,10 @@ struct ncclComm {
   bool useNetPXN;
   bool useGdr;
   int splitCount;
-  // symmetric buffer
-  uint8_t* baseUCSymPtr;
-  uint8_t* baseMCSymPtr;
-  size_t baseStride;
-  size_t symAllocHead;
-  CUmemGenericAllocationHandle symMCHandle;
-  struct ncclIntruQueue<struct ncclSymRegTask, &ncclSymRegTask::next> symRegTaskQueue;
+
+  struct ncclDevrState devrState; // The symmetric runtime state
+  struct ncclSymkState symkState; // The symmetric kernels state (built on previous)
+
   uint64_t endMagic;
 };
 
diff --git a/projects/rccl/src/include/core.h b/projects/rccl/src/include/core.h
index a1754beeb1..2ce1d8e780 100644
--- a/projects/rccl/src/include/core.h
+++ b/projects/rccl/src/include/core.h
@@ -16,6 +16,7 @@
 
 #ifdef PROFAPI
 #define NCCL_API(ret, func, args...)        \
+    extern "C"                              \
     __attribute__ ((visibility("default"))) \
     __attribute__ ((alias(#func)))          \
     ret p##func (args);                     \
diff --git a/projects/rccl/src/include/cpuset.h b/projects/rccl/src/include/cpuset.h
index 99e3edf4d4..df936a31e1 100644
--- a/projects/rccl/src/include/cpuset.h
+++ b/projects/rccl/src/include/cpuset.h
@@ -1,5 +1,5 @@
 /*************************************************************************
- * Copyright (c) 2018-2020, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2018-2025, NVIDIA CORPORATION. All rights reserved.
  *
  * See LICENSE.txt for license information
  ************************************************************************/
@@ -7,54 +7,38 @@
 #ifndef NCCL_CPUSET_H_
 #define NCCL_CPUSET_H_
 
-// Convert local_cpus, e.g. 0003ff,f0003fff to cpu_set_t
+#include "nccl.h"
+#include <cstdint>
+#include <cstdio>
+#include <cstdlib>
+#include <sched.h>
 
-static int hexToInt(char c) {
-  int v = c - '0';
-  if (v < 0) return -1;
-  if (v > 9) v = 10 + c - 'a';
-  if ((v < 0) || (v > 15)) return -1;
-  return v;
-}
+// Convert local_cpus, e.g. 0003ff,f0003fff to cpu_set_t.
+// The bitmask is divided into chunks of 32 bits, each of them represented by 8 hex number.
+#define U32_LEN 32 // using uint32_t
+#define CPU_SET_N_U32 (CPU_SETSIZE / U32_LEN)
 
-#define CPU_SET_N_U32 (sizeof(cpu_set_t)/sizeof(uint32_t))
+static ncclResult_t ncclStrToCpuset(const char* maskStr, cpu_set_t* set) {
+  uint32_t cpumasks[CPU_SET_N_U32] = {0};
 
-static ncclResult_t ncclStrToCpuset(const char* str, cpu_set_t* mask) {
-  uint32_t cpumasks[CPU_SET_N_U32];
-  int m = CPU_SET_N_U32-1;
-  cpumasks[m] = 0;
-  for (int o=0; o<strlen(str); o++) {
-    char c = str[o];
-    if (c == ',') {
-      m--;
-      cpumasks[m] = 0;
-    } else {
-      int v = hexToInt(c);
-      if (v == -1) break;
-      cpumasks[m] <<= 4;
-      cpumasks[m] += v;
+  // transform the string into an array of 32 bit masks, starting with the highest mask
+  int m = CPU_SET_N_U32;
+  char* str = strdup(maskStr);
+  char* token = strtok(str, ",");
+  while (token != NULL && m > 0) {
+    cpumasks[--m] = strtoul(token, NULL, /*base = hex*/ 16);
+    token = strtok(NULL, ",");
+  }
+  free(str);
+
+  // list all the CPUs as part of the CPU set, starting with the lowest mask (= current value of m)
+  CPU_ZERO(set);
+  for (int a = 0; (a + m) < CPU_SET_N_U32; a++) {
+    // each mask is U32_LEN CPUs, list them all if the bit is on
+    for (int i = 0; i < U32_LEN; ++i) {
+      if (cpumasks[a + m] & (1UL << i)) CPU_SET(i + a * U32_LEN, set);
     }
   }
-  // Copy cpumasks to mask
-  for (int a=0; m<CPU_SET_N_U32; a++,m++) {
-    memcpy(((uint32_t*)mask)+a, cpumasks+m, sizeof(uint32_t));
-  }
-  return ncclSuccess;
-}
-
-static ncclResult_t ncclCpusetToStr(cpu_set_t* mask, char* str) {
-  int c = 0;
-  uint8_t* m8 = (uint8_t*)mask;
-  for (int o=sizeof(cpu_set_t)-1; o>=0; o--) {
-    if (c == 0 && m8[o] == 0) continue;
-    sprintf(str+c, "%02x", m8[o]);
-    c+=2;
-    if (o && o%4 == 0) {
-      sprintf(str+c, ",");
-      c++;
-    }
-  }
-  str[c] = '\0';
   return ncclSuccess;
 }
 
@@ -83,4 +67,31 @@ static char* ncclCpusetToRangeStr(cpu_set_t* mask, char* str, size_t len) {
   return str;
 }
 
+static ncclResult_t ncclStrListToCpuset(const char* userStr, cpu_set_t* mask) {
+  // reset the CPU set
+  CPU_ZERO(mask);
+  const char delim[] = ",";
+  char* str = strdup(userStr);
+  char* token = strtok(str, delim);
+  while (token != NULL) {
+    uint64_t cpu = strtoull(token, NULL, 0);
+    CPU_SET(cpu, mask);
+    token = strtok(NULL, delim);
+  }
+  free(str);
+  return ncclSuccess;
+}
+
+static ncclResult_t ncclCpusetToStrList(cpu_set_t* mask, char* str, size_t len) {
+  if (len == 0) return ncclSuccess;
+  str[0] = '\0';
+  int count = 0;
+  for (uint64_t id = 0; id < CPU_SETSIZE; ++id) {
+    if (CPU_ISSET(id, mask)) {
+      snprintf(str + strlen(str), len - strlen(str), "%s%lu", (count++ == 0) ? "" : ",", id);
+    }
+  }
+  return ncclSuccess;
+}
+
 #endif
diff --git a/projects/rccl/src/include/cudawrap.h b/projects/rccl/src/include/cudawrap.h
index 2edc60f213..f05f13e43e 100644
--- a/projects/rccl/src/include/cudawrap.h
+++ b/projects/rccl/src/include/cudawrap.h
@@ -114,6 +114,12 @@ DECLARE_CUDA_PFN_EXTERN(cuMulticastCreate, 12010);
 DECLARE_CUDA_PFN_EXTERN(cuMulticastGetGranularity, 12010);
 DECLARE_CUDA_PFN_EXTERN(cuMulticastUnbind, 12010);
 #endif
+/* Stream-MemOp support */
+DECLARE_CUDA_PFN_EXTERN(cuStreamBatchMemOp, 11070);
+DECLARE_CUDA_PFN_EXTERN(cuStreamWaitValue32, 11070);
+DECLARE_CUDA_PFN_EXTERN(cuStreamWaitValue64, 11070);
+DECLARE_CUDA_PFN_EXTERN(cuStreamWriteValue32, 11070);
+DECLARE_CUDA_PFN_EXTERN(cuStreamWriteValue64, 11070);
 #endif
 
 ncclResult_t ncclCudaLibraryInit(void);
diff --git a/projects/rccl/src/include/debug.h b/projects/rccl/src/include/debug.h
index 4e50cbf5a7..3822e87600 100644
--- a/projects/rccl/src/include/debug.h
+++ b/projects/rccl/src/include/debug.h
@@ -17,6 +17,7 @@
 #define NCCL_THREAD_NAMELEN 16
 
 extern int ncclDebugLevel;
+extern uint64_t ncclDebugMask;
 extern FILE *ncclDebugFile;
 
 void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *filefunc, int line, const char *fmt, ...) __attribute__ ((format (printf, 5, 6)));
@@ -27,11 +28,30 @@ extern char ncclLastError[];
 
 #define VERSION(...) ncclDebugLog(NCCL_LOG_VERSION, NCCL_ALL, __FILE__, __LINE__, __VA_ARGS__)
 #define WARN(...) ncclDebugLog(NCCL_LOG_WARN, NCCL_ALL, __FILE__, __LINE__, __VA_ARGS__)
-#define INFO(FLAGS, ...) ncclDebugLog(NCCL_LOG_INFO, (FLAGS), __func__, __LINE__, __VA_ARGS__)
-#define TRACE_CALL(...) ncclDebugLog(NCCL_LOG_TRACE, NCCL_CALL, __func__, __LINE__, __VA_ARGS__)
+
+#define INFO(FLAGS, ...) \
+    do{ \
+        int level = __atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE); \
+        if((level >= NCCL_LOG_INFO && ((unsigned long)(FLAGS) & ncclDebugMask)) || (level < 0)) \
+            ncclDebugLog(NCCL_LOG_INFO, (unsigned long)(FLAGS), __func__, __LINE__, __VA_ARGS__); \
+    } while(0)
+
+#define TRACE_CALL(...) \
+    do { \
+        int level = __atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE); \
+        if((level >= NCCL_LOG_TRACE && (NCCL_CALL & ncclDebugMask)) || (level < 0)) { \
+            ncclDebugLog(NCCL_LOG_TRACE, NCCL_CALL, __func__, __LINE__, __VA_ARGS__); \
+        } \
+    } while (0)
 
 #ifdef ENABLE_TRACE
-#define TRACE(FLAGS, ...) ncclDebugLog(NCCL_LOG_TRACE, (FLAGS), __func__, __LINE__, __VA_ARGS__)
+#define TRACE(FLAGS, ...) \
+    do { \
+        int level = __atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE); \
+        if ((level >= NCCL_LOG_TRACE && ((unsigned long)(FLAGS) & ncclDebugMask)) || (level < 0)) { \
+            ncclDebugLog(NCCL_LOG_TRACE, (unsigned long)(FLAGS), __func__, __LINE__, __VA_ARGS__); \
+        } \
+    } while (0)
 #else
 #define TRACE(...)
 #endif
diff --git a/projects/rccl/src/include/dev_runtime.h b/projects/rccl/src/include/dev_runtime.h
new file mode 100644
index 0000000000..5f6e66e338
--- /dev/null
+++ b/projects/rccl/src/include/dev_runtime.h
@@ -0,0 +1,92 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef NCCL_DEVICE_RUNTIME_H_
+#define NCCL_DEVICE_RUNTIME_H_
+#include "nccl.h"
+#include "nccl_device.h"
+#include "nccl_common.h"
+#include "allocator.h"
+#include "bitops.h"
+#include "utils.h"
+
+////////////////////////////////////////////////////////////////////////////////
+// ncclDevr[_]: runtime implements for symmetric API.
+
+struct ncclDevrMemory;
+struct ncclDevrWindow {
+  struct ncclDevrMemory* memory;
+  void* userPtr;
+  size_t size;
+  size_t bigOffset; // Offset in big VA space.
+  int winFlags;
+  void* localRegHandle;
+  struct ncclWindow_vidmem* vidmem;
+};
+struct ncclDevrWindowSorted;
+struct ncclDevrTeam;
+
+struct ncclDevrRegTask {
+  struct ncclDevrRegTask *next;
+  void* userPtr;
+  size_t userSize;
+  int winFlags;
+  ncclWindow_t* outWinDev;
+};
+
+struct ncclDevrCommCreateTask {
+  struct ncclDevrCommCreateTask *next;
+  struct ncclDevCommRequirements* reqs;
+  struct ncclDevComm* outDevComm;
+};
+
+struct ncclDevrState {
+  // Like localRank/localRanks except "lsa" ranks must be consecutive in the world
+  // and all lsa subsets have the same number of ranks. If any condition is
+  // false then the lsa team is just the singleton of self.
+  int lsaSelf;
+  int lsaSize;
+  int* lsaRankList;
+
+  size_t granularity; // cuMemGetAllocationGranularity
+  struct ncclDevrMemory* memHead;
+  struct ncclDevrWindowSorted* winSorted;
+  int winSortedCapacity, winSortedCount;
+  struct ncclDevrTeam* teamHead;
+  size_t bigSize; // size of our big logical space (128GB?)
+  struct ncclSpace bigSpace; // allocates our big VA space.
+  void* lsaFlatBase; // base ptr for all lsa ranks big VA's concatenated together: size = lsaRanks*bigSize
+  struct ncclShadowPool shadows;
+  struct ncclDevCommWindowTable* windowTable;
+
+  struct ncclIntruQueue<struct ncclDevrRegTask, &ncclDevrRegTask::next> regTaskQueue;
+  struct ncclIntruQueue<struct ncclDevrCommCreateTask, &ncclDevrCommCreateTask::next> commCreateTaskQueue;
+};
+
+// We assume ncclComm has a `ncclDevrState symState` member.
+ncclResult_t ncclDevrInitOnce(struct ncclComm* comm);
+ncclResult_t ncclDevrFinalize(struct ncclComm* comm);
+
+// If found *outWinHost will be populated and *outWinId >= 0, otherwise *outWinId == -1
+ncclResult_t ncclDevrFindWindow(struct ncclComm* comm, void const* userPtr, struct ncclDevrWindow** outWin);
+
+ncclResult_t ncclDevrWindowRegisterInGroup(
+  struct ncclComm* comm, void* ptr, size_t size, int winFlags, ncclWindow_t* outWinDev
+);
+
+ncclResult_t ncclDevrCommCreateInternal(
+  struct ncclComm* comm, struct ncclDevCommRequirements const* reqs, struct ncclDevComm* outDevComm
+);
+void freeDevCommRequirements(
+  struct ncclDevCommRequirements* reqs
+);
+
+// Get the corresponding pointer in another lsa rank's symmetric memory window
+ncclResult_t ncclDevrGetLsaRankPtr(struct ncclComm* comm, struct ncclDevrWindow* winHost, size_t offset, int lsaRank, void** outPtr);
+
+// Get the multicast address for a given team
+ncclResult_t ncclDevrGetLsaTeamPtrMC(struct ncclComm* comm, struct ncclDevrWindow* winHost, size_t offset, struct ncclTeam lsaTeam, void** outPtr);
+#endif
diff --git a/projects/rccl/src/include/device.h b/projects/rccl/src/include/device.h
index 2c5ce1029a..9ffc260950 100644
--- a/projects/rccl/src/include/device.h
+++ b/projects/rccl/src/include/device.h
@@ -8,9 +8,8 @@
 #define NCCL_DEVICE_H_
 
 #include "nccl.h"
-#include "nccl_common.h"
+#include "nccl_tuner.h"
 #include "bitops.h"
-#include "symmetric.h"
 #include <algorithm>
 #include <stdint.h>
 #include <sys/types.h>
@@ -159,6 +158,7 @@ struct ncclProxyConnector {
 struct ncclConnector {
   int connected;
   int hasSeen;
+  int p2pOnly;
   struct ncclProxyConnector proxyConn;
   struct ncclTransportComm* transportComm;
   void* transportResources;
@@ -228,7 +228,7 @@ struct ncclChannelPeer {
   int refCount;
 };
 
-struct ncclDevComm;
+struct ncclKernelComm;
 
 struct alignas(16) ncclDevWorkP2p {
   void *sendAddr, *recvAddr;
@@ -267,16 +267,10 @@ inline __host__ uint8_t ncclP2pChannelBaseForRound(struct ncclComm* comm, int p2
 // ncclP2pChannelToPart and ncclP2pChannelForPart are inverses. The device code
 // uses ncclP2pChannelToPart to determine which part "this" channel is responsible for.
 inline __host__ int ncclP2pChannelForPart(int nP2pChannels, int base, int part) {
-  // Only works because nP2pChannels is pow2
-  int nChannelsLog2 = countOneBits(nP2pChannels-1);
-  int delta = reverseBits(part, nChannelsLog2);
-  return (base + delta) & (nP2pChannels-1);
+  return (base + part) & (nP2pChannels-1);
 }
 inline __device__ int ncclP2pChannelToPart(int nP2pChannels, int base, int channel) {
-  // Only works because nP2pChannels is pow2
-  int nChannelsLog2 = countOneBits(nP2pChannels-1);
-  int delta = (channel-base) & (nP2pChannels-1);
-  return reverseBits(delta, nChannelsLog2);
+  return (channel - base) & (nP2pChannels-1);
 }
 
 struct alignas(16) ncclDevWorkColl {
@@ -413,7 +407,7 @@ struct ncclDevProfiler {
   } data[MAX_PROFILER_EVENTS_PER_CHANNEL];
 };
 
-struct ncclDevComm {
+struct ncclKernelComm {
   int rank;
   int nRanks;
   int node;
@@ -436,8 +430,8 @@ struct ncclDevComm {
   struct ncclDevProfiler* workCompleted/*[MAXCHANNELS]*/;
 };
 
-struct alignas(16) ncclDevCommAndChannels {
-  struct ncclDevComm comm;
+struct alignas(16) ncclKernelCommAndChannels {
+  struct ncclKernelComm comm;
   struct ncclDevChannel channels[MAXCHANNELS];
 };
 
@@ -448,7 +442,7 @@ enum ncclDevWorkStorageType: uint8_t {
 };
 
 struct alignas(16) ncclDevKernelArgs {
-  struct ncclDevComm* comm;
+  struct ncclKernelComm* comm;
   uint64_t channelMask;
   enum ncclDevWorkStorageType workStorageType;
   uint32_t workMask;
diff --git a/projects/rccl/src/include/graph.h b/projects/rccl/src/include/graph.h
index 7475e5a7b2..6b926717e7 100644
--- a/projects/rccl/src/include/graph.h
+++ b/projects/rccl/src/include/graph.h
@@ -115,6 +115,7 @@ ncclResult_t ncclTopoPrintGraph(struct ncclTopoSystem* system, struct ncclTopoGr
 ncclResult_t ncclTopoDumpGraphs(struct ncclTopoSystem* system, int ngraphs, struct ncclTopoGraph** graphs);
 
 struct ncclTopoRanks {
+  int crossNicRing;
   int ringRecv[MAXCHANNELS];
   int ringSend[MAXCHANNELS];
   int ringPrev[MAXCHANNELS];
@@ -131,6 +132,7 @@ ncclResult_t ncclTopoPreset(struct ncclComm* comm, struct ncclTopoGraph** graphs
 ncclResult_t ncclTopoPostset(struct ncclComm* comm, int* firstRanks, int* treePatterns,
     struct ncclTopoRanks** allTopoRanks, int* rings, struct ncclTopoGraph** graphs, struct ncclComm* parent);
 
+ncclResult_t ncclTopoInitTunerConstants(struct ncclComm* comm);
 ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCompCap, struct ncclTopoGraph** graphs);
 ncclResult_t ncclTopoGetAlgoTime(struct ncclComm* comm, int coll, int algorithm, int protocol, size_t nBytes, int numPipeOps, float* time);
 
diff --git a/projects/rccl/src/include/group.h b/projects/rccl/src/include/group.h
index 033a187da3..6e317c6c4d 100644
--- a/projects/rccl/src/include/group.h
+++ b/projects/rccl/src/include/group.h
@@ -43,6 +43,7 @@ struct ncclAsyncJob {
   uint32_t* childAbortFlagDev; /* point to child abortFlagDev */
   ncclComm_t comm;
   int destroyFlag;
+  bool isThreadMain;
 };
 
 ncclResult_t ncclAsyncLaunch(
diff --git a/projects/rccl/src/include/nccl_common.h b/projects/rccl/src/include/nccl_common.h
index 0f387c15ee..0a38421516 100644
--- a/projects/rccl/src/include/nccl_common.h
+++ b/projects/rccl/src/include/nccl_common.h
@@ -7,6 +7,11 @@
 #ifndef NCCL_DEBUG_H_
 #define NCCL_DEBUG_H_
 
+// Workaround for libstdc++ trying to force public visibility of std:: symbols.  We don't want to do that in libnccl.so.
+#include <bits/c++config.h>
+#undef _GLIBCXX_VISIBILITY
+#define _GLIBCXX_VISIBILITY(V)
+
 #include <cstdint>
 
 typedef enum {
@@ -60,24 +65,11 @@ typedef enum {
   ncclFuncSendRecv = 5,
   ncclFuncSend = 6,
   ncclFuncRecv = 7,
-  ncclNumFuncs = 8
+  ncclFuncAlltoAll = 8,
+  ncclFuncScatter = 9,
+  ncclFuncGather = 10,
+  ncclNumFuncs = 11
 } ncclFunc_t;
 
-#define NCCL_NUM_ALGORITHMS 7 // Tree/Ring/CollNet*/PAT
-#define NCCL_ALGO_UNDEF -1
-#define NCCL_ALGO_TREE 0
-#define NCCL_ALGO_RING 1
-#define NCCL_ALGO_COLLNET_DIRECT 2
-#define NCCL_ALGO_COLLNET_CHAIN 3
-#define NCCL_ALGO_NVLS 4
-#define NCCL_ALGO_NVLS_TREE 5
-#define NCCL_ALGO_PAT 6
 
-#define NCCL_NUM_PROTOCOLS 3 // Simple/LL/LL128
-#define NCCL_PROTO_UNDEF -1
-#define NCCL_PROTO_LL 0
-#define NCCL_PROTO_LL128 1
-#define NCCL_PROTO_SIMPLE 2
-
-#define NCCL_ALGO_PROTO_IGNORE -1.0
 #endif
diff --git a/projects/rccl/src/include/nccl_device.h b/projects/rccl/src/include/nccl_device.h
new file mode 100644
index 0000000000..88b2531d19
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device.h
@@ -0,0 +1,15 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "nccl_device/impl/comm__funcs.h"
+#include "nccl_device/coop.h"
+#include "nccl_device/impl/core__funcs.h"
+#include "nccl_device/impl/ll_a2a__funcs.h"
+#include "nccl_device/impl/mem_barrier__funcs.h"
+//#include "nccl_device/net_barrier__funcs.h"
+//#include "nccl_device/net_scratch_a2a__funcs.h"
+//#include "nccl_device/barrier__funcs.h"
+#include "nccl_device/impl/ptr__funcs.h"
diff --git a/projects/rccl/src/include/nccl_device/README.md b/projects/rccl/src/include/nccl_device/README.md
new file mode 100644
index 0000000000..bf1728d47d
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/README.md
@@ -0,0 +1,32 @@
+This directory has been structured to make it easy for user to read the headers to learn the API. The files adjacent
+to this README are meant for humans. They contain the essential declarations like which types exist and function prototypes and comments
+indicating the contract/usage. Everything else goes into the "impl/" subdirectory. Most modules are stratified into three layers:
+
+1) "foo.h" Public API declarations.
+2) "impl/foo__types.h" struct definitions. Has #include of layer 1.
+3) "impl/foo_funcs.h" inline functions. Has #include of layer 2.
+
+The include dependencies should be acyclic for layers 1 and 2 since order matters for declarations and types. Layer 3 though
+can freely have cycles amongst itself ("impl/foo__funcs.h" and "impl/bar__funcs.h" can mutually include each other) since
+functions can be defined in any order once declared.
+
+Translation units should just include "nccl_device.h" to ensure they get all the "impl/foo__funcs.h". But if a translation unit wants
+to be more specific as to which module it pulls in it should include "impl/foo__funcs.h".
+
+One of the nasty reasons this was required is because of C++ defaulted function parameters:
+
+```
+// +++ in foo.h +++
+struct Foo; // defined in some __types.h
+
+// +++ in "impl/foo__types.h" +++
+struct Foo { int x; };
+
+// +++ in "bar.h" +++
+// Prototype function where default value is default construction of Foo. Since
+// Foo would be incomplete if just including "foo.h" the compiler errors because
+// it can't reason about the {}.
+// I was able to solve this by including "impl/foo__types.h" instead.
+#include "impl/foo__types.h"
+void bar(Foo arg = {});
+```
diff --git a/projects/rccl/src/include/nccl_device/comm.h b/projects/rccl/src/include/nccl_device/comm.h
new file mode 100644
index 0000000000..d989ce1f63
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/comm.h
@@ -0,0 +1,10 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_COMM_H_
+#define _NCCL_DEVICE_COMM_H_
+#include "core.h"
+#endif
diff --git a/projects/rccl/src/include/nccl_device/coop.h b/projects/rccl/src/include/nccl_device/coop.h
new file mode 100644
index 0000000000..9a8d4b0a86
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/coop.h
@@ -0,0 +1,152 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_COOP_H_
+#define _NCCL_DEVICE_COOP_H_
+#include "utility.h"
+
+// ncclCoop[Foo]: NCCL's versions of CUDA's Cooperative Groups. They conform
+// to just this subset of the CUDA API:
+//   int Coop::thread_rank();
+//   int Coop::size();
+//   int Coop::num_threads(); // same as size()
+//   void Coop::sync();
+
+#if __CUDACC__
+template<int nThreadsPow2>
+struct ncclCoopTile { // An aligned pow2 set of threads within the warp.
+  static_assert(nccl::utility::isPow2(nThreadsPow2) && nThreadsPow2 <= 32, "Condition required");
+
+  NCCL_DEVICE_INLINE int thread_rank() const {
+    return nccl::utility::lane() % nThreadsPow2;
+  }
+  NCCL_DEVICE_INLINE constexpr int size() const { return nThreadsPow2; }
+  NCCL_DEVICE_INLINE constexpr int num_threads() const { return nThreadsPow2; }
+
+  NCCL_DEVICE_INLINE uint32_t laneMask() const {
+    return (-1u>>(32-nThreadsPow2))<<(nccl::utility::lane() & -nThreadsPow2);
+  }
+  NCCL_DEVICE_INLINE void sync() {
+    __syncwarp(laneMask());
+  }
+};
+#endif
+
+#if __CUDACC__
+typedef ncclCoopTile<1> ncclCoopThread;
+typedef ncclCoopTile<32> ncclCoopWarp;
+#endif
+
+#if __CUDACC__
+struct ncclCoopLanes { // Some lanes of this warp.
+  uint32_t lmask;
+  
+  NCCL_DEVICE_INLINE constexpr ncclCoopLanes(uint32_t lmask=-1u): lmask(lmask) {}
+
+  NCCL_DEVICE_INLINE int thread_rank() const {
+    return __popc(lmask & nccl::utility::lanemask_lt());
+  }
+  NCCL_DEVICE_INLINE int size() const {
+    return __popc(lmask);
+  }
+  NCCL_DEVICE_INLINE int num_threads() const {
+    return __popc(lmask);
+  }
+  NCCL_DEVICE_INLINE void sync() {
+    __syncwarp(lmask);
+  }
+};
+#endif
+
+#if __CUDACC__
+// A set of consecutive warps that the user has also supplied with a unique
+// id from [0..15]. It is an error for two different warp spans with the same
+// id to be in a collective concurrently.
+struct ncclCoopWarpSpan {
+  uint32_t warp0:8, nWarps:8, id:8;
+
+  NCCL_DEVICE_INLINE constexpr ncclCoopWarpSpan(int warp0, int nWarps, int id):
+    warp0(warp0), nWarps(nWarps), id(id) {
+  }
+  
+  NCCL_DEVICE_INLINE int thread_rank() const {
+    return threadIdx.x - 32*warp0;
+  }
+  NCCL_DEVICE_INLINE int size() const {
+    return 32*nWarps;
+  }
+  NCCL_DEVICE_INLINE int num_threads() const {
+    return 32*nWarps;
+  }
+
+  NCCL_DEVICE_INLINE void sync() {
+    //asm volatile("barrier.sync %0, %1;" :: "r"(1+id), "r"(32*nWarps) : "memory");
+    __barrier_sync_count(1+id, 32*nWarps);
+  }
+};
+#endif
+
+#if __CUDACC__
+struct ncclCoopCta {
+  NCCL_DEVICE_INLINE int thread_rank() const { return threadIdx.x; }
+  NCCL_DEVICE_INLINE int size() const { return blockDim.x; }
+  NCCL_DEVICE_INLINE int num_threads() const { return blockDim.x; }
+  NCCL_DEVICE_INLINE void sync() { __syncthreads(); }
+};
+#endif
+
+#if __CUDACC__
+template<int nThreadsPow2>
+NCCL_DEVICE_INLINE uint32_t ncclCoopLaneMask(ncclCoopTile<nThreadsPow2> coop) {
+  return coop.laneMask();
+}
+NCCL_DEVICE_INLINE uint32_t ncclCoopLaneMask(ncclCoopLanes coop) {
+  return coop.lmask;
+}
+NCCL_DEVICE_INLINE uint32_t ncclCoopLaneMask(ncclCoopWarpSpan coop) {
+  return -1u;
+}
+NCCL_DEVICE_INLINE uint32_t ncclCoopLaneMask(ncclCoopCta coop) {
+  return -1u;
+}
+#endif
+
+#if __CUDACC__
+// ncclCoopIsThread:
+// At compile time do we know the given coop is a single thread only.
+template<int nThreads>
+NCCL_DEVICE_INLINE constexpr bool ncclCoopIsThread(ncclCoopTile<nThreads>) {
+  return nThreads == 1;
+}
+NCCL_DEVICE_INLINE constexpr bool ncclCoopIsThread(ncclCoopLanes) { return false; }
+NCCL_DEVICE_INLINE constexpr bool ncclCoopIsThread(ncclCoopWarpSpan) { return false; }
+NCCL_DEVICE_INLINE constexpr bool ncclCoopIsThread(ncclCoopCta) { return false; }
+#endif
+
+#if __CUDACC__
+// Pick threads of our warp that are safe to use collectively.
+NCCL_DEVICE_INLINE ncclCoopLanes ncclCoopCoalesced() {
+  return ncclCoopLanes{__activemask()};
+}
+#endif
+
+#if __CUDACC__
+// Pick threads of our warp that are safe to use collectively given that this
+// is a collective on the provided cooperative group.
+template<typename Coop>
+NCCL_DEVICE_INLINE ncclCoopTile<32> ncclCoopCoalesced(Coop) {
+  return ncclCoopTile<32>();
+}
+NCCL_DEVICE_INLINE ncclCoopLanes ncclCoopCoalesced(ncclCoopLanes coop) {
+  return coop;
+}
+template<int nThreads>
+NCCL_DEVICE_INLINE ncclCoopTile<nThreads> ncclCoopCoalesced(ncclCoopTile<nThreads> coop) {
+  return coop;
+}
+#endif
+
+#endif
diff --git a/projects/rccl/src/include/nccl_device/core.h b/projects/rccl/src/include/nccl_device/core.h
new file mode 100644
index 0000000000..dd41d69250
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/core.h
@@ -0,0 +1,150 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_CORE_H_
+#define _NCCL_DEVICE_CORE_H_
+#include <nccl.h>
+#include "coop.h"
+#include "utility.h"
+
+struct ncclDevComm;
+typedef struct ncclDevComm ncclDevComm_t;
+
+struct ncclTeam;
+typedef struct ncclTeam ncclTeam_t;
+
+// typedef struct ncclWindow_vidmem* ncclWindow_t; // in nccl.h
+
+struct ncclMultimemHandle;
+typedef struct ncclMultimemHandle ncclMultimemHandle_t;
+
+typedef uint32_t ncclDevResourceHandle;
+typedef ncclDevResourceHandle ncclDevResourceHandle_t;
+
+struct ncclLsaBarrierHandle;
+typedef struct ncclLsaBarrierHandle ncclLsaBarrierHandle_t;
+
+struct ncclLLA2AHandle;
+typedef struct ncclLLA2AHandle ncclLLA2AHandle_t;
+
+struct ncclTeam {
+  int nRanks, rank, stride;
+};
+
+#if __cplusplus
+template<typename T> struct ncclSymPtr;
+#endif
+
+#if __cplusplus
+struct ncclTeamTagWorld {};
+struct ncclTeamTagLsa {};
+struct ncclTeamTagRail {};
+#endif
+
+struct ncclDevCommRequirements;
+typedef struct ncclDevCommRequirements ncclDevCommRequirements_t;
+
+struct ncclDevResourceRequirements;
+typedef struct ncclDevResourceRequirements ncclDevResourceRequirements_t;
+
+struct ncclTeamRequirements;
+typedef struct ncclTeamRequirements ncclTeamRequirements_t;
+
+struct ncclDevCommRequirements {
+  ncclDevResourceRequirements_t* resourceRequirementsList;
+  ncclTeamRequirements_t* teamRequirementsList;
+
+  bool lsaMultimem; // Enable multimem on lsa team
+
+  int lsaBarrierCount;
+};
+
+struct ncclDevResourceRequirements {
+  ncclDevResourceRequirements_t* next;
+  size_t bufferSize, bufferAlign;
+  ncclDevResourceHandle_t* outBufferHandle; // If non-null, target assigned during ncclDevCommCreate.
+};
+
+struct ncclTeamRequirements {
+  ncclTeamRequirements_t* next;
+  ncclTeam_t team;
+  bool multimem;
+  ncclMultimemHandle_t* outMultimemHandle; // If non-null, target assigned during ncclDevCommCreate.
+};
+
+NCCL_EXTERN_C __host__ ncclResult_t ncclDevCommCreate(ncclComm_t, ncclDevCommRequirements_t const*, ncclDevComm_t* outDevComm);
+NCCL_EXTERN_C __host__ ncclResult_t ncclDevCommDestroy(ncclComm_t, ncclDevComm_t const* devComm);
+
+////////////////////////////////////////////////////////////////////////////////
+// Team API:
+
+#if __cplusplus
+NCCL_HOST_DEVICE_INLINE ncclTeam ncclTeamWorld(ncclDevComm const&);
+#endif
+NCCL_EXTERN_C __host__ ncclTeam_t ncclTeamWorld(ncclComm_t);
+
+#if __cplusplus
+NCCL_HOST_DEVICE_INLINE ncclTeam ncclTeamLsa(ncclDevComm const&);
+#endif
+NCCL_EXTERN_C __host__ ncclTeam_t ncclTeamLsa(ncclComm_t);
+
+NCCL_EXTERN_C NCCL_HOST_DEVICE_INLINE bool ncclTeamRankIsMember(ncclTeam_t a, ncclTeam_t b, int bPeer);
+NCCL_EXTERN_C NCCL_HOST_DEVICE_INLINE int ncclTeamRankToTeam(ncclTeam_t a, ncclTeam_t b, int bPeer);
+
+#if __cplusplus
+NCCL_HOST_DEVICE_INLINE int ncclTeamRankToWorld(ncclDevComm const&, ncclTeam, int rank);
+#endif
+NCCL_EXTERN_C __host__ int ncclTeamRankToWorld(ncclComm_t, ncclTeam_t, int rank);
+
+#if __cplusplus
+NCCL_HOST_DEVICE_INLINE int ncclTeamRankToLsa(ncclDevComm const&, ncclTeam, int rank);
+#endif
+NCCL_EXTERN_C __host__ int ncclTeamRankToLsa(ncclComm_t, ncclTeam_t, int rank);
+
+NCCL_EXTERN_C NCCL_HOST_DEVICE_INLINE ncclTeam_t ncclTeamInnerFactor(ncclTeam_t parent, int innerSize);
+NCCL_EXTERN_C NCCL_HOST_DEVICE_INLINE ncclTeam_t ncclTeamOuterFactor(ncclTeam_t parent, int innerSize);
+
+// Interpret each team as a set of ranks. This function assumes that `subset`
+// is a subset of `parent`. Thus the number of ranks in the set difference of
+// `parent` minus `subset` is `super.nRanks - subset.nRanks`. Given `index` this
+// function returns the index'th element of `parent` minus `subset`.
+NCCL_EXTERN_C NCCL_HOST_DEVICE_INLINE int ncclTeamRankInDifference(ncclTeam_t parent, ncclTeam_t subset, int index);
+
+// Equivalent to ncclTeamOuterFactor of lsa team.
+#if __cplusplus
+NCCL_HOST_DEVICE_INLINE ncclTeam ncclTeamRail(ncclDevComm const&);
+#endif
+NCCL_EXTERN_C __host__ ncclTeam_t ncclTeamRail(ncclComm_t);
+
+// Get offset of resource buffer within `comm.resourceWindow`.
+NCCL_EXTERN_C NCCL_HOST_DEVICE_INLINE size_t ncclGetResourceBufferOffset(ncclDevResourceHandle_t);
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE ncclSymPtr<char> ncclGetResourceBuffer(ncclDevComm const&, ncclDevResourceHandle);
+#endif
+
+////////////////////////////////////////////////////////////////////////////////
+// Window API:
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void* ncclGetLocalPointer(ncclWindow_t w, size_t offset);
+NCCL_DEVICE_INLINE void* ncclGetLsaPointer(ncclWindow_t w, size_t offset, int peer);
+NCCL_DEVICE_INLINE void* ncclGetPeerPointer(ncclWindow_t w, size_t offset, int peer);
+NCCL_DEVICE_INLINE void* ncclGetPeerPointer(ncclWindow_t w, size_t offset, ncclTeam tm, int peer);
+NCCL_DEVICE_INLINE void* ncclGetMultimemPointer(ncclWindow_t w, size_t offset, ncclMultimemHandle mmHandle);
+NCCL_DEVICE_INLINE void* ncclGetLsaMultimemPointer(ncclWindow_t w, size_t offset, ncclDevComm const&);
+#endif
+
+#if __CUDACC__
+// Convenience for combining ncclGet***Pointer() with resource handle.
+NCCL_DEVICE_INLINE void* ncclGetResourceBufferLocalPointer(ncclDevComm const&, ncclDevResourceHandle);
+NCCL_DEVICE_INLINE void* ncclGetResourceBufferLsaPointer(ncclDevComm const&, ncclDevResourceHandle, int peer);
+NCCL_DEVICE_INLINE void* ncclGetResourceBufferPeerPointer(ncclDevComm const&, ncclDevResourceHandle, ncclTeam, int peer);
+NCCL_DEVICE_INLINE void* ncclGetResourceBufferMultimemPointer(ncclDevComm const&, ncclDevResourceHandle, ncclMultimemHandle);
+NCCL_DEVICE_INLINE void* ncclGetResourceBufferLsaMultimemPointer(ncclDevComm const&, ncclDevResourceHandle);
+#endif
+
+#endif // _NCCL_DEVICE_CORE_H_
diff --git a/projects/rccl/src/include/nccl_device/impl/comm__funcs.h b/projects/rccl/src/include/nccl_device/impl/comm__funcs.h
new file mode 100644
index 0000000000..0bfe90c91d
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/impl/comm__funcs.h
@@ -0,0 +1,10 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_COMM__FUNCS_H_
+#define _NCCL_DEVICE_COMM__FUNCS_H_
+#include "comm__types.h"
+#endif // _NCCL_DEVICE_COMM__FUNCS_H_
diff --git a/projects/rccl/src/include/nccl_device/impl/comm__types.h b/projects/rccl/src/include/nccl_device/impl/comm__types.h
new file mode 100644
index 0000000000..680d7055b9
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/impl/comm__types.h
@@ -0,0 +1,40 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_COMM__TYPES_H_
+#define _NCCL_DEVICE_COMM__TYPES_H_
+#include "../comm.h"
+#include "core__types.h"
+#include "mem_barrier__types.h"
+#include "ll_a2a__types.h"
+
+struct ncclDevCommWindowTable;
+#if __cplusplus
+struct ncclDevCommWindowTable {
+  struct Entry {
+    uintptr_t base, size;
+    ncclWindow_t window;
+  } entries[32];
+  struct ncclDevCommWindowTable* next;
+};
+#endif
+
+struct ncclDevComm {
+  int rank, nRanks;
+  uint32_t nRanks_rcp32;
+  int lsaRank, lsaSize;
+  uint32_t lsaSize_rcp32;
+
+  struct ncclDevCommWindowTable* windowTable;
+
+  ncclWindow_t resourceWindow;
+  struct ncclWindow_vidmem resourceWindow_inlined;
+
+  ncclMultimemHandle_t lsaMultimem;
+  ncclLsaBarrierHandle_t lsaBarrier;
+};
+
+#endif // _NCCL_DEVICE_COMM__TYPES_H_
diff --git a/projects/rccl/src/include/nccl_device/impl/core__funcs.h b/projects/rccl/src/include/nccl_device/impl/core__funcs.h
new file mode 100644
index 0000000000..1087cd2892
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/impl/core__funcs.h
@@ -0,0 +1,210 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_CORE__FUNCS_H_
+#define _NCCL_DEVICE_CORE__FUNCS_H_
+#include "core__types.h"
+#include "comm__types.h"
+#include "ptr__types.h"
+
+#if __cplusplus
+NCCL_HOST_DEVICE_INLINE ncclTeam ncclTeamWorld(ncclDevComm const &comm) {
+  ncclTeam ans;
+  ans.nRanks = comm.nRanks;
+  ans.rank = comm.rank;
+  ans.stride = 1;
+  return ans;
+}
+#endif
+
+#if __cplusplus
+NCCL_HOST_DEVICE_INLINE ncclTeam ncclTeamLsa(ncclDevComm const &comm) {
+  ncclTeam ans;
+  ans.nRanks = comm.lsaSize;
+  ans.rank = comm.lsaRank;
+  ans.stride = 1;
+  return ans;
+}
+#endif
+
+#if __cplusplus
+NCCL_HOST_DEVICE_INLINE ncclTeam ncclTeamRail(ncclDevComm const& comm) {
+  ncclTeam ans;
+  ans.nRanks = nccl::utility::idivFast32(comm.nRanks, comm.lsaSize, comm.lsaSize_rcp32);
+  ans.rank = nccl::utility::idivFast32(comm.rank, comm.lsaSize, comm.lsaSize_rcp32);
+  ans.stride = comm.lsaSize;
+  return ans;
+}
+#endif
+
+NCCL_HOST_DEVICE_INLINE bool ncclTeamRankIsMember(ncclTeam_t a, ncclTeam_t b, int brank) {
+  int wrank = (brank - b.rank)*b.stride;
+  uint32_t adelta = wrank/a.stride;
+  uint32_t amod = wrank%a.stride;
+  int arank = a.rank + adelta;
+  return 0 <= arank && arank < a.nRanks && amod == 0;
+}
+
+NCCL_HOST_DEVICE_INLINE int ncclTeamRankToTeam(ncclTeam_t a, ncclTeam_t b, int brank) {
+  int wrank = (brank - b.rank)*b.stride;
+  uint32_t adelta = wrank/a.stride;
+  //uint32_t amod = wrank%a.stride;
+  int arank = a.rank + adelta;
+  return arank;
+}
+
+#if __cplusplus
+NCCL_HOST_DEVICE_INLINE int ncclTeamRankToWorld(ncclDevComm const& comm, ncclTeam tm, int rank) {
+  return comm.rank + (rank - tm.rank)*tm.stride;
+}
+#endif
+
+#if __cplusplus
+NCCL_HOST_DEVICE_INLINE int ncclTeamRankToLsa(ncclDevComm const& comm, ncclTeam tm, int rank) {
+  return comm.lsaRank + (rank - tm.rank)*tm.stride;
+}
+#endif
+
+NCCL_HOST_DEVICE_INLINE ncclTeam_t ncclTeamInnerFactor(ncclTeam_t parent, int innerSize) {
+  ncclTeam_t ans;
+  ans.nRanks = innerSize;
+  ans.rank = parent.rank%innerSize;
+  ans.stride = parent.stride;
+  return ans;
+}
+
+NCCL_HOST_DEVICE_INLINE ncclTeam_t ncclTeamOuterFactor(ncclTeam_t parent, int innerSize) {
+  ncclTeam_t ans;
+  ans.nRanks = parent.nRanks/innerSize;
+  ans.rank = parent.rank/innerSize;
+  ans.stride = parent.stride*innerSize;
+  return ans;
+}
+
+NCCL_HOST_DEVICE_INLINE int ncclTeamRankInDifference(ncclTeam_t parent, ncclTeam_t subset, int index) {
+  int stride = subset.stride/parent.stride;
+  int below = parent.rank - subset.rank*stride;
+  if (stride < 0) {
+    stride = -stride;
+    below -= (subset.nRanks-1)*stride;
+  }
+  if (index < below) {
+    return index;
+  } else if (index-below < (subset.nRanks-1)*(stride-1)) {
+    return below + 1 + ((index-below)/(stride-1))*stride + (index-below)%(stride-1);
+  } else {
+    return below + 1 + (subset.nRanks-1)*stride + (index - below - (subset.nRanks-1)*(stride-1));
+  }
+}
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void* ncclGetLocalPointer(ncclWindow_t w, size_t offset) {
+  char* base = nccl::utility::loadConst(&w->lsaFlatBase);
+  uint32_t stride4G = nccl::utility::loadConst(&w->stride4G);
+  int i = nccl::utility::loadConst(&w->lsaRank);
+  return (void*)(nccl::utility::add4G(base, i*stride4G) + offset);
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void* ncclGetLsaPointer(ncclWindow_t w, size_t offset, int peer) {
+  char* base = nccl::utility::loadConst(&w->lsaFlatBase);
+  uint32_t stride4G = nccl::utility::loadConst(&w->stride4G);
+  int i = peer;
+  return (void*)(nccl::utility::add4G(base, i*stride4G) + offset);
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void* ncclGetPeerPointer(ncclWindow_t w, size_t offset, int peer) {
+  char* base = nccl::utility::loadConst(&w->lsaFlatBase);
+  uint32_t stride4G = nccl::utility::loadConst(&w->stride4G);
+  int worldRank = nccl::utility::loadConst(&w->worldRank);
+  int lsaRank = nccl::utility::loadConst(&w->lsaRank);
+  int i = lsaRank + (peer - worldRank);
+  return (void*)(nccl::utility::add4G(base, i*stride4G) + offset);
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void* ncclGetPeerPointer(ncclWindow_t w, size_t offset, ncclTeam tm, int peer) {
+  char* base = nccl::utility::loadConst(&w->lsaFlatBase);
+  uint32_t stride4G = nccl::utility::loadConst(&w->stride4G);
+  int lsaRank = nccl::utility::loadConst(&w->lsaRank);
+  int i = lsaRank + (peer - tm.rank)*tm.stride;
+  return (void*)(nccl::utility::add4G(base, i*stride4G) + offset);
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void* ncclGetMultimemPointer(ncclWindow_t w, size_t offset, ncclMultimemHandle mm) {
+  void* ptr = mm.mcBasePtr;
+  ptr = reinterpret_cast<char(*)[4096]>(ptr) + nccl::utility::loadConst(&w->mcOffset4K);
+  return (void*)((char*)ptr + offset);
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void* ncclGetLsaMultimemPointer(ncclWindow_t w, size_t offset, ncclDevComm const& comm) {
+  return ncclGetMultimemPointer(w, offset, comm.lsaMultimem);
+}
+#endif
+
+NCCL_HOST_DEVICE_INLINE size_t ncclGetResourceBufferOffset(ncclDevResourceHandle_t h) {
+  return ((size_t)h)*128;
+}
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void* ncclGetResourceBufferLocalPointer(ncclDevComm const& comm, ncclDevResourceHandle h) {
+  void* lsaFlatBase = comm.resourceWindow_inlined.lsaFlatBase;
+  uint32_t stride4G = comm.resourceWindow_inlined.stride4G;
+  void* local = nccl::utility::add4G(lsaFlatBase, comm.lsaRank*stride4G);
+  return (void*)(reinterpret_cast<char(*)[128]>(local) + h);
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void* ncclGetResourceBufferLsaPointer(ncclDevComm const& comm, ncclDevResourceHandle h, int peer) {
+  int r = peer;
+  void* lsaFlatBase = comm.resourceWindow_inlined.lsaFlatBase;
+  uint32_t stride4G = comm.resourceWindow_inlined.stride4G;
+  void* local = nccl::utility::add4G(lsaFlatBase, r*stride4G);
+  return (void*)(reinterpret_cast<char(*)[128]>(local) + h);
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void* ncclGetResourceBufferPeerPointer(ncclDevComm const& comm, ncclDevResourceHandle h, ncclTeam team, int peer) {
+  int r = comm.lsaRank + (peer - team.rank)*team.stride;
+  void* lsaFlatBase = comm.resourceWindow_inlined.lsaFlatBase;
+  uint32_t stride4G = comm.resourceWindow_inlined.stride4G;
+  void* local = nccl::utility::add4G(lsaFlatBase, r*stride4G);
+  return (void*)(reinterpret_cast<char(*)[128]>(local) + h);
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void* ncclGetResourceBufferMultimemPointer(ncclDevComm const& comm, ncclDevResourceHandle h, ncclMultimemHandle mm) {
+  void* ptr = mm.mcBasePtr;
+  ptr = reinterpret_cast<char(*)[4096]>(ptr) + comm.resourceWindow_inlined.mcOffset4K;
+  ptr = reinterpret_cast<char(*)[128]>(ptr) + h;
+  return ptr;
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void* ncclGetResourceBufferLsaMultimemPointer(ncclDevComm const& comm, ncclDevResourceHandle h) {
+  return ncclGetResourceBufferMultimemPointer(comm, h, comm.lsaMultimem);
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE ncclSymPtr<char> ncclGetResourceBuffer(ncclDevComm const& comm, ncclDevResourceHandle h) {
+  return ncclSymPtr<char>(comm.resourceWindow, size_t(h)*128);
+}
+#endif
+
+#endif
diff --git a/projects/rccl/src/include/nccl_device/impl/core__types.h b/projects/rccl/src/include/nccl_device/impl/core__types.h
new file mode 100644
index 0000000000..d2d1350b12
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/impl/core__types.h
@@ -0,0 +1,26 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_CORE__TYPES_H_
+#define _NCCL_DEVICE_CORE__TYPES_H_
+#include "../core.h"
+
+// nccl.h has: typedef ncclWindow_vidmem* ncclWindow_t;
+struct ncclWindow_vidmem {
+  void* winHost;
+  //ncclGinWindow_t ginWin;
+  char* lsaFlatBase; // pointer to first byte for rank 0 of lsa team
+  int lsaRank;
+  int worldRank;
+  uint32_t stride4G;
+  uint32_t mcOffset4K;
+};
+
+struct ncclMultimemHandle {
+  void* mcBasePtr;
+};
+
+#endif
diff --git a/projects/rccl/src/include/nccl_device/impl/ll_a2a__funcs.h b/projects/rccl/src/include/nccl_device/impl/ll_a2a__funcs.h
new file mode 100644
index 0000000000..39bdf7a29b
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/impl/ll_a2a__funcs.h
@@ -0,0 +1,229 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_LL_A2A__FUNCS_H_
+#define _NCCL_DEVICE_LL_A2A__FUNCS_H_
+#include "ll_a2a__types.h"
+#include "comm__types.h"
+#include "../utility.h"
+
+#if __CUDACC__
+template<typename Coop>
+NCCL_DEVICE_INLINE ncclLLA2ASession<Coop>::ncclLLA2ASession(
+    Coop coop, ncclDevComm const& comm, ncclTeam team,
+    ncclLLA2AHandle handle, uint32_t block, int maxElts,
+    bool multimem, ncclMultimemHandle mmHandle
+  ):
+  ncclLLA2ASession_internal<Coop>{
+    coop, comm, team, handle, (int)block, /*pitch=*/maxElts,
+    multimem, mmHandle, /*epoch=*/0, /*slotsOffset=*/0
+  } {
+  uint4* line = (uint4*)ncclGetResourceBufferLocalPointer(comm, handle.bufHandle);
+  line += block*(1 + 2*handle.nSlots);
+  this->epoch = line->x + 2;
+  this->slotsOffset = this->calcSlotOffset();
+}
+#endif
+
+#if __CUDACC__
+template<typename Coop>
+NCCL_DEVICE_INLINE ncclLLA2ASession<Coop>::~ncclLLA2ASession() {
+  uint4* line = (uint4*)ncclGetResourceBufferLocalPointer(this->comm, this->handle.bufHandle);
+  line += this->block*(1 + 2*this->handle.nSlots);
+  if (this->coop.thread_rank() == 0) line->x = this->epoch - 2;
+  this->coop.sync();
+}
+#endif
+
+#if __CUDACC__
+template<typename Coop>
+template<typename T>
+NCCL_DEVICE_INLINE void ncclLLA2ASession<Coop>::send(int peer, int elt, T data) {
+  using nccl::utility::divUp;
+  union { T tmp; uint32_t u32[divUp(sizeof(T), 8)][2]; };
+  tmp = data;
+  uint4* buf = (uint4*)ncclGetResourceBufferPeerPointer(this->comm, this->handle.bufHandle, this->team, peer);
+  buf += this->slotsOffset + elt;
+  #pragma unroll
+  for (int u=0; u < divUp(sizeof(T), 8); u++) {
+    asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" ::
+      "l"(buf + u*this->pitch),
+      "r"(u32[u][0]), "r"(u32[u][1]), "r"(this->epoch)
+    );
+  }
+}
+#endif
+
+#if __CUDACC__
+template<typename Coop>
+template<typename T>
+NCCL_DEVICE_INLINE void ncclLLA2ASession<Coop>::bcast(int elt, T data) {
+  using nccl::utility::divUp;
+  if (this->multimem) {
+    union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
+    tmp = data;
+    uint4* bufmc = (uint4*)ncclGetResourceBufferMultimemPointer(this->comm, this->handle.bufHandle, this->mmHandle);
+    bufmc += this->slotsOffset + elt;
+    #pragma unroll
+    for (int u=0; u < divUp(sizeof(T), 8); u++) {
+      asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" ::
+        "l"(bufmc + this->pitch*u),
+        "r"(u32[u][0]), "r"(u32[u][1]), "r"(this->epoch)
+      );
+    }
+  } else {
+    union { T tmp; uint32_t u32[divUp(sizeof(T), 8)][2]; };
+    tmp = data;
+    int dr = 0;
+    int r = this->team.rank;
+    #pragma unroll 1
+    for (; dr+8 <= this->team.nRanks; dr += 8) {
+      #pragma unroll
+      for (int ur=0; ur < 8; ur++) {
+        uint4* buf = (uint4*)ncclGetResourceBufferPeerPointer(this->comm, this->handle.bufHandle, this->team, r);
+        buf += this->slotsOffset + elt;
+        #pragma unroll
+        for (int u=0; u < divUp(sizeof(T),8); u++) {
+          asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" ::
+            "l"(buf + u*this->pitch),
+            "r"(u32[u][0]), "r"(u32[u][1]), "r"(this->epoch)
+          );
+        }
+        r += 1;
+        if (r == this->team.nRanks) r = 0;
+      }
+    }
+    #pragma unroll
+    for (int ur=0; ur < 8; ur++, dr++) {
+      if (dr == this->team.nRanks) break;
+      uint4* buf = (uint4*)ncclGetResourceBufferPeerPointer(this->comm, this->handle.bufHandle, this->team, r);
+      buf += this->slotsOffset + elt;
+      #pragma unroll
+      for (int u=0; u < divUp(sizeof(T),8); u++) {
+        asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" ::
+          "l"(buf + u*this->pitch),
+          "r"(u32[u][0]), "r"(u32[u][1]), "r"(this->epoch)
+        );
+      }
+      r += 1;
+      if (r == this->team.nRanks) r = 0;
+    }
+  }
+}
+#endif
+
+#if __CUDACC__
+template<typename Coop>
+template<typename T>
+NCCL_DEVICE_INLINE T ncclLLA2ASession<Coop>::recv(int elt) {
+  T ret[1];
+  this->template recvUnrolled</*MinEltCount=*/1, /*MaxEltCount=*/1>(elt, 1, 0, ret);
+  return ret[0];
+}
+#endif
+
+#if __CUDACC__
+template<typename Coop>
+template<int MinEltCount, int MaxEltCount, typename T>
+NCCL_DEVICE_INLINE void ncclLLA2ASession<Coop>::recvUnrolled(int eltStart, int eltCount, int eltStride, T(&elts)[MaxEltCount]) {
+  using nccl::utility::divUp;
+  uint4* buf = (uint4*)ncclGetResourceBufferLocalPointer(this->comm, this->handle.bufHandle);
+  buf += this->slotsOffset + eltStart;
+
+  uint4 tmp[MaxEltCount][divUp(sizeof(T), 8)];
+  #pragma unroll 1
+  while (true) {
+    #pragma unroll
+    for (int u=0; u < MaxEltCount; u++) {
+      if (u < MinEltCount || u < eltCount) {
+        #pragma unroll
+        for (int v=0; v < divUp(sizeof(T), 8); v++) {
+          asm volatile("ld.volatile.v4.u32 {%0,%1,%2,%3},[%4];"
+            : "=r"(tmp[u][v].x), "=r"(tmp[u][v].y), "=r"(tmp[u][v].z), "=r"(tmp[u][v].w)
+            : "l"(buf + u*eltStride + v*this->pitch));
+        }
+      }
+    }
+    bool okAll = true;
+    #pragma unroll
+    for (int u=0; u < MaxEltCount; u++) {
+      #pragma unroll
+      for (int v=0; v < divUp(sizeof(T), 8); v++) {
+        if (u < MinEltCount || u < eltCount) {
+          bool ok = tmp[u][v].y == this->epoch &&
+                    tmp[u][v].w == this->epoch;
+          okAll &= ok;
+        }
+      }
+    }
+    if (__builtin_expect(okAll, true)) break;
+  }
+
+  #pragma unroll
+  for (int u=0; u < MaxEltCount; u++) {
+    if (MinEltCount <= u && u == eltCount) break;
+    union { T val; uint32_t u32[divUp(sizeof(T), 8)][2]; };
+    #pragma unroll
+    for (int v=0; v < divUp(sizeof(T), 8); v++) {
+      u32[v][0] = tmp[u][v].x;
+      u32[v][1] = tmp[u][v].z;
+    }
+    elts[u] = val;
+  }
+}
+#endif
+
+#if __CUDACC__
+template<typename Coop>
+template<int Unroll, typename Elt, typename EltToAcc, typename Reduce>
+NCCL_DEVICE_INLINE auto ncclLLA2ASession<Coop>::recvReduce(
+    int eltStart, int eltCount, int eltStride, EltToAcc eltToAcc, Reduce reduce
+  ) -> decltype(eltToAcc(nccl::utility::declval<Elt>())) {
+  using Acc = decltype(eltToAcc(nccl::utility::declval<Elt>()));
+  Acc acc;
+  int i = 0;
+  #pragma unroll 1
+  for (; i+Unroll <= eltCount; i += Unroll) {
+    Elt got[Unroll];
+    this->template recvUnrolled</*Min=*/Unroll>(eltStart + i*eltStride, Unroll, eltStride, got);
+    Acc acc0 = eltToAcc(got[0]);
+    acc = i==0 ? acc0 : reduce(acc, acc0);
+    #pragma unroll
+    for (int j=1; j < Unroll; j++) acc = reduce(acc, eltToAcc(got[j]));
+  }
+  if (i < eltCount) {
+    Elt got[Unroll];
+    this->template recvUnrolled</*Min=*/1>(eltStart + i*eltStride, eltCount-i, eltStride, got);
+    Acc acc0 = eltToAcc(got[0]);
+    acc = i==0 ? acc0 : reduce(acc, acc0);
+    #pragma unroll
+    for (int j=1; j < Unroll-1; j++) {
+      if (i+j < eltCount) acc = reduce(acc, eltToAcc(got[j]));
+    }
+  }
+  return acc;
+}
+#endif
+
+#if __CUDACC__
+template<typename Coop>
+NCCL_DEVICE_INLINE void ncclLLA2ASession<Coop>::endEpoch(Coop) {
+  if (__builtin_expect(this->epoch >= -2u, false)) {
+    this->coop.sync();
+    uint4* buf = (uint4*)ncclGetResourceBufferLocalPointer(this->comm, this->handle.bufHandle);
+    buf += this->slotsOffset;
+    #pragma unroll 4
+    for (int i=this->coop.thread_rank(); i < this->handle.nSlots; i += this->coop.size()) {
+      buf[i] = uint4{0, 0, 0, 0};
+    }
+  }
+  this->coop.sync();
+  this->epoch += (this->epoch == -1u) ? 3 : 1;
+  this->slotsOffset = this->calcSlotOffset();
+}
+#endif
+
+#endif // _NCCL_DEVICE_LL_A2A__FUNCS_H_
diff --git a/projects/rccl/src/include/nccl_device/impl/ll_a2a__types.h b/projects/rccl/src/include/nccl_device/impl/ll_a2a__types.h
new file mode 100644
index 0000000000..501777acff
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/impl/ll_a2a__types.h
@@ -0,0 +1,37 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_LL_A2A__TYPES_H_
+#define _NCCL_DEVICE_LL_A2A__TYPES_H_
+#include "../ll_a2a.h"
+#include "core__types.h"
+
+struct ncclLLA2AHandle {
+  ncclDevResourceHandle_t bufHandle;
+  uint32_t nSlots;
+};
+
+#if __CUDACC__
+template<typename Coop>
+struct ncclLLA2ASession_internal {
+  Coop coop;
+  ncclDevComm const& comm;
+  ncclTeam team;
+  ncclLLA2AHandle handle;
+  int block;
+  int pitch;
+  bool multimem;
+  ncclMultimemHandle mmHandle;
+  uint32_t epoch;
+  uint32_t slotsOffset;
+
+  NCCL_DEVICE_INLINE uint32_t calcSlotOffset() const {
+    return block*(1 + 2*handle.nSlots) + 1 + (epoch & 1)*handle.nSlots;
+  }
+};
+#endif
+
+#endif // _NCCL_DEVICE_LL_A2A__TYPES_H_
diff --git a/projects/rccl/src/include/nccl_device/impl/mem_barrier__funcs.h b/projects/rccl/src/include/nccl_device/impl/mem_barrier__funcs.h
new file mode 100644
index 0000000000..86a5d0fbc1
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/impl/mem_barrier__funcs.h
@@ -0,0 +1,126 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_MEM_BARRIER__FUNCS_H_
+#define _NCCL_DEVICE_MEM_BARRIER__FUNCS_H_
+#include "mem_barrier__types.h"
+#include "comm__types.h"
+
+#if __CUDACC__
+template<typename Coop>
+NCCL_DEVICE_INLINE ncclLsaBarrierSession<Coop>::ncclLsaBarrierSession(
+    Coop coop, ncclDevComm const& comm, ncclTeam team,
+    ncclLsaBarrierHandle handle, uint32_t index,
+    bool multimem, ncclMultimemHandle mmHandle
+  ):
+  ncclLsaBarrierSession_internal<Coop>{
+    coop, comm, team, handle, (int)index,
+#if CUDART_VERSION >= 12060
+    multimem,
+#else // WAR for an issue with ptxas in CTK < 12.6
+    /*multimem=*/false,
+#endif
+    mmHandle, /*epoch=*/0
+  } {
+  uint32_t* state = (uint32_t*)ncclGetResourceBufferLocalPointer(comm, handle.bufHandle);
+  this->epoch = state[(this->multimem ? 0 : 1)*this->handle.nBarriers + this->index];
+}
+#endif
+
+#if __CUDACC__
+template<typename Coop>
+NCCL_DEVICE_INLINE ncclLsaBarrierSession<Coop>::ncclLsaBarrierSession(
+    Coop coop, ncclDevComm const& comm, ncclTeamTagLsa, uint32_t index, bool multimem
+  ): ncclLsaBarrierSession(
+    coop, comm, ncclTeamLsa(comm), comm.lsaBarrier, index, multimem, comm.lsaMultimem
+  ) {
+}
+#endif
+
+#if __CUDACC__
+template<typename Coop>
+NCCL_DEVICE_INLINE ncclLsaBarrierSession<Coop>::~ncclLsaBarrierSession() {
+  uint32_t* state = (uint32_t*)ncclGetResourceBufferLocalPointer(this->comm, this->handle.bufHandle);
+  if (this->coop.thread_rank() == 0) {
+#if __CUDA_ARCH__ == 1200 && CUDART_VERSION < 13000
+    // WAR for a compiler issue with CTK < 13.0
+    if (this->index == 0)
+      state[(this->multimem ? 0 : 1)*this->handle.nBarriers] = this->epoch;
+    else
+#endif
+    state[(this->multimem ? 0 : 1)*this->handle.nBarriers + this->index] = this->epoch;
+  }
+  this->coop.sync();
+}
+#endif
+
+#if __CUDACC__
+template<typename Coop>
+NCCL_DEVICE_INLINE void ncclLsaBarrierSession<Coop>::arrive(Coop, cuda::memory_order order) {
+  this->coop.sync();
+  if (this->multimem) {
+  #if __CUDA_ARCH__ >= 900
+    if (this->coop.thread_rank() == 0) {
+      uint32_t* inbox = this->mcInbox(/*multimem=*/true);
+      if (nccl::utility::releaseOrderOf(order) != cuda::memory_order_relaxed) {
+        asm volatile("multimem.red.release.sys.add.u32 [%0],1;" :: "l"(inbox));
+      } else {
+        asm volatile("multimem.red.relaxed.sys.add.u32 [%0],1;" :: "l"(inbox));
+      }
+    }
+  #endif
+  } else {
+    #pragma unroll 1
+    for (int i = this->coop.thread_rank(); i < this->team.nRanks-1; i += this->coop.size()) {
+      int peer = i + (this->team.rank <= i ? 1 : 0);
+      cuda::atomic_ref<uint32_t> inbox(*this->ucInbox(peer, this->team.rank));
+      inbox.store(this->epoch+1, nccl::utility::releaseOrderOf(order));
+    }
+  }
+}
+#endif
+
+#if __CUDACC__
+template<typename Coop>
+NCCL_DEVICE_INLINE void ncclLsaBarrierSession<Coop>::wait(Coop, cuda::memory_order order) {
+  if (this->multimem) {
+  #if __CUDA_ARCH__ >= 900
+    if (this->coop.thread_rank() == 0) {
+      cuda::atomic_ref<uint32_t> inbox(*this->mcInbox(/*multimem=*/false));
+      #pragma unroll 1
+      while (true) {
+        uint32_t got = inbox.load(nccl::utility::acquireOrderOf(order));
+        if (got - (this->epoch + this->team.nRanks) <= uint32_t(-1)>>1) break;
+      }
+      this->epoch += this->team.nRanks;
+    }
+  #endif
+  } else {
+    #pragma unroll 1
+    for (int i = this->coop.thread_rank(); i < this->team.nRanks-1; i += this->coop.size()) {
+      int peer = i + (this->team.rank <= i ? 1 : 0);
+      cuda::atomic_ref<uint32_t> inbox(*this->ucInbox(this->team.rank, peer));
+      #pragma unroll 1
+      while (true) {
+        uint32_t got = inbox.load(nccl::utility::acquireOrderOf(order));
+        if (got - (this->epoch + 1) <= uint32_t(-1)>>1) break;
+      }
+    }
+    this->epoch += 1;
+  }
+  this->coop.sync();
+}
+#endif
+
+#if __CUDACC__
+template<typename Coop>
+NCCL_DEVICE_INLINE void ncclLsaBarrierSession<Coop>::sync(Coop coop, cuda::memory_order order) {
+  this->arrive(coop, order);
+  this->wait(coop, order);
+}
+#endif
+
+#endif // _NCCL_DEVICE_MEM_BARRIER__FUNCS_H_
diff --git a/projects/rccl/src/include/nccl_device/impl/mem_barrier__types.h b/projects/rccl/src/include/nccl_device/impl/mem_barrier__types.h
new file mode 100644
index 0000000000..8498cd6ba7
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/impl/mem_barrier__types.h
@@ -0,0 +1,46 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_MEM_BARRIER__TYPES_H_
+#define _NCCL_DEVICE_MEM_BARRIER__TYPES_H_
+#include "../mem_barrier.h"
+#include "core__types.h"
+
+struct ncclLsaBarrierHandle {
+  ncclDevResourceHandle_t bufHandle;
+  int nBarriers;
+};
+
+#if __CUDACC__
+template<typename Coop>
+struct ncclLsaBarrierSession_internal {
+  Coop coop;
+  ncclDevComm const& comm;
+  ncclTeam team;
+  ncclLsaBarrierHandle handle;
+  int index;
+  bool multimem;
+  ncclMultimemHandle mmHandle;
+  uint32_t epoch;
+
+  NCCL_DEVICE_INLINE uint32_t* mcInbox(bool multimem) {
+    uint32_t* state;
+    if (multimem) { // multicast
+      state = (uint32_t*)ncclGetResourceBufferMultimemPointer(comm, handle.bufHandle, mmHandle);
+    } else { // unicast
+      state = (uint32_t*)ncclGetResourceBufferLocalPointer(comm, handle.bufHandle);
+    }
+    return state + 2*handle.nBarriers + index;
+  }
+
+  NCCL_DEVICE_INLINE uint32_t* ucInbox(int owner, int peer) {
+    uint32_t* state = (uint32_t*)ncclGetResourceBufferPeerPointer(comm, handle.bufHandle, team, owner);
+    return state + 3*handle.nBarriers + index*team.nRanks + peer;
+  }
+};
+#endif
+
+#endif // _NCCL_DEVICE_MEM_BARRIER__TYPES_H_
diff --git a/projects/rccl/src/include/nccl_device/impl/ptr__funcs.h b/projects/rccl/src/include/nccl_device/impl/ptr__funcs.h
new file mode 100644
index 0000000000..ef33634e43
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/impl/ptr__funcs.h
@@ -0,0 +1,157 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_PTR__FUNCS_H_
+#define _NCCL_DEVICE_PTR__FUNCS_H_
+#include "ptr__types.h"
+#include "core__funcs.h"
+#include "comm__types.h"
+
+#if __cplusplus
+
+template<typename T>
+NCCL_HOST_DEVICE_INLINE constexpr ncclSymPtr<T>::ncclSymPtr(ncclWindow_t window, size_t offset):
+  window(window), offset(offset) {
+}
+
+template<typename T>
+template<typename U>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>::operator ncclSymPtr<U>() const {
+  return {window, offset};
+}
+
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& ncclSymPtr<T>::operator+=(int d) {
+  offset = reinterpret_cast<size_t>(reinterpret_cast<T*>(offset) + d);
+  return *this;
+}
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& ncclSymPtr<T>::operator+=(unsigned int d) {
+  offset = reinterpret_cast<size_t>(reinterpret_cast<T*>(offset) + d);
+  return *this;
+}
+
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& ncclSymPtr<T>::operator+=(long d) {
+  offset = reinterpret_cast<size_t>(reinterpret_cast<T*>(offset) + d);
+  return *this;
+}
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& ncclSymPtr<T>::operator+=(unsigned long d) {
+  offset = reinterpret_cast<size_t>(reinterpret_cast<T*>(offset) + d);
+  return *this;
+}
+
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& ncclSymPtr<T>::operator+=(long long d) {
+  offset = reinterpret_cast<size_t>(reinterpret_cast<T*>(offset) + d);
+  return *this;
+}
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& ncclSymPtr<T>::operator+=(unsigned long long d) {
+  offset = reinterpret_cast<size_t>(reinterpret_cast<T*>(offset) + d);
+  return *this;
+}
+
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& ncclSymPtr<T>::operator-=(int d) {
+  offset = reinterpret_cast<size_t>(reinterpret_cast<T*>(offset) - d);
+  return *this;
+}
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& ncclSymPtr<T>::operator-=(unsigned int d) {
+  offset = reinterpret_cast<size_t>(reinterpret_cast<T*>(offset) - d);
+  return *this;
+}
+
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& ncclSymPtr<T>::operator-=(long d) {
+  offset = reinterpret_cast<size_t>(reinterpret_cast<T*>(offset) - d);
+  return *this;
+}
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& ncclSymPtr<T>::operator-=(unsigned long d) {
+  offset = reinterpret_cast<size_t>(reinterpret_cast<T*>(offset) - d);
+  return *this;
+}
+
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& ncclSymPtr<T>::operator-=(long long d) {
+  offset = reinterpret_cast<size_t>(reinterpret_cast<T*>(offset) - d);
+  return *this;
+}
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& ncclSymPtr<T>::operator-=(unsigned long long d) {
+  offset = reinterpret_cast<size_t>(reinterpret_cast<T*>(offset) - d);
+  return *this;
+}
+
+#if __CUDACC__
+template<typename T>
+NCCL_DEVICE_INLINE T* ncclSymPtr<T>::localPtr() const {
+  return (T*)ncclGetLocalPointer(window, offset);
+}
+#endif
+
+#if __CUDACC__
+template<typename T>
+NCCL_DEVICE_INLINE T* ncclSymPtr<T>::lsaPtr(int peer) const {
+  return (T*)ncclGetLsaPointer(window, offset, peer);
+}
+#endif
+
+#if __CUDACC__
+template<typename T>
+NCCL_DEVICE_INLINE T* ncclSymPtr<T>::peerPtr(int peer) const {
+  return (T*)ncclGetPeerPointer(window, offset, peer);
+}
+#endif
+
+#if __CUDACC__
+template<typename T>
+NCCL_DEVICE_INLINE T* ncclSymPtr<T>::peerPtr(ncclTeam team, int peer) const {
+  return (T*)ncclGetPeerPointer(window, offset, team, peer);
+}
+#endif
+
+#if __CUDACC__
+template<typename T>
+NCCL_DEVICE_INLINE T* ncclSymPtr<T>::multimemPtr(ncclMultimemHandle mmHandle) const {
+  return (T*)ncclGetMultimemPointer(window, offset, mmHandle);
+}
+#endif
+
+#if __CUDACC__
+template<typename T>
+NCCL_DEVICE_INLINE T* ncclSymPtr<T>::lsaMultimemPtr(ncclDevComm const& comm) const {
+  return (T*)ncclGetLsaMultimemPointer(window, offset, comm);
+}
+#endif
+
+template<typename T, typename Int>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T> operator+(ncclSymPtr<T> p, Int d) {
+  return p += d;
+}
+template<typename T, typename Int>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T> operator-(ncclSymPtr<T> p, Int d) {
+  return p -= d;
+}
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ptrdiff_t operator-(ncclSymPtr<T> a, ncclSymPtr<T> b) {
+  return reinterpret_cast<T*>(a.offset) - reinterpret_cast<T*>(b.offset);
+}
+
+template<typename T>
+NCCL_HOST_DEVICE_INLINE bool operator==(ncclSymPtr<T> a, ncclSymPtr<T> b) {
+  return a.window == b.window && a.offset == b.offset;
+}
+template<typename T>
+NCCL_HOST_DEVICE_INLINE bool operator!=(ncclSymPtr<T> a, ncclSymPtr<T> b) {
+  return a.window != b.window || a.offset != b.offset;
+}
+
+#endif // __cplusplus
+#endif // _NCCL_DEVICE_PTR__FUNCS_H_
diff --git a/projects/rccl/src/include/nccl_device/impl/ptr__types.h b/projects/rccl/src/include/nccl_device/impl/ptr__types.h
new file mode 100644
index 0000000000..3f9a1a0f87
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/impl/ptr__types.h
@@ -0,0 +1,11 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_PTR__TYPES_H_
+#define _NCCL_DEVICE_PTR__TYPES_H_
+#include "../ptr.h"
+#include "core__types.h"
+#endif // _NCCL_DEVICE_PTR__TYPES_H_
diff --git a/projects/rccl/src/include/nccl_device/ll_a2a.h b/projects/rccl/src/include/nccl_device/ll_a2a.h
new file mode 100644
index 0000000000..db3a517b75
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/ll_a2a.h
@@ -0,0 +1,53 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_LL_A2A_H_
+#define _NCCL_DEVICE_LL_A2A_H_
+#include "impl/core__types.h"
+
+struct ncclLLA2AHandle;
+
+NCCL_EXTERN_C __host__ int ncclLLA2ACalcSlots(int maxElts, int maxEltSize);
+
+NCCL_EXTERN_C __host__ ncclResult_t ncclLLA2ACreateRequirement(int nBlocks, int nSlots, ncclLLA2AHandle_t* outHandle, ncclDevResourceRequirements_t* outReq);
+
+#if __CUDACC__
+template<typename Coop>
+struct ncclLLA2ASession_internal;
+
+template<typename Coop>
+struct ncclLLA2ASession: ncclLLA2ASession_internal<Coop> {
+  NCCL_DEVICE_INLINE ncclLLA2ASession(Coop, ncclDevComm const&, ncclTeam, ncclLLA2AHandle, uint32_t block, int maxElts, bool multimem=false, ncclMultimemHandle mmHandle={});
+
+  NCCL_DEVICE_INLINE ~ncclLLA2ASession();
+
+  ncclLLA2ASession(ncclLLA2ASession const&) = delete; // Sessions are not copyable
+  
+  template<typename T>
+  NCCL_DEVICE_INLINE void send(int peer, int slot, T data);
+
+  template<typename T>
+  NCCL_DEVICE_INLINE void bcast(int slot, T data);
+
+  template<typename T>
+  NCCL_DEVICE_INLINE T recv(int slot);
+
+  template<int MinEltCount, int MaxEltCount, typename T>
+  NCCL_DEVICE_INLINE void recvUnrolled(int eltStart, int eltCount, int eltStride, T(&vals)[MaxEltCount]);
+
+  template<int Unroll, typename Elt, typename EltToAcc, typename Reduce>
+  NCCL_DEVICE_INLINE auto recvReduce(int eltStart, int eltCount, int eltStride, EltToAcc eltToAcc, Reduce red)
+    -> decltype(eltToAcc(nccl::utility::declval<Elt>())) ;
+  
+  // End an alltoall region. For every peer in team you must have done both of the
+  // following each of which can be accomplished using any thread in coop:
+  //  1. Targeted that peer with at least one send().
+  //  2. Received from a slot targeted by that peer.
+  NCCL_DEVICE_INLINE void endEpoch(Coop);
+};
+#endif
+
+#endif // _NCCL_DEVICE_LL_A2A_H_
diff --git a/projects/rccl/src/include/nccl_device/mem_barrier.h b/projects/rccl/src/include/nccl_device/mem_barrier.h
new file mode 100644
index 0000000000..ea90cc6f64
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/mem_barrier.h
@@ -0,0 +1,35 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_MEM_BARRIER_H_
+#define _NCCL_DEVICE_MEM_BARRIER_H_
+#include "impl/core__types.h"
+
+struct ncclLsaBarrierHandle;
+
+NCCL_EXTERN_C __host__ ncclResult_t ncclLsaBarrierCreateRequirement(ncclTeam_t, int nBarriers, ncclLsaBarrierHandle_t* outHandle, ncclDevResourceRequirements_t* outReq);
+
+#if __CUDACC__
+template<typename Coop>
+struct ncclLsaBarrierSession_internal;
+
+template<typename Coop>
+struct ncclLsaBarrierSession: ncclLsaBarrierSession_internal<Coop> {
+  NCCL_DEVICE_INLINE ncclLsaBarrierSession(Coop, ncclDevComm const&, ncclTeam, ncclLsaBarrierHandle, uint32_t index, bool multimem=false, ncclMultimemHandle mmHandle={});
+
+  NCCL_DEVICE_INLINE ncclLsaBarrierSession(Coop, ncclDevComm const&, ncclTeamTagLsa, uint32_t index, bool multimem=false);
+
+  NCCL_DEVICE_INLINE ~ncclLsaBarrierSession();
+
+  ncclLsaBarrierSession(ncclLsaBarrierSession const&) = delete; // Sessions are not copyable
+
+  NCCL_DEVICE_INLINE void arrive(Coop, cuda::memory_order);
+  NCCL_DEVICE_INLINE void wait(Coop, cuda::memory_order);
+  NCCL_DEVICE_INLINE void sync(Coop, cuda::memory_order);
+};
+#endif
+
+#endif // _NCCL_DEVICE_MEM_BARRIER_H_
diff --git a/projects/rccl/src/include/nccl_device/ptr.h b/projects/rccl/src/include/nccl_device/ptr.h
new file mode 100644
index 0000000000..4b8914c887
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/ptr.h
@@ -0,0 +1,61 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_PTR_H_
+#define _NCCL_DEVICE_PTR_H_
+#include "core.h"
+#include <stdint.h>
+
+#if __cplusplus
+template<typename T>
+struct ncclSymPtr {
+  using ElementType = T;
+  ncclWindow_t window;
+  size_t offset;
+
+  NCCL_HOST_DEVICE_INLINE constexpr ncclSymPtr(ncclWindow_t window=nullptr, size_t offset=0);
+
+  template<typename U>
+  NCCL_HOST_DEVICE_INLINE operator ncclSymPtr<U>() const;
+
+  NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& operator+=(int d);
+  NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& operator+=(unsigned int d);
+  NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& operator+=(long d);
+  NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& operator+=(unsigned long d);
+  NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& operator+=(long long d);
+  NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& operator+=(unsigned long long d);
+
+  NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& operator-=(int d);
+  NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& operator-=(unsigned int d);
+  NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& operator-=(long d);
+  NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& operator-=(unsigned long d);
+  NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& operator-=(long long d);
+  NCCL_HOST_DEVICE_INLINE ncclSymPtr<T>& operator-=(unsigned long long d);
+
+  #if __CUDACC__
+  NCCL_DEVICE_INLINE T* localPtr() const;
+  NCCL_DEVICE_INLINE T* lsaPtr(int peer) const;
+  NCCL_DEVICE_INLINE T* peerPtr(int peer) const;
+  NCCL_DEVICE_INLINE T* peerPtr(ncclTeam team, int peer) const;
+  NCCL_DEVICE_INLINE T* multimemPtr(ncclMultimemHandle mmHandle) const;
+  NCCL_DEVICE_INLINE T* lsaMultimemPtr(ncclDevComm const&) const;
+  #endif
+};
+
+template<typename T, typename Int>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T> operator+(ncclSymPtr<T> p, Int d);
+template<typename T, typename Int>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T> operator-(ncclSymPtr<T> p, Int d);
+template<typename T>
+NCCL_HOST_DEVICE_INLINE ptrdiff_t operator-(ncclSymPtr<T> a, ncclSymPtr<T> b);
+
+template<typename T, typename Int>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T> operator==(ncclSymPtr<T> a, ncclSymPtr<T> b);
+template<typename T, typename Int>
+NCCL_HOST_DEVICE_INLINE ncclSymPtr<T> operator!=(ncclSymPtr<T> a, ncclSymPtr<T> b);
+#endif
+
+#endif
diff --git a/projects/rccl/src/include/nccl_device/utility.h b/projects/rccl/src/include/nccl_device/utility.h
new file mode 100644
index 0000000000..b98a0d973c
--- /dev/null
+++ b/projects/rccl/src/include/nccl_device/utility.h
@@ -0,0 +1,352 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _NCCL_DEVICE_UTILITY_H_
+#define _NCCL_DEVICE_UTILITY_H_
+
+#if __CUDACC__
+  #define NCCL_DEVICE_INLINE __device__ __forceinline__
+  #define NCCL_HOST_DEVICE_INLINE __host__ __device__ __forceinline__
+#else
+  #ifndef __host__
+    #define __host__
+  #endif
+  #define NCCL_DEVICE_INLINE
+  #define NCCL_HOST_DEVICE_INLINE inline __attribute__((always_inline))
+#endif
+
+#if __cplusplus
+#define NCCL_EXTERN_C extern "C"
+#else
+#define NCCL_EXTERN_C
+#endif
+
+#include <stdint.h>
+#include <stdbool.h>
+
+#if __CUDACC__
+#include <cuda/atomic>
+#endif
+
+#if __cplusplus
+namespace nccl {
+namespace utility {
+
+template<typename T>
+T&& declval() noexcept {
+  static_assert(sizeof(T)!=sizeof(T), "You can't evaluate declval.");
+}
+
+template<typename X, typename Y, typename Z = decltype(X()+Y())>
+NCCL_HOST_DEVICE_INLINE constexpr Z divUp(X x, Y y) {
+  return (x+y-1)/y;
+}
+
+template<typename X, typename Y, typename Z = decltype(X()+Y())>
+NCCL_HOST_DEVICE_INLINE constexpr Z roundUp(X x, Y y) {
+  return (x+y-1) - (x+y-1)%y;
+}
+template<typename X, typename Y, typename Z = decltype(X()+Y())>
+NCCL_HOST_DEVICE_INLINE constexpr Z roundDown(X x, Y y) {
+  return x - x%y;
+}
+
+// assumes second argument is a power of 2
+template<typename X, typename Y, typename Z = decltype(X()+Y())>
+NCCL_HOST_DEVICE_INLINE constexpr Z alignUp(X x, Y a) {
+  return (x + a-1) & -Z(a);
+}
+template<typename T>
+NCCL_HOST_DEVICE_INLINE T* alignUp(T* x, size_t a) {
+  static_assert(sizeof(T) == 1, "Only single byte types allowed.");
+  return reinterpret_cast<T*>((reinterpret_cast<uintptr_t>(x) + a-1) & -uintptr_t(a));
+}
+template<typename T>
+NCCL_HOST_DEVICE_INLINE void* alignUp(void const* x, size_t a) {
+  return reinterpret_cast<void*>((reinterpret_cast<uintptr_t>(x) + a-1) & -uintptr_t(a));
+}
+
+// assumes second argument is a power of 2
+template<typename X, typename Y, typename Z = decltype(X()+int())>
+NCCL_HOST_DEVICE_INLINE constexpr Z alignDown(X x, Y a) {
+  return x & -Z(a);
+}
+template<typename T>
+NCCL_HOST_DEVICE_INLINE T* alignDown(T* x, size_t a) {
+  static_assert(sizeof(T) == 1, "Only single byte types allowed.");
+  return reinterpret_cast<T*>(reinterpret_cast<uintptr_t>(x) & -uintptr_t(a));
+}
+template<typename T>
+NCCL_HOST_DEVICE_INLINE void* alignDown(void const* x, size_t a) {
+  return reinterpret_cast<void*>(reinterpret_cast<uintptr_t>(x) & -uintptr_t(a));
+}
+
+template<typename T>
+NCCL_HOST_DEVICE_INLINE T add4G(T base, int delta4G) {
+  union { uint32_t u32[2]; T tmp; };
+  tmp = base;
+  u32[1] += delta4G;
+  return tmp;
+}
+
+
+template<typename Int>
+NCCL_HOST_DEVICE_INLINE constexpr bool isPow2(Int x) {
+  return (x & (x-1)) == 0;
+}
+
+// Produce the reciprocal of x for use in idivByRcp
+NCCL_HOST_DEVICE_INLINE constexpr uint32_t idivRcp32(uint32_t x) {
+  return uint32_t(-1)/x + isPow2(x);
+}
+NCCL_HOST_DEVICE_INLINE constexpr uint64_t idivRcp64(uint64_t x) {
+  return uint64_t(-1)/x + isPow2(x);
+}
+
+NCCL_HOST_DEVICE_INLINE uint32_t mul32hi(uint32_t a, uint32_t b) {
+#if __CUDA_ARCH__
+  return __umulhi(a, b);
+#else
+  return uint64_t(a)*b >> 32;
+#endif
+}
+NCCL_HOST_DEVICE_INLINE uint64_t mul64hi(uint64_t a, uint64_t b) {
+#if __CUDA_ARCH__
+  return __umul64hi(a, b);
+#else
+  return (uint64_t)(((unsigned __int128)a)*b >> 64);
+#endif
+}
+
+// Produce the reciprocal of x*y given their respective reciprocals. This incurs
+// no integer division on device.
+NCCL_HOST_DEVICE_INLINE uint32_t imulRcp32(uint32_t x, uint32_t xrcp, uint32_t y, uint32_t yrcp) {
+  if (xrcp == 0) return yrcp;
+  if (yrcp == 0) return xrcp;
+  uint32_t rcp = mul32hi(xrcp, yrcp);
+  uint32_t rem = -x*y*rcp;
+  if (x*y <= rem) rcp += 1;
+  return rcp;
+}
+NCCL_HOST_DEVICE_INLINE uint64_t imulRcp64(uint64_t x, uint64_t xrcp, uint64_t y, uint64_t yrcp) {
+  if (xrcp == 0) return yrcp;
+  if (yrcp == 0) return xrcp;
+  uint64_t rcp = mul64hi(xrcp, yrcp);
+  uint64_t rem = -x*y*rcp;
+  if (x*y <= rem) rcp += 1;
+  return rcp;
+}
+
+// Fast unsigned integer division where divisor has precomputed reciprocal.
+// idivFast(x, y, idivRcp(y)) == x/y
+NCCL_HOST_DEVICE_INLINE void idivmodFast32(uint32_t *quo, uint32_t *rem, uint32_t x, uint32_t y, uint32_t yrcp) {
+  uint32_t q = yrcp == 0 ? x : mul32hi(x, yrcp);
+  uint32_t r = x - y*q;
+  if (r >= y) { q += 1; r -= y; }
+  *quo = q;
+  *rem = r;
+}
+NCCL_HOST_DEVICE_INLINE void idivmodFast64(uint64_t *quo, uint64_t *rem, uint64_t x, uint64_t y, uint64_t yrcp) {
+  uint32_t q = yrcp == 0 ? x : mul64hi(x, yrcp);
+  uint32_t r = x - y*q;
+  if (r >= y) { q += 1; r -= y; }
+  *quo = q;
+  *rem = r;
+}
+
+NCCL_HOST_DEVICE_INLINE uint32_t idivFast32(uint32_t x, uint32_t y, uint32_t yrcp) {
+  uint32_t q, r;
+  idivmodFast32(&q, &r, x, y, yrcp);
+  return q;
+}
+NCCL_HOST_DEVICE_INLINE uint32_t idivFast64(uint64_t x, uint64_t y, uint64_t yrcp) {
+  uint64_t q, r;
+  idivmodFast64(&q, &r, x, y, yrcp);
+  return q;
+}
+
+NCCL_HOST_DEVICE_INLINE uint32_t imodFast32(uint32_t x, uint32_t y, uint32_t yrcp) {
+  uint32_t q, r;
+  idivmodFast32(&q, &r, x, y, yrcp);
+  return r;
+}
+NCCL_HOST_DEVICE_INLINE uint32_t imodFast64(uint64_t x, uint64_t y, uint64_t yrcp) {
+  uint64_t q, r;
+  idivmodFast64(&q, &r, x, y, yrcp);
+  return r;
+}
+
+#if __CUDACC__
+// Precomputed integer reciprocoals for denominator values 1..64 inclusive.
+// Pass these to idivFast64() for fast division on the GPU.
+NCCL_DEVICE_INLINE uint64_t idivRcp64_upto64(int x) {
+  static constexpr uint64_t table[65] = {
+    idivRcp64(0x01), idivRcp64(0x01), idivRcp64(0x02), idivRcp64(0x03),
+    idivRcp64(0x04), idivRcp64(0x05), idivRcp64(0x06), idivRcp64(0x07),
+    idivRcp64(0x08), idivRcp64(0x09), idivRcp64(0x0a), idivRcp64(0x0b),
+    idivRcp64(0x0c), idivRcp64(0x0d), idivRcp64(0x0e), idivRcp64(0x0f),
+    idivRcp64(0x10), idivRcp64(0x11), idivRcp64(0x12), idivRcp64(0x13),
+    idivRcp64(0x14), idivRcp64(0x15), idivRcp64(0x16), idivRcp64(0x17),
+    idivRcp64(0x18), idivRcp64(0x19), idivRcp64(0x1a), idivRcp64(0x1b),
+    idivRcp64(0x1c), idivRcp64(0x1d), idivRcp64(0x1e), idivRcp64(0x1f),
+    idivRcp64(0x20), idivRcp64(0x21), idivRcp64(0x22), idivRcp64(0x23),
+    idivRcp64(0x24), idivRcp64(0x25), idivRcp64(0x26), idivRcp64(0x27),
+    idivRcp64(0x28), idivRcp64(0x29), idivRcp64(0x2a), idivRcp64(0x2b),
+    idivRcp64(0x2c), idivRcp64(0x2d), idivRcp64(0x2e), idivRcp64(0x2f),
+    idivRcp64(0x30), idivRcp64(0x31), idivRcp64(0x32), idivRcp64(0x33),
+    idivRcp64(0x34), idivRcp64(0x35), idivRcp64(0x36), idivRcp64(0x37),
+    idivRcp64(0x38), idivRcp64(0x39), idivRcp64(0x3a), idivRcp64(0x3b),
+    idivRcp64(0x3c), idivRcp64(0x3d), idivRcp64(0x3e), idivRcp64(0x3f),
+    idivRcp64(0x40)
+  };
+  return table[x];
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE uint32_t idivRcp32_upto64(int x) {
+  return idivRcp64_upto64(x)>>32;
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE void fenceAcquireGpu() {
+  static __device__ int dummy;
+  int tmp;
+  asm volatile("ld.acquire.gpu.s32 %0,[%1];" : "=r"(tmp) : "l"(&dummy) : "memory");
+  dummy = tmp;
+}
+NCCL_DEVICE_INLINE void fenceReleaseGpu() {
+  cuda::atomic_thread_fence(cuda::memory_order_release, cuda::thread_scope_device);
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE cuda::memory_order acquireOrderOf(cuda::memory_order ord) {
+  return ord == cuda::memory_order_release ? cuda::memory_order_relaxed :
+         ord == cuda::memory_order_acq_rel ? cuda::memory_order_acquire :
+         ord;
+}
+NCCL_DEVICE_INLINE cuda::memory_order releaseOrderOf(cuda::memory_order ord) {
+  return ord == cuda::memory_order_acquire ? cuda::memory_order_relaxed :
+         ord == cuda::memory_order_acq_rel ? cuda::memory_order_release :
+         ord;
+}
+#endif
+
+#if __CUDACC__
+NCCL_DEVICE_INLINE int lane() {
+  int ret;
+  asm("mov.u32 %0, %%laneid;" : "=r"(ret));
+  return ret;
+}
+NCCL_DEVICE_INLINE unsigned int lanemask_lt() {
+  unsigned int ret;
+  asm("mov.u32 %0, %%lanemask_lt;" : "=r"(ret));
+  return ret;
+}
+#endif
+
+#if __CUDACC__
+// Load anything, but cache like its constant memory.
+template<typename T>
+NCCL_DEVICE_INLINE T loadConst(T const *p) {
+  if (alignof(T) == 1) {
+    union { uint8_t part[sizeof(T)]; T ret; };
+    for (int i=0; i < (int)sizeof(T); i++) part[i] = __ldg((uint8_t const*)p + i);
+    return ret;
+  } else if (alignof(T) == 2) {
+    union { uint16_t part[sizeof(T)/2]; T ret; };
+    for (int i=0; i < (int)sizeof(T)/2; i++) part[i] = __ldg((uint16_t const*)p + i);
+    return ret;
+  } else if (alignof(T) == 4) {
+    union { uint32_t part[sizeof(T)/4]; T ret; };
+    for (int i=0; i < (int)sizeof(T)/4; i++) part[i] = __ldg((uint32_t const*)p + i);
+    return ret;
+  } else if (alignof(T) == 8) {
+    union { uint64_t part[sizeof(T)/8]; T ret; };
+    for (int i=0; i < (int)sizeof(T)/8; i++) part[i] = __ldg((uint64_t const*)p + i);
+    return ret;
+  } else { // alignof(T) >= 16
+    union { ulonglong2 part[sizeof(T)/16]; T ret; };
+    for (int i=0; i < (int)sizeof(T)/16; i++) part[i] = __ldg((ulonglong2 const*)p + i);
+    return ret;
+  }
+}
+#endif
+
+////////////////////////////////////////////////////////////////////////////////
+// Optional<T>: Holds a T that may or may not be constructed. An Optional
+// constructed with a Present<Arg...> will have its T constructed via the
+// T::T(Arg...) constructor. An Optional constructed with a Absent will not
+// have its T constructed.
+
+template<int ...vals>
+struct IntSeq {};
+
+template<int n, int m, int ...i>
+struct IntSeqUpTo: IntSeqUpTo<n, m+1, i..., m> {};
+template<int n, int ...i>
+struct IntSeqUpTo<n, n, i...> { using Type = IntSeq<i...>; };
+
+// Present<Arg...>: Packs a list of arguments together to be passed to Optional<T>.
+template<typename ...Arg>
+struct Present;
+template<>
+struct Present<> {};
+template<typename H, typename ...T>
+struct Present<H, T...> {
+  H h;
+  Present<T...> t;
+
+  NCCL_HOST_DEVICE_INLINE H get(IntSeq<0>) {
+    return static_cast<H>(h);
+  }
+  template<int i>
+  NCCL_HOST_DEVICE_INLINE decltype(auto) get(IntSeq<i>) {
+    return t.get(IntSeq<i-1>{});
+  }
+};
+
+NCCL_HOST_DEVICE_INLINE Present<> present() {
+  return Present<>{};
+}
+template<typename H, typename ...T>
+NCCL_HOST_DEVICE_INLINE Present<H&&, T&&...> present(H&& h, T&& ...t) {
+  return Present<H&&, T&&...>{static_cast<H&&>(h), present(static_cast<T&&>(t)...)};
+}
+
+struct Absent {};
+
+template<typename T>
+struct Optional {
+  bool present; // Is `thing` constructed.
+  union { T thing; };
+
+  // Construct with absent thing:
+  NCCL_HOST_DEVICE_INLINE constexpr Optional(): present(false) {}
+  NCCL_HOST_DEVICE_INLINE constexpr Optional(Absent): present(false) {}
+
+  // Helper constructor
+  template<int ...i, typename ...Arg>
+  NCCL_HOST_DEVICE_INLINE Optional(Present<Arg...> args, IntSeq<i...>):
+    present(true),
+    thing{args.get(IntSeq<i>())...} {
+  }
+  // Construct with present thing:
+  template<typename ...Arg>
+  NCCL_HOST_DEVICE_INLINE Optional(Present<Arg...> args):
+    Optional(args, IntSeqUpTo<sizeof...(Arg), 0>::Type()) {
+  }
+
+  NCCL_HOST_DEVICE_INLINE ~Optional() {
+    if (present) thing.~T();
+  }
+};
+
+}}
+#endif // __cplusplus
+#endif
diff --git a/projects/rccl/src/include/net.h b/projects/rccl/src/include/net.h
index 552e9bcb43..f13eebb064 100644
--- a/projects/rccl/src/include/net.h
+++ b/projects/rccl/src/include/net.h
@@ -12,10 +12,16 @@
 #include "comm.h"
 #include "checks.h"
 
+#define NCCL_UNDEF_DEV_COUNT -1
+
 typedef char ncclNetHandle_t[NCCL_NET_HANDLE_MAXSIZE];
 
 ncclResult_t ncclNetInit(struct ncclComm* comm);
 ncclResult_t ncclNetFinalize(struct ncclComm* comm);
+ncclResult_t ncclNetGetDevCount(int netPluginIndex, int* nPhysDev, int* nVirtDev);
+ncclResult_t ncclNetSetVirtDevCount(int netPluginIndex, int nVirtDev);
+ncclResult_t ncclCollNetGetDevCount(int netPluginIndex, int* nPhysDev, int* nVirtDev);
+ncclResult_t ncclCollNetSetVirtDevCount(int netPluginIndex, int nVirtDev);
 
 // Test whether the current GPU support GPU Direct RDMA.
 ncclResult_t ncclGpuGdrSupport(struct ncclComm* comm, int* gdrSupport);
diff --git a/projects/rccl/src/include/net_device.h b/projects/rccl/src/include/net_device.h
index c3a79e35cd..99ae9c38bd 100644
--- a/projects/rccl/src/include/net_device.h
+++ b/projects/rccl/src/include/net_device.h
@@ -12,7 +12,7 @@
 
 // Arbitrary version number - A given NCCL build will only be compatible with a single device networking plugin
 // version. NCCL will check the supplied version number from net->getProperties() and compare to its internal version.
-#define NCCL_NET_DEVICE_UNPACK_VERSION 0x7  
+#define NCCL_NET_DEVICE_UNPACK_VERSION 0x7
 
 typedef enum {NCCL_NET_DEVICE_HOST=0, NCCL_NET_DEVICE_UNPACK=1} ncclNetDeviceType;
 
@@ -27,6 +27,7 @@ typedef struct {
 typedef ncclNetDeviceHandle_v7_t ncclNetDeviceHandle_v8_t;
 typedef ncclNetDeviceHandle_v8_t ncclNetDeviceHandle_v9_t;
 typedef ncclNetDeviceHandle_v9_t ncclNetDeviceHandle_v10_t;
-typedef ncclNetDeviceHandle_v10_t ncclNetDeviceHandle_t;
+typedef ncclNetDeviceHandle_v10_t ncclNetDeviceHandle_v11_t;
+typedef ncclNetDeviceHandle_v11_t ncclNetDeviceHandle_t;
 
 #endif
diff --git a/projects/rccl/src/include/nvmlwrap.h b/projects/rccl/src/include/nvmlwrap.h
index 72fbf9ce2a..ce8925ef91 100644
--- a/projects/rccl/src/include/nvmlwrap.h
+++ b/projects/rccl/src/include/nvmlwrap.h
@@ -253,6 +253,24 @@ typedef nvmlGpuFabricInfo_v2_t nvmlGpuFabricInfoV_t;
 */
 #define nvmlGpuFabricInfo_v2 NVML_STRUCT_VERSION(GpuFabricInfo, 2)
 
+/**
+ * Structure to store platform information (v2)
+ */
+typedef struct
+{
+    unsigned int version;                       //!< the API version number
+    unsigned char ibGuid[16];                   //!< Infiniband GUID reported by platform (for Blackwell, ibGuid is 8 bytes so indices 8-15 are zero)
+    unsigned char chassisSerialNumber[16];      //!< Serial number of the chassis containing this GPU (for Blackwell it is 13 bytes so indices 13-15 are zero)
+    unsigned char slotNumber;                   //!< The slot number in the chassis containing this GPU (includes switches)
+    unsigned char trayIndex;                    //!< The tray index within the compute slots in the chassis containing this GPU (does not include switches)
+    unsigned char hostId;                       //!< Index of the node within the slot containing this GPU
+    unsigned char peerType;                     //!< Platform indicated NVLink-peer type (e.g. switch present or not)
+    unsigned char moduleId;                     //!< ID of this GPU within the node
+} nvmlPlatformInfo_v2_t;
+
+typedef nvmlPlatformInfo_v2_t nvmlPlatformInfo_t;
+#define nvmlPlatformInfo_v2 NVML_STRUCT_VERSION(PlatformInfo, 2)
+
 /**
  * Confidential Compute Feature Status values
  */
@@ -270,6 +288,7 @@ typedef struct nvmlConfComputeSystemState_st {
  */
 #define NVML_CC_SYSTEM_MULTIGPU_NONE 0
 #define NVML_CC_SYSTEM_MULTIGPU_PROTECTED_PCIE 1
+#define NVML_CC_SYSTEM_MULTIGPU_NVLE 2
 
 /**
  * Confidential Compute System settings
@@ -303,6 +322,7 @@ extern ncclNvmlDevicePairInfo ncclNvmlDevicePairs[ncclNvmlMaxDevices][ncclNvmlMa
 struct ncclNvmlCCStatus {
     bool CCEnabled;
     bool multiGpuProtectedPCIE;
+    bool multiGpuNVLE;
 };
 
 // All ncclNvmlFoo() functions call ncclNvmlEnsureInitialized() implicitly.
@@ -320,6 +340,7 @@ ncclResult_t ncclNvmlDeviceGetCudaComputeCapability(nvmlDevice_t device, int* ma
 ncclResult_t ncclNvmlDeviceGetP2PStatus(nvmlDevice_t device1, nvmlDevice_t device2, nvmlGpuP2PCapsIndex_t p2pIndex, nvmlGpuP2PStatus_t* p2pStatus);
 ncclResult_t ncclNvmlDeviceGetFieldValues(nvmlDevice_t device, int valuesCount, nvmlFieldValue_t *values);
 ncclResult_t ncclNvmlDeviceGetGpuFabricInfoV(nvmlDevice_t device, nvmlGpuFabricInfoV_t *gpuFabricInfo);
+ncclResult_t ncclNvmlDeviceGetPlatformInfo(nvmlDevice_t device, nvmlPlatformInfo_t *plaformInfo);
 ncclResult_t ncclNvmlGetCCStatus(struct ncclNvmlCCStatus *status);
 
 #endif // End include guard
diff --git a/projects/rccl/src/include/nvtx.h b/projects/rccl/src/include/nvtx.h
index de50dfe2ee..8f20be43de 100644
--- a/projects/rccl/src/include/nvtx.h
+++ b/projects/rccl/src/include/nvtx.h
@@ -9,6 +9,8 @@
 
 #include "nvtx3/nvtx3.hpp"
 
+#include "param.h"
+
 #if __cpp_constexpr >= 201304L && !defined(NVTX3_CONSTEXPR_IF_CPP14)
 #define NVTX3_CONSTEXPR_IF_CPP14 constexpr
 #else
@@ -32,15 +34,20 @@
 #define NVTX_SID_CommSplit            13
 #define NVTX_SID_CommFinalize         14
 #define NVTX_SID_CommShrink           15
+#define NVTX_SID_AlltoAll             16
+#define NVTX_SID_Gather               17
+#define NVTX_SID_Scatter              18
 // When adding new schema IDs, DO NOT re-use/overlap with the enum schema ID below!
 
 // Define static schema ID for the reduction operation.
-#define NVTX_PAYLOAD_ENTRY_NCCL_REDOP 16 + NVTX_PAYLOAD_ENTRY_TYPE_SCHEMA_ID_STATIC_START
+#define NVTX_PAYLOAD_ENTRY_NCCL_REDOP 19 + NVTX_PAYLOAD_ENTRY_TYPE_SCHEMA_ID_STATIC_START
 
 extern const nvtxDomainHandle_t ncclNvtxDomainHandle;
 
 struct nccl_domain{static constexpr char const* name{"NCCL"};};
 
+extern int64_t ncclParamNvtxDisable();
+
 /// @brief Register an NVTX payload schema for static-size payloads.
 class payload_schema {
  public:
@@ -74,6 +81,32 @@ class payload_schema {
     nullptr, 0, 0, 0, 0, nullptr};
 };
 
+class ncclOptionalNvtxScopedRange
+{
+ public:
+  void push(const nvtx3::event_attributes& attr) noexcept {
+    // pushed must not be true already, but it's too expensive to check
+    pushed = true;
+    nvtxDomainRangePushEx(nvtx3::domain::get<nccl_domain>(), attr.get());
+  }
+
+  ~ncclOptionalNvtxScopedRange() noexcept {
+    if (!pushed) {
+      return;
+    }
+    nvtxDomainRangePop(nvtx3::domain::get<nccl_domain>());
+  }
+
+  ncclOptionalNvtxScopedRange() = default;
+  ncclOptionalNvtxScopedRange(ncclOptionalNvtxScopedRange const&) = delete;
+  ncclOptionalNvtxScopedRange& operator=(ncclOptionalNvtxScopedRange const&) = delete;
+  ncclOptionalNvtxScopedRange(ncclOptionalNvtxScopedRange&&) = delete;
+  ncclOptionalNvtxScopedRange& operator=(ncclOptionalNvtxScopedRange&&) = delete;
+
+ private:
+  bool pushed = false;
+};
+
 // Convenience macro to give the payload parameters a scope.
 #define NVTX3_PAYLOAD(...) __VA_ARGS__
 
@@ -81,26 +114,43 @@ class payload_schema {
 // @param N NCCL API name without the `nccl` prefix.
 // @param T name of the used NVTX payload schema without "Schema" suffix.
 // @param P payload parameters/entries
-#define NVTX3_FUNC_WITH_PARAMS(N, T, P) \
-  constexpr uint64_t schemaId = NVTX_PAYLOAD_ENTRY_TYPE_SCHEMA_ID_STATIC_START + NVTX_SID_##N; \
-  static const payload_schema schema{T##Schema, std::extent<decltype(T##Schema)>::value - 1, \
-    schemaId, sizeof(T)}; \
-  static ::nvtx3::v1::registered_string_in<nccl_domain> const nvtx3_func_name__{__func__}; \
-  const T _payload = {P}; \
-  nvtxPayloadData_t nvtx3_bpl__[] = {{schemaId, sizeof(_payload), &_payload}}; \
-  ::nvtx3::v1::event_attributes const nvtx3_func_attr__{nvtx3_func_name__, nvtx3_bpl__}; \
-  ::nvtx3::v1::scoped_range_in<nccl_domain> const nvtx3_range__{nvtx3_func_attr__};
+#define NVTX3_FUNC_WITH_PARAMS(N, T, P)                                                          \
+  ncclOptionalNvtxScopedRange nvtx3_range__;                                                     \
+  if (!ncclParamNvtxDisable())                                                                   \
+  {                                                                                              \
+    constexpr uint64_t schemaId = NVTX_PAYLOAD_ENTRY_TYPE_SCHEMA_ID_STATIC_START + NVTX_SID_##N; \
+    static const payload_schema                                                                  \
+        schema{T##Schema, std::extent<decltype(T##Schema)>::value - 1, schemaId, sizeof(T)};     \
+    static ::nvtx3::v1::registered_string_in<nccl_domain> const nvtx3_func_name__{__func__};     \
+    const T _payload = {P};                                                                      \
+    nvtxPayloadData_t nvtx3_bpl__[] = {{schemaId, sizeof(_payload), &_payload}};                 \
+    ::nvtx3::v1::event_attributes const nvtx3_func_attr__{nvtx3_func_name__, nvtx3_bpl__};       \
+    nvtx3_range__.push(nvtx3_func_attr__);                                                       \
+  }
+
+#define NCCL_NVTX3_FUNC_RANGE \
+  ncclOptionalNvtxScopedRange nvtx3_range__; \
+  if (!ncclParamNvtxDisable()) { \
+    static ::nvtx3::v1::registered_string_in<nccl_domain> const nvtx3_func_name__{__func__}; \
+    static ::nvtx3::v1::event_attributes const nvtx3_func_attr__{nvtx3_func_name__}; \
+    nvtx3_range__.push(nvtx3_func_attr__); \
+  }
 
 /// @brief Creates an NVTX range with extended payload using the RAII pattern.
 /// @tparam PayloadType Data type of the payload.
 template <typename PayloadType>
-class ncclNvtxRange {
+class ncclOptionalNvtxPayloadRange {
  public:
-  explicit ncclNvtxRange(const nvtxEventAttributes_t* evtAttr) noexcept {
-    nvtxDomainRangePushEx(nvtx3::domain::get<nccl_domain>(), evtAttr);
+  void push(const nvtx3::event_attributes& attr) noexcept {
+    // pushed must not be true already, but it's too expensive to check
+    pushed = true;
+    nvtxDomainRangePushEx(nvtx3::domain::get<nccl_domain>(), attr.get());
   }
 
-  ~ncclNvtxRange() noexcept {
+  ~ncclOptionalNvtxPayloadRange() noexcept {
+    if (!pushed) {
+      return;
+    }
     if (payloadData.payload) {
       nvtxRangePopPayload(nvtx3::domain::get<nccl_domain>(), &payloadData, 1);
     } else {
@@ -113,25 +163,34 @@ class ncclNvtxRange {
     payloadData = {schemaId, sizeof(PayloadType), &payload};
   }
 
-  ncclNvtxRange() = delete;
-  ncclNvtxRange(ncclNvtxRange const&) = default;
-  ncclNvtxRange& operator=(ncclNvtxRange const&) = default;
-  ncclNvtxRange(ncclNvtxRange&&) = default;
-  ncclNvtxRange& operator=(ncclNvtxRange&&) = default;
+  ncclOptionalNvtxPayloadRange() = default;
+  ncclOptionalNvtxPayloadRange(ncclOptionalNvtxPayloadRange const&) = delete;
+  ncclOptionalNvtxPayloadRange& operator=(ncclOptionalNvtxPayloadRange const&) = delete;
+  ncclOptionalNvtxPayloadRange(ncclOptionalNvtxPayloadRange&&) = delete;
+  ncclOptionalNvtxPayloadRange& operator=(ncclOptionalNvtxPayloadRange&&) = delete;
 
   // Holds the payload data.
   PayloadType payload{};
 
+  bool isPushed() const noexcept {
+    return pushed;
+  }
+
  private:
+  bool pushed = false;
   nvtxPayloadData_t payloadData = {NVTX_PAYLOAD_ENTRY_TYPE_INVALID, 0, NULL};
 };
 
 // Create an NVTX range with the function name as the range name. Use RAII pattern.
 // @param T Type ID of the NVTX payload (pointer for variable-size payloads).
-#define NVTX3_RANGE(T) \
-  static ::nvtx3::v1::registered_string_in<nccl_domain> const nvtx3_func_name__{__func__}; \
-  ::nvtx3::v1::event_attributes const nvtx3_func_attr__{nvtx3_func_name__}; \
-  ncclNvtxRange<T> nvtx3_range__{nvtx3_func_attr__.get()};
+#define NVTX3_RANGE(T)                                                                       \
+  ncclOptionalNvtxPayloadRange<T> nvtx3_range__;                                             \
+  if (!ncclParamNvtxDisable())                                                               \
+  {                                                                                          \
+    static ::nvtx3::v1::registered_string_in<nccl_domain> const nvtx3_func_name__{__func__}; \
+    ::nvtx3::v1::event_attributes const nvtx3_func_attr__{nvtx3_func_name__};                \
+    nvtx3_range__.push(nvtx3_func_attr__);                                                   \
+  }
 
 // Add static-size payload to the NVTX range created with `NVTX3_RANGE()`,
 // which must be in this or an outer scope.
@@ -139,6 +198,9 @@ class ncclNvtxRange {
 // @param S name of the used NVTX payload schema.
 // @param P payload parameters/entries
 #define NVTX3_RANGE_ADD_PAYLOAD(N, S, P) do { \
+  if (!nvtx3_range__.isPushed()) { \
+    break; \
+  } \
   constexpr uint64_t schema_id = NVTX_PAYLOAD_ENTRY_TYPE_SCHEMA_ID_STATIC_START + NVTX_SID_##N; \
   static const payload_schema schema{S, std::extent<decltype(S)>::value - 1, schema_id, \
     sizeof(nvtx3_range__.payload)}; \
diff --git a/projects/rccl/src/include/nvtx3/nvToolsExtCounters.h b/projects/rccl/src/include/nvtx3/nvToolsExtCounters.h
index 00e2b7f8f9..e24ab0e042 100644
--- a/projects/rccl/src/include/nvtx3/nvToolsExtCounters.h
+++ b/projects/rccl/src/include/nvtx3/nvToolsExtCounters.h
@@ -332,4 +332,4 @@ NVTX_DECLSPEC void NVTX_API nvtxCountersSubmitBatchEx(
 }
 #endif /* __cplusplus */
 
-#endif /* NVTOOLSEXT_COUNTERS_H */
\ No newline at end of file
+#endif /* NVTOOLSEXT_COUNTERS_H */
diff --git a/projects/rccl/src/include/nvtx3/nvToolsExtSemanticsCounters.h b/projects/rccl/src/include/nvtx3/nvToolsExtSemanticsCounters.h
index f97624a076..6334bfc5dc 100644
--- a/projects/rccl/src/include/nvtx3/nvToolsExtSemanticsCounters.h
+++ b/projects/rccl/src/include/nvtx3/nvToolsExtSemanticsCounters.h
@@ -85,4 +85,4 @@ typedef struct nvtxSemanticsCounter_v1 {
     } limits;
 } nvtxSemanticsCounter_t;
 
-#endif /* NVTX_SEMANTIC_ID_COUNTERS_V1 */
\ No newline at end of file
+#endif /* NVTX_SEMANTIC_ID_COUNTERS_V1 */
diff --git a/projects/rccl/src/include/nvtx3/nvToolsExtSemanticsScope.h b/projects/rccl/src/include/nvtx3/nvToolsExtSemanticsScope.h
index eed6f30959..e6d1c5f26a 100644
--- a/projects/rccl/src/include/nvtx3/nvToolsExtSemanticsScope.h
+++ b/projects/rccl/src/include/nvtx3/nvToolsExtSemanticsScope.h
@@ -27,4 +27,4 @@ typedef struct nvtxSemanticsScope_v1
     uint64_t scopeId;
 } nvtxSemanticsScope_t;
 
-#endif /* NVTX_SEMANTIC_ID_SCOPE_V1 */
\ No newline at end of file
+#endif /* NVTX_SEMANTIC_ID_SCOPE_V1 */
diff --git a/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtHelperMacros.h b/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtHelperMacros.h
index 00fc817687..6fca4801b3 100644
--- a/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtHelperMacros.h
+++ b/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtHelperMacros.h
@@ -28,4 +28,4 @@
 #define NVTX_EXT_HELPER_UNUSED_ARGS(...) \
     NVTX_EXT_CONCAT(_NVTX_EXT_VOIDIFY, NVTX_EXT_NUM_ARGS(__VA_ARGS__))(__VA_ARGS__)
 
-#endif /* NVTX_EXT_HELPER_MACROS_H */
\ No newline at end of file
+#endif /* NVTX_EXT_HELPER_MACROS_H */
diff --git a/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtImpl.h b/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtImpl.h
index 79bb0c1c5a..56dcab6929 100644
--- a/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtImpl.h
+++ b/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtImpl.h
@@ -96,4 +96,4 @@ NVTX_LINKONCE_DEFINE_GLOBAL nvtxExtGlobals1_t NVTX_VERSIONED_IDENTIFIER(nvtxExtG
 } /* extern "C" */
 #endif /* __cplusplus */
 
-#endif /* NVTX_EXT_IMPL_H */
\ No newline at end of file
+#endif /* NVTX_EXT_IMPL_H */
diff --git a/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtImplCounters_v1.h b/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtImplCounters_v1.h
index 0f6ff96679..c34fa83924 100644
--- a/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtImplCounters_v1.h
+++ b/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtImplCounters_v1.h
@@ -145,4 +145,4 @@ NVTX_EXT_COUNTERS_IMPL_FN_V1(void, nvtxCountersSubmitBatchEx,
 } /* extern "C" */
 #endif /* __cplusplus */
 
-#endif /* NVTX_EXT_IMPL_COUNTERS_V1 */
\ No newline at end of file
+#endif /* NVTX_EXT_IMPL_COUNTERS_V1 */
diff --git a/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtPayloadHelperInternal.h b/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtPayloadHelperInternal.h
index 71e30bc378..9d07f5b1fa 100644
--- a/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtPayloadHelperInternal.h
+++ b/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtPayloadHelperInternal.h
@@ -269,4 +269,4 @@
 
 /*** END: Helper for `NVTX_PAYLOAD_STATIC_SCHEMA_{INIT,CREATE}` */
 
-#endif /* NVTX_EXT_PAYLOAD_HELPER_INTERNAL_H */
\ No newline at end of file
+#endif /* NVTX_EXT_PAYLOAD_HELPER_INTERNAL_H */
diff --git a/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtPayloadTypeInfo.h b/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtPayloadTypeInfo.h
index 6a30e6633a..eeb227a5a6 100644
--- a/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtPayloadTypeInfo.h
+++ b/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtPayloadTypeInfo.h
@@ -148,4 +148,4 @@ NVTX_EXT_PAYLOAD_VERSIONED_ID(nvtxExtPayloadTypeInfo)[NVTX_PAYLOAD_ENTRY_TYPE_IN
 };
 
 #undef nvtx_alignof
-#undef nvtx_alignof2
\ No newline at end of file
+#undef nvtx_alignof2
diff --git a/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtTypes.h b/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtTypes.h
index bcad095a0c..6be0ac7963 100644
--- a/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtTypes.h
+++ b/projects/rccl/src/include/nvtx3/nvtxDetail/nvtxExtTypes.h
@@ -41,4 +41,4 @@ typedef struct nvtxExtModuleInfo_t
 
 typedef int (NVTX_API * NvtxExtInitializeInjectionFunc_t)(nvtxExtModuleInfo_t* moduleInfo);
 
-#endif /* NVTXEXTTYPES_H */
\ No newline at end of file
+#endif /* NVTXEXTTYPES_H */
diff --git a/projects/rccl/src/include/nvtx_payload_schemas.h b/projects/rccl/src/include/nvtx_payload_schemas.h
index 89a41d4b50..587c1a2a4d 100644
--- a/projects/rccl/src/include/nvtx_payload_schemas.h
+++ b/projects/rccl/src/include/nvtx_payload_schemas.h
@@ -90,6 +90,13 @@ NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsAllGather, static cons
   )
 )
 
+NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsAlltoAll, static constexpr,
+  NCCL_NVTX_PAYLOAD_ENTRIES(
+    (uint64_t, comm, TYPE_UINT64, nccl_nvtxCommStr),
+    (size_t, bytes, TYPE_SIZE, nccl_nvtxMsgSizeStr)
+  )
+)
+
 NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsAllReduce, static constexpr,
   NCCL_NVTX_PAYLOAD_ENTRIES(
     (uint64_t, comm, TYPE_UINT64, nccl_nvtxCommStr),
@@ -106,6 +113,14 @@ NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsBroadcast, static cons
   )
 )
 
+NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsGather, static constexpr,
+  NCCL_NVTX_PAYLOAD_ENTRIES(
+    (uint64_t, comm, TYPE_UINT64, nccl_nvtxCommStr),
+    (size_t, bytes, TYPE_SIZE, nccl_nvtxMsgSizeStr),
+    (int, root, TYPE_INT, "Root")
+  )
+)
+
 NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsReduce, static constexpr,
   NCCL_NVTX_PAYLOAD_ENTRIES(
     (uint64_t, comm, TYPE_UINT64, nccl_nvtxCommStr),
@@ -123,6 +138,14 @@ NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsReduceScatter, static
   )
 )
 
+NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsScatter, static constexpr,
+  NCCL_NVTX_PAYLOAD_ENTRIES(
+    (uint64_t, comm, TYPE_UINT64, nccl_nvtxCommStr),
+    (size_t, bytes, TYPE_SIZE, nccl_nvtxMsgSizeStr),
+    (int, root, TYPE_INT, "Root")
+  )
+)
+
 // Used in NCCL APIs `ncclSend` and `ncclRecv`.
 NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsSendRecv, static constexpr,
   NCCL_NVTX_PAYLOAD_ENTRIES(
diff --git a/projects/rccl/src/include/plugin/nccl_net.h b/projects/rccl/src/include/plugin/nccl_net.h
index 18d1486d74..d92a21b4ea 100644
--- a/projects/rccl/src/include/plugin/nccl_net.h
+++ b/projects/rccl/src/include/plugin/nccl_net.h
@@ -16,6 +16,7 @@
 //Maximum value NCCL can accept for maxP2pBytes and maxCollBytes net properties
 #define NCCL_MAX_NET_SIZE_BYTES (1*1024*1024*1024*1024L)
 #define NCCL_NET_OPTIONAL_RECV_COMPLETION 0x1
+#define NCCL_NET_MULTI_REQUEST 0x2
 
 #define MAX_NET_SIZE (1024*1024*1024L) // Rather than send INT_MAX which is 2G-1, send a power of two.
 #define MAX_COLLNET_SIZE (512*1024*1024L) //Set for initial collent plugins when size was not dynamically queried
@@ -32,22 +33,24 @@
 #define NCCL_NET_MAX_PLUGINS 16
 #endif
 
+#define NCCL_NET_MAX_DEVS_PER_NIC 4
+
+#include "net/net_v11.h"
 #include "net/net_v10.h"
 #include "net/net_v9.h"
 #include "net/net_v8.h"
 #include "net/net_v7.h"
 #include "net/net_v6.h"
 
-typedef ncclNet_v10_t ncclNet_t;
-typedef ncclCollNet_v10_t ncclCollNet_t;
-typedef ncclNetSGE_v10_t ncclNetSGE_t;
-typedef ncclNetProperties_v10_t ncclNetProperties_t;
-typedef ncclNetVDeviceProps_v10_t ncclNetVDeviceProps_t;
-typedef ncclNetCommConfig_v10_t ncclNetCommConfig_t;
+typedef ncclNet_v11_t ncclNet_t;
+typedef ncclCollNet_v11_t ncclCollNet_t;
+typedef ncclNetSGE_v11_t ncclNetSGE_t;
+typedef ncclNetProperties_v11_t ncclNetProperties_t;
+typedef ncclNetAttr_v11_t ncclNetAttr_t;
+typedef ncclNetVDeviceProps_v11_t ncclNetVDeviceProps_t;
+typedef ncclNetCommConfig_v11_t ncclNetCommConfig_t;
 
-#define NCCL_NET_MAX_DEVS_PER_NIC NCCL_NET_MAX_DEVS_PER_NIC_V10
-
-#define NCCL_NET_PLUGIN_SYMBOL ncclNetPlugin_v10
-#define NCCL_COLLNET_PLUGIN_SYMBOL ncclCollNetPlugin_v10
+#define NCCL_NET_PLUGIN_SYMBOL ncclNetPlugin_v11
+#define NCCL_COLLNET_PLUGIN_SYMBOL ncclCollNetPlugin_v11
 
 #endif // end include guard
diff --git a/projects/rccl/src/include/plugin/nccl_profiler.h b/projects/rccl/src/include/plugin/nccl_profiler.h
index 710aac4d5a..5ce77c3ee1 100644
--- a/projects/rccl/src/include/plugin/nccl_profiler.h
+++ b/projects/rccl/src/include/plugin/nccl_profiler.h
@@ -8,14 +8,18 @@
 #define NCCL_PROFILER_H_
 
 enum {
-  ncclProfileGroup     = (1 << 0),  // group event type
-  ncclProfileColl      = (1 << 1),  // host collective call event type
-  ncclProfileP2p       = (1 << 2),  // host point-to-point call event type
-  ncclProfileProxyOp   = (1 << 3),  // proxy operation event type
-  ncclProfileProxyStep = (1 << 4),  // proxy step event type
-  ncclProfileProxyCtrl = (1 << 5),  // proxy control event type
-  ncclProfileKernelCh  = (1 << 6),  // kernel channel event type
-  ncclProfileNetPlugin = (1 << 7),  // network plugin-defined, events
+  ncclProfileGroup          = (1 << 0),  // group event type
+  ncclProfileColl           = (1 << 1),  // host collective call event type
+  ncclProfileP2p            = (1 << 2),  // host point-to-point call event type
+  ncclProfileProxyOp        = (1 << 3),  // proxy operation event type
+  ncclProfileProxyStep      = (1 << 4),  // proxy step event type
+  ncclProfileProxyCtrl      = (1 << 5),  // proxy control event type
+  ncclProfileKernelCh       = (1 << 6),  // kernel channel event type
+  ncclProfileNetPlugin      = (1 << 7),  // network plugin-defined, events
+  ncclProfileGroupApi       = (1 << 8),  // Group API events
+  ncclProfileCollApi        = (1 << 9),  // Collective API events
+  ncclProfileP2pApi         = (1 << 10), // Point-to-Point API events
+  ncclProfileKernelLaunch   = (1 << 11), // Kernel launch events
 };
 
 typedef enum {
@@ -50,22 +54,28 @@ typedef enum {
 
   /* Kernel event states */
   ncclProfilerKernelChStop             = 22,
+
+  /* Group API States */
+  ncclProfilerGroupStartApiStop        = 23,
+  ncclProfilerGroupEndApiStart         = 24
 } ncclProfilerEventState_t;
 
 typedef ncclProfilerEventState_t ncclProfilerEventState_v1_t;
 typedef ncclProfilerEventState_t ncclProfilerEventState_v2_t;
 typedef ncclProfilerEventState_t ncclProfilerEventState_v3_t;
 typedef ncclProfilerEventState_t ncclProfilerEventState_v4_t;
+typedef ncclProfilerEventState_t ncclProfilerEventState_v5_t;
 
 #include <cstdint>
+#include "profiler/profiler_v5.h"
 #include "profiler/profiler_v4.h"
 #include "profiler/profiler_v3.h"
 #include "profiler/profiler_v2.h"
 #include "profiler/profiler_v1.h"
 
-typedef ncclProfiler_v4_t ncclProfiler_t;
-typedef ncclProfilerEventDescr_v4_t ncclProfilerEventDescr_t;
-typedef ncclProfilerEventStateArgs_v4_t ncclProfilerEventStateArgs_t;
+typedef ncclProfiler_v5_t ncclProfiler_t;
+typedef ncclProfilerEventDescr_v5_t ncclProfilerEventDescr_t;
+typedef ncclProfilerEventStateArgs_v5_t ncclProfilerEventStateArgs_t;
 
 #define NCCL_PROFILER_NET_VER_BITS  (16)
 #define NCCL_PROFILER_NET_VER_MASK  (~0U >> NCCL_PROFILER_NET_VER_BITS)
diff --git a/projects/rccl/src/include/plugin/nccl_tuner.h b/projects/rccl/src/include/plugin/nccl_tuner.h
index f2401890d5..fbd87b58f0 100644
--- a/projects/rccl/src/include/plugin/nccl_tuner.h
+++ b/projects/rccl/src/include/plugin/nccl_tuner.h
@@ -11,12 +11,49 @@
 #include "nccl.h"
 #include "nccl_common.h"
 
+#include "tuner/tuner_v5.h"
 #include "tuner/tuner_v4.h"
 #include "tuner/tuner_v3.h"
 #include "tuner/tuner_v2.h"
 
-typedef ncclTuner_v4_t ncclTuner_t;
+typedef ncclTuner_v5_t ncclTuner_t;
+typedef ncclTunerConstants_v5_t ncclTunerConstants_t;
+typedef ncclNvlDomainInfo_v5_t ncclNvlDomainInfo_t;
 
-#define NCCL_TUNER_PLUGIN_SYMBOL "ncclTunerPlugin_v4"
+#define NCCL_TUNER_PLUGIN_SYMBOL "ncclTunerPlugin_v5"
+
+#define NCCL_ALGO_UNDEF -1
+#define NCCL_ALGO_TREE 0
+#define NCCL_ALGO_RING 1
+#define NCCL_ALGO_COLLNET_DIRECT 2
+#define NCCL_ALGO_COLLNET_CHAIN 3
+#define NCCL_ALGO_NVLS 4
+#define NCCL_ALGO_NVLS_TREE 5
+#define NCCL_ALGO_PAT 6
+#define NCCL_NUM_ALGORITHMS NCCL_NUM_ALGORITHMS_V5 // Tree/Ring/CollNet*/PAT
+
+#define NCCL_PROTO_UNDEF -1
+#define NCCL_PROTO_LL 0
+#define NCCL_PROTO_LL128 1
+#define NCCL_PROTO_SIMPLE 2
+#define NCCL_NUM_PROTOCOLS NCCL_NUM_PROTOCOLS_V5 // Simple/LL/LL128
+
+#define NCCL_ALGO_PROTO_IGNORE -1.0
+
+#define NCCL_HW_NVLINK 0
+#define NCCL_HW_PCI 1
+#define NCCL_HW_NET 2
+#define NCCL_NUM_HW_LINKS NCCL_NUM_HW_LINKS_V5
+
+#define NCCL_VOLTA_COMPCAP_IDX 0
+#define NCCL_AMPERE_COMPCAP_IDX 1
+#define NCCL_HOPPER_COMPCAP_IDX 2
+#define NCCL_BLACKWELL_COMPCAP_IDX 3
+#define NCCL_NUM_COMPCAPS NCCL_NUM_COMPCAPS_V5
+
+#define NCCL_TUNING_SCALE_1NODE 0
+#define NCCL_TUNING_SCALE_2NODES 1
+#define NCCL_TUNING_SCALE_4NODES 2
+#define NCCL_NUM_TUNING_SCALES NCCL_NUM_TUNING_SCALES_V5
 
 #endif
diff --git a/projects/rccl/src/include/plugin/net/net_v10.h b/projects/rccl/src/include/plugin/net/net_v10.h
index ada6d482ef..2e9187b0a0 100644
--- a/projects/rccl/src/include/plugin/net/net_v10.h
+++ b/projects/rccl/src/include/plugin/net/net_v10.h
@@ -5,11 +5,9 @@
 #ifndef NET_V10_H_
 #define NET_V10_H_
 
-#define NCCL_NET_MAX_DEVS_PER_NIC_V10 4
-
 typedef struct {
   int ndevs;
-  int devs[NCCL_NET_MAX_DEVS_PER_NIC_V10];
+  int devs[NCCL_NET_MAX_DEVS_PER_NIC];
 } ncclNetVDeviceProps_v10_t;
 
 #define NCCL_NET_TRAFFIC_CLASS_UNDEF -1
diff --git a/projects/rccl/src/include/plugin/net/net_v11.h b/projects/rccl/src/include/plugin/net/net_v11.h
new file mode 100644
index 0000000000..68e100637e
--- /dev/null
+++ b/projects/rccl/src/include/plugin/net/net_v11.h
@@ -0,0 +1,188 @@
+/*
+ * Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
+ */
+
+#ifndef NET_V11_H_
+#define NET_V11_H_
+
+typedef struct {
+  int ndevs;
+  int devs[NCCL_NET_MAX_DEVS_PER_NIC];
+} ncclNetVDeviceProps_v11_t;
+
+#define NCCL_NET_TRAFFIC_CLASS_UNDEF -1
+
+typedef struct {
+  // Plugin-specific TC value
+  int trafficClass;
+} ncclNetCommConfig_v11_t;
+
+
+typedef struct {
+  char* name;                      // Used mostly for logging.
+  char* pciPath;                   // Path to the PCI device in /sys.
+  uint64_t guid;                   // Unique identifier for the NIC chip. Important for
+                                   // cards with multiple PCI functions (Physical or virtual).
+  int ptrSupport;                  // [NCCL_PTR_HOST|NCCL_PTR_CUDA|NCCL_PTR_DMABUF]
+  int regIsGlobal;                 // regMr is not tied to a particular comm
+  int forceFlush;                  // Force a flush on receives
+  int speed;                       // Port speed in Mbps.
+  int port;                        // Port number.
+  float latency;                   // Network latency
+  int maxComms;                    // Maximum number of comms we can create
+  int maxRecvs;                    // Maximum number of grouped receives.
+  ncclNetDeviceType netDeviceType; // Network offload type
+  int netDeviceVersion;            // Version number for network offload
+  ncclNetVDeviceProps_v11_t vProps;
+  size_t maxP2pBytes;              // Max transfer size for point-to-point operations
+  size_t maxCollBytes;             // Max transfer size for collective operations
+  int maxMultiRequestSize;         // Maximum number of requests supported in a single multi-request.
+} ncclNetProperties_v11_t;
+
+#define NCCL_NET_ATTR_UNDEF -1
+
+#define NCCL_NET_ATTR_INIT { \
+  { NCCL_NET_ATTR_UNDEF, NCCL_NET_ATTR_UNDEF, NCCL_NET_ATTR_UNDEF, NCCL_NET_ATTR_UNDEF }, /* sendCommAttr */ \
+  { NCCL_NET_ATTR_UNDEF, NCCL_NET_ATTR_UNDEF, NCCL_NET_ATTR_UNDEF, NCCL_NET_ATTR_UNDEF }, /* recvCommAttr */ \
+  (uint32_t)NCCL_NET_ATTR_UNDEF, /* op */ \
+  (uint32_t)NCCL_NET_ATTR_UNDEF, /* algo */ \
+  (uint32_t)NCCL_NET_ATTR_UNDEF, /* proto */ \
+}
+
+typedef struct {
+  int32_t maxConcurrentPeers;
+  int32_t minConcurrentPeers;
+  int32_t maxFlowsPerPeer;
+  int32_t minFlowsPerPeer;
+} ncclNetCommAttr_v11_t;
+
+typedef struct {
+  ncclNetCommAttr_v11_t sendCommAttr;
+  ncclNetCommAttr_v11_t recvCommAttr;
+  uint32_t op;
+  uint32_t algo;
+  uint32_t proto;
+} ncclNetAttr_v11_t;
+
+typedef struct {
+  // Name of the network (mainly for logs)
+  const char* name;
+  // Initialize the network.
+  ncclResult_t (*init)(void** ctx, uint64_t commId, ncclNetCommConfig_v11_t* config, ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction);
+  // Return the number of adapters.
+  ncclResult_t (*devices)(int* ndev);
+  // Get various device properties.
+  ncclResult_t (*getProperties)(int dev, ncclNetProperties_v11_t* props);
+  // Create a receiving object and provide a handle to connect to it. The
+  // handle can be up to NCCL_NET_HANDLE_MAXSIZE bytes and will be exchanged
+  // between ranks to create a connection.
+  ncclResult_t (*listen)(void* ctx, int dev, void* handle, void** listenComm);
+  // Connect to a handle and return a sending comm object for that peer.
+  // This call must not block for the connection to be established, and instead
+  // should return successfully with sendComm == NULL with the expectation that
+  // it will be called again until sendComm != NULL.
+  // If *sendDevComm points to a valid object, then NCCL is requesting device offload for this connection
+  ncclResult_t (*connect)(void* ctx, int dev, void* handle, void** sendComm, ncclNetDeviceHandle_v11_t** sendDevComm);
+  // Finalize connection establishment after remote peer has called connect.
+  // This call must not block for the connection to be established, and instead
+  // should return successfully with recvComm == NULL with the expectation that
+  // it will be called again until recvComm != NULL.
+  // If *recvDevComm points to a valid object, then NCCL is requesting device offload for this connection
+  ncclResult_t (*accept)(void* listenComm, void** recvComm, ncclNetDeviceHandle_v11_t** recvDevComm);
+  // Register/Deregister memory. Comm can be either a sendComm or a recvComm.
+  // Type is either NCCL_PTR_HOST or NCCL_PTR_CUDA.
+  ncclResult_t (*regMr)(void* comm, void* data, size_t size, int type, void** mhandle);
+  /* DMA-BUF support */
+  ncclResult_t (*regMrDmaBuf)(void* comm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle);
+  ncclResult_t (*deregMr)(void* comm, void* mhandle);
+  // Asynchronous send to a peer.
+  // May return request == NULL if the call cannot be performed (or would block)
+  ncclResult_t (*isend)(void* sendComm, void* data, size_t size, int tag, void* mhandle, void* phandle, void** request);
+  // Asynchronous recv from a peer.
+  // May return request == NULL if the call cannot be performed (or would block)
+  ncclResult_t (*irecv)(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles, void** phandles, void** request);
+  // Perform a flush/fence to make sure all data received with NCCL_PTR_CUDA is
+  // visible to the GPU
+  ncclResult_t (*iflush)(void* recvComm, int n, void** data, int* sizes, void** mhandles, void** request);
+  // Test whether a request is complete. If size is not NULL, it returns the
+  // number of bytes sent/received.
+  ncclResult_t (*test)(void* request, int* done, int* sizes);
+  // Close and free send/recv comm objects
+  ncclResult_t (*closeSend)(void* sendComm);
+  ncclResult_t (*closeRecv)(void* recvComm);
+  ncclResult_t (*closeListen)(void* listenComm);
+
+  // Copy the given mhandle to a dptr in a format usable by this plugin's device code
+  ncclResult_t (*getDeviceMr)(void* comm, void* mhandle, void** dptr_mhandle);
+
+  // Notify the plugin that a recv has completed by the device
+  ncclResult_t (*irecvConsumed)(void* recvComm, int n, void* request);
+
+  // Virtual NIC APIs. makeVDevice will create a virtual NIC given the specified properties, and tell the caller
+  // what index this new vNIC exists at
+  ncclResult_t (*makeVDevice)(int* d, ncclNetVDeviceProps_v11_t* props);
+  // Finalize the network.
+  ncclResult_t (*finalize)(void* ctx);
+
+  ncclResult_t (*setNetAttr)(void* ctx, ncclNetAttr_v11_t* netAttr);
+} ncclNet_v11_t;
+
+typedef struct {
+  void* mhandle;
+  void* address;
+  size_t size;
+} ncclNetSGE_v11_t;
+
+typedef struct {
+  // Name of the collective network (mainly for logs)
+  const char* name;
+  // Initialize the collective network.
+  ncclResult_t (*init)(void** ctx, uint64_t commId, ncclDebugLogger_t logFunction);
+  // Return the number of adapters capable of doing collective operations.
+  // If ndev returns 0, all other functions might be set to NULL.
+  ncclResult_t (*devices)(int* ndev);
+  // Get various device properties.
+  ncclResult_t (*getProperties)(int dev, ncclNetProperties_v11_t* props);
+  // Create a receiving object and provide a handle to connect to it. The
+  // handle can be up to NCCL_NET_HANDLE_MAXSIZE bytes and will be exchanged
+  // between ranks to create connections.
+  ncclResult_t (*listen)(void* ctx, int dev, void* handle, void** listenComm);
+  // Create a group for collective operations. handles have been created
+  // using listen() above. rank indicates caller's rank in the collective network.
+  ncclResult_t (*connect)(void* handles[], int nranks, int rank, void* listenComm, void** collComm);
+  // Returns whether a reduction operation on a data type is supported.
+  // 1 for supported, 0 otherwise.
+  ncclResult_t (*reduceSupport)(ncclDataType_t dataType, ncclRedOp_t redOp, int* supported);
+  // Register/Deregister memory. Type is either NCCL_PTR_HOST or NCCL_PTR_CUDA.
+  ncclResult_t (*regMr)(void* collComm, void* data, size_t size, int type, void** mhandle);
+  /* DMA-BUF support */
+  ncclResult_t (*regMrDmaBuf)(void* collComm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle);
+  ncclResult_t (*deregMr)(void* collComm, void* mhandle);
+  // Performs an asynchronous allreduce operation on the collective group.
+  // May return request == NULL if the call cannot be performed (or would block).
+  ncclResult_t (*iallreduce)(void* collComm, void* sendData, void* recvData, size_t count,
+      ncclDataType_t dataType, ncclRedOp_t redOp, void* sendMhandle, void* recvMhandle, void** request);
+  ncclResult_t (*iallgather)(void* collComm, void* sendData, int nRecvParts, ncclNetSGE_v11_t* recvParts,
+                             size_t bytesPerRank, size_t windowOffset, size_t windowBytes,
+                             void* sendMhandle, void** request);
+  ncclResult_t (*ireducescatter)(void* collComm, int nSendParts, ncclNetSGE_v11_t* sendParts, void* recvData,
+                                 size_t bytesPerRank, size_t windowOffset, size_t windowBytes,
+                                 ncclDataType_t dataType, ncclRedOp_t redOp,
+                                 void* recvMhandle, void** request);
+  // Perform a flush/fence to make sure all data received with NCCL_PTR_CUDA is
+  // visible to the GPU
+  ncclResult_t (*iflush)(void* collComm, void* data, int size, void* mhandle, void** request);
+  // Test whether a request is complete. If size is not NULL, it returns the
+  // number of bytes sent/received.
+  ncclResult_t (*test)(void* request, int* done, int* size);
+  // Close and free collective comm objects
+  ncclResult_t (*closeColl)(void* collComm);
+  ncclResult_t (*closeListen)(void* listenComm);
+
+  // Create a virtual NIC given the specified properties, which can be accessed at device index d
+  ncclResult_t (*makeVDevice)(int* d, ncclNetVDeviceProps_v11_t* props);
+  // Finalize the collective network.
+  ncclResult_t (*finalize)(void* ctx);
+} ncclCollNet_v11_t;
+
+#endif // end include guard
diff --git a/projects/rccl/src/include/plugin/net/net_v9.h b/projects/rccl/src/include/plugin/net/net_v9.h
index ce9d917484..ef054bbe63 100644
--- a/projects/rccl/src/include/plugin/net/net_v9.h
+++ b/projects/rccl/src/include/plugin/net/net_v9.h
@@ -7,11 +7,9 @@
 #ifndef NET_V9_H_
 #define NET_V9_H_
 
-#define NCCL_NET_MAX_DEVS_PER_NIC_V9 4
-
 typedef struct {
   int ndevs;
-  int devs[NCCL_NET_MAX_DEVS_PER_NIC_V9];
+  int devs[NCCL_NET_MAX_DEVS_PER_NIC];
 } ncclNetVDeviceProps_v9_t;
 
 typedef struct {
diff --git a/projects/rccl/src/include/plugin/plugin.h b/projects/rccl/src/include/plugin/plugin.h
index 300e436a09..83b58e985a 100644
--- a/projects/rccl/src/include/plugin/plugin.h
+++ b/projects/rccl/src/include/plugin/plugin.h
@@ -21,4 +21,6 @@ void* ncclOpenProfilerPluginLib(const char* name);
 void* ncclGetNetPluginLib(enum ncclPluginType type);
 ncclResult_t ncclClosePluginLib(void* handle, enum ncclPluginType type);
 
+extern char* ncclPluginLibPaths[];
+
 #endif
diff --git a/projects/rccl/src/include/plugin/profiler/profiler_v5.h b/projects/rccl/src/include/plugin/profiler/profiler_v5.h
new file mode 100644
index 0000000000..dab1db9e1a
--- /dev/null
+++ b/projects/rccl/src/include/plugin/profiler/profiler_v5.h
@@ -0,0 +1,151 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef PROFILER_V5_H_
+#define PROFILER_V5_H_
+
+typedef struct {
+  uint64_t type;                // event type descriptor: ncclProfileColl, ...
+  void* parentObj;              // pointer to the profiler parent object (for coll is the group)
+  int rank;                     // originating rank
+  union {
+    struct {
+      bool graphCaptured;
+      int groupDepth;
+    } groupApi;
+
+    struct {
+      const char* func;
+      size_t count;
+      const char* datatype;
+      int root;
+      void* stream;
+      bool graphCaptured;
+    } collApi;
+
+    struct {
+      const char* func;
+      size_t count;
+      const char* datatype;
+      void* stream;
+      bool graphCaptured;
+    } p2pApi;
+
+    struct {
+      void* stream;
+    } kernelLaunch;
+
+    struct {
+      uint64_t seqNumber;
+      const char* func;
+      void const* sendBuff;
+      void* recvBuff;
+      size_t count;
+      int root;
+      const char* datatype;
+      uint8_t nChannels;
+      uint8_t nWarps;
+      const char* algo;
+      const char* proto;
+      void* parentGroup; // for backward compatibility with v4
+    } coll;
+
+    struct {
+      const char* func;
+      void* buff;
+      const char* datatype;
+      size_t count;
+      int peer;
+      uint8_t nChannels;
+      void* parentGroup; // for backward compatibility with v4
+    } p2p;
+
+    struct {
+      pid_t pid;                // pid of the originating process
+      uint8_t channelId;        // channel id for this proxy operation
+      int peer;                 // remote rank for send/recv
+      int nSteps;               // number of steps for this proxy operation
+      int chunkSize;            // amount of data transferred by this proxy operation
+      int isSend;
+    } proxyOp;
+
+    struct {
+      int step;
+    } proxyStep;
+
+    struct {
+      uint8_t channelId;
+      uint64_t pTimer;          // start timestamp from GPU globaltimer
+    } kernelCh;
+
+    struct {
+      int64_t id;
+      void* data;
+    } netPlugin;
+  };
+} ncclProfilerEventDescr_v5_t;
+
+typedef union {
+  struct {
+    size_t transSize;
+  } proxyStep;
+
+  struct {
+    int appendedProxyOps;
+  } proxyCtrl;
+
+  struct {
+    void* data;
+  } netPlugin;
+
+  struct {
+    uint64_t pTimer;
+  } kernelCh;
+} ncclProfilerEventStateArgs_v5_t;
+
+typedef struct {
+  const char* name;
+
+  // init - initialize the profiler plugin
+  // Input
+  //  - context        : opaque profiler context object for separating profiler behavior across comms
+  //  - commId         : communicator id
+  //  - commName       : user assigned communicator name
+  //  - nNodes         : number of nodes in communicator
+  //  - nranks         : number of ranks in communicator
+  //  - rank           : rank identifier in communicator
+  //  - logfn          : logger function
+  // Output
+  //  - eActivationMask: bitmask of active events set by the plugin
+  ncclResult_t (*init)(void** context, uint64_t commId, int* eActivationMask, const char* commName, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
+
+  // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
+  // Input
+  //  - context: opaque profiler context object
+  //  - eDescr : pointer to ncclProfilerEventDescr_t object
+  // Output
+  //  - eHandle: return event handle for supplied event descriptor object
+  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v5_t* eDescr);
+
+  // stopEvent - stop/finalize an event inside and event set
+  // Input
+  //  - eHandle: handle to event object
+  ncclResult_t (*stopEvent)(void* eHandle);
+
+  // recordEventState - record event state transitions and event attribute updates
+  // Input
+  //  - eHandle   : handle to event object created through startEvent
+  //  - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
+  //  - eState    : event state transition
+  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v5_t eState, ncclProfilerEventStateArgs_v5_t* eStateArgs);
+
+  // finalize - finalize the profiler plugin
+  // Input
+  //  - context: opaque profiler context object
+  ncclResult_t (*finalize)(void* context);
+} ncclProfiler_v5_t;
+
+#endif
diff --git a/projects/rccl/src/include/plugin/tuner/tuner_v5.h b/projects/rccl/src/include/plugin/tuner/tuner_v5.h
new file mode 100644
index 0000000000..9e621f842e
--- /dev/null
+++ b/projects/rccl/src/include/plugin/tuner/tuner_v5.h
@@ -0,0 +1,87 @@
+/*************************************************************************
+ * Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2023, Meta Platforms, Inc. and affiliates.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef TUNER_V5_H_
+#define TUNER_V5_H_
+
+// NVL domain information struct
+typedef struct {
+  int nNvlDomains;                    // number of NVLink domains
+  int minRanksPerNvlDomain;           // minimum ranks across all NVLink domains
+  int maxRanksPerNvlDomain;           // maximum ranks across all NVLink domains
+} ncclNvlDomainInfo_v5_t;
+
+#define NCCL_NUM_ALGORITHMS_V5 7 // Tree/Ring/CollNet*/PAT
+#define NCCL_NUM_PROTOCOLS_V5 3 // Simple/LL/LL128
+#define NCCL_NUM_HW_LINKS_V5 3
+#define NCCL_NUM_COMPCAPS_V5 4
+#define NCCL_NUM_TUNING_SCALES_V5 3
+
+typedef struct {
+  double baseLatencies [NCCL_NUM_ALGORITHMS_V5][NCCL_NUM_PROTOCOLS_V5];
+  double hwLatencies [NCCL_NUM_HW_LINKS_V5][NCCL_NUM_ALGORITHMS_V5][NCCL_NUM_PROTOCOLS_V5];
+
+  double llMaxBws [NCCL_NUM_COMPCAPS_V5][NCCL_NUM_TUNING_SCALES_V5];
+  double perChMaxRingLL128Bws [NCCL_NUM_COMPCAPS_V5][NCCL_NUM_TUNING_SCALES_V5];
+  double perChMaxTreeLL128Bws [NCCL_NUM_COMPCAPS_V5][NCCL_NUM_TUNING_SCALES_V5];
+  double perChMaxTreeBws [NCCL_NUM_COMPCAPS_V5][NCCL_NUM_TUNING_SCALES_V5];
+  double perChMaxNVLSTreeBws [NCCL_NUM_COMPCAPS_V5][NCCL_NUM_TUNING_SCALES_V5];
+
+
+} ncclTunerConstants_v5_t;
+
+// API to be implemented by external tuner
+typedef struct {
+  // Name of the tuner
+  const char* name;
+
+  // Initializes tuner states.
+  // Inputs:
+  //   - commId: communicator identifier
+  //   - nRanks: number of ranks in current communicator. Each communicator initialize its own tuner.
+  //   - nNodes: number of nodes in current communicator.
+  //   - logFunction: a logFunction can be useful to integrate logging together with NCCL core.
+  //   - nvlDomainInfo: NVL domain information struct
+  // Outputs:
+  //   - context: tuner context object
+  // Input/Output:
+  //   - constants: tuner constants
+  ncclResult_t (*init)(void** ctx, uint64_t commId, size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction,
+                      ncclNvlDomainInfo_v5_t* nvlDomainInfo, ncclTunerConstants_v5_t* constants);
+
+  // Gets info (algo, protocol, number of ctas and threads) for a given collective.
+  // Inputs:
+  //   - context: tuner context object
+  //   - collType: collective type , e.g., allreduce, allgather…
+  //   - nBytes: collective size in bytes
+  //   - numPipeOps: number of operations in the group
+  //   - numAlgo: number of algorithms in collCostTable
+  //   - numProto: number of protocols in collCostTable
+  //   - regBuff: can register user buffer
+  //
+  // Outputs:
+  //   - nChannels: number of channels (hence SMs) to be used.
+  //
+  // InOut:
+  //   - collCostTable: collective cost table, generated by NCCL core, containing algo|proto|time entries for collType.
+  //                    NCCL core sets ignored algo/proto cost table entries to -1.0 (NCCL_ALGO_PROTO_IGNORE).
+  //
+  // If getCollInfo() does not return ncclSuccess, NCCL will fall back to the
+  // default tuning for the given collective.
+  // Also, the plugin is allowed to not set any output, or set only the
+  // algorithm and protocol, but not only the algorithm or only the protocol.
+  // Unset fields will be set automatically by NCCL.
+  ncclResult_t (*getCollInfo)(void* context, ncclFunc_t collType, size_t nBytes,
+                              int numPipeOps, float** collCostTable, int numAlgo, int numProto,
+                              int regBuff, int* nChannels);
+
+  // Terminates the plugin and cleans up any resources that the plugin allocated.
+  // context: tuner context object
+  ncclResult_t (*finalize)(void* context);
+} ncclTuner_v5_t;
+
+#endif
diff --git a/projects/rccl/src/include/profiler.h b/projects/rccl/src/include/profiler.h
index 2fb6a7d38a..f7f9980b56 100644
--- a/projects/rccl/src/include/profiler.h
+++ b/projects/rccl/src/include/profiler.h
@@ -28,12 +28,48 @@ struct ncclProfilerProxy {
   struct ncclProxyConnector recvProxyConn[MAXCHANNELS];
 };
 
+enum groupApiState {
+  ncclProfilerGroupApiStartStateReset   = 0,
+  ncclProfilerGroupApiStartStateStarted = 1,
+  ncclProfilerGroupApiStartStateStopped = 2,
+};
+
+// Used by the profiler to track state for API events
+typedef struct ncclProfilerApiState {
+  int profilerGroupDepth;
+  int eActivationMask;
+  groupApiState state;
+  void *groupApiEventHandle;
+  // Tracks the latest API event handles for p2p/collectives
+  void* p2pApiEventHandle;
+  void *collApiEventHandle;
+} ncclProfilerApiState_t;
+
+extern __thread ncclProfilerApiState_t ncclProfilerApiState;
+
 extern int ncclProfilerEventMask;
 
 // Plugin Init/Finalize Wrappers
 ncclResult_t ncclProfilerPluginInit(struct ncclComm* comm);
 ncclResult_t ncclProfilerPluginFinalize(struct ncclComm* comm);
 
+// Profiler Start/Stop/Record wrappers for ncclGroupStart and ncclGroupEnd API calls
+ncclResult_t ncclProfilerStartGroupApiEvent(struct ncclInfo *info, bool isGraphCaptured);
+ncclResult_t ncclProfilerStopGroupApiEvent();
+ncclResult_t ncclProfilerRecordGroupApiEventState(ncclProfilerEventState_t eState);
+
+//Profiler Start/Stop wrappers for P2p API calls
+ncclResult_t ncclProfilerStartP2pApiEvent(struct ncclInfo *info, bool isGraphCaptured);
+ncclResult_t ncclProfilerStopP2pApiEvent();
+
+//Profiler Start/Stop wrappers for Collective API calls
+ncclResult_t ncclProfilerStartCollApiEvent(struct ncclInfo *info, bool isGraphCaptured);
+ncclResult_t ncclProfilerStopCollApiEvent();
+
+// Kernel Launch Start/Stop Event Wrappers
+ncclResult_t ncclProfilerStartKernelLaunchEvent(struct ncclKernelPlan* plan, cudaStream_t stream);
+ncclResult_t ncclProfilerStopKernelLaunchEvent(struct ncclKernelPlan* plan);
+
 // Profiler Start/Stop Group Wrappers
 ncclResult_t ncclProfilerStartGroupEvent(struct ncclKernelPlan* plan);
 ncclResult_t ncclProfilerStopGroupEvent(struct ncclKernelPlan* plan);
diff --git a/projects/rccl/src/include/proxy.h b/projects/rccl/src/include/proxy.h
index 772aa206c4..4613ada49d 100644
--- a/projects/rccl/src/include/proxy.h
+++ b/projects/rccl/src/include/proxy.h
@@ -69,6 +69,7 @@ struct ncclProxyOp {
   uint8_t /*ncclDataType_t*/ dtype;
   uint8_t /*ncclDevRedOp_t*/ redOp;
   uint8_t /*ncclFunc_t*/ coll;
+  uint8_t /*ncclFunc_t*/ collAPI;
   uint8_t /*ncclPattern_t*/ pattern;
   uint8_t protocol;
   uint8_t algorithm;
@@ -81,6 +82,8 @@ struct ncclProxyOp {
   int isOneRPN;
   RingAlgorithm *ringAlgo;
   union ncclProxyOpSpecifics specifics;
+  int nChannels;
+  int nPeers;
 
   // Profiler plugin
   union {
@@ -175,11 +178,14 @@ struct ncclProxyArgs {
   uint8_t /*ncclDevRedOp_t*/ redOp;
   uint8_t /*ncclPattern_t*/ pattern;
   uint8_t /*ncclFunc_t*/ coll;
+  uint8_t /*ncclFunc_t*/ collAPI;
   uint8_t protocol;
   uint8_t algorithm;
   int state;
   char* sharedBuff[NCCL_STEPS];
   int sharedSize[NCCL_STEPS];
+  int nChannels;
+  int nPeers;
 
   int idle;
 
@@ -338,6 +344,11 @@ struct ncclProxyState {
   // Progress thread
   struct ncclProxyProgressState progressState;
 
+  // Network plugin
+  void* netContext;
+  ncclNetAttr_t netAttr;
+  void* collNetContext;
+
   // Profiler plugin
   void* profilerContext;
 
diff --git a/projects/rccl/src/include/register.h b/projects/rccl/src/include/register.h
index 231cbfc34f..edfc722dee 100644
--- a/projects/rccl/src/include/register.h
+++ b/projects/rccl/src/include/register.h
@@ -1,3 +1,9 @@
+/*************************************************************************
+ * Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
 #ifndef NCCL_REGISTER_H_
 #define NCCL_REGISTER_H_
 
@@ -29,15 +35,6 @@ struct ncclRegNetHandles {
   struct ncclRegNetHandles* next;
 };
 
-struct ncclSymRegTask {
-  struct ncclSymRegTask *next;
-  void* buff;
-  size_t baseSize;
-  CUmemGenericAllocationHandle memHandle;
-  struct ncclReg* regHandle;
-  size_t alignment;
-};
-
 struct ncclReg {
   // common attributes
   uintptr_t begAddr, endAddr; // page aligned
@@ -58,10 +55,6 @@ struct ncclReg {
   // general ipc reg
   struct ncclPeerRegIpcAddr regIpcAddrs;
   struct ncclIpcRegInfo* ipcInfos[NCCL_MAX_LOCAL_RANKS];
-  // symmetric reg
-  void* baseSymPtr;
-  size_t symSize;
-  int winFlags;
 };
 
 struct ncclRegCache {
@@ -70,14 +63,9 @@ struct ncclRegCache {
   uintptr_t pageSize;
 };
 
-struct ncclWindow {
-  struct ncclReg* handle;
-};
-
 ncclResult_t ncclRegCleanup(struct ncclComm* comm);
 ncclResult_t ncclCommGraphRegister(const ncclComm_t comm, void* buff, size_t size, void** handle);
 ncclResult_t ncclCommGraphDeregister(const ncclComm_t comm, struct ncclReg *handle);
 ncclResult_t ncclRegLocalIsValid(struct ncclReg *reg, bool *isValid);
-ncclResult_t ncclCommSymmetricRegisterInternal(struct ncclComm* comm, void* buff, size_t baseSize, size_t alignment, CUmemGenericAllocationHandle memHandle, struct ncclReg* regHandle);
 
 #endif
diff --git a/projects/rccl/src/include/register_inline.h b/projects/rccl/src/include/register_inline.h
index fb7641b131..76181c4ac5 100644
--- a/projects/rccl/src/include/register_inline.h
+++ b/projects/rccl/src/include/register_inline.h
@@ -1,3 +1,9 @@
+/*************************************************************************
+ * Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
 #ifndef NCCL_REGISTER_INLINE_H_
 #define NCCL_REGISTER_INLINE_H_
 
@@ -18,16 +24,5 @@ static inline ncclResult_t ncclRegFind(struct ncclComm* comm, const void* data,
   }
 }
 
-static inline ncclResult_t ncclRegFindSymmetric(struct ncclComm* comm, const void* data, size_t size, void** symPtr, struct ncclReg** outReg) {
-  struct ncclReg* regRecord = NULL;
-  *symPtr = NULL;
-  *outReg = NULL;
-  NCCLCHECK(ncclRegFind(comm, data, size, &regRecord));
-  if (regRecord && regRecord->baseSymPtr) {
-    *symPtr = (void*)((uintptr_t)regRecord->baseSymPtr + (uintptr_t)data - (uintptr_t)regRecord->begAddr);
-    *outReg = regRecord;
-  }
-  return ncclSuccess;
-}
 
 #endif
diff --git a/projects/rccl/src/include/scheduler.h b/projects/rccl/src/include/scheduler.h
new file mode 100644
index 0000000000..9ee9bb2323
--- /dev/null
+++ b/projects/rccl/src/include/scheduler.h
@@ -0,0 +1,17 @@
+/*************************************************************************
+ * Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef NCCL_SCHEDULER_H_
+#define NCCL_SCHEDULER_H_
+
+#include "nccl.h"
+#include "comm.h"
+#include "sym_kernels.h"
+
+ncclResult_t ncclMakeSymmetricTaskList(struct ncclComm* comm, struct ncclTaskColl* task, struct ncclIntruQueue<struct ncclTaskColl, &ncclTaskColl::next>* symTaskQueue, struct ncclTaskColl** remainTasksHead);
+ncclResult_t ncclSymmetricTaskScheduler(struct ncclComm* comm, struct ncclIntruQueue<struct ncclTaskColl, &ncclTaskColl::next>* symTaskQueue, struct ncclKernelPlan* plan);
+
+#endif // NCCL_SCHEDULER_H_
diff --git a/projects/rccl/src/include/shm.h b/projects/rccl/src/include/shm.h
index 223d873465..b944241a4e 100644
--- a/projects/rccl/src/include/shm.h
+++ b/projects/rccl/src/include/shm.h
@@ -1,3 +1,9 @@
+/*************************************************************************
+ * Copyright (c) 2016-2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
 #ifndef NCCL_SHM_H_
 #define NCCL_SHM_H_
 
diff --git a/projects/rccl/src/include/shmutils.h b/projects/rccl/src/include/shmutils.h
index 097b4c6577..199b9f7175 100644
--- a/projects/rccl/src/include/shmutils.h
+++ b/projects/rccl/src/include/shmutils.h
@@ -15,8 +15,8 @@ ncclResult_t ncclShmClose(ncclShmHandle_t handle);
 ncclResult_t ncclShmUnlink(ncclShmHandle_t handle);
 
 struct ncclShmemCollBuff {
-  volatile size_t *cnt[2];
-  volatile void *ptr[2];
+  size_t *cnt[2];
+  void *ptr[2];
   int round;
   size_t maxTypeSize;
 };
diff --git a/projects/rccl/src/include/sym_kernels.h b/projects/rccl/src/include/sym_kernels.h
new file mode 100644
index 0000000000..4e742eff7b
--- /dev/null
+++ b/projects/rccl/src/include/sym_kernels.h
@@ -0,0 +1,112 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef NCCL_SYM_KERNELS_H_
+#define NCCL_SYM_KERNELS_H_
+#include "nccl.h"
+#include "nccl_device.h"
+#include "nccl_common.h"
+#include "device.h"
+
+////////////////////////////////////////////////////////////////////////////////
+// ncclSymk[Foo]: Kernels built on the device API
+
+#define NCCL_SYM_KERNEL_CELL_SIZE 1024 // no less than 16 bytes minimal cell size
+
+constexpr int ncclSymkMaxBlocks = 64;
+constexpr int ncclSymkMaxThreads = 512;
+constexpr int ncclSymkLLMaxEltSize = 8;
+
+constexpr __host__ __device__ int ncclSymkLLMaxSlots(int eltSize = ncclSymkLLMaxEltSize) {
+  return ncclSymkMaxThreads*ncclSymkLLMaxEltSize/eltSize;
+}
+
+enum ncclSymkKernelId {
+  ncclSymkKernelId_AllReduce_AGxLL_R,
+  ncclSymkKernelId_AllReduce_AGxLLMC_R,
+  ncclSymkKernelId_AllReduce_RSxLD_AGxST,
+  ncclSymkKernelId_AllReduce_RSxLDMC_AGxSTMC,
+  ncclSymkKernelId_AllReduce_RSxNet_ARxMC_AGxNet,
+
+  ncclSymkKernelId_AllGather_LL,
+  ncclSymkKernelId_AllGather_LLMC,
+  ncclSymkKernelId_AllGather_ST,
+  ncclSymkKernelId_AllGather_STMC,
+
+  ncclSymkKernelId_ReduceScatter_LL,
+  ncclSymkKernelId_ReduceScatter_LD,
+  ncclSymkKernelId_ReduceScatter_LDMC,
+
+  ncclSymkKernelId_Count
+};
+
+struct ncclSymkDevComm {
+  struct ncclDevComm devComm;
+  struct ncclLLA2AHandle lsaLLA2A;
+};
+
+struct ncclSymkState {
+  bool initialized;
+  struct ncclSymkDevComm kcomm;
+};
+
+struct ncclSymkChannelWorkRange {
+  uint16_t workHi; // inclusive index of my ending work
+  uint16_t fracHi; // 16-bit fraction in (0.0, 1.0] indicating where my part ends
+};
+
+// 16 bytes aligned
+struct alignas(16) ncclSymkDevWork {
+  uint64_t redOpArg; // must be collectively uniform
+  size_t nElts;
+  struct ncclWindow_vidmem* inputWin, *outputWin;
+  size_t inputOff, outputOff; // these = origUserOffset + cbdPartOffset
+  uint64_t rootRank;
+  uint64_t sChannelId:16, nChannels:16, padding:32;
+};
+
+struct alignas(16) ncclSymkDevWorkArgs {
+  struct ncclSymkDevComm kcomm;
+  int nMaxChannels;
+  // starting of channelWorkRange will be aligned to 16 bytes
+  // channelWorkRange[nChannels];
+  // ncclSymDevWork[nWorks];
+  // aux functions
+  __host__ static constexpr size_t calcArgsSize(int nChannels, int nWorks) {
+    return alignUp(sizeof(struct ncclSymkDevWorkArgs), 16) + alignUp(nChannels * sizeof(struct ncclSymkChannelWorkRange), 16) + nWorks * sizeof(struct ncclSymkDevWork);
+  }
+  __host__ __device__ struct ncclSymkChannelWorkRange* getWorkRange() const {
+    return (struct ncclSymkChannelWorkRange*)((uint8_t*)this + alignUp(sizeof(struct ncclSymkDevWorkArgs), 16));
+  }
+  __host__ __device__ struct ncclSymkDevWork* getWorks(int nChannels) const {
+    return (struct ncclSymkDevWork*)((uint8_t*)this->getWorkRange() + alignUp(nChannels * sizeof(struct ncclSymkChannelWorkRange), 16));
+  }
+};
+
+union ncclSymkDevWorkArgs4K {
+  struct ncclSymkDevWorkArgs args;
+  char buf4K[4096];
+};
+
+// We assume ncclComm contains a field: `ncclSymkState symkState`
+ncclResult_t ncclSymkInitOnce(struct ncclComm* comm);
+ncclResult_t ncclSymkFinalize(struct ncclComm* comm);
+
+bool ncclSymkAvailable(struct ncclComm* comm, ncclFunc_t coll, int/*ncclDevRedOp_t*/ red,
+                       ncclDataType_t ty, size_t nElts);
+ncclResult_t ncclSymkPickKernel(struct ncclComm* comm, ncclFunc_t coll, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty,
+                                size_t nEltsTotal, size_t nEltsMax, int nWorks,
+                                float* estTimeUs, ncclSymkKernelId* kernelId, int* nBlocks, int* nWarps);
+
+ncclResult_t ncclSymkMakeDevWork(struct ncclComm* comm, struct ncclTaskColl* task, struct ncclSymkDevWork* outDevWork);
+
+// Generated by src/device/symmetric/generate.py
+extern int const ncclSymkKernelCount;
+extern void* const ncclSymkKernelList[];
+void* ncclSymkGetKernelPtr(ncclSymkKernelId kernelId, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty);
+const char* ncclSymkKernelIdToString(int kernelId);
+
+#endif
diff --git a/projects/rccl/src/include/symmetric.h b/projects/rccl/src/include/symmetric.h
deleted file mode 100644
index 7a189bcca6..0000000000
--- a/projects/rccl/src/include/symmetric.h
+++ /dev/null
@@ -1,90 +0,0 @@
-#ifndef NCCL_DEVICE_SYMMETRIC_H_
-#define NCCL_DEVICE_SYMMETRIC_H_
-
-#include "nccl.h"
-#include "nccl_common.h"
-#include "bitops.h"
-
-constexpr int ncclSymMaxBlocks = 64;
-constexpr int ncclSymMaxThreads = 512;
-constexpr int ncclSymLLMaxEltSize = 64;
-
-constexpr __host__ __device__ int ncclSymLLMaxSlots(int eltSize = ncclSymLLMaxEltSize) {
-  return ncclSymMaxThreads*ncclSymLLMaxEltSize/eltSize;
-}
-
-constexpr __host__ __device__ int ncclSymLLEpochSize(int nRanks) {
-  return /*LL Overhead*/2 * maxval(ncclSymMaxThreads*nRanks*8, ncclSymLLMaxSlots(ncclSymLLMaxEltSize)*ncclSymLLMaxEltSize);
-}
-
-struct alignas(16) ncclSymDevBase {
-  uint32_t llEpoch[ncclSymMaxBlocks];
-  uint32_t barEpochMc[ncclSymMaxBlocks], barEpochUc[ncclSymMaxBlocks];
-  uint32_t barInboxMc[ncclSymMaxBlocks];
-  uint32_t barInboxPerPeer[];
-
-  static constexpr size_t size(int nRanks) {
-    return sizeof(ncclSymDevBase) +
-           alignUp(ncclSymMaxBlocks*nRanks*sizeof(uint32_t), 16) +
-           ncclSymMaxBlocks * /*epochs=*/2 * ncclSymLLEpochSize(nRanks);
-  }
-};
-
-static __device__ uint4* ncclSymDevBase_getLLBuf(struct ncclSymDevBase* base, int nRanks, int block, uint32_t epoch) {
-  // Get pointer to buffer trailing the header struct.
-  char* ans = (char*)(base + 1);
-  // Skip over barInboxPerPeer[]
-  ans += alignUp(ncclSymMaxBlocks*nRanks*sizeof(uint32_t), 16);
-  // Skip to our block
-  int epochSize = ncclSymLLEpochSize(nRanks);
-  ans += block * /*epochs=*/2 * epochSize;
-  ans += (epoch & 1)*epochSize;
-  return (uint4*)ans;
-}
-
-struct ncclSymDevComm {
-  ncclSymDevBase* base;
-  ncclSymDevBase* baseMc;
-  uint32_t stride4G;
-  int nRanks, rank;
-  uint32_t nRanks_rcp32; // idivRcp32(nRanks)
-};
-
-struct alignas(16) ncclSymDevArgs {
-  struct ncclSymDevComm comm;
-  int rootRank;
-  uint64_t redOpArg; // must be collectively uniform
-  size_t nElts;
-  char* input;
-  char* output;
-};
-
-enum ncclSymKernelId {
-  ncclSymKernelId_AllReduce_AGxLL_R,
-  ncclSymKernelId_AllReduce_AGxLLMC_R,
-  ncclSymKernelId_AllReduce_RSxLD_AGxST,
-  ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC,
-
-  ncclSymKernelId_AllGather_LL,
-  ncclSymKernelId_AllGather_LLMC,
-  ncclSymKernelId_AllGather_ST,
-  ncclSymKernelId_AllGather_STMC,
-
-  ncclSymKernelId_ReduceScatter_LL,
-  ncclSymKernelId_ReduceScatter_LD,
-  ncclSymKernelId_ReduceScatter_LDMC,
-
-  ncclSymKernelId_Count
-};
-
-bool ncclSymImplemented(ncclFunc_t fn, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty);
-
-ncclResult_t ncclSymPickKernel(struct ncclComm* comm, ncclFunc_t fn, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty, size_t nElts, float* estTimeUs, ncclSymKernelId* kernelId, int* nBlocks, int* nWarps);
-
-// Generated by src/device/symmetric/generate.py
-extern int const ncclSymKernelCount;
-extern void* const ncclSymKernelList[];
-void* ncclSymGetKernelPtr(ncclSymKernelId kernelId, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty);
-const char* ncclSymKernelIdToString(int kernelId);
-
-#endif
diff --git a/projects/rccl/src/include/transport.h b/projects/rccl/src/include/transport.h
index a9971a74fb..39e479e24e 100644
--- a/projects/rccl/src/include/transport.h
+++ b/projects/rccl/src/include/transport.h
@@ -162,13 +162,9 @@ ncclResult_t ncclRegisterCollBuffers(struct ncclComm* comm, struct ncclTaskColl*
 ncclResult_t ncclRegisterCollNvlsBuffers(struct ncclComm* comm, struct ncclTaskColl* info, void* outRegBufSend[NCCL_MAX_LOCAL_RANKS], void* outRegBufRecv[NCCL_MAX_LOCAL_RANKS], struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next>* cleanupQueue, bool* regNeedConnect);
 ncclResult_t ncclNvlsRegResourcesQuery(struct ncclComm* comm, struct ncclTaskColl* info, int* recChannels);
 
-ncclResult_t ncclIpcSymmetricInit(struct ncclComm* comm);
-ncclResult_t ncclIpcSymmetricMap(struct ncclComm* comm, size_t offset, size_t size, CUmemGenericAllocationHandle memHandle, void** symPtr);
-ncclResult_t ncclIpcSymmetricFree(struct ncclComm* comm, size_t size, void* symPtr);
-ncclResult_t ncclIpcSymmetricFinalize(struct ncclComm* comm);
-ncclResult_t ncclNvlsSymmetricInit(struct ncclComm* comm);
-ncclResult_t ncclNvlsSymmetricMap(struct ncclComm* comm, size_t offset, size_t ucsize, void* ucaddr);
-ncclResult_t ncclNvlsSymmetricFree(struct ncclComm* comm, size_t ucsize, void* ucaddr);
-ncclResult_t ncclNvlsSymmetricFinalize(struct ncclComm* comm);
+#if CUDART_VERSION >= 12010
+ncclResult_t ncclNvlsGroupCreate(struct ncclComm *comm, CUmulticastObjectProp *prop, int rank, unsigned int nranks, CUmemGenericAllocationHandle *mcHandle, char *shareableHandle);
+ncclResult_t ncclNvlsGroupConnect(struct ncclComm *comm, char *shareableHandle, int rank, CUmemGenericAllocationHandle *mcHandle);
+#endif
 
 #endif
diff --git a/projects/rccl/src/include/utils.h b/projects/rccl/src/include/utils.h
index bfed2722cc..46389985f2 100644
--- a/projects/rccl/src/include/utils.h
+++ b/projects/rccl/src/include/utils.h
@@ -551,4 +551,6 @@ T* ncclIntruQueueMpscAbandon(ncclIntruQueueMpsc<T,next>* me) {
     return head;
   }
 }
+
+ncclResult_t ncclBitsToString(uint32_t bits, uint32_t mask, const char* (*toStr)(int), char *buf, size_t bufLen, const char *wildcard);
 #endif
diff --git a/projects/rccl/src/init.cc b/projects/rccl/src/init.cc
index af784c02d1..ebf942c021 100644
--- a/projects/rccl/src/init.cc
+++ b/projects/rccl/src/init.cc
@@ -27,10 +27,14 @@
 #include <dlfcn.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <sys/resource.h>
 #include <unistd.h>
 #include "param.h"
 #include "nvtx_payload_schemas.h"
 #include "utils.h"
+#include <mutex>
+#include "ce_coll.h"
+#include "nvtx.h"
 
 #define STR2(v) #v
 #define STR(v) STR2(v)
@@ -54,6 +58,9 @@ NCCL_PARAM(WinEnable, "WIN_ENABLE", 1);
 NCCL_PARAM(CollnetEnable, "COLLNET_ENABLE", NCCL_CONFIG_UNDEF_INT);
 NCCL_PARAM(CtaPolicy, "CTA_POLICY", NCCL_CONFIG_UNDEF_INT);
 NCCL_PARAM(NvlsChannels, "NVLS_NCHANNELS", NCCL_CONFIG_UNDEF_INT);
+NCCL_PARAM(SetCpuStackSize, "SET_CPU_STACK_SIZE", 1);
+
+extern int64_t ncclParamSingleProcMemRegEnable();
 
 static ncclResult_t commReclaim(ncclComm_t comm);
 
@@ -70,11 +77,50 @@ ncclResult_t initGdrCopy() {
   return ncclSuccess;
 }
 
+// The default Linux stack size (8MB) is safe.
+#define SAFE_STACK_SIZE (8192*1024)
+
+static ncclResult_t setCpuStackSize() {
+  if (ncclParamSetCpuStackSize() != 0) {
+    // Query the stack size used for newly launched threads.
+    pthread_attr_t attr;
+    size_t stackSize;
+    PTHREADCHECK(pthread_attr_init(&attr), "pthread_attr_init");
+    PTHREADCHECK(pthread_attr_getstacksize(&attr, &stackSize), "pthread_attr_getstacksize");
+
+    if (stackSize < SAFE_STACK_SIZE) {
+      // GNU libc normally uses RLIMIT_STACK as the default pthread stack size, unless it's set to "unlimited" --
+      // in that case a fallback value of 2MB (!) is used.
+
+      // Query the actual resource limit so that we can distinguish between the settings of 2MB and unlimited.
+      struct rlimit stackLimit;
+      char buf[30];
+      SYSCHECK(getrlimit(RLIMIT_STACK, &stackLimit), "getrlimit");
+      if (stackLimit.rlim_cur == RLIM_INFINITY)
+        strcpy(buf, "unlimited");
+      else
+        snprintf(buf, sizeof(buf), "%ldKB", stackLimit.rlim_cur/1024);
+      INFO(NCCL_INIT|NCCL_ENV, "Stack size limit (%s) is unsafe; will use %dKB for newly launched threads",
+           buf, SAFE_STACK_SIZE/1024);
+
+      // Change the default pthread stack size (via a nonportable API, which will become necessary if we switch
+      // to C++ threads).
+      PTHREADCHECK(pthread_attr_setstacksize(&attr, SAFE_STACK_SIZE), "pthread_attr_setstacksize");
+      PTHREADCHECK(pthread_setattr_default_np(&attr), "pthread_setattr_default_np");
+    }
+
+    PTHREADCHECK(pthread_attr_destroy(&attr), "pthread_attr_destroy");
+  }
+
+  return ncclSuccess;
+}
+
 static ncclResult_t initResult = ncclSuccess;
-static pthread_once_t initOnceControl = PTHREAD_ONCE_INIT;
+static std::once_flag initOnceFlag;
 
 static void initOnceFunc() {
   initEnv();
+  setCpuStackSize();
   initGdrCopy();
   // Always initialize bootstrap network
   NCCLCHECKGOTO(bootstrapNetInit(), initResult, exit);
@@ -84,7 +130,7 @@ exit:;
 }
 
 static ncclResult_t ncclInit() {
-  pthread_once(&initOnceControl, initOnceFunc);
+  std::call_once(initOnceFlag, initOnceFunc);
   return initResult;
 }
 
@@ -180,10 +226,12 @@ static ncclResult_t commFree(ncclComm_t comm) {
   if (comm == NULL)
     return ncclSuccess;
 
-  if (comm->symmetricSupport && comm->symDevComm.base) {
-    NCCLCHECK(ncclCommSymmetricFreeInternal(comm, comm->baseUCSymPtr + comm->rank * comm->baseStride));
-  }
+  NCCLCHECK(ncclCeFinalize(comm));
 
+  if (comm->symmetricSupport) {
+    NCCLCHECK(ncclSymkFinalize(comm));
+    NCCLCHECK(ncclDevrFinalize(comm));
+  }
   NCCLCHECK(ncclRasCommFini(comm));
 
   /* in commReclaim, we have guaranteed only last rank which calls ncclCommDestroy() will
@@ -263,10 +311,6 @@ static ncclResult_t commFree(ncclComm_t comm) {
 
   NCCLCHECK(ncclRegCleanup(comm));
 
-  if (comm->symmetricSupport) {
-    NCCLCHECK(ncclNvlsSymmetricFinalize(comm));
-    NCCLCHECK(ncclIpcSymmetricFinalize(comm));
-  }
   INFO(NCCL_INIT,"comm %p rank %d nranks %d cudaDev %d busId %lx - %s COMPLETE", comm, comm->rank, comm->nRanks, comm->cudaDev, comm->busId, abort ? "Abort" : "Destroy");
 
   commPoison(comm); // poison comm before free to avoid comm reuse.
@@ -414,6 +458,7 @@ static ncclResult_t commAlloc(struct ncclComm* comm, struct ncclComm* parent, in
 
   ncclIntruQueueMpscConstruct(&comm->callbackQueue);
   ncclIntruQueueConstruct(&comm->legacyRegCleanupQueue);
+  ncclIntruQueueConstruct(&comm->ceInitTaskQueue);
 
   comm->regCache.pageSize = sysconf(_SC_PAGESIZE);
 
@@ -436,8 +481,8 @@ static ncclResult_t commAlloc(struct ncclComm* comm, struct ncclComm* parent, in
 static ncclResult_t devCommSetup(ncclComm_t comm) {
   ncclResult_t ret = ncclSuccess;
   int nRanks = comm->nRanks;
-  struct ncclDevCommAndChannels tmpCommAndChans;
-  struct ncclDevCommAndChannels *devCommAndChans = NULL;
+  struct ncclKernelCommAndChannels tmpCommAndChans;
+  struct ncclKernelCommAndChannels *devCommAndChans = NULL;
   struct ncclNvmlCCStatus ccStatus;
   bool ccEnable;
   cudaStream_t deviceStream;
@@ -465,7 +510,7 @@ static ncclResult_t devCommSetup(ncclComm_t comm) {
   comm->workArgsBytes = std::min<size_t>(ncclParamWorkArgsBytes(), ncclMaxKernelArgsSize(comm->cudaArch));
 
   memset(&ccStatus, 0, sizeof(ccStatus));
-  ccEnable = (ncclSuccess == ncclNvmlGetCCStatus(&ccStatus)) && (ccStatus.CCEnabled || ccStatus.multiGpuProtectedPCIE);
+  ccEnable = (ncclSuccess == ncclNvmlGetCCStatus(&ccStatus)) && (ccStatus.CCEnabled || ccStatus.multiGpuProtectedPCIE || ccStatus.multiGpuNVLE);
   if (ccEnable) {
     comm->workFifoBytes = 0;
   } else {
@@ -582,14 +627,28 @@ static ncclResult_t fillInfo(struct ncclComm* comm, struct ncclPeerInfo* info, u
     info->fabricInfo.state = NVML_GPU_FABRIC_STATE_NOT_SUPPORTED;
     (void) ncclNvmlDeviceGetGpuFabricInfoV(nvmlDev, &info->fabricInfo);
     if (info->fabricInfo.state != NVML_GPU_FABRIC_STATE_NOT_SUPPORTED) {
+      unsigned long uuid0 = 0;
+      unsigned long uuid1 = 0;
       if (ncclParamMNNVLUUID() != -1) {
-        ((long*)&info->fabricInfo.clusterUuid)[0] = ncclParamMNNVLUUID();
-        ((long*)&info->fabricInfo.clusterUuid)[1] = ncclParamMNNVLUUID();
+        unsigned long temp_uuid0 = (unsigned long)ncclParamMNNVLUUID();
+        unsigned long temp_uuid1 = (unsigned long)ncclParamMNNVLUUID();
+        memcpy(info->fabricInfo.clusterUuid, &temp_uuid0, sizeof(temp_uuid0));
+        memcpy(info->fabricInfo.clusterUuid + sizeof(temp_uuid0), &temp_uuid1, sizeof(temp_uuid1));
       }
-      if (ncclParamMNNVLCliqueId() != -1) info->fabricInfo.cliqueId = ncclParamMNNVLCliqueId();
+      memcpy(&uuid0, info->fabricInfo.clusterUuid, sizeof(uuid0));
+      memcpy(&uuid1, info->fabricInfo.clusterUuid + sizeof(uuid0), sizeof(uuid1));
+      if (ncclParamMNNVLCliqueId() == -2) {
+        nvmlPlatformInfo_t platformInfo = { 0 };
+        NCCLCHECK(ncclNvmlDeviceGetPlatformInfo(nvmlDev, &platformInfo));
+        INFO(NCCL_INIT, "MNNVL rack serial %s slot %d tray %d hostId %d peerType %d moduleId %d",
+             platformInfo.chassisSerialNumber, platformInfo.slotNumber, platformInfo.trayIndex,
+             platformInfo.hostId, platformInfo.peerType, platformInfo.moduleId);
+        // Use a hash of the Rack serial number to partition the NVLD clique
+        info->fabricInfo.cliqueId = getHash(platformInfo.chassisSerialNumber, sizeof(platformInfo.chassisSerialNumber));
+      } else if (ncclParamMNNVLCliqueId() != -1) info->fabricInfo.cliqueId = ncclParamMNNVLCliqueId();
       INFO(NCCL_INIT, "MNNVL busId 0x%lx fabric UUID %lx.%lx cliqueId 0x%x state %d healthMask 0x%x",
            info->busId,
-           ((long *)&info->fabricInfo.clusterUuid)[0], ((long *)&info->fabricInfo.clusterUuid)[1],
+           uuid0, uuid1,
            info->fabricInfo.cliqueId, info->fabricInfo.state, info->fabricInfo.healthMask);
     }
   }
@@ -670,6 +729,18 @@ NCCL_PARAM(MNNVLEnable, "MNNVL_ENABLE", 2);
 #define TIMER_INIT_ALLOC 7
 #define TIMERS_INIT_COUNT 8
 
+static ncclResult_t initNvlDomainInfo(struct ncclComm* comm) {
+  // Initialize NVLink domain info
+  comm->nvlDomainInfo.nNvlDomains = comm->nNodes;
+  comm->nvlDomainInfo.minRanksPerNvlDomain = comm->minLocalRanks;
+  comm->nvlDomainInfo.maxRanksPerNvlDomain = comm->maxLocalRanks;
+  
+  TRACE(NCCL_INIT, "NVLink domains: %d domains, min ranks per domain: %d, max ranks per domain: %d",
+        comm->nNodes, comm->nvlDomainInfo.minRanksPerNvlDomain, comm->nvlDomainInfo.maxRanksPerNvlDomain);
+
+  return ncclSuccess;
+}
+
 static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* parent, uint64_t timers[TIMERS_INIT_COUNT]) {
   // We use 2 AllGathers
   // 1. { peerInfo, comm, compCap}
@@ -781,6 +852,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
 
     // Buffer Registration is not supported with MNNVL
     if (comm->MNNVL) comm->nvlsRegSupport = 0;
+    else if (ncclParamSingleProcMemRegEnable()) comm->nvlsRegSupport = 1;
 
     TRACE(NCCL_INIT,"pidHash[%d] %lx intraProcRank %d intraProcRanks %d intraProcRank0 %d",
         rank, comm->peerInfo[rank].pidHash, intraProcRank, intraProcRanks, intraProcRank0);
@@ -969,10 +1041,12 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
     comm->rankToLocalRank[r] = comm->nodeRanks[node].localRanks;
     comm->nodeRanks[node].localRanks++;
   }
+  comm->minLocalRanks = INT_MAX;
   // Allocate ranks arrays for each node
   for (int n=0; n<comm->nNodes; n++) {
     NCCLCHECKGOTO(ncclCalloc(&comm->nodeRanks[n].localRankToRank, comm->nodeRanks[n].localRanks), ret, fail);
     comm->maxLocalRanks = std::max(comm->maxLocalRanks, comm->nodeRanks[n].localRanks);
+    comm->minLocalRanks = std::min(comm->minLocalRanks, comm->nodeRanks[n].localRanks);
     comm->nodeRanks[n].localRanks = 0;
   }
   // And fill the ranks arrays
@@ -985,6 +1059,8 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
   comm->localRank = comm->rankToLocalRank[rank];
   comm->localRanks = comm->nodeRanks[comm->node].localRanks;
 
+  NCCLCHECKGOTO(initNvlDomainInfo(comm), ret, fail);
+
   TRACE(NCCL_INIT,"hostHash[%d] %lx localRank %d localRanks %d localRank0 %d",
         rank, comm->peerInfo[rank].hostHash, comm->localRank, comm->localRanks, comm->localRankToRank[0]);
   if (comm->localRank == -1 || comm->localRankToRank[0] == -1 || comm->localRanks == 0) {
@@ -1227,6 +1303,11 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
   TRACE(NCCL_INIT, "rank %d nranks %d - CONNECTED %d RINGS AND TREES", rank, nranks, comm->nChannels);
 
   // Compute time models for algorithm and protocol combinations
+  NCCLCHECKGOTO(ncclTopoInitTunerConstants(comm), ret, fail);
+  NCCLCHECKGOTO(ncclTunerPluginLoad(comm), ret, fail);
+  if (comm->tuner) {
+    NCCLCHECK(comm->tuner->init(&comm->tunerContext, comm->commHash, comm->nRanks, comm->nNodes, ncclDebugLog, &comm->nvlDomainInfo, &comm->tunerConstants));
+  }
   NCCLCHECKGOTO(ncclTopoTuneModel(comm, comm->minCompCap, comm->maxCompCap, graphs), ret, fail);
 
   INFO(NCCL_INIT, "%d coll channels, %d collnet channels, %d nvls channels, %d p2p channels, %d p2p channels per peer", comm->nChannels, comm->nChannels, comm->nvlsChannels, comm->p2pnChannels, comm->p2pnChannelsPerPeer);
@@ -1248,7 +1329,10 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
   }
 
   comm->symmetricSupport = comm->isAllDirectP2p && comm->nNodes == 1 && ncclParamWinEnable() && ncclCuMemEnable();
-  comm->baseStride = 0;
+  comm->devrState.bigSize = 0;
+
+  comm->ceColl.baseUCSymReadyPtr = NULL;
+  comm->ceColl.baseUCSymComplPtr = NULL;
 
   // Call devCommSetup before the last barrier, making sure we don't have a thread running in front and starting to
   // launch NCCL kernels before all cuda mem allocation is complete. That could cause a deadlock.
@@ -1287,6 +1371,10 @@ NCCL_PARAM(MaxCTAs, "MAX_CTAS", NCCL_CONFIG_UNDEF_INT);
 NCCL_PARAM(MinCTAs, "MIN_CTAS", NCCL_CONFIG_UNDEF_INT);
 #define NCCL_MAX_CGA_CLUSTER_SIZE 8
 
+NCCL_PARAM(NChannelsPerNetPeer, "NCHANNELS_PER_NET_PEER", NCCL_CONFIG_UNDEF_INT);
+NCCL_PARAM(NvlinkUtilCentricSchedEnable, "NVLINK_UTIL_CENTRIC_SCHED_ENABLE", 0);
+
+
 #define NCCL_COMMINIT_FUNCNAME_LEN 128
 struct ncclCommInitRankAsyncJob {
   struct ncclAsyncJob base;
@@ -1416,15 +1504,15 @@ static ncclResult_t ncclCommInitRankFunc(struct ncclAsyncJob* job_) {
       // Negative color does not create a new comm object. We needed to take part in the allgather, but we're done now.
       if (job->color == NCCL_SPLIT_NOCOLOR) goto exit;
     }
-    timers[TIMER_INIT_ALLOC] = clockNano();
-    NCCLCHECKGOTO(commAlloc(comm, job->parent, job->nranks, job->myrank), res, fail);
-    timers[TIMER_INIT_ALLOC] = clockNano() - timers[TIMER_INIT_ALLOC];
     // child hash obtained from (parent hash, split count, color)
     uint64_t hacc[2] = {1, 1};
     eatHash(hacc, &job->parent->commHash);
     eatHash(hacc, &job->splitCount);
     eatHash(hacc, &job->color);
     comm->commHash = digestHash(hacc);
+    timers[TIMER_INIT_ALLOC] = clockNano();
+    NCCLCHECKGOTO(commAlloc(comm, job->parent, job->nranks, job->myrank), res, fail);
+    timers[TIMER_INIT_ALLOC] = clockNano() - timers[TIMER_INIT_ALLOC];
     INFO(NCCL_INIT, "%s comm %p rank %d nranks %d cudaDev %d nvmlDev %d busId %lx parent %p splitCount %d color %d key %d- Init START", job->funcName,
          comm, comm->rank, comm->nRanks, comm->cudaDev, comm->nvmlDev, comm->busId, job->parent, job->splitCount, job->color, job->key);
     timers[TIMER_INIT_BOOTSTRAP] = clockNano();
@@ -1433,11 +1521,11 @@ static ncclResult_t ncclCommInitRankFunc(struct ncclAsyncJob* job_) {
     // debug info, no commId was used
     commIdHash = 0;
   } else {
+    // obtain a unique hash using the first commId
+    comm->commHash = commIdHash = getHash(job->commId->internal, NCCL_UNIQUE_ID_BYTES);
     timers[TIMER_INIT_ALLOC] = clockNano();
     NCCLCHECKGOTO(commAlloc(comm, NULL, job->nranks, job->myrank), res, fail);
     timers[TIMER_INIT_ALLOC] = clockNano() - timers[TIMER_INIT_ALLOC];
-    // obtain a unique hash using the first commId
-    comm->commHash = commIdHash = getHash(job->commId->internal, NCCL_UNIQUE_ID_BYTES);
     INFO(NCCL_INIT, "%s comm %p rank %d nranks %d cudaDev %d nvmlDev %d busId %lx commId 0x%llx - Init START", job->funcName,
          comm, comm->rank, comm->nRanks, comm->cudaDev, comm->nvmlDev, comm->busId, commIdHash);
     timers[TIMER_INIT_BOOTSTRAP] = clockNano();
@@ -1447,10 +1535,6 @@ static ncclResult_t ncclCommInitRankFunc(struct ncclAsyncJob* job_) {
   comm->cudaArch = cudaArch;
 
   NCCLCHECKGOTO(initTransportsRank(comm, job->parent, timers), res, fail);
-  NCCLCHECKGOTO(ncclTunerPluginLoad(comm), res, fail);
-  if (comm->tuner) {
-    NCCLCHECK(comm->tuner->init(comm->nRanks, comm->nNodes, ncclDebugLog, &comm->tunerContext));
-  }
 
   // update communicator state
   comm->initState = ncclSuccess;
@@ -1511,8 +1595,10 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
   int ctaPolicyEnv;
   int shrinkShareEnv;
   int nvlsCTAsEnv;
+  int nChannelsPerNetPeerEnv;
+  int nvlinkUtilCentricSchedEnableEnv;
 
-  /* override configuration from env variable. */
+  /* override configuration with env variable. */
   blockingEnv = ncclParamCommBlocking();
   if (blockingEnv == 0 || blockingEnv == 1)
     comm->config.blocking = blockingEnv;
@@ -1541,6 +1627,23 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
       comm->config.maxCTAs = maxCTAsEnv;
   }
 
+  /* override configuration with env variable. */
+  nChannelsPerNetPeerEnv = ncclParamNChannelsPerNetPeer();
+  if (nChannelsPerNetPeerEnv != NCCL_CONFIG_UNDEF_INT) {
+    if (nChannelsPerNetPeerEnv <= 0)
+      INFO(NCCL_ENV, "NCCL_NCHANNELS_PER_NET_PEER %d is too low, leaving it set at %d", nChannelsPerNetPeerEnv, comm->config.nChannelsPerNetPeer);
+    else
+      comm->config.nChannelsPerNetPeer = nChannelsPerNetPeerEnv;
+  }
+
+  nvlinkUtilCentricSchedEnableEnv = ncclParamNvlinkUtilCentricSchedEnable();
+  if (nvlinkUtilCentricSchedEnableEnv != NCCL_CONFIG_UNDEF_INT) {
+    if (nvlinkUtilCentricSchedEnableEnv != 0 && nvlinkUtilCentricSchedEnableEnv != 1)
+      INFO(NCCL_ENV, "NCCL_NVLINK_UTIL_CENTRIC_SCHED_ENABLE %d is not valid, leaving it set at %d", nvlinkUtilCentricSchedEnableEnv, comm->config.nvlinkCentricSched);
+    else
+      comm->config.nvlinkCentricSched = nvlinkUtilCentricSchedEnableEnv;
+  }
+
   envNetName = ncclGetEnv("NCCL_NET");
   if (envNetName)
     tmpNetName = envNetName;
@@ -1608,7 +1711,7 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
     comm->config.collnetEnable = 0;
   }
 
-  if (comm->config.CTAPolicy < NCCL_CTA_POLICY_DEFAULT || comm->config.CTAPolicy > NCCL_CTA_POLICY_EFFICIENCY) {
+  if (comm->config.CTAPolicy < NCCL_CTA_POLICY_DEFAULT || comm->config.CTAPolicy > NCCL_CTA_POLICY_ZERO) {
     INFO(NCCL_ENV, "CTAPolicy %d is not a valid value, set it to %d", comm->config.CTAPolicy, NCCL_CTA_POLICY_DEFAULT);
     comm->config.CTAPolicy = NCCL_CTA_POLICY_DEFAULT;
   }
@@ -1617,6 +1720,7 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
     INFO(NCCL_ENV, "nvlsCTAs %d is not a valid value, NCCL will decide the default value automatically", comm->config.nvlsCTAs);
     comm->config.nvlsCTAs = NCCL_CONFIG_UNDEF_INT;
   }
+
   return ret;
 }
 
@@ -1668,6 +1772,10 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
       internalConfigPtr->shrinkShare = defaultConfig.shrinkShare;
       internalConfigPtr->nvlsCTAs = defaultConfig.nvlsCTAs;
     }
+    if (internalConfigPtr->version < NCCL_VERSION(2, 28, 0)) {
+      internalConfigPtr->nChannelsPerNetPeer = defaultConfig.nChannelsPerNetPeer;
+      internalConfigPtr->nvlinkCentricSched = defaultConfig.nvlinkCentricSched;
+    }
   }
 
   /* check input config attributes, -1 means user-undefined and we should use default value from NCCL. */
@@ -1706,7 +1814,7 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
   }
 
   if (internalConfigPtr->CTAPolicy != NCCL_CONFIG_UNDEF_INT && (internalConfigPtr->CTAPolicy < NCCL_CTA_POLICY_DEFAULT ||
-    internalConfigPtr->CTAPolicy > NCCL_CTA_POLICY_EFFICIENCY)) {
+    internalConfigPtr->CTAPolicy > NCCL_CTA_POLICY_ZERO)) {
     WARN("Invalid config policy attribute value %d", internalConfigPtr->CTAPolicy);
     ret = ncclInvalidArgument;
     goto fail;
@@ -1724,6 +1832,18 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
     goto fail;
   }
 
+  if (internalConfigPtr->nChannelsPerNetPeer != NCCL_CONFIG_UNDEF_INT && (internalConfigPtr->nChannelsPerNetPeer <= 0 || internalConfigPtr->nChannelsPerNetPeer > MAXCHANNELS)) {
+    WARN("Invalid config nChannelsPerNetPeer attribute value %d", internalConfigPtr->nChannelsPerNetPeer);
+    ret = ncclInvalidArgument;
+    goto fail;
+  }
+
+  if (internalConfigPtr->nvlinkCentricSched != NCCL_CONFIG_UNDEF_INT && internalConfigPtr->nvlinkCentricSched != 0 && internalConfigPtr->nvlinkCentricSched != 1) {
+    WARN("Invalid config nvlinkCentricSched attribute value %d", internalConfigPtr->nvlinkCentricSched);
+    ret = ncclInvalidArgument;
+    goto fail;
+  }
+
   /* default config value can be tuned on different platform. */
   NCCL_CONFIG_DEFAULT(internalConfigPtr, blocking, NCCL_CONFIG_UNDEF_INT, 1, "Blocking", "%d");
   NCCL_CONFIG_DEFAULT(internalConfigPtr, cgaClusterSize, NCCL_CONFIG_UNDEF_INT, 4, "CGA cluster size", "%d");
@@ -1737,6 +1857,9 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
   NCCL_CONFIG_DEFAULT(internalConfigPtr, CTAPolicy, NCCL_CONFIG_UNDEF_INT, NCCL_CTA_POLICY_DEFAULT, "CTA policy flags", "%d");
   NCCL_CONFIG_DEFAULT(internalConfigPtr, shrinkShare, NCCL_CONFIG_UNDEF_INT, 0, "shrinkShare", "%d");
   NCCL_CONFIG_DEFAULT(internalConfigPtr, nvlsCTAs, NCCL_CONFIG_UNDEF_INT, NCCL_CONFIG_UNDEF_INT, "nvlsCTAs", "%d");
+  NCCL_CONFIG_DEFAULT(internalConfigPtr, nChannelsPerNetPeer, NCCL_CONFIG_UNDEF_INT,
+                      NCCL_CONFIG_UNDEF_INT, "nChannelsPerNetPeer", "%d");
+  NCCL_CONFIG_DEFAULT(internalConfigPtr, nvlinkCentricSched, NCCL_CONFIG_UNDEF_INT, 0, "nvlinkCentricSched", "%d");
 
   /* assign config to communicator */
   comm->config.blocking = internalConfigPtr->blocking;
@@ -1751,6 +1874,8 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
   comm->config.CTAPolicy = internalConfigPtr->CTAPolicy;
   comm->config.shrinkShare = internalConfigPtr->shrinkShare;
   comm->config.nvlsCTAs = internalConfigPtr->nvlsCTAs;
+  comm->config.nChannelsPerNetPeer = internalConfigPtr->nChannelsPerNetPeer;
+  comm->config.nvlinkCentricSched = internalConfigPtr->nvlinkCentricSched;
   NCCLCHECKGOTO(envConfigOverride(comm), ret, fail);
 
 exit:
@@ -1779,8 +1904,8 @@ static ncclResult_t ncclCommInitRankDev(ncclComm_t* newcomm, int nranks, int nId
   NCCLCHECKGOTO(ncclInit(), res, fail);
 
   if (ncclDebugLevel > NCCL_LOG_WARN || (ncclDebugLevel != NCCL_LOG_NONE && myrank == 0)) {
-    static pthread_once_t once = PTHREAD_ONCE_INIT;
-    pthread_once(&once, showVersion);
+    static std::once_flag once;
+    std::call_once(once, showVersion);
   }
   // Make sure the CUDA runtime is initialized.
   CUDACHECKGOTO(cudaFree(NULL), res, fail);
@@ -2054,7 +2179,7 @@ fail:
 static ncclResult_t commCleanup(ncclComm_t comm) {
   CUDACHECK(cudaSetDevice(comm->cudaDev));
   if (comm->tuner != NULL) {
-    NCCLCHECK(comm->tuner->destroy(comm->tunerContext));
+    NCCLCHECK(comm->tuner->finalize(comm->tunerContext));
     NCCLCHECK(ncclTunerPluginUnload(comm));
   }
   NCCLCHECK(commFree(comm));
@@ -2158,7 +2283,7 @@ static ncclResult_t commReclaim(struct ncclAsyncJob* job_) {
 NCCL_API(ncclResult_t, ncclCommDestroy, ncclComm_t comm);
 ncclResult_t ncclCommDestroy(ncclComm_t comm) {
   if (comm == NULL) {
-    NVTX3_FUNC_RANGE_IN(nccl_domain);
+    NCCL_NVTX3_FUNC_RANGE;
     return ncclSuccess;
   }
 
@@ -2210,6 +2335,10 @@ ncclResult_t ncclCommAbort(ncclComm_t comm) {
   if (comm == NULL) {
     return ncclSuccess;
   }
+
+  INFO(NCCL_INIT, "comm %p rank %d nRanks %d cudaDev %d busId %lx - Abort START",
+      comm, comm->rank, comm->nRanks, comm->cudaDev, comm->busId);
+
   NCCLCHECK(ncclGroupStartInternal());
   // Ask anything that might still be running on the device to quit
   NCCLCHECK(setCommAbortFlags(comm,1));
@@ -2418,7 +2547,7 @@ ncclResult_t ncclCommGetAsyncError(ncclComm_t comm, ncclResult_t *asyncError) {
 
 NCCL_API(ncclResult_t, ncclCommCount, const ncclComm_t comm, int* count);
 ncclResult_t ncclCommCount(const ncclComm_t comm, int* count) {
-  NVTX3_FUNC_RANGE_IN(nccl_domain);
+  NCCL_NVTX3_FUNC_RANGE;
 
   NCCLCHECK(CommCheck(comm, "CommCount", "comm"));
   NCCLCHECK(PtrCheck(count, "CommCount", "count"));
@@ -2432,7 +2561,7 @@ ncclResult_t ncclCommCount(const ncclComm_t comm, int* count) {
 
 NCCL_API(ncclResult_t, ncclCommCuDevice, const ncclComm_t comm, int* devid);
 ncclResult_t ncclCommCuDevice(const ncclComm_t comm, int* devid) {
-  NVTX3_FUNC_RANGE_IN(nccl_domain);
+  NCCL_NVTX3_FUNC_RANGE;
 
   NCCLCHECK(CommCheck(comm, "CommCuDevice", "comm"));
   NCCLCHECK(PtrCheck(devid, "CommCuDevice", "devid"));
@@ -2445,7 +2574,7 @@ ncclResult_t ncclCommCuDevice(const ncclComm_t comm, int* devid) {
 
 NCCL_API(ncclResult_t, ncclCommUserRank, const ncclComm_t comm, int* rank);
 ncclResult_t ncclCommUserRank(const ncclComm_t comm, int* rank) {
-  NVTX3_FUNC_RANGE_IN(nccl_domain);
+  NCCL_NVTX3_FUNC_RANGE;
 
   NCCLCHECK(CommCheck(comm, "CommUserRank", "comm"));
   NCCLCHECK(PtrCheck(rank, "CommUserRank", "rank"));
diff --git a/projects/rccl/src/init_nvtx.cc b/projects/rccl/src/init_nvtx.cc
index 1cb1277d2f..b7005123b4 100644
--- a/projects/rccl/src/init_nvtx.cc
+++ b/projects/rccl/src/init_nvtx.cc
@@ -1,5 +1,12 @@
+/*************************************************************************
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
 #include "nccl.h"
 #include "nvtx.h"
+#include "param.h"
 
 static constexpr const nvtxPayloadEnum_t NvtxEnumRedSchema[] = {
   {"Sum", ncclSum, 0},
@@ -9,9 +16,15 @@ static constexpr const nvtxPayloadEnum_t NvtxEnumRedSchema[] = {
   {"Avg", ncclAvg, 0}
 };
 
+NCCL_PARAM(NvtxDisable, "NVTX_DISABLE", 0);
+
 // Must be called before the first call to any reduction operation.
 void initNvtxRegisteredEnums() {
   // Register schemas and strings
+  if (ncclParamNvtxDisable()) {
+    return;
+  }
+
   constexpr const nvtxPayloadEnumAttr_t eAttr {
     .fieldMask = NVTX_PAYLOAD_ENUM_ATTR_ENTRIES | NVTX_PAYLOAD_ENUM_ATTR_NUM_ENTRIES |
       NVTX_PAYLOAD_ENUM_ATTR_SIZE | NVTX_PAYLOAD_ENUM_ATTR_SCHEMA_ID,
diff --git a/projects/rccl/src/misc/CMakeLists.txt b/projects/rccl/src/misc/CMakeLists.txt
new file mode 100644
index 0000000000..984becc5fd
--- /dev/null
+++ b/projects/rccl/src/misc/CMakeLists.txt
@@ -0,0 +1,20 @@
+# Misc sources
+set(MISC_SOURCES
+    ${CMAKE_CURRENT_SOURCE_DIR}/strongstream.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/socket.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/ibvwrap.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/mlx5dvsymbols.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/mlx5dvwrap.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/cudawrap.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/param.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/ipcsocket.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/utils.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/shmutils.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/nvmlwrap.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/argcheck.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/gdrwrap.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/ibvsymbols.cc
+)
+
+# Add misc sources to parent scope
+set(MISC_SOURCES ${MISC_SOURCES} PARENT_SCOPE)
diff --git a/projects/rccl/src/misc/cudawrap.cc b/projects/rccl/src/misc/cudawrap.cc
index 5b66fea929..1ecb35fb2a 100644
--- a/projects/rccl/src/misc/cudawrap.cc
+++ b/projects/rccl/src/misc/cudawrap.cc
@@ -9,6 +9,7 @@
 #include "debug.h"
 #include "param.h"
 #include "cudawrap.h"
+#include <mutex>
 
 // This env var (NCCL_CUMEM_ENABLE) toggles cuMem API usage
 NCCL_PARAM(CuMemEnable, "CUMEM_ENABLE", -2);
@@ -153,6 +154,12 @@ DECLARE_CUDA_PFN(cuMulticastCreate, 12010);
 DECLARE_CUDA_PFN(cuMulticastGetGranularity, 12010);
 DECLARE_CUDA_PFN(cuMulticastUnbind, 12010);
 #endif
+/* Stream MemOp support */
+DECLARE_CUDA_PFN(cuStreamBatchMemOp, 11070);
+DECLARE_CUDA_PFN(cuStreamWaitValue32, 11070);
+DECLARE_CUDA_PFN(cuStreamWaitValue64, 11070);
+DECLARE_CUDA_PFN(cuStreamWriteValue32, 11070);
+DECLARE_CUDA_PFN(cuStreamWriteValue64, 11070);
 #endif
 
 #define CUDA_DRIVER_MIN_VERSION 11030
@@ -238,11 +245,17 @@ static ncclResult_t cudaPfnFuncLoader(void) {
   LOAD_SYM(cuMulticastGetGranularity, 12010, 1);
   LOAD_SYM(cuMulticastUnbind, 12010, 1);
 #endif
+/* Stream MemOp support */
+  LOAD_SYM(cuStreamBatchMemOp, 11070, 1);
+  LOAD_SYM(cuStreamWaitValue32, 11070, 1);
+  LOAD_SYM(cuStreamWaitValue64, 11070, 1);
+  LOAD_SYM(cuStreamWriteValue32, 11070, 1);
+  LOAD_SYM(cuStreamWriteValue64, 11070, 1);
   return ncclSuccess;
 }
 #endif
 
-static pthread_once_t initOnceControl = PTHREAD_ONCE_INIT;
+static std::once_flag initOnceFlag;
 static ncclResult_t initResult;
 
 static void initOnceFunc() {
@@ -295,6 +308,6 @@ error:
 }
 
 ncclResult_t ncclCudaLibraryInit() {
-  pthread_once(&initOnceControl, initOnceFunc);
+  std::call_once(initOnceFlag, initOnceFunc);
   return initResult;
 }
diff --git a/projects/rccl/src/misc/gdrwrap.cc b/projects/rccl/src/misc/gdrwrap.cc
index 3b46759c6e..cef254cf2a 100644
--- a/projects/rccl/src/misc/gdrwrap.cc
+++ b/projects/rccl/src/misc/gdrwrap.cc
@@ -5,6 +5,7 @@
  ************************************************************************/
 
 #include "gdrwrap.h"
+#include <mutex>
 
 #ifndef GDR_DIRECT
 #include "core.h"
@@ -47,7 +48,7 @@ pthread_mutex_t gdrLock = PTHREAD_MUTEX_INITIALIZER;
     *cast = tmp;                                         \
   } while (0)
 
-static pthread_once_t initOnceControl = PTHREAD_ONCE_INIT;
+static std::once_flag initOnceFlag;
 static ncclResult_t initResult;
 
 static void initOnceFunc(void) {
@@ -97,7 +98,7 @@ teardown:
 
 
 ncclResult_t wrap_gdr_symbols(void) {
-  pthread_once(&initOnceControl, initOnceFunc);
+  std::call_once(initOnceFlag, initOnceFunc);
   return initResult;
 }
 
diff --git a/projects/rccl/src/misc/ibvwrap.cc b/projects/rccl/src/misc/ibvwrap.cc
index 59f52e3208..6d6586e78d 100644
--- a/projects/rccl/src/misc/ibvwrap.cc
+++ b/projects/rccl/src/misc/ibvwrap.cc
@@ -7,6 +7,7 @@
 #include "ibvwrap.h"
 #include <sys/types.h>
 #include <unistd.h>
+#include <mutex>
 
 #ifdef NCCL_BUILD_RDMA_CORE
 #include <infiniband/verbs.h>
@@ -15,12 +16,12 @@
 #endif
 #include "ibvsymbols.h"
 
-static pthread_once_t initOnceControl = PTHREAD_ONCE_INIT;
+static std::once_flag initOnceFlag;
 static ncclResult_t initResult;
 struct ncclIbvSymbols ibvSymbols;
 
 ncclResult_t wrap_ibv_symbols(void) {
-  pthread_once(&initOnceControl,
+  std::call_once(initOnceFlag,
                [](){ initResult = buildIbvSymbols(&ibvSymbols); });
   return initResult;
 }
diff --git a/projects/rccl/src/misc/mlx5dvwrap.cc b/projects/rccl/src/misc/mlx5dvwrap.cc
index 930ed5d2e1..af4f41dff3 100644
--- a/projects/rccl/src/misc/mlx5dvwrap.cc
+++ b/projects/rccl/src/misc/mlx5dvwrap.cc
@@ -7,6 +7,7 @@
 #include "mlx5/mlx5dvwrap.h"
 #include <sys/types.h>
 #include <unistd.h>
+#include <mutex>
 
 #ifdef NCCL_BUILD_MLX5DV
 #include <infiniband/mlx5dv.h>
@@ -15,12 +16,12 @@
 #endif
 #include "mlx5/mlx5dvsymbols.h"
 
-static pthread_once_t initOnceControl = PTHREAD_ONCE_INIT;
+static std::once_flag initOnceFlag;
 static ncclResult_t initResult;
 struct ncclMlx5dvSymbols mlx5dvSymbols;
 
 ncclResult_t wrap_mlx5dv_symbols(void) {
-  pthread_once(&initOnceControl,
+  std::call_once(initOnceFlag,
                [](){ initResult = buildMlx5dvSymbols(&mlx5dvSymbols); });
   return initResult;
 }
@@ -28,7 +29,7 @@ ncclResult_t wrap_mlx5dv_symbols(void) {
 /* CHECK_NOT_NULL: helper macro to check for NULL symbol */
 #define CHECK_NOT_NULL(container, internal_name) \
   if (container.internal_name == NULL) { \
-     WARN("lib wrapper not initialized."); \
+     WARN("NET/MLX5: lib wrapper not initialized."); \
      return ncclInternalError; \
   }
 
@@ -36,16 +37,7 @@ ncclResult_t wrap_mlx5dv_symbols(void) {
   CHECK_NOT_NULL(container, internal_name); \
   retval = container.call; \
   if (retval == error_retval) { \
-    WARN("Call to " name " failed with error %s", strerror(errno)); \
-    return ncclSystemError; \
-  } \
-  return ncclSuccess;
-
-#define MLX5DV_INT_CHECK_RET_ERRNO(container, internal_name, call, success_retval, name) \
-  CHECK_NOT_NULL(container, internal_name); \
-  int ret = container.call; \
-  if (ret != success_retval) { \
-    INFO(NCCL_NET, "Call to " name " failed with error %s errno %d", strerror(ret), ret); \
+    WARN("NET/MLX5: Call to " name " failed with error %s", strerror(errno)); \
     return ncclSystemError; \
   } \
   return ncclSuccess;
@@ -57,8 +49,14 @@ bool wrap_mlx5dv_is_supported(struct ibv_device *device) {
   return mlx5dvSymbols.mlx5dv_internal_is_supported(device);
 }
 
-ncclResult_t wrap_mlx5dv_get_data_direct_sysfs_path(struct ibv_context *context, char *buf, size_t buf_len) {
-  MLX5DV_INT_CHECK_RET_ERRNO(mlx5dvSymbols, mlx5dv_internal_get_data_direct_sysfs_path, mlx5dv_internal_get_data_direct_sysfs_path(context, buf, buf_len), 0, "mlx5dv_get_data_direct_sysfs_path");
+ncclResult_t wrap_mlx5dv_get_data_direct_sysfs_path(struct ibv_context* context, char* buf, size_t buf_len) {
+  CHECK_NOT_NULL(mlx5dvSymbols, mlx5dv_internal_get_data_direct_sysfs_path);
+  int ret = mlx5dvSymbols.mlx5dv_internal_get_data_direct_sysfs_path(context, buf, buf_len);
+  if (ret == 0) return ncclSuccess;
+  /* ENODEV can happen if the devices is not data-direct but mlx5 is used. It's not an error*/
+  if (ret == ENODEV) return ncclInvalidArgument;
+  INFO(NCCL_NET, "NET/MLX5: Call to mlx5dv_internal_get_data_direct_sysfs_path failed with error %s errno %d", strerror(ret), ret);
+  return ncclSystemError;
 }
 
 /* DMA-BUF support */
@@ -72,4 +70,4 @@ struct ibv_mr * wrap_direct_mlx5dv_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t off
     return NULL;
   }
   return mlx5dvSymbols.mlx5dv_internal_reg_dmabuf_mr(pd, offset, length, iova, fd, access, mlx5_access);
-}
\ No newline at end of file
+}
diff --git a/projects/rccl/src/misc/nvmlwrap.cc b/projects/rccl/src/misc/nvmlwrap.cc
index 66ba2d4c85..d26a6facf5 100644
--- a/projects/rccl/src/misc/nvmlwrap.cc
+++ b/projects/rccl/src/misc/nvmlwrap.cc
@@ -41,6 +41,7 @@ namespace {
   NCCL_NVML_FN(nvmlDeviceGetFieldValues, nvmlReturn_t, (nvmlDevice_t device, int valuesCount, nvmlFieldValue_t *values))
   // MNNVL support
   NCCL_NVML_FN(nvmlDeviceGetGpuFabricInfoV, nvmlReturn_t, (nvmlDevice_t device, nvmlGpuFabricInfoV_t *gpuFabricInfo))
+  NCCL_NVML_FN(nvmlDeviceGetPlatformInfo, nvmlReturn_t, (nvmlDevice_t device, nvmlPlatformInfo_t *platfromInfo))
   // CC support
   NCCL_NVML_FN(nvmlSystemGetConfComputeState, nvmlReturn_t, (nvmlConfComputeSystemState_t *state));
   NCCL_NVML_FN(nvmlSystemGetConfComputeSettings, nvmlReturn_t, (nvmlSystemConfComputeSettings_t *setting));
@@ -95,6 +96,7 @@ ncclResult_t ncclNvmlEnsureInitialized() {
       {(void**)&pfn_nvmlDeviceGetFieldValues, "nvmlDeviceGetFieldValues"},
       // MNNVL support
       {(void**)&pfn_nvmlDeviceGetGpuFabricInfoV, "nvmlDeviceGetGpuFabricInfoV"},
+      {(void**)&pfn_nvmlDeviceGetPlatformInfo, "nvmlDeviceGetPlatformInfo"},
       // CC support
       {(void**)&pfn_nvmlSystemGetConfComputeState, "nvmlSystemGetConfComputeState"},
       {(void**)&pfn_nvmlSystemGetConfComputeSettings, "nvmlSystemGetConfComputeSettings"}
@@ -298,6 +300,15 @@ ncclResult_t ncclNvmlDeviceGetGpuFabricInfoV(nvmlDevice_t device, nvmlGpuFabricI
   return ncclSuccess;
 }
 
+ncclResult_t ncclNvmlDeviceGetPlatformInfo(nvmlDevice_t device, nvmlPlatformInfo_t *platformInfo) {
+  NCCLCHECK(ncclNvmlEnsureInitialized());
+  std::lock_guard<std::mutex> locked(lock);
+  platformInfo->version = nvmlPlatformInfo_v2;
+  NVMLTRY(nvmlDeviceGetPlatformInfo, device, platformInfo);
+  return ncclSuccess;
+}
+
+
 ncclResult_t ncclNvmlGetCCStatus(struct ncclNvmlCCStatus *status) {
   NCCLCHECK(ncclNvmlEnsureInitialized());
   std::lock_guard<std::mutex> locked(lock);
@@ -314,6 +325,10 @@ ncclResult_t ncclNvmlGetCCStatus(struct ncclNvmlCCStatus *status) {
       status->multiGpuProtectedPCIE = true;
     else
       status->multiGpuProtectedPCIE = false;
+    if (ccInfo.settingV12040.multiGpuMode == NVML_CC_SYSTEM_MULTIGPU_NVLE)
+      status->multiGpuNVLE = true;
+    else
+      status->multiGpuNVLE = false;
   } else if (pfn_nvmlSystemGetConfComputeState != NULL) {
     NVMLTRY(nvmlSystemGetConfComputeState, &ccInfo.settingV12020);
     if (ccInfo.settingV12020.ccFeature == NVML_CC_SYSTEM_FEATURE_ENABLED)
@@ -321,9 +336,11 @@ ncclResult_t ncclNvmlGetCCStatus(struct ncclNvmlCCStatus *status) {
     else
       status->CCEnabled = false;
     status->multiGpuProtectedPCIE = false;
+    status->multiGpuNVLE = false;
   } else {
     status->CCEnabled = false;
     status->multiGpuProtectedPCIE = false;
+    status->multiGpuNVLE = false;
   }
   return ncclSuccess;
 }
diff --git a/projects/rccl/src/misc/param.cc b/projects/rccl/src/misc/param.cc
index d7c324fe94..9060b00667 100644
--- a/projects/rccl/src/misc/param.cc
+++ b/projects/rccl/src/misc/param.cc
@@ -15,6 +15,7 @@
 #include <sys/types.h>
 #include <unistd.h>
 #include <pthread.h>
+#include <mutex>
 #include <pwd.h>
 
 const char* userHomeDir() {
@@ -67,13 +68,13 @@ static void initEnvFunc() {
 }
 
 void initEnv() {
-  static pthread_once_t once = PTHREAD_ONCE_INIT;
-  pthread_once(&once, initEnvFunc);
+  static std::once_flag once;
+  std::call_once(once, initEnvFunc);
 }
 
 void ncclLoadParam(char const* env, int64_t deftVal, int64_t uninitialized, int64_t* cache) {
-  static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
-  pthread_mutex_lock(&mutex);
+  static std::mutex mutex;
+  std::lock_guard<std::mutex> lock(mutex);
   if (__atomic_load_n(cache, __ATOMIC_RELAXED) == uninitialized) {
     const char* str = ncclGetEnv(env);
     int64_t value = deftVal;
@@ -89,7 +90,6 @@ void ncclLoadParam(char const* env, int64_t deftVal, int64_t uninitialized, int6
     }
     __atomic_store_n(cache, value, __ATOMIC_RELAXED);
   }
-  pthread_mutex_unlock(&mutex);
 }
 
 const char* ncclGetEnv(const char* name) {
diff --git a/projects/rccl/src/misc/shmutils.cc b/projects/rccl/src/misc/shmutils.cc
index eb9cd10156..59adedf242 100644
--- a/projects/rccl/src/misc/shmutils.cc
+++ b/projects/rccl/src/misc/shmutils.cc
@@ -114,8 +114,11 @@ ncclResult_t ncclShmOpen(char* shmPath, size_t shmPathSize, size_t shmSize, void
   }
 
   if (devShmPtr) {
+    cudaStreamCaptureMode mode = cudaStreamCaptureModeRelaxed;
+    CUDACHECKGOTO(cudaThreadExchangeStreamCaptureMode(&mode), ret, fail);
     CUDACHECKGOTO(cudaHostRegister((void*)hptr, realShmSize, cudaHostRegisterPortable | cudaHostRegisterMapped), ret, fail);
     CUDACHECKGOTO(cudaHostGetDevicePointer(&dptr, (void*)hptr, 0), ret, fail);
+    CUDACHECKGOTO(cudaThreadExchangeStreamCaptureMode(&mode), ret, fail);
   }
 
   shmHandleInit(fd, shmPath, shmSize, realShmSize, hptr, dptr, create, tmphandle);
@@ -182,34 +185,36 @@ ncclResult_t ncclShmUnlink(ncclShmHandle_t handle) {
 
 ncclResult_t ncclShmemAllgather(struct ncclComm *comm, struct ncclShmemCollBuff *shmem, void *sendbuff, void *recvbuff, size_t typeSize) {
   ncclResult_t ret = ncclSuccess;
-  int curRound;
-  size_t mycnt;
+  int nextRound = shmem->round + 1;
+  int curIndex = shmem->round % 2;
+  bool done;
+  int index = 0;
+  size_t maxTypeSize = shmem->maxTypeSize;
 
-  if (comm == NULL || shmem == NULL || sendbuff == NULL || recvbuff == NULL || shmem->maxTypeSize < typeSize) {
+  if (comm == NULL || shmem == NULL || sendbuff == NULL || recvbuff == NULL || maxTypeSize < typeSize) {
     ret = ncclInvalidArgument;
     goto exit;
   }
 
-  curRound = shmem->round;
-  memcpy((char*)shmem->ptr[curRound] + comm->localRank * typeSize, sendbuff, typeSize);
-  /* sync among local ranks */
-  mycnt = __atomic_add_fetch(shmem->cnt[curRound], 1, __ATOMIC_ACQ_REL);
-  if (mycnt == comm->localRanks) {
-    *shmem->cnt[curRound ^ 1] = 0; /* prepare next round */
-    __atomic_store_n(shmem->cnt[curRound], comm->localRanks + 1, __ATOMIC_RELEASE); /* release everyone */
-  } else {
-    uint64_t t0 = clockNano();
-    while(__atomic_load_n(shmem->cnt[curRound], __ATOMIC_ACQUIRE) != comm->localRanks + 1) {
-      if (clockNano() - t0 >= 5 * 1000) sched_yield();
-      if (__atomic_load_n(comm->abortFlag, __ATOMIC_ACQUIRE) == 1) {
-        ret = ncclInternalError;
-        goto exit;
+  memcpy((char*)shmem->ptr[curIndex] + comm->localRank * maxTypeSize, sendbuff, typeSize);
+  /* reset the previous round and notify I arrive this round */
+  __atomic_store_n((int*)((char*)shmem->cnt[curIndex] + CACHE_LINE_SIZE * comm->localRank), nextRound, __ATOMIC_RELEASE);
+
+  do {
+    done = true;
+    for (int i = index; i < comm->localRanks; ++i) {
+      if (i != comm->localRank && __atomic_load_n((int*)((char*)shmem->cnt[curIndex] + CACHE_LINE_SIZE * i), __ATOMIC_ACQUIRE) < nextRound) {
+        done = false;
+        index = i;
+        break;
       }
     }
-  }
+  } while (!done);
 
-  memcpy(recvbuff, (const void*)shmem->ptr[curRound], comm->localRanks * typeSize);
-  shmem->round ^= 1;
+  for (int i = 0; i < comm->localRanks; ++i) {
+    memcpy((uint8_t*)recvbuff + i * typeSize, (uint8_t*)shmem->ptr[curIndex] + i * maxTypeSize, typeSize);
+  }
+  shmem->round = nextRound;
 
 exit:
   return ret;
diff --git a/projects/rccl/src/misc/socket.cc b/projects/rccl/src/misc/socket.cc
index d066d28293..5633fef3ed 100644
--- a/projects/rccl/src/misc/socket.cc
+++ b/projects/rccl/src/misc/socket.cc
@@ -149,6 +149,9 @@ static ncclResult_t findInterfaces(const char* prefixList, char* names, union nc
     if (family != AF_INET && family != AF_INET6)
       continue;
 
+    /* Only consider running interfaces, i.e. UP and physically attached. */
+    if (!(interface->ifa_flags & IFF_RUNNING)) continue;
+
     TRACE(NCCL_INIT|NCCL_NET,"Found interface %s:%s", interface->ifa_name, ncclSocketToString((union ncclSocketAddress *) interface->ifa_addr, line));
 
     /* Allow the caller to force the socket family type */
@@ -377,11 +380,12 @@ ncclResult_t ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs,
         NCCLCHECK(ncclFindInterfaceMatchSubnet(ifNames, ifAddrs, &idAddr, ifNameMaxSize, nIfs));
       }
     }
-    // Then look for anything else (but not docker or lo)
-    if (*nIfs == 0) NCCLCHECK(findInterfaces("^docker,lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
+    // Then look for anything else (but not docker,lo, or virtual)
+    if (*nIfs == 0) NCCLCHECK(findInterfaces("^docker,lo,virbr", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
     // Finally look for docker, then lo.
     if (*nIfs == 0) NCCLCHECK(findInterfaces("docker", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
     if (*nIfs == 0) NCCLCHECK(findInterfaces("lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
+    if (*nIfs == 0) NCCLCHECK(findInterfaces("virbr", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
   }
   return ncclSuccess;
 }
diff --git a/projects/rccl/src/misc/strongstream.cc b/projects/rccl/src/misc/strongstream.cc
index 1766f41673..d92b506cb7 100644
--- a/projects/rccl/src/misc/strongstream.cc
+++ b/projects/rccl/src/misc/strongstream.cc
@@ -8,6 +8,7 @@
 #include "cudawrap.h"
 #include "checks.h"
 #include "param.h"
+#include <mutex>
 
 #if CUDART_VERSION >= 13000
 #define cudaStreamGetCaptureInfo_v3 cudaStreamGetCaptureInfo
@@ -27,14 +28,14 @@ struct ncclStrongStreamCapture {
 ////////////////////////////////////////////////////////////////////////////////
 
 static ncclCudaContext* cxtListHead = nullptr;
-static pthread_mutex_t cxtListLock = PTHREAD_MUTEX_INITIALIZER;
+static std::mutex cxtListMutex;
 
 ncclResult_t ncclCudaContextTrack(struct ncclCudaContext** out) {
   ncclResult_t result = ncclSuccess;
   CUcontext hcontext;
   CUCHECK(cuCtxGetCurrent(&hcontext));
 
-  pthread_mutex_lock(&cxtListLock);
+  std::lock_guard<std::mutex> lock(cxtListMutex);
   struct ncclCudaContext* p = cxtListHead;
   while (1) {
     if (p == nullptr) {
@@ -53,13 +54,12 @@ ncclResult_t ncclCudaContextTrack(struct ncclCudaContext** out) {
     p = p->next;
   }
 leave:
-  pthread_mutex_unlock(&cxtListLock);
   *out = p;
   return ncclSuccess;
 }
 
 void ncclCudaContextDrop(struct ncclCudaContext* cxt) {
-  pthread_mutex_lock(&cxtListLock);
+  std::lock_guard<std::mutex> lock(cxtListMutex);
   if (0 == --cxt->refCount) {
     struct ncclCudaContext** pp = &cxtListHead;
     while (*pp != cxt) pp = &(*pp)->next;
@@ -68,7 +68,6 @@ void ncclCudaContextDrop(struct ncclCudaContext* cxt) {
     ncclStrongStreamDestruct(&cxt->launchOrder);
     free(cxt);
   }
-  pthread_mutex_unlock(&cxtListLock);
 }
 
 ////////////////////////////////////////////////////////////////////////////////
diff --git a/projects/rccl/src/misc/utils.cc b/projects/rccl/src/misc/utils.cc
index bb59947e46..7e7179411f 100644
--- a/projects/rccl/src/misc/utils.cc
+++ b/projects/rccl/src/misc/utils.cc
@@ -10,6 +10,7 @@
 #include "nvmlwrap.h"
 
 #include <stdlib.h>
+#include <mutex>
 
 // Get current Compute Capability
 int ncclCudaCompCap() {
@@ -107,8 +108,8 @@ static void getHostHashOnce() {
   hostHashValue = getHash(hostHash, strlen(hostHash));
 }
 uint64_t getHostHash(void) {
-  static pthread_once_t once = PTHREAD_ONCE_INIT;
-  pthread_once(&once, getHostHashOnce);
+  static std::once_flag once;
+  std::call_once(once, getHostHashOnce);
   return hostHashValue;
 }
 
@@ -289,3 +290,28 @@ void ncclMemoryStackDestruct(struct ncclMemoryStack* me) {
     h = h1;
   }
 }
+
+/* return concatenated string representing each set bit */
+ncclResult_t ncclBitsToString(uint32_t bits, uint32_t mask, const char* (*toStr)(int), char *buf, size_t bufLen, const char *wildcard) {
+  if (!buf || !bufLen)
+    return ncclInvalidArgument;
+
+  bits &= mask;
+
+  // print wildcard value if all bits set
+  if (wildcard && bits == mask) {
+    snprintf(buf, bufLen, "%s", wildcard);
+    return ncclSuccess;
+  }
+
+  // Add each set bit to string
+  int pos = 0;
+  for (int i = 0; bits; i++, bits >>= 1) {
+    if (bits & 1) {
+      if (pos > 0) pos += snprintf(buf + pos, bufLen - pos, "|");
+      pos += snprintf(buf + pos, bufLen - pos, "%s", toStr(i));
+    }
+  }
+
+  return ncclSuccess;
+}
diff --git a/projects/rccl/src/mnnvl.cc b/projects/rccl/src/mnnvl.cc
index 34a18b80a4..fb41106ab7 100644
--- a/projects/rccl/src/mnnvl.cc
+++ b/projects/rccl/src/mnnvl.cc
@@ -36,7 +36,11 @@ ncclResult_t ncclMnnvlCheck(struct ncclComm* comm) {
     nvmlGpuFabricInfoV_t *fabricInfo2 = &comm->peerInfo[i].fabricInfo;
     // Check if the cluster UUID and cliqueId match
     // A zero UUID means we don't have MNNVL fabric info - disable MNNVL
-    if ((((long *)&fabricInfo2->clusterUuid)[0]|((long *)fabricInfo2->clusterUuid)[1]) == 0) return ncclSuccess;
+    unsigned long uuid0 = 0;
+    unsigned long uuid1 = 0;
+    memcpy(&uuid0, fabricInfo2->clusterUuid, sizeof(uuid0));
+    memcpy(&uuid1, fabricInfo2->clusterUuid + sizeof(uuid0), sizeof(uuid1));
+    if ((uuid0 | uuid1) == 0) return ncclSuccess;
     if ((memcmp(fabricInfo1->clusterUuid, fabricInfo2->clusterUuid, NVML_GPU_FABRIC_UUID_LEN) == 0) &&
         (fabricInfo1->cliqueId == fabricInfo2->cliqueId)) {
       if (i == comm->rank) {
diff --git a/projects/rccl/src/nccl.h.in b/projects/rccl/src/nccl.h.in
index 292a839143..0c53c826e7 100644
--- a/projects/rccl/src/nccl.h.in
+++ b/projects/rccl/src/nccl.h.in
@@ -12,7 +12,7 @@
 #if CUDART_VERSION >= 11000
 #include <cuda_bf16.h>
 #endif
-#if CUDART_VERSION >= 11080
+#if __cplusplus && CUDART_VERSION >= 11080
 #include <cuda_fp8.h>
 #endif
 
@@ -29,9 +29,10 @@ extern "C" {
 #endif
 
 #include <limits.h>
+
 /* Opaque handle to communicator */
 typedef struct ncclComm* ncclComm_t;
-typedef struct ncclWindow* ncclWindow_t;
+typedef struct ncclWindow_vidmem* ncclWindow_t;
 #define NCCL_COMM_NULL NULL
 
 #define NCCL_UNIQUE_ID_BYTES 128
@@ -57,9 +58,12 @@ typedef enum { ncclSuccess                 =  0,
 #define NCCL_WIN_DEFAULT 0x00
 #define NCCL_WIN_COLL_SYMMETRIC 0x01
 
+#define NCCL_WIN_REQUIRED_ALIGNMENT 4096
+
 /* NCCL performance policy */
 #define NCCL_CTA_POLICY_DEFAULT 0x00
 #define NCCL_CTA_POLICY_EFFICIENCY 0x01
+#define NCCL_CTA_POLICY_ZERO 0x02
 
 /* ncclCommShrink flags*/
 #define NCCL_SHRINK_DEFAULT 0x00 /* shrink the parent communicator */
@@ -67,7 +71,7 @@ typedef enum { ncclSuccess                 =  0,
 
 /* Communicator configuration. Users can assign value to attributes to specify the
  * behavior of a communicator. */
-typedef struct ncclConfig_v22700 {
+typedef struct ncclConfig_v22800 {
   /* attributes that users should never touch. */
   size_t size;
   unsigned int magic;
@@ -85,6 +89,8 @@ typedef struct ncclConfig_v22700 {
   int CTAPolicy;
   int shrinkShare;
   int nvlsCTAs;
+  int nChannelsPerNetPeer;
+  int nvlinkCentricSched;
 } ncclConfig_t;
 
 /* Config initializer must be assigned to initialize config structure when it is created.
@@ -105,6 +111,8 @@ typedef struct ncclConfig_v22700 {
   NCCL_CONFIG_UNDEF_INT,                    /* CTAPolicy */             \
   NCCL_CONFIG_UNDEF_INT,                    /* shrinkShare */           \
   NCCL_CONFIG_UNDEF_INT,                    /* nvlsCTAs */              \
+  NCCL_CONFIG_UNDEF_INT,                    /* nChannelsPerNetPeer */   \
+  NCCL_CONFIG_UNDEF_INT,                    /* nvlinkCentricSched */    \
 }
 
 /* This struct will be used by ncclGroupSimulateEnd() API to query information about simulation. */
@@ -220,7 +228,9 @@ const char*  ncclGetLastError(ncclComm_t comm);
 const char* pncclGetLastError(ncclComm_t comm);
 
 /* Reload environment variables that determine logging. */
+__attribute__ ((deprecated("ncclResetDebugInit is not supported as part of the NCCL API and will be removed in the future")))
 void  ncclResetDebugInit();
+__attribute__ ((deprecated("pncclResetDebugInit is not supported as part of the NCCL API and will be removed in the future")))
 void pncclResetDebugInit();
 
 /* Checks whether the comm has encountered any asynchronous errors */
@@ -427,6 +437,49 @@ ncclResult_t  ncclAllGather(const void* sendbuff, void* recvbuff, size_t sendcou
 ncclResult_t pncclAllGather(const void* sendbuff, void* recvbuff, size_t sendcount,
     ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream);
 
+/*
+ * All-to-All
+ *
+ * Each device sends count values to all other devices and receives count values
+ * from all other devices. Data to send to destination rank j is taken from
+ * sendbuff+j*count and data received from source rank i is placed at
+ * recvbuff+i*count.
+ */
+ncclResult_t  ncclAlltoAll(const void* sendbuff, void* recvbuff, size_t count,
+    ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream);
+ncclResult_t pncclAlltoAll(const void* sendbuff, void* recvbuff, size_t count,
+    ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream);
+
+/*
+ * Gather
+ *
+ * Each rank sends count elements from sendbuff to the root rank.
+ * On the root rank, data from rank i is placed at recvbuff + i*count.
+ * On non-root ranks, recvbuff is not used.
+ * root is the rank where data will be gathered.
+ *
+ * In-place operations will happen if sendbuff == recvbuff + root * count.
+ */
+ncclResult_t  ncclGather(const void* sendbuff, void* recvbuff, size_t count,
+    ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream);
+ncclResult_t pncclGather(const void* sendbuff, void* recvbuff, size_t count,
+    ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream);
+
+/*
+ * Scatter
+ *
+ * On the root rank, count elements from sendbuff+i*count are sent to rank i.
+ * On non-root ranks, sendbuff is not used.
+ * Each rank receives count elements into recvbuff.
+ * root is the rank that will distribute the data.
+ *
+ * In-place operations will happen if recvbuff == sendbuff + root * count.
+ */
+ncclResult_t  ncclScatter(const void* sendbuff, void* recvbuff, size_t count,
+    ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream);
+ncclResult_t pncclScatter(const void* sendbuff, void* recvbuff, size_t count,
+    ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream);
+
 /*
  * Send
  *
diff --git a/projects/rccl/src/nccl_device/CMakeLists.txt b/projects/rccl/src/nccl_device/CMakeLists.txt
new file mode 100644
index 0000000000..9d0c3d1006
--- /dev/null
+++ b/projects/rccl/src/nccl_device/CMakeLists.txt
@@ -0,0 +1,9 @@
+# Register sources
+set(SYM_SOURCES
+    ${CMAKE_CURRENT_SOURCE_DIR}/core.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/ll_a2a.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/mem_barrier.cc
+)
+
+# Add register sources to parent scope
+set(SYM_SOURCES ${SYM_SOURCES} PARENT_SCOPE)
diff --git a/projects/rccl/src/nccl_device/core.cc b/projects/rccl/src/nccl_device/core.cc
new file mode 100644
index 0000000000..bae6b39bfa
--- /dev/null
+++ b/projects/rccl/src/nccl_device/core.cc
@@ -0,0 +1,57 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "core.h"
+#include "comm.h"
+#include "nccl_device/impl/core__funcs.h"
+
+NCCL_API(ncclTeam_t, ncclTeamWorld, ncclComm_t comm);
+ncclTeam_t ncclTeamWorld(ncclComm_t comm) {
+  ncclTeam_t ans;
+  ans.nRanks = comm->nRanks;
+  ans.rank = comm->rank;
+  ans.stride = 1;
+  return ans;
+}
+
+NCCL_API(ncclTeam_t, ncclTeamLsa, ncclComm_t comm);
+ncclTeam_t ncclTeamLsa(ncclComm_t comm) {
+  // Ignoring errors since if it fails ncclDevrInitOnce will try again.
+  // The returned team will be junk and the next "interesting" API call that
+  // needs ncclDevrInitOnce will report the error.
+  if (ncclSuccess != ncclDevrInitOnce(comm)) return ncclTeam_t{};
+
+  ncclTeam_t ans;
+  ans.nRanks = comm->devrState.lsaSize;
+  ans.rank = comm->devrState.lsaSelf;
+  ans.stride = 1;
+  return ans;
+}
+
+NCCL_API(ncclTeam_t, ncclTeamRail, ncclComm_t comm);
+ncclTeam_t ncclTeamRail(ncclComm_t comm) {
+  // Ignoring errors as above.
+  if (ncclSuccess != ncclDevrInitOnce(comm)) return ncclTeam_t{};
+
+  ncclTeam_t ans;
+  ans.nRanks = comm->nRanks/comm->devrState.lsaSize;
+  ans.rank = comm->rank/comm->devrState.lsaSize;
+  ans.stride = comm->devrState.lsaSize;
+  return ans;
+}
+
+NCCL_API(int, ncclTeamRankToWorld, ncclComm_t comm, ncclTeam_t team, int rank);
+int ncclTeamRankToWorld(ncclComm_t comm, ncclTeam_t team, int rank) {
+  return comm->rank + (rank - team.rank)*team.stride;
+}
+
+NCCL_API(int, ncclTeamRankToLsa, ncclComm_t comm, ncclTeam_t team, int rank);
+int ncclTeamRankToLsa(ncclComm_t comm, ncclTeam_t team, int rank) {
+  // Ignoring errors as above.
+  if (ncclSuccess != ncclDevrInitOnce(comm)) return -1;
+
+  return comm->devrState.lsaSelf + (rank - team.rank)*team.stride;
+}
diff --git a/projects/rccl/src/nccl_device/ll_a2a.cc b/projects/rccl/src/nccl_device/ll_a2a.cc
new file mode 100644
index 0000000000..6a51d0f2b3
--- /dev/null
+++ b/projects/rccl/src/nccl_device/ll_a2a.cc
@@ -0,0 +1,26 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "core.h"
+#include "nccl_device/impl/ll_a2a__funcs.h"
+
+NCCL_API(int, ncclLLA2ACalcSlots, int maxElts, int maxEltSize);
+int ncclLLA2ACalcSlots(int maxElts, int maxEltSize) {
+  return maxElts*divUp(maxEltSize, 8);
+}
+
+NCCL_API(ncclResult_t, ncclLLA2ACreateRequirement, int nBlocks, int nSlots, ncclLLA2AHandle_t* outHandle, ncclDevResourceRequirements_t* outReq);
+ncclResult_t ncclLLA2ACreateRequirement(
+    int nBlocks, int nSlots, ncclLLA2AHandle_t* outHandle,
+    ncclDevResourceRequirements_t* outReq
+  ) {
+  outHandle->nSlots = nSlots;
+  memset(outReq, 0, sizeof(*outReq));
+  outReq->bufferSize = nBlocks*(1 + 2*nSlots)*16;
+  outReq->bufferAlign = 16;
+  outReq->outBufferHandle = &outHandle->bufHandle;
+  return ncclSuccess;
+}
diff --git a/projects/rccl/src/nccl_device/mem_barrier.cc b/projects/rccl/src/nccl_device/mem_barrier.cc
new file mode 100644
index 0000000000..b6c400fa4e
--- /dev/null
+++ b/projects/rccl/src/nccl_device/mem_barrier.cc
@@ -0,0 +1,21 @@
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "core.h"
+#include "nccl_device/impl/mem_barrier__funcs.h"
+
+NCCL_API(ncclResult_t, ncclLsaBarrierCreateRequirement, ncclTeam_t team, int nBarriers, ncclLsaBarrierHandle_t* outHandle, ncclDevResourceRequirements_t* outReq);
+ncclResult_t ncclLsaBarrierCreateRequirement(
+    ncclTeam_t team, int nBarriers, ncclLsaBarrierHandle_t* outHandle,
+    ncclDevResourceRequirements_t* outReq
+  ) {
+  memset(outReq, 0, sizeof(*outReq));
+  outHandle->nBarriers = nBarriers;
+  outReq->bufferSize = (3*nBarriers + nBarriers*team.nRanks)*sizeof(uint32_t);
+  outReq->bufferAlign = alignof(uint32_t);
+  outReq->outBufferHandle = &outHandle->bufHandle;
+  return ncclSuccess;
+}
diff --git a/projects/rccl/src/plugin/CMakeLists.txt b/projects/rccl/src/plugin/CMakeLists.txt
new file mode 100644
index 0000000000..2ef9282f6d
--- /dev/null
+++ b/projects/rccl/src/plugin/CMakeLists.txt
@@ -0,0 +1,18 @@
+# Add plugin subdirectories
+add_subdirectory(profiler)
+add_subdirectory(net)
+add_subdirectory(tuner)
+
+# Plugin sources
+set(PLUGIN_SOURCES
+    ${CMAKE_CURRENT_SOURCE_DIR}/net.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/profiler.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/plugin_open.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/tuner.cc
+    ${PLUGIN_NET_SOURCES}
+    ${PLUGIN_PROFILER_SOURCES}
+    ${PLUGIN_TUNER_SOURCES}
+)
+
+# Add plugin sources to parent scope
+set(PLUGIN_SOURCES ${PLUGIN_SOURCES} PARENT_SCOPE)
diff --git a/projects/rccl/src/plugin/net.cc b/projects/rccl/src/plugin/net.cc
index aa80c12aba..6abd0804db 100644
--- a/projects/rccl/src/plugin/net.cc
+++ b/projects/rccl/src/plugin/net.cc
@@ -12,6 +12,7 @@
 
 #include <string.h>
 #include <errno.h>
+#include <mutex>
 //#include <sys/types.h>
 //#include <sys/stat.h>
 //#include <unistd.h>
@@ -24,17 +25,19 @@ extern getNcclNet_t getNcclNet_v7;
 extern getNcclNet_t getNcclNet_v8;
 extern getNcclNet_t getNcclNet_v9;
 extern getNcclNet_t getNcclNet_v10;
+extern getNcclNet_t getNcclNet_v11;
 extern getNcclCollNet_t getNcclCollNet_v6;
 extern getNcclCollNet_t getNcclCollNet_v7;
 extern getNcclCollNet_t getNcclCollNet_v8;
 extern getNcclCollNet_t getNcclCollNet_v9;
 extern getNcclCollNet_t getNcclCollNet_v10;
+extern getNcclCollNet_t getNcclCollNet_v11;
 
-NCCL_PARAM(NetPluginRefCount, "NET_PLUGIN_REF_COUNT", 1);
-#define NCCL_NET_VERSION_COUNT 5
-int ncclNetVersion[NCCL_NET_VERSION_COUNT] = {10, 9, 8, 7, 6};
-getNcclNet_t* getNcclNet[NCCL_NET_VERSION_COUNT] = {getNcclNet_v10, getNcclNet_v9, getNcclNet_v8, getNcclNet_v7, getNcclNet_v6};
-getNcclCollNet_t* getNcclCollNet[NCCL_NET_VERSION_COUNT] = {getNcclCollNet_v10, getNcclCollNet_v9, getNcclCollNet_v8, getNcclCollNet_v7,  getNcclCollNet_v6};
+NCCL_PARAM(NetPluginRefCount, "NET_PLUGIN_REF_COUNT", 0);
+#define NCCL_NET_VERSION_COUNT 6
+int ncclNetVersion[NCCL_NET_VERSION_COUNT] = {11, 10, 9, 8, 7, 6};
+getNcclNet_t* getNcclNet[NCCL_NET_VERSION_COUNT] = {getNcclNet_v11, getNcclNet_v10, getNcclNet_v9, getNcclNet_v8, getNcclNet_v7, getNcclNet_v6};
+getNcclCollNet_t* getNcclCollNet[NCCL_NET_VERSION_COUNT] = {getNcclCollNet_v11, getNcclCollNet_v10, getNcclCollNet_v9, getNcclCollNet_v8, getNcclCollNet_v7, getNcclCollNet_v6};
 
 #define NCCL_NET_NUM_INTERNAL_PLUGINS 2
 
@@ -56,19 +59,27 @@ typedef struct netPluginLib {
   ncclNetPluginState_t ncclNetPluginState;      // State of the nccl net plugin
   ncclNetPluginState_t ncclCollNetPluginState;  // State of the nccl coll net plugin
   int ncclNetPluginRefCount;                    // Reference count for the nccl net plugin
+  int netPhysDevs;                              // ncclNet - number of physical devices
+  int netVirtDevs;                              // ncclNet - number of virtual devices
+  int collNetPhysDevs;                          // ncclCollNet -  number of physical devices
+  int collNetVirtDevs;                          // ncclCollNet -  number of virtual devices
 } netPluginLib_t;
 
 int pluginCount = 0;
 bool netPluginLibsInitialized = false;
 netPluginLib_t netPluginLibs[NCCL_NET_MAX_PLUGINS] = { 0 };
-static pthread_mutex_t netPluginLock = PTHREAD_MUTEX_INITIALIZER;
-static pthread_once_t initPluginLibsOnceControl = PTHREAD_ONCE_INIT;
+static std::mutex netPluginMutex;
+static std::once_flag initPluginLibsOnceFlag;
 
 static ncclResult_t ncclNetPluginUnload(netPluginLib_t* pluginLib) {
   if ((pluginLib->dlHandle) && ((pluginLib->ncclNetPluginRefCount) == 0)) {
     INFO(NCCL_INIT|NCCL_NET, "Unloading plugin %s", pluginLib->name);
     NCCLCHECK(ncclClosePluginLib(pluginLib->dlHandle, ncclPluginTypeNet));
+    // memset will reset the status to ncllNetPluginStateLoadReady
     memset(pluginLib, 0, sizeof(netPluginLib_t));
+    // reset the count of devices to UNDEF_DEV_COUNT
+    pluginLib->netPhysDevs = pluginLib->netVirtDevs = NCCL_UNDEF_DEV_COUNT;
+    pluginLib->collNetPhysDevs = pluginLib->collNetVirtDevs = NCCL_UNDEF_DEV_COUNT;
   }
   return ncclSuccess;
 }
@@ -85,11 +96,15 @@ static ncclResult_t ncclNetPluginLoad(netPluginLib_t* pluginLib) {
   }
 
   // if we fail to find a net, exit
-  if (pluginLib->ncclNet == nullptr) goto fail;
+  if (pluginLib->ncclNet == nullptr) {
+    INFO(NCCL_INIT|NCCL_NET, "External network plugin %s is unsupported",
+         (ncclPluginLibPaths[ncclPluginTypeNet] ? ncclPluginLibPaths[ncclPluginTypeNet] : pluginLib->name));
+    goto fail;
+  }
 
   pluginLib->ncclNetPluginState = ncclNetPluginStateInitReady;
 
-  // load ncclColNet
+  // load ncclCollNet
   for (int i = 0; i < NCCL_NET_VERSION_COUNT; i++) {
     pluginLib->ncclCollNet = getNcclCollNet[i](pluginLib->dlHandle);
     if (pluginLib->ncclCollNet) break;
@@ -100,7 +115,8 @@ static ncclResult_t ncclNetPluginLoad(netPluginLib_t* pluginLib) {
   else
     pluginLib->ncclCollNetPluginState = ncclNetPluginStateInitReady;
 
-  INFO(NCCL_INIT|NCCL_NET, "Successfully loaded external plugin %s", pluginLib->name);
+  INFO(NCCL_INIT|NCCL_NET, "Successfully loaded external network plugin %s",
+       (ncclPluginLibPaths[ncclPluginTypeNet] ? ncclPluginLibPaths[ncclPluginTypeNet] : pluginLib->name));
 exit:
   return ncclSuccess;
 fail:
@@ -137,25 +153,35 @@ ncclResult_t ncclNetCheckDeviceVersion(struct ncclComm* comm, ncclNet_t* net, in
   return ncclSuccess;
 }
 
-static ncclResult_t ncclNetPluginInit(netPluginLib_t* pluginLib) {
+static ncclResult_t ncclNetPluginInit(struct ncclComm* comm, netPluginLib_t* pluginLib) {
   int ndev;
-  if (pluginLib->ncclNetPluginState == ncclNetPluginStateInitReady && pluginLib->ncclNet) {
-    if (pluginLib->ncclNet->init(ncclDebugLog, ncclProfilerCallback) != ncclSuccess) goto fail;
+  if (pluginLib->ncclNetPluginState >= ncclNetPluginStateInitReady && pluginLib->ncclNet) {
+    ncclNetCommConfig_t commConfig = {};
+    commConfig.trafficClass = comm->config.trafficClass == NCCL_CONFIG_UNDEF_INT ? NCCL_NET_TRAFFIC_CLASS_UNDEF : comm->config.trafficClass;
+    if (pluginLib->ncclNet->init(&comm->netContext, comm->commHash, &commConfig, ncclDebugLog, ncclProfilerCallback) != ncclSuccess) goto fail;
     if (pluginLib->ncclNet->devices(&ndev) != ncclSuccess || ndev <= 0) goto fail;
+    pluginLib->netPhysDevs = ndev;
+    pluginLib->netVirtDevs = NCCL_UNDEF_DEV_COUNT;
   }
   pluginLib->ncclNetPluginState = ncclNetPluginStateEnabled;
   INFO(NCCL_INIT|NCCL_NET, "Initialized NET plugin %s", pluginLib->ncclNet->name);
 
-  if (pluginLib->ncclCollNetPluginState == ncclNetPluginStateInitReady && pluginLib->ncclCollNet) {
-    if (pluginLib->ncclCollNet->init(ncclDebugLog) != ncclSuccess) pluginLib->ncclCollNetPluginState = ncclNetPluginStateDisabled;
+  if (pluginLib->ncclCollNetPluginState >= ncclNetPluginStateInitReady && pluginLib->ncclCollNet) {
+    if (pluginLib->ncclCollNet->init(&comm->collNetContext, comm->commHash, ncclDebugLog) != ncclSuccess) pluginLib->ncclCollNetPluginState = ncclNetPluginStateDisabled;
     else if (pluginLib->ncclCollNet->devices(&ndev) != ncclSuccess || ndev <= 0) pluginLib->ncclCollNetPluginState = ncclNetPluginStateDisabled;
     else {
+      pluginLib->collNetPhysDevs = ndev;
+      pluginLib->collNetVirtDevs = NCCL_UNDEF_DEV_COUNT;
       pluginLib->ncclCollNetPluginState = ncclNetPluginStateEnabled;
     }
   }
 exit:
   return ncclSuccess;
 fail:
+  INFO(NCCL_INIT|NCCL_NET, "Failed to initialize NET plugin %s", pluginLib->ncclNet->name);
+  pluginLib->ncclNet->finalize(comm->netContext);
+  pluginLib->netPhysDevs = pluginLib->netVirtDevs = NCCL_UNDEF_DEV_COUNT;
+  pluginLib->collNetPhysDevs = pluginLib->collNetVirtDevs = NCCL_UNDEF_DEV_COUNT;
   pluginLib->ncclNetPluginState = ncclNetPluginStateDisabled;
   pluginLib->ncclCollNetPluginState = ncclNetPluginStateDisabled;
   goto exit;
@@ -214,6 +240,9 @@ static void initPluginLibsOnceFunc() {
   memset(netPluginLibs, 0, NCCL_NET_MAX_PLUGINS * sizeof(netPluginLib_t));
   envNetPlugin = ncclGetEnv("NCCL_NET_PLUGIN");
   if (envNetPlugin) {
+    INFO(NCCL_ENV|NCCL_NET, "NCCL_NET_PLUGIN set by environment to %s", envNetPlugin);
+    if (strcasecmp(envNetPlugin, "none") == 0)
+      envNetPlugin = "";
     envNetPluginList = strdup(envNetPlugin);
     // Iterate over list until the list is empty
     netPluginName = strtok_r(envNetPluginList, ",", &savePtr);
@@ -221,7 +250,7 @@ static void initPluginLibsOnceFunc() {
       // We have 2 internal plugins (ib and socket)
       // So, we can have at most( NCCL_NET_MAX_PLUGINS - (NCCL_NET_NUM_INTERNAL_PLUGINS)) in the NCCL_NET_PLUGIN list
       if (pluginCounter >= (NCCL_NET_MAX_PLUGINS - (NCCL_NET_NUM_INTERNAL_PLUGINS))) {
-        INFO(NCCL_NET|NCCL_INIT,"NCCL_NET_PLUGIN list contains more than %d plugins, ignoring the rest", (NCCL_NET_MAX_PLUGINS - (NCCL_NET_NUM_INTERNAL_PLUGINS + 1)));
+        INFO(NCCL_NET|NCCL_ENV,"NCCL_NET_PLUGIN list contains more than %d plugins, ignoring the rest", (NCCL_NET_MAX_PLUGINS - (NCCL_NET_NUM_INTERNAL_PLUGINS + 1)));
         break;
       }
       // need to leave space for the name + "\n"
@@ -231,7 +260,7 @@ static void initPluginLibsOnceFunc() {
         strcpy(netPluginLibs[pluginCounter].name, netPluginName);
         pluginCounter++;
       } else {
-        INFO(NCCL_NET|NCCL_INIT,"NCCL_NET_PLUGIN list contains a plugin name %s longer than %d characters, ignoring it.", netPluginName, MAX_STR_LEN);
+        INFO(NCCL_NET|NCCL_ENV,"NCCL_NET_PLUGIN list contains a plugin name %s longer than %d characters, ignoring it.", netPluginName, MAX_STR_LEN);
       }
       netPluginName = strtok_r(nullptr, ",", &savePtr);
     }
@@ -253,14 +282,14 @@ static void initPluginLibsOnceFunc() {
 
 ncclResult_t ncclNetInit(struct ncclComm* comm) {
   bool ncclNetPluginInitialized = false;
-  pthread_once(&initPluginLibsOnceControl, initPluginLibsOnceFunc);
-  pthread_mutex_lock(&netPluginLock);
+  std::call_once(initPluginLibsOnceFlag, initPluginLibsOnceFunc);
+  std::lock_guard<std::mutex> lock(netPluginMutex);
   for (int pluginIndex = 0; pluginIndex < pluginCount; pluginIndex++) {
     if ((pluginIndex < (pluginCount - NCCL_NET_NUM_INTERNAL_PLUGINS)) && (netPluginLibs[pluginIndex].ncclNetPluginState == ncclNetPluginStateLoadReady)) {
       NCCLCHECK(ncclNetPluginLoad(&netPluginLibs[pluginIndex]));
     }
-    if (netPluginLibs[pluginIndex].ncclNetPluginState == ncclNetPluginStateInitReady) {
-      NCCLCHECK(ncclNetPluginInit(&netPluginLibs[pluginIndex]));
+    if (netPluginLibs[pluginIndex].ncclNetPluginState >= ncclNetPluginStateInitReady) {
+      NCCLCHECK(ncclNetPluginInit(comm, &netPluginLibs[pluginIndex]));
     }
     if (netPluginLibs[pluginIndex].ncclNetPluginState == ncclNetPluginStateEnabled) {
       bool isAssigned = false;
@@ -273,7 +302,6 @@ ncclResult_t ncclNetInit(struct ncclComm* comm) {
       }
     }
   }
-  pthread_mutex_unlock(&netPluginLock);
   if (ncclNetPluginInitialized) return ncclSuccess;
   WARN("Failed to initialize any NET plugin");
   return ncclInvalidUsage;
@@ -281,15 +309,60 @@ ncclResult_t ncclNetInit(struct ncclComm* comm) {
 
 ncclResult_t ncclNetFinalize(struct ncclComm* comm) {
   int pluginIndex = comm->netPluginIndex;
-  pthread_mutex_lock(&netPluginLock);
+  std::lock_guard<std::mutex> lock(netPluginMutex);
+  NCCLCHECK(comm->ncclNet->finalize(comm->netContext));
+  if (comm->collNetContext) NCCLCHECK(comm->ncclCollNet->finalize(comm->collNetContext));
   netPluginLibs[pluginIndex].ncclNetPluginRefCount--;
   for (int i = 0; i < (pluginCount - NCCL_NET_NUM_INTERNAL_PLUGINS); i++) {
     NCCLCHECK(ncclNetPluginUnload(&netPluginLibs[i]));
   }
-  pthread_mutex_unlock(&netPluginLock);
   return ncclSuccess;
 }
 
+ncclResult_t ncclNetGetDevCount(int netPluginIndex, int* nPhysDevs, int* nVirtDevs) {
+  if (netPluginLibs[netPluginIndex].ncclNetPluginState != ncclNetPluginStateEnabled ||
+     netPluginLibs[netPluginIndex].netPhysDevs == NCCL_UNDEF_DEV_COUNT) goto fail;
+  // lock not needed as it's called within a lock already in ncclTopoGetSystem
+  *nPhysDevs = netPluginLibs[netPluginIndex].netPhysDevs;
+  *nVirtDevs = netPluginLibs[netPluginIndex].netVirtDevs;
+  return ncclSuccess;
+fail:
+  WARN("%s: trying to access the number of devices of an uninitialized netPlugin[%d]", __func__, netPluginIndex);
+  return ncclInternalError;
+}
+
+ncclResult_t ncclCollNetGetDevCount(int netPluginIndex, int* nPhysDevs, int* nVirtDevs) {
+  if (netPluginLibs[netPluginIndex].ncclCollNetPluginState != ncclNetPluginStateEnabled ||
+     netPluginLibs[netPluginIndex].collNetPhysDevs == NCCL_UNDEF_DEV_COUNT) goto fail;
+  // lock not needed as it's called within a lock already in ncclTopoGetSystem
+  *nPhysDevs = netPluginLibs[netPluginIndex].collNetPhysDevs;
+  *nVirtDevs = netPluginLibs[netPluginIndex].collNetVirtDevs;
+  return ncclSuccess;
+fail:
+  WARN("%s: trying to access the number of devices of an uninitialized netPlugin[%d]", __func__, netPluginIndex);
+  return ncclInternalError;
+}
+
+ncclResult_t ncclNetSetVirtDevCount(int netPluginIndex, int nVirtDevs) {
+  if (netPluginLibs[netPluginIndex].ncclNetPluginState != ncclNetPluginStateEnabled || nVirtDevs < 0) goto fail;
+  // lock not needed as it's called within a lock already in ncclTopoGetSystem
+  netPluginLibs[netPluginIndex].netVirtDevs = nVirtDevs;
+  return ncclSuccess;
+fail:
+  WARN("%s: failed to set the number of devices for netPlugin[%d] to %d", __func__, netPluginIndex,nVirtDevs);
+  return ncclInternalError;
+}
+
+ncclResult_t ncclCollNetSetVirtDevCount(int netPluginIndex, int nVirtDevs) {
+  if (netPluginLibs[netPluginIndex].ncclCollNetPluginState != ncclNetPluginStateEnabled || nVirtDevs < 0) goto fail;
+  // lock not needed as it's called within a lock already in ncclTopoGetSystem
+  netPluginLibs[netPluginIndex].collNetVirtDevs = nVirtDevs;
+  return ncclSuccess;
+fail:
+  WARN("%s: failed to set the number of devices for netPlugin[%d] to %d", __func__, netPluginIndex,nVirtDevs);
+  return ncclInternalError;
+}
+
 ncclResult_t ncclGpuGdrSupport(struct ncclComm* comm, int* gdrSupport) {
   constexpr int GPU_BUF_SIZE = 2*1024*1024;
 #if CUDART_VERSION >= 11030
@@ -324,7 +397,7 @@ ncclResult_t ncclGpuGdrSupport(struct ncclComm* comm, int* gdrSupport) {
     void* mHandle = NULL;
     ncclResult_t ret;
     ncclDebugNoWarn = NCCL_NET;
-    NCCLCHECKGOTO(comm->ncclNet->listen(dev, &handle, &lComm), ret, cleanup1);
+    NCCLCHECKGOTO(comm->ncclNet->listen(comm->netContext, dev, &handle, &lComm), ret, cleanup1);
 
     bool connected;
     connected = false;
@@ -336,7 +409,7 @@ ncclResult_t ncclGpuGdrSupport(struct ncclComm* comm, int* gdrSupport) {
       }
 
       if (sComm == NULL)
-        NCCLCHECKGOTO(comm->ncclNet->connect(dev, NULL, &handle, &sComm, NULL), ret, cleanup2);
+        NCCLCHECKGOTO(comm->ncclNet->connect(comm->netContext, dev, &handle, &sComm, NULL), ret, cleanup2);
 
       if (rComm == NULL)
         NCCLCHECKGOTO(comm->ncclNet->accept(lComm, &rComm, NULL), ret, cleanup2);
diff --git a/projects/rccl/src/plugin/net/CMakeLists.txt b/projects/rccl/src/plugin/net/CMakeLists.txt
new file mode 100644
index 0000000000..0a6fcb2372
--- /dev/null
+++ b/projects/rccl/src/plugin/net/CMakeLists.txt
@@ -0,0 +1,12 @@
+# Net plugin sources
+set(PLUGIN_NET_SOURCES
+    ${CMAKE_CURRENT_SOURCE_DIR}/net_v9.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/net_v6.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/net_v7.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/net_v8.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/net_v10.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/net_v11.cc
+)
+
+# Add net plugin sources to parent scope
+set(PLUGIN_NET_SOURCES ${PLUGIN_NET_SOURCES} PARENT_SCOPE)
diff --git a/projects/rccl/src/plugin/net/net_v10.cc b/projects/rccl/src/plugin/net/net_v10.cc
index 682f239f7f..591a57ac08 100644
--- a/projects/rccl/src/plugin/net/net_v10.cc
+++ b/projects/rccl/src/plugin/net/net_v10.cc
@@ -7,26 +7,203 @@
 #include "nccl_net.h"
 #include "net_device.h"
 #include "proxy.h"
+#include "checks.h"
+#include <dlfcn.h>
 
+static ncclNet_t ncclNet;
+static ncclCollNet_t ncclCollNet;
 static ncclNet_v10_t* ncclNet_v10;
 static ncclCollNet_v10_t* ncclCollNet_v10;
 
+#define NET_INDEX 0
+#define COLLNET_INDEX 1
+#define INDEX_NUMS 2
+static int refCount[INDEX_NUMS];
+
+static ncclResult_t ncclNet_getProperties(int dev, ncclNetProperties_t* props) {
+  ncclNetProperties_v10_t props_v10;
+  NCCLCHECK(ncclNet_v10->getProperties(dev, &props_v10));
+  props->name = props_v10.name;
+  props->pciPath = props_v10.pciPath;
+  props->guid = props_v10.guid;
+  props->ptrSupport = props_v10.ptrSupport;
+  props->regIsGlobal = props_v10.regIsGlobal;
+  props->forceFlush = props_v10.forceFlush;
+  props->speed = props_v10.speed;
+  props->port = props_v10.port;
+  props->latency = props_v10.latency;
+  props->maxComms = props_v10.maxComms;
+  props->maxRecvs = props_v10.maxRecvs;
+  props->netDeviceType = props_v10.netDeviceType;
+  props->netDeviceVersion = props_v10.netDeviceVersion;
+  props->vProps.ndevs = props_v10.vProps.ndevs;
+  for (int i = 0; i < props->vProps.ndevs; i++) {
+    props->vProps.devs[i] = props_v10.vProps.devs[i];
+  }
+  props->maxP2pBytes = props_v10.maxP2pBytes;
+  props->maxCollBytes = props_v10.maxCollBytes;
+  props->maxMultiRequestSize = 1;
+  return ncclSuccess;
+}
+
+static ncclResult_t ncclNet_listen(void* ctx __attribute__((unused)),
+    int dev, void* handle, void** listenComm) {
+  return ncclNet_v10->listen(dev, handle, listenComm);
+}
+
+static ncclResult_t ncclNet_connect(void* ctx, int dev, void* handle, void** sendComm, ncclNetDeviceHandle_t** sendDevComm) {
+  return ncclNet_v10->connect(dev, (ncclNetCommConfig_v10_t *)ctx, handle, sendComm, sendDevComm);
+}
+
+static ncclResult_t ncclNet_makeVDevice(int* d, ncclNetVDeviceProps_v11_t* props) {
+  return ncclNet_v10->makeVDevice(d, (ncclNetVDeviceProps_v10_t *)props);
+}
+
+static ncclResult_t ncclNet_finalize(void* ctx) {
+  refCount[NET_INDEX]--;
+  free(ctx);
+  return ncclSuccess;
+}
+
+static ncclResult_t ncclNet_init(void** ctx, uint64_t commId __attribute__((unused)),
+    ncclNetCommConfig_t* config, ncclDebugLogger_t logfn, ncclProfilerCallback_t proffn) {
+  // since ncclNet_v11, the ncclNetCommConfig_t has been moved from connect to init. Since the config is per comm,
+  // this allows the config to be passed only once, instead of multiple times (once per connect). To preserve the
+  // ncclNet_v10 behavior, in the compat layer, we store the config in the context pointer and pass it to the connect
+  // function.
+  ncclNetCommConfig_v10_t* config_v10 = nullptr;
+  NCCLCHECK(ncclCalloc(&config_v10, 1));
+  config_v10->trafficClass = config->trafficClass;
+  *ctx = config_v10;
+  // before ncclNet_v11 the net plugin was initialized only once. With ncclNet_v11 this is no longer the case.
+  // The compat layer preserves the ncclNet_v10 behavior using a refCount to track the number of times the plugin
+  // is initialized, and avoid initializing it multiple times.
+  if (refCount[NET_INDEX]++) return ncclSuccess;
+  NCCLCHECK(ncclNet_v10->init(logfn, proffn));
+  ncclNet.devices = ncclNet_v10->devices;
+  ncclNet.getProperties = ncclNet_getProperties;
+  ncclNet.listen = ncclNet_listen;
+  ncclNet.connect = ncclNet_connect;
+  ncclNet.accept = ncclNet_v10->accept;
+  ncclNet.regMr = ncclNet_v10->regMr;
+  ncclNet.regMrDmaBuf = ncclNet_v10->regMrDmaBuf;
+  ncclNet.deregMr = ncclNet_v10->deregMr;
+  ncclNet.isend = ncclNet_v10->isend;
+  ncclNet.irecv = ncclNet_v10->irecv;
+  ncclNet.iflush = ncclNet_v10->iflush;
+  ncclNet.test = ncclNet_v10->test;
+  ncclNet.closeSend = ncclNet_v10->closeSend;
+  ncclNet.closeRecv = ncclNet_v10->closeRecv;
+  ncclNet.closeListen = ncclNet_v10->closeListen;
+  ncclNet.getDeviceMr = ncclNet_v10->getDeviceMr;
+  ncclNet.irecvConsumed = ncclNet_v10->irecvConsumed;
+  ncclNet.makeVDevice = (ncclNet_v10->makeVDevice) ? ncclNet_makeVDevice : nullptr;
+  ncclNet.finalize = ncclNet_finalize;
+  ncclNet.setNetAttr = nullptr;
+  return ncclSuccess;
+}
+
 ncclNet_t* getNcclNet_v10(void* lib) {
   ncclNet_v10 = (ncclNet_v10_t*)dlsym(lib, "ncclNetPlugin_v10");
   if (ncclNet_v10) {
+    ncclNet.name = ncclNet_v10->name;
+    ncclNet.init = ncclNet_init;
     INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded net plugin %s (v10)", ncclNet_v10->name);
-    return ncclNet_v10;
+    return &ncclNet;
   }
-  INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Failed to find ncclNetPlugin_v10 symbol.");
   return nullptr;
 }
 
+static ncclResult_t ncclCollNet_getProperties(int dev, ncclNetProperties_t* props) {
+  ncclNetProperties_v10_t props_v10;
+  NCCLCHECK(ncclCollNet_v10->getProperties(dev, &props_v10));
+  props->name = props_v10.name;
+  props->pciPath = props_v10.pciPath;
+  props->guid = props_v10.guid;
+  props->ptrSupport = props_v10.ptrSupport;
+  props->regIsGlobal = props_v10.regIsGlobal;
+  props->forceFlush = props_v10.forceFlush;
+  props->speed = props_v10.speed;
+  props->port = props_v10.port;
+  props->latency = props_v10.latency;
+  props->maxComms = props_v10.maxComms;
+  props->maxRecvs = props_v10.maxRecvs;
+  props->netDeviceType = props_v10.netDeviceType;
+  props->netDeviceVersion = props_v10.netDeviceVersion;
+  props->vProps.ndevs = props_v10.vProps.ndevs;
+  for (int i = 0; i < props->vProps.ndevs; i++) {
+    props->vProps.devs[i] = props_v10.vProps.devs[i];
+  }
+  props->maxP2pBytes = props_v10.maxP2pBytes;
+  props->maxCollBytes = props_v10.maxCollBytes;
+  props->maxMultiRequestSize = 1;
+  return ncclSuccess;
+}
+
+static ncclResult_t ncclCollNet_listen(void* ctx __attribute__((unused)),
+    int dev, void* handle , void** listenComm) {
+  return ncclCollNet_v10->listen(dev, handle, listenComm);
+}
+
+static ncclResult_t ncclCollNet_iallgather(void* collComm, void* sendData, int nRecvParts, ncclNetSGE_t* recvParts,
+                             size_t bytesPerRank, size_t windowOffset, size_t windowBytes,
+                             void* sendMhandle, void** request) {
+  return ncclCollNet_v10->iallgather(collComm, sendData, nRecvParts, (ncclNetSGE_v10_t*)recvParts, bytesPerRank,
+                             windowOffset, windowBytes, sendMhandle, request);
+}
+
+static ncclResult_t ncclCollNet_ireducescatter(void* collComm, int nSendParts, ncclNetSGE_t* sendParts, void* recvData,
+                                 size_t bytesPerRank, size_t windowOffset, size_t windowBytes,
+                                 ncclDataType_t dataType, ncclRedOp_t redOp,
+                                 void* recvMhandle, void** request) {
+  return ncclCollNet_v10->ireducescatter(collComm, nSendParts, (ncclNetSGE_v10_t*)sendParts, recvData, bytesPerRank,
+                                 windowOffset, windowBytes, dataType, redOp, recvMhandle, request);
+}
+
+static ncclResult_t ncclCollNet_makeVDevice(int* d, ncclNetVDeviceProps_t* props) {
+  return ncclCollNet_v10->makeVDevice(d, (ncclNetVDeviceProps_v10_t *)props);
+}
+
+static ncclResult_t ncclCollNet_finalize(void* ctx __attribute__((unused))) {
+  refCount[COLLNET_INDEX]--;
+  return ncclSuccess;
+}
+
+static ncclResult_t ncclCollNet_init(void** ctx __attribute__((unused)),
+    uint64_t commId __attribute__((unused)),
+    ncclDebugLogger_t logfn) {
+  // before ncclCollNet_v11 the collnet plugin was initialized only once. With ncclCollNet_v11 this is no longer the case.
+  // The compat layer preserves the ncclCollNet_v10 behavior using a refCount to track the number of times the plugin
+  // is initialized, and avoid initializing it multiple times.
+  if (refCount[COLLNET_INDEX]++) return ncclSuccess;
+  NCCLCHECK(ncclCollNet_v10->init(logfn));
+  ncclCollNet.devices = ncclCollNet_v10->devices;
+  ncclCollNet.getProperties = ncclCollNet_getProperties;
+  ncclCollNet.listen = ncclCollNet_listen;
+  ncclCollNet.connect = ncclCollNet_v10->connect;
+  ncclCollNet.reduceSupport = ncclCollNet_v10->reduceSupport;
+  ncclCollNet.regMr = ncclCollNet_v10->regMr;
+  ncclCollNet.regMrDmaBuf = ncclCollNet_v10->regMrDmaBuf;
+  ncclCollNet.deregMr = ncclCollNet_v10->deregMr;
+  ncclCollNet.iallreduce = ncclCollNet_v10->iallreduce;
+  ncclCollNet.iallgather = ncclCollNet_iallgather;
+  ncclCollNet.ireducescatter = ncclCollNet_ireducescatter;
+  ncclCollNet.iflush = ncclCollNet_v10->iflush;
+  ncclCollNet.test = ncclCollNet_v10->test;
+  ncclCollNet.closeColl = ncclCollNet_v10->closeColl;
+  ncclCollNet.closeListen = ncclCollNet_v10->closeListen;
+  ncclCollNet.makeVDevice = ncclCollNet_makeVDevice;
+  ncclCollNet.finalize = ncclCollNet_finalize;
+  return ncclSuccess;
+}
+
 ncclCollNet_t* getNcclCollNet_v10(void* lib) {
   ncclCollNet_v10 = (ncclCollNet_v10_t*)dlsym(lib, "ncclCollNetPlugin_v10");
   if (ncclCollNet_v10) {
-    INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded collnet plugin %s (v10)", ncclNet_v10->name);
-    return ncclCollNet_v10;
+    ncclCollNet.name = ncclCollNet_v10->name;
+    ncclCollNet.init = ncclCollNet_init;
+    INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded collnet plugin %s (v10)", ncclCollNet_v10->name);
+    return &ncclCollNet;
   }
-  INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Failed to find ncclCollNetPlugin_v10 symbol.");
   return nullptr;
 }
diff --git a/projects/rccl/src/plugin/net/net_v11.cc b/projects/rccl/src/plugin/net/net_v11.cc
new file mode 100644
index 0000000000..b13a0efb9d
--- /dev/null
+++ b/projects/rccl/src/plugin/net/net_v11.cc
@@ -0,0 +1,31 @@
+/*************************************************************************
+ * Copyright (c) 2022-2023, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "nccl_net.h"
+#include "net_device.h"
+#include "proxy.h"
+#include <dlfcn.h>
+
+static ncclNet_v11_t* ncclNet_v11;
+static ncclCollNet_v11_t* ncclCollNet_v11;
+
+ncclNet_t* getNcclNet_v11(void* lib) {
+  ncclNet_v11 = (ncclNet_v11_t*)dlsym(lib, "ncclNetPlugin_v11");
+  if (ncclNet_v11) {
+    INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded net plugin %s (v11)", ncclNet_v11->name);
+    return ncclNet_v11;
+  }
+  return nullptr;
+}
+
+ncclCollNet_t* getNcclCollNet_v11(void* lib) {
+  ncclCollNet_v11 = (ncclCollNet_v11_t*)dlsym(lib, "ncclCollNetPlugin_v11");
+  if (ncclCollNet_v11) {
+    INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded collnet plugin %s (v11)", ncclCollNet_v11->name);
+    return ncclCollNet_v11;
+  }
+  return nullptr;
+}
diff --git a/projects/rccl/src/plugin/net/net_v6.cc b/projects/rccl/src/plugin/net/net_v6.cc
index baff67935e..73eb8614d3 100644
--- a/projects/rccl/src/plugin/net/net_v6.cc
+++ b/projects/rccl/src/plugin/net/net_v6.cc
@@ -8,12 +8,18 @@
 #include "net_device.h"
 #include "proxy.h"
 #include "checks.h"
+#include <dlfcn.h>
 
 static ncclNet_t ncclNet;
 static ncclCollNet_t ncclCollNet;
 static ncclNet_v6_t* ncclNet_v6;
 static ncclCollNet_v6_t* ncclCollNet_v6;
 
+#define NET_INDEX 0
+#define COLLNET_INDEX 1
+#define INDEX_NUMS 2
+static int refCount[INDEX_NUMS];
+
 static ncclResult_t ncclNet_getProperties(int dev, ncclNetProperties_t* props) {
   ncclNetProperties_v6_t p6;
   ncclResult_t ans = ncclNet_v6->getProperties(dev, &p6);
@@ -35,6 +41,7 @@ static ncclResult_t ncclNet_getProperties(int dev, ncclNetProperties_t* props) {
   props->vProps.devs[0] = dev;
   props->maxP2pBytes = MAX_NET_SIZE;
   props->maxCollBytes = MAX_COLLNET_SIZE;
+  props->maxMultiRequestSize = 1;
   return ncclSuccess;
 }
 
@@ -43,7 +50,14 @@ static ncclResult_t ncclNet_regMr(void* comm, void* data, size_t size, int type,
   return ncclNet_v6->regMr(comm, data, (int) size, type, mhandle);
 }
 
-static ncclResult_t ncclNet_connect(int dev, ncclNetCommConfig_t* config, void* handle, void** sendComm, ncclNetDeviceHandle_t** /*sendDevComm*/) {
+static ncclResult_t ncclNet_listen(void* ctx __attribute__((unused)),
+    int d, void* handle, void** listenComm) {
+  return ncclNet_v6->listen(d, handle, listenComm);
+}
+
+static ncclResult_t ncclNet_connect(void* ctx __attribute__((unused)),
+    int dev,
+    void* handle, void** sendComm, ncclNetDeviceHandle_t** /*sendDevComm*/) {
   return ncclNet_v6->connect(dev, handle, sendComm);
 }
 
@@ -51,7 +65,9 @@ static ncclResult_t ncclNet_accept(void* listenComm, void** recvComm, ncclNetDev
   return ncclNet_v6->accept(listenComm, recvComm);
 }
 
-static ncclResult_t ncclNet_isend(void* sendComm, void* data, size_t size, int tag, void* mhandle, void* pHandle, void** request) {
+static ncclResult_t ncclNet_isend(void* sendComm, void* data, size_t size, int tag, void* mhandle,
+    void* pHandle __attribute__((unused)),
+    void** request) {
   int sizeInt;
   if (size > MAX_NET_SIZE) return ncclInternalError;
   sizeInt = (int)size;
@@ -59,7 +75,9 @@ static ncclResult_t ncclNet_isend(void* sendComm, void* data, size_t size, int t
   return ans;
 }
 
-static ncclResult_t ncclNet_irecv(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles, void** pHandles, void** request) {
+static ncclResult_t ncclNet_irecv(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles,
+    void** pHandles __attribute__((unused)),
+    void** request) {
   int sizesInt[NCCL_PROXY_MAX_SUBS];
   //reset to nullptr if optional receive completion is set
   if (*request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION) *request = nullptr;
@@ -71,6 +89,11 @@ static ncclResult_t ncclNet_irecv(void* recvComm, int n, void** data, size_t* si
   return ans;
 }
 
+static ncclResult_t ncclNet_finalize(void* ctx __attribute__((unused))) {
+  refCount[NET_INDEX]--;
+  return ncclSuccess;
+}
+
 static ncclResult_t ncclCollNet_getProperties(int dev, ncclNetProperties_t* props) {
   ncclNetProperties_v6_t p6;
   ncclResult_t ans = ncclCollNet_v6->getProperties(dev, &p6);
@@ -92,9 +115,15 @@ static ncclResult_t ncclCollNet_getProperties(int dev, ncclNetProperties_t* prop
   props->vProps.devs[0] = dev;
   props->maxP2pBytes = MAX_NET_SIZE;
   props->maxCollBytes = MAX_COLLNET_SIZE;
+  props->maxMultiRequestSize = 1;
   return ncclSuccess;
 }
 
+static ncclResult_t ncclCollNet_listen(void* ctx __attribute__((unused)),
+    int d, void* handle, void** listenComm) {
+  return ncclCollNet_v6->listen(d, handle, listenComm);
+}
+
 static ncclResult_t ncclCollNet_regMr(void* comm, void* data, size_t size, int type, void** mhandle) {
   if (size >= 1UL<<31) return ncclInternalError;
   return ncclCollNet_v6->regMr(comm, data, (int) size, type, mhandle);
@@ -110,11 +139,24 @@ static ncclResult_t ncclCollNet_iallreduce(void* collComm, void* sendData, void*
   return ans;
 }
 
-static ncclResult_t ncclNet_init(ncclDebugLogger_t logfn, ncclProfilerCallback_t proffn) {
+static ncclResult_t ncclCollNet_finalize(void* ctx __attribute__((unused))) {
+  refCount[COLLNET_INDEX]--;
+  return ncclSuccess;
+}
+
+static ncclResult_t ncclNet_init(void** ctx __attribute__((unused)),
+    uint64_t commId __attribute__((unused)),
+    ncclNetCommConfig_t* config __attribute__((unused)),
+    ncclDebugLogger_t logfn,
+    ncclProfilerCallback_t proffn __attribute__((unused))) {
+  // before ncclNet_v11 the net plugin was initialized only once. With ncclNet_v11 this is no longer the case.
+  // The compat layer preserves the ncclNet_v6 behavior using a refCount to track the number of times the plugin
+  // is initialized, and avoid initializing it multiple times.
+  if (refCount[NET_INDEX]++) return ncclSuccess;
   NCCLCHECK(ncclNet_v6->init(logfn));
   ncclNet.devices = ncclNet_v6->devices;
   ncclNet.getProperties = ncclNet_getProperties;
-  ncclNet.listen = ncclNet_v6->listen;
+  ncclNet.listen = ncclNet_listen;
   ncclNet.connect = ncclNet_connect;
   ncclNet.accept =  ncclNet_accept;
   ncclNet.regMr = ncclNet_regMr;
@@ -130,6 +172,8 @@ static ncclResult_t ncclNet_init(ncclDebugLogger_t logfn, ncclProfilerCallback_t
   ncclNet.getDeviceMr = NULL;
   ncclNet.irecvConsumed = NULL;
   ncclNet.makeVDevice  = NULL;
+  ncclNet.finalize = ncclNet_finalize;
+  ncclNet.setNetAttr = nullptr;
   return ncclSuccess;
 }
 
@@ -141,15 +185,20 @@ ncclNet_t* getNcclNet_v6(void* lib) {
     INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded net plugin %s (v6)", ncclNet_v6->name);
     return &ncclNet;
   }
-  INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.");
   return nullptr;
 }
 
-static ncclResult_t ncclCollNet_init(ncclDebugLogger_t logfn) {
+static ncclResult_t ncclCollNet_init(void** ctx __attribute__((unused)),
+    uint64_t commId __attribute__((unused)),
+    ncclDebugLogger_t logfn) {
+  // before ncclCollNet_v11 the collnet plugin was initialized only once. With ncclCollNet_v11 this is no longer the case.
+  // The compat layer preserves the ncclCollNet_v6 behavior using a refCount to track the number of times the plugin
+  // is initialized, and avoid initializing it multiple times.
+  if (refCount[COLLNET_INDEX]++) return ncclSuccess;
   NCCLCHECK(ncclCollNet_v6->init(logfn));
   ncclCollNet.devices = ncclCollNet_v6->devices;
   ncclCollNet.getProperties = ncclCollNet_getProperties;
-  ncclCollNet.listen = ncclCollNet_v6->listen;
+  ncclCollNet.listen = ncclCollNet_listen;
   ncclCollNet.connect = ncclCollNet_v6->connect;
   ncclCollNet.reduceSupport = ncclCollNet_v6->reduceSupport;
   ncclCollNet.regMr = ncclCollNet_regMr;
@@ -162,6 +211,8 @@ static ncclResult_t ncclCollNet_init(ncclDebugLogger_t logfn) {
   ncclCollNet.test = ncclCollNet_v6->test;
   ncclCollNet.closeColl = ncclCollNet_v6->closeColl;
   ncclCollNet.closeListen = ncclCollNet_v6->closeListen;
+  ncclCollNet.makeVDevice  = NULL;
+  ncclCollNet.finalize = ncclCollNet_finalize;
   return ncclSuccess;
 }
 
@@ -173,6 +224,5 @@ ncclCollNet_t* getNcclCollNet_v6(void* lib) {
     INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded collnet plugin %s (v6)", ncclCollNet_v6->name);
     return &ncclCollNet;
   }
-  INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.");
   return nullptr;
 }
diff --git a/projects/rccl/src/plugin/net/net_v7.cc b/projects/rccl/src/plugin/net/net_v7.cc
index 4bad5ec264..a137172943 100644
--- a/projects/rccl/src/plugin/net/net_v7.cc
+++ b/projects/rccl/src/plugin/net/net_v7.cc
@@ -8,12 +8,18 @@
 #include "net_device.h"
 #include "proxy.h"
 #include "checks.h"
+#include <dlfcn.h>
 
 static ncclNet_t ncclNet;
 static ncclCollNet_t ncclCollNet;
 static ncclNet_v7_t* ncclNet_v7;
 static ncclCollNet_v7_t* ncclCollNet_v7;
 
+#define NET_INDEX 0
+#define COLLNET_INDEX 1
+#define INDEX_NUMS 2
+static int refCount[INDEX_NUMS];
+
 static ncclResult_t ncclNet_getProperties(int dev, ncclNetProperties_t* props) {
   ncclNetProperties_v7_t p7;
   ncclResult_t ans = ncclNet_v7->getProperties(dev, &p7);
@@ -35,10 +41,18 @@ static ncclResult_t ncclNet_getProperties(int dev, ncclNetProperties_t* props) {
   props->vProps.devs[0] = dev;
   props->maxP2pBytes = MAX_NET_SIZE;
   props->maxCollBytes = MAX_COLLNET_SIZE;
+  props->maxMultiRequestSize = 1;
   return ncclSuccess;
 }
 
-static ncclResult_t ncclNet_connect(int dev, ncclNetCommConfig_t* config, void* handle, void** sendComm, ncclNetDeviceHandle_t** sendDevComm) {
+static ncclResult_t ncclNet_listen(void* ctx __attribute__((unused)),
+    int dev, void* handle, void** listenComm) {
+  return ncclNet_v7->listen(dev, handle, listenComm);
+}
+
+static ncclResult_t ncclNet_connect(void* ctx __attribute__((unused)),
+    int dev,
+    void* handle, void** sendComm, ncclNetDeviceHandle_t** sendDevComm) {
   return ncclNet_v7->connect(dev, handle, sendComm, sendDevComm);
 }
 
@@ -47,7 +61,9 @@ static ncclResult_t ncclNet_regMr(void* comm, void* data, size_t size, int type,
   return ncclNet_v7->regMr(comm, data, (int) size, type, mhandle);
 }
 
-static ncclResult_t ncclNet_isend(void* sendComm, void* data, size_t size, int tag, void* mhandle, void* pHandle, void** request) {
+static ncclResult_t ncclNet_isend(void* sendComm, void* data, size_t size, int tag, void* mhandle,
+    void* pHandle __attribute__((unused)),
+    void** request) {
   int sizeInt;
   if (size > MAX_NET_SIZE) return ncclInternalError;
   sizeInt = (int)size;
@@ -55,7 +71,9 @@ static ncclResult_t ncclNet_isend(void* sendComm, void* data, size_t size, int t
   return ans;
 }
 
-static ncclResult_t ncclNet_irecv(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles, void** pHandles, void** request) {
+static ncclResult_t ncclNet_irecv(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles,
+    void** pHandles __attribute__((unused)),
+    void** request) {
   int sizesInt[NCCL_PROXY_MAX_SUBS];
   //reset to nullptr if optional receive completion is set
   if (*request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION) *request = nullptr;
@@ -67,6 +85,11 @@ static ncclResult_t ncclNet_irecv(void* recvComm, int n, void** data, size_t* si
   return ans;
 }
 
+static ncclResult_t ncclNet_finalize(void* ctx __attribute__((unused))) {
+  refCount[NET_INDEX]--;
+  return ncclSuccess;
+}
+
 static ncclResult_t ncclCollNet_getProperties(int dev, ncclNetProperties_t* props) {
   ncclNetProperties_v7_t p7;
   ncclResult_t ans = ncclCollNet_v7->getProperties(dev, &p7);
@@ -88,9 +111,15 @@ static ncclResult_t ncclCollNet_getProperties(int dev, ncclNetProperties_t* prop
   props->vProps.devs[0] = dev;
   props->maxP2pBytes = MAX_NET_SIZE;
   props->maxCollBytes = MAX_COLLNET_SIZE;
+  props->maxMultiRequestSize = 1;
   return ncclSuccess;
 }
 
+static ncclResult_t ncclCollNet_listen(void* ctx __attribute__((unused)),
+    int d, void* handle, void** listenComm) {
+  return ncclCollNet_v7->listen(d, handle, listenComm);
+}
+
 static ncclResult_t ncclCollNet_regMr(void* comm, void* data, size_t size, int type, void** mhandle) {
   if (size >= 1UL<<31) return ncclInternalError;
   return ncclCollNet_v7->regMr(comm, data, (int) size, type, mhandle);
@@ -106,11 +135,24 @@ static ncclResult_t ncclCollNet_iallreduce(void* collComm, void* sendData, void*
   return ans;
 }
 
-static ncclResult_t ncclNet_init(ncclDebugLogger_t logfn, ncclProfilerCallback_t proffn) {
+static ncclResult_t ncclCollNet_finalize(void* ctx __attribute__((unused))) {
+  refCount[COLLNET_INDEX]--;
+  return ncclSuccess;
+}
+
+static ncclResult_t ncclNet_init(void** ctx __attribute__((unused)),
+    uint64_t commId __attribute__((unused)),
+    ncclNetCommConfig_t* config __attribute__((unused)),
+    ncclDebugLogger_t logfn,
+    ncclProfilerCallback_t proffn __attribute__((unused))) {
+  // before ncclNet_v11 the net plugin was initialized only once. With ncclNet_v11 this is no longer the case.
+  // The compat layer preserves the ncclNet_v7 behavior using a refCount to track the number of times the plugin
+  // is initialized, and avoid initializing it multiple times.
+  if (refCount[NET_INDEX]++) return ncclSuccess;
   NCCLCHECK(ncclNet_v7->init(logfn));
   ncclNet.devices = ncclNet_v7->devices;
   ncclNet.getProperties = ncclNet_getProperties; // ncclNet_v5->getProperties;
-  ncclNet.listen = ncclNet_v7->listen;
+  ncclNet.listen = ncclNet_listen;
   ncclNet.connect = ncclNet_connect;
   ncclNet.accept =  ncclNet_v7->accept;
   ncclNet.regMr = ncclNet_regMr;
@@ -126,6 +168,8 @@ static ncclResult_t ncclNet_init(ncclDebugLogger_t logfn, ncclProfilerCallback_t
   ncclNet.getDeviceMr = ncclNet_v7->getDeviceMr;
   ncclNet.irecvConsumed = ncclNet_v7->irecvConsumed;
   ncclNet.makeVDevice  = NULL;
+  ncclNet.finalize = ncclNet_finalize;
+  ncclNet.setNetAttr = nullptr;
   return ncclSuccess;
 }
 
@@ -137,15 +181,20 @@ ncclNet_t* getNcclNet_v7(void* lib) {
     INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded net plugin %s (v7)", ncclNet_v7->name);
     return &ncclNet;
   }
-  INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.");
   return nullptr;
 }
 
-static ncclResult_t ncclCollNet_init(ncclDebugLogger_t logfn) {
+static ncclResult_t ncclCollNet_init(void** ctx __attribute__((unused)),
+    uint64_t commId __attribute__((unused)),
+    ncclDebugLogger_t logfn) {
+  // before ncclCollNet_v11 the collnet plugin was initialized only once. With ncclCollNet_v11 this is no longer the case.
+  // The compat layer preserves the ncclCollNet_v7 behavior using a refCount to track the number of times the plugin
+  // is initialized, and avoid initializing it multiple times.
+  if (refCount[COLLNET_INDEX]++) return ncclSuccess;
   NCCLCHECK(ncclCollNet_v7->init(logfn));
   ncclCollNet.devices = ncclCollNet_v7->devices;
   ncclCollNet.getProperties = ncclCollNet_getProperties;
-  ncclCollNet.listen = ncclCollNet_v7->listen;
+  ncclCollNet.listen = ncclCollNet_listen;
   ncclCollNet.connect = ncclCollNet_v7->connect;
   ncclCollNet.reduceSupport = ncclCollNet_v7->reduceSupport;
   ncclCollNet.regMr = ncclCollNet_regMr;
@@ -158,6 +207,7 @@ static ncclResult_t ncclCollNet_init(ncclDebugLogger_t logfn) {
   ncclCollNet.test = ncclCollNet_v7->test;
   ncclCollNet.closeColl = ncclCollNet_v7->closeColl;
   ncclCollNet.closeListen = ncclCollNet_v7->closeListen;
+  ncclCollNet.finalize = ncclCollNet_finalize;
   return ncclSuccess;
 }
 
@@ -169,6 +219,5 @@ ncclCollNet_t* getNcclCollNet_v7(void* lib) {
     INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded collnet plugin %s (v7)", ncclCollNet_v7->name);
     return &ncclCollNet;
   }
-  INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.");
   return nullptr;
 }
diff --git a/projects/rccl/src/plugin/net/net_v8.cc b/projects/rccl/src/plugin/net/net_v8.cc
index b43bb895e0..d241d5dc5d 100644
--- a/projects/rccl/src/plugin/net/net_v8.cc
+++ b/projects/rccl/src/plugin/net/net_v8.cc
@@ -8,12 +8,18 @@
 #include "net_device.h"
 #include "proxy.h"
 #include "checks.h"
+#include <dlfcn.h>
 
 static ncclNet_t ncclNet;
 static ncclCollNet_t ncclCollNet;
 static ncclNet_v8_t* ncclNet_v8;
 static ncclCollNet_v8_t* ncclCollNet_v8;
 
+#define NET_INDEX 0
+#define COLLNET_INDEX 1
+#define INDEX_NUMS 2
+static int refCount[INDEX_NUMS];
+
 static ncclResult_t ncclNet_getProperties(int dev, ncclNetProperties_t* props) {
   ncclNetProperties_v8_t p8;
   ncclResult_t ans = ncclNet_v8->getProperties(dev, &p8);
@@ -35,14 +41,24 @@ static ncclResult_t ncclNet_getProperties(int dev, ncclNetProperties_t* props) {
   props->vProps.devs[0] = dev;
   props->maxP2pBytes = MAX_NET_SIZE;
   props->maxCollBytes = MAX_COLLNET_SIZE;
+  props->maxMultiRequestSize = 1;
   return ncclSuccess;
 }
 
-static ncclResult_t ncclNet_connect(int dev, ncclNetCommConfig_t* config, void* handle, void** sendComm, ncclNetDeviceHandle_t** sendDevComm) {
+static ncclResult_t ncclNet_listen(void* ctx __attribute__((unused)),
+    int dev, void* handle, void** listenComm) {
+  return ncclNet_v8->listen(dev, handle, listenComm);
+}
+
+static ncclResult_t ncclNet_connect(void* ctx __attribute__((unused)),
+    int dev,
+    void* handle, void** sendComm, ncclNetDeviceHandle_t** sendDevComm) {
   return ncclNet_v8->connect(dev, handle, sendComm, sendDevComm);
 }
 
-static ncclResult_t ncclNet_isend(void* sendComm, void* data, size_t size, int tag, void* mhandle, void* pHandle, void** request) {
+static ncclResult_t ncclNet_isend(void* sendComm, void* data, size_t size, int tag, void* mhandle,
+    void* pHandle __attribute__((unused)),
+    void** request) {
   int sizeInt;
   if (size > MAX_NET_SIZE) return ncclInternalError;
   sizeInt = (int)size;
@@ -50,7 +66,9 @@ static ncclResult_t ncclNet_isend(void* sendComm, void* data, size_t size, int t
   return ans;
 }
 
-static ncclResult_t ncclNet_irecv(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles, void** pHandles, void** request) {
+static ncclResult_t ncclNet_irecv(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles,
+    void** pHandles __attribute__((unused)),
+    void** request) {
   int sizesInt[NCCL_PROXY_MAX_SUBS];
   //reset to nullptr if optional receive completion is set
   if (*request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION) *request = nullptr;
@@ -62,6 +80,11 @@ static ncclResult_t ncclNet_irecv(void* recvComm, int n, void** data, size_t* si
   return ans;
 }
 
+static ncclResult_t ncclNet_finalize(void* ctx __attribute__((unused))) {
+  refCount[NET_INDEX]--;
+  return ncclSuccess;
+}
+
 static ncclResult_t ncclCollNet_getProperties(int dev, ncclNetProperties_t* props) {
   ncclNetProperties_v8_t p8;
   ncclResult_t ans = ncclCollNet_v8->getProperties(dev, &p8);
@@ -83,9 +106,15 @@ static ncclResult_t ncclCollNet_getProperties(int dev, ncclNetProperties_t* prop
   props->vProps.devs[0] = dev;
   props->maxP2pBytes = MAX_NET_SIZE;
   props->maxCollBytes = MAX_COLLNET_SIZE;
+  props->maxMultiRequestSize = 1;
   return ncclSuccess;
 }
 
+static ncclResult_t ncclCollNet_listen(void* ctx __attribute__((unused)),
+    int dev, void* handle, void** listenComm) {
+  return ncclCollNet_v8->listen(dev, handle, listenComm);
+}
+
 static ncclResult_t ncclCollNet_iallreduce(void* collComm, void* sendData, void* recvData, size_t count,
       ncclDataType_t dataType, ncclRedOp_t redOp, void* sendMhandle, void* recvMhandle, void** request) {
   int countInt;
@@ -128,11 +157,23 @@ static ncclResult_t ncclCollNet_ireducescatter(void* collComm, int nSendParts, n
   return ans;
 }
 
-static ncclResult_t ncclNet_init(ncclDebugLogger_t logfn, ncclProfilerCallback_t proffn) {
+static ncclResult_t ncclCollNet_finalize(void* ctx __attribute__((unused))) {
+  refCount[COLLNET_INDEX]--;
+  return ncclSuccess;
+}
+
+static ncclResult_t ncclNet_init(void** ctx __attribute__((unused)),
+    uint64_t commId __attribute__((unused)),
+    ncclNetCommConfig_t* config __attribute__((unused)),
+    ncclDebugLogger_t logfn, ncclProfilerCallback_t proffn) {
+  // before ncclNet_v11 the net plugin was initialized only once. With ncclNet_v11 this is no longer the case.
+  // The compat layer preserves the ncclNet_v8 behavior using a refCount to track the number of times the plugin
+  // is initialized, and avoid initializing it multiple times.
+  if (refCount[NET_INDEX]++) return ncclSuccess;
   NCCLCHECK(ncclNet_v8->init(logfn));
   ncclNet.devices = ncclNet_v8->devices;
   ncclNet.getProperties = ncclNet_getProperties;
-  ncclNet.listen = ncclNet_v8->listen;
+  ncclNet.listen = ncclNet_listen;
   ncclNet.connect = ncclNet_connect;
   ncclNet.accept =  ncclNet_v8->accept;
   ncclNet.regMr = ncclNet_v8->regMr;
@@ -148,6 +189,8 @@ static ncclResult_t ncclNet_init(ncclDebugLogger_t logfn, ncclProfilerCallback_t
   ncclNet.getDeviceMr = ncclNet_v8->getDeviceMr;
   ncclNet.irecvConsumed = ncclNet_v8->irecvConsumed;
   ncclNet.makeVDevice   = NULL;
+  ncclNet.finalize = ncclNet_finalize;
+  ncclNet.setNetAttr = nullptr;
   return ncclSuccess;
 }
 
@@ -159,15 +202,20 @@ ncclNet_t* getNcclNet_v8(void* lib) {
     INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded net plugin %s (v8)", ncclNet_v8->name);
     return &ncclNet;
   }
-  INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.");
   return nullptr;
 }
 
-static ncclResult_t ncclCollNet_init(ncclDebugLogger_t logfn) {
+static ncclResult_t ncclCollNet_init(void** ctx __attribute__((unused)),
+    uint64_t commId __attribute__((unused)),
+    ncclDebugLogger_t logfn) {
+  // before ncclCollNet_v11 the collnet plugin was initialized only once. With ncclCollNet_v11 this is no longer the case.
+  // The compat layer preserves the ncclCollNet_v8 behavior using a refCount to track the number of times the plugin
+  // is initialized, and avoid initializing it multiple times.
+  if (refCount[COLLNET_INDEX]++) return ncclSuccess;
   NCCLCHECK(ncclCollNet_v8->init(logfn));
   ncclCollNet.devices = ncclCollNet_v8->devices;
   ncclCollNet.getProperties = ncclCollNet_getProperties;
-  ncclCollNet.listen = ncclCollNet_v8->listen;
+  ncclCollNet.listen = ncclCollNet_listen;
   ncclCollNet.connect = ncclCollNet_v8->connect;
   ncclCollNet.reduceSupport = ncclCollNet_v8->reduceSupport;
   ncclCollNet.regMr = ncclCollNet_v8->regMr;
@@ -180,6 +228,8 @@ static ncclResult_t ncclCollNet_init(ncclDebugLogger_t logfn) {
   ncclCollNet.test = ncclCollNet_v8->test;
   ncclCollNet.closeColl = ncclCollNet_v8->closeColl;
   ncclCollNet.closeListen = ncclCollNet_v8->closeListen;
+  ncclCollNet.makeVDevice = nullptr;
+  ncclCollNet.finalize = ncclCollNet_finalize;
   return ncclSuccess;
 }
 
@@ -191,6 +241,5 @@ ncclCollNet_t* getNcclCollNet_v8(void* lib) {
     INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded collnet plugin %s (v8)", ncclCollNet_v8->name);
     return &ncclCollNet;
   }
-  INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.");
   return nullptr;
 }
diff --git a/projects/rccl/src/plugin/net/net_v9.cc b/projects/rccl/src/plugin/net/net_v9.cc
index 34e0393320..12011aa8c2 100644
--- a/projects/rccl/src/plugin/net/net_v9.cc
+++ b/projects/rccl/src/plugin/net/net_v9.cc
@@ -8,25 +8,64 @@
 #include "net_device.h"
 #include "proxy.h"
 #include "checks.h"
+#include <dlfcn.h>
 
 static ncclNet_t ncclNet;
 static ncclCollNet_t ncclCollNet;
 static ncclNet_v9_t* ncclNet_v9;
 static ncclCollNet_v9_t* ncclCollNet_v9;
 
+#define NET_INDEX 0
+#define COLLNET_INDEX 1
+#define INDEX_NUMS 2
+static int refCount[INDEX_NUMS];
+
 static ncclResult_t ncclNet_getProperties(int dev, ncclNetProperties_t* props) {
-  return ncclNet_v9->getProperties(dev, (ncclNetProperties_v9_t *)props);
+  ncclNetProperties_v9_t props_v9;
+  NCCLCHECK(ncclNet_v9->getProperties(dev, &props_v9));
+  props->name = props_v9.name;
+  props->pciPath = props_v9.pciPath;
+  props->guid = props_v9.guid;
+  props->ptrSupport = props_v9.ptrSupport;
+  props->regIsGlobal = props_v9.regIsGlobal;
+  props->forceFlush = props_v9.forceFlush;
+  props->speed = props_v9.speed;
+  props->port = props_v9.port;
+  props->latency = props_v9.latency;
+  props->maxComms = props_v9.maxComms;
+  props->maxRecvs = props_v9.maxRecvs;
+  props->netDeviceType = props_v9.netDeviceType;
+  props->netDeviceVersion = props_v9.netDeviceVersion;
+  props->vProps.ndevs = props_v9.vProps.ndevs;
+  for (int i = 0; i < props->vProps.ndevs; i++) {
+    props->vProps.devs[i] = props_v9.vProps.devs[i];
+  }
+  props->maxP2pBytes = props_v9.maxP2pBytes;
+  props->maxCollBytes = props_v9.maxCollBytes;
+  props->maxMultiRequestSize = 1;
+  return ncclSuccess;
 }
 
-static ncclResult_t ncclNet_isend(void* sendComm, void* data, size_t size, int tag, void* mhandle, void* pHandle, void** request) {
+static ncclResult_t ncclNet_isend(void* sendComm, void* data, size_t size, int tag, void* mhandle,
+    void* pHandle __attribute__((unused)),
+    void** request) {
   return ncclNet_v9->isend(sendComm, data, size, tag, mhandle, request);
 }
 
-static ncclResult_t ncclNet_irecv(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles, void** pHandles, void** request) {
+static ncclResult_t ncclNet_irecv(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles,
+    void** pHandles __attribute__((unused)),
+    void** request) {
   return ncclNet_v9->irecv(recvComm, n, data, sizes, tags, mhandles, request);
 }
 
-static ncclResult_t ncclNet_connect(int dev, ncclNetCommConfig_t* config, void* handle, void** sendComm, ncclNetDeviceHandle_t** sendDevComm) {
+static ncclResult_t ncclNet_listen(void* ctx __attribute__((unused)),
+    int dev, void* handle, void** listenComm) {
+  return ncclNet_v9->listen(dev, handle, listenComm);
+}
+
+static ncclResult_t ncclNet_connect(void* ctx __attribute__((unused)),
+    int dev,
+    void* handle, void** sendComm, ncclNetDeviceHandle_t** sendDevComm) {
   return ncclNet_v9->connect(dev, handle, sendComm, sendDevComm);
 }
 
@@ -34,8 +73,40 @@ static ncclResult_t ncclNet_makeVDevice(int* d, ncclNetVDeviceProps_t* props) {
   return ncclNet_v9->makeVDevice(d, (ncclNetVDeviceProps_v9_t*)props);
 }
 
+static ncclResult_t ncclNet_finalize(void* ctx __attribute__((unused))) {
+  refCount[NET_INDEX]--;
+  return ncclSuccess;
+}
+
 static ncclResult_t ncclCollNet_getProperties(int dev, ncclNetProperties_t* props) {
-  return ncclCollNet_v9->getProperties(dev, (ncclNetProperties_v9_t *)props);
+  ncclNetProperties_v9_t props_v9;
+  NCCLCHECK(ncclCollNet_v9->getProperties(dev, &props_v9));
+  props->name = props_v9.name;
+  props->pciPath = props_v9.pciPath;
+  props->guid = props_v9.guid;
+  props->ptrSupport = props_v9.ptrSupport;
+  props->regIsGlobal = props_v9.regIsGlobal;
+  props->forceFlush = props_v9.forceFlush;
+  props->speed = props_v9.speed;
+  props->port = props_v9.port;
+  props->latency = props_v9.latency;
+  props->maxComms = props_v9.maxComms;
+  props->maxRecvs = props_v9.maxRecvs;
+  props->netDeviceType = props_v9.netDeviceType;
+  props->netDeviceVersion = props_v9.netDeviceVersion;
+  props->vProps.ndevs = props_v9.vProps.ndevs;
+  for (int i = 0; i < props->vProps.ndevs; i++) {
+    props->vProps.devs[i] = props_v9.vProps.devs[i];
+  }
+  props->maxP2pBytes = props_v9.maxP2pBytes;
+  props->maxCollBytes = props_v9.maxCollBytes;
+  props->maxMultiRequestSize = 1;
+  return ncclSuccess;
+}
+
+static ncclResult_t ncclCollNet_listen(void* ctx __attribute__((unused)),
+    int d, void* handle, void** listenComm) {
+  return ncclCollNet_v9->listen(d, handle, listenComm);
 }
 
 static ncclResult_t ncclCollNet_iallgather(void* collComm, void* sendData, int nRecvParts, ncclNetSGE_t* recvParts,
@@ -53,11 +124,27 @@ static ncclResult_t ncclCollNet_ireducescatter(void* collComm, int nSendParts, n
                                  windowOffset, windowBytes, dataType, redOp, recvMhandle, request);
 }
 
-static ncclResult_t ncclNet_init(ncclDebugLogger_t logfn, ncclProfilerCallback_t proffn) {
+static ncclResult_t ncclCollNet_makeVDevice(int* d, ncclNetVDeviceProps_t* props) {
+  return ncclCollNet_v9->makeVDevice(d, (ncclNetVDeviceProps_v9_t *)props);
+}
+
+static ncclResult_t ncclCollNet_finalize(void* ctx __attribute__((unused))) {
+  refCount[COLLNET_INDEX]--;
+  return ncclSuccess;
+}
+
+static ncclResult_t ncclNet_init(void** ctx __attribute__((unused)),
+    uint64_t commId __attribute__((unused)),
+    ncclNetCommConfig_t* config __attribute__((unused)),
+    ncclDebugLogger_t logfn, ncclProfilerCallback_t proffn) {
+  // before ncclNet_v11 the net plugin was initialized only once. With ncclNet_v11 this is no longer the case.
+  // The compat layer preserves the ncclNet_v9 behavior using a refCount to track the number of times the plugin
+  // is initialized, and avoid initializing it multiple times.
+  if (refCount[NET_INDEX]++) return ncclSuccess;
   NCCLCHECK(ncclNet_v9->init(logfn));
   ncclNet.devices = ncclNet_v9->devices;
   ncclNet.getProperties = ncclNet_getProperties;
-  ncclNet.listen = ncclNet_v9->listen;
+  ncclNet.listen = ncclNet_listen;
   ncclNet.connect = ncclNet_connect;
   ncclNet.accept = ncclNet_v9->accept;
   ncclNet.regMr = ncclNet_v9->regMr;
@@ -73,6 +160,8 @@ static ncclResult_t ncclNet_init(ncclDebugLogger_t logfn, ncclProfilerCallback_t
   ncclNet.getDeviceMr = ncclNet_v9->getDeviceMr;
   ncclNet.irecvConsumed = ncclNet_v9->irecvConsumed;
   ncclNet.makeVDevice = (ncclNet_v9->makeVDevice) ? ncclNet_makeVDevice : nullptr;
+  ncclNet.finalize = ncclNet_finalize;
+  ncclNet.setNetAttr = nullptr;
   return ncclSuccess;
 }
 
@@ -84,15 +173,20 @@ ncclNet_t* getNcclNet_v9(void* lib) {
     INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded net plugin %s (v9)", ncclNet_v9->name);
     return &ncclNet;
   }
-  INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Failed to find ncclNetPlugin_v9 symbol.");
   return nullptr;
 }
 
-static ncclResult_t ncclCollNet_init(ncclDebugLogger_t logfn) {
+static ncclResult_t ncclCollNet_init(void** ctx __attribute__((unused)),
+    uint64_t commId __attribute__((unused)),
+    ncclDebugLogger_t logfn) {
+  // before ncclCollNet_v11 the collnet plugin was initialized only once. With ncclCollNet_v11 this is no longer the case.
+  // The compat layer preserves the ncclCollNet_v9 behavior using a refCount to track the number of times the plugin
+  // is initialized, and avoid initializing it multiple times.
+  if (refCount[COLLNET_INDEX]++) return ncclSuccess;
   NCCLCHECK(ncclCollNet_v9->init(logfn));
   ncclCollNet.devices = ncclCollNet_v9->devices;
   ncclCollNet.getProperties = ncclCollNet_getProperties;
-  ncclCollNet.listen = ncclCollNet_v9->listen;
+  ncclCollNet.listen = ncclCollNet_listen;
   ncclCollNet.connect = ncclCollNet_v9->connect;
   ncclCollNet.reduceSupport = ncclCollNet_v9->reduceSupport;
   ncclCollNet.regMr = ncclCollNet_v9->regMr;
@@ -105,6 +199,8 @@ static ncclResult_t ncclCollNet_init(ncclDebugLogger_t logfn) {
   ncclCollNet.test = ncclCollNet_v9->test;
   ncclCollNet.closeColl = ncclCollNet_v9->closeColl;
   ncclCollNet.closeListen = ncclCollNet_v9->closeListen;
+  ncclCollNet.makeVDevice = (ncclCollNet_v9->makeVDevice) ? ncclCollNet_makeVDevice : nullptr;
+  ncclCollNet.finalize = ncclCollNet_finalize;
   return ncclSuccess;
 }
 
@@ -116,6 +212,5 @@ ncclCollNet_t* getNcclCollNet_v9(void* lib) {
     INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Loaded collnet plugin %s (v9)", ncclCollNet_v9->name);
     return &ncclCollNet;
   }
-  INFO(NCCL_INIT|NCCL_NET, "NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol.");
   return nullptr;
 }
diff --git a/projects/rccl/src/plugin/plugin_open.cc b/projects/rccl/src/plugin/plugin_open.cc
index f80321c81c..740f220653 100644
--- a/projects/rccl/src/plugin/plugin_open.cc
+++ b/projects/rccl/src/plugin/plugin_open.cc
@@ -7,6 +7,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <errno.h>
+#include <link.h>
 #include <dlfcn.h>
 
 #include "debug.h"
@@ -16,6 +17,7 @@
 
 #define NUM_LIBS 3
 static char* libNames[NUM_LIBS];
+char* ncclPluginLibPaths[NUM_LIBS];
 static void *libHandles[NUM_LIBS];
 static const char *pluginNames[NUM_LIBS] = { "NET", "TUNER", "PROFILER" };
 static const char *pluginPrefix[NUM_LIBS] = { "libnccl-net", "libnccl-tuner", "libnccl-profiler" };
@@ -50,6 +52,14 @@ static void appendNameToList(char* nameList, int *leftChars, char* name) {
   *leftChars -= strlen(name) + 1;
 }
 
+static char* getLibPath(void* handle) {
+  struct link_map* lm;
+  if (dlinfo(handle, RTLD_DI_LINKMAP, &lm) != 0)
+    return nullptr;
+  else
+    return strdup(lm->l_name);
+}
+
 static void* openPluginLib(enum ncclPluginType type, const char* libName) {
   int openErr, len = PATH_MAX;
   char libName_[MAX_STR_LEN] = { 0 };
@@ -58,49 +68,44 @@ static void* openPluginLib(enum ncclPluginType type, const char* libName) {
 
   if (libName && strlen(libName)) {
     snprintf(libName_, MAX_STR_LEN, "%s", libName);
-    libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
-    if (libHandles[type]) {
-      INFO(subsys[type], "%s/Plugin: Plugin name set by env to %s", pluginNames[type], libName_);
-      libNames[type] = strdup(libName_);
-      return libHandles[type];
-    }
-    if (openErr == ENOENT) {
-      appendNameToList(eNoEntNameList, &len, libName_);
-    } else {
-      INFO(subsys[type], "%s/Plugin: %s", pluginNames[type], openErrStr);
-    }
-
-    // libName can't be a relative or absolute path (start with '.' or contain any '/'). It can't be a library name either (start with 'lib' or end with '.so')
-    if (strchr(libName, '/') == nullptr && (strncmp(libName, "lib", strlen("lib")) || strlen(libName) < strlen(".so") || strncmp(libName + strlen(libName) - strlen(".so"), ".so", strlen(".so")))) {
-      snprintf(libName_, MAX_STR_LEN, "%s-%s.so", pluginPrefix[type], libName);
-      libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
-      if (libHandles[type]) {
-        INFO(subsys[type], "%s/Plugin: Plugin name set by env to %s", pluginNames[type], libName_);
-        libNames[type] = strdup(libName_);
-        return libHandles[type];
-      }
-      if (openErr == ENOENT) {
-        appendNameToList(eNoEntNameList, &len, libName_);
-      } else {
-        INFO(subsys[type], "%s/Plugin: %s", pluginNames[type], openErrStr);
-      }
-    }
   } else {
     snprintf(libName_, MAX_STR_LEN, "%s.so", pluginPrefix[type]);
+  }
+
+  libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
+  if (libHandles[type]) {
+    libNames[type] = strdup(libName_);
+    ncclPluginLibPaths[type] = getLibPath(libHandles[type]);
+    return libHandles[type];
+  }
+  if (openErr == ENOENT) {
+    appendNameToList(eNoEntNameList, &len, libName_);
+  } else {
+    INFO(subsys[type], "%s/Plugin: %s: %s", pluginNames[type], libName_, openErrStr);
+  }
+
+  // libName can't be a relative or absolute path (start with '.' or contain any '/'). It can't be a library name either (start with 'lib' or end with '.so')
+  if (libName && strlen(libName) && strchr(libName, '/') == nullptr &&
+      (strncmp(libName, "lib", strlen("lib")) || strlen(libName) < strlen(".so") ||
+       strncmp(libName + strlen(libName) - strlen(".so"), ".so", strlen(".so")))) {
+    snprintf(libName_, MAX_STR_LEN, "%s-%s.so", pluginPrefix[type], libName);
+
     libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
     if (libHandles[type]) {
       libNames[type] = strdup(libName_);
+      ncclPluginLibPaths[type] = getLibPath(libHandles[type]);
       return libHandles[type];
     }
     if (openErr == ENOENT) {
       appendNameToList(eNoEntNameList, &len, libName_);
     } else {
-      INFO(subsys[type], "%s/Plugin: %s", pluginNames[type], openErrStr);
+      INFO(subsys[type], "%s/Plugin: %s: %s", pluginNames[type], libName_, openErrStr);
     }
   }
 
   if (strlen(eNoEntNameList)) {
-    INFO(subsys[type], "%s/Plugin: Could not find:%s. %s", pluginNames[type], eNoEntNameList, pluginFallback[type]);
+    INFO(subsys[type], "%s/Plugin: Could not find:%s%s%s", pluginNames[type], eNoEntNameList,
+         (strlen(pluginFallback[type]) > 0 ? ". " : ""), pluginFallback[type]);
   } else if (strlen(pluginFallback[type])) {
     INFO(subsys[type], "%s/Plugin: %s", pluginNames[type], pluginFallback[type]);
   }
@@ -123,6 +128,7 @@ void* ncclGetNetPluginLib(enum ncclPluginType type) {
   if (libNames[ncclPluginTypeNet]) {
     // increment the reference counter of the net library
     libNames[type] = strdup(libNames[ncclPluginTypeNet]);
+    ncclPluginLibPaths[type] = strdup(ncclPluginLibPaths[ncclPluginTypeNet]);
     libHandles[type] = dlopen(libNames[ncclPluginTypeNet], RTLD_NOW | RTLD_LOCAL);
   }
   return libHandles[type];
@@ -132,6 +138,8 @@ ncclResult_t ncclClosePluginLib(void* handle, enum ncclPluginType type) {
   if (handle && libHandles[type] == handle) {
     dlclose(handle);
     libHandles[type] = nullptr;
+    free(ncclPluginLibPaths[type]);
+    ncclPluginLibPaths[type] = nullptr;
     free(libNames[type]);
     libNames[type] = nullptr;
   }
diff --git a/projects/rccl/src/plugin/profiler.cc b/projects/rccl/src/plugin/profiler.cc
index 15c3f2bc2b..8514db17c3 100644
--- a/projects/rccl/src/plugin/profiler.cc
+++ b/projects/rccl/src/plugin/profiler.cc
@@ -13,17 +13,22 @@
 #include "profiler.h"
 #include "transport.h"
 #include "plugin.h"
+#include <mutex>
 
 extern ncclProfiler_t* getNcclProfiler_v1(void* lib);
 extern ncclProfiler_t* getNcclProfiler_v2(void* lib);
 extern ncclProfiler_t* getNcclProfiler_v3(void* lib);
 extern ncclProfiler_t* getNcclProfiler_v4(void* lib);
+extern ncclProfiler_t* getNcclProfiler_v5(void* lib);
 
-static pthread_mutex_t profilerLock = PTHREAD_MUTEX_INITIALIZER;
+static std::mutex profilerMutex;
 static int profilerPluginRefCount;
 static void* profilerPluginLib;
 static ncclProfiler_t* ncclProfiler;
 
+extern __thread int ncclGroupDepth;
+__thread ncclProfilerApiState_t ncclProfilerApiState;
+
 #define MAX_STR_LEN 256
 
 enum {
@@ -35,22 +40,37 @@ static int profilerPluginStatus = profilerPluginLoadReady;
 static pid_t pid;
 
 static ncclResult_t ncclProfilerPluginLoad(void) {
+  const char* profilerName;
   if (profilerPluginLoadFailed == profilerPluginStatus) {
     return ncclSuccess;
   }
 
-  pthread_mutex_lock(&profilerLock);
+  std::lock_guard<std::mutex> lock(profilerMutex);
   if (profilerPluginLoadSuccess == profilerPluginStatus) {
     ++profilerPluginRefCount;
     goto exit;
   }
 
-  profilerPluginLib = ncclOpenProfilerPluginLib(ncclGetEnv("NCCL_PROFILER_PLUGIN"));
+  if ((profilerName = ncclGetEnv("NCCL_PROFILER_PLUGIN")) != nullptr) {
+    INFO(NCCL_ENV, "NCCL_PROFILER_PLUGIN set by environment to %s", profilerName);
+    if (strcasecmp(profilerName, "none") == 0)
+      goto fail;
+  }
+  profilerPluginLib = ncclOpenProfilerPluginLib(profilerName);
   if (profilerPluginLib == nullptr) {
-    goto fail;
+    profilerPluginLib = ncclGetNetPluginLib(ncclPluginTypeProfiler);
+    if (nullptr == profilerPluginLib) {
+      goto fail;
+    }
+    profilerName = nullptr;
+  } else if (ncclPluginLibPaths[ncclPluginTypeProfiler]) {
+    profilerName = ncclPluginLibPaths[ncclPluginTypeProfiler];
   }
 
-  ncclProfiler = getNcclProfiler_v4(profilerPluginLib);
+  ncclProfiler = getNcclProfiler_v5(profilerPluginLib);
+  if (ncclProfiler == nullptr) {
+    ncclProfiler = getNcclProfiler_v4(profilerPluginLib);
+  }
   if (ncclProfiler == nullptr) {
     ncclProfiler = getNcclProfiler_v3(profilerPluginLib);
   }
@@ -61,8 +81,10 @@ static ncclResult_t ncclProfilerPluginLoad(void) {
     ncclProfiler = getNcclProfiler_v1(profilerPluginLib);
   }
   if (ncclProfiler == NULL) {
+    if (profilerName) INFO(NCCL_INIT, "External profiler plugin %s is unsupported", profilerName);
     goto fail;
   }
+  if (profilerName) INFO(NCCL_INIT, "Successfully loaded external profiler plugin %s", profilerName);
 
   ++profilerPluginRefCount;
   profilerPluginStatus = profilerPluginLoadSuccess;
@@ -74,7 +96,6 @@ static ncclResult_t ncclProfilerPluginLoad(void) {
   pid = getpid();
 
 exit:
-  pthread_mutex_unlock(&profilerLock);
   return ncclSuccess;
 fail:
   if (profilerPluginLib) NCCLCHECK(ncclClosePluginLib(profilerPluginLib, ncclPluginTypeProfiler));
@@ -84,15 +105,16 @@ fail:
 }
 
 static ncclResult_t ncclProfilerPluginUnload(void) {
-  pthread_mutex_lock(&profilerLock);
+  std::lock_guard<std::mutex> lock(profilerMutex);
   if (0 == (--profilerPluginRefCount)) {
-    INFO(NCCL_ENV, "PROFILER/Plugin: Closing profiler plugin %s", ncclProfiler->name);
+    if (__builtin_expect(ncclProfiler != NULL, 0)) {
+      INFO(NCCL_INIT, "PROFILER/Plugin: Closing profiler plugin %s", ncclProfiler->name);
+    }
     NCCLCHECK(ncclClosePluginLib(profilerPluginLib, ncclPluginTypeProfiler));
     profilerPluginLib = nullptr;
     ncclProfiler = nullptr;
     profilerPluginStatus = profilerPluginLoadReady;
   }
-  pthread_mutex_unlock(&profilerLock);
   return ncclSuccess;
 }
 
@@ -167,10 +189,9 @@ ncclResult_t ncclProfilerPluginInit(struct ncclComm* comm) {
   TIME_START_EVENT(init);
   ncclProfilerPluginLoad();
   if (__builtin_expect(ncclProfiler != NULL, 0)) {
-    int err = ncclProfiler->init(&comm->profilerContext, &ncclProfilerEventMask, comm->config.commName, comm->commHash, comm->nNodes, comm->nRanks, comm->rank, ncclDebugLog);
+    int err = ncclProfiler->init(&comm->profilerContext, comm->commHash, &ncclProfilerEventMask, comm->config.commName, comm->nNodes, comm->nRanks, comm->rank, ncclDebugLog);
     if (err) {
-      WARN("Profiler init failed with error (%d). Continue without profiler.", err);
-      ncclProfiler = NULL;
+      INFO(NCCL_INIT, "Profiler init failed with error '%d': %s. Continue without profiler.", err, strerror(errno));
     }
   }
   TIME_STOP_EVENT(init);
@@ -179,7 +200,7 @@ ncclResult_t ncclProfilerPluginInit(struct ncclComm* comm) {
 
 ncclResult_t ncclProfilerPluginFinalize(struct ncclComm* comm) {
   TIME_START_EVENT(finalize);
-  if (__builtin_expect(ncclProfiler != NULL, 0)) {
+  if (__builtin_expect(ncclProfiler != NULL, 0) && comm->profilerContext) {
     ncclProfiler->finalize(comm->profilerContext);
   }
   ncclProfilerPluginUnload();
@@ -189,6 +210,143 @@ ncclResult_t ncclProfilerPluginFinalize(struct ncclComm* comm) {
   return ncclSuccess;
 }
 
+ncclResult_t ncclProfilerStartGroupApiEvent(struct ncclInfo* info, bool isGraphCaptured) {
+  ncclProfilerEventDescr_t eDescr = { 0 };
+  eDescr.type = ncclProfileGroupApi;
+  eDescr.groupApi.graphCaptured = isGraphCaptured;
+
+  ncclProfilerApiState.eActivationMask = __atomic_load_n(&ncclProfilerEventMask, __ATOMIC_RELAXED);
+  int groupApiMask = ncclProfileGroupApi | ncclProfileP2pApi | ncclProfileCollApi | ncclProfileKernelLaunch | ncclProfileGroup | ncclProfileColl | ncclProfileP2p | ncclProfileProxyOp | ncclProfileProxyStep | ncclProfileKernelCh | ncclProfileNetPlugin;
+  // Only count outermost groups when emitting group API events
+  if (__builtin_expect(ncclProfiler != NULL, 0) && (ncclProfilerApiState.eActivationMask & groupApiMask)) {
+    if (ncclProfilerApiState.profilerGroupDepth == 0) {
+      eDescr.groupApi.groupDepth = ncclGroupDepth;
+      ncclProfiler->startEvent(info->comm->profilerContext, &ncclProfilerApiState.groupApiEventHandle, &eDescr);
+      ncclProfilerApiState.profilerGroupDepth = ncclGroupDepth;
+      ncclProfilerApiState.state = ncclProfilerGroupApiStartStateStarted;
+    }
+  }
+  return ncclSuccess;
+}
+
+ncclResult_t ncclProfilerStopGroupApiEvent() {
+  void* groupApiEventHandle = ncclProfilerApiState.groupApiEventHandle;
+  if (__builtin_expect(ncclProfiler != NULL, 0) && groupApiEventHandle && ncclProfilerApiState.profilerGroupDepth == 0) {
+    ncclProfiler->stopEvent(groupApiEventHandle);
+    ncclProfilerApiState.groupApiEventHandle = nullptr;
+  }
+  return ncclSuccess;
+}
+
+ncclResult_t ncclProfilerRecordGroupApiEventState(ncclProfilerEventState_t eState) {
+  void* groupApiEventHandle = ncclProfilerApiState.groupApiEventHandle;
+  bool shouldRecord = false;
+  if (eState == ncclProfilerGroupStartApiStop && ncclProfilerApiState.state == ncclProfilerGroupApiStartStateStarted) {
+    ncclProfilerApiState.state = ncclProfilerGroupApiStartStateStopped;
+    shouldRecord = true;
+  } else if (eState == ncclProfilerGroupEndApiStart && ncclProfilerApiState.state == ncclProfilerGroupApiStartStateStopped) {
+    ncclProfilerApiState.state = ncclProfilerGroupApiStartStateReset;
+    shouldRecord = true;
+  }
+
+  if (__builtin_expect(ncclProfiler != NULL, 0) && groupApiEventHandle && shouldRecord) {
+    ncclProfiler->recordEventState(groupApiEventHandle, eState, NULL);
+  }
+  return ncclSuccess;
+}
+
+ncclResult_t ncclProfilerStartP2pApiEvent(struct ncclInfo *info, bool isGraphCaptured) {
+  ncclProfilerEventDescr_t eDescr = { 0 };
+  eDescr.type = ncclProfileP2pApi;
+  eDescr.parentObj = ncclProfilerApiState.groupApiEventHandle;
+  eDescr.p2pApi.func = ncclFuncToString(info->coll);
+  eDescr.p2pApi.count = info->count;
+  eDescr.p2pApi.datatype = ncclDatatypeToString(info->datatype);
+  eDescr.p2pApi.stream = (void *) info->stream;
+  eDescr.p2pApi.graphCaptured = isGraphCaptured;
+  int p2pApiMask = ncclProfileP2pApi | ncclProfileP2p | ncclProfileProxyOp | ncclProfileProxyStep | ncclProfileKernelCh | ncclProfileNetPlugin;
+  if (__builtin_expect(ncclProfiler != NULL, 0) && (ncclProfilerApiState.eActivationMask & p2pApiMask)) {
+    ncclProfiler->startEvent(info->comm->profilerContext, &ncclProfilerApiState.p2pApiEventHandle, &eDescr);
+  }
+  return ncclSuccess;
+}
+
+ncclResult_t ncclProfilerStopP2pApiEvent() {
+  if (__builtin_expect(ncclProfiler != NULL, 0) && ncclProfilerApiState.p2pApiEventHandle) {
+    ncclProfiler->stopEvent(ncclProfilerApiState.p2pApiEventHandle);
+    ncclProfilerApiState.p2pApiEventHandle = nullptr;
+  }
+  return ncclSuccess;
+}
+
+ncclResult_t ncclProfilerStartCollApiEvent(struct ncclInfo *info, bool isGraphCaptured) {
+  ncclProfilerEventDescr_t eDescr = { 0 };
+  eDescr.type = ncclProfileCollApi;
+  eDescr.parentObj = ncclProfilerApiState.groupApiEventHandle;
+  eDescr.collApi.func = ncclFuncToString(info->coll);
+  eDescr.collApi.count = info->count;
+  eDescr.collApi.datatype = ncclDatatypeToString(info->datatype);
+  eDescr.collApi.stream = (void *) info->stream;
+  eDescr.collApi.root = info->root;
+  eDescr.collApi.graphCaptured = isGraphCaptured;
+  int collApiMask = ncclProfileCollApi | ncclProfileColl | ncclProfileProxyOp | ncclProfileProxyStep | ncclProfileKernelCh | ncclProfileNetPlugin;
+  if (__builtin_expect(ncclProfiler != NULL, 0) && (ncclProfilerApiState.eActivationMask & collApiMask)) {
+    ncclProfiler->startEvent(info->comm->profilerContext, &ncclProfilerApiState.collApiEventHandle, &eDescr);
+  }
+  return ncclSuccess;
+}
+
+ncclResult_t ncclProfilerStopCollApiEvent() {
+  if (__builtin_expect(ncclProfiler != NULL, 0) && ncclProfilerApiState.collApiEventHandle) {
+    ncclProfiler->stopEvent(ncclProfilerApiState.collApiEventHandle);
+  }
+  return ncclSuccess;
+}
+
+ncclResult_t ncclProfilerStartKernelLaunchEvent(struct ncclKernelPlan* plan, cudaStream_t stream) {
+  ncclProfilerEventDescr_t eDescr = { 0 };
+  if (__builtin_expect(ncclProfiler != NULL, 0)) {
+    void* groupApiEventHandle = NULL;
+    // Check if any collective in the plan has a set event activation mask
+    struct ncclTaskColl* ct = ncclIntruQueueHead(&plan->collTaskQueue);
+    struct ncclTaskP2p* pt = ncclIntruQueueHead(&plan->p2pTaskQueue);
+    int eActivationMask_ = 0;
+    while (ct) {
+      if (ct->eActivationMask) {
+        eActivationMask_ = ct->eActivationMask;
+        groupApiEventHandle = ct->groupApiEventHandle;
+        goto startKernelLaunchEvent;
+      }
+      ct = ct->next;
+    }
+    // Check if any pt2pt in the plan has a set event activation mask
+    while (pt) {
+      if (pt->eActivationMask) {
+        eActivationMask_ = pt->eActivationMask;
+        groupApiEventHandle = pt->groupApiEventHandle;
+        goto startKernelLaunchEvent;
+      }
+      pt = pt->next;
+    }
+
+  startKernelLaunchEvent:
+    if (eActivationMask_ & ncclProfileKernelLaunch) {
+      eDescr.type = ncclProfileKernelLaunch;
+      eDescr.parentObj = groupApiEventHandle;
+      eDescr.kernelLaunch.stream = (void *) stream;
+      ncclProfiler->startEvent(plan->comm->profilerContext, &plan->kernelLaunchEventHandle, &eDescr);
+    }
+  }
+  return ncclSuccess;
+}
+
+ncclResult_t ncclProfilerStopKernelLaunchEvent(struct ncclKernelPlan* plan) {
+  if (__builtin_expect(ncclProfiler != NULL, 0) && plan->kernelLaunchEventHandle) {
+    ncclProfiler->stopEvent(plan->kernelLaunchEventHandle);
+  }
+  return ncclSuccess;
+}
+
 ncclResult_t ncclProfilerStartGroupEvent(struct ncclKernelPlan* plan) {
   TIME_START_EVENT(groupStart);
   if (__builtin_expect(ncclProfiler != NULL, 0)) {
@@ -237,26 +395,25 @@ ncclResult_t ncclProfilerStartTaskEvents(struct ncclKernelPlan* plan) {
   struct ncclTaskColl* ct = ncclIntruQueueHead(&plan->collTaskQueue);
   while (ct) {
     if (__builtin_expect(ncclProfiler != NULL, 0)) {
-      if (plan->groupEventHandle) {
-        int enable = ct->eActivationMask & (ncclProfileColl | ncclProfileProxyOp | ncclProfileProxyStep | ncclProfileKernelCh | ncclProfileNetPlugin);
-        if (enable) {
-          ncclProfilerEventDescr_t eDescr = { 0 };
-          eDescr.type = ncclProfileColl;
-          eDescr.parentObj = plan->groupEventHandle;
-          eDescr.rank = plan->comm->rank;
-          eDescr.coll.seqNumber = plan->comm->seqNumber[ct->func];
-          eDescr.coll.func = ncclFuncToString(ct->func);
-          eDescr.coll.sendBuff = ct->sendbuff;
-          eDescr.coll.recvBuff = ct->recvbuff;
-          eDescr.coll.count = ct->count;
-          eDescr.coll.root = ct->root;
-          eDescr.coll.datatype = ncclDatatypeToString(ct->datatype);
-          eDescr.coll.nChannels = ct->nChannels;
-          eDescr.coll.nWarps = ct->nWarps;
-          eDescr.coll.algo = ncclAlgoToString(ct->algorithm);
-          eDescr.coll.proto = ncclProtoToString(ct->protocol);
-          ncclProfiler->startEvent(plan->comm->profilerContext, &ct->eventHandle, &eDescr);
-        }
+      int enable = ct->eActivationMask & (ncclProfileColl | ncclProfileProxyOp | ncclProfileProxyStep | ncclProfileKernelCh | ncclProfileNetPlugin);
+      if (enable) {
+        ncclProfilerEventDescr_t eDescr = { 0 };
+        eDescr.type = ncclProfileColl;
+        eDescr.coll.parentGroup = plan->groupEventHandle;
+        eDescr.parentObj = ct->collApiEventHandle;
+        eDescr.rank = plan->comm->rank;
+        eDescr.coll.seqNumber = plan->comm->seqNumber[ct->func];
+        eDescr.coll.func = ncclFuncToString(ct->func);
+        eDescr.coll.sendBuff = ct->sendbuff;
+        eDescr.coll.recvBuff = ct->recvbuff;
+        eDescr.coll.count = ct->count;
+        eDescr.coll.root = ct->root;
+        eDescr.coll.datatype = ncclDatatypeToString(ct->datatype);
+        eDescr.coll.nChannels = ct->nChannels;
+        eDescr.coll.nWarps = ct->nWarps;
+        eDescr.coll.algo = ncclAlgoToString(ct->algorithm);
+        eDescr.coll.proto = ncclProtoToString(ct->protocol);
+        ncclProfiler->startEvent(plan->comm->profilerContext, &ct->eventHandle, &eDescr);
       }
     }
     // comm->seqNumber values are updated even if the plugin is not active, since they are used by RAS as well.
@@ -265,31 +422,30 @@ ncclResult_t ncclProfilerStartTaskEvents(struct ncclKernelPlan* plan) {
     // reports from RAS.  Instead, we choose not to include graph-captured collectives in our counts.  An exception is
     // made if ncclProfileKernelCh profiler events are active, as they result in proxy events always being added, which
     // gives the consistency.
-    if (!plan->persistent || (__builtin_expect(ncclProfiler != NULL, 0) && plan->groupEventHandle &&
+    if (!plan->persistent || (__builtin_expect(ncclProfiler != NULL, 0) && (plan->groupEventHandle || ct->collApiEventHandle) &&
                               (ct->eActivationMask & ncclProfileKernelCh)))
       __atomic_fetch_add(&plan->comm->seqNumber[ct->func], 1, __ATOMIC_RELAXED);
     ct = ct->next;
   }
   if (__builtin_expect(ncclProfiler != NULL, 0)) {
-    if (plan->groupEventHandle) {
-      struct ncclTaskP2p* pt = ncclIntruQueueHead(&plan->p2pTaskQueue);
-      while (pt) {
-        int enable = pt->eActivationMask & (ncclProfileP2p | ncclProfileProxyOp | ncclProfileProxyStep | ncclProfileKernelCh);
-        if (enable) {
-          ncclProfilerEventDescr_t eDescr = { 0 };
-          eDescr.type = ncclProfileP2p;
-          eDescr.parentObj = plan->groupEventHandle;
-          eDescr.rank = plan->comm->rank;
-          eDescr.p2p.func = ncclFuncToString(pt->func);
-          eDescr.p2p.buff = pt->buff;
-          eDescr.p2p.count = pt->count;
-          eDescr.p2p.datatype = ncclDatatypeToString(pt->datatype);
-          eDescr.p2p.peer = pt->root;
-          eDescr.p2p.nChannels = pt->nChannels;
-          ncclProfiler->startEvent(plan->comm->profilerContext, &pt->eventHandle, &eDescr);
-        }
-        pt = pt->next;
+    struct ncclTaskP2p* pt = ncclIntruQueueHead(&plan->p2pTaskQueue);
+    while (pt) {
+      int enable = pt->eActivationMask & (ncclProfileP2p | ncclProfileProxyOp | ncclProfileProxyStep | ncclProfileKernelCh | ncclProfileNetPlugin);
+      if (enable) {
+        ncclProfilerEventDescr_t eDescr = { 0 };
+        eDescr.type = ncclProfileP2p;
+        eDescr.p2p.parentGroup = plan->groupEventHandle;
+        eDescr.parentObj = pt->p2pApiEventHandle;
+        eDescr.rank = plan->comm->rank;
+        eDescr.p2p.func = ncclFuncToString(pt->func);
+        eDescr.p2p.buff = pt->buff;
+        eDescr.p2p.count = pt->count;
+        eDescr.p2p.datatype = ncclDatatypeToString(pt->datatype);
+        eDescr.p2p.peer = pt->root;
+        eDescr.p2p.nChannels = pt->nChannels;
+        ncclProfiler->startEvent(plan->comm->profilerContext, &pt->eventHandle, &eDescr);
       }
+      pt = pt->next;
     }
   }
   TIME_STOP_EVENT(taskStart);
@@ -299,17 +455,15 @@ ncclResult_t ncclProfilerStartTaskEvents(struct ncclKernelPlan* plan) {
 ncclResult_t ncclProfilerStopTaskEvents(struct ncclKernelPlan* plan) {
   TIME_START_EVENT(taskStop);
   if (__builtin_expect(ncclProfiler != NULL, 0)) {
-    if (plan->groupEventHandle) {
-      struct ncclTaskColl* ct = ncclIntruQueueHead(&plan->collTaskQueue);
-      while (ct) {
-        if (ct->eventHandle) ncclProfiler->stopEvent(ct->eventHandle);
-        ct = ct->next;
-      }
-      struct ncclTaskP2p* pt = ncclIntruQueueHead(&plan->p2pTaskQueue);
-      while (pt) {
-        if (pt->eventHandle) ncclProfiler->stopEvent(pt->eventHandle);
-        pt = pt->next;
-      }
+    struct ncclTaskColl* ct = ncclIntruQueueHead(&plan->collTaskQueue);
+    while (ct) {
+      if (ct->eventHandle) ncclProfiler->stopEvent(ct->eventHandle);
+      ct = ct->next;
+    }
+    struct ncclTaskP2p* pt = ncclIntruQueueHead(&plan->p2pTaskQueue);
+    while (pt) {
+      if (pt->eventHandle) ncclProfiler->stopEvent(pt->eventHandle);
+      pt = pt->next;
     }
   }
   TIME_STOP_EVENT(taskStop);
@@ -357,18 +511,18 @@ ncclResult_t ncclProfilerStopProxyOpEvent(int s, struct ncclProxyArgs* args) {
 ncclResult_t ncclProfilerStartSendProxyStepEvent(int s, struct ncclProxyArgs* args, int stepId) {
   TIME_START_EVENT(proxyStepStart);
   struct ncclProxySubArgs* sub = &args->subs[s];
+  int step_ = DIVUP(stepId, args->sliceSteps);
   if (__builtin_expect(ncclProfiler != NULL, 0)) {
-    if (sub->opEventHandle && (sub->eActivationMask & (ncclProfileProxyStep | ncclProfileNetPlugin))) {
-      int step_ = DIVUP(stepId, args->sliceSteps);
+    if (sub->eActivationMask & (ncclProfileProxyStep | ncclProfileNetPlugin)) {
       ncclProfilerEventDescr_t eDescr = { 0 };
       eDescr.type = ncclProfileProxyStep;
       eDescr.parentObj = sub->opEventHandle;
       eDescr.rank = sub->rank;
       eDescr.proxyStep.step = step_;
       ncclProfiler->startEvent(sub->profilerContext, &sub->pHandles[step_%NCCL_STEPS].stepEventHandle, &eDescr);
-      sub->pHandles[step_%NCCL_STEPS].subArgPtr = sub;
     }
   }
+  sub->pHandles[step_%NCCL_STEPS].subArgPtr = sub;
   TIME_STOP_EVENT(proxyStepStart);
   return ncclSuccess;
 }
@@ -376,18 +530,18 @@ ncclResult_t ncclProfilerStartSendProxyStepEvent(int s, struct ncclProxyArgs* ar
 ncclResult_t ncclProfilerStartRecvProxyStepEvent(int s, struct ncclProxyArgs* args, int stepId) {
   TIME_START_EVENT(proxyStepStart);
   struct ncclProxySubArgs* sub = &args->subs[s];
+  int step_ = DIVUP(stepId, args->sliceSteps);
   if (__builtin_expect(ncclProfiler != NULL, 0)) {
-    if (sub->opEventHandle && (sub->eActivationMask & (ncclProfileProxyStep | ncclProfileNetPlugin))) {
-      int step_ = DIVUP(stepId, args->sliceSteps);
+    if (sub->eActivationMask & (ncclProfileProxyStep | ncclProfileNetPlugin)) {
       ncclProfilerEventDescr_t eDescr = { 0 };
       eDescr.type = ncclProfileProxyStep;
       eDescr.parentObj = sub->opEventHandle;
       eDescr.rank = sub->rank;
       eDescr.proxyStep.step = step_;
       ncclProfiler->startEvent(sub->profilerContext, &sub->pHandles[step_%NCCL_STEPS].stepEventHandle, &eDescr);
-      sub->pHandles[step_%NCCL_STEPS].subArgPtr = sub;
     }
   }
+  sub->pHandles[step_%NCCL_STEPS].subArgPtr = sub;
   TIME_STOP_EVENT(proxyStepStart);
   return ncclSuccess;
 }
@@ -503,11 +657,11 @@ ncclResult_t ncclProfilerAddPidToProxyOp(struct ncclProxyOp* op) {
   return ncclSuccess;
 }
 
-static pthread_mutex_t proxyProfilerConnectLock = PTHREAD_MUTEX_INITIALIZER;
+static std::mutex proxyProfilerConnectMutex;
 
 static ncclResult_t proxyProfilerConnect(struct ncclComm* comm, struct ncclProxyOp* op) {
   ncclResult_t ret = ncclSuccess;
-  pthread_mutex_lock(&proxyProfilerConnectLock);
+  std::lock_guard<std::mutex> lock(proxyProfilerConnectMutex);
   if (comm->profiler.initialized) goto exit;
   for (int c = 0; c < MAXCHANNELS; c++) {
     NCCLCHECKGOTO(ncclProxyConnect(comm, TRANSPORT_PROFILER, 0, comm->rank, &comm->profiler.sendProxyConn[c]), ret, exit);
@@ -517,7 +671,6 @@ static ncclResult_t proxyProfilerConnect(struct ncclComm* comm, struct ncclProxy
   }
   comm->profiler.initialized = true;
 exit:
-  pthread_mutex_unlock(&proxyProfilerConnectLock);
   return ret;
 }
 
diff --git a/projects/rccl/src/plugin/profiler/CMakeLists.txt b/projects/rccl/src/plugin/profiler/CMakeLists.txt
new file mode 100644
index 0000000000..1a5cc9a30c
--- /dev/null
+++ b/projects/rccl/src/plugin/profiler/CMakeLists.txt
@@ -0,0 +1,11 @@
+# Profiler plugin sources
+set(PLUGIN_PROFILER_SOURCES
+    ${CMAKE_CURRENT_SOURCE_DIR}/profiler_v3.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/profiler_v4.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/profiler_v1.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/profiler_v2.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/profiler_v5.cc
+)
+
+# Add profiler plugin sources to parent scope
+set(PLUGIN_PROFILER_SOURCES ${PLUGIN_PROFILER_SOURCES} PARENT_SCOPE)
diff --git a/projects/rccl/src/plugin/profiler/profiler_v1.cc b/projects/rccl/src/plugin/profiler/profiler_v1.cc
index 2126afc681..ef8ef6b5d8 100644
--- a/projects/rccl/src/plugin/profiler/profiler_v1.cc
+++ b/projects/rccl/src/plugin/profiler/profiler_v1.cc
@@ -7,6 +7,7 @@
 #include "comm.h"
 #include "nccl_profiler.h"
 #include "checks.h"
+#include <dlfcn.h>
 
 static ncclProfiler_t ncclProfiler;
 static ncclProfiler_v1_t* ncclProfiler_v1;
@@ -63,6 +64,7 @@ static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclP
     case ncclProfileColl: {
       eDescr_v1.coll.name = nullptr; // removed in v4
       eDescr_v1.coll.commHash = 0; // removed in v4
+      eDescr_v1.parentObj = eDescr->coll.parentGroup; // Hierarchy changed in v5
       eDescr_v1.coll.seqNumber = eDescr->coll.seqNumber;
       eDescr_v1.coll.func = ncclStringToFunc(eDescr->coll.func);
       eDescr_v1.coll.sendBuff = eDescr->coll.sendBuff;
@@ -80,6 +82,7 @@ static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclP
     case ncclProfileP2p: {
       eDescr_v1.p2p.name = nullptr; // removed in v4
       eDescr_v1.p2p.commHash = 0; // removed in v4
+      eDescr_v1.parentObj = eDescr->p2p.parentGroup; // Hierarchy changed in v5
       eDescr_v1.p2p.func = ncclStringToFunc(eDescr->p2p.func);
       eDescr_v1.p2p.buff = eDescr->p2p.buff;
       eDescr_v1.p2p.count = eDescr->p2p.count;
@@ -125,7 +128,15 @@ static ncclResult_t ncclProfiler_recordEventState(void* eHandle, ncclProfilerEve
   return ncclProfiler_v1->recordEventState(eHandle, eState, &args);
 }
 
-static ncclResult_t ncclProfiler_init(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn) {
+static ncclResult_t ncclProfiler_init(void** context,
+    uint64_t commId __attribute__((unused)),
+    int* eActivationMask __attribute__((unused)),
+    const char* commName __attribute__((unused)),
+    int nNodes __attribute__((unused)),
+    int nranks __attribute__((unused)),
+    int rank __attribute__((unused)),
+    ncclDebugLogger_t logfn __attribute__((unused))
+  ) {
   NCCLCHECK(ncclProfiler_v1->init(context, eActivationMask));
   ncclProfiler.startEvent = ncclProfiler_startEvent;
   ncclProfiler.stopEvent = ncclProfiler_v1->stopEvent;
@@ -139,9 +150,8 @@ ncclProfiler_t* getNcclProfiler_v1(void* lib) {
   if (ncclProfiler_v1) {
     ncclProfiler.name = ncclProfiler_v1->name;
     ncclProfiler.init = ncclProfiler_init;
-    INFO(NCCL_INIT|NCCL_ENV, "PROFILER/Plugin: loaded %s", ncclProfiler_v1->name);
+    INFO(NCCL_INIT, "PROFILER/Plugin: Loaded %s (v1)", ncclProfiler_v1->name);
     return &ncclProfiler;
   }
-  INFO(NCCL_INIT|NCCL_ENV, "PROFILER/Plugin: failed to find ncclProfiler_v1.");
   return NULL;
 }
diff --git a/projects/rccl/src/plugin/profiler/profiler_v2.cc b/projects/rccl/src/plugin/profiler/profiler_v2.cc
index 11e521e902..d1c83cf1a4 100644
--- a/projects/rccl/src/plugin/profiler/profiler_v2.cc
+++ b/projects/rccl/src/plugin/profiler/profiler_v2.cc
@@ -7,6 +7,7 @@
 #include "comm.h"
 #include "nccl_profiler.h"
 #include "checks.h"
+#include <dlfcn.h>
 
 static ncclProfiler_t ncclProfiler;
 static ncclProfiler_v2_t* ncclProfiler_v2;
@@ -20,6 +21,7 @@ static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclP
   switch(eDescr->type) {
     case ncclProfileGroup: break;
     case ncclProfileColl: {
+      eDescr_v2.parentObj = eDescr->coll.parentGroup; // Hierarchy changed in v5
       eDescr_v2.coll.name = nullptr; // removed in v4
       eDescr_v2.coll.commHash = 0; // removed in v4
       eDescr_v2.coll.seqNumber = eDescr->coll.seqNumber;
@@ -38,6 +40,7 @@ static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclP
     case ncclProfileP2p: {
       eDescr_v2.p2p.name = nullptr; // removed in v4
       eDescr_v2.p2p.commHash = 0; // removed in v4
+      eDescr_v2.parentObj = eDescr->p2p.parentGroup; // Hierarchy changed in v5
       eDescr_v2.p2p.func = eDescr->p2p.func;
       eDescr_v2.p2p.buff = eDescr->p2p.buff;
       eDescr_v2.p2p.count = eDescr->p2p.count;
@@ -83,7 +86,15 @@ static ncclResult_t ncclProfiler_recordEventState(void* eHandle, ncclProfilerEve
   return ncclProfiler_v2->recordEventState(eHandle, eState, &args);
 }
 
-static ncclResult_t ncclProfiler_init(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn) {
+static ncclResult_t ncclProfiler_init(void** context,
+    uint64_t commId __attribute__((unused)),
+    int* eActivationMask __attribute__((unused)),
+    const char* commName __attribute__((unused)),
+    int nNodes __attribute__((unused)),
+    int nranks __attribute__((unused)),
+    int rank __attribute__((unused)),
+    ncclDebugLogger_t logfn __attribute__((unused))
+  ) {
   NCCLCHECK(ncclProfiler_v2->init(context, eActivationMask));
   ncclProfiler.startEvent = ncclProfiler_startEvent;
   ncclProfiler.stopEvent = ncclProfiler_v2->stopEvent;
@@ -97,9 +108,8 @@ ncclProfiler_t* getNcclProfiler_v2(void* lib) {
   if (ncclProfiler_v2) {
     ncclProfiler.name = ncclProfiler_v2->name;
     ncclProfiler.init = ncclProfiler_init;
-    INFO(NCCL_INIT|NCCL_ENV, "PROFILER/Plugin: loaded %s", ncclProfiler_v2->name);
+    INFO(NCCL_INIT, "PROFILER/Plugin: Loaded %s (v2)", ncclProfiler_v2->name);
     return &ncclProfiler;
   }
-  INFO(NCCL_INIT|NCCL_ENV, "PROFILER/Plugin: failed to find ncclProfiler_v2");
   return NULL;
 }
diff --git a/projects/rccl/src/plugin/profiler/profiler_v3.cc b/projects/rccl/src/plugin/profiler/profiler_v3.cc
index 3dba3231ab..84ec1468e5 100644
--- a/projects/rccl/src/plugin/profiler/profiler_v3.cc
+++ b/projects/rccl/src/plugin/profiler/profiler_v3.cc
@@ -7,6 +7,7 @@
 #include "comm.h"
 #include "nccl_profiler.h"
 #include "checks.h"
+#include <dlfcn.h>
 
 static ncclProfiler_t ncclProfiler;
 static ncclProfiler_v3_t* ncclProfiler_v3;
@@ -22,6 +23,7 @@ static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclP
     case ncclProfileColl: {
       eDescr_v3.coll.name = nullptr; // removed in v4
       eDescr_v3.coll.commHash = 0; // removed in v4
+      eDescr_v3.parentObj = eDescr->coll.parentGroup; // Hierarchy changed in v5
       eDescr_v3.coll.seqNumber = eDescr->coll.seqNumber;
       eDescr_v3.coll.func = eDescr->coll.func;
       eDescr_v3.coll.sendBuff = eDescr->coll.sendBuff;
@@ -37,6 +39,7 @@ static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclP
     case ncclProfileP2p: {
       eDescr_v3.p2p.name = nullptr; // removed in v4
       eDescr_v3.p2p.commHash = 0; // removed in v4
+      eDescr_v3.parentObj = eDescr->p2p.parentGroup; // Hierarchy changed in v5
       eDescr_v3.p2p.func = eDescr->p2p.func;
       eDescr_v3.p2p.buff = eDescr->p2p.buff;
       eDescr_v3.p2p.count = eDescr->p2p.count;
@@ -89,7 +92,15 @@ static ncclResult_t ncclProfiler_recordEventState(void* eHandle, ncclProfilerEve
   return ncclProfiler_v3->recordEventState(eHandle, eState, &args);
 }
 
-static ncclResult_t ncclProfiler_init(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn) {
+static ncclResult_t ncclProfiler_init(void** context,
+    uint64_t commId __attribute__((unused)),
+    int* eActivationMask __attribute__((unused)),
+    const char* commName __attribute__((unused)),
+    int nNodes __attribute__((unused)),
+    int nranks __attribute__((unused)),
+    int rank __attribute__((unused)),
+    ncclDebugLogger_t logfn __attribute__((unused))
+  ) {
   NCCLCHECK(ncclProfiler_v3->init(context, eActivationMask));
   ncclProfiler.startEvent = ncclProfiler_startEvent;
   ncclProfiler.stopEvent = ncclProfiler_v3->stopEvent;
@@ -103,9 +114,8 @@ ncclProfiler_t* getNcclProfiler_v3(void* lib) {
   if (ncclProfiler_v3) {
     ncclProfiler.name = ncclProfiler_v3->name;
     ncclProfiler.init = ncclProfiler_init;
-    INFO(NCCL_INIT|NCCL_ENV, "PROFILER/Plugin: loaded %s", ncclProfiler_v3->name);
+    INFO(NCCL_INIT, "PROFILER/Plugin: Loaded %s (v3)", ncclProfiler_v3->name);
     return &ncclProfiler;
   }
-  INFO(NCCL_INIT|NCCL_ENV, "PROFILER/Plugin: failed to find ncclProfiler_v3");
   return NULL;
 }
diff --git a/projects/rccl/src/plugin/profiler/profiler_v4.cc b/projects/rccl/src/plugin/profiler/profiler_v4.cc
index 11bed891a8..53b57ce806 100644
--- a/projects/rccl/src/plugin/profiler/profiler_v4.cc
+++ b/projects/rccl/src/plugin/profiler/profiler_v4.cc
@@ -7,15 +7,113 @@
 #include "comm.h"
 #include "nccl_profiler.h"
 #include "checks.h"
+#include <dlfcn.h>
 
 static ncclProfiler_v4_t* ncclProfiler_v4;
+static ncclProfiler_t ncclProfiler;
+
+static ncclResult_t ncclProfiler_startEvent(void* ctx, void** eHandle, ncclProfilerEventDescr_t* eDescr) {
+  ncclProfilerEventDescr_v4_t eDescr_v4;
+  eDescr_v4.type = eDescr->type;
+  eDescr_v4.parentObj = eDescr->parentObj;
+  eDescr_v4.rank = eDescr->rank;
+  switch(eDescr->type) {
+    case ncclProfileGroup: break;
+    case ncclProfileColl: {
+      eDescr_v4.coll.seqNumber = eDescr->coll.seqNumber;
+      eDescr_v4.coll.func = eDescr->coll.func;
+      eDescr_v4.coll.sendBuff = eDescr->coll.sendBuff;
+      eDescr_v4.coll.recvBuff = eDescr->coll.recvBuff;
+      eDescr_v4.coll.count = eDescr->coll.count;
+      eDescr_v4.coll.root = eDescr->coll.root;
+      eDescr_v4.coll.datatype = eDescr->coll.datatype;
+      eDescr_v4.coll.nChannels = eDescr->coll.nChannels;
+      eDescr_v4.coll.nWarps = eDescr->coll.nWarps;
+      eDescr_v4.coll.algo = eDescr->coll.algo;
+      eDescr_v4.coll.proto = eDescr->coll.proto;
+      eDescr_v4.parentObj = eDescr->coll.parentGroup;
+    } break;
+    case ncclProfileP2p: {
+      eDescr_v4.p2p.func = eDescr->p2p.func;
+      eDescr_v4.p2p.buff = eDescr->p2p.buff;
+      eDescr_v4.p2p.count = eDescr->p2p.count;
+      eDescr_v4.p2p.datatype = eDescr->p2p.datatype;
+      eDescr_v4.p2p.peer = eDescr->p2p.peer;
+      eDescr_v4.parentObj = eDescr->p2p.parentGroup;
+    } break;
+    case ncclProfileProxyOp: {
+      eDescr_v4.proxyOp.pid = eDescr->proxyOp.pid;
+      eDescr_v4.proxyOp.channelId = eDescr->proxyOp.channelId;
+      eDescr_v4.proxyOp.peer = eDescr->proxyOp.peer;
+      eDescr_v4.proxyOp.nSteps = eDescr->proxyOp.nSteps;
+      eDescr_v4.proxyOp.chunkSize = eDescr->proxyOp.chunkSize;
+      eDescr_v4.proxyOp.isSend = eDescr->proxyOp.isSend;
+    } break;
+    case ncclProfileProxyStep: {
+      eDescr_v4.proxyStep.step = eDescr->proxyStep.step;
+    } break;
+    case ncclProfileProxyCtrl: break;
+    case ncclProfileKernelCh: {
+      eDescr_v4.kernelCh.channelId = eDescr->kernelCh.channelId;
+      eDescr_v4.kernelCh.pTimer = eDescr->kernelCh.pTimer;
+    } break;
+    case ncclProfileNetPlugin: {
+      eDescr_v4.netPlugin.id = eDescr->netPlugin.id;
+      eDescr_v4.netPlugin.data = eDescr->netPlugin.data;
+    } break;
+    default: return ncclSuccess;
+  }
+  return ncclProfiler_v4->startEvent(ctx, eHandle, &eDescr_v4);
+}
+
+static ncclResult_t ncclProfiler_recordEventState(void* eHandle, ncclProfilerEventState_t eState, ncclProfilerEventStateArgs_t* eStateArgs) {
+  ncclProfilerEventStateArgs_v4_t eStateArgs_v4;
+  switch(eState) {
+    case ncclProfilerProxyOpInProgress_v4:
+      break;
+    case ncclProfilerProxyStepSendGPUWait:
+    case ncclProfilerProxyStepSendPeerWait_v4:
+    case ncclProfilerProxyStepSendWait:
+    case ncclProfilerProxyStepRecvWait:
+    case ncclProfilerProxyStepRecvFlushWait:
+    case ncclProfilerProxyStepRecvGPUWait:
+      eStateArgs_v4.proxyStep.transSize = eStateArgs->proxyStep.transSize;
+      break;
+    case ncclProfilerNetPluginUpdate:
+      eStateArgs_v4.netPlugin.data = eStateArgs->netPlugin.data;
+      break;
+    case ncclProfilerKernelChStop:
+      eStateArgs_v4.kernelCh.pTimer = eStateArgs->kernelCh.pTimer;
+      break;
+    case ncclProfilerProxyCtrlIdle:
+    case ncclProfilerProxyCtrlActive:
+    case ncclProfilerProxyCtrlSleep:
+    case ncclProfilerProxyCtrlWakeup:
+    case ncclProfilerProxyCtrlAppend:
+    case ncclProfilerProxyCtrlAppendEnd:
+      eStateArgs_v4.proxyCtrl.appendedProxyOps = eStateArgs->proxyCtrl.appendedProxyOps;
+      break;
+    default: return ncclSuccess;
+  }
+  return ncclProfiler_v4->recordEventState(eHandle, (ncclProfilerEventState_v4_t)eState, &eStateArgs_v4);
+}
+
+static ncclResult_t ncclProfiler_init(void** ctx, uint64_t commId, int* eActivationMask, const char* commName, int nNodes, int nRanks, int rank, ncclDebugLogger_t logfn) {
+  NCCLCHECK(ncclProfiler_v4->init(ctx, eActivationMask, commName, commId, nNodes, nRanks, rank, logfn));
+  ncclProfiler.startEvent = ncclProfiler_startEvent;
+  ncclProfiler.recordEventState = ncclProfiler_recordEventState;
+  ncclProfiler.stopEvent = ncclProfiler_v4->stopEvent;
+  ncclProfiler.finalize = ncclProfiler_v4->finalize;
+  return ncclSuccess;
+}
 
 ncclProfiler_t* getNcclProfiler_v4(void* lib) {
   ncclProfiler_v4 = (ncclProfiler_v4_t*)dlsym(lib, "ncclProfiler_v4");
   if (ncclProfiler_v4) {
-    INFO(NCCL_INIT|NCCL_ENV, "PROFILER/Plugin: loaded %s", ncclProfiler_v4->name);
-    return ncclProfiler_v4;
+    ncclProfiler.name = ncclProfiler_v4->name;
+    ncclProfiler.init = ncclProfiler_init;
+    INFO(NCCL_INIT, "PROFILER/Plugin: Loaded %s (v4)", ncclProfiler_v4->name);
+    return &ncclProfiler;
   }
-  INFO(NCCL_INIT|NCCL_ENV, "PROFILER/Plugin: failed to find ncclProfiler_v4");
   return NULL;
 }
diff --git a/projects/rccl/src/plugin/profiler/profiler_v5.cc b/projects/rccl/src/plugin/profiler/profiler_v5.cc
new file mode 100644
index 0000000000..01d73db053
--- /dev/null
+++ b/projects/rccl/src/plugin/profiler/profiler_v5.cc
@@ -0,0 +1,21 @@
+/*************************************************************************
+ * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "comm.h"
+#include "nccl_profiler.h"
+#include "checks.h"
+#include <dlfcn.h>
+
+static ncclProfiler_v5_t* ncclProfiler_v5;
+
+ncclProfiler_t* getNcclProfiler_v5(void* lib) {
+  ncclProfiler_v5 = (ncclProfiler_v5_t*)dlsym(lib, "ncclProfiler_v5");
+  if (ncclProfiler_v5) {
+    INFO(NCCL_INIT, "PROFILER/Plugin: Loaded %s (v5)", ncclProfiler_v5->name);
+    return ncclProfiler_v5;
+  }
+  return NULL;
+}
diff --git a/projects/rccl/src/plugin/tuner.cc b/projects/rccl/src/plugin/tuner.cc
index 24a59de2e2..dfa21ae7ee 100644
--- a/projects/rccl/src/plugin/tuner.cc
+++ b/projects/rccl/src/plugin/tuner.cc
@@ -7,6 +7,7 @@
 
 #include <errno.h>
 #include <stdlib.h>
+#include <mutex>
 
 #include "checks.h"
 #include "debug.h"
@@ -16,8 +17,9 @@
 extern ncclTuner_t* getNcclTuner_v2(void* lib);
 extern ncclTuner_t* getNcclTuner_v3(void* lib);
 extern ncclTuner_t* getNcclTuner_v4(void* lib);
+extern ncclTuner_t* getNcclTuner_v5(void* lib);
 
-pthread_mutex_t tunerPluginLock = PTHREAD_MUTEX_INITIALIZER;
+static std::mutex tunerPluginMutex;
 static int tunerPluginRefCount;
 static void* tunerPluginLib = nullptr;
 static ncclTuner_t* tunerSymbol = nullptr;
@@ -33,13 +35,14 @@ enum {
 static int status = tunerPluginLoadReady;
 
 ncclResult_t ncclTunerPluginLoad(struct ncclComm* comm) {
+  const char* tunerName;
   // Initialize to nullptr by default if plugin tuner cannot be loaded.
   comm->tuner = nullptr;
   if (tunerPluginLoadFailed == status) {
     return ncclSuccess;
   }
 
-  pthread_mutex_lock(&tunerPluginLock);
+  std::lock_guard<std::mutex> lock(tunerPluginMutex);
   if (tunerPluginLoadFailed == status) {
     goto exit;
   }
@@ -50,15 +53,26 @@ ncclResult_t ncclTunerPluginLoad(struct ncclComm* comm) {
     goto exit;
   }
 
-  tunerPluginLib = ncclOpenTunerPluginLib(ncclGetEnv("NCCL_TUNER_PLUGIN"));
+  if ((tunerName = ncclGetEnv("NCCL_TUNER_PLUGIN")) != nullptr) {
+    INFO(NCCL_ENV|NCCL_TUNING, "NCCL_TUNER_PLUGIN set by environment to %s", tunerName);
+    if (strcasecmp(tunerName, "none") == 0)
+      goto fail;
+  }
+  tunerPluginLib = ncclOpenTunerPluginLib(tunerName);
   if (nullptr == tunerPluginLib) {
     tunerPluginLib = ncclGetNetPluginLib(ncclPluginTypeTuner);
     if (nullptr == tunerPluginLib) {
       goto fail;
     }
+    tunerName = nullptr;
+  } else if (ncclPluginLibPaths[ncclPluginTypeTuner]) {
+    tunerName = ncclPluginLibPaths[ncclPluginTypeTuner];
   }
 
-  tunerSymbol = getNcclTuner_v4(tunerPluginLib);
+  tunerSymbol = getNcclTuner_v5(tunerPluginLib);
+  if (tunerSymbol == NULL) {
+    tunerSymbol = getNcclTuner_v4(tunerPluginLib);
+  }
   if (tunerSymbol == NULL) {
     tunerSymbol = getNcclTuner_v3(tunerPluginLib);
   }
@@ -66,8 +80,10 @@ ncclResult_t ncclTunerPluginLoad(struct ncclComm* comm) {
     tunerSymbol = getNcclTuner_v2(tunerPluginLib);
   }
   if (tunerSymbol == NULL) {
+    if (tunerName) INFO(NCCL_INIT|NCCL_TUNING, "External tuner plugin %s is unsupported", tunerName);
     goto fail;
   }
+  if (tunerName) INFO(NCCL_INIT|NCCL_TUNING, "Successfully loaded external tuner plugin %s", tunerName);
 
   comm->tuner = tunerSymbol;
   ++tunerPluginRefCount;
@@ -75,7 +91,6 @@ ncclResult_t ncclTunerPluginLoad(struct ncclComm* comm) {
   comm->tunerPluginLoaded = 1;
 
 exit:
-  pthread_mutex_unlock(&tunerPluginLock);
   return ncclSuccess;
 fail:
   if (tunerPluginLib) NCCLCHECK(ncclClosePluginLib(tunerPluginLib, ncclPluginTypeTuner));
@@ -85,9 +100,9 @@ fail:
 }
 
 ncclResult_t ncclTunerPluginUnload(struct ncclComm* comm) {
-  pthread_mutex_lock(&tunerPluginLock);
+  std::lock_guard<std::mutex> lock(tunerPluginMutex);
   if (comm->tunerPluginLoaded && 0 == (--tunerPluginRefCount)) {
-    INFO(NCCL_TUNING, "TUNER/Plugin: Closing tuner: '%s'", tunerSymbol->name);
+    INFO(NCCL_INIT|NCCL_TUNING, "TUNER/Plugin: Closing tuner: '%s'", tunerSymbol->name);
     NCCLCHECK(ncclClosePluginLib(tunerPluginLib, ncclPluginTypeTuner));
     tunerPluginLib = nullptr;
     tunerSymbol = nullptr;
@@ -95,6 +110,5 @@ ncclResult_t ncclTunerPluginUnload(struct ncclComm* comm) {
     status = tunerPluginLoadReady;
     comm->tunerPluginLoaded = 0;
   }
-  pthread_mutex_unlock(&tunerPluginLock);
   return ncclSuccess;
 }
diff --git a/projects/rccl/src/plugin/tuner/CMakeLists.txt b/projects/rccl/src/plugin/tuner/CMakeLists.txt
new file mode 100644
index 0000000000..71f4498ad3
--- /dev/null
+++ b/projects/rccl/src/plugin/tuner/CMakeLists.txt
@@ -0,0 +1,10 @@
+# Tuner plugin sources
+set(PLUGIN_TUNER_SOURCES
+    ${CMAKE_CURRENT_SOURCE_DIR}/tuner_v2.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/tuner_v3.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/tuner_v4.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/tuner_v5.cc
+)
+
+# Add tuner plugin sources to parent scope
+set(PLUGIN_TUNER_SOURCES ${PLUGIN_TUNER_SOURCES} PARENT_SCOPE)
diff --git a/projects/rccl/src/plugin/tuner/tuner_v2.cc b/projects/rccl/src/plugin/tuner/tuner_v2.cc
index 005638f013..9deefc1fdd 100644
--- a/projects/rccl/src/plugin/tuner/tuner_v2.cc
+++ b/projects/rccl/src/plugin/tuner/tuner_v2.cc
@@ -46,10 +46,15 @@ static ncclResult_t ncclTuner_getCollInfo(void* context, ncclFunc_t collType, si
   return ncclSuccess;
 }
 
-static ncclResult_t ncclTuner_init(size_t nRanks, size_t nNodes, ncclDebugLogger_t logfn, void** context) {
-  NCCLCHECK(ncclTuner_v2->init(nRanks, nNodes, logfn, context));
+static ncclResult_t ncclTuner_finalize(void* ctx) {
+  return ncclTuner_v2->destroy(ctx);
+}
+
+static ncclResult_t ncclTuner_init(void** ctx, uint64_t commId, size_t nRanks, size_t nNodes, ncclDebugLogger_t logfn,
+                                   ncclNvlDomainInfo_v5_t* nvlDomainInfo, ncclTunerConstants_t* /*constants*/) {
+  NCCLCHECK(ncclTuner_v2->init(nRanks, nNodes, logfn, ctx));
   ncclTuner.getCollInfo = ncclTuner_getCollInfo;
-  ncclTuner.destroy = ncclTuner_v2->destroy;
+  ncclTuner.finalize = ncclTuner_finalize;
   return ncclSuccess;
 }
 
@@ -58,9 +63,8 @@ ncclTuner_t* getNcclTuner_v2(void* lib) {
   if (ncclTuner_v2) {
     ncclTuner.name = ncclTuner_v2->name;
     ncclTuner.init = ncclTuner_init;
-    INFO(NCCL_ENV|NCCL_TUNING, "TUNER/Plugin: Using tuner plugin %s", ncclTuner_v2->name);
+    INFO(NCCL_INIT|NCCL_TUNING, "TUNER/Plugin: Using %s (v2)", ncclTuner_v2->name);
     return &ncclTuner;
   }
-  INFO(NCCL_ENV|NCCL_TUNING, "TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.");
   return NULL;
 }
diff --git a/projects/rccl/src/plugin/tuner/tuner_v3.cc b/projects/rccl/src/plugin/tuner/tuner_v3.cc
index 3898243bc5..3f896a6443 100644
--- a/projects/rccl/src/plugin/tuner/tuner_v3.cc
+++ b/projects/rccl/src/plugin/tuner/tuner_v3.cc
@@ -18,10 +18,15 @@ static ncclResult_t ncclTuner_getCollInfo(void* context, ncclFunc_t collType, si
   return ncclSuccess;
 }
 
-static ncclResult_t ncclTuner_init(size_t nRanks, size_t nNodes, ncclDebugLogger_t logfn, void** context) {
+static ncclResult_t ncclTuner_finalize(void* ctx) {
+  return ncclTuner_v3->destroy(ctx);
+}
+
+static ncclResult_t ncclTuner_init(void** context, uint64_t commId, size_t nRanks, size_t nNodes, ncclDebugLogger_t logfn,
+                                   ncclNvlDomainInfo_v5_t* nvlDomainInfo, ncclTunerConstants_t* /*constants*/) {
   NCCLCHECK(ncclTuner_v3->init(nRanks, nNodes, logfn, context));
   ncclTuner.getCollInfo = ncclTuner_getCollInfo;
-  ncclTuner.destroy = ncclTuner_v3->destroy;
+  ncclTuner.finalize = ncclTuner_finalize;
   return ncclSuccess;
 }
 
@@ -30,9 +35,8 @@ ncclTuner_t* getNcclTuner_v3(void* lib) {
   if (ncclTuner_v3) {
     ncclTuner.name = ncclTuner_v3->name;
     ncclTuner.init = ncclTuner_init;
-    INFO(NCCL_ENV|NCCL_TUNING, "TUNER/Plugin: Using tuner plugin %s", ncclTuner_v3->name);
+    INFO(NCCL_INIT|NCCL_TUNING, "TUNER/Plugin: Using %s (v3)", ncclTuner_v3->name);
     return &ncclTuner;
   }
-  INFO(NCCL_ENV|NCCL_TUNING, "TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.");
   return NULL;
 }
diff --git a/projects/rccl/src/plugin/tuner/tuner_v4.cc b/projects/rccl/src/plugin/tuner/tuner_v4.cc
index 4bfd116bb0..077ed0aea0 100644
--- a/projects/rccl/src/plugin/tuner/tuner_v4.cc
+++ b/projects/rccl/src/plugin/tuner/tuner_v4.cc
@@ -7,16 +7,32 @@
 
 #include <dlfcn.h>
 #include "debug.h"
+#include "checks.h"
 #include "nccl_tuner.h"
 
 static ncclTuner_v4_t* ncclTuner_v4;
+static ncclTuner_t ncclTuner;
+
+static ncclResult_t ncclTuner_finalize(void* ctx) {
+  return ncclTuner_v4->destroy(ctx);
+}
+
+static ncclResult_t ncclTuner_init(void** context, uint64_t commId, size_t nRanks, size_t nNodes, ncclDebugLogger_t logfn,
+                                   ncclNvlDomainInfo_v5_t* nvlDomainInfo, ncclTunerConstants_t* /*constants*/) {
+  NCCLCHECK(ncclTuner_v4->init(nRanks, nNodes, logfn, context));
+  ncclTuner.getCollInfo = ncclTuner_v4->getCollInfo;
+  ncclTuner.finalize = ncclTuner_finalize;
+  return ncclSuccess;
+}
 
 ncclTuner_t* getNcclTuner_v4(void* lib) {
   ncclTuner_v4 = (ncclTuner_v4_t*)dlsym(lib, "ncclTunerPlugin_v4");
   if (ncclTuner_v4) {
-    INFO(NCCL_ENV|NCCL_TUNING, "TUNER/Plugin: Using tuner plugin %s", ncclTuner_v4->name);
-    return ncclTuner_v4;
+    ncclTuner.name = ncclTuner_v4->name;
+    ncclTuner.init = ncclTuner_init;
+
+    INFO(NCCL_INIT|NCCL_TUNING, "TUNER/Plugin: Using %s (v4)", ncclTuner_v4->name);
+    return &ncclTuner;
   }
-  INFO(NCCL_ENV|NCCL_TUNING, "TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol.");
   return NULL;
 }
diff --git a/projects/rccl/src/plugin/tuner/tuner_v5.cc b/projects/rccl/src/plugin/tuner/tuner_v5.cc
new file mode 100644
index 0000000000..22c3d4b42d
--- /dev/null
+++ b/projects/rccl/src/plugin/tuner/tuner_v5.cc
@@ -0,0 +1,21 @@
+/*************************************************************************
+ * Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2023, Meta Platforms, Inc. and affiliates.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include <dlfcn.h>
+#include "debug.h"
+#include "nccl_tuner.h"
+
+static ncclTuner_v5_t* ncclTuner_v5;
+
+ncclTuner_t* getNcclTuner_v5(void* lib) {
+  ncclTuner_v5 = (ncclTuner_v5_t*)dlsym(lib, "ncclTunerPlugin_v5");
+  if (ncclTuner_v5) {
+    INFO(NCCL_INIT|NCCL_TUNING, "TUNER/Plugin: Using %s (v5)", ncclTuner_v5->name);
+    return ncclTuner_v5;
+  }
+  return NULL;
+}
diff --git a/projects/rccl/src/proxy.cc b/projects/rccl/src/proxy.cc
index 74ec70f0ec..25a14cd640 100644
--- a/projects/rccl/src/proxy.cc
+++ b/projects/rccl/src/proxy.cc
@@ -13,12 +13,14 @@
 #include "timer.h"
 #include "profiler.h"
 #include "transport.h"
+#include "cpuset.h"
 
 #include <sys/syscall.h>
 #include <assert.h>
 #include <unistd.h>
 #include <sys/time.h>
 #include <sched.h>
+#include <algorithm>
 
 #define NCCL_MAX_PROXY_CONNECTIONS (NCCL_MAX_LOCAL_RANKS+1)
 
@@ -385,6 +387,8 @@ static ncclResult_t ncclProxyOpToArgs(struct ncclProxyOp* op, struct ncclProxyAr
   sub->workCounter = op->workCounter;
   args->nsubs = subIndex+1;
   if (subIndex) {
+    args->nChannels = std::min(args->nChannels, op->nChannels);
+    args->nPeers = std::min(args->nPeers, op->nPeers);
     if ((args->sliceSteps != op->sliceSteps) ||
         (args->chunkSteps != op->chunkSteps) ||
         (args->protocol != op->protocol) ||
@@ -398,7 +402,7 @@ static ncclResult_t ncclProxyOpToArgs(struct ncclProxyOp* op, struct ncclProxyAr
       WARN("Proxy append on running operation");
       return ncclInternalError;
     }
-    return ncclSuccess;
+    goto exit;
   }
   //memset(&args->progress, 0, sizeof(struct ncclProxyArgs)-offsetof(struct ncclProxyArgs, progress));
   args->done = 0;
@@ -411,11 +415,15 @@ static ncclResult_t ncclProxyOpToArgs(struct ncclProxyOp* op, struct ncclProxyAr
   args->pattern = op->pattern;
   args->protocol = op->protocol;
   args->coll = op->coll;
+  args->collAPI = op->collAPI;
   args->algorithm = op->algorithm;
+  args->nChannels = op->nChannels;
+  args->nPeers = op->nPeers;
   args->specifics = op->specifics;
   args->state = ncclProxyOpReady;
   args->progress = op->connection->tcomm->proxyProgress;
   args->proxyAppendPtr = op->connection->proxyAppendPtr;
+exit:
   if (args->pattern != ncclPatternProfiler) ncclProfilerStartProxyOpEvent(subIndex, args);
   return ncclSuccess;
 }
@@ -744,6 +752,7 @@ static ncclResult_t removeOp(struct ncclProxyProgressState* state, struct ncclPr
 static ncclResult_t progressOps(struct ncclProxyState* proxyState, struct ncclProxyProgressState* state, struct ncclProxyArgs* opStart, int* idle) {
   struct ncclProxyArgs* prevOp = NULL;
   struct ncclProxyArgs* op = opStart;
+  ncclResult_t status = ncclSuccess;
   while (op) {
     if (op->state == ncclProxyOpNone) return ncclInternalError;
     TIME_START(0); TIME_START(1);
@@ -751,6 +760,8 @@ static ncclResult_t progressOps(struct ncclProxyState* proxyState, struct ncclPr
     if (op->idle) { TIME_STOP(1); TIME_CANCEL(0); } else { TIME_CANCEL(1); TIME_STOP(0); }
     *idle &= op->idle;
     if (op->state == ncclProxyOpNone || ret != ncclSuccess) {
+      //track first error that occured
+      if (ret != ncclSuccess && status == ncclSuccess) status = ret;
       TIME_START(2);
       NCCLCHECK(removeOp(state, &op, &prevOp));
       TIME_STOP(2);
@@ -759,7 +770,7 @@ static ncclResult_t progressOps(struct ncclProxyState* proxyState, struct ncclPr
       op = op->next;
     }
   }
-  return ncclSuccess;
+  return status;
 }
 
 NCCL_PARAM(ProxyAppendBatchSize, "PROXY_APPEND_BATCH_SIZE", 16);
@@ -899,16 +910,43 @@ exit:
 NCCL_PARAM(ProxyDumpSignal, "PROXY_DUMP_SIGNAL", -1);
 NCCL_PARAM(ProgressAppendOpFreq, "PROGRESS_APPENDOP_FREQ", 8);
 
+static cpu_set_t proxyCpuset;
+static pthread_once_t proxyCpusetOnce = PTHREAD_ONCE_INIT;
+void proxyCpusetOnceFunc() {
+  const char* setEnv = ncclGetEnv("NCCL_PROXY_CPUSET");
+  if (setEnv) {
+    ncclResult_t res = ncclStrListToCpuset(setEnv, &proxyCpuset);
+    if (res != ncclSuccess) {
+      INFO(NCCL_ENV, "failed to decode NCCL_PROXY_CPUSET=%s. Ignoring", setEnv);
+      goto fail;
+    }
+    // debug info
+    char msg[1024] = {0};
+    cpu_set_t currSet;
+    sched_getaffinity(0, sizeof(cpu_set_t), &currSet);
+    (void)ncclCpusetToStrList(&currSet, msg, sizeof(msg));
+    snprintf(msg + strlen(msg), sizeof(msg) - strlen(msg), " changed to ");
+    (void)ncclCpusetToStrList(&proxyCpuset, msg + strlen(msg), sizeof(msg) - strlen(msg));
+    INFO(NCCL_ENV, "NCCL_PROXY_CPUSET = %s: %s", setEnv, msg);
+    return;
+  }
+  // if we arrive here we have either no env or we have failed to decode it
+fail:
+  CPU_ZERO(&proxyCpuset);
+  return;
+}
+
 void* ncclProxyProgress(void *proxyState_) {
   struct ncclProxyState* proxyState = (struct ncclProxyState*)proxyState_;
+
+  // This thread is created by proxyService, therefore setting the affinity is not needed.
+  INFO(NCCL_INIT, "[Proxy Progress] Device %d CPU core %d", proxyState->cudaDev, sched_getcpu());
+
   if (setProxyThreadContext(proxyState)) {
     INFO(NCCL_INIT, "[Proxy Progress] Set CUDA context on device %d", proxyState->cudaDev);
   } else if (cudaSetDevice(proxyState->cudaDev) != cudaSuccess) {
     WARN("[Proxy Progress] Failed to set CUDA device %d", proxyState->cudaDev);
   }
-  // if (CPU_COUNT(&comm->cpuAffinity)) sched_setaffinity(0, sizeof(cpu_set_t), &comm->cpuAffinity);
-
-  INFO(NCCL_INIT, "[Proxy Progress] Device %d CPU core %d", proxyState->cudaDev, sched_getcpu());
 
   struct ncclProxyProgressState* state = &proxyState->progressState;
   state->nextOps = -1;
@@ -1567,15 +1605,17 @@ enum {
 
 void* ncclProxyService(void* _args) {
   struct ncclProxyState* proxyState =  (struct ncclProxyState*) _args;
-  // if (CPU_COUNT(&comm->cpuAffinity)) sched_setaffinity(0, sizeof(cpu_set_t), &comm->cpuAffinity);
+
+  // set the thread affinity before setting the cuda context
+  pthread_once(&proxyCpusetOnce,proxyCpusetOnceFunc);
+  if (CPU_COUNT(&proxyCpuset)) sched_setaffinity(0, sizeof(cpu_set_t), &proxyCpuset);
+  INFO(NCCL_INIT, "[Proxy Service] Device %d CPU core %d", proxyState->cudaDev, sched_getcpu());
+
   if (setProxyThreadContext(proxyState)) {
     INFO(NCCL_INIT, "[Proxy Service] Created CUDA context on device %d", proxyState->cudaDev);
   } else if (cudaSetDevice(proxyState->cudaDev) != cudaSuccess) {
     WARN("[Proxy Service] Failed to set CUDA device %d", proxyState->cudaDev);
   }
-  // if (CPU_COUNT(&comm->cpuAffinity)) sched_setaffinity(0, sizeof(cpu_set_t), &comm->cpuAffinity);
-
-  INFO(NCCL_INIT, "[Proxy Service] Device %d CPU core %d", proxyState->cudaDev, sched_getcpu());
 
   // Prepare poll descriptor
   struct ncclProxyConnectionPool connectionPool;
@@ -1760,14 +1800,17 @@ void* ncclProxyServiceUDS(void* _args) {
   struct ncclProxyState* proxyState =  (struct ncclProxyState*) _args;
   struct pollfd pollfds[1];
 
+  // set the thread affinity before setting the cuda context
+  pthread_once(&proxyCpusetOnce,proxyCpusetOnceFunc);
+  if (CPU_COUNT(&proxyCpuset)) sched_setaffinity(0, sizeof(cpu_set_t), &proxyCpuset);
+  INFO(NCCL_INIT, "[Proxy Service UDS] Device %d CPU core %d", proxyState->cudaDev, sched_getcpu());
+
   if (setProxyThreadContext(proxyState)) {
     INFO(NCCL_INIT, "[Proxy Service UDS] Set CUDA context on device %d", proxyState->cudaDev);
   } else if (cudaSetDevice(proxyState->cudaDev) != cudaSuccess) {
     WARN("[Proxy Service UDS] Failed to set CUDA device %d", proxyState->cudaDev);
   }
 
-  INFO(NCCL_INIT, "[Proxy Service UDS] Device %d CPU core %d", proxyState->cudaDev, sched_getcpu());
-
   if (ncclIpcSocketGetFd(&proxyState->ipcSock, &pollfds[0].fd) != ncclSuccess) {
     WARN("[Proxy Service UDS] Get listenSock fd fails");
     return NULL;
@@ -1807,6 +1850,7 @@ ncclResult_t ncclProxyInit(struct ncclComm* comm, struct ncclSocket* sock, union
   comm->proxyState->listenSock = sock;
   comm->proxyState->peerAddresses = peerAddresses;
   comm->proxyState->peerAddressesUDS = peerAddressesUDS;
+  comm->proxyState->netAttr = NCCL_NET_ATTR_INIT;
 
   // UDS support
   NCCLCHECK(ncclIpcSocketInit(&comm->proxyState->ipcSock, comm->rank, peerAddressesUDS[comm->rank], comm->abortFlag));
@@ -1831,6 +1875,8 @@ ncclResult_t ncclProxyCreate(struct ncclComm* comm) {
     proxyState->dmaBufSupport = comm->dmaBufSupport;
     proxyState->ncclNet = comm->ncclNet;
     proxyState->ncclCollNet = comm->ncclCollNet;
+    proxyState->netContext = comm->netContext;
+    proxyState->collNetContext = comm->collNetContext;
     proxyState->profilerContext = comm->profilerContext;
     proxyState->directMode = comm->directMode;
     memcpy(proxyState->buffSizes, comm->buffSizes, sizeof(comm->buffSizes));
diff --git a/projects/rccl/src/ras/CMakeLists.txt b/projects/rccl/src/ras/CMakeLists.txt
new file mode 100644
index 0000000000..2c08b8f999
--- /dev/null
+++ b/projects/rccl/src/ras/CMakeLists.txt
@@ -0,0 +1,11 @@
+# RAS sources
+set(RAS_SOURCES
+    ${CMAKE_CURRENT_SOURCE_DIR}/collectives.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/rasnet.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/peers.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/ras.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/client_support.cc
+)
+
+# Add RAS sources to parent scope
+set(RAS_SOURCES ${RAS_SOURCES} PARENT_SCOPE)
diff --git a/projects/rccl/src/ras/ras.cc b/projects/rccl/src/ras/ras.cc
index 8ef551c64a..948e264461 100644
--- a/projects/rccl/src/ras/ras.cc
+++ b/projects/rccl/src/ras/ras.cc
@@ -4,10 +4,6 @@
  * See LICENSE.txt for license information
  ************************************************************************/
 
-// Workaround for libstdc++ trying to force public visibility of std:: symbols.  We don't want to do that in libnccl.so.
-#include <bits/c++config.h>
-#undef _GLIBCXX_VISIBILITY
-#define _GLIBCXX_VISIBILITY(V)
 #include <cstddef>
 #include <mutex>
 #include <poll.h>
@@ -76,7 +72,7 @@ static ncclResult_t rasNetSendNack(struct rasSocket* sock);
 
 static void* rasThreadMain(void*);
 
-static void rasTerminate() __attribute__((destructor));
+static void rasTerminate();
 
 NCCL_PARAM(RasTimeoutFactor, "RAS_TIMEOUT_FACTOR", 1);
 
@@ -111,6 +107,8 @@ ncclResult_t ncclRasCommInit(struct ncclComm* comm, struct rasRankInit* myRank)
       ncclSetThreadName(rasThread, "NCCL RAS");
 
       rasInitialized = true;
+
+      atexit(rasTerminate);
     }
   }
   ncclAtomicRefCountIncrement(&rasInitRefCount);
diff --git a/projects/rccl/src/register/CMakeLists.txt b/projects/rccl/src/register/CMakeLists.txt
new file mode 100644
index 0000000000..b3b35bfe63
--- /dev/null
+++ b/projects/rccl/src/register/CMakeLists.txt
@@ -0,0 +1,9 @@
+# Register sources
+set(REGISTER_SOURCES
+    ${CMAKE_CURRENT_SOURCE_DIR}/register.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/coll_reg.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/sendrecv_reg.cc
+)
+
+# Add register sources to parent scope
+set(REGISTER_SOURCES ${REGISTER_SOURCES} PARENT_SCOPE)
diff --git a/projects/rccl/src/register/coll_reg.cc b/projects/rccl/src/register/coll_reg.cc
index d9d9fb436c..6cbb0c75fe 100644
--- a/projects/rccl/src/register/coll_reg.cc
+++ b/projects/rccl/src/register/coll_reg.cc
@@ -1,3 +1,9 @@
+/*************************************************************************
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
 #include "register.h"
 #include "transport.h"
 #include "enqueue.h"
@@ -176,7 +182,8 @@ ncclResult_t ncclRegisterCollBuffers(
     // IPC buffer registration
     if (info->func == ncclFuncReduceScatter && info->algorithm != NCCL_ALGO_COLLNET_DIRECT) goto exit;
     if (info->algorithm == NCCL_ALGO_RING && ((info->func == ncclFuncAllReduce && info->sendbuff == info->recvbuff) || info->func == ncclFuncReduce)) goto exit;
-    if ((info->algorithm == NCCL_ALGO_TREE || info->algorithm == NCCL_ALGO_COLLNET_CHAIN) && info->sendbuff == info->recvbuff) goto exit;
+    if (info->algorithm == NCCL_ALGO_TREE && info->sendbuff == info->recvbuff) goto exit;
+    if (info->algorithm == NCCL_ALGO_COLLNET_CHAIN && info->sendbuff == info->recvbuff && comm->maxLocalRanks > 1) goto exit;
     if (info->func == ncclFuncAllGather && info->algorithm == NCCL_ALGO_PAT) goto exit;
 
     int peerRanks[NCCL_MAX_LOCAL_RANKS];
diff --git a/projects/rccl/src/register/register.cc b/projects/rccl/src/register/register.cc
index 59928f57e3..b118a4cc4d 100644
--- a/projects/rccl/src/register/register.cc
+++ b/projects/rccl/src/register/register.cc
@@ -14,18 +14,6 @@
 
 NCCL_PARAM(LocalRegister, "LOCAL_REGISTER", 1);
 
-static ncclResult_t regFindHandleFromSymAddr(struct ncclComm* comm, void* baseSymPtr, struct ncclReg** handle) {
-  struct ncclRegCache* cache = &comm->regCache;
-  *handle = NULL;
-  for (int slot = 0; slot < cache->population; slot++) {
-    if (baseSymPtr == cache->slots[slot]->baseSymPtr) {
-      *handle = cache->slots[slot];
-      break;
-    }
-  }
-  return ncclSuccess;
-}
-
 ncclResult_t ncclRegLocalIsValid(struct ncclReg *reg, bool *isValid) {
   if (reg && isValid) {
     if (reg->localRefs)
@@ -174,104 +162,3 @@ ncclResult_t ncclCommGraphDeregister(const ncclComm_t comm, struct ncclReg *hand
   NCCLCHECK(commDeregister(comm, true, handle));
   return ncclSuccess;
 }
-
-ncclResult_t ncclCommSymmetricRegisterInternal(struct ncclComm* comm, void* buff, size_t baseSize, size_t alignment, CUmemGenericAllocationHandle memHandle, struct ncclReg* regHandle) {
-  ncclResult_t ret = ncclSuccess;
-  void* regSymAddr = NULL;
-  ALIGN_SIZE(comm->symAllocHead, alignment);
-  NCCLCHECKGOTO(ncclIpcSymmetricMap(comm, comm->symAllocHead, baseSize, memHandle, &regSymAddr), ret, fail);
-  NCCLCHECKGOTO(ncclNvlsSymmetricMap(comm, comm->symAllocHead, baseSize, regSymAddr), ret, fail);
-  NCCLCHECKGOTO(bootstrapIntraNodeBarrier(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, comm->localRankToRank[0]), ret, fail);
-  comm->symAllocHead += baseSize;
-  regHandle->baseSymPtr = regSymAddr;
-  regHandle->symSize = baseSize;
-exit:
-  return ret;
-fail:
-  regHandle->baseSymPtr = NULL;
-  regHandle->symSize = 0;
-  goto exit;
-}
-
-NCCL_API(ncclResult_t, ncclCommWindowRegister, ncclComm_t comm, void* buff, size_t size, ncclWindow_t* win, int winFlags);
-ncclResult_t ncclCommWindowRegister(ncclComm_t comm, void* buff, size_t size, ncclWindow_t* win, int winFlags) {
-  ncclResult_t ret = ncclSuccess;
-  CUmemGenericAllocationHandle memHandle;
-  size_t baseSize;
-  void* baseAddr = NULL;
-  struct ncclReg* regHandle = NULL;
-  int saveDev;
-
-  *win = NULL;
-
-  CUDACHECK(cudaGetDevice(&saveDev));
-  NCCLCHECK(ncclGroupStartInternal());
-  if (!ncclParamLocalRegister() || !ncclCuMemEnable()) {
-    goto exit;
-  }
-
-  NCCLCHECKGOTO(ncclCommEnsureReady(comm), ret, fail);
-
-  CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
-  if (comm && buff && size && win) {
-    size_t alignment = 0;
-    CUCHECKGOTO(cuMemGetAddressRange((CUdeviceptr*)&baseAddr, &baseSize, (CUdeviceptr)buff), ret, fail);
-    // size and alignment check
-    if (!((uintptr_t)baseAddr % NCCL_REC_PAGE_SIZE == 0 && baseSize % NCCL_REC_PAGE_SIZE == 0 && (uintptr_t)buff + size <= (uintptr_t)baseAddr + baseSize)) {
-      WARN("buffer %p (baseAddr %p align %d) size %zu (baseSize %ld align %d) does not satisfy symmetric registration requirements", buff, baseAddr, (uintptr_t)baseAddr % NCCL_REC_PAGE_SIZE == 0, size, baseSize, baseSize % NCCL_REC_PAGE_SIZE == 0);
-      goto fail;
-    }
-    NCCLCHECKGOTO(ncclRegister(comm, baseAddr, baseSize, false, (void**)&regHandle), ret, fail);
-    NCCLCHECKGOTO(ncclCalloc(win, 1), ret, fail);
-    (*win)->handle = regHandle;
-    regHandle->winFlags = winFlags;
-    if (regHandle->baseSymPtr == NULL && comm->symmetricSupport) {
-      struct ncclSymRegTask* task;
-      CUCHECKGOTO(cuMemRetainAllocationHandle(&memHandle, baseAddr), ret, fail);
-      CUCHECKGOTO(cuMemRelease(memHandle), ret, fail);
-      alignment = baseSize >= NCCL_REC_PAGE_SIZE * 72L ? NCCL_MAX_PAGE_SIZE : NCCL_REC_PAGE_SIZE;
-      NCCLCHECKGOTO(ncclCalloc(&task, 1), ret, fail);
-      task->buff = buff;
-      task->baseSize = baseSize;
-      task->memHandle = memHandle;
-      task->regHandle = regHandle;
-      task->alignment = alignment;
-      ncclIntruQueueEnqueue(&comm->symRegTaskQueue, task);
-      ncclGroupCommJoin(comm, ncclGroupTaskTypeSymRegister);
-    }
-  }
-
-exit:
-  ncclGroupErrCheck(ret);
-  NCCLCHECK(ret = ncclGroupEndInternal());
-  cudaSetDevice(saveDev);
-  return ret;
-fail:
-  free(*win);
-  *win = NULL;
-  goto exit;
-}
-
-NCCL_API(ncclResult_t, ncclCommWindowDeregister, ncclComm_t comm, ncclWindow_t win);
-ncclResult_t ncclCommWindowDeregister(ncclComm_t comm, ncclWindow_t win) {
-  ncclResult_t ret = ncclSuccess;
-  int saveDev;
-  struct ncclReg* regHandle;
-  CUDACHECK(cudaGetDevice(&saveDev));
-  if (win == NULL) goto exit;
-  regHandle = win->handle;
-  if (regHandle && ncclParamLocalRegister() && ncclCuMemEnable()) {
-    if (regHandle->baseSymPtr) {
-      CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
-      NCCLCHECKGOTO(ncclNvlsSymmetricFree(comm, regHandle->symSize, regHandle->baseSymPtr), ret, fail);
-      NCCLCHECKGOTO(ncclIpcSymmetricFree(comm, regHandle->symSize, regHandle->baseSymPtr), ret, fail);
-    }
-    NCCLCHECKGOTO(commDeregister(comm, false, regHandle), ret, fail);
-  }
-  free(win);
-exit:
-  CUDACHECK(cudaSetDevice(saveDev));
-  return ret;
-fail:
-  goto exit;
-}
diff --git a/projects/rccl/src/register/sendrecv_reg.cc b/projects/rccl/src/register/sendrecv_reg.cc
index f82fbd7142..9114fab01d 100644
--- a/projects/rccl/src/register/sendrecv_reg.cc
+++ b/projects/rccl/src/register/sendrecv_reg.cc
@@ -1,3 +1,9 @@
+/*************************************************************************
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
 #include "register.h"
 #include "transport.h"
 
diff --git a/projects/rccl/src/scheduler/CMakeLists.txt b/projects/rccl/src/scheduler/CMakeLists.txt
new file mode 100644
index 0000000000..f6583bd4db
--- /dev/null
+++ b/projects/rccl/src/scheduler/CMakeLists.txt
@@ -0,0 +1,7 @@
+# Scheduler sources
+set(SCHEDULER_SOURCES
+    ${CMAKE_CURRENT_SOURCE_DIR}/symmetric_sched.cc
+)
+
+# Add scheduler sources to parent scope
+set(SCHEDULER_SOURCES ${SCHEDULER_SOURCES} PARENT_SCOPE)
diff --git a/projects/rccl/src/scheduler/symmetric_sched.cc b/projects/rccl/src/scheduler/symmetric_sched.cc
new file mode 100644
index 0000000000..440b6061be
--- /dev/null
+++ b/projects/rccl/src/scheduler/symmetric_sched.cc
@@ -0,0 +1,235 @@
+/*************************************************************************
+ * Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef NCCL_SYMMETRIC_SCHED_H_
+#define NCCL_SYMMETRIC_SCHED_H_
+
+#include "scheduler.h"
+
+ncclResult_t ncclMakeSymmetricTaskList(struct ncclComm* comm, struct ncclTaskColl* task, struct ncclIntruQueue<struct ncclTaskColl, &ncclTaskColl::next>* symTaskQueue, struct ncclTaskColl** remainTasksHead) {
+  ncclResult_t ret = ncclSuccess;
+  int fnOpTySymCount = 0;
+  struct ncclTaskColl* tasksSymByFnOpTy[ncclNumFuncs * ncclNumDevRedOps * ncclNumTypes];
+  int fnOpTySymIndices[ncclNumFuncs * ncclNumDevRedOps * ncclNumTypes];
+  struct ncclKernelPlanner* planner = &comm->planner;
+  struct ncclTaskColl* remainTasksTail = nullptr;
+
+  memset(tasksSymByFnOpTy, 0, sizeof(tasksSymByFnOpTy));
+  *remainTasksHead = nullptr;
+  while (task != nullptr) {
+    int index = ((int)task->func*ncclNumDevRedOps + (int)task->opDev.op)*ncclNumTypes + (int)task->datatype;
+    struct ncclTaskColl* next = task->next;
+    NCCLCHECK(ncclDevrFindWindow(comm, task->sendbuff, &task->sendWin));
+    NCCLCHECK(ncclDevrFindWindow(comm, task->recvbuff, &task->recvWin));
+    bool symAvailable = ncclSymkAvailable(comm, task->func, task->opDev.op, task->datatype, task->count);
+
+    if (task->sendWin && task->recvWin && (task->sendWin->winFlags & task->recvWin->winFlags & NCCL_WIN_COLL_SYMMETRIC) && symAvailable) {
+      if (tasksSymByFnOpTy[index] == nullptr) fnOpTySymIndices[fnOpTySymCount++] = index;
+      task->next = tasksSymByFnOpTy[index];
+      tasksSymByFnOpTy[index] = task;
+      planner->nTasksColl--;
+    } else {
+      if (*remainTasksHead) {
+        remainTasksTail->next = task;
+       remainTasksTail = task;
+      } else {
+        *remainTasksHead = remainTasksTail = task;
+      }
+    }
+    task = next;
+  }
+  if (remainTasksTail) remainTasksTail->next = nullptr;
+
+  // make sure kernel args space can hold at least a single work
+  assert(comm->workArgsBytes >= ncclSymkDevWorkArgs::calcArgsSize(MAXCHANNELS, 1));
+
+  // Determine symmetric tasks kernels
+  for (int cursor = 0; cursor < fnOpTySymCount; cursor++) {
+    struct ncclTaskColl* task = tasksSymByFnOpTy[fnOpTySymIndices[cursor]];
+    while (task != NULL) {
+      ncclSymkKernelId kernelId = ncclSymkKernelId_Count;
+      int nChannels = MAXCHANNELS;
+      int nWarps = 0;
+      int nWorks = 0;
+      float estTimeUs = 1.e18;
+      size_t countTotal = 0, countMax = 0;
+      struct ncclTaskColl* headTask = task;
+      size_t cellCount = NCCL_SYM_KERNEL_CELL_SIZE / ncclTypeSize(headTask->datatype);
+      // For now we assume higher kernel id means a kernel for larger data size
+      while (task != nullptr) {
+        size_t count;
+        nWorks++;
+        count = alignUp(task->count, cellCount);
+        countTotal += count;
+        if (count > countMax) countMax = count;
+        if (ncclSymkDevWorkArgs::calcArgsSize(MAXCHANNELS, nWorks + 1) > comm->workArgsBytes || task->next == nullptr) {
+          task->isSymLast = 1;
+          break;
+        }
+        task = task->next;
+      }
+      NCCLCHECK(ncclSymkPickKernel(comm, headTask->func, headTask->opDev.op, headTask->datatype,
+                                   countTotal, countMax, nWorks,
+                                   &estTimeUs, &kernelId, &nChannels, &nWarps));
+      if (kernelId == ncclSymkKernelId_Count) {
+        char const* name = ncclGetEnv("NCCL_SYM_KERNEL");
+        WARN("Error: no symmetric kernel available for function %s.%s%s",
+             ncclFuncToString(headTask->func), (name ? " NCCL_SYM_KERNEL was set to " : ""), (name ? name: ""));
+        ret = (name ? ncclInvalidUsage : ncclInternalError);
+        goto fail;
+      }
+      // set all symmetric tasks to the same kernel
+      task = headTask;
+      while (task != nullptr) {
+        struct ncclTaskColl* next = task->next;
+        int isSymLast = task->isSymLast;
+        task->devFuncId = (uint32_t)kernelId;
+        task->nMaxChannels = nChannels;
+        task->nWarps = nWarps;
+        ncclIntruQueueEnqueue(&planner->collSymTaskQueue, task);
+        task = next;
+        if (isSymLast) break;
+      }
+    }
+  }
+
+exit:
+  return ret;
+fail:
+  goto exit;
+}
+
+ncclResult_t ncclSymmetricTaskScheduler(struct ncclComm* comm, struct ncclIntruQueue<struct ncclTaskColl, &ncclTaskColl::next>* symTaskQueue, struct ncclKernelPlan* plan) {
+  struct ncclTaskColl* headTask = ncclIntruQueueHead(symTaskQueue);
+  int devFuncId = headTask->devFuncId;
+  struct ncclTaskColl* task = NULL;
+  ssize_t totalCount = 0;  // aligned bytes
+  ssize_t logCount = 0;
+  ssize_t remainCell = 0;
+  ssize_t cellPerChannel = 0;
+  int workCount = 0, workIndex = 0;
+  size_t cellCount = NCCL_SYM_KERNEL_CELL_SIZE / ncclTypeSize(headTask->datatype); // minimal cell size
+  ncclResult_t ret = ncclSuccess;
+  int curChannel = 0;
+  int curChannelWork = 0;
+  int nMaxChannels = headTask->nMaxChannels;
+  struct ncclSymkDevWork* workBufPtr = NULL;
+  struct ncclSymkChannelWorkRange* workRangePtr = NULL;
+  const char* funcName = ncclFuncToString(headTask->func);
+  const char* kernelName = ncclSymkKernelIdToString(headTask->devFuncId);
+  struct ncclSymkDevWorkArgs* argsBuf = NULL;
+
+  plan->isSymColl = true;
+  plan->threadPerBlock = headTask->nWarps * WARP_SIZE;
+  plan->hasProxyOps = false;
+  plan->kernelFn = ncclSymkGetKernelPtr((ncclSymkKernelId)headTask->devFuncId, headTask->opDev.op, headTask->datatype);
+  task = headTask;
+  while (task != nullptr && task->devFuncId == devFuncId) {
+    workCount++;
+    totalCount += alignUp(task->count, cellCount);
+    logCount += task->count;
+    if (task->isSymLast == 1) break;
+    task = task->next;
+  }
+
+  plan->kernelArgsSize = ncclSymkDevWorkArgs::calcArgsSize(nMaxChannels, workCount);
+  argsBuf = (struct ncclSymkDevWorkArgs*)calloc(1, plan->kernelArgsSize);
+
+  remainCell = cellPerChannel = DIVUP(DIVUP(totalCount, nMaxChannels), cellCount);
+  workRangePtr = argsBuf->getWorkRange();
+  workBufPtr = argsBuf->getWorks(nMaxChannels);
+  argsBuf->nMaxChannels = nMaxChannels;
+
+  while (!ncclIntruQueueEmpty(symTaskQueue)) {
+    struct ncclSymkDevWork devWork = {};
+    size_t cellLeft = 0, taskCell = 0;
+    uint8_t isSymLast = 0;
+
+    if (ncclIntruQueueHead(symTaskQueue)->devFuncId != devFuncId) break; // scheduling is done
+
+    task = ncclIntruQueueDequeue(symTaskQueue);
+    isSymLast = task->isSymLast;
+
+    NCCLCHECKGOTO(ncclSymkMakeDevWork(comm, task, &devWork), ret, fail);
+
+    cellLeft = taskCell = DIVUP(task->count, cellCount);
+    for (;curChannel < nMaxChannels;) {
+      workRangePtr[curChannel].workHi = workIndex;
+      if (curChannelWork == 0) {
+        if (devWork.nChannels == 0) {
+          devWork.sChannelId = curChannel;
+          devWork.nChannels = 1;
+        } else if (cellLeft <= remainCell) {
+          // the last segment of the task
+          assert(devWork.nChannels > 0);
+          // if the remaining cell is less than 1024 bytes, we can fuse the last channel
+          if ((remainCell - cellLeft) * NCCL_SYM_KERNEL_CELL_SIZE <= (1 << 10) || ncclIntruQueueEmpty(symTaskQueue)) devWork.nChannels++;
+        } else {
+          // middle segment of the task
+          devWork.nChannels++;
+        }
+      } else {
+        assert(cellLeft == taskCell);
+        if (taskCell <= remainCell) {
+          // the first segment of the task is fully scheduled onto the channel
+          devWork.sChannelId = curChannel;
+          devWork.nChannels = 1;
+        }
+      }
+      if (cellLeft < remainCell) {
+        workRangePtr[curChannel].fracHi = uint16_t(0x10000UL - 1);
+        remainCell -= cellLeft;
+        curChannelWork++;
+        break;
+      } else if (cellLeft == remainCell) {
+        workRangePtr[curChannel].fracHi = uint16_t(0x10000UL - 1);
+        remainCell = cellPerChannel;
+        curChannel++;
+        curChannelWork = 0;
+        break;
+      } else {
+        // cellLeft > remainCell; the task is partially scheduled onto the channel
+        cellLeft -= remainCell;
+        workRangePtr[curChannel].fracHi = uint16_t(DIVUP(0x10000L * (taskCell - cellLeft), taskCell) - 1);
+        remainCell = cellPerChannel;
+        curChannel++;
+        curChannelWork = 0;
+      }
+    }
+    memcpy(workBufPtr + workIndex, &devWork, sizeof(struct ncclSymkDevWork));
+    workIndex++;
+
+    // Profiler
+    plan->groupApiEventHandle = task->groupApiEventHandle;
+
+    ncclMemoryPoolFree<struct ncclTaskColl>(&comm->memPool_ncclTaskColl, task);
+    if (isSymLast == 1) break;
+    if (curChannel == nMaxChannels) {
+      WARN("ncclSymmetricTaskScheduler ran out of channel space (nMaxChannels=%d, workCount=%d, workIndex=%d)",
+           nMaxChannels, workCount, workIndex);
+      goto fail;
+    }
+  }
+  if (remainCell < cellPerChannel) curChannel++;
+
+  memcpy(&argsBuf->kcomm, &comm->symkState.kcomm, sizeof(comm->symkState.kcomm));
+  plan->workBytes = totalCount * ncclTypeSize(headTask->datatype);
+  plan->channelMask = uint64_t(-1) >> (64 - curChannel);
+  plan->kernelSymArgs = (void*)argsBuf;
+  plan->workStorageType = ncclDevWorkStorageTypeArgs;
+
+  if (comm->rank == 0) {
+    INFO(NCCL_TUNING, "%s [Symmetric]: %ld Bytes -> Kernel %s nchannels %d nthreads %d nWorks %d", funcName,
+         logCount * ncclTypeSize(headTask->datatype), kernelName, curChannel, plan->threadPerBlock, workCount);
+  }
+
+exit:
+  return ret;
+fail:
+  goto exit;
+}
+
+#endif // NCCL_SYMMETRIC_SCHED_H_
diff --git a/projects/rccl/src/symmetric.cc b/projects/rccl/src/sym_kernels.cc
similarity index 52%
rename from projects/rccl/src/symmetric.cc
rename to projects/rccl/src/sym_kernels.cc
index f5b1e6c22d..df4965d569 100644
--- a/projects/rccl/src/symmetric.cc
+++ b/projects/rccl/src/sym_kernels.cc
@@ -1,14 +1,22 @@
-#include "symmetric.h"
+/*************************************************************************
+ * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "sym_kernels.h"
 #include "comm.h"
 #include "device.h"
+#include "transport.h"
 #include <cmath>
 
 constexpr char const* kernelName[] = {
-  // Must align with enum ncclSymKernelId definition in src/include/symmetric.h
+  // Must align with enum ncclSymkKernelId definition in src/include/sym_kernels.h
   "AllReduce_AGxLL_R",
   "AllReduce_AGxLLMC_R",
   "AllReduce_RSxLD_AGxST",
   "AllReduce_RSxLDMC_AGxSTMC",
+  "AllReduce_RSxNet_ARxMC_AGxNet",
   "AllGather_LL",
   "AllGather_LLMC",
   "AllGather_ST",
@@ -18,34 +26,34 @@ constexpr char const* kernelName[] = {
   "ReduceScatter_LDMC"
 };
 
-constexpr uint32_t kernelMask_STMC = 1<<ncclSymKernelId_AllGather_LLMC |
-                                     1<<ncclSymKernelId_AllGather_STMC |
-                                     1<<ncclSymKernelId_AllReduce_AGxLLMC_R |
-                                     1<<ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC |
-                                     1<<ncclSymKernelId_ReduceScatter_LDMC;
+constexpr uint32_t kernelMask_STMC = 1<<ncclSymkKernelId_AllGather_LLMC |
+                                     1<<ncclSymkKernelId_AllGather_STMC |
+                                     1<<ncclSymkKernelId_AllReduce_AGxLLMC_R |
+                                     1<<ncclSymkKernelId_AllReduce_RSxLDMC_AGxSTMC |
+                                     1<<ncclSymkKernelId_ReduceScatter_LDMC;
 
-constexpr uint32_t kernelMask_LDMC = 1<<ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC |
-                                     1<<ncclSymKernelId_ReduceScatter_LDMC;
+constexpr uint32_t kernelMask_LDMC = 1<<ncclSymkKernelId_AllReduce_RSxLDMC_AGxSTMC |
+                                     1<<ncclSymkKernelId_ReduceScatter_LDMC;
 
-constexpr uint32_t kernelMask_LL = 1<<ncclSymKernelId_AllReduce_AGxLL_R |
-                                   1<<ncclSymKernelId_AllReduce_AGxLLMC_R |
-                                   1<<ncclSymKernelId_AllGather_LL |
-                                   1<<ncclSymKernelId_AllGather_LLMC |
-                                   1<<ncclSymKernelId_ReduceScatter_LL;
+constexpr uint32_t kernelMask_LL = 1<<ncclSymkKernelId_AllReduce_AGxLL_R |
+                                   1<<ncclSymkKernelId_AllReduce_AGxLLMC_R |
+                                   1<<ncclSymkKernelId_AllGather_LL |
+                                   1<<ncclSymkKernelId_AllGather_LLMC |
+                                   1<<ncclSymkKernelId_ReduceScatter_LL;
 
-constexpr uint32_t kernelMask_AG = 1<<ncclSymKernelId_AllGather_LL |
-                                   1<<ncclSymKernelId_AllGather_LLMC |
-                                   1<<ncclSymKernelId_AllGather_ST |
-                                   1<<ncclSymKernelId_AllGather_STMC;
+constexpr uint32_t kernelMask_AG = 1<<ncclSymkKernelId_AllGather_LL |
+                                   1<<ncclSymkKernelId_AllGather_LLMC |
+                                   1<<ncclSymkKernelId_AllGather_ST |
+                                   1<<ncclSymkKernelId_AllGather_STMC;
 
-constexpr uint32_t kernelMask_AR = 1<<ncclSymKernelId_AllReduce_AGxLLMC_R |
-                                   1<<ncclSymKernelId_AllReduce_AGxLL_R |
-                                   1<<ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC |
-                                   1<<ncclSymKernelId_AllReduce_RSxLD_AGxST;
+constexpr uint32_t kernelMask_AR = 1<<ncclSymkKernelId_AllReduce_AGxLLMC_R |
+                                   1<<ncclSymkKernelId_AllReduce_AGxLL_R |
+                                   1<<ncclSymkKernelId_AllReduce_RSxLDMC_AGxSTMC |
+                                   1<<ncclSymkKernelId_AllReduce_RSxLD_AGxST;
 
-constexpr uint32_t kernelMask_RS = 1<<ncclSymKernelId_ReduceScatter_LD |
-                                   1<<ncclSymKernelId_ReduceScatter_LDMC |
-                                   1<<ncclSymKernelId_ReduceScatter_LL;
+constexpr uint32_t kernelMask_RS = 1<<ncclSymkKernelId_ReduceScatter_LD |
+                                   1<<ncclSymkKernelId_ReduceScatter_LDMC |
+                                   1<<ncclSymkKernelId_ReduceScatter_LL;
 
 static uint32_t kernelMask_coll(ncclFunc_t coll) {
   switch (coll) {
@@ -64,11 +72,11 @@ static uint32_t kernelMask_user() {
     // the parseList() used by NCCL_ALGO/PROTO.
     char const* name = ncclGetEnv("NCCL_SYM_KERNEL");
     if (name == nullptr || strcmp(name, "^") == 0) {
-      static_assert((int)ncclSymKernelId_Count < 32, "Use more than 32 bits");
-      got = (1<<(int)ncclSymKernelId_Count)-1;
+      static_assert((int)ncclSymkKernelId_Count < 32, "Use more than 32 bits");
+      got = (1<<(int)ncclSymkKernelId_Count)-1;
     } else {
       got = 0;
-      for (int k=0; k < (int)ncclSymKernelId_Count; k++) {
+      for (int k=0; k < (int)ncclSymkKernelId_Count; k++) {
         if (strcmp(kernelName[k], name) == 0) {
           __atomic_store_n(&cache, 1<<k, __ATOMIC_RELAXED);
           got = 1<<k;
@@ -102,11 +110,11 @@ static double model(double busBytes, double baseLat, int nSMs, double smBw, doub
 // Given the kernel and bytes, return the minimum number of blocks to run on such that
 // perf is 99% of running at max blocks, and return the estimate runtime for that
 // block count.
-static void queryModel(struct ncclComm* comm, ncclSymKernelId k, size_t nBytes, float* timeUs, int* nBlocks) {
+static void queryModel(struct ncclComm* comm, ncclSymkKernelId k, size_t nBytes, float* timeUs, int* nBlocks) {
   constexpr double LL_BusFactor = 9; // 2X the bytes, plus some processing, plus no unrolling
 
   int nRanks = comm->nRanks;
-  int nMaxBlocks = ncclSymMaxBlocks;
+  int nMaxBlocks = ncclSymkMaxBlocks;
   int nMaxBlocksNvls = divUp((comm->cudaArch < 1000 ? 16 : 32), nRanks);
   size_t busBytes; // max(bytes sent, bytes received)
   double busMultiplier = 1;
@@ -116,45 +124,45 @@ static void queryModel(struct ncclComm* comm, ncclSymKernelId k, size_t nBytes,
     busBytes = size_t(1)<<50;
     break;
 
-  case ncclSymKernelId_AllReduce_AGxLL_R:
+  case ncclSymkKernelId_AllReduce_AGxLL_R:
     busBytes = nRanks*nBytes*LL_BusFactor;
     break;
-  case ncclSymKernelId_AllReduce_AGxLLMC_R:
+  case ncclSymkKernelId_AllReduce_AGxLLMC_R:
     busBytes = nRanks*nBytes*LL_BusFactor;
     busMultiplier = 1.1; // To beat non-MC LL
     break;
-  case ncclSymKernelId_AllReduce_RSxLD_AGxST:
+  case ncclSymkKernelId_AllReduce_RSxLD_AGxST:
     busBytes = 2*nBytes*(nRanks-1)/nRanks;
     break;
-  case ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC:
+  case ncclSymkKernelId_AllReduce_RSxLDMC_AGxSTMC:
     busBytes = nBytes/nRanks + nBytes;
     busMultiplier = nRanks;
     nMaxBlocks = nMaxBlocksNvls;
     break;
 
-  case ncclSymKernelId_AllGather_LL:
+  case ncclSymkKernelId_AllGather_LL:
     busBytes = nRanks*nBytes*LL_BusFactor;
     break;
-  case ncclSymKernelId_AllGather_LLMC:
+  case ncclSymkKernelId_AllGather_LLMC:
     busBytes = nRanks*nBytes*LL_BusFactor;
     busMultiplier = 1.1; // To beat non-MC LL
     break;
-  case ncclSymKernelId_AllGather_ST:
+  case ncclSymkKernelId_AllGather_ST:
     busBytes = (nRanks-1)*nBytes;
     break;
-  case ncclSymKernelId_AllGather_STMC:
+  case ncclSymkKernelId_AllGather_STMC:
     busBytes = (nRanks-1)*nBytes; // Wrong. Should be nRanks*nBytes but we want to beat non-MC.
     busMultiplier = 0.55*nRanks;
     nMaxBlocks = nMaxBlocksNvls;
     break;
 
-  case ncclSymKernelId_ReduceScatter_LL:
+  case ncclSymkKernelId_ReduceScatter_LL:
     busBytes = nRanks*nBytes*LL_BusFactor;
     break;
-  case ncclSymKernelId_ReduceScatter_LD:
+  case ncclSymkKernelId_ReduceScatter_LD:
     busBytes = (nRanks-1)*nBytes;
     break;
-  case ncclSymKernelId_ReduceScatter_LDMC:
+  case ncclSymkKernelId_ReduceScatter_LDMC:
     busBytes = (nRanks-1)*nBytes; // Wrong. Should be nRanks*nBytes but we want to beat non-MC.
     busMultiplier = 0.55*nRanks;
     nMaxBlocks = nMaxBlocksNvls;
@@ -164,7 +172,7 @@ static void queryModel(struct ncclComm* comm, ncclSymKernelId k, size_t nBytes,
   nMaxBlocks = std::min<int>(nMaxBlocks, comm->config.maxCTAs);
   int nMinBlocks = comm->config.minCTAs;
 
-  int nUserCTAs = std::min<int>(ncclSymMaxBlocks, ncclParamSymCTAs());
+  int nUserCTAs = std::min<int>(ncclSymkMaxBlocks, ncclParamSymCTAs());
   if (nUserCTAs > 0) nMinBlocks = nMaxBlocks = nUserCTAs;
 
   bool isLL = kernelMask_LL>>k & 1;
@@ -175,11 +183,11 @@ static void queryModel(struct ncclComm* comm, ncclSymKernelId k, size_t nBytes,
   if (comm->cudaArch < 1000) {
     baseLat = isLL ? 4.5 : 7.8;
     smBw = isAR ? 65*GBps : 44*GBps;
-    peakBw = k == ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC ? 480*GBps : 320*GBps;
+    peakBw = k == ncclSymkKernelId_AllReduce_RSxLDMC_AGxSTMC ? 480*GBps : 320*GBps;
   } else {
     baseLat = isLL ? (isAG ? 8.5 : 11) : (isAR ? 19.5 : 13.0);
     smBw = 55*GBps;
-    peakBw = k == ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC ? 1000*GBps : 600*GBps;
+    peakBw = k == ncclSymkKernelId_AllReduce_RSxLDMC_AGxSTMC ? 1000*GBps : 600*GBps;
   }
   *nBlocks = nMaxBlocks;
   *timeUs = model(busBytes, baseLat, nMaxBlocks, smBw, busMultiplier, peakBw);
@@ -194,7 +202,36 @@ static void queryModel(struct ncclComm* comm, ncclSymKernelId k, size_t nBytes,
   }
 }
 
-bool ncclSymImplemented(ncclFunc_t coll, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty) {
+ncclResult_t ncclSymkInitOnce(struct ncclComm* comm) {
+  struct ncclSymkState* symk = &comm->symkState;
+  if (!symk->initialized) {
+    symk->initialized = true;
+    struct ncclDevCommRequirements reqs = {};
+    reqs.lsaMultimem = comm->nvlsSupport;
+    reqs.lsaBarrierCount = ncclSymkMaxBlocks;
+
+    struct ncclDevResourceRequirements lla2aReq;
+    ncclLLA2ACreateRequirement(
+      ncclSymkMaxBlocks, ncclLLA2ACalcSlots(ncclTeamLsa(comm).nRanks*ncclSymkMaxThreads, ncclSymkLLMaxEltSize),
+      &symk->kcomm.lsaLLA2A, &lla2aReq
+    );
+    lla2aReq.next = reqs.resourceRequirementsList;
+    reqs.resourceRequirementsList = &lla2aReq;
+
+    NCCLCHECK(ncclDevrCommCreateInternal(comm, &reqs, &symk->kcomm.devComm));
+  }
+  return ncclSuccess;
+}
+
+ncclResult_t ncclSymkFinalize(struct ncclComm* comm) {
+  struct ncclSymkState* symk = &comm->symkState;
+  if (symk->initialized) {
+    NCCLCHECK(ncclDevCommDestroy(comm, &symk->kcomm.devComm));
+  }
+  return ncclSuccess;
+}
+
+static bool ncclSymkImplemented(ncclFunc_t coll, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty) {
   bool isFloat;
   switch (ty) {
   case ncclFloat64:
@@ -221,10 +258,7 @@ bool ncclSymImplemented(ncclFunc_t coll, int/*ncclDevRedOp_t*/ red, ncclDataType
   }
 }
 
-ncclResult_t ncclSymPickKernel(
-    struct ncclComm* comm, ncclFunc_t coll, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty, size_t nElts,
-    float* estTimeUs, ncclSymKernelId* kernelId, int* nBlocks, int* nWarps
-  ) {
+static uint32_t ncclSymkMask(struct ncclComm* comm, ncclFunc_t coll, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty, size_t nElts) {
   uint32_t kmask = kernelMask_coll(coll);
   kmask &= kernelMask_user();
 
@@ -263,14 +297,37 @@ ncclResult_t ncclSymPickKernel(
   // to be at least 32 bytes per chunk)
   if (nBusBytes >= 32*(size_t(2)<<30)) kmask = 0;
 
-  ncclSymKernelId bestKernel = ncclSymKernelId_Count;
+  return kmask;
+}
+
+bool ncclSymkAvailable(struct ncclComm* comm, ncclFunc_t coll, int/*ncclDevRedOp_t*/ red,
+                       ncclDataType_t ty, size_t nElts) {
+  if (!ncclSymkImplemented(coll, red, ty))
+    return false;
+
+  return (ncclSymkMask(comm, coll, red, ty, nElts) != 0);
+}
+
+ncclResult_t ncclSymkPickKernel(
+    struct ncclComm* comm, ncclFunc_t coll, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty,
+    size_t nEltsTotal, size_t nEltsMax, int nWorks,
+    float* estTimeUs, ncclSymkKernelId* kernelId, int* nBlocks, int* nWarps
+  ) {
+  uint32_t kmask = ncclSymkMask(comm, coll, red, ty, nEltsMax);
+
+  // We currently don't support grouping for LL kernels.
+  if (nWorks > 1)
+    kmask &= ~kernelMask_LL;
+
+  ncclSymkKernelId bestKernel = ncclSymkKernelId_Count;
   float bestTime = 1.e30f;
   int bestBlocks = 999;
+  size_t nBytes = nEltsTotal*ncclTypeSize(ty);
 
   constexpr float smPenalty = .025f; // 2.5% percent increase in time per SM
   uint32_t kmaskRemain = kmask;
   while (kmaskRemain != 0) {
-    ncclSymKernelId k = (ncclSymKernelId)popFirstOneBit(&kmaskRemain);
+    ncclSymkKernelId k = (ncclSymkKernelId)popFirstOneBit(&kmaskRemain);
     float kTime;
     int kBlocks;
     queryModel(comm, k, nBytes, &kTime, &kBlocks);
@@ -282,15 +339,29 @@ ncclResult_t ncclSymPickKernel(
   }
 
   *kernelId = bestKernel;
-  *estTimeUs = kmask==0 || kernelMask_user() == (1<<ncclSymKernelId_Count)-1 ? bestTime : 0.0f;
+  *estTimeUs = kmask==0 || kernelMask_user() == (1<<ncclSymkKernelId_Count)-1 ? bestTime : 0.0f;
   *nBlocks = bestBlocks;
   *nWarps = 16;
   return ncclSuccess;
 }
 
-const char* ncclSymKernelIdToString(int kernelId) {
-  if (kernelId < 0 || kernelId >= ncclSymKernelId_Count) {
+const char* ncclSymkKernelIdToString(int kernelId) {
+  if (kernelId < 0 || kernelId >= ncclSymkKernelId_Count) {
     return "Unknown";
   }
   return kernelName[kernelId];
 }
+
+/* this function fills in the devWork except nextWorkOffset */
+ncclResult_t ncclSymkMakeDevWork(struct ncclComm* comm, struct ncclTaskColl* task, struct ncclSymkDevWork* outDevWork) {
+  outDevWork->rootRank = task->root;
+  outDevWork->redOpArg = task->opDev.scalarArg;
+  outDevWork->nElts = task->count;
+  outDevWork->inputWin = task->sendWin->vidmem;
+  outDevWork->inputOff = (uint8_t*)task->sendbuff - (uint8_t*)task->sendWin->userPtr;
+  outDevWork->outputWin = task->recvWin->vidmem;
+  outDevWork->outputOff = (uint8_t*)task->recvbuff - (uint8_t*)task->recvWin->userPtr;
+  outDevWork->sChannelId = 0xffff;
+  outDevWork->nChannels = 0;
+  return ncclSuccess;
+}
diff --git a/projects/rccl/src/transport/CMakeLists.txt b/projects/rccl/src/transport/CMakeLists.txt
new file mode 100644
index 0000000000..0485008c06
--- /dev/null
+++ b/projects/rccl/src/transport/CMakeLists.txt
@@ -0,0 +1,15 @@
+# Transport sources
+set(TRANSPORT_SOURCES
+    ${CMAKE_CURRENT_SOURCE_DIR}/nvls.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/profiler.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/net_socket.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/p2p.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/net.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/net_ib.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/coll_net.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/shm.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/generic.cc
+)
+
+# Add transport sources to parent scope
+set(TRANSPORT_SOURCES ${TRANSPORT_SOURCES} PARENT_SCOPE)
diff --git a/projects/rccl/src/transport/coll_net.cc b/projects/rccl/src/transport/coll_net.cc
index 386865e21c..6cf6b18c72 100644
--- a/projects/rccl/src/transport/coll_net.cc
+++ b/projects/rccl/src/transport/coll_net.cc
@@ -355,7 +355,7 @@ static ncclResult_t sharedListen(struct ncclProxyState* proxyState, int netDev,
     collNet->resources = resources;
   }
   if (resources->collNetComms[netDev] == NULL)
-    NCCLCHECK(proxyState->ncclCollNet->listen(netDev, collNetHandle, resources->collNetListenComms + netDev));
+    NCCLCHECK(proxyState->ncclCollNet->listen(proxyState->collNetContext, netDev, collNetHandle, resources->collNetListenComms + netDev));
   return ncclSuccess;
 }
 
@@ -1223,14 +1223,19 @@ ncclResult_t ncclCollnetLocalRegisterBuffer(struct ncclComm* comm, const void* u
   ncclResult_t ret = ncclSuccess;
   struct ncclReg *regRecord = NULL;
   bool isValid = false;
+  void *base = NULL;
+  size_t baseSize = 0;
 
   *outRegBufFlag = 0;
   *outHandle = NULL;
   if (comm && userbuff && buffSize > 0) {
     NCCLCHECKGOTO(ncclRegFind(comm, userbuff, buffSize, &regRecord), ret, fail);
     NCCLCHECKGOTO(ncclRegLocalIsValid(regRecord, &isValid), ret, fail);
-    if (isValid)
-      NCCLCHECKGOTO(collnetRegisterBuffer(comm, userbuff, buffSize, type, regRecord, outRegBufFlag, outHandle), ret, fail);
+    if (isValid) {
+      CUCHECKGOTO(cuMemGetAddressRange((CUdeviceptr *)&base, &baseSize, (CUdeviceptr)userbuff), ret, fail);
+      if ((uint64_t)base + baseSize < (uint64_t)userbuff + buffSize) goto exit;
+    }
+    NCCLCHECKGOTO(collnetRegisterBuffer(comm, userbuff, buffSize, type, regRecord, outRegBufFlag, outHandle), ret, fail);
   }
 exit:
   return ret;
@@ -1256,13 +1261,14 @@ ncclResult_t ncclCollnetGraphRegisterBuffer(struct ncclComm* comm, const void* u
   ncclResult_t ret = ncclSuccess;
   struct ncclCollnetCleanupCallback* record = NULL;
   struct ncclReg *regRecord = NULL;
-  void *baseSend = NULL;
-  size_t baseSendSize = 0;
+  void *base = NULL;
+  size_t baseSize = 0;
 
   *outRegBufFlag = 0;
   if (comm && userbuff && buffSize > 0) {
-    CUCHECKGOTO(cuMemGetAddressRange((CUdeviceptr *)&baseSend, &baseSendSize, (CUdeviceptr)userbuff), ret, fail);
-    NCCLCHECKGOTO(ncclCommGraphRegister(comm, baseSend, baseSendSize, (void**)&regRecord), ret, fail);
+    CUCHECKGOTO(cuMemGetAddressRange((CUdeviceptr *)&base, &baseSize, (CUdeviceptr)userbuff), ret, fail);
+    if ((uint64_t)base + baseSize < (uint64_t)userbuff + buffSize) goto exit;
+    NCCLCHECKGOTO(ncclCommGraphRegister(comm, base, baseSize, (void**)&regRecord), ret, fail);
     NCCLCHECKGOTO(collnetRegisterBuffer(comm, userbuff, buffSize, type, regRecord, outRegBufFlag, outHandle), ret, fail);
 
     if (*outRegBufFlag) {
@@ -1473,11 +1479,7 @@ ncclResult_t ncclCollNetSetup(ncclComm_t comm, ncclComm_t parent, struct ncclTop
   ncclResult_t ret = ncclSuccess;
   int rank = comm->rank;
   int collNetSetupFail = 0;
-  // Find all head ranks
-  int nHeadsUnique = 0;
-  int* headsUnique = NULL;
   bool share;
-  struct ncclTopoGraph* directGraph = graphs[NCCL_ALGO_COLLNET_DIRECT];
 
   struct collnetShareInfo {
     int headPosition;
@@ -1485,20 +1487,30 @@ ncclResult_t ncclCollNetSetup(ncclComm_t comm, ncclComm_t parent, struct ncclTop
   };
   struct collnetShareInfo* infos = NULL;
 
-  NCCLCHECKGOTO(ncclCalloc(&headsUnique, directGraph->nChannels), ret, fail);
-  { uint64_t mask = 0;
+  struct ncclTopoGraph* collNetGraph;
+
+  if (!comm->nvlsSupport) {
+    collNetGraph = graphs[NCCL_ALGO_COLLNET_DIRECT];
+    NCCLCHECKGOTO(ncclCalloc(&comm->collNetHeads, collNetGraph->nChannels), ret, fail);
+    uint64_t mask = 0;
     // Head GPU index is always 0
-    for (int c = 0; c < directGraph->nChannels; c++) {
-      int head = directGraph->intra[c * comm->localRanks + 0];
+    for (int c = 0; c < collNetGraph->nChannels; c++) {
+      int head = collNetGraph->intra[c * comm->localRanks + 0];
       assert(comm->rankToNode[head] == comm->node);
       uint64_t mask0 = mask;
       mask |= 1ull<<comm->rankToLocalRank[head];
-      if (mask != mask0) headsUnique[nHeadsUnique++] = head;
+      if (mask != mask0) comm->collNetHeads[comm->collNetHeadsNum++] = head;
     }
+  } else {
+    // Use the NVLS graph to get the head ranks for collnet setup. comm->nvlsHeads already has unique heads.
+    // nHeads is the same on all the channels, see connectNvls function
+    collNetGraph = graphs[NCCL_ALGO_NVLS];
+    NCCLCHECKGOTO(ncclCalloc(&comm->collNetHeads, collNetGraph->nChannels), ret, fail);
+    comm->collNetHeadsNum = comm->channels[0].nvls.nHeads;
+    // Copy over comm->collNetHeads from comm->nvlsHeads since they are freed in different places.
+    memcpy(comm->collNetHeads, comm->nvlsHeads, comm->collNetHeadsNum * sizeof(int));
   }
 
-  comm->collNetHeads = headsUnique;
-  comm->collNetHeadsNum = nHeadsUnique;
   if (parent && parent->config.collnetEnable && parent->nNodes == comm->nNodes) {
     if (!parent->shareResources) {
       collNetSetupFail = 1;
@@ -1508,7 +1520,7 @@ ncclResult_t ncclCollNetSetup(ncclComm_t comm, ncclComm_t parent, struct ncclTop
     /* check whether child can share collnet resources of parent. Since parent builds each collnet communicator
      * based on heads with the same head position in each node, as long as the collnet heads of child comm
      * can match parent's heads, we can let child communicator share parent's collnet resources. */
-    for (int h = 0; h < nHeadsUnique; ++h) {
+    for (int h = 0; h < comm->collNetHeadsNum; ++h) {
       int prev = INT_MIN;
       struct collnetShareInfo* myinfo;
 
@@ -1516,7 +1528,7 @@ ncclResult_t ncclCollNetSetup(ncclComm_t comm, ncclComm_t parent, struct ncclTop
       myinfo = infos + comm->rank;
       memset(myinfo, 0, sizeof(struct collnetShareInfo));
       /* find the child head position in parent collnet heads. */
-      if (headsUnique[h] == comm->rank) {
+      if (comm->collNetHeads[h] == comm->rank) {
         myinfo->headPosition = -1;
         myinfo->isMaster = 1;
         for (int th = 0; th < parent->collNetHeadsNum; ++th)
@@ -1567,11 +1579,11 @@ ncclResult_t ncclCollNetSetup(ncclComm_t comm, ncclComm_t parent, struct ncclTop
     for (int c = 0; c < comm->nChannels; c++) {
       struct ncclChannel* channel = comm->channels + c;
       NCCLCHECKGOTO(initCollnetChannel(comm, c, parent, false), ret, fail);
-      for (int h = 0; h < nHeadsUnique; h++) {
-        const int head = headsUnique[h];
+      for (int h = 0; h < comm->collNetHeadsNum; h++) {
+        const int head = comm->collNetHeads[h];
         ncclConnect connect;
-        collNetSetupFail |= ncclTransportCollNetSetup(comm, directGraph, channel, head, head, h, collNetRecv, &connect);
-        if (!collNetSetupFail) collNetSetupFail |= ncclTransportCollNetSetup(comm, directGraph, channel, head, head, h, collNetSend, &connect);
+        collNetSetupFail |= ncclTransportCollNetSetup(comm, collNetGraph, channel, head, head, h, collNetRecv, &connect);
+        if (!collNetSetupFail) collNetSetupFail |= ncclTransportCollNetSetup(comm, collNetGraph, channel, head, head, h, collNetSend, &connect);
       }
       // Verify CollNet setup across ranks after trying the first channel
       if (c == 0) {
@@ -1592,7 +1604,7 @@ ncclResult_t ncclCollNetSetup(ncclComm_t comm, ncclComm_t parent, struct ncclTop
       bool isHead = false;
       matrix = nullptr;
       NCCLCHECKGOTO(ncclCalloc(&matrix, comm->nRanks), ret, matrix_end);
-      for (int h = 0; h < nHeadsUnique; h++) isHead |= (headsUnique[h] == comm->rank);
+      for (int h = 0; h < comm->collNetHeadsNum; h++) isHead |= (comm->collNetHeads[h] == comm->rank);
       if (isHead) {
         for (int ty=0; ty < ncclNumTypes; ty++) {
           for (int op=0; op < 4; op++) {
diff --git a/projects/rccl/src/transport/generic.cc b/projects/rccl/src/transport/generic.cc
index 47b023667d..a42418773f 100644
--- a/projects/rccl/src/transport/generic.cc
+++ b/projects/rccl/src/transport/generic.cc
@@ -1,3 +1,9 @@
+/*************************************************************************
+ * Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
 #include "comm.h"
 #include "transport.h"
 #include "bootstrap.h"
diff --git a/projects/rccl/src/transport/net.cc b/projects/rccl/src/transport/net.cc
index c0cd20d6e0..5d1a5601c4 100644
--- a/projects/rccl/src/transport/net.cc
+++ b/projects/rccl/src/transport/net.cc
@@ -178,6 +178,101 @@ struct setupReq {
 NCCL_PARAM(NetOptionalRecvCompletion, "NET_OPTIONAL_RECV_COMPLETION", 1);
 
 static_assert(sizeof(ncclNetHandle_t) + sizeof(int) <= CONNECT_SIZE, "Not large enough ncclConnect to hold ncclNetHandle_t and useGdr flag");
+
+// Common function to initialize network attributes from a ncclComm
+static void populateCommNetAttrs(struct ncclComm* comm, struct ncclConnector* conn, ncclNetAttr_t* netAttr) {
+  *netAttr = NCCL_NET_ATTR_INIT;
+  netAttr->sendCommAttr.minConcurrentPeers = 1;
+  netAttr->sendCommAttr.minFlowsPerPeer = 1;
+
+  netAttr->recvCommAttr.minConcurrentPeers = 1;
+  netAttr->recvCommAttr.minFlowsPerPeer = 1;
+
+  if (conn->p2pOnly) {
+    size_t maxConcPeers = comm->p2pnChannels * NCCL_MAX_DEV_WORK_P2P_PER_BATCH;
+    if (comm->nRanks < maxConcPeers) maxConcPeers = comm->nRanks;
+
+    netAttr->sendCommAttr.maxConcurrentPeers = maxConcPeers;
+    netAttr->sendCommAttr.maxFlowsPerPeer = comm->p2pnChannelsPerPeer;
+    netAttr->recvCommAttr.maxConcurrentPeers = maxConcPeers;
+    netAttr->recvCommAttr.maxFlowsPerPeer = comm->p2pnChannelsPerPeer;
+    netAttr->op = BIT(ncclFuncSend) | BIT(ncclFuncRecv) |
+                  BIT(ncclFuncAlltoAll) | BIT(ncclFuncScatter) | BIT(ncclFuncGather);
+  } else {
+    size_t maxConcPeers = (NCCL_MAX_TREE_ARITY - 1) * 2;
+    if (comm->nRanks < maxConcPeers) maxConcPeers = comm->nRanks;
+    netAttr->sendCommAttr.maxConcurrentPeers = maxConcPeers;
+    netAttr->sendCommAttr.maxFlowsPerPeer = comm->nChannels;
+    netAttr->recvCommAttr.maxConcurrentPeers = maxConcPeers;
+    netAttr->recvCommAttr.maxFlowsPerPeer = comm->nChannels;
+  }
+}
+
+// Apply the netAttr to the netComm
+void setNetAttrs(struct ncclProxyState* proxyState, ncclNetAttr_t* netAttr)
+{
+  if (proxyState->ncclNet->setNetAttr) {
+    proxyState->ncclNet->setNetAttr(proxyState->netContext, netAttr);
+    proxyState->netAttr = *netAttr;
+  }
+}
+
+void printNetAttrs(ncclNetAttr_t* netAttr, const char *task)
+{
+  const int opBufLen = ncclNumFuncs*32;
+  char opBuf[opBufLen] = "";
+  const int algoBufLen = NCCL_NUM_ALGORITHMS*32;
+  char algoBuf[algoBufLen] = "";
+  const int protoBufLen = NCCL_NUM_PROTOCOLS*32;
+  char protoBuf[protoBufLen] = "";
+
+  ncclBitsToString(netAttr->op, MASK(ncclNumFuncs), (const char* (*)(int))ncclFuncToString, opBuf, opBufLen, "*");
+  ncclBitsToString(netAttr->algo, MASK(NCCL_NUM_ALGORITHMS), ncclAlgoToString, algoBuf, algoBufLen, "*");
+  ncclBitsToString(netAttr->proto, MASK(NCCL_NUM_PROTOCOLS), ncclProtoToString, protoBuf, protoBufLen, "*");
+
+  TRACE(NCCL_NET, "%s hints, send peers/flows: [%d-%d][%d-%d] recv peers/flows: [%d-%d][%d-%d] op: %s algo: %s proto: %s",
+        task, netAttr->sendCommAttr.minConcurrentPeers, netAttr->sendCommAttr.maxConcurrentPeers,
+        netAttr->sendCommAttr.minFlowsPerPeer, netAttr->sendCommAttr.maxFlowsPerPeer,
+        netAttr->recvCommAttr.minConcurrentPeers, netAttr->recvCommAttr.maxConcurrentPeers,
+        netAttr->recvCommAttr.minFlowsPerPeer, netAttr->recvCommAttr.maxFlowsPerPeer,
+        opBuf, algoBuf, protoBuf);
+}
+
+// Set the netAttr for a transfer operation
+void setXferNetAttrs(struct ncclProxyState* proxyState, struct ncclProxyArgs* args, int send)
+{
+  ncclNetAttr_t netAttr;
+
+  if (!proxyState->ncclNet->setNetAttr)
+    return;
+
+  netAttr = proxyState->netAttr;
+
+  if (send) {
+    netAttr.sendCommAttr.maxConcurrentPeers = args->nPeers;
+    netAttr.sendCommAttr.minConcurrentPeers = args->nPeers;
+    netAttr.sendCommAttr.maxFlowsPerPeer = args->nChannels;
+    netAttr.sendCommAttr.minFlowsPerPeer = args->nChannels;
+  } else {
+    netAttr.recvCommAttr.maxConcurrentPeers = args->nPeers;
+    netAttr.recvCommAttr.minConcurrentPeers = args->nPeers;
+    netAttr.recvCommAttr.maxFlowsPerPeer = args->nChannels;
+    netAttr.recvCommAttr.minFlowsPerPeer = args->nChannels;
+  }
+
+  netAttr.op = BIT(args->collAPI);
+  // algo/proto are undefined for p2p
+  if (args->collAPI < NCCL_NUM_FUNCTIONS) {
+    netAttr.algo = BIT(args->algorithm);
+    netAttr.proto = BIT(args->protocol);
+  }
+
+  if (memcmp(&proxyState->netAttr, &netAttr, sizeof(netAttr))) {
+    setNetAttrs(proxyState, &netAttr);
+    printNetAttrs(&netAttr, send ? "send" : "recv");
+  }
+}
+
 // Forward declaration
 static ncclResult_t sendProxyProgress(struct ncclProxyState* proxyState, struct ncclProxyArgs* args);
 
@@ -307,11 +402,12 @@ static ncclResult_t netDumpMap(struct connectMap* map) {
 
 struct netSendConnectArgs {
   ncclNetHandle_t handle;
-  int trafficClass;
+  ncclNetAttr_t netAttr;
 };
 
 struct netRecvConnectArgs {
   int proxyRank;
+  ncclNetAttr_t netAttr;
 };
 
 static ncclResult_t sendConnect(struct ncclComm* comm, struct ncclConnect* connectInfo, int nranks, int rank, struct ncclConnector* send) {
@@ -331,7 +427,9 @@ static ncclResult_t sendConnect(struct ncclComm* comm, struct ncclConnect* conne
     INFO(NCCL_PROXY, "sendConnect ncclProxyCallAsync opId=%p", opId);
     netSendConnectArgs args = {0};
     memcpy(&args.handle, connectInfo, sizeof(ncclNetHandle_t));
-    args.trafficClass = comm->config.trafficClass;
+
+    populateCommNetAttrs(comm, send, &args.netAttr);
+
     NCCLCHECK(ncclProxyCallAsync(comm, &send->proxyConn, ncclProxyMsgConnect, &args, sizeof(netSendConnectArgs), sizeof(struct connectMap), opId));
   } else {
     opId =  send;
@@ -442,6 +540,9 @@ static ncclResult_t recvConnect(struct ncclComm* comm, struct ncclConnect* conne
        opId, &recv->proxyConn, connectInfo);
     netRecvConnectArgs args = {0};
     args.proxyRank = *((int*)connectInfo);
+
+    populateCommNetAttrs(comm, recv, &args.netAttr);
+
     NCCLCHECK(ncclProxyCallAsync(comm, &recv->proxyConn, ncclProxyMsgConnect, &args, sizeof(netRecvConnectArgs), sizeof(struct connectMap), opId));
   } else {
     opId = recv;
@@ -677,7 +778,7 @@ static ncclResult_t recvProxySetup(struct ncclProxyConnection* connection, struc
   }
 
   if (respSize != sizeof(ncclNetHandle_t)) return ncclInternalError;
-  NCCLCHECK(proxyState->ncclNet->listen(req->netDev, respBuff, &resources->netListenComm));
+  NCCLCHECK(proxyState->ncclNet->listen(proxyState->netContext, req->netDev, respBuff, &resources->netListenComm));
   *done = 1;
 
   return ncclSuccess;
@@ -707,11 +808,12 @@ static ncclResult_t ncclNetGetDeviceHandle(ncclNetDeviceType type, int version,
 
 static ncclResult_t sendProxyConnect(struct ncclProxyConnection* connection, struct ncclProxyState* proxyState, void* reqBuff, int reqSize, void* respBuff, int respSize, int* done) {
   struct sendNetResources* resources = (struct sendNetResources*)(connection->transportResources);
-  ncclNetCommConfig_t commConfig = {0};
   if (reqSize != sizeof(netSendConnectArgs)) return ncclInternalError;
   ncclResult_t ret = ncclSuccess;
   netSendConnectArgs* req = (netSendConnectArgs*) reqBuff;
-  commConfig.trafficClass = req->trafficClass == NCCL_CONFIG_UNDEF_INT ? NCCL_NET_TRAFFIC_CLASS_UNDEF : req->trafficClass;
+
+  setNetAttrs(proxyState, &req->netAttr);
+
   NCCLCHECK(ncclNetGetDeviceHandle(resources->netDeviceType, resources->netDeviceVersion, false /*isRecv*/, &resources->netDeviceHandle));
   if (resources->shared) {
     // Shared buffers
@@ -736,25 +838,29 @@ static ncclResult_t sendProxyConnect(struct ncclProxyConnection* connection, str
         comms->activeConnect[resources->channelId] = (resources->tpLocalRank + 1);
       if (comms->sendComm[resources->channelId] == NULL
           && comms->activeConnect[resources->channelId] == (resources->tpLocalRank + 1)) {
-        ret = proxyState->ncclNet->connect(resources->netDev, &commConfig, req->handle,
+        ret = proxyState->ncclNet->connect(proxyState->netContext, resources->netDev, req->handle,
             comms->sendComm + resources->channelId, &resources->netDeviceHandle);
       }
       resources->netSendComm = comms->sendComm[resources->channelId];
       if (comms->sendComm[resources->channelId]) comms->sendRefCount[resources->channelId]++;
     } else {
-      ret = proxyState->ncclNet->connect(resources->netDev, &commConfig, req->handle, &resources->netSendComm, &resources->netDeviceHandle);
+      ret = proxyState->ncclNet->connect(proxyState->netContext, resources->netDev, req->handle, &resources->netSendComm, &resources->netDeviceHandle);
     }
   } else {
     // Connect to remote peer
-    ret = proxyState->ncclNet->connect(resources->netDev, &commConfig, req->handle, &resources->netSendComm, &resources->netDeviceHandle);
+    ret = proxyState->ncclNet->connect(proxyState->netContext, resources->netDev, req->handle, &resources->netSendComm, &resources->netDeviceHandle);
     connection->proxyAppendPtr = &connection->proxyAppend;
   }
 
-  NCCLCHECK(ret);
+  if (ret != ncclSuccess) {
+    if (resources->netSendComm) proxyState->ncclNet->closeSend(resources->netSendComm);
+    NCCLCHECK(ret);
+  }
   if (resources->netSendComm == NULL) {
     *done = 0;
     return ncclInProgress;
   }
+  printNetAttrs(&req->netAttr, "send connect");
   *done = 1;
 
   if (resources->netDeviceHandle) {
@@ -872,6 +978,8 @@ static ncclResult_t recvProxyConnect(struct ncclProxyConnection* connection, str
   resources->tpRemoteProxyRank = req->proxyRank;
   ncclResult_t ret = ncclSuccess;
 
+  setNetAttrs(proxyState, &req->netAttr);
+
   NCCLCHECK(ncclNetGetDeviceHandle(resources->netDeviceType, resources->netDeviceVersion, true /*isRecv*/, &resources->netDeviceHandle));
   // Finish connection establishment from remote peer
   if (resources->shared) {
@@ -917,6 +1025,7 @@ static ncclResult_t recvProxyConnect(struct ncclProxyConnection* connection, str
     *done = 0;
     return ncclInProgress;
   }
+  printNetAttrs(&req->netAttr, "recv connect");
   *done = 1;
 
   if (resources->netDeviceHandle) {
@@ -1106,6 +1215,7 @@ static ncclResult_t recvProxyFree(struct ncclProxyConnection* connection, struct
 static_assert(NCCL_STEPS <= NCCL_NET_MAX_REQUESTS, "Not enough net requests to cover for steps");
 
 static ncclResult_t sendProxyProgress(struct ncclProxyState* proxyState, struct ncclProxyArgs* args) {
+  int checkedNetAttr = 0;
   if (args->state == ncclProxyOpReady) {
     for (int s=0; s<args->nsubs; s++) {
       struct ncclProxySubArgs* sub = args->subs+s;
@@ -1207,6 +1317,8 @@ static ncclResult_t sendProxyProgress(struct ncclProxyState* proxyState, struct
             // since size is a plain integer.
             // coverity[use_invalid:FALSE]
             void* phandle = &sub->pHandles[DIVUP(transmittedStepId, args->sliceSteps)%NCCL_STEPS];
+            if (!checkedNetAttr++)
+              setXferNetAttrs(proxyState, args, 1);
             NCCLCHECK(proxyState->ncclNet->isend(resources->netSendComm, buff, size, resources->tpRank, sub->sendMhandle, phandle, sub->requests+buffSlot));
             if (sub->requests[buffSlot] != NULL) {
               TRACE(NCCL_NET, "sendProxy [%ld/%d/%d] Isend posted, req %p, buff %p, size %d, proto %d, myRank %d, channelId %d, mhandle %p", sub->transmitted, buffSlot, sub->nsteps, sub->requests[buffSlot], buff, size, p, proxyState->tpRank, sub->channelId, sub->sendMhandle);
@@ -1258,6 +1370,7 @@ static ncclResult_t sendProxyProgress(struct ncclProxyState* proxyState, struct
 }
 
 static ncclResult_t recvProxyProgress(struct ncclProxyState* proxyState, struct ncclProxyArgs* args) {
+  int checkedNetAttr = 0;
   if (args->state == ncclProxyOpReady) {
     // Initialize subs and group them by same recvComm.
     void* recvComm;
@@ -1363,6 +1476,8 @@ static ncclResult_t recvProxyProgress(struct ncclProxyState* proxyState, struct
         struct recvNetResources* resources = (struct recvNetResources*) (subGroup->connection->transportResources);
         void** requestPtr = subGroup->requests+(step%NCCL_STEPS);
         bool ignoreCompletion = ncclParamNetOptionalRecvCompletion() && ((args->protocol == NCCL_PROTO_LL128) || (args->protocol == NCCL_PROTO_LL)) && (subCount == 1);
+        if (!checkedNetAttr++)
+          setXferNetAttrs(proxyState, args, 0);
         if (ignoreCompletion) *requestPtr = (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION;
         NCCLCHECK(proxyState->ncclNet->irecv(resources->netRecvComm, subCount, ptrs, sizes, tags, mhandles, phandles, requestPtr));
         if (*requestPtr) {
@@ -1594,13 +1709,18 @@ ncclResult_t ncclNetLocalRegisterBuffer(ncclComm* comm, const void* userbuff, si
   ncclResult_t ret = ncclSuccess;
   struct ncclReg *regRecord = NULL;
   bool isValid = false;
+  void *base = NULL;
+  size_t baseSize = 0;
 
   *outRegBufFlag = 0;
   if (comm && userbuff && buffSize > 0 && nPeers > 0) {
     NCCLCHECKGOTO(ncclRegFind(comm, userbuff, buffSize, &regRecord), ret, fail);
     NCCLCHECKGOTO(ncclRegLocalIsValid(regRecord, &isValid), ret, fail);
-    if (isValid)
+    if (isValid) {
+      CUCHECKGOTO(cuMemGetAddressRange((CUdeviceptr *)&base, &baseSize, (CUdeviceptr)userbuff), ret, fail);
+      if ((uint64_t)base + baseSize < (uint64_t)userbuff + buffSize) goto exit;
       NCCLCHECKGOTO(netRegisterBuffer(comm, userbuff, buffSize, peerConns, nPeers, regRecord, outRegBufFlag, outHandle), ret, fail);
+    }
   }
 
 exit:
@@ -1627,13 +1747,14 @@ ncclResult_t ncclNetGraphRegisterBuffer(ncclComm* comm, const void* userbuff, si
   ncclResult_t ret = ncclSuccess;
   struct ncclNetCleanupCallback *record = NULL;
   struct ncclReg *regRecord = NULL;
-  void *baseSend;
-  size_t baseSendSize;
+  void *base = NULL;
+  size_t baseSize = 0;
 
   *outRegBufFlag = 0;
   if (comm && userbuff && buffSize > 0 && nPeers > 0) {
-    CUCHECKGOTO(cuMemGetAddressRange((CUdeviceptr *)&baseSend, &baseSendSize, (CUdeviceptr)userbuff), ret, fail);
-    NCCLCHECKGOTO(ncclCommGraphRegister(comm, baseSend, baseSendSize, (void**)&regRecord), ret, fail);
+    CUCHECKGOTO(cuMemGetAddressRange((CUdeviceptr *)&base, &baseSize, (CUdeviceptr)userbuff), ret, fail);
+    if ((uint64_t)base + baseSize < (uint64_t)userbuff + buffSize) goto exit;
+    NCCLCHECKGOTO(ncclCommGraphRegister(comm, base, baseSize, (void**)&regRecord), ret, fail);
     NCCLCHECKGOTO(netRegisterBuffer(comm, userbuff, buffSize, peerConns, nPeers, regRecord, outRegBufFlag, outHandle), ret, fail);
     if (*outRegBufFlag) {
       NCCLCHECKGOTO(ncclCalloc(&record, 1), ret, fail);
diff --git a/projects/rccl/src/transport/net_ib.cc b/projects/rccl/src/transport/net_ib.cc
index 709e7ad400..3614dec61d 100644
--- a/projects/rccl/src/transport/net_ib.cc
+++ b/projects/rccl/src/transport/net_ib.cc
@@ -21,6 +21,7 @@
 #include <poll.h>
 #include <sys/types.h>
 #include <unistd.h>
+#include <mutex>
 #define ENABLE_TIMER 0
 #include "timer.h"
 
@@ -32,6 +33,8 @@
 static char ncclIbIfName[MAX_IF_NAME_SIZE+1];
 static union ncclSocketAddress ncclIbIfAddr;
 
+static ncclNetCommConfig_t ibContext;
+
 struct ncclIbMr {
   uintptr_t addr;
   size_t pages;
@@ -70,7 +73,7 @@ const char* ibProviderName[] = {
 
 static int ncclNIbDevs = -1;
 struct alignas(64) ncclIbDev {
-  pthread_mutex_t lock;
+  std::mutex mutex;
   int device;
   uint64_t guid;
   uint8_t portNum;
@@ -102,9 +105,16 @@ struct alignas(64) ncclIbDev {
 #define MAX_IB_VDEVS MAX_IB_DEVS*8
 struct ncclIbMergedDev ncclIbMergedDevs[MAX_IB_VDEVS];
 struct ncclIbDev ncclIbDevs[MAX_IB_DEVS];
-pthread_mutex_t ncclIbLock = PTHREAD_MUTEX_INITIALIZER;
+static std::mutex ncclIbMutex;
 static int ncclIbRelaxedOrderingEnabled = 0;
 
+// With ncclNet_v11_t the NCCL core initializes the network plugin per-communicator
+// rather than once for all communicators. However, the internal plugin implementation
+// still assumes the plugin is initialized only once across all communicators. The ref
+// counter makes sure the plugin internally initializes only once. When per communicator
+// context support is added to the plugin the ref counter can be removed.
+static int netRefCount;
+
 #define NCCL_IB_LLSTR(ll) (((ll) == IBV_LINK_LAYER_INFINIBAND) ? "IB" : (((ll) == IBV_LINK_LAYER_ETHERNET) ? "RoCE" : "UNSPECIFIED"))
 
 #define NCCL_IB_SL_DEFAULT 0
@@ -184,6 +194,9 @@ static void* ncclIbAsyncThreadMain(void* args) {
       // SRQ are not used in NCCL
       WARN("NET/IB : %s:%d async fatal event on SRQ, unused for now (%p): %s", dev->devName, dev->portNum, srq, str);
       break;
+    case IBV_EVENT_GID_CHANGE:
+      WARN("NET/IB : %s:%d GID table changed", dev->devName, dev->portNum);
+      break;
     case IBV_EVENT_PATH_MIG_ERR:
     case IBV_EVENT_PORT_ERR:
     case IBV_EVENT_PATH_MIG:
@@ -597,15 +610,22 @@ ncclResult_t ncclIbMakeVDeviceInternal(int* d, ncclNetVDeviceProps_t* props) {
 }
 
 ncclResult_t ncclIbMakeVDevice(int* d, ncclNetVDeviceProps_t* props) {
-  pthread_mutex_lock(&ncclIbLock);
+  std::lock_guard<std::mutex> lock(ncclIbMutex);
   ncclResult_t res = ncclIbMakeVDeviceInternal(d, props);
-  pthread_mutex_unlock(&ncclIbLock);
   return res;
+
+}
+
+ncclResult_t ncclIbSetNetAttr(void *ctx, ncclNetAttr_t *netAttr) {
+  (void)ctx;
+  (void)netAttr;
+  return ncclSuccess;
 }
 
 static ncclProfilerCallback_t ncclProfilerFunction;
 
-ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction) {
+ncclResult_t ncclIbInit(void** ctx, uint64_t commId, ncclNetCommConfig_t* config, ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction) {
+  if (netRefCount++) return ncclSuccess;
   ncclResult_t ret = ncclSuccess;
   ncclProfilerFunction = profFunction;
   if (ncclParamIbDisable()) return ncclInternalError;
@@ -614,7 +634,7 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
   if(wrap_mlx5dv_symbols() != ncclSuccess) { INFO(NCCL_NET, "NET/IB : Failed to open mlx5dv symbols. Advance features like CX-8 Direct-NIC will be disabled."); }
 
   if (ncclNIbDevs == -1) {
-    pthread_mutex_lock(&ncclIbLock);
+    std::lock_guard<std::mutex> lock(ncclIbMutex);
     wrap_ibv_fork_init();
     if (ncclNIbDevs == -1) {
       int nIpIfs = 0;
@@ -644,25 +664,15 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
       if (ncclSuccess != wrap_ibv_get_device_list(&devices, &nIbDevs)) { ret = ncclInternalError; goto fail; }
 
       for (int d=0; d<nIbDevs && ncclNIbDevs<MAX_IB_DEVS; d++) {
-        struct ibv_context * context;
+        struct ibv_context * context = NULL;
         if (ncclSuccess != wrap_ibv_open_device(&context, devices[d]) || context == NULL) {
           WARN("NET/IB : Unable to open device %s", devices[d]->name);
           continue;
         }
-        enum ncclIbProvider ibProvider = IB_PROVIDER_NONE;
-        char dataDirectDevicePath[PATH_MAX];
-        int dataDirectSupported = 0;
-        int skipNetDevForDataDirect = 0;
-        if (wrap_mlx5dv_is_supported(devices[d])) {
-          ibProvider = IB_PROVIDER_MLX5;
-          snprintf(dataDirectDevicePath, PATH_MAX, "/sys");
-          if((ncclMlx5dvDmaBufCapable(context)) && (wrap_mlx5dv_get_data_direct_sysfs_path(context, dataDirectDevicePath + 4, PATH_MAX - 4) == ncclSuccess)) {
-            INFO(NCCL_INIT|NCCL_NET, "NET/IB: Data Direct DMA Interface is detected for device:%s", devices[d]->name);
-            // Now check whether Data Direct has been disabled by the user
-            if(ncclParamIbDataDirect() == 1) { dataDirectSupported = 1; skipNetDevForDataDirect = 1; }
-            if(ncclParamIbDataDirect() == 2) { dataDirectSupported = 1; skipNetDevForDataDirect = 0; }
-          }
-        }
+        char dataDirectDevicePath[PATH_MAX] = "/sys";
+        int devCount = /*undefined*/-1, devOffset = 0;
+        enum ncclIbProvider ibProvider = wrap_mlx5dv_is_supported(devices[d]) ? IB_PROVIDER_MLX5 : IB_PROVIDER_NONE;
+
         int nPorts = 0;
         struct ibv_device_attr devAttr;
         memset(&devAttr, 0, sizeof(devAttr));
@@ -672,78 +682,99 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
           continue;
         }
         for (int port_num = 1; port_num <= devAttr.phys_port_cnt; port_num++) {
-          // dataDirect = 0 exposes the devices normally, dataDirect = 1 exposes the devices through direct NIC
-          for (int dataDirect = skipNetDevForDataDirect; dataDirect < 1 + dataDirectSupported; ++dataDirect) {
             struct ibv_port_attr portAttr;
             if (ncclSuccess != wrap_ibv_query_port(context, port_num, &portAttr)) {
               WARN("NET/IB : Unable to query port_num %d", port_num);
               continue;
             }
             if (portAttr.state != IBV_PORT_ACTIVE) continue;
-            if (portAttr.link_layer != IBV_LINK_LAYER_INFINIBAND
-                && portAttr.link_layer != IBV_LINK_LAYER_ETHERNET) continue;
+            if (portAttr.link_layer != IBV_LINK_LAYER_INFINIBAND && portAttr.link_layer != IBV_LINK_LAYER_ETHERNET) continue;
 
             // check against user specified HCAs/ports
             if (! (matchIfList(devices[d]->name, port_num, userIfs, nUserIfs, searchExact) ^ searchNot)) {
               continue;
             }
-            pthread_mutex_init(&ncclIbDevs[ncclNIbDevs].lock, NULL);
-            ncclIbDevs[ncclNIbDevs].device = d;
-            ncclIbDevs[ncclNIbDevs].ibProvider = ibProvider;
-            ncclIbDevs[ncclNIbDevs].guid = devAttr.sys_image_guid;
-            ncclIbDevs[ncclNIbDevs].portAttr = portAttr;
-            ncclIbDevs[ncclNIbDevs].portNum = port_num;
-            ncclIbDevs[ncclNIbDevs].link = portAttr.link_layer;
-            if (portAttr.active_speed_ex)
-              // A non-zero active_speed_ex indicates XDR rate (0x100) or higher
-              ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed_ex) * ncclIbWidth(portAttr.active_width);
-            else
-              ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed) * ncclIbWidth(portAttr.active_width);
-            ncclIbDevs[ncclNIbDevs].context = context;
-            ncclIbDevs[ncclNIbDevs].pdRefs = 0;
-            ncclIbDevs[ncclNIbDevs].pd = NULL;
-            if (!dataDirect) {
-              strncpy(ncclIbDevs[ncclNIbDevs].devName, devices[d]->name, MAXNAMESIZE);
-              NCCLCHECKGOTO(ncclIbGetPciPath(ncclIbDevs[ncclNIbDevs].devName, &ncclIbDevs[ncclNIbDevs].pciPath, &ncclIbDevs[ncclNIbDevs].realPort), ret, fail);
-            } else {
-              snprintf(ncclIbDevs[ncclNIbDevs].devName, MAXNAMESIZE, "%s_dma", devices[d]->name);
-              NCCLCHECK(ncclCalloc(&ncclIbDevs[ncclNIbDevs].pciPath, PATH_MAX));
-              strncpy(ncclIbDevs[ncclNIbDevs].pciPath, dataDirectDevicePath, PATH_MAX);
-              ncclIbDevs[ncclNIbDevs].capsProvider.mlx5.dataDirect = 1;
+
+            // check for mlx5 data direct support only once for a each device
+            if (devCount == -1) {
+              devCount = 1;
+              devOffset = 0;
+              if (ncclParamIbDataDirect() > 0 && ibProvider == IB_PROVIDER_MLX5 && ncclMlx5dvDmaBufCapable(context)) {
+                int pathLen = strlen(dataDirectDevicePath);
+                ncclResult_t res = wrap_mlx5dv_get_data_direct_sysfs_path(context, dataDirectDevicePath + pathLen, sizeof(dataDirectDevicePath) - pathLen);
+                if (res == ncclSuccess) {
+                  // data direct devices are exposed twice: with the C2C + PCIe link and with the data direct link
+                  devCount = 2;
+                  // by default only expose the data direct NIC (devOffset = 1), unless set to 2 by the user
+                  devOffset = (ncclParamIbDataDirect() == 2) ? 0 : 1;
+                  INFO(NCCL_INIT | NCCL_NET, "NET/IB: Data Direct DMA Interface is detected for device %s", devices[d]->name);
+                } else if (res == ncclInvalidArgument) {
+                  TRACE(NCCL_NET, "NET/IB: Device %s does not support Data Direct DMA.", devices[d]->name);
+                } else {
+                  WARN("NET/IB: Error in mlx5dv_get_data_direct_sysfs_path with device %s", devices[d]->name);
+                  return res;
+                }
+              }
             }
-            ncclIbDevs[ncclNIbDevs].maxQp = devAttr.max_qp;
-            ncclIbDevs[ncclNIbDevs].mrCache.capacity = 0;
-            ncclIbDevs[ncclNIbDevs].mrCache.population = 0;
-            ncclIbDevs[ncclNIbDevs].mrCache.slots = NULL;
-            NCCLCHECK(ncclIbStatsInit(&ncclIbDevs[ncclNIbDevs].stats));
+            for (int dev = devOffset; dev < devCount; ++dev) {
+              ncclIbDevs[ncclNIbDevs].device = d;
+              ncclIbDevs[ncclNIbDevs].ibProvider = ibProvider;
+              ncclIbDevs[ncclNIbDevs].guid = devAttr.sys_image_guid;
+              ncclIbDevs[ncclNIbDevs].portAttr = portAttr;
+              ncclIbDevs[ncclNIbDevs].portNum = port_num;
+              ncclIbDevs[ncclNIbDevs].link = portAttr.link_layer;
+              if (portAttr.active_speed_ex) {
+                // A non-zero active_speed_ex indicates XDR rate (0x100) or higher
+                ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed_ex) * ncclIbWidth(portAttr.active_width);
+              } else {
+                ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed) * ncclIbWidth(portAttr.active_width);
+              }
+              ncclIbDevs[ncclNIbDevs].context = context;
+              ncclIbDevs[ncclNIbDevs].pdRefs = 0;
+              ncclIbDevs[ncclNIbDevs].pd = NULL;
+              if (dev == 0) {
+                strncpy(ncclIbDevs[ncclNIbDevs].devName, devices[d]->name, MAXNAMESIZE);
+                NCCLCHECKGOTO(ncclIbGetPciPath(ncclIbDevs[ncclNIbDevs].devName, &ncclIbDevs[ncclNIbDevs].pciPath, &ncclIbDevs[ncclNIbDevs].realPort), ret, fail);
+              } else {
+                snprintf(ncclIbDevs[ncclNIbDevs].devName, MAXNAMESIZE, "%s_dma", devices[d]->name);
+                NCCLCHECK(ncclCalloc(&ncclIbDevs[ncclNIbDevs].pciPath, PATH_MAX));
+                strncpy(ncclIbDevs[ncclNIbDevs].pciPath, dataDirectDevicePath, PATH_MAX);
+                ncclIbDevs[ncclNIbDevs].capsProvider.mlx5.dataDirect = 1;
+              }
+              ncclIbDevs[ncclNIbDevs].maxQp = devAttr.max_qp;
+              ncclIbDevs[ncclNIbDevs].mrCache.capacity = 0;
+              ncclIbDevs[ncclNIbDevs].mrCache.population = 0;
+              ncclIbDevs[ncclNIbDevs].mrCache.slots = NULL;
+              NCCLCHECK(ncclIbStatsInit(&ncclIbDevs[ncclNIbDevs].stats));
 
-            // Enable ADAPTIVE_ROUTING by default on IB networks
-            // But allow it to be overloaded by an env parameter
-            ncclIbDevs[ncclNIbDevs].ar = (portAttr.link_layer == IBV_LINK_LAYER_INFINIBAND) ? 1 : 0;
-            if (ncclParamIbAdaptiveRouting() != -2) ncclIbDevs[ncclNIbDevs].ar = ncclParamIbAdaptiveRouting();
+              // Enable ADAPTIVE_ROUTING by default on IB networks
+              // But allow it to be overloaded by an env parameter
+              ncclIbDevs[ncclNIbDevs].ar = (portAttr.link_layer == IBV_LINK_LAYER_INFINIBAND) ? 1 : 0;
+              if (ncclParamIbAdaptiveRouting() != -2) ncclIbDevs[ncclNIbDevs].ar = ncclParamIbAdaptiveRouting();
 
-            INFO(NCCL_NET,"NET/IB: [%d] %s:%s:%d/%s provider=%s speed=%d context=%p pciPath=%s ar=%d", d, devices[d]->name, devices[d]->dev_name, ncclIbDevs[ncclNIbDevs].portNum,
-                NCCL_IB_LLSTR(portAttr.link_layer), ibProviderName[ncclIbDevs[ncclNIbDevs].ibProvider], ncclIbDevs[ncclNIbDevs].speed, context, ncclIbDevs[ncclNIbDevs].pciPath, ncclIbDevs[ncclNIbDevs].ar);
+              INFO(NCCL_NET, "NET/IB: [%d] %s:%s:%d/%s provider=%s speed=%d context=%p pciPath=%s ar=%d", d, devices[d]->name, devices[d]->dev_name,
+                   ncclIbDevs[ncclNIbDevs].portNum, NCCL_IB_LLSTR(portAttr.link_layer), ibProviderName[ncclIbDevs[ncclNIbDevs].ibProvider], ncclIbDevs[ncclNIbDevs].speed, context,
+                   ncclIbDevs[ncclNIbDevs].pciPath, ncclIbDevs[ncclNIbDevs].ar);
 
-            PTHREADCHECKGOTO(pthread_create(&ncclIbAsyncThread, NULL, ncclIbAsyncThreadMain, ncclIbDevs + ncclNIbDevs), "pthread_create", ret, fail);
-            ncclSetThreadName(ncclIbAsyncThread, "NCCL IbAsync %2d", ncclNIbDevs);
-            PTHREADCHECKGOTO(pthread_detach(ncclIbAsyncThread), "pthread_detach", ret, fail); // will not be pthread_join()'d
+              PTHREADCHECKGOTO(pthread_create(&ncclIbAsyncThread, NULL, ncclIbAsyncThreadMain, ncclIbDevs + ncclNIbDevs), "pthread_create", ret, fail);
+              ncclSetThreadName(ncclIbAsyncThread, "NCCL IbAsync %2d", ncclNIbDevs);
+              PTHREADCHECKGOTO(pthread_detach(ncclIbAsyncThread), "pthread_detach", ret, fail); // will not be pthread_join()'d
 
-            // Add this plain physical device to the list of virtual devices
-            int vDev;
-            ncclNetVDeviceProps_t vProps = {0};
-            vProps.ndevs = 1;
-            vProps.devs[0] = ncclNIbDevs;
-            NCCLCHECK(ncclIbMakeVDeviceInternal(&vDev, &vProps));
+              // Add this plain physical device to the list of virtual devices
+              int vDev;
+              ncclNetVDeviceProps_t vProps = {0};
+              vProps.ndevs = 1;
+              vProps.devs[0] = ncclNIbDevs;
+              NCCLCHECK(ncclIbMakeVDeviceInternal(&vDev, &vProps));
 
-            ncclNIbDevs++;
-            nPorts++;
-          }
+              ncclNIbDevs++;
+              nPorts++;
+            }
         }
         if (nPorts == 0 && ncclSuccess != wrap_ibv_close_device(context)) { ret = ncclInternalError; goto fail; }
       }
 
-      if (nIbDevs && (ncclSuccess != wrap_ibv_free_device_list(devices))) { ret = ncclInternalError; goto fail; };
+      if (nIbDevs && (ncclSuccess != wrap_ibv_free_device_list(devices))) { ret = ncclInternalError; goto fail; }
     }
     if (ncclNIbDevs == 0) {
       INFO(NCCL_INIT|NCCL_NET, "NET/IB : No device found.");
@@ -762,12 +793,12 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
     INFO(NCCL_INIT|NCCL_NET, "NET/IB : Using%s %s; OOB %s:%s", line, ncclIbRelaxedOrderingEnabled ? "[RO]" : "",
           ncclIbIfName, ncclSocketToString(&ncclIbIfAddr, addrline));
 
-    pthread_mutex_unlock(&ncclIbLock);
   }
 exit:
+  ibContext.trafficClass = config->trafficClass;
+  *ctx = &ibContext;
   return ret;
 fail:
-  pthread_mutex_unlock(&ncclIbLock);
   goto exit;
 }
 
@@ -789,8 +820,8 @@ static void ibGdrSupportInitOnce() {
                           KNL_MODULE_LOADED("/sys/module/nvidia_peermem/version");
 }
 ncclResult_t ncclIbGdrSupport() {
-  static pthread_once_t once = PTHREAD_ONCE_INIT;
-  pthread_once(&once, ibGdrSupportInitOnce);
+  static std::once_flag once;
+  std::call_once(once, ibGdrSupportInitOnce);
   if (!ncclIbGdrModuleLoaded)
     return ncclSystemError;
   return ncclSuccess;
@@ -825,13 +856,10 @@ failure:
 // ncclSuccess : DMA-BUF support is available
 // ncclSystemError : DMA-BUF is not supported by the kernel
 ncclResult_t ncclIbDmaBufSupport(int dev) {
-  struct oncewrap {
-    pthread_once_t once = PTHREAD_ONCE_INIT;
-  };
-  static oncewrap onces[MAX_IB_DEVS];
+  static std::once_flag onces[MAX_IB_DEVS];
   // init the device only once
   ibDmaSupportInitDev = dev;
-  pthread_once(&onces[dev].once, ibDmaBufSupportInitOnce);
+  std::call_once(onces[dev], ibDmaBufSupportInitOnce);
   ncclIbMergedDev* mergedDev = ncclIbMergedDevs + ibDmaSupportInitDev;
   ncclIbDev* ibDev = ncclIbDevs + mergedDev->vProps.devs[0];
   int dmaBufSupported = ibDev->dmaBufSupported;
@@ -843,7 +871,7 @@ ncclResult_t ncclIbDmaBufSupport(int dev) {
 
 ncclResult_t ncclIbGetPhysProperties(int dev, ncclNetProperties_t* props) {
   struct ncclIbDev* ibDev = ncclIbDevs + dev;
-  pthread_mutex_lock(&ibDev->lock);
+  std::lock_guard<std::mutex> lock(ibDev->mutex);
   props->name = ibDev->devName;
   props->speed = ibDev->speed;
   props->pciPath = ibDev->pciPath;
@@ -867,7 +895,8 @@ ncclResult_t ncclIbGetPhysProperties(int dev, ncclNetProperties_t* props) {
   props->netDeviceType    = NCCL_NET_DEVICE_HOST;
   props->netDeviceVersion = NCCL_NET_DEVICE_INVALID_VERSION;
   props->maxP2pBytes = NCCL_MAX_NET_SIZE_BYTES;
-  pthread_mutex_unlock(&ibDev->lock);
+  props->maxCollBytes = MAX_COLLNET_SIZE;
+  props->maxMultiRequestSize = 1;
   return ncclSuccess;
 }
 
@@ -1132,18 +1161,13 @@ static void ncclIbAddEvent(struct ncclIbRequest* req, int devIndex, struct ncclI
 ncclResult_t ncclIbInitCommDevBase(int ibDevN, struct ncclIbNetCommDevBase* base, void* cq_context) {
   base->ibDevN = ibDevN;
   ncclIbDev* ibDev = ncclIbDevs + ibDevN;
-  pthread_mutex_lock(&ibDev->lock);
-  if (0 == ibDev->pdRefs++) {
-    ncclResult_t res;
-    NCCLCHECKGOTO(wrap_ibv_alloc_pd(&ibDev->pd, ibDev->context), res, failure);
-    if (0) {
-    failure:
-      pthread_mutex_unlock(&ibDev->lock);
-      return res;
+  {
+    std::lock_guard<std::mutex> lock(ibDev->mutex);
+    if (0 == ibDev->pdRefs++) {
+      NCCLCHECK(wrap_ibv_alloc_pd(&ibDev->pd, ibDev->context));
     }
+    base->pd = ibDev->pd;
   }
-  base->pd = ibDev->pd;
-  pthread_mutex_unlock(&ibDev->lock);
 
   // Recv requests can generate 2 completions (one for the post FIFO, one for the Recv).
   NCCLCHECK(wrap_ibv_create_cq(&base->cq, ibDev->context, 2*MAX_REQUESTS*ncclParamIbQpsPerConn(), cq_context, NULL, 0));
@@ -1152,17 +1176,13 @@ ncclResult_t ncclIbInitCommDevBase(int ibDevN, struct ncclIbNetCommDevBase* base
 }
 
 ncclResult_t ncclIbDestroyBase(struct ncclIbNetCommDevBase* base) {
-  ncclResult_t res;
   NCCLCHECK(wrap_ibv_destroy_cq(base->cq));
 
-  pthread_mutex_lock(&ncclIbDevs[base->ibDevN].lock);
+  std::lock_guard<std::mutex> lock(ncclIbDevs[base->ibDevN].mutex);
   if (0 == --ncclIbDevs[base->ibDevN].pdRefs) {
-    NCCLCHECKGOTO(wrap_ibv_dealloc_pd(ncclIbDevs[base->ibDevN].pd), res, returning);
+    NCCLCHECK(wrap_ibv_dealloc_pd(ncclIbDevs[base->ibDevN].pd));
   }
-  res = ncclSuccess;
-returning:
-  pthread_mutex_unlock(&ncclIbDevs[base->ibDevN].lock);
-  return res;
+  return ncclSuccess;
 }
 
 ncclResult_t ncclIbCreateQp(uint8_t ib_port, struct ncclIbNetCommDevBase* base, int access_flags, void* qp_context, struct ncclIbQp* qp) {
@@ -1250,7 +1270,7 @@ ncclResult_t ncclIbRtsQp(struct ibv_qp* qp) {
   return ncclSuccess;
 }
 
-ncclResult_t ncclIbListen(int dev, void* opaqueHandle, void** listenComm) {
+ncclResult_t ncclIbListen(void* ctx, int dev, void* opaqueHandle, void** listenComm) {
   ncclResult_t ret = ncclSuccess;
   struct ncclIbListenComm* comm;
   NCCLCHECK(ncclCalloc(&comm, 1));
@@ -1271,7 +1291,7 @@ fail:
   goto exit;
 }
 
-ncclResult_t ncclIbConnect(int dev, ncclNetCommConfig_t* config, void* opaqueHandle, void** sendComm, ncclNetDeviceHandle_t** /*sendDevComm*/) {
+ncclResult_t ncclIbConnect(void* ctx, int dev, void* opaqueHandle, void** sendComm, ncclNetDeviceHandle_t** /*sendDevComm*/) {
   ncclResult_t ret = ncclSuccess;
   struct ncclIbHandle* handle = (struct ncclIbHandle*) opaqueHandle;
   struct ncclIbCommStage* stage = &handle->stage;
@@ -1333,6 +1353,7 @@ ib_recv_dev_list:
   if (stage->offset != sizeof(ncclNetVDeviceProps_t)) return ncclSuccess;
   stage->offset = 0;
   ncclNetVDeviceProps_t remoteVProps;
+  ncclNetCommConfig_t* config;
   memcpy(&remoteVProps, stage->buffer, sizeof(ncclNetVDeviceProps_t));
   mergedDev = ncclIbMergedDevs + dev;
   comm->base.vProps = mergedDev->vProps;
@@ -1425,6 +1446,7 @@ ib_recv_dev_list:
       return ncclInternalError;
     }
   }
+  config = (ncclNetCommConfig_t*)ctx;
   meta.fifoAddr = (uint64_t)comm->fifo;
   meta.sl = (ncclParamIbSl() != -1) ? ncclParamIbSl() : (config && config->trafficClass != NCCL_NET_TRAFFIC_CLASS_UNDEF) ? config->trafficClass : NCCL_IB_SL_DEFAULT;
   meta.tc = (ncclParamIbTc() != -1) ? ncclParamIbTc() : (config && config->trafficClass != NCCL_NET_TRAFFIC_CLASS_UNDEF) ? config->trafficClass : NCCL_IB_TC_DEFAULT;
@@ -1856,13 +1878,12 @@ ncclResult_t ncclIbRegMrDmaBufInternal(ncclIbNetCommDevBase* base, void* data, s
   struct ncclIbMrCache* cache = &ncclIbDevs[base->ibDevN].mrCache;
   uintptr_t addr = (uintptr_t)data & -pageSize;
   size_t pages = ((uintptr_t)data + size - addr + pageSize-1)/pageSize;
-  ncclResult_t res;
-  pthread_mutex_lock(&ncclIbDevs[base->ibDevN].lock);
+  std::lock_guard<std::mutex> lock(ncclIbDevs[base->ibDevN].mutex);
   for (int slot=0; /*true*/; slot++) {
     if (slot == cache->population || addr < cache->slots[slot].addr) { // didn't find in cache
       if (cache->population == cache->capacity) { // must grow cache
         cache->capacity = cache->capacity < 32 ? 32 : 2*cache->capacity;
-        NCCLCHECKGOTO(ncclRealloc(&cache->slots, cache->population, cache->capacity), res, returning);
+        NCCLCHECK(ncclRealloc(&cache->slots, cache->population, cache->capacity));
       }
       // Deregister / register
       struct ibv_mr* mr;
@@ -1871,17 +1892,17 @@ ncclResult_t ncclIbRegMrDmaBufInternal(ncclIbNetCommDevBase* base, void* data, s
       if (fd != -1) {
         /* DMA-BUF support */
         if (!ncclIbDevs[base->ibDevN].capsProvider.mlx5.dataDirect) {
-          NCCLCHECKGOTO(wrap_ibv_reg_dmabuf_mr(&mr, base->pd, offset, pages*pageSize, addr, fd, flags), res, returning);
+          NCCLCHECK(wrap_ibv_reg_dmabuf_mr(&mr, base->pd, offset, pages*pageSize, addr, fd, flags));
         } else {
-          NCCLCHECKGOTO(wrap_mlx5dv_reg_dmabuf_mr(&mr, base->pd, offset, pages*pageSize, addr, fd, flags, MLX5DV_REG_DMABUF_ACCESS_DATA_DIRECT), res, returning);
+          NCCLCHECK(wrap_mlx5dv_reg_dmabuf_mr(&mr, base->pd, offset, pages*pageSize, addr, fd, flags, MLX5DV_REG_DMABUF_ACCESS_DATA_DIRECT));
         }
       } else {
         if (ncclIbRelaxedOrderingEnabled) {
           // Use IBVERBS_1.8 API - needed for IBV_ACCESS_RELAXED_ORDERING support
-          NCCLCHECKGOTO(wrap_ibv_reg_mr_iova2(&mr, base->pd, (void*)addr, pages*pageSize, addr, flags), res, returning);
+          NCCLCHECK(wrap_ibv_reg_mr_iova2(&mr, base->pd, (void*)addr, pages*pageSize, addr, flags));
         }
         else {
-          NCCLCHECKGOTO(wrap_ibv_reg_mr(&mr, base->pd, (void*)addr, pages*pageSize, flags), res, returning);
+          NCCLCHECK(wrap_ibv_reg_mr(&mr, base->pd, (void*)addr, pages*pageSize, flags));
         }
       }
       TRACE(NCCL_INIT|NCCL_NET,"regAddr=0x%lx size=%lld rkey=0x%x lkey=0x%x fd=%d", (unsigned long)addr, (long long)pages*pageSize, mr->rkey, mr->lkey, fd);
@@ -1892,19 +1913,15 @@ ncclResult_t ncclIbRegMrDmaBufInternal(ncclIbNetCommDevBase* base, void* data, s
       cache->slots[slot].mr = mr;
       cache->population += 1;
       *mhandle = mr;
-      res = ncclSuccess;
-      goto returning;
+      return ncclSuccess;
     } else if ((addr >= cache->slots[slot].addr) &&
         ((addr-cache->slots[slot].addr)/pageSize+pages) <= cache->slots[slot].pages) {
       cache->slots[slot].refs += 1;
       *mhandle = cache->slots[slot].mr;
-      res = ncclSuccess;
-      goto returning;
+      return ncclSuccess;
     }
   }
-returning:
-  pthread_mutex_unlock(&ncclIbDevs[base->ibDevN].lock);
-  return res;
+  return ncclSuccess;
 }
 
 struct ncclIbNetCommDevBase* ncclIbGetNetCommDevBase(ncclIbNetCommBase* base, int devIndex) {
@@ -1942,8 +1959,7 @@ ncclResult_t ncclIbRegMr(void* comm, void* data, size_t size, int type, void** m
 
 ncclResult_t ncclIbDeregMrInternal(ncclIbNetCommDevBase* base, ibv_mr* mhandle) {
   struct ncclIbMrCache* cache = &ncclIbDevs[base->ibDevN].mrCache;
-  ncclResult_t res;
-  pthread_mutex_lock(&ncclIbDevs[base->ibDevN].lock);
+  std::lock_guard<std::mutex> lock(ncclIbDevs[base->ibDevN].mutex);
   for (int i=0; i < cache->population; i++) {
     if (mhandle == cache->slots[i].mr) {
       if (0 == --cache->slots[i].refs) {
@@ -1953,17 +1969,13 @@ ncclResult_t ncclIbDeregMrInternal(ncclIbNetCommDevBase* base, ibv_mr* mhandle)
           cache->slots = NULL;
           cache->capacity = 0;
         }
-        NCCLCHECKGOTO(wrap_ibv_dereg_mr(mhandle), res, returning);
+        NCCLCHECK(wrap_ibv_dereg_mr(mhandle));
       }
-      res = ncclSuccess;
-      goto returning;
+      return ncclSuccess;
     }
   }
   WARN("NET/IB: could not find mr %p inside cache of %d entries", mhandle, cache->population);
-  res = ncclInternalError;
-returning:
-  pthread_mutex_unlock(&ncclIbDevs[base->ibDevN].lock);
-  return res;
+  return ncclInternalError;
 }
 
 ncclResult_t ncclIbDeregMr(void* comm, void* mhandle) {
@@ -2567,6 +2579,11 @@ ncclResult_t ncclIbCloseListen(void* listenComm) {
   return ncclSuccess;
 }
 
+ncclResult_t ncclIbFinalize(void* ctx) {
+  netRefCount--;
+  return ncclSuccess;
+}
+
 ncclNet_t ncclNetIb = {
   "IB",
   ncclIbInit,
@@ -2587,7 +2604,9 @@ ncclNet_t ncclNetIb = {
   ncclIbCloseListen,
   NULL /* getDeviceMr */,
   NULL /* irecvConsumed */,
-  ncclIbMakeVDevice
+  ncclIbMakeVDevice,
+  ncclIbFinalize,
+  ncclIbSetNetAttr,
 };
 
 /*
diff --git a/projects/rccl/src/transport/net_socket.cc b/projects/rccl/src/transport/net_socket.cc
index 985810c47f..fa331aae2d 100644
--- a/projects/rccl/src/transport/net_socket.cc
+++ b/projects/rccl/src/transport/net_socket.cc
@@ -16,6 +16,8 @@
 #include <poll.h>
 #include <limits.h>
 #include <fcntl.h>
+#include <mutex>
+#include <condition_variable>
 
 /* Init functions */
 static int ncclNetIfs = -1;
@@ -26,7 +28,7 @@ struct ncclNetSocketDev {
 };
 static struct ncclNetSocketDev ncclNetSocketDevs[MAX_IFS];
 
-pthread_mutex_t ncclNetSocketLock = PTHREAD_MUTEX_INITIALIZER;
+static std::mutex ncclNetSocketMutex;
 
 static ncclResult_t ncclNetSocketGetPciPath(char* devName, char** pciPath) {
   char devicePath[PATH_MAX];
@@ -38,17 +40,24 @@ static ncclResult_t ncclNetSocketGetPciPath(char* devName, char** pciPath) {
 
 static ncclProfilerCallback_t ncclProfilerFunction;
 
-ncclResult_t ncclNetSocketInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction) {
+// With ncclNet_v11_t the NCCL core initializes the network plugin per-communicator
+// rather than once for all communicators. However, the internal plugin implementation
+// still assumes the plugin is initialized only once across all communicators. The ref
+// counter makes sure the plugin internally initializes only once. When per communicator
+// context support is added to the plugin the ref counter can be removed.
+static int netRefCount;
+
+ncclResult_t ncclNetSocketInit(void** ctx, uint64_t commId, ncclNetCommConfig_t* config, ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction) {
+  if (netRefCount++) return ncclSuccess;
   ncclProfilerFunction = profFunction;
   if (ncclNetIfs == -1) {
-    pthread_mutex_lock(&ncclNetSocketLock);
+    std::lock_guard<std::mutex> lock(ncclNetSocketMutex);
     if (ncclNetIfs == -1) {
       char names[MAX_IF_NAME_SIZE*MAX_IFS];
       union ncclSocketAddress addrs[MAX_IFS];
       NCCLCHECK(ncclFindInterfaces(names, addrs, MAX_IF_NAME_SIZE, MAX_IFS, &ncclNetIfs));
       if (ncclNetIfs <= 0) {
         WARN("NET/Socket : no interface found");
-        pthread_mutex_unlock(&ncclNetSocketLock);
         return ncclInternalError;
       } else {
         #define MAX_LINE_LEN (2047)
@@ -67,7 +76,6 @@ ncclResult_t ncclNetSocketInit(ncclDebugLogger_t logFunction, ncclProfilerCallba
         INFO(NCCL_INIT|NCCL_NET,"NET/Socket : Using%s", line);
       }
     }
-    pthread_mutex_unlock(&ncclNetSocketLock);
   }
   return ncclSuccess;
 }
@@ -116,6 +124,8 @@ ncclResult_t ncclNetSocketGetProperties(int dev, ncclNetProperties_t* props) {
   props->netDeviceType    = NCCL_NET_DEVICE_HOST;
   props->netDeviceVersion = NCCL_NET_DEVICE_INVALID_VERSION;
   props->maxP2pBytes = NCCL_MAX_NET_SIZE_BYTES;
+  props->maxCollBytes = MAX_COLLNET_SIZE;
+  props->maxMultiRequestSize = 1;
   return ncclSuccess;
 }
 
@@ -193,8 +203,8 @@ struct ncclNetSocketThreadResources {
   int stop;
   struct ncclNetSocketComm* comm;
   struct ncclProfilerInfo* pInfo;
-  pthread_mutex_t threadLock;
-  pthread_cond_t  threadCond;
+  std::mutex threadMutex;
+  std::condition_variable threadCond;
 };
 
 struct ncclNetSocketListenComm {
@@ -269,11 +279,8 @@ void* persistentSocketThread(void *args_) {
       } while (repeat);
     }
     if (idle) {
-      pthread_mutex_lock(&resource->threadLock);
-      while (mark == myQueue->next && resource->stop == 0) { // no new tasks, wait
-        pthread_cond_wait(&resource->threadCond, &resource->threadLock);
-      }
-      pthread_mutex_unlock(&resource->threadLock);
+      std::unique_lock<std::mutex> lock(resource->threadMutex);
+      resource->threadCond.wait(lock, [&] { return mark != myQueue->next || resource->stop; });
     }
     if (resource->stop) return NULL;
   }
@@ -335,7 +342,7 @@ fail:
   goto exit;
 }
 
-ncclResult_t ncclNetSocketListen(int dev, void* opaqueHandle, void** listenComm) {
+ncclResult_t ncclNetSocketListen(void* ctx, int dev, void* opaqueHandle, void** listenComm) {
   if (dev < 0 || dev >= ncclNetIfs) { // data transfer socket is based on specified dev
     WARN("NET/Socket : ncclNetSocketListen dev=%d ncclNetIfs=%d", dev, ncclNetIfs);
     return ncclInternalError;
@@ -364,7 +371,7 @@ fail:
 }
 
 #define SOCKET_CTRL_SIZE (sizeof(int))
-ncclResult_t ncclNetSocketConnect(int dev, ncclNetCommConfig_t* config, void* opaqueHandle, void** sendComm, ncclNetDeviceHandle_t** /*sendDevComm*/) {
+ncclResult_t ncclNetSocketConnect(void* ctx, int dev, void* opaqueHandle, void** sendComm, ncclNetDeviceHandle_t** /*sendDevComm*/) {
   if (dev < 0 || dev >= ncclNetIfs) { // data transfer socket is based on specified dev
     return ncclInternalError;
   }
@@ -380,7 +387,7 @@ ncclResult_t ncclNetSocketConnect(int dev, ncclNetCommConfig_t* config, void* op
   if (stage->state == ncclNetSocketCommStateConnect) goto socket_connect_check;
   if (stage->state == ncclNetSocketCommStateSend) goto socket_send;
 
-  NCCLCHECK(ncclCalloc(&comm, 1));
+  comm = new ncclNetSocketComm();
   stage->comm = comm;
   comm->nSocks = handle->nSocks;
   comm->nThreads = handle->nThreads;
@@ -422,7 +429,7 @@ ncclResult_t ncclNetSocketAccept(void* listenComm, void** recvComm, ncclNetDevic
   if (stage->state == ncclNetSocketCommStateAccept) goto socket_accept_check;
   if (stage->state == ncclNetSocketCommStateRecv) goto socket_recv;
 
-  NCCLCHECK(ncclCalloc(&rComm, 1));
+  rComm = new ncclNetSocketComm();
   stage->comm = rComm;
   rComm->nSocks = lComm->nSocks;
   rComm->nThreads = lComm->nThreads;
@@ -449,9 +456,9 @@ socket_recv:
     if (done == 0) return ncclSuccess;
 
     if (sendSockIdx == rComm->nSocks)
-      memcpy(&rComm->ctrlSock, sock, sizeof(struct ncclSocket));
+      rComm->ctrlSock = *sock;
     else
-      memcpy(rComm->socks+sendSockIdx, sock, sizeof(struct ncclSocket));
+      rComm->socks[sendSockIdx] = *sock;
     free(sock);
   }
   NCCLCHECK(ncclCalloc(&rComm->inlineData, MAX_REQUESTS * (SOCKET_CTRL_SIZE + ncclParamSocketInlineSize())));
@@ -501,8 +508,6 @@ ncclResult_t ncclNetSocketGetTask(struct ncclNetSocketComm* comm, struct ncclPro
 #ifdef NCCL_ENABLE_NET_PROFILING
     res->pInfo = pInfo;
 #endif
-    pthread_mutex_init(&res->threadLock, NULL);
-    pthread_cond_init(&res->threadCond, NULL);
     PTHREADCHECK(pthread_create(comm->helperThread+tid, NULL, persistentSocketThread, res), "pthread_create");
     ncclSetThreadName(comm->helperThread[tid], "NCCL Sock%c%1u%2u%2u", op == NCCL_SOCKET_SEND ? 'S' : 'R', comm->dev, tid, comm->cudaDev);
   }
@@ -517,10 +522,9 @@ ncclResult_t ncclNetSocketGetTask(struct ncclNetSocketComm* comm, struct ncclPro
     comm->nextSock = (comm->nextSock + 1) % comm->nSocks;
     r->used = 1;
     *req = r;
-    pthread_mutex_lock(&res->threadLock);
+    std::lock_guard<std::mutex> lock(res->threadMutex);
     queue->next = (queue->next+1)%queue->len;
-    pthread_cond_signal(&res->threadCond);
-    pthread_mutex_unlock(&res->threadLock);
+    res->threadCond.notify_one();
     return ncclSuccess;
   }
   WARN("NET/Socket : unable to allocate subtasks");
@@ -686,10 +690,11 @@ ncclResult_t ncclNetSocketClose(void* opaqueComm) {
     for (int i=0; i<comm->nThreads; i++) {
       struct ncclNetSocketThreadResources* res = comm->threadResources+i;
       if (comm->helperThread[i]) {
-        pthread_mutex_lock(&res->threadLock);
-        res->stop = 1;
-        pthread_cond_signal(&res->threadCond);
-        pthread_mutex_unlock(&res->threadLock);
+        {
+          std::lock_guard<std::mutex> lock(res->threadMutex);
+          res->stop = 1;
+          res->threadCond.notify_one();
+        }
         PTHREADCHECK(pthread_join(comm->helperThread[i], NULL), "pthread_join");
       }
       free(res->threadTaskQueue.tasks);
@@ -702,11 +707,16 @@ ncclResult_t ncclNetSocketClose(void* opaqueComm) {
       if (ready) NCCLCHECK(ncclSocketClose(&comm->socks[i]));
     }
     if(comm->inlineData) free(comm->inlineData);
-    free(comm);
+    delete comm;
   }
   return ncclSuccess;
 }
 
+ncclResult_t ncclNetSocketFinalize(void* ctx) {
+  netRefCount--;
+  return ncclSuccess;
+}
+
 ncclNet_t ncclNetSocket = {
   "Socket",
   ncclNetSocketInit,
@@ -727,5 +737,7 @@ ncclNet_t ncclNetSocket = {
   ncclNetSocketCloseListen,
   NULL /* getDeviceMr */,
   NULL /* irecvConsumed */,
-  NULL /* mergeDevices */
+  NULL /* mergeDevices */,
+  ncclNetSocketFinalize,
+  NULL /* setNetAttr */,
 };
diff --git a/projects/rccl/src/transport/nvls.cc b/projects/rccl/src/transport/nvls.cc
index da8d263f11..1f13bb01b7 100644
--- a/projects/rccl/src/transport/nvls.cc
+++ b/projects/rccl/src/transport/nvls.cc
@@ -48,7 +48,7 @@ struct ncclTransport nvlsTransport = {
   { NULL, NULL, nvlsRecvFree, NULL, NULL, NULL, NULL, NULL }
 };
 
-ncclResult_t nvlsGroupCreate(struct ncclComm *comm, CUmulticastObjectProp *prop, int rank, unsigned int nranks, CUmemGenericAllocationHandle *mcHandle, char *shareableHandle) {
+ncclResult_t ncclNvlsGroupCreate(struct ncclComm *comm, CUmulticastObjectProp *prop, int rank, unsigned int nranks, CUmemGenericAllocationHandle *mcHandle, char *shareableHandle) {
   CUmemAllocationHandleType type = ncclCuMemHandleType;
   size_t size = prop->size;
 
@@ -70,7 +70,7 @@ ncclResult_t nvlsGroupCreate(struct ncclComm *comm, CUmulticastObjectProp *prop,
   return ncclSuccess;
 }
 
-ncclResult_t nvlsGroupConnect(struct ncclComm *comm, char *shareableHandle, int rank, CUmemGenericAllocationHandle *mcHandle) {
+ncclResult_t ncclNvlsGroupConnect(struct ncclComm *comm, char *shareableHandle, int rank, CUmemGenericAllocationHandle *mcHandle) {
   CUmemAllocationHandleType type = ncclCuMemHandleType;
   int fd = -1;
   ncclResult_t ret = ncclSuccess;
@@ -205,7 +205,7 @@ ncclResult_t ncclNvlsInit(struct ncclComm* comm) {
 ncclResult_t ncclNvlsTreeConnect(struct ncclComm* comm) {
   ncclResult_t ret = ncclSuccess;
   if (comm && comm->nvlsSupport && comm->nNodes > 1) {
-    for (int c = 0; c < comm->nChannels; c++) {
+    for (int c = 0; c < comm->nvlsChannels; c++) {
       struct ncclChannel* channel = comm->channels + c;
       NCCLCHECKGOTO(ncclTransportP2pConnect(comm, c, NCCL_MAX_NVLS_TREE_ARITY, channel->nvls.treeDown, 1, &channel->nvls.treeUp, 0), ret, fail);
       NCCLCHECKGOTO(ncclTransportP2pConnect(comm, c, 1, &channel->nvls.treeUp, NCCL_MAX_NVLS_TREE_ARITY, channel->nvls.treeDown, 0), ret, fail);
@@ -242,12 +242,12 @@ static ncclResult_t nvlsAllocateMem(struct ncclComm* comm, const CUmemAccessDesc
   mcprop.size = mcsize;
 
   if (comm->localRank == 0) {
-    NCCLCHECKGOTO(nvlsGroupCreate(comm, &mcprop, comm->localRank, comm->localRanks, mcHandle, shareableHandle), ret, fail);
+    NCCLCHECKGOTO(ncclNvlsGroupCreate(comm, &mcprop, comm->localRank, comm->localRanks, mcHandle, shareableHandle), ret, fail);
     allocMcHandle = 1;
     NCCLCHECKGOTO(bootstrapIntraNodeBroadcast(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, 0, shareableHandle, NVLS_HANDLE_SIZE), ret, fail);
   } else {
     NCCLCHECKGOTO(bootstrapIntraNodeBroadcast(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, 0, shareableHandle, NVLS_HANDLE_SIZE), ret, fail);
-    NCCLCHECKGOTO(nvlsGroupConnect(comm, shareableHandle, comm->localRankToRank[0], mcHandle), ret, fail);
+    NCCLCHECKGOTO(ncclNvlsGroupConnect(comm, shareableHandle, comm->localRankToRank[0], mcHandle), ret, fail);
     allocMcHandle = 1;
   }
 
@@ -330,7 +330,7 @@ ncclResult_t ncclNvlsBufferSetup(struct ncclComm* comm) {
   nHeads = comm->channels[0].nvls.nHeads;
   headRank = comm->channels[0].nvls.headRank;
   resources = comm->nvlsResources;
-  nChannels = comm->nvlsResources->nChannels;
+  nChannels = comm->nvlsChannels;
   nvlsStepSize = comm->nvlsChunkSize;
   buffSize = nvlsStepSize * NCCL_STEPS;
   nvlsPerRankSize = nChannels * 2 * buffSize;
@@ -391,7 +391,7 @@ ncclResult_t ncclNvlsSetup(struct ncclComm* comm, struct ncclComm* parent) {
   if (nvlsShare) {
     /* reuse NVLS resources */
     comm->nvlsChannels = std::min(comm->nvlsChannels, parent->nvlsResources->nChannels);
-    for (int c = 0; c < comm->nChannels; c++) {
+    for (int c = 0; c < comm->nvlsChannels; c++) {
       NCCLCHECKGOTO(initNvlsChannel(comm, c, parent, true), res, fail);
     }
 
@@ -400,7 +400,7 @@ ncclResult_t ncclNvlsSetup(struct ncclComm* comm, struct ncclComm* parent) {
   } else {
     struct ncclNvlsSharedRes* resources = NULL;
     int nHeads = comm->channels[0].nvls.nHeads;
-    int nChannels = comm->nChannels;
+    int nChannels = comm->nvlsChannels;
     size_t memSize = 64;
     size_t creditSize = nChannels * 2 * memSize * nHeads;
     int nvlsStepSize = comm->nvlsChunkSize;
@@ -420,7 +420,7 @@ ncclResult_t ncclNvlsSetup(struct ncclComm* comm, struct ncclComm* parent) {
     }
     comm->nvlsResources->nChannels = comm->nvlsChannels;
 
-    for (int c = 0; c < comm->nChannels; c++) {
+    for (int c = 0; c < nChannels; c++) {
       NCCLCHECKGOTO(initNvlsChannel(comm, c, NULL, false), res, fail);
     }
 
@@ -486,21 +486,21 @@ ncclResult_t ncclNvlsSetup(struct ncclComm* comm, struct ncclComm* parent) {
   // MNNVL does not support NVLS buffer registration
   if (!comm->MNNVL && comm->nvlsResources->nvlsShmemHandle == NULL) {
     /* create shared memory for fast NVLS buffer registration */
-    typeSize = sizeof(struct localRegData) << 1;
+    typeSize = DIVUP(sizeof(struct localRegData) << 1, CACHE_LINE_SIZE) * CACHE_LINE_SIZE;
 
     if (comm->localRank == 0) {
       shmPath[0] = '\0';
-      NCCLCHECKGOTO(ncclShmOpen(shmPath, sizeof(shmPath), (sizeof(size_t) + typeSize * comm->localRanks) * 2, (void**)&nvlsShmem, NULL, comm->localRanks - 1, &comm->nvlsResources->nvlsShmemHandle), res, fail);
+      NCCLCHECKGOTO(ncclShmOpen(shmPath, sizeof(shmPath), (CACHE_LINE_SIZE * comm->localRanks + typeSize * comm->localRanks) * 2, (void**)&nvlsShmem, NULL, comm->localRanks - 1, &comm->nvlsResources->nvlsShmemHandle), res, fail);
       NCCLCHECKGOTO(bootstrapIntraNodeBroadcast(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, 0, shmPath, sizeof(shmPath)), res, fail);
     } else {
       NCCLCHECKGOTO(bootstrapIntraNodeBroadcast(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, 0, shmPath, sizeof(shmPath)), res, fail);
-      NCCLCHECKGOTO(ncclShmOpen(shmPath, sizeof(shmPath), (sizeof(size_t) + typeSize * comm->localRanks) * 2, (void**)&nvlsShmem, NULL, -1, &comm->nvlsResources->nvlsShmemHandle), res, fail);
+      NCCLCHECKGOTO(ncclShmOpen(shmPath, sizeof(shmPath), (CACHE_LINE_SIZE * comm->localRanks + typeSize * comm->localRanks) * 2, (void**)&nvlsShmem, NULL, -1, &comm->nvlsResources->nvlsShmemHandle), res, fail);
     }
     /* need 2 pools and a shared counter for shmem-based collectives */
     comm->nvlsResources->nvlsShmem.cnt[0] = (size_t*)nvlsShmem;
-    comm->nvlsResources->nvlsShmem.ptr[0] = (void*)((char*)comm->nvlsResources->nvlsShmem.cnt[0] + sizeof(size_t));
+    comm->nvlsResources->nvlsShmem.ptr[0] = (void*)((char*)comm->nvlsResources->nvlsShmem.cnt[0] + CACHE_LINE_SIZE * comm->localRanks);
     comm->nvlsResources->nvlsShmem.cnt[1] = (size_t*)((char*)comm->nvlsResources->nvlsShmem.ptr[0] + typeSize * comm->localRanks);
-    comm->nvlsResources->nvlsShmem.ptr[1] = (void*)((char*)comm->nvlsResources->nvlsShmem.cnt[1] + sizeof(size_t));
+    comm->nvlsResources->nvlsShmem.ptr[1] = (void*)((char*)comm->nvlsResources->nvlsShmem.cnt[1] + CACHE_LINE_SIZE * comm->localRanks);
     comm->nvlsResources->nvlsShmem.round = 0;
     comm->nvlsResources->nvlsShmem.maxTypeSize = typeSize;
   }
@@ -607,11 +607,11 @@ ncclResult_t tryRegisterBuffer(struct ncclComm *comm, uintptr_t userBuff, size_t
   mcprop.size = mcsize;
 
   if (comm->localRank == 0) {
-    NCCLCHECKGOTO(nvlsGroupCreate(comm, &mcprop, comm->localRank, comm->localRanks, &mcHandle, shareableHandle), ret, fail);
+    NCCLCHECKGOTO(ncclNvlsGroupCreate(comm, &mcprop, comm->localRank, comm->localRanks, &mcHandle, shareableHandle), ret, fail);
     NCCLCHECKGOTO(bootstrapIntraNodeBroadcast(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, 0, shareableHandle, NVLS_HANDLE_SIZE), ret, fail);
   } else {
     NCCLCHECKGOTO(bootstrapIntraNodeBroadcast(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, 0, shareableHandle, NVLS_HANDLE_SIZE), ret, fail);
-    NCCLCHECKGOTO(nvlsGroupConnect(comm, shareableHandle, comm->localRankToRank[0], &mcHandle), ret, fail);
+    NCCLCHECKGOTO(ncclNvlsGroupConnect(comm, shareableHandle, comm->localRankToRank[0], &mcHandle), ret, fail);
   }
 
   CUCHECKGOTO(cuMulticastAddDevice(mcHandle, comm->nvlsResources->dev), ret, fail);
@@ -751,17 +751,36 @@ ncclResult_t ncclNvlsLocalRegisterBuffer(struct ncclComm *comm, const void *send
   struct ncclReg *recvRegRecord = NULL;
   bool sendIsValid = false;
   bool recvIsValid = false;
+  void *baseSend = NULL;
+  void *baseRecv = NULL;
+  size_t baseSendSize = 0;
+  size_t baseRecvSize = 0;
 
   *outRegBufUsed = 0;
   if (sendbuff) {
     NCCLCHECK(ncclRegFind(comm, sendbuff, sendbuffSize, &sendRegRecord));
     NCCLCHECK(ncclRegLocalIsValid(sendRegRecord, &sendIsValid));
+    if (sendIsValid) {
+      CUCHECK(cuMemGetAddressRange((CUdeviceptr *)&baseSend, &baseSendSize, (CUdeviceptr)sendbuff));
+      if ((uint64_t)baseSend + baseSendSize < (uint64_t)sendbuff + sendbuffSize) {
+        // the virtual address is backed by multiple physical memory regions, just fall back to non-UB path
+        goto exit;
+      }
+    }
   } else {
     sendIsValid = true;
   }
+
   if (recvbuff) {
     NCCLCHECK(ncclRegFind(comm, recvbuff, recvbuffSize, &recvRegRecord));
     NCCLCHECK(ncclRegLocalIsValid(recvRegRecord, &recvIsValid));
+    if (recvIsValid) {
+      CUCHECK(cuMemGetAddressRange((CUdeviceptr *)&baseRecv, &baseRecvSize, (CUdeviceptr)recvbuff));
+      if ((uint64_t)baseRecv + baseRecvSize < (uint64_t)recvbuff + recvbuffSize) {
+        // the virtual address is backed by multiple physical memory regions, just fall back to non-UB path
+        goto exit;
+      }
+    }
   } else {
     recvIsValid = true;
   }
@@ -769,6 +788,7 @@ ncclResult_t ncclNvlsLocalRegisterBuffer(struct ncclComm *comm, const void *send
   if (sendIsValid && recvIsValid)
     NCCLCHECK(nvlsRegisterBuffer(comm, sendbuff, recvbuff, sendbuffSize, recvbuffSize, sendRegRecord, recvRegRecord, outRegBufUsed, outRegBufSend, outRegBufRecv));
 
+exit:
   return ncclSuccess;
 }
 
@@ -802,11 +822,19 @@ ncclResult_t ncclNvlsGraphRegisterBuffer(
   *outRegBufUsed = 0;
   if (sendbuff) {
     CUCHECK(cuMemGetAddressRange((CUdeviceptr *)&baseSend, &baseSendSize, (CUdeviceptr)sendbuff));
+    if ((uint64_t)baseSend + baseSendSize < (uint64_t)sendbuff + sendbuffSize) {
+      // the virtual address is backed by multiple physical memory regions, just fall back to non-UB path
+      goto exit;
+    }
     NCCLCHECK(ncclCommGraphRegister(comm, baseSend, baseSendSize, (void**)&sendRegRecord));
   }
 
   if (recvbuff) {
     CUCHECK(cuMemGetAddressRange((CUdeviceptr *)&baseRecv, &baseRecvSize, (CUdeviceptr)recvbuff));
+    if ((uint64_t)baseRecv + baseRecvSize < (uint64_t)recvbuff + recvbuffSize) {
+      // the virtual address is backed by multiple physical memory regions, just fall back to non-UB path
+      goto exit;
+    }
     NCCLCHECK(ncclCommGraphRegister(comm, baseRecv, baseRecvSize, (void**)&recvRegRecord));
   }
 
@@ -835,84 +863,10 @@ ncclResult_t ncclNvlsGraphRegisterBuffer(
     if (recvbuff) NCCLCHECK(ncclCommGraphDeregister(comm, recvRegRecord));
   }
 
+exit:
   return ncclSuccess;
 }
 
-ncclResult_t ncclNvlsSymmetricInit(struct ncclComm* comm) {
-  ncclResult_t ret = ncclSuccess;
-  if (comm && comm->nvlsSupport) {
-    CUmulticastObjectProp mcprop = {};
-    CUmemGenericAllocationHandle mcHandle;
-    char shareableHandle[NVLS_HANDLE_SIZE];
-    CUmemAccessDesc accessDesc = {};
-
-    mcprop.numDevices = comm->localRanks;
-    mcprop.handleTypes = ncclCuMemHandleType;
-    mcprop.flags = 0;
-    mcprop.size = comm->baseStride;
-
-    if (comm->localRank == 0) {
-      NCCLCHECKGOTO(nvlsGroupCreate(comm, &mcprop, comm->localRank, comm->localRanks, &mcHandle, shareableHandle), ret, fail);
-      NCCLCHECKGOTO(bootstrapIntraNodeBroadcast(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, 0, shareableHandle, NVLS_HANDLE_SIZE), ret, fail);
-    } else {
-      NCCLCHECKGOTO(bootstrapIntraNodeBroadcast(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, 0, shareableHandle, NVLS_HANDLE_SIZE), ret, fail);
-      NCCLCHECKGOTO(nvlsGroupConnect(comm, shareableHandle, comm->localRankToRank[0], &mcHandle), ret, fail);
-    }
-
-    CUCHECKGOTO(cuMulticastAddDevice(mcHandle, comm->cudaDev), ret, fail);
-    CUCHECKGOTO(cuMemAddressReserve((CUdeviceptr*)&comm->baseMCSymPtr, comm->baseStride, NCCL_MAX_PAGE_SIZE, 0, 0), ret, fail);
-    CUCHECKGOTO(cuMemMap((CUdeviceptr)comm->baseMCSymPtr, comm->baseStride, 0, mcHandle, 0), ret, fail);
-    accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
-    accessDesc.location.id = comm->cudaDev;
-    accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
-    CUCHECKGOTO(cuMemSetAccess((CUdeviceptr)comm->baseMCSymPtr, comm->baseStride, &accessDesc, 1), ret, fail);
-    comm->symMCHandle = mcHandle;
-  }
-exit:
-  return ret;
-fail:
-  goto exit;
-}
-
-ncclResult_t ncclNvlsSymmetricFinalize(struct ncclComm* comm) {
-  ncclResult_t ret = ncclSuccess;
-  if (comm && comm->nvlsSupport && comm->baseMCSymPtr) {
-    CUCHECKGOTO(cuMemUnmap((CUdeviceptr)comm->baseMCSymPtr, comm->baseStride), ret, fail);
-    CUCHECKGOTO(cuMemAddressFree((CUdeviceptr)comm->baseMCSymPtr, comm->baseStride), ret, fail);
-    CUCHECKGOTO(cuMemRelease(comm->symMCHandle), ret, fail);
-  }
-exit:
-  return ret;
-fail:
-  goto exit;
-}
-
-ncclResult_t ncclNvlsSymmetricMap(struct ncclComm* comm, size_t offset, size_t ucsize, void* ucaddr) {
-  ncclResult_t ret = ncclSuccess;
-  assert((uintptr_t)ucaddr % NCCL_REC_PAGE_SIZE == 0 && ucsize % NCCL_REC_PAGE_SIZE == 0);
-  if (comm && comm->nvlsSupport && ucaddr && ucsize > 0) {
-    CUCHECKGOTO(cuMulticastBindAddr(comm->symMCHandle, offset, (CUdeviceptr)ucaddr, ucsize, 0), ret, fail);
-    INFO(NCCL_ALLOC, "NVLS symmetric alloc mc buffer ptr %p offset %ld UC addr %p UC size %ld symAllocHead %ld", comm->baseMCSymPtr + offset, offset, ucaddr, ucsize, comm->symAllocHead);
-  }
-
-exit:
-  return ret;
-fail:
-  goto exit;
-}
-
-ncclResult_t ncclNvlsSymmetricFree(struct ncclComm* comm, size_t ucsize, void* ucaddr) {
-  ncclResult_t ret = ncclSuccess;
-  if (comm && comm->nvlsSupport && ucaddr && ucsize > 0) {
-    size_t offset = (size_t)ucaddr - ((size_t)comm->baseUCSymPtr + comm->localRank * comm->baseStride);
-    CUCHECKGOTO(cuMulticastUnbind(comm->symMCHandle, comm->cudaDev, offset, ucsize), ret, fail);
-  }
-exit:
-  return ret;
-fail:
-  goto exit;
-}
-
 ncclResult_t ncclNvlsRegResourcesQuery(struct ncclComm* comm, struct ncclTaskColl* info, int* recChannels) {
   int factor;
   ncclResult_t ret = ncclSuccess;
diff --git a/projects/rccl/src/transport/p2p.cc b/projects/rccl/src/transport/p2p.cc
index d263dda3a5..d9fd01da0a 100644
--- a/projects/rccl/src/transport/p2p.cc
+++ b/projects/rccl/src/transport/p2p.cc
@@ -955,6 +955,8 @@ ncclResult_t ncclIpcLocalRegisterBuffer(ncclComm* comm, const void* userbuff, si
   ncclResult_t ret = ncclSuccess;
   struct ncclReg *regRecord = NULL;
   bool isValid = false;
+  void *baseAddr = NULL;
+  size_t baseSize = 0;
 
   *regBufFlag = 0;
   *offsetOut = 0;
@@ -962,8 +964,11 @@ ncclResult_t ncclIpcLocalRegisterBuffer(ncclComm* comm, const void* userbuff, si
   if (comm && userbuff && buffSize > 0 && nPeers > 0) {
     NCCLCHECKGOTO(ncclRegFind(comm, userbuff, buffSize, &regRecord), ret, fail);
     NCCLCHECKGOTO(ncclRegLocalIsValid(regRecord, &isValid), ret, fail);
-    if (isValid)
+    if (isValid) {
+      CUCHECKGOTO(cuMemGetAddressRange((CUdeviceptr *)&baseAddr, &baseSize, (CUdeviceptr)userbuff), ret, fail);
+      if ((uint64_t)baseAddr + baseSize < (uint64_t)userbuff + buffSize) goto exit;
       NCCLCHECKGOTO(ipcRegisterBuffer(comm, userbuff, buffSize, peerRanks, nPeers, type, regRecord, regBufFlag, offsetOut, peerRmtAddrsOut, NULL), ret, fail);
+    }
   }
 
 exit:
@@ -1001,6 +1006,7 @@ ncclResult_t ncclIpcGraphRegisterBuffer(ncclComm* comm, const void* userbuff, si
   *peerRmtAddrsOut = NULL;
   if (comm && userbuff && buffSize > 0 && nPeers > 0) {
     CUCHECKGOTO(cuMemGetAddressRange((CUdeviceptr*)&baseAddr, &baseSize, (CUdeviceptr)userbuff), ret, fail);
+    if ((uint64_t)baseAddr + baseSize < (uint64_t)userbuff + buffSize) goto exit;
     NCCLCHECKGOTO(ncclCommGraphRegister(comm, baseAddr, baseSize, (void**)&regRecord), ret, fail);
     NCCLCHECKGOTO(ipcRegisterBuffer(comm, userbuff, buffSize, peerRanks, nPeers, type, regRecord, regBufFlag, offsetOut, peerRmtAddrsOut, &isLegacyIpc), ret, fail);
     if (*regBufFlag) {
@@ -1118,88 +1124,6 @@ fail:
   goto exit;
 }
 
-ncclResult_t ncclIpcSymmetricInit(struct ncclComm* comm) {
-  CUCHECK(cuMemAddressReserve((CUdeviceptr*)&comm->baseUCSymPtr, comm->baseStride * comm->localRanks, NCCL_MAX_PAGE_SIZE, 0, 0));
-  return ncclSuccess;
-}
-
-ncclResult_t ncclIpcSymmetricFinalize(struct ncclComm* comm) {
-  if (comm->baseUCSymPtr) {
-    CUCHECK(cuMemAddressFree((CUdeviceptr)comm->baseUCSymPtr, comm->baseStride * comm->localRanks));
-  }
-  return ncclSuccess;
-}
-
-ncclResult_t ncclIpcSymmetricMap(struct ncclComm* comm, size_t offset, size_t size, CUmemGenericAllocationHandle memHandle, void** symPtr) {
-  ncclResult_t ret = ncclSuccess;
-  CUmemGenericAllocationHandle impHandle;
-  int impFd = -1;
-  ncclCuDesc* desc = NULL;
-  CUmemAccessDesc accessDesc = {};
-
-  assert(offset % NCCL_REC_PAGE_SIZE == 0 && size % NCCL_REC_PAGE_SIZE == 0);
-  NCCLCHECKGOTO(ncclCalloc(&desc, comm->localRanks), ret, fail);
-  if (ncclCuMemHandleType == CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR) {
-    memcpy(&desc[comm->localRank].data, &memHandle, sizeof(CUmemGenericAllocationHandle));
-  } else {
-    CUCHECKGOTO(cuMemExportToShareableHandle(&desc[comm->localRank].handle, memHandle, ncclCuMemHandleType, 0), ret, fail);
-  }
-
-  NCCLCHECKGOTO(bootstrapIntraNodeAllGather(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, desc, sizeof(ncclCuDesc)), ret, fail);
-
-  // start mapping
-  accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
-  accessDesc.location.id = comm->cudaDev;
-  accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
-  for (int r = 0; r < comm->localRanks; ++r) {
-    CUdeviceptr maddr;
-    if (r == comm->localRank) {
-      impHandle = memHandle;
-    } else {
-      if (ncclCuMemHandleType == CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR) {
-        impFd = -1;
-        NCCLCHECKGOTO(ncclProxyClientGetFdBlocking(comm, comm->localRankToRank[r], &desc[r].data, &impFd), ret, fail);
-        CUCHECKGOTO(cuMemImportFromShareableHandle(&impHandle, (void*)(uintptr_t)impFd, ncclCuMemHandleType), ret, fail);
-        SYSCHECKGOTO(close(impFd), "close", ret, fail);
-      } else {
-        CUCHECKGOTO(cuMemImportFromShareableHandle(&impHandle, (void*)&desc[r].handle, ncclCuMemHandleType), ret, fail);
-      }
-    }
-    maddr = (CUdeviceptr)(comm->baseUCSymPtr + (size_t)r * comm->baseStride + offset);
-    CUCHECKGOTO(cuMemMap(maddr, size, 0, impHandle, 0), ret, fail);
-    CUCHECKGOTO(cuMemSetAccess(maddr, size, &accessDesc, 1), ret, fail);
-
-    if (r == comm->localRank) {
-      *symPtr = (void*)maddr;
-    } else {
-      CUCHECKGOTO(cuMemRelease(impHandle), ret, fail);
-    }
-  }
-
-  INFO(NCCL_ALLOC, "IPC symmetric alloc buffer %p offset %ld size %ld symAllocHead %ld", *symPtr, offset, size, comm->symAllocHead);
-
-exit:
-  free(desc);
-  return ret;
-fail:
-  goto exit;
-}
-
-ncclResult_t ncclIpcSymmetricFree(struct ncclComm* comm, size_t size, void* symPtr) {
-  ncclResult_t ret = ncclSuccess;
-  if (comm && symPtr && size > 0) {
-    size_t offset = (size_t)symPtr - ((size_t)comm->baseUCSymPtr + comm->localRank * comm->baseStride);
-    for (int r = 0; r < comm->localRanks; ++r) {
-      CUdeviceptr peerAddr = (CUdeviceptr)(comm->baseUCSymPtr + r * comm->baseStride + offset);
-      CUCHECKGOTO(cuMemUnmap(peerAddr, size), ret, fail);
-    }
-  }
-exit:
-  return ret;
-fail:
-  goto exit;
-}
-
 struct ncclTransport p2pTransport = {
   "P2P",
   p2pCanConnect,
diff --git a/projects/rccl/src/transport/profiler.cc b/projects/rccl/src/transport/profiler.cc
index 6e7b33c168..354fa57bba 100644
--- a/projects/rccl/src/transport/profiler.cc
+++ b/projects/rccl/src/transport/profiler.cc
@@ -10,7 +10,7 @@
 
 static ncclResult_t profilerProxyConnect(struct ncclProxyConnection* connection, struct ncclProxyState* proxyState, void* reqBuff, int reqSize, void* respBuff, int respSize, int* done) {
   connection->proxyAppendPtr = &connection->proxyAppend;
-  connection->shared = 1;
+  connection->shared = 0;
   return ncclSuccess;
 }