Device API (Experimental) * Introduces device-side APIs to integrate NCCL communication directly into application kernels. * Supports LSA (Load/Store Access) for CUDA P2P communication over NVLink and some PCIe platforms. * Supports Multimem for hardware multicast using NVLink SHARP. * Adds initial framework for GIN (GPU-Initiated Networking), currently under development. * Introduces device communicators created using ncclDevCommCreate. * Enables device-side communication operations with synchronization (ncclLsaBarrierSession) and memory accessors (ncclGetLsaPointer, ncclGetLsaMultimemPointer). * Experimental APIs - signatures and functionality may evolve in future releases. * No ABI compatibility is guaranteed — applications must be recompiled with each new NCCL release. Symmetric memory improvements * Support for aggregating symmetric operations using ncclGroupStart/End APIs. * Reimplement symmetric kernels using device API. New Host APIs * Introduce new host collective APIs: ncclAlltoAll, ncclScatter, ncclGather. CE (Copy Engine) Collectives * Reduce SM utilization for alltoall, scatter, gather, and allgather within a single (MN)NVL domain. * Free up SM capacity for the application to do computation at the same time. * To enable the feature for ncclAllGather, ncclAlltoAll, ncclGather, ncclScatter, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t. NCCL Inspector Plugin * Introduces an Inspector plugin for always-on performance monitoring. * Produces structured JSON output with metadata, execution time, bandwidth, and optional event traces for each NCCL operation. * Enables integration with analysis tools such as Performance Exporter to visualize NCCL performance bottlenecks. * Lightweight to enable via environment variables NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE. CMake support (Experiemental) * Adds a CMake build system as an alternative to existing Makefiles. * Known issues: pkg.build and Device API currently do not work with CMake. * The known issues will be addressed in a future release. Decreased max CTA count from 32 to 16 on Blackwell * SM overhead is decreased by 50% with this improvement. * This may cause some perf drop on Blackwell because of the reduced SM usage. * If the extra SM capacity is not desired, two options are available to restore to previous behavior: 1) Setting NCCL_MIN_CTAS=32 NCCL_MAX_CTAS=32 environment variables; 2) setting communicator config to over-write max CTA count to 32. * Based on community feedback, future versions may consider different trade-offs between performance and SM overhead. Plugins * Network * App-aware Network plugin. NCCL passes information about communication operations to be executed on the network end point. This allows for better tuning of network end points and their use in the plugins. * Improve handling of physical and virtual network devices and load/unload. * Network plugin version 11 - add explicit context and communication ID support for per communicator init/finalize. * Add Multi-Request Net API. Using this will help NCCL to anticipate multiple send/recv requests and optimize for it. See maxMultiRequestSize field in ncclNetProperties_v11_t. * Profiler * Add support for API events (group, collective, and p2p) and for tracking kernel launches in the profiler plugin. * Add Inspector Profiler Plugin (see section above). * Add a hook to Google’s CoMMA profiler on github. * Tuner * Expose NCCL tuning constants at tuner initialization via ncclTunerConstants_v5_t. * Add NVL Domain Information API. * Support multiple plugin types from a single shared object. New Parameterization and ncclConfig changes: * Add new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack. * Add NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions. * Add NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks” case as opt in. * Move nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig. * Enable PxN over C2C by default * PxN over C2C will improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe. * This behavior can be overridden by setting NCCL_PXN_C2C=0. Other Improvements: * Allow FP8 support for non-reductive operations on pre sm90 devices. (See https://github.com/pytorch/pytorch/pull/151594#discussion_r2135777776) * Fix NVLS+CollNet and temporarily disables COLLNET_CHAIN for >8 GPUs. * Only consider running interfaces for socket traffic. NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit. (https://github.com/NVIDIA/nccl/issues/1798) * Modernize mutex management. Convert to std::mutex and std::lock_guard. * Remove sm35 and sm50 GENCODE targets which have long been deprecated and were causing issues with the latest NCCL release builds. * Improved NVLS/NVLSTree tuning prediction to improve algorithm and protocol selection. * NVLSTree Tuning Fixes. Update tuning data for H100, GB200-NV72. * Respond better to RoCE link flaps. Instead of reporting an “unknown event” it will now report “GID table changed”. * Move libvirt bridge interface to the end of possible interfaces so that they are considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.
NCCL Example Tuner Plugin
This example plugin shows a practical example of a CSV file-based tuning approach, allowing selective overrides for tuning parameters based on all tuning inputs without recompiling.
Features
- File-based Configuration: Read tuning parameters from a CSV configuration file
- Size-based Tuning: Specify different configurations based on message size ranges
- Dimension-aware Tuning: Match configurations based on number of nodes and ranks
- Optional Channels Configuration: Set specific channel counts or use -1 to keep NCCL's default
- Environment Variable Support: Specify config file location via
NCCL_TUNER_CONFIG_FILE - Fallback Behavior: Gracefully handles missing config files and invalid entries
Building
make
This will create libnccl-tuner-example.so that can be loaded by NCCL.
Configuration File Format
The configuration file uses CSV (Comma-Separated Values) format with one configuration per line:
collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
Parameters
-
collective_type: The collective operation type
broadcast,reduce,allgather,reducescatter,allreduce
-
min_bytes/max_bytes: The message size range (in bytes) for which this config applies
- Use
0for minimum and4294967295for maximum (covers all sizes)
- Use
-
algorithm: The NCCL algorithm to use
tree,ring,collnet_direct,collnet_chain,nvls,nvls_tree,pat
-
protocol: The NCCL protocol to use
ll,ll128,simple
-
channels: Number of channels (SMs) to use
- Use a positive integer to specify exact channel count
- Use
-1to keep NCCL's default channel selection
-
nNodes: Number of nodes to match
- Use a positive integer to match specific node count
- Use
-1to match any number of nodes
-
nRanks: Number of ranks to match
- Use a positive integer to match specific rank count
- Use
-1to match any number of ranks
-
numPipeOps: Number of pipeline operations to match (optional)
- Use a positive integer to match specific pipeline operation count
- Use
-1to match any number of pipeline operations - If omitted, configuration will match any numPipeOps value
-
regBuff: Whether user buffer can be registered (optional)
- Use
0to match only non-registered buffers - Use
1to match only registered buffers - Use
-1to match either registered or non-registered buffers - If omitted, configuration will match any regBuff value
- Use
Example Configuration
# Single-node, small allreduce: use tree algorithm, registered buffers only
allreduce,0,65536,tree,simple,2,1,-1,-1,1
# 4-node, 32-rank setup: medium allreduce, single pipeline op, non-registered buffers
allreduce,65537,1048576,ring,simple,4,4,32,1,0
# Any topology: large allreduce with LL128, multiple pipeline ops, any buffer type
allreduce,1048577,4294967295,ring,ll128,-1,-1,-1,4,-1
# Single-node broadcast: prefer tree, any pipeOps, registered buffers (backward compatible)
broadcast,0,32768,tree,simple,-1,1,-1
# Multi-node broadcast: optimized for non-registered buffers, single pipeline op
broadcast,32769,4294967295,ring,simple,2,-1,-1,1,0
Comments start with # and empty lines are ignored. The CSV format makes it easy to edit configurations in spreadsheet applications like Excel, Google Sheets, or LibreOffice Calc.
Backward Compatibility
Configurations without the numPipeOps and/or regBuff parameters are fully supported:
- 8 fields: matches any numPipeOps and regBuff values
- 9 fields: matches any regBuff value
- 10 fields: full parameter specification
This ensures existing configuration files continue to work without modification.
Usage
Method 1: Default Config File
Place your configuration in nccl_tuner.conf in the current working directory.
Method 2: Environment Variable
Set the NCCL_TUNER_CONFIG_FILE environment variable to specify the config file path:
export NCCL_TUNER_CONFIG_FILE=/path/to/your/tuner.conf
mpirun -np 4 your_nccl_application
Editing Configuration Files
Generating Configuration Files from Raw Data
A python script to generate valid CSV configs has been provided. Using optimize_config.py.
Spreadsheet Tips:
- Use column headers:
collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff - Save as CSV format (not Excel format) for the plugin to read
- Use data validation to prevent typos in algorithm/protocol names
Logging
The plugin uses NCCL's logging system. To see tuner-related messages:
export NCCL_DEBUG=INFO
This will show when configurations are loaded and applied, including the topology information.
For detailed debugging output during tuning decisions:
export NCCL_DEBUG=TRACE
This will show verbose information about which configurations are being evaluated and matched.
Dimension Matching
Configurations are only applied when the topology matches:
- Exact Match: Configuration specifies
nNodes=4,nRanks=32, only applied when communicator has exactly 4 nodes and 32 ranks - Wildcard Nodes: Configuration specifies
nNodes=-1,nRanks=8, applied to any topology with exactly 8 ranks - Wildcard Ranks: Configuration specifies
nNodes=2,nRanks=-1, applied to any 2-node topology regardless of ranks per node - Wildcard Both: Configuration specifies
nNodes=-1,nRanks=-1, applied to any topology
This allows you to create specialized configurations for different cluster setups while maintaining flexibility.
Default Behavior
If no configuration file is found or no matching configuration exists for a collective operation, the plugin falls back to preferring the ring algorithm with simple protocol. All configured algorithm/protocol combinations are given a low cost (0.0) to make them preferred by NCCL's selection logic.
When channels is set to -1, NCCL's default channel selection logic is preserved, allowing the system to automatically determine the optimal number of channels based on hardware and message size.
Troubleshooting
- Config file not found: Check the file path and permissions
- Configurations not applied: Verify the collective type, size ranges, algorithm/protocol names, and topology parameters
- Plugin not loaded: Ensure
LD_LIBRARY_PATHincludes the plugin directory and thatNCCL_TUNER_PLUGINeither specifies the plugin name, or an absolute path to the plugin shared library. - No effect on performance: Check that NCCL is actually using the tuner plugin with
NCCL_DEBUG=INFO - Topology mismatch: Verify that nNodes and nRanks match your actual setup, or use -1 for wildcards
- CSV parsing errors: Ensure no spaces after commas, or quote fields containing spaces