Files

T

Mark Santesson f1308997d0 NCCL 2.28.3-1

Device API (Experimental)
* Introduces device-side APIs to integrate NCCL communication directly into application kernels.
* Supports LSA (Load/Store Access) for CUDA P2P communication over NVLink and some PCIe platforms.
* Supports Multimem for hardware multicast using NVLink SHARP.
* Adds initial framework for GIN (GPU-Initiated Networking), currently under development.
* Introduces device communicators created using ncclDevCommCreate.
* Enables device-side communication operations with synchronization (ncclLsaBarrierSession) and memory accessors (ncclGetLsaPointer, ncclGetLsaMultimemPointer).
* Experimental APIs - signatures and functionality may evolve in future releases.
* No ABI compatibility is guaranteed — applications must be recompiled with each new NCCL release.

Symmetric memory improvements
* Support for aggregating symmetric operations using ncclGroupStart/End APIs.
* Reimplement symmetric kernels using device API.

New Host APIs
* Introduce new host collective APIs: ncclAlltoAll, ncclScatter, ncclGather.

CE (Copy Engine) Collectives
* Reduce SM utilization for alltoall, scatter, gather, and allgather within a single (MN)NVL domain.
* Free up SM capacity for the application to do computation at the same time.
* To enable the feature for ncclAllGather, ncclAlltoAll, ncclGather, ncclScatter, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t.

NCCL Inspector Plugin
* Introduces an Inspector plugin for always-on performance monitoring.
* Produces structured JSON output with metadata, execution time, bandwidth, and optional event traces for each NCCL operation.
* Enables integration with analysis tools such as Performance Exporter to visualize NCCL performance bottlenecks.
* Lightweight to enable via environment variables NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE.

CMake support (Experiemental)
* Adds a CMake build system as an alternative to existing Makefiles.
* Known issues: pkg.build and Device API currently do not work with CMake.
* The known issues will be addressed in a future release.

Decreased max CTA count from 32 to 16 on Blackwell
* SM overhead is decreased by 50% with this improvement.
* This may cause some perf drop on Blackwell because of the reduced SM usage.
* If the extra SM capacity is not desired, two options are available to restore to previous behavior: 1) Setting NCCL_MIN_CTAS=32 NCCL_MAX_CTAS=32 environment variables; 2) setting communicator config to over-write max CTA count to 32.
* Based on community feedback, future versions may consider different trade-offs between performance and SM overhead.

Plugins
* Network
* App-aware Network plugin. NCCL passes information about communication operations to be executed on the network end point. This allows for better tuning of network end points and their use in the plugins.
* Improve handling of physical and virtual network devices and load/unload.
* Network plugin version 11 - add explicit context and communication ID support for per communicator init/finalize.
* Add Multi-Request Net API. Using this will help NCCL to anticipate multiple send/recv requests and optimize for it. See maxMultiRequestSize field in ncclNetProperties_v11_t.
* Profiler
* Add support for API events (group, collective, and p2p) and for tracking kernel launches in the profiler plugin.
* Add Inspector Profiler Plugin (see section above).
* Add a hook to Google’s CoMMA profiler on github.
* Tuner
* Expose NCCL tuning constants at tuner initialization via ncclTunerConstants_v5_t.
* Add NVL Domain Information API.
* Support multiple plugin types from a single shared object.

New Parameterization and ncclConfig changes:
* Add new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack.
* Add NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions.
* Add NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks” case as opt in.
* Move nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig.
* Enable PxN over C2C by default
* PxN over C2C will improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe.
* This behavior can be overridden by setting NCCL_PXN_C2C=0.

Other Improvements:
* Allow FP8 support for non-reductive operations on pre sm90 devices. (See https://github.com/pytorch/pytorch/pull/151594#discussion_r2135777776)
* Fix NVLS+CollNet and temporarily disables COLLNET_CHAIN for >8 GPUs.
* Only consider running interfaces for socket traffic. NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit. (https://github.com/NVIDIA/nccl/issues/1798)
* Modernize mutex management. Convert to std::mutex and std::lock_guard.
* Remove sm35 and sm50 GENCODE targets which have long been deprecated and were causing issues with the latest NCCL release builds.
* Improved NVLS/NVLSTree tuning prediction to improve algorithm and protocol selection.
* NVLSTree Tuning Fixes. Update tuning data for H100, GB200-NV72.
* Respond better to RoCE link flaps. Instead of reporting an “unknown event” it will now report “GID table changed”.
* Move libvirt bridge interface to the end of possible interfaces so that they are considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.

2025-09-02 13:53:34 -07:00

nccl

NCCL 2.28.3-1

2025-09-02 13:53:34 -07:00

scripts

NCCL 2.27.5-1

2025-06-18 10:34:47 -07:00

test

NCCL 2.28.3-1

2025-09-02 13:53:34 -07:00

.gitignore

NCCL 2.28.3-1

2025-09-02 13:53:34 -07:00

CMakeLists.txt

NCCL 2.28.3-1

2025-09-02 13:53:34 -07:00

Makefile

NCCL 2.27.5-1

2025-06-18 10:34:47 -07:00

nccl_tuner.conf

NCCL 2.27.5-1

2025-06-18 10:34:47 -07:00

plugin.c

NCCL 2.28.3-1

2025-09-02 13:53:34 -07:00

README.md

NCCL 2.27.7-1

2025-07-24 10:39:53 -07:00

README.md

NCCL Example Tuner Plugin

This example plugin shows a practical example of a CSV file-based tuning approach, allowing selective overrides for tuning parameters based on all tuning inputs without recompiling.

Features

File-based Configuration: Read tuning parameters from a CSV configuration file
Size-based Tuning: Specify different configurations based on message size ranges
Dimension-aware Tuning: Match configurations based on number of nodes and ranks
Optional Channels Configuration: Set specific channel counts or use -1 to keep NCCL's default
Environment Variable Support: Specify config file location via NCCL_TUNER_CONFIG_FILE
Fallback Behavior: Gracefully handles missing config files and invalid entries

Building

make

This will create libnccl-tuner-example.so that can be loaded by NCCL.

Configuration File Format

The configuration file uses CSV (Comma-Separated Values) format with one configuration per line:

collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff

Parameters

collective_type: The collective operation type
- broadcast, reduce, allgather, reducescatter, allreduce
min_bytes/max_bytes: The message size range (in bytes) for which this config applies
- Use 0 for minimum and 4294967295 for maximum (covers all sizes)
algorithm: The NCCL algorithm to use
- tree, ring, collnet_direct, collnet_chain, nvls, nvls_tree, pat
protocol: The NCCL protocol to use
- ll, ll128, simple
channels: Number of channels (SMs) to use
- Use a positive integer to specify exact channel count
- Use -1 to keep NCCL's default channel selection
nNodes: Number of nodes to match
- Use a positive integer to match specific node count
- Use -1 to match any number of nodes
nRanks: Number of ranks to match
- Use a positive integer to match specific rank count
- Use -1 to match any number of ranks
numPipeOps: Number of pipeline operations to match (optional)
- Use a positive integer to match specific pipeline operation count
- Use -1 to match any number of pipeline operations
- If omitted, configuration will match any numPipeOps value
regBuff: Whether user buffer can be registered (optional)
- Use 0 to match only non-registered buffers
- Use 1 to match only registered buffers
- Use -1 to match either registered or non-registered buffers
- If omitted, configuration will match any regBuff value

Example Configuration

# Single-node, small allreduce: use tree algorithm, registered buffers only
allreduce,0,65536,tree,simple,2,1,-1,-1,1

# 4-node, 32-rank setup: medium allreduce, single pipeline op, non-registered buffers
allreduce,65537,1048576,ring,simple,4,4,32,1,0

# Any topology: large allreduce with LL128, multiple pipeline ops, any buffer type
allreduce,1048577,4294967295,ring,ll128,-1,-1,-1,4,-1

# Single-node broadcast: prefer tree, any pipeOps, registered buffers (backward compatible)
broadcast,0,32768,tree,simple,-1,1,-1

# Multi-node broadcast: optimized for non-registered buffers, single pipeline op
broadcast,32769,4294967295,ring,simple,2,-1,-1,1,0

Comments start with # and empty lines are ignored. The CSV format makes it easy to edit configurations in spreadsheet applications like Excel, Google Sheets, or LibreOffice Calc.

Backward Compatibility

Configurations without the numPipeOps and/or regBuff parameters are fully supported:

8 fields: matches any numPipeOps and regBuff values
9 fields: matches any regBuff value
10 fields: full parameter specification

This ensures existing configuration files continue to work without modification.

Usage

Method 1: Default Config File

Place your configuration in nccl_tuner.conf in the current working directory.

Method 2: Environment Variable

Set the NCCL_TUNER_CONFIG_FILE environment variable to specify the config file path:

export NCCL_TUNER_CONFIG_FILE=/path/to/your/tuner.conf
mpirun -np 4 your_nccl_application

Editing Configuration Files

Generating Configuration Files from Raw Data

A python script to generate valid CSV configs has been provided. Using optimize_config.py.

Spreadsheet Tips:

Use column headers: collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
Save as CSV format (not Excel format) for the plugin to read
Use data validation to prevent typos in algorithm/protocol names

Logging

The plugin uses NCCL's logging system. To see tuner-related messages:

export NCCL_DEBUG=INFO

This will show when configurations are loaded and applied, including the topology information.

For detailed debugging output during tuning decisions:

export NCCL_DEBUG=TRACE

This will show verbose information about which configurations are being evaluated and matched.

Dimension Matching

Configurations are only applied when the topology matches:

Exact Match: Configuration specifies nNodes=4,nRanks=32, only applied when communicator has exactly 4 nodes and 32 ranks
Wildcard Nodes: Configuration specifies nNodes=-1,nRanks=8, applied to any topology with exactly 8 ranks
Wildcard Ranks: Configuration specifies nNodes=2,nRanks=-1, applied to any 2-node topology regardless of ranks per node
Wildcard Both: Configuration specifies nNodes=-1,nRanks=-1, applied to any topology

This allows you to create specialized configurations for different cluster setups while maintaining flexibility.

Default Behavior

If no configuration file is found or no matching configuration exists for a collective operation, the plugin falls back to preferring the ring algorithm with simple protocol. All configured algorithm/protocol combinations are given a low cost (0.0) to make them preferred by NCCL's selection logic.

When channels is set to -1, NCCL's default channel selection logic is preserved, allowing the system to automatically determine the optimal number of channels based on hardware and message size.

Troubleshooting

Config file not found: Check the file path and permissions
Configurations not applied: Verify the collective type, size ranges, algorithm/protocol names, and topology parameters
Plugin not loaded: Ensure LD_LIBRARY_PATH includes the plugin directory and that NCCL_TUNER_PLUGIN either specifies the plugin name, or an absolute path to the plugin shared library.
No effect on performance: Check that NCCL is actually using the tuner plugin with NCCL_DEBUG=INFO
Topology mismatch: Verify that nNodes and nRanks match your actual setup, or use -1 for wildcards
CSV parsing errors: Ensure no spaces after commas, or quote fields containing spaces