文件
rocm-systems/ext-tuner/README.md
T
Mark Santesson f1308997d0 NCCL 2.28.3-1
Device API (Experimental)
 * Introduces device-side APIs to integrate NCCL communication directly into application kernels.
 * Supports LSA (Load/Store Access) for CUDA P2P communication over NVLink and some PCIe platforms.
 * Supports Multimem for hardware multicast using NVLink SHARP.
 * Adds initial framework for GIN (GPU-Initiated Networking), currently under development.
 * Introduces device communicators created using ncclDevCommCreate.
 * Enables device-side communication operations with synchronization (ncclLsaBarrierSession) and memory accessors (ncclGetLsaPointer, ncclGetLsaMultimemPointer).
 * Experimental APIs - signatures and functionality may evolve in future releases.
 * No ABI compatibility is guaranteed — applications must be recompiled with each new NCCL release.

Symmetric memory improvements
 * Support for aggregating symmetric operations using ncclGroupStart/End APIs.
 * Reimplement symmetric kernels using device API.

New Host APIs
 * Introduce new host collective APIs: ncclAlltoAll, ncclScatter, ncclGather.

CE (Copy Engine) Collectives
 * Reduce SM utilization for alltoall, scatter, gather, and allgather within a single (MN)NVL domain.
 * Free up SM capacity for the application to do computation at the same time.
 * To enable the feature for ncclAllGather, ncclAlltoAll, ncclGather, ncclScatter, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t.

NCCL Inspector Plugin
 * Introduces an Inspector plugin for always-on performance monitoring.
 * Produces structured JSON output with metadata, execution time, bandwidth, and optional event traces for each NCCL operation.
 * Enables integration with analysis tools such as Performance Exporter to visualize NCCL performance bottlenecks.
 * Lightweight to enable via environment variables NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE.

CMake support (Experiemental)
 * Adds a CMake build system as an alternative to existing Makefiles.
 * Known issues: pkg.build and Device API currently do not work with CMake.
 * The known issues will be addressed in a future release.

Decreased max CTA count from 32 to 16 on Blackwell
 * SM overhead is decreased by 50% with this improvement.
 * This may cause some perf drop on Blackwell because of the reduced SM usage.
 * If the extra SM capacity is not desired, two options are available to restore to previous behavior: 1) Setting NCCL_MIN_CTAS=32 NCCL_MAX_CTAS=32 environment variables; 2) setting communicator config to over-write max CTA count to 32.
 * Based on community feedback, future versions may consider different trade-offs between performance and SM overhead.

Plugins
 * Network
   * App-aware Network plugin. NCCL passes information about communication operations to be executed on the network end point. This allows for better tuning of network end points and their use in the plugins.
   * Improve handling of physical and virtual network devices and load/unload.
   * Network plugin version 11 - add explicit context and communication ID support for per communicator init/finalize.
   * Add Multi-Request Net API. Using this will help NCCL to anticipate multiple send/recv requests and optimize for it. See maxMultiRequestSize field in ncclNetProperties_v11_t.
 * Profiler
   * Add support for API events (group, collective, and p2p) and for tracking kernel launches in the profiler plugin.
   * Add Inspector Profiler Plugin (see section above).
   * Add a hook to Google’s CoMMA profiler on github.
 * Tuner
   * Expose NCCL tuning constants at tuner initialization via ncclTunerConstants_v5_t.
   * Add NVL Domain Information API.
 * Support multiple plugin types from a single shared object.

New Parameterization and ncclConfig changes:
 * Add new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack.
 * Add NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions.
 * Add NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks” case as opt in.
 * Move nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig.
 * Enable PxN over C2C by default
   * PxN over C2C will improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe.
   * This behavior can be overridden by setting NCCL_PXN_C2C=0.

Other Improvements:
 * Allow FP8 support for non-reductive operations on pre sm90 devices. (See https://github.com/pytorch/pytorch/pull/151594#discussion_r2135777776)
 * Fix NVLS+CollNet and temporarily disables COLLNET_CHAIN for >8 GPUs.
 * Only consider running interfaces for socket traffic. NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit. (https://github.com/NVIDIA/nccl/issues/1798)
 * Modernize mutex management. Convert to std::mutex and std::lock_guard.
 * Remove sm35 and sm50 GENCODE targets which have long been deprecated and were causing issues with the latest NCCL release builds.
 * Improved NVLS/NVLSTree tuning prediction to improve algorithm and protocol selection.
 * NVLSTree Tuning Fixes. Update tuning data for H100, GB200-NV72.
 * Respond better to RoCE link flaps. Instead of reporting an “unknown event” it will now report “GID table changed”.
 * Move libvirt bridge interface to the end of possible interfaces so that they are considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.
2025-09-02 13:53:34 -07:00

6.2 KiB

NCCL Tuner Plugin Development

This directory contains resources and examples for developing NCCL tuner plugins. Tuner plugins allow you to customize NCCL's algorithm and protocol selection behavior to optimize performance for specific workloads and hardware configurations.

Overview

NCCL tuner plugins provide a way to influence NCCL's automatic algorithm and protocol selection by modifying the cost tables that NCCL uses to make decisions. This allows you to:

  • Override default algorithm/protocol combinations for specific collective operations
  • Customize tuning based on message size, topology, and other parameters
  • Implement sophisticated tuning strategies without recompiling NCCL
  • Optimize performance for specific hardware configurations or workloads

Tuner Plugin Interface

NCCL tuner plugins must implement the ncclTuner_t interface defined in nccl_tuner.h within nccl/src/include/plugin. These definitions have been forked to tuner.h in each example plugin, and it is expected that any plugin implementor forks the internal NCCL definitions as well. The current interface includes:

// Initialize the tuner plugin
ncclResult_t (*init)(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context);

// Get and modify collective operation cost information
ncclResult_t (*getCollInfo)(void* context, ncclFunc_t collType, size_t nBytes,
                           int numPipeOps, float** collCostTable, int numAlgo, int numProto,
                           int regBuff, int* nChannels);

// Clean up plugin resources
ncclResult_t (*destroy)(void* context);

Development Guidelines

1. Plugin Structure

A typical tuner plugin should:

  • Include the necessary forked NCCL headers (tuner.h)
  • Implement all required interface functions
  • Export the plugin structure with appropriate version
  • Handle all input parameters gracefully

2. Cost Table Modification

The getCollInfo function receives a cost table that maps algorithm/protocol combinations to performance costs. Lower costs indicate preferred combinations. You can:

  • Set costs to 0.0 to make combinations highly preferred
  • Set costs to NCCL_ALGO_PROTO_IGNORE to disable combinations
  • Use relative costs to create preferences between options

3. Channel Management

The nChannels parameter allows you to:

  • Set a specific number of channels to use
  • Return the original value to preserve NCCL's default behavior
  • Implement dynamic channel selection based on message size or topology

4. Error Handling

Always return appropriate ncclResult_t values:

  • ncclSuccess for successful or ignored operations
  • ncclInternalError for plugin-specific errors. Returning an error is only advisable on plugin initialization and destruction, as the penalty users can pay for the overhead of a failed plugin call can be immense.
  • Other NCCL error codes as appropriate

Getting Started

Option 1: Start with the Example Plugin

If you're new to tuner plugin development, start with the example/ directory:

cd example/
make

This provides a CSV-based configuration system that you can customize or use as a template.

Building and Testing

Build Requirements

  • GCC or compatible C compiler
  • NCCL headers (included in nccl/ subdirectories)
  • Make

Option 2: Use the Basic Plugin

For more customized tuning needs, you might want to start with a clean baseline. In that case, base off the basic plugin in the basic/ directory:

cd basic/
make

Build Process

Each plugin directory contains a Makefile:

cd basic/    # or example/
make

This generates a shared library (.so file) that can be loaded by NCCL.

Loading the Plugin

Set the LD_LIBRARY_PATH to include your plugin directory:

export LD_LIBRARY_PATH=/path/to/your/plugin:$LD_LIBRARY_PATH

Set NCCL_TUNER_PLUGIN to either the plugin name, or the absolute path to the plugin file. Any of the below can work:

export NCCL_TUNER_PLUGIN=example
export NCCL_TUNER_PLUGIN=libnccl-tuner-example.so
export NCCL_TUNER_PLUGIN=/path/to/your/plugin/libnccl-tuner-example.so

NCCL will automatically discover and load the plugin based on the exported symbol names.

Advanced Topics

Plugin Versioning

NCCL supports multiple plugin interface versions. Make sure your plugin exports the correct version:

const ncclTuner_v4_t ncclTunerPlugin_v4 = {
    .name = "YourPluginName",
    .init = yourInitFunction,
    .getCollInfo = yourGetCollInfoFunction,
    .destroy = yourDestroyFunction
};

Multi-GPU and Multi-Node Considerations

Your plugin receives topology information (nRanks, nNodes) during initialization. Use this to:

  • Implement topology-aware tuning strategies
  • Handle single-node vs. multi-node optimizations differently
  • Scale channel counts based on available hardware

Performance Optimization

  • Keep plugin logic lightweight to avoid impacting NCCL performance
  • Cache expensive computations when possible
  • Use the logging system for debugging but avoid excessive output in production

Debugging and Logging

Use NCCL's debug logging system:

export NCCL_DEBUG=INFO    # General information
export NCCL_DEBUG_SUBSYS=TUNING

Within your plugin, use the provided ncclDebugLogger_t function for consistent logging.

Best Practices

  1. Test thoroughly: Verify your plugin works with various message sizes and topologies
  2. Handle edge cases: Ensure your plugin behaves correctly with unusual input parameters
  3. Document your approach: Clearly document your tuning strategy and configuration options
  4. Version your plugin: Use meaningful version numbers and maintain backward compatibility
  5. Performance validation: Measure the impact of your tuning decisions on real workloads

Contributing

When developing new tuner plugins:

  • Follow the existing code style and structure
  • Include comprehensive documentation
  • Add example configurations and test cases
  • Consider contributing useful plugins back to the community

Resources

For questions and support, refer to the NCCL community resources and documentation.