NCCL 2.27.3-1

Symmetric memory API and symmetric kernels * Redesign from the ground up, enabling major latency and bandwidth improvements. * Add new API calls to register user-allocated memory among communicator ranks into a NCCL window: ncclCommWindowRegister() and ncclCommWindowDeregister(). The calls currently support symmetric registration for P2P and NVLS, and require VMM memory buffers (i.e., CUMEM must be operational). * Implement specialized kernels taking advantage of symmetrically registered memory, with performance gains expected particularly for small to medium message sizes. * The kernels support 32 bit floating point types and smaller, and sum as the reduction operator, with no more than one collective operation per group. * Floating point summation is always done in fp32 accumulators (with the exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus, the accuracy with fp8 and fp16 data types should be much improved. * This initial implementation supports non-network communicators only (P2P and NVLS transports). * To explore this functionality users need to use the new memory registration API calls with the NCCL_WIN_COLL_SYMMETRIC flag and all ranks of a communicator must pass buffers at the same offset in the same registration when invoking a collective NCCL operation. Add support for DGX Spark. Add support for DirectNIC (CX8) to the internal IB plugin. Add a new ncclCommShrink() API call * It is a non-collective call similar to ncclCommSplit(), which makes it possible to exclude some (possibly unresponsive) ranks from the parent communicator. Add support for loading multiple network plugins * This enables the creation of generic containers that can work across a range of providers. * Allow NCCL_NET_PLUGIN to accept a comma-separated list of plugins to load. NVLink SHARP (NVLS) improvements * Implement NVLS+IB SHARP support for AllGather and ReduceScatter with user buffer registration. This improves performance and reduces the number of CTAs needed to achieve peak bandwidth. * Gracefully fall back by default to other transports if NVLS initialization fails (the old behavior of returning an error code from a NCCL call can be preserved by setting NCCL_NVLS_ENABLE=1). * Decrease the NVLS channel count to 24 on Blackwell systems with multiple NVLink domains per communicator. * Enable fine-tuning of NCCL behavior per communicator using new "ncclConfig_t" members "collnetEnable", "CTAPolicy", and "nvlsCTAs". Profiler improvements * Extend the init function by adding communicator name, comm id (hash), rank, number of ranks, number of nodes, and the NCCL log function to the argument list. This makes the name and the comm id available to all events in the communicator without explicitly passing them to each individual event. Add the communicator id and rank to the profiler trace filename. Now, the communicator name can be set via a new "ncclConfig_t" member "commName". * Improve the accuracy of the GPU kernel events by providing GPU-generated timestamps for the start and stop of every NCCL operation. * Harmonize proxy events, removing overlaps between ProxyOp and ProxyStep states. * Add support for network-defined event updates (through "recordEventState"). * Report the correct number of channels used by every collective/p2p operation (used to be set to nMaxChannels for collectives and absent for p2ps). * Fix the logic on proxyCtrl Idle/Active events (Issue #1162). * Fix an issue where the network proxy profiler could lose track of an event identifier (Issue #1682). * Improve the backward compatibility with plugins older than v4. * Ensure that the work counters are 0-initialized. * Fix a potential race condition in the network profiler that could result in an event being linked to a wrong parent. MNNVL improvements * Increase to 16 the number of NICs used to communicate between MNNVL domains on GB200 systems, to optimize the performance of collective operations. * Add support for more complex MNNVL topologies with up to 32 NICs per node. * If the MNNVL fabric initialization was unsuccessful, NCCL will now fail by default, so as to avoid inadvertently falling back to a potentially much slower network transport. Such failures are typically due to a misconfigured IMEX support on the system. To continue without MNNVL, restart the job with NCCL_MNNVL_ENABLE=0. * Fix a potential hang in alltoall-like communication patterns at a scale of over 80 ranks. * Make NCCL_P2P_DISABLE=1 imply NCCL_MNNVL_ENABLE=0 (so the latter no longer needs to be specified on MNNVL systems). * Fix an initialization failure when NCCL_TOPO_FILE is used on MNNVL systems. * Fix the graph search to exclude non-local NICs. * Fix the SHM transport to use fabric handles on MNNVL systems. NIC Fusion improvements * Disable the creation of fused NICs for physical devices that haven't been merged. * Flatten multiple ports to a single PCI device within the internal IB plugin and reparent dual-port NICs under the first PCI parent. If the parent is not a PCI switch, PCI devices for fused NICs won't be duplicated. * Route traffic on GB200-CX8 systems through DirectNIC, not the host interface. Improve support for platforms with C2C connectivity (e.g., GB200) * Enable GPUDirect RDMA for the NICs by default. * Add support for P2C (PXN over C2C) and the LL128 protocol. Extend NCCL fault tolerance in multithreaded scenarios * Support the creation of multiple nonblocking communicators within a single group and polling in parallel for the completion using multiple threads (one per communicator). Enable ncclImplicitOrderLaunch for CUDA 12.9+ * This can potentially speed up NCCL_IMPLICIT_LAUNCH_ORDER. Improve the netSocket transport latency and control * Provide finer control over the size of the socket send/receive buffers, the task size, and the number of sockets that a single peer can open. * Add support for the inlining of small messages behind the header when using multiple sockets per connection. Improve the readability of the CPU affinity in the debug output * Print it as a range string rather than a bitmask. Fix a potential race condition in graph execution * A contention could arise when mixing graph and non-graph execution. Improve PXN connection code * Avoid duplicate and unused connections. RAS fixes * Fix a memory corruption at job termination time in case of a previously failed initialization of a RAS socket connection. * Fix a race condition leading to a crash when generating a RAS report during communicator initialization (Issues #1669, #1718). * Fix a potential race condition when gathering data for a RAS status report. Fix a potential memory corruption in ncclCommSplit() * Memory could get corrupted when resource sharing was in use and the size of the NVLink domain in the new communicator was smaller than in the old one. Fix asynchronous graph upload * Fix a small memory leak. * Fix oversychronization. Add a check for out-of-memory conditions in ncclMemAlloc() Clean up the NCCL socket code * accept() will retry also if just reading the magic failed (Issue #1613). * connect() will retry also if poll() did not return a POLLOUT event (Issue #1618). * Add error checking in a few instances (Issue #1539). * Fix the loop condition in ncclFindInterfaceMatchSubnet() (Issue #1574). * Clean up the debug output, downgrading WARN messages to INFO in non-critical cases, and printing the peer's address where relevant. Switch NCCL_DEBUG_FILE to line buffering * This should help avoid mixed-up partial output lines in multithreaded cases. Other minor fixes * Improve the checks for buffer overflows in the graph code (Issue #1585). * Extend logging and state clearing to all four events in the internal IB plugin (Issue #1650). * Fix the error path in case IB communication is not ready (Issue #1489). * Add ECE logging for IB fabric. * Fix various minor issues in the graph module (Issue #1635). * Clean up the debug output in the graph code, downgrading WARN messages to INFO in non-critical cases. * Add a missing argument to a directSend() call (Issue #1628). * Remove duplicate code in sendProxySetup() (Issue #1420). * Fix the order of arguments of cudaDeviceCanAccessPeer() (Issue #1507). * Fix compiler warnings with GCC 14. * Fix a typo in a comment (Issue #1236).
2025-05-29 20:56:40 -07:00
Commit 72d2432094
@@ -49,9 +49,9 @@ of newer ones.
 The `nccl/` directory is populated with `profiler_vX.h` files extracting all relevant definitions
 from old API versions. It also provides error codes in `err.h`.

-# API (v3)
+# API (v4)

-Below is the main `ncclProfiler_v3` struct. Each function is explained in later sections.
+Below is the main `ncclProfiler_v4` struct. Each function is explained in later sections.

 ```
 typedef struct {
@@ -60,9 +60,15 @@ typedef struct {
  // init - initialize the profiler plugin
  // Input
  //  - context        : opaque profiler context object for separating profiler behavior across comms
+  //  - commName       : user assigned communicator name
+  //  - commHash       : communicator id
+  //  - nNodes         : number of nodes in communicator
+  //  - nranks         : number of ranks in communicator
+  //  - rank           : rank identifier in communicator
+  //  - logfn          : logger function
  // Output
  //  - eActivationMask: bitmask of active events set by the plugin
-  ncclResult_t (*init)(void** context, int* eActivationMask);
+  ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);

  // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
  // Input
@@ -70,7 +76,7 @@ typedef struct {
  //  - eDescr : pointer to ncclProfilerEventDescr_t object
  // Output
  //  - eHandle: return event handle for supplied event descriptor object
-  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v3_t* eDescr);
+  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);

  // stopEvent - stop/finalize an event inside and event set
  // Input
@@ -82,13 +88,13 @@ typedef struct {
  //  - eHandle   : handle to event object created through startEvent
  //  - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
  //  - eState    : event state transition
-  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v3_t eState, ncclProfilerEventStateArgs_v3_t* eStateArgs);
+  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);

  // finalize - finalize the profiler plugin
  // Input
  //  - context: opaque profiler context object
  ncclResult_t (*finalize)(void* context);
-} ncclProfiler_v3_t;
+} ncclProfiler_v4_t;
 ```

 ## Error codes
@@ -147,8 +153,6 @@ typedef struct {
  int rank;                 // rank that generated the event
  union {
    struct {                // collective events metadata
-      const char* name;     // string containing name of the communicator
-      uint64_t commHash;    // unique hash/id for the communicator
      uint64_t seqNumber;   // sequence number of this collective operation in the communicator
      const char* func;     // string containing name of the collective
      void const* sendBuff; // address of send buffer
@@ -156,20 +160,19 @@ typedef struct {
      size_t count;         // data count
      int root;             // root rank
      const char* datatype; // string containing the name of the datatype
-      uint8_t nMaxChannels; // max number of channels for this collective
+      uint8_t nChannels;    // number of channels for this collective
      uint8_t nWarps;       // number of GPU warps for this collective
      const char* algo;     // string containing name of the algorithm for this collective
      const char* proto;    // string containing name of the protocol for this collective
    } coll;

    struct {                // point-to-point events metadata
-      const char* name;
-      uint64_t commHash;
      const char* func;
      void* buff;
      const char* datatype;
      size_t count;
      int peer;             // peer rank for this point-to-point
+      uint8_t nChannels;    // number of channels for this p2p
    } p2p;

    struct {                // proxyOp events metadata
@@ -178,7 +181,7 @@ typedef struct {
      int peer;             // peer rank
      int nSteps;           // number of network transfers/steps required by the `ncclProxyOp`
      int chunkSize;        // chunk size for this `ncclProxyOp`
-      int isSend;           // set to 1 for sends and 0 for recvs
+      int isSend;           // type of network operation
    } proxyOp;

    struct {                // proxyStep events metadata
@@ -187,6 +190,7 @@ typedef struct {

    struct {
      uint8_t channelId;    // id of the channel used by the kernel
+      uint64_t ptimer;      // kernel supplied timestamp
    } kernelCh;

    struct {
@@ -194,7 +198,7 @@ typedef struct {
      void* data;           // pointer to network plugin defined event
    } netPlugin;
  };
-} ncclProfilerEventDescr_v3_t;
+} ncclProfilerEventDescr_v4_t;
 ```

 NCCL defines the following events: `ncclProfileGroup`, `ncclProfileColl`, `ncclProfileP2p`,
@@ -212,45 +216,57 @@ handle after `eventStop` is undefined behavior.
 Some events can only be started and stopped. For example, `ncclProfileGroup`, `ncclProfileColl`,
 `ncclProfileP2p`, cannot be updated through calls to `recordEventState`.

-`ncclProfileProxyOp`, `ncclProfileProxyStep` and `ncclProfileProxyCtrl` can be updated through
-calls to `recordEventState`.
+`ncclProfileProxyOp`, `ncclProfileProxyStep`, `ncclProfileNetPlugin`, `ncclProfileKernelCh`, and
+`ncclProfileProxyCtrl` can be updated through calls to `recordEventState`.

-The state of proxy generated events can be updated, along with event attributes, using
-`recordEventState`. These events can go through several states during their lifecycle.
-The list of supported states for the proxy-defined events is reported below.
+The state of these events can be updated, along with event attributes, using `recordEventState`.
+These events can go through several states during their lifecycle.
+
+The list of supported states for the updatable events is reported below.

 ```
 typedef enum {
  // ncclProfileProxyOp event states
-  ncclProfilerProxyOpSendPosted,        // state marks the posting of send buffer to GPU for given network transfer/step
-  ncclProfilerProxyOpSendRemFifoWait,   // state marks the waiting of CTS credits from peer rank
-  ncclProfilerProxyOpSendTransmitted,   // state marks the sending of network transfer/step to peer rank
-  ncclProfilerProxyOpSendDone,          // state marks the ending  of network transfer/step
-  ncclProfilerProxyOpRecvPosted,        // state marks the posting of recv to network for given network transfer/step
-  ncclProfilerProxyOpRecvReceived,      // state marks the recving of network transfer/step from peer rank
-  ncclProfilerProxyOpRecvTransmitted,   // state marks the ending  of the network transfer/step
-  ncclProfilerProxyOpRecvDone,          // state marks the consuming of data from GPU
+  ncclProfilerProxyOpSendPosted        = 0, // deprecated in v4
+  ncclProfilerProxyOpSendRemFifoWait   = 1, // deprecated in v4
+  ncclProfilerProxyOpSendTransmitted   = 2, // deprecated in v4
+  ncclProfilerProxyOpSendDone          = 3, // deprecated in v4
+  ncclProfilerProxyOpRecvPosted        = 4, // deprecated in v4
+  ncclProfilerProxyOpRecvReceived      = 5, // deprecated in v4
+  ncclProfilerProxyOpRecvTransmitted   = 6, // deprecated in v4
+  ncclProfilerProxyOpRecvDone          = 7, // deprecated in v4
+  ncclProfilerProxyOpInProgress_v4     = 19,// state marks transition of proxy op to progress

  // ncclProfileProxyStep event states
-  ncclProfilerProxyStepSendGPUWait,     // state marks the waiting of send data from GPU for given network transfer/step
-  ncclProfilerProxyStepSendWait,        // state marks the waiting of send data from network for given network transfer/step
-  ncclProfilerProxyStepRecvWait,        // state marks the waiting of recv data from network for given network transfer/step
-  ncclProfilerProxyStepRecvFlushWait,   // state marks the waiting of recv data flush to GPU for given network transfer/step
-  ncclProfilerProxyStepRecvGPUWait,     // state marks the waiting of recv data consumption from GPU for given network transfer/step
+  ncclProfilerProxyStepSendGPUWait     = 8, // state marks the waiting of send data from GPU for given network transfer/step
+  ncclProfilerProxyStepSendPeerWait_v4 = 20,// state marks the waiting of recv clear to send credits for given network transfer/step
+  ncclProfilerProxyStepSendWait        = 9, // state marks the waiting of send data from network for given network transfer/step
+  ncclProfilerProxyStepRecvWait        = 10,// state marks the waiting of recv data from network for given network transfer/step
+  ncclProfilerProxyStepRecvFlushWait   = 11,// state marks the waiting of recv data flush to GPU for given network transfer/step
+  ncclProfilerProxyStepRecvGPUWait     = 12,// state marks the waiting of recv data consumption from GPU for given network transfer/step

  // ncclProfileProxyCtrl event states
-  ncclProfilerProxyCtrlIdle,            // state marks proxy progress thread idle
-  ncclProfilerProxyCtrlActive,          // state marks proxy progress thread active
-  ncclProfilerProxyCtrlSleep,           // state marks proxy progress thread sleeping
-  ncclProfilerProxyCtrlWakeup,          // state marks proxy progress thread waking up
-  ncclProfilerProxyCtrlAppend,          // state marks append of new network work item begin
-  ncclProfilerProxyCtrlAppendEnd,       // state marks append of new network work item end
-} ncclProfilerEventState_v3_t;
+  ncclProfilerProxyCtrlIdle            = 13,// state marks proxy progress thread idle
+  ncclProfilerProxyCtrlActive          = 14,// state marks proxy progress thread active
+  ncclProfilerProxyCtrlSleep           = 15,// state marks proxy progress thread sleeping
+  ncclProfilerProxyCtrlWakeup          = 16,// state marks proxy progress thread waking up
+  ncclProfilerProxyCtrlAppend          = 17,// state marks append of new network work item begin
+  ncclProfilerProxyCtrlAppendEnd       = 18,// state marks append of new network work item end
+
+  // ncclProfileNetPlugin event states
+  ncclProfilerNetPluginUpdate          = 21,// state marks update of network defined event
+
+  // ncclProfileKernelCh event states
+  ncclProfilerKernelChStop             = 22,// state marks stop of kernelCh event and timestamp update
+} ncclProfilerEventState_v4_t;
 ```

 `ncclProfileProxyOp` events are generated by the proxy progress thread while it is processing
 network requests for the GPU kernel. ProxyOp events are generated for every active channel and
-provide a summary of the activity of the proxy progress thread for that channel.
+provide a summary of the activity of the proxy progress thread for that channel. Most of the
+states for this event were duplicated with `ncclProfileProxyStep` events. Therefore, starting
+with version 4 of the profiler interface these states have been deprecated. The same level of
+information can still be obtained through the `ncclProfileProxyStep` events.

 `ncclProfileProxyStep` events are generated by the proxy progress thread while it is processing
 network requests for the GPU kernel. ProxyStep events describe individual network transfer in
@@ -348,15 +364,22 @@ reason the profiler defines the `ncclProfilerEventStateArgs_t` struct, reported

 ```
 typedef union {
-  struct {                // attributes to update for ncclProfileProxyOp events
-    size_t transSize;     // data transferred thus far
-    int steps;            // network transfer/steps processed thus far
-  } proxyOp;
+  struct {                // attributes for update for ncclProfileProxyStep events
+    size_t transSize;     // transfer size field for this proxy step
+  } proxyStep;

-  struct {                // attributes to update for ncclProfileProxyCtrl
+  struct {                // attributes to update for ncclProfileProxyCtrl events
    int appendedProxyOps; // number of appended proxy ops thus far
  } proxyCtrl;
-} ncclProfilerEventStateArgs_v3_t;
+
+  struct {                // attributes to update for ncclProfileNetPlugin events
+    void* data;           // network plugin opaque update data field
+  } netPlugin;
+
+  struct {                // attribute to update for ncclProfileKernelCh events
+    uint64_t pTimer;      // timestamp provided by the NCCL kernel
+  } kernelCh;
+} ncclProfilerEventStateArgs_v4_t;
 ```

 The example profiler in `ext-profiler/example` contains details on how to capture and use the events above.
@@ -396,12 +419,12 @@ ProxyCtrl event
 ## Profiling of collective and p2p operations

 The NCCL code is instrumented with profiler callbacks at different levels to capture start/stop of groups,
-collective and point-to-point operations, as well as proxy progress activity. Due to the asynchronous nature
+collective and point-to-point operations, as well as proxy, kernel and network activity. Due to the asynchronous nature
 of NCCL operations, events associated to collective and point-to-point operations are not easy to delimit
 precisely. For example, without both proxy and/or kernel activity it is impossible for the profiler to
 figure out when a collective operation completes. Therefore, `stopEvent` for collectives simply indicates to
-the profiler that the collective has been enqueued. The profiler can leverage proxy event information, if
-these are enabled, to estimate when the collective ends. In this case, the profiler can look at the `stopEvent`
+the profiler that the collective has been enqueued. The profiler can leverage proxy and/or kernel event information, if
+these are enabled, to estimate when the collective ends. For example, the profiler can look at the `stopEvent`
 call of the last `ncclProfileProxyOp` event to mark the completion of the associated collective event. This
 can be achieved by reference counting the collective event and letting calls to `startEvent` and `stopEvent`
 increment and decrement the reference counter, respectively.
@@ -425,8 +448,14 @@ enqueue can be time stamped by the profiler (at start and stop) to reconstruct t
 collective. However, this time only represents the launch time of the collective and not the actual
 execution time. To reconstruct the execution time more accurately proxy and kernel events are provided.

+With version 3 of the profiler interface network activity is no longer required to do intra-node profiling.
 Kernel events instrumentation leverages counters exposed by the kernel to the host and the proxy progress
 thread. Thus, the proxy progress thread infrastructure is shared between the network and the profiler. If
 the proxy is serving network requests the kernel profiling probing can be delayed, causing loss of
 accuracy. Similarly, if the CPU is under heavy load and the scheduling of the proxy progress thread is
-delayed, a similar loss of accuracy can be encountered. Keep this in mind when using kernel events.
+delayed, a similar loss of accuracy can be encountered.
+
+To mitigate this effect, with version 4 of the profiler NCCL uses a per-channel ring buffer of 64 elements.
+Every counter is complemented by two timestamps (ptimers) supplied by the NCCL kernel (one for start and one
+for stop of the operation in the kernel). NCCL propagates these timestamps to the profiler plugin that it can
+convert them to CPU time domain.