Symmetric memory API and symmetric kernels
 * Redesign from the ground up, enabling major latency and bandwidth
   improvements.
 * Add new API calls to register user-allocated memory among communicator
   ranks into a NCCL window: ncclCommWindowRegister() and
   ncclCommWindowDeregister(). The calls currently support symmetric
   registration for P2P and NVLS, and require VMM memory buffers (i.e.,
   CUMEM must be operational).
 * Implement specialized kernels taking advantage of symmetrically
   registered memory, with performance gains expected particularly for
   small to medium message sizes.
 * The kernels support 32 bit floating point types and smaller, and sum as
   the reduction operator, with no more than one collective operation per
   group.
 * Floating point summation is always done in fp32 accumulators (with the
   exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus,
   the accuracy with fp8 and fp16 data types should be much improved.
 * This initial implementation supports non-network communicators only (P2P
   and NVLS transports).
 * To explore this functionality users need to use the new memory
   registration API calls with the NCCL_WIN_COLL_SYMMETRIC flag and all
   ranks of a communicator must pass buffers at the same offset in the same
   registration when invoking a collective NCCL operation.

Add support for DGX Spark.

Add support for DirectNIC (CX8) to the internal IB plugin.

Add a new ncclCommShrink() API call
 * It is a non-collective call similar to ncclCommSplit(), which makes it
   possible to exclude some (possibly unresponsive) ranks from the parent
   communicator.

Add support for loading multiple network plugins
 * This enables the creation of generic containers that can work across a
   range of providers.
 * Allow NCCL_NET_PLUGIN to accept a comma-separated list of plugins to
   load.

NVLink SHARP (NVLS) improvements
 * Implement NVLS+IB SHARP support for AllGather and ReduceScatter with
   user buffer registration. This improves performance and reduces the
   number of CTAs needed to achieve peak bandwidth.
 * Gracefully fall back by default to other transports if NVLS
   initialization fails (the old behavior of returning an error code from a
   NCCL call can be preserved by setting NCCL_NVLS_ENABLE=1).
 * Decrease the NVLS channel count to 24 on Blackwell systems with multiple
   NVLink domains per communicator.
 * Enable fine-tuning of NCCL behavior per communicator using new
   "ncclConfig_t" members "collnetEnable", "CTAPolicy", and "nvlsCTAs".

Profiler improvements
 * Extend the init function by adding communicator name, comm id (hash),
   rank, number of ranks, number of nodes, and the NCCL log function to the
   argument list. This makes the name and the comm id available to all
   events in the communicator without explicitly passing them to each
   individual event. Add the communicator id and rank to the profiler trace
   filename. Now, the communicator name can be set via a new "ncclConfig_t"
   member "commName".
 * Improve the accuracy of the GPU kernel events by providing GPU-generated
   timestamps for the start and stop of every NCCL operation.
 * Harmonize proxy events, removing overlaps between ProxyOp and ProxyStep
   states.
 * Add support for network-defined event updates (through
   "recordEventState").
 * Report the correct number of channels used by every collective/p2p
   operation (used to be set to nMaxChannels for collectives and absent for
   p2ps).
 * Fix the logic on proxyCtrl Idle/Active events (Issue #1162).
 * Fix an issue where the network proxy profiler could lose track of an
   event identifier (Issue #1682).
 * Improve the backward compatibility with plugins older than v4.
 * Ensure that the work counters are 0-initialized.
 * Fix a potential race condition in the network profiler that could result
   in an event being linked to a wrong parent.

MNNVL improvements
 * Increase to 16 the number of NICs used to communicate between MNNVL
   domains on GB200 systems, to optimize the performance of collective
   operations.
 * Add support for more complex MNNVL topologies with up to 32 NICs per
   node.
 * If the MNNVL fabric initialization was unsuccessful, NCCL will now fail
   by default, so as to avoid inadvertently falling back to a potentially
   much slower network transport. Such failures are typically due to a
   misconfigured IMEX support on the system. To continue without MNNVL,
   restart the job with NCCL_MNNVL_ENABLE=0.
 * Fix a potential hang in alltoall-like communication patterns at a scale
   of over 80 ranks.
 * Make NCCL_P2P_DISABLE=1 imply NCCL_MNNVL_ENABLE=0 (so the latter no
   longer needs to be specified on MNNVL systems).
 * Fix an initialization failure when NCCL_TOPO_FILE is used on MNNVL
   systems.
 * Fix the graph search to exclude non-local NICs.
 * Fix the SHM transport to use fabric handles on MNNVL systems.

NIC Fusion improvements
 * Disable the creation of fused NICs for physical devices that haven't
   been merged.
 * Flatten multiple ports to a single PCI device within the internal IB
   plugin and reparent dual-port NICs under the first PCI parent. If the
   parent is not a PCI switch, PCI devices for fused NICs won't be
   duplicated.
 * Route traffic on GB200-CX8 systems through DirectNIC, not the host
   interface.

Improve support for platforms with C2C connectivity (e.g., GB200)
 * Enable GPUDirect RDMA for the NICs by default.
 * Add support for P2C (PXN over C2C) and the LL128 protocol.

Extend NCCL fault tolerance in multithreaded scenarios
 * Support the creation of multiple nonblocking communicators within a
   single group and polling in parallel for the completion using multiple
   threads (one per communicator).

Enable ncclImplicitOrderLaunch for CUDA 12.9+
 * This can potentially speed up NCCL_IMPLICIT_LAUNCH_ORDER.

Improve the netSocket transport latency and control
 * Provide finer control over the size of the socket send/receive buffers,
   the task size, and the number of sockets that a single peer can open.
 * Add support for the inlining of small messages behind the header when
   using multiple sockets per connection.

Improve the readability of the CPU affinity in the debug output
 * Print it as a range string rather than a bitmask.

Fix a potential race condition in graph execution
 * A contention could arise when mixing graph and non-graph execution.

Improve PXN connection code
 * Avoid duplicate and unused connections.

RAS fixes
 * Fix a memory corruption at job termination time in case of a previously
   failed initialization of a RAS socket connection.
 * Fix a race condition leading to a crash when generating a RAS report
   during communicator initialization (Issues #1669, #1718).
 * Fix a potential race condition when gathering data for a RAS status
   report.

Fix a potential memory corruption in ncclCommSplit()
 * Memory could get corrupted when resource sharing was in use and the size
   of the NVLink domain in the new communicator was smaller than in the old
   one.

Fix asynchronous graph upload
 * Fix a small memory leak.
 * Fix oversychronization.

Add a check for out-of-memory conditions in ncclMemAlloc()

Clean up the NCCL socket code
 * accept() will retry also if just reading the magic failed (Issue #1613).
 * connect() will retry also if poll() did not return a POLLOUT event
   (Issue #1618).
 * Add error checking in a few instances (Issue #1539).
 * Fix the loop condition in ncclFindInterfaceMatchSubnet() (Issue #1574).
 * Clean up the debug output, downgrading WARN messages to INFO in
   non-critical cases, and printing the peer's address where relevant.

Switch NCCL_DEBUG_FILE to line buffering
 * This should help avoid mixed-up partial output lines in multithreaded
   cases.

Other minor fixes
 * Improve the checks for buffer overflows in the graph code (Issue #1585).
 * Extend logging and state clearing to all four events in the internal IB
   plugin (Issue #1650).
 * Fix the error path in case IB communication is not ready (Issue #1489).
 * Add ECE logging for IB fabric.
 * Fix various minor issues in the graph module (Issue #1635).
 * Clean up the debug output in the graph code, downgrading WARN messages
   to INFO in non-critical cases.
 * Add a missing argument to a directSend() call (Issue #1628).
 * Remove duplicate code in sendProxySetup() (Issue #1420).
 * Fix the order of arguments of cudaDeviceCanAccessPeer() (Issue #1507).
 * Fix compiler warnings with GCC 14.
 * Fix a typo in a comment (Issue #1236).
Этот коммит содержится в:
Kamil Iskra
2025-05-29 20:56:40 -07:00
родитель 8171af656b
Коммит 72d2432094
99 изменённых файлов: 7223 добавлений и 2029 удалений
+6
Просмотреть файл
@@ -7,9 +7,15 @@
#ifndef COMMON_H_
#define COMMON_H_
#include <stdint.h>
typedef enum {NCCL_LOG_NONE=0, NCCL_LOG_VERSION=1, NCCL_LOG_WARN=2, NCCL_LOG_INFO=3, NCCL_LOG_ABORT=4, NCCL_LOG_TRACE=5} ncclDebugLogLevel;
typedef enum {NCCL_INIT=1, NCCL_COLL=2, NCCL_P2P=4, NCCL_SHM=8, NCCL_NET=16, NCCL_GRAPH=32, NCCL_TUNING=64, NCCL_ENV=128, NCCL_ALLOC=256, NCCL_CALL=512, NCCL_PROXY=1024, NCCL_NVLS=2048, NCCL_BOOTSTRAP=4096, NCCL_REG=8192, NCCL_ALL=~0} ncclDebugLogSubSys;
typedef void (*ncclDebugLogger_t)(ncclDebugLogLevel level, unsigned long flags, const char *file, int line, const char *fmt, ...);
enum { ncclProfilerNetEventStart = 0, ncclProfilerNetEventStop, ncclProfilerNetEventUpdate, ncclProfilerNetEventUpdateAndStop };
typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* phandle, int64_t pluginId, void* extData);
#endif
+1 -3
Просмотреть файл
@@ -8,9 +8,9 @@
#include <stdint.h>
#include <stdlib.h>
#include "common.h"
#include "err.h"
#include "net_device.h"
#include "common.h"
#define NCCL_NET_HANDLE_MAXSIZE 128
#define NCCL_MAX_NET_SIZE_BYTES (1*1024*1024*1024*1024L) //1TB
@@ -23,8 +23,6 @@
// Maximum number of requests per comm object
#define NCCL_NET_MAX_REQUESTS 32
typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* phandle, int64_t pluginId, void* extData);
#include "net_v10.h"
#include "net_v9.h"
#include "net_v8.h"
+78 -49
Просмотреть файл
@@ -49,9 +49,9 @@ of newer ones.
The `nccl/` directory is populated with `profiler_vX.h` files extracting all relevant definitions
from old API versions. It also provides error codes in `err.h`.
# API (v3)
# API (v4)
Below is the main `ncclProfiler_v3` struct. Each function is explained in later sections.
Below is the main `ncclProfiler_v4` struct. Each function is explained in later sections.
```
typedef struct {
@@ -60,9 +60,15 @@ typedef struct {
// init - initialize the profiler plugin
// Input
// - context : opaque profiler context object for separating profiler behavior across comms
// - commName : user assigned communicator name
// - commHash : communicator id
// - nNodes : number of nodes in communicator
// - nranks : number of ranks in communicator
// - rank : rank identifier in communicator
// - logfn : logger function
// Output
// - eActivationMask: bitmask of active events set by the plugin
ncclResult_t (*init)(void** context, int* eActivationMask);
ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
// startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
// Input
@@ -70,7 +76,7 @@ typedef struct {
// - eDescr : pointer to ncclProfilerEventDescr_t object
// Output
// - eHandle: return event handle for supplied event descriptor object
ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v3_t* eDescr);
ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);
// stopEvent - stop/finalize an event inside and event set
// Input
@@ -82,13 +88,13 @@ typedef struct {
// - eHandle : handle to event object created through startEvent
// - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
// - eState : event state transition
ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v3_t eState, ncclProfilerEventStateArgs_v3_t* eStateArgs);
ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);
// finalize - finalize the profiler plugin
// Input
// - context: opaque profiler context object
ncclResult_t (*finalize)(void* context);
} ncclProfiler_v3_t;
} ncclProfiler_v4_t;
```
## Error codes
@@ -147,8 +153,6 @@ typedef struct {
int rank; // rank that generated the event
union {
struct { // collective events metadata
const char* name; // string containing name of the communicator
uint64_t commHash; // unique hash/id for the communicator
uint64_t seqNumber; // sequence number of this collective operation in the communicator
const char* func; // string containing name of the collective
void const* sendBuff; // address of send buffer
@@ -156,20 +160,19 @@ typedef struct {
size_t count; // data count
int root; // root rank
const char* datatype; // string containing the name of the datatype
uint8_t nMaxChannels; // max number of channels for this collective
uint8_t nChannels; // number of channels for this collective
uint8_t nWarps; // number of GPU warps for this collective
const char* algo; // string containing name of the algorithm for this collective
const char* proto; // string containing name of the protocol for this collective
} coll;
struct { // point-to-point events metadata
const char* name;
uint64_t commHash;
const char* func;
void* buff;
const char* datatype;
size_t count;
int peer; // peer rank for this point-to-point
uint8_t nChannels; // number of channels for this p2p
} p2p;
struct { // proxyOp events metadata
@@ -178,7 +181,7 @@ typedef struct {
int peer; // peer rank
int nSteps; // number of network transfers/steps required by the `ncclProxyOp`
int chunkSize; // chunk size for this `ncclProxyOp`
int isSend; // set to 1 for sends and 0 for recvs
int isSend; // type of network operation
} proxyOp;
struct { // proxyStep events metadata
@@ -187,6 +190,7 @@ typedef struct {
struct {
uint8_t channelId; // id of the channel used by the kernel
uint64_t ptimer; // kernel supplied timestamp
} kernelCh;
struct {
@@ -194,7 +198,7 @@ typedef struct {
void* data; // pointer to network plugin defined event
} netPlugin;
};
} ncclProfilerEventDescr_v3_t;
} ncclProfilerEventDescr_v4_t;
```
NCCL defines the following events: `ncclProfileGroup`, `ncclProfileColl`, `ncclProfileP2p`,
@@ -212,45 +216,57 @@ handle after `eventStop` is undefined behavior.
Some events can only be started and stopped. For example, `ncclProfileGroup`, `ncclProfileColl`,
`ncclProfileP2p`, cannot be updated through calls to `recordEventState`.
`ncclProfileProxyOp`, `ncclProfileProxyStep` and `ncclProfileProxyCtrl` can be updated through
calls to `recordEventState`.
`ncclProfileProxyOp`, `ncclProfileProxyStep`, `ncclProfileNetPlugin`, `ncclProfileKernelCh`, and
`ncclProfileProxyCtrl` can be updated through calls to `recordEventState`.
The state of proxy generated events can be updated, along with event attributes, using
`recordEventState`. These events can go through several states during their lifecycle.
The list of supported states for the proxy-defined events is reported below.
The state of these events can be updated, along with event attributes, using `recordEventState`.
These events can go through several states during their lifecycle.
The list of supported states for the updatable events is reported below.
```
typedef enum {
// ncclProfileProxyOp event states
ncclProfilerProxyOpSendPosted, // state marks the posting of send buffer to GPU for given network transfer/step
ncclProfilerProxyOpSendRemFifoWait, // state marks the waiting of CTS credits from peer rank
ncclProfilerProxyOpSendTransmitted, // state marks the sending of network transfer/step to peer rank
ncclProfilerProxyOpSendDone, // state marks the ending of network transfer/step
ncclProfilerProxyOpRecvPosted, // state marks the posting of recv to network for given network transfer/step
ncclProfilerProxyOpRecvReceived, // state marks the recving of network transfer/step from peer rank
ncclProfilerProxyOpRecvTransmitted, // state marks the ending of the network transfer/step
ncclProfilerProxyOpRecvDone, // state marks the consuming of data from GPU
ncclProfilerProxyOpSendPosted = 0, // deprecated in v4
ncclProfilerProxyOpSendRemFifoWait = 1, // deprecated in v4
ncclProfilerProxyOpSendTransmitted = 2, // deprecated in v4
ncclProfilerProxyOpSendDone = 3, // deprecated in v4
ncclProfilerProxyOpRecvPosted = 4, // deprecated in v4
ncclProfilerProxyOpRecvReceived = 5, // deprecated in v4
ncclProfilerProxyOpRecvTransmitted = 6, // deprecated in v4
ncclProfilerProxyOpRecvDone = 7, // deprecated in v4
ncclProfilerProxyOpInProgress_v4 = 19,// state marks transition of proxy op to progress
// ncclProfileProxyStep event states
ncclProfilerProxyStepSendGPUWait, // state marks the waiting of send data from GPU for given network transfer/step
ncclProfilerProxyStepSendWait, // state marks the waiting of send data from network for given network transfer/step
ncclProfilerProxyStepRecvWait, // state marks the waiting of recv data from network for given network transfer/step
ncclProfilerProxyStepRecvFlushWait, // state marks the waiting of recv data flush to GPU for given network transfer/step
ncclProfilerProxyStepRecvGPUWait, // state marks the waiting of recv data consumption from GPU for given network transfer/step
ncclProfilerProxyStepSendGPUWait = 8, // state marks the waiting of send data from GPU for given network transfer/step
ncclProfilerProxyStepSendPeerWait_v4 = 20,// state marks the waiting of recv clear to send credits for given network transfer/step
ncclProfilerProxyStepSendWait = 9, // state marks the waiting of send data from network for given network transfer/step
ncclProfilerProxyStepRecvWait = 10,// state marks the waiting of recv data from network for given network transfer/step
ncclProfilerProxyStepRecvFlushWait = 11,// state marks the waiting of recv data flush to GPU for given network transfer/step
ncclProfilerProxyStepRecvGPUWait = 12,// state marks the waiting of recv data consumption from GPU for given network transfer/step
// ncclProfileProxyCtrl event states
ncclProfilerProxyCtrlIdle, // state marks proxy progress thread idle
ncclProfilerProxyCtrlActive, // state marks proxy progress thread active
ncclProfilerProxyCtrlSleep, // state marks proxy progress thread sleeping
ncclProfilerProxyCtrlWakeup, // state marks proxy progress thread waking up
ncclProfilerProxyCtrlAppend, // state marks append of new network work item begin
ncclProfilerProxyCtrlAppendEnd, // state marks append of new network work item end
} ncclProfilerEventState_v3_t;
ncclProfilerProxyCtrlIdle = 13,// state marks proxy progress thread idle
ncclProfilerProxyCtrlActive = 14,// state marks proxy progress thread active
ncclProfilerProxyCtrlSleep = 15,// state marks proxy progress thread sleeping
ncclProfilerProxyCtrlWakeup = 16,// state marks proxy progress thread waking up
ncclProfilerProxyCtrlAppend = 17,// state marks append of new network work item begin
ncclProfilerProxyCtrlAppendEnd = 18,// state marks append of new network work item end
// ncclProfileNetPlugin event states
ncclProfilerNetPluginUpdate = 21,// state marks update of network defined event
// ncclProfileKernelCh event states
ncclProfilerKernelChStop = 22,// state marks stop of kernelCh event and timestamp update
} ncclProfilerEventState_v4_t;
```
`ncclProfileProxyOp` events are generated by the proxy progress thread while it is processing
network requests for the GPU kernel. ProxyOp events are generated for every active channel and
provide a summary of the activity of the proxy progress thread for that channel.
provide a summary of the activity of the proxy progress thread for that channel. Most of the
states for this event were duplicated with `ncclProfileProxyStep` events. Therefore, starting
with version 4 of the profiler interface these states have been deprecated. The same level of
information can still be obtained through the `ncclProfileProxyStep` events.
`ncclProfileProxyStep` events are generated by the proxy progress thread while it is processing
network requests for the GPU kernel. ProxyStep events describe individual network transfer in
@@ -348,15 +364,22 @@ reason the profiler defines the `ncclProfilerEventStateArgs_t` struct, reported
```
typedef union {
struct { // attributes to update for ncclProfileProxyOp events
size_t transSize; // data transferred thus far
int steps; // network transfer/steps processed thus far
} proxyOp;
struct { // attributes for update for ncclProfileProxyStep events
size_t transSize; // transfer size field for this proxy step
} proxyStep;
struct { // attributes to update for ncclProfileProxyCtrl
struct { // attributes to update for ncclProfileProxyCtrl events
int appendedProxyOps; // number of appended proxy ops thus far
} proxyCtrl;
} ncclProfilerEventStateArgs_v3_t;
struct { // attributes to update for ncclProfileNetPlugin events
void* data; // network plugin opaque update data field
} netPlugin;
struct { // attribute to update for ncclProfileKernelCh events
uint64_t pTimer; // timestamp provided by the NCCL kernel
} kernelCh;
} ncclProfilerEventStateArgs_v4_t;
```
The example profiler in `ext-profiler/example` contains details on how to capture and use the events above.
@@ -396,12 +419,12 @@ ProxyCtrl event
## Profiling of collective and p2p operations
The NCCL code is instrumented with profiler callbacks at different levels to capture start/stop of groups,
collective and point-to-point operations, as well as proxy progress activity. Due to the asynchronous nature
collective and point-to-point operations, as well as proxy, kernel and network activity. Due to the asynchronous nature
of NCCL operations, events associated to collective and point-to-point operations are not easy to delimit
precisely. For example, without both proxy and/or kernel activity it is impossible for the profiler to
figure out when a collective operation completes. Therefore, `stopEvent` for collectives simply indicates to
the profiler that the collective has been enqueued. The profiler can leverage proxy event information, if
these are enabled, to estimate when the collective ends. In this case, the profiler can look at the `stopEvent`
the profiler that the collective has been enqueued. The profiler can leverage proxy and/or kernel event information, if
these are enabled, to estimate when the collective ends. For example, the profiler can look at the `stopEvent`
call of the last `ncclProfileProxyOp` event to mark the completion of the associated collective event. This
can be achieved by reference counting the collective event and letting calls to `startEvent` and `stopEvent`
increment and decrement the reference counter, respectively.
@@ -425,8 +448,14 @@ enqueue can be time stamped by the profiler (at start and stop) to reconstruct t
collective. However, this time only represents the launch time of the collective and not the actual
execution time. To reconstruct the execution time more accurately proxy and kernel events are provided.
With version 3 of the profiler interface network activity is no longer required to do intra-node profiling.
Kernel events instrumentation leverages counters exposed by the kernel to the host and the proxy progress
thread. Thus, the proxy progress thread infrastructure is shared between the network and the profiler. If
the proxy is serving network requests the kernel profiling probing can be delayed, causing loss of
accuracy. Similarly, if the CPU is under heavy load and the scheduling of the proxy progress thread is
delayed, a similar loss of accuracy can be encountered. Keep this in mind when using kernel events.
delayed, a similar loss of accuracy can be encountered.
To mitigate this effect, with version 4 of the profiler NCCL uses a per-channel ring buffer of 64 elements.
Every counter is complemented by two timestamps (ptimers) supplied by the NCCL kernel (one for start and one
for stop of the operation in the kernel). NCCL propagates these timestamps to the profiler plugin that it can
convert them to CPU time domain.
+21 -28
Просмотреть файл
@@ -15,24 +15,6 @@
#define MAX_CHANNELS 32
#define MAX_STEPS 16
#define MAX_OPS 16 // Up to 64K ranks for PAT
#define PROXY_OP_SEND_STATE_OFFSET (ncclProfilerProxyOpSendPosted)
#define PROXY_OP_RECV_STATE_OFFSET (ncclProfilerProxyOpRecvPosted)
#define PROXY_STEP_SEND_STATE_OFFSET (ncclProfilerProxyStepSendGPUWait)
#define PROXY_STEP_RECV_STATE_OFFSET (ncclProfilerProxyStepRecvWait)
#define NUM_PROXY_OP_SEND_STATES (ncclProfilerProxyOpSendDone - ncclProfilerProxyOpSendPosted + 1)
#define NUM_PROXY_OP_RECV_STATES (ncclProfilerProxyOpRecvDone - ncclProfilerProxyOpRecvPosted + 1)
#define NUM_PROXY_STEP_SEND_STATES (ncclProfilerProxyStepSendWait - ncclProfilerProxyStepSendGPUWait + 1)
#define NUM_PROXY_STEP_RECV_STATES (ncclProfilerProxyStepRecvGPUWait - ncclProfilerProxyStepRecvWait + 1)
#define PROXY_OP_SEND_STATE_IDX(state) (state - PROXY_OP_SEND_STATE_OFFSET)
#define PROXY_OP_RECV_STATE_IDX(state) (state - PROXY_OP_RECV_STATE_OFFSET)
#define PROXY_STEP_SEND_STATE_IDX(state) (state - PROXY_STEP_SEND_STATE_OFFSET)
#define PROXY_STEP_RECV_STATE_IDX(state) (state - PROXY_STEP_RECV_STATE_OFFSET)
#define MAX_PROXY_OP_STATES ((NUM_PROXY_OP_SEND_STATES > NUM_PROXY_OP_RECV_STATES ) ? NUM_PROXY_OP_SEND_STATES : NUM_PROXY_OP_RECV_STATES)
#define MAX_PROXY_STEP_STATES ((NUM_PROXY_STEP_SEND_STATES > NUM_PROXY_STEP_RECV_STATES) ? NUM_PROXY_STEP_SEND_STATES : NUM_PROXY_STEP_RECV_STATES)
#define MAX_EVENTS_PER_REQ (8)
struct proxyOp;
@@ -68,13 +50,24 @@ struct kernelCh {
struct taskEventBase* parent;
double startTs;
double stopTs;
uint64_t startGpuClk;
uint64_t stopGpuClk;
};
#define PROXY_STEP_SEND_GPU_WAIT 0
#define PROXY_STEP_SEND_PEER_WAIT 1
#define PROXY_STEP_SEND_WAIT 2
#define PROXY_STEP_RECV_WAIT 0
#define PROXY_STEP_RECV_FLUSH_WAIT 1
#define PROXY_STEP_RECV_GPU_WAIT 2
#define PROXY_STEP_MAX_STATES 3
struct proxyStep {
uint8_t type; // type of event: network transfer
int state;
int step; // network transfer id in given channel
int isSend; // send/recv channel operation
double timestamp[MAX_PROXY_STEP_STATES];
double timestamp[PROXY_STEP_MAX_STATES];
double startTs;
double stopTs;
struct proxyOp* parent;
@@ -92,11 +85,8 @@ struct proxyOp {
int chunkSize; // chunk size for this proxy operation
int isSend; // send/recv channel operation
size_t transSize; // transfer data size for this proxy operation
struct {
int steps; // completed steps for this proxy operation state
double timestamp;
} states[MAX_PROXY_OP_STATES];
double startTs;
double progrTs; // In progress state transition
double stopTs;
int stepCount; // last processed network operation for this proxy operation
struct proxyStep step[MAX_STEPS]; // array of network transfer events
@@ -119,8 +109,6 @@ struct proxyCtrl {
struct taskEventBase {
uint8_t type; // event type: collective/p2p
int rank; // rank of the operation in NCCL communicator
const char* name; // FIXME: unused
uint64_t commHash; // communicator identifier
const char* func; // ncclFunc*
int refCount; // number of references for this operation
struct group* parent; // parent event group
@@ -137,12 +125,11 @@ struct collective {
size_t count;
int root;
const char* datatype;
uint8_t nMaxChannels;
uint8_t nChannels;
const char* algo;
const char* proto;
int nWarps;
struct proxyOp send[MAX_CHANNELS][MAX_OPS];// array of send proxy operation events
struct proxyOp recv[MAX_CHANNELS][MAX_OPS];// array of recv proxy operation events
struct proxyOp op[MAX_CHANNELS][2*MAX_OPS];
int nProxyOps[MAX_CHANNELS];
struct kernelCh kernel[MAX_CHANNELS];
};
@@ -154,6 +141,7 @@ struct p2p {
size_t count;
const char* datatype;
int peer;
uint8_t nChannels;
struct proxyOp op[MAX_CHANNELS];
struct kernelCh kernel[MAX_CHANNELS];
};
@@ -172,6 +160,11 @@ struct group {
// arrays for different event objects
struct context {
const char* commName;
uint64_t commHash;
int nranks;
int rank;
int groupPoolSize;
int groupPoolBase;
int groupPoolIndex;
+32 -22
Просмотреть файл
@@ -25,42 +25,52 @@ enum {
};
typedef enum {
ncclProfilerProxyOpSendPosted,
ncclProfilerProxyOpSendRemFifoWait,
ncclProfilerProxyOpSendTransmitted,
ncclProfilerProxyOpSendDone,
ncclProfilerProxyOpRecvPosted,
ncclProfilerProxyOpRecvReceived,
ncclProfilerProxyOpRecvTransmitted,
ncclProfilerProxyOpRecvDone,
ncclProfilerProxyOpSendPosted = 0, // deprecated in v4
ncclProfilerProxyOpSendRemFifoWait = 1, // deprecated in v4
ncclProfilerProxyOpSendTransmitted = 2, // deprecated in v4
ncclProfilerProxyOpSendDone = 3, // deprecated in v4
ncclProfilerProxyOpRecvPosted = 4, // deprecated in v4
ncclProfilerProxyOpRecvReceived = 5, // deprecated in v4
ncclProfilerProxyOpRecvTransmitted = 6, // deprecated in v4
ncclProfilerProxyOpRecvDone = 7, // deprecated in v4
ncclProfilerProxyOpInProgress_v4 = 19,
/* Legacy proxy profiler states */
ncclProfilerProxyStepSendGPUWait,
ncclProfilerProxyStepSendWait,
ncclProfilerProxyStepRecvWait,
ncclProfilerProxyStepRecvFlushWait,
ncclProfilerProxyStepRecvGPUWait,
ncclProfilerProxyStepSendGPUWait = 8,
ncclProfilerProxyStepSendPeerWait_v4 = 20,
ncclProfilerProxyStepSendWait = 9,
ncclProfilerProxyStepRecvWait = 10,
ncclProfilerProxyStepRecvFlushWait = 11,
ncclProfilerProxyStepRecvGPUWait = 12,
/* Legacy proxy control states */
ncclProfilerProxyCtrlIdle,
ncclProfilerProxyCtrlActive,
ncclProfilerProxyCtrlSleep,
ncclProfilerProxyCtrlWakeup,
ncclProfilerProxyCtrlAppend,
ncclProfilerProxyCtrlAppendEnd,
ncclProfilerProxyCtrlIdle = 13,
ncclProfilerProxyCtrlActive = 14,
ncclProfilerProxyCtrlSleep = 15,
ncclProfilerProxyCtrlWakeup = 16,
ncclProfilerProxyCtrlAppend = 17,
ncclProfilerProxyCtrlAppendEnd = 18,
/* Network defined events states */
ncclProfilerNetPluginUpdate = 21,
/* Kernel event states */
ncclProfilerKernelChStop = 22,
} ncclProfilerEventState_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v1_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v2_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v3_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v4_t;
#include "profiler_v4.h"
#include "profiler_v3.h"
#include "profiler_v2.h"
#include "profiler_v1.h"
#include "profiler_net.h"
typedef ncclProfiler_v3_t ncclProfiler_t;
typedef ncclProfilerEventDescr_v3_t ncclProfilerEventDescr_t;
typedef ncclProfilerEventStateArgs_v3_t ncclProfilerEventStateArgs_t;
typedef ncclProfiler_v4_t ncclProfiler_t;
typedef ncclProfilerEventDescr_v4_t ncclProfilerEventDescr_t;
typedef ncclProfilerEventStateArgs_v4_t ncclProfilerEventStateArgs_t;
#endif // end include guard
-5
Просмотреть файл
@@ -111,9 +111,4 @@ typedef struct {
ncclResult_t (*finalize)(void* context);
} ncclProfiler_v3_t;
typedef ncclProfilerEventDescr_v3_t ncclProfilerEventDescr_t;
typedef ncclProfilerEventState_v3_t ncclProfilerEventState_t;
typedef ncclProfilerEventStateArgs_v3_t ncclProfilerEventStateArgs_t;
typedef ncclProfiler_v3_t ncclProfiler_t;
#endif
+123
Просмотреть файл
@@ -0,0 +1,123 @@
/*************************************************************************
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#ifndef PROFILER_V4_H_
#define PROFILER_V4_H_
typedef struct {
uint8_t type; // event type descriptor: ncclProfileColl, ...
void* parentObj; // pointer to the profiler parent object (for coll is the group)
int rank; // originating rank
union {
struct {
uint64_t seqNumber;
const char* func;
void const* sendBuff;
void* recvBuff;
size_t count;
int root;
const char* datatype;
uint8_t nChannels;
uint8_t nWarps;
const char* algo;
const char* proto;
} coll;
struct {
const char* func;
void* buff;
const char* datatype;
size_t count;
int peer;
uint8_t nChannels;
} p2p;
struct {
pid_t pid; // pid of the originating process
uint8_t channelId; // channel id for this proxy operation
int peer; // remote rank for send/recv
int nSteps; // number of steps for this proxy operation
int chunkSize; // amount of data transferred by this proxy operation
int isSend;
} proxyOp;
struct {
int step;
} proxyStep;
struct {
uint8_t channelId;
uint64_t pTimer; // start timestamp from GPU globaltimer
} kernelCh;
struct {
int64_t id;
void* data;
} netPlugin;
};
} ncclProfilerEventDescr_v4_t;
typedef union {
struct {
size_t transSize;
} proxyStep;
struct {
int appendedProxyOps;
} proxyCtrl;
struct {
void* data;
} netPlugin;
struct {
uint64_t pTimer;
} kernelCh;
} ncclProfilerEventStateArgs_v4_t;
typedef struct {
const char* name;
// init - initialize the profiler plugin
// Input
// - context : opaque profiler context object for separating profiler behavior across comms
// - commName : user assigned communicator name
// - commHash : communicator id
// - nNodes : number of nodes in communicator
// - nranks : number of ranks in communciator
// - rank : rank identifier in communicator
// - logfn : logger function
// Output
// - eActivationMask: bitmask of active events set by the plugin
ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
// startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
// Input
// - context: opaque profiler context object
// - eDescr : pointer to ncclProfilerEventDescr_t object
// Output
// - eHandle: return event handle for supplied event descriptor object
ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);
// stopEvent - stop/finalize an event inside and event set
// Input
// - eHandle: handle to event object
ncclResult_t (*stopEvent)(void* eHandle);
// recordEventState - record event state transitions and event attribute updates
// Input
// - eHandle : handle to event object created through startEvent
// - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
// - eState : event state transition
ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);
// finalize - finalize the profiler plugin
// Input
// - context: opaque profiler context object
ncclResult_t (*finalize)(void* context);
} ncclProfiler_v4_t;
#endif
+60 -22
Просмотреть файл
@@ -38,6 +38,9 @@ static int detachPoolIndex;
static int detachPoolDone;
static struct proxyOp* detachPool;
ncclDebugLogger_t logFn;
#define INFO(FLAGS, ...) logFn(NCCL_LOG_INFO, (FLAGS), __func__, __LINE__, __VA_ARGS__)
static double freq = -1;
__hidden void calibrate() {
struct timeval tv;
@@ -60,7 +63,7 @@ static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static pid_t pid;
static int* eActivationMaskPtr;
__hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask) {
__hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn) {
pthread_mutex_lock(&lock);
if (__atomic_fetch_add(&initialized, 1, __ATOMIC_RELAXED) == 0) {
// first thread initializes event mask, environment and detach pool
@@ -106,6 +109,13 @@ __hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask)
// pre-allocate memory for event object pools in dedicated profiler context
struct context* ctx = (struct context *)calloc(1, sizeof(*ctx));
ctx->commName = commName;
ctx->commHash = commHash;
ctx->nranks = nranks;
ctx->rank = rank;
logFn = logfn;
INFO(NCCL_INIT, "PROFILER/Plugin: init commName: %s commHash: %lu nranks: %d rank: %d", commName ? commName : "", commHash, nranks, rank);
ctx->groupPool = (struct group *)calloc(groupPoolSize, sizeof(*ctx->groupPool));
if (ctx->groupPool == NULL) goto fail;
@@ -142,17 +152,16 @@ fail:
__hidden ncclResult_t exampleProfilerFinalize(void* context) {
FILE* fh = NULL;
char filename[PATH_MAX] = { 0 };
char hostname[64] = { 0 };
gethostname(hostname, 64);
struct context* ctx = (struct context *)context;
const char* dump = getenv("NCCL_PROFILE_DUMP_FILE");
if (dump) {
sprintf(filename, "%s-%s-%ld.txt", dump, hostname, syscall(SYS_gettid));
sprintf(filename, "%s_%lu_%d.json", dump, ctx->commHash, ctx->rank);
fh = fopen(filename, "w");
fprintf(fh, "[\n");
}
INFO(NCCL_INIT, "PROFILER/Plugin: finalize commName: %s commHash: %lu nranks: %d rank: %d", ctx->commName ? ctx->commName : "", ctx->commHash, ctx->nranks, ctx->rank);
// print last N groups/collectives/p2ps
struct context* ctx = (struct context *)context;
int start = (ctx->groupPoolIndex - groupPoolSize >= 0) ? ctx->groupPoolIndex - groupPoolSize : 0;
int end = ctx->groupPoolIndex;
for (int i = start; i < end; i++) {
@@ -243,8 +252,6 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->base.type = ncclProfileColl;
event->base.rank = eDescr->rank;
event->base.name = eDescr->coll.name;
event->base.commHash = eDescr->coll.commHash;
event->base.func = eDescr->coll.func;
event->base.startTs = gettime() - startTime;
event->base.parent = parent;
@@ -254,7 +261,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->count = eDescr->coll.count;
event->root = eDescr->coll.root;
event->datatype = eDescr->coll.datatype;
event->nMaxChannels = eDescr->coll.nMaxChannels;
event->nChannels = eDescr->coll.nChannels;
event->nWarps = eDescr->coll.nWarps;
event->algo = eDescr->coll.algo;
event->proto = eDescr->coll.proto;
@@ -281,8 +288,6 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->base.type = ncclProfileP2p;
event->base.rank = eDescr->rank;
event->base.name = eDescr->p2p.name;
event->base.commHash = eDescr->p2p.commHash;
event->base.func = eDescr->p2p.func;
event->base.next = parent->eventHead;
event->base.startTs = gettime() - startTime;
@@ -291,6 +296,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->count = eDescr->p2p.count;
event->datatype = eDescr->p2p.datatype;
event->peer = eDescr->p2p.peer;
event->nChannels = eDescr->p2p.nChannels;
*eHandle = event;
// increment the group ref counter so the event will staty open
taskEventQueueEnqueue(parent, (struct taskEventBase *)event);
@@ -331,6 +337,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->isSend = eDescr->proxyOp.isSend;
event->startTs = gettime() - startTime;
event->parent = NULL;
event->stepCount = 0;
*eHandle = event;
debugEvent(event, "PxnProxyOpStart");
return ncclSuccess;
@@ -339,9 +346,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
if (eventBase->type == ncclProfileColl) {
struct collective* parent = (struct collective *)eDescr->parentObj;
int channelId = eDescr->proxyOp.channelId;
struct proxyOp* event = (eDescr->proxyOp.isSend) ?
&parent->send[channelId][parent->nProxyOps[channelId]++] :
&parent->recv[channelId][parent->nProxyOps[channelId]++];
struct proxyOp* event = &parent->op[channelId][parent->nProxyOps[channelId]++];
event->type = ncclProfileProxyOp;
event->channelId = channelId;
@@ -353,6 +358,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->isSend = eDescr->proxyOp.isSend;
event->parent = eventBase;
event->startTs = gettime() - startTime;
event->stepCount = 0;
*eHandle = event;
__atomic_fetch_add(&parent->base.refCount, 1, __ATOMIC_RELAXED);
debugEvent(event, "ProxyOpStart");
@@ -370,6 +376,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->isSend = eDescr->proxyOp.isSend;
event->parent = eventBase;
event->startTs = gettime() - startTime;
event->stepCount = 0;
*eHandle = event;
__atomic_fetch_add(&parent->base.refCount, 1, __ATOMIC_RELAXED);
debugEvent(event, "ProxyOpStart");
@@ -382,9 +389,10 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
int s = parent->stepCount++ % MAX_STEPS;
struct proxyStep* event = &parent->step[s];
event->type = ncclProfileProxyStep;
event->state = 0;
event->step = eDescr->proxyStep.step;
event->isSend = parent->isSend;
event->parent = parent;
event->isSend = parent->isSend;
event->startTs = gettime() - startTime;
event->nNetEvents = 0;
*eHandle = event;
@@ -397,6 +405,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
struct kernelCh* event = &parent->kernel[eDescr->kernelCh.channelId];
event->type = ncclProfileKernelCh;
event->channelId = eDescr->kernelCh.channelId;
event->startGpuClk = eDescr->kernelCh.pTimer;
event->parent = eventBase;
event->startTs = gettime() - startTime;
*eHandle = event;
@@ -407,6 +416,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
struct kernelCh* event = &parent->kernel[eDescr->kernelCh.channelId];
event->type = ncclProfileKernelCh;
event->channelId = eDescr->kernelCh.channelId;
event->startGpuClk = eDescr->kernelCh.pTimer;
event->parent = eventBase;
event->startTs = gettime() - startTime;
*eHandle = event;
@@ -563,29 +573,57 @@ __hidden ncclResult_t exampleProfilerRecordEventState(void* eHandle, ncclProfile
// the event handle might be null if we run out of events
if (eHandle == NULL) return ncclSuccess;
debugEvent(eHandle, "RecordEventState");
uint8_t type = *(uint8_t *)eHandle;
if (type == ncclProfileProxyOp) {
struct proxyOp* event = (struct proxyOp *)eHandle;
int steps = event->states[event->isSend ? PROXY_OP_SEND_STATE_IDX(eState) : PROXY_OP_RECV_STATE_IDX(eState)].steps;
if (eState == ncclProfilerProxyOpSendRemFifoWait && eStateArgs->proxyOp.steps == steps) return ncclSuccess;
event->states[event->isSend ? PROXY_OP_SEND_STATE_IDX(eState) : PROXY_OP_RECV_STATE_IDX(eState)].steps = eStateArgs->proxyOp.steps;
event->states[event->isSend ? PROXY_OP_SEND_STATE_IDX(eState) : PROXY_OP_RECV_STATE_IDX(eState)].timestamp = gettime() - startTime;
event->transSize = eStateArgs->proxyOp.transSize;
if (eState == ncclProfilerProxyOpInProgress_v4) {
event->progrTs = gettime() - startTime;
}
} else if (type == ncclProfileProxyStep) {
struct proxyStep* event = (struct proxyStep *)eHandle;
event->timestamp[event->isSend ? PROXY_STEP_SEND_STATE_IDX(eState) : PROXY_STEP_RECV_STATE_IDX(eState)] = gettime() - startTime;
struct proxyOp* parent = event->parent;
switch (eState) {
case ncclProfilerProxyStepSendGPUWait:
event->timestamp[PROXY_STEP_SEND_GPU_WAIT] = gettime() - startTime;
break;
case ncclProfilerProxyStepSendPeerWait_v4:
// do not update step event if in SendPeerWait
if (event->state == ncclProfilerProxyStepSendPeerWait_v4) break;
event->timestamp[PROXY_STEP_SEND_PEER_WAIT] = gettime() - startTime;
event->state = ncclProfilerProxyStepSendPeerWait_v4;
break;
case ncclProfilerProxyStepSendWait:
event->timestamp[PROXY_STEP_SEND_WAIT] = gettime() - startTime;
parent->transSize += eStateArgs->proxyStep.transSize;
break;
case ncclProfilerProxyStepRecvWait:
event->timestamp[PROXY_STEP_RECV_WAIT] = gettime() - startTime;
break;
case ncclProfilerProxyStepRecvFlushWait:
event->timestamp[PROXY_STEP_RECV_FLUSH_WAIT] = gettime() - startTime;
parent->transSize += eStateArgs->proxyStep.transSize;
break;
case ncclProfilerProxyStepRecvGPUWait:
event->timestamp[PROXY_STEP_RECV_GPU_WAIT] = gettime() - startTime;
break;
}
} else if (type == ncclProfileProxyCtrl) {
struct proxyCtrl* event = (struct proxyCtrl *)eHandle;
if (eState == ncclProfilerProxyCtrlAppendEnd) {
event->appended = eStateArgs->proxyCtrl.appendedProxyOps;
}
event->state = eState;
} else if (type == ncclProfileKernelCh) {
struct kernelCh* event = (struct kernelCh *)eHandle;
if (eState == ncclProfilerKernelChStop) {
event->stopGpuClk = eStateArgs->kernelCh.pTimer;
}
}
debugEvent(eHandle, "RecordEventState");
return ncclSuccess;
}
ncclProfiler_t ncclProfiler_v3 = {
ncclProfiler_t ncclProfiler_v4 = {
"Example-profiler",
exampleProfilerInit,
exampleProfilerStartEvent,
+37 -40
Просмотреть файл
@@ -27,8 +27,8 @@ __hidden void printGroupEventTrailer(FILE* fh, struct group* event) {
static __thread int collId;
__hidden void printCollEventHeader(FILE* fh, struct collective* event) {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"COLL\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"SeqNum\": %lu, \"CommHash\": %lu, \"Rank\": %d, \"Count\": %lu, \"Datatype\": \"%s\", \"Algorithm\": \"%s\", \"Protocol\": \"%s\", \"nMaxChannels\": %d}},\n",
event->base.func, collId, getpid(), 1, event->base.startTs, event->seqNumber, event->base.commHash, event->base.rank, event->count, event->datatype, event->algo, event->proto, event->nMaxChannels);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"COLL\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"SeqNum\": %lu, \"CommHash\": %lu, \"Rank\": %d, \"Count\": %lu, \"Datatype\": \"%s\", \"Algorithm\": \"%s\", \"Protocol\": \"%s\", \"nChannels\": %d}},\n",
event->base.func, collId, getpid(), 1, event->base.startTs, event->seqNumber, event->base.parent->ctx->commHash, event->base.rank, event->count, event->datatype, event->algo, event->proto, event->nChannels);
}
__hidden void printCollEventTrailer(FILE* fh, struct collective* event) {
@@ -38,8 +38,8 @@ __hidden void printCollEventTrailer(FILE* fh, struct collective* event) {
static __thread int p2pId;
__hidden void printP2pEventHeader(FILE* fh, struct p2p* event) {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"P2P\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"CommHash\": %lu, \"Rank\": %d, \"Peer\": %d, \"Count\": %lu, \"Datatype\": \"%s\"}},\n",
event->base.func, p2pId, getpid(), 1, event->base.startTs, event->base.commHash, event->base.rank, event->peer, event->count, event->datatype);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"P2P\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"CommHash\": %lu, \"Rank\": %d, \"Peer\": %d, \"Count\": %lu, \"Datatype\": \"%s\", \"nChannels\": %d}},\n",
event->base.func, p2pId, getpid(), 1, event->base.startTs, event->base.parent->ctx->commHash, event->base.rank, event->peer, event->count, event->datatype, event->nChannels);
}
__hidden void printP2pEventTrailer(FILE* fh, struct p2p* event) {
@@ -50,47 +50,43 @@ __hidden void printP2pEventTrailer(FILE* fh, struct p2p* event) {
static __thread int proxyOpId;
__hidden void printProxyOpEventHeader(FILE* fh, struct proxyOp* event) {
if (event->isSend) {
int posted = PROXY_OP_SEND_STATE_IDX(ncclProfilerProxyOpSendPosted);
int remFifoWait = PROXY_OP_SEND_STATE_IDX(ncclProfilerProxyOpSendRemFifoWait);
int transmitted = PROXY_OP_SEND_STATE_IDX(ncclProfilerProxyOpSendTransmitted);
int done = PROXY_OP_SEND_STATE_IDX(ncclProfilerProxyOpSendDone);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu, \"POSTED\": {\"step\": %d, \"ts\": %f}, \"REM_FIFO_WAIT\": {\"step\": %d, \"ts\": %f}, \"TRANSMITTED\": {\"step\": %d, \"ts\": %f}, \"DONE\": {\"step\": %d, \"ts\": %f}}},\n",
"Send", proxyOpId, getpid(), 1, event->startTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize, event->states[posted].steps, event->states[posted].timestamp, event->states[remFifoWait].steps, event->states[remFifoWait].timestamp, event->states[transmitted].steps, event->states[transmitted].timestamp, event->states[done].steps, event->states[done].timestamp);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu}},\n",
"ScheduleSend", proxyOpId, getpid(), 1, event->startTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"ScheduleSend", proxyOpId, getpid(), 1, event->progrTs);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu}},\n",
"ProgressSend", proxyOpId, getpid(), 1, event->progrTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize);
} else {
int posted = PROXY_OP_RECV_STATE_IDX(ncclProfilerProxyOpRecvPosted);
int received = PROXY_OP_RECV_STATE_IDX(ncclProfilerProxyOpRecvReceived);
int transmitted = PROXY_OP_RECV_STATE_IDX(ncclProfilerProxyOpRecvTransmitted);
int done = PROXY_OP_RECV_STATE_IDX(ncclProfilerProxyOpRecvDone);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu, \"POSTED\": {\"step\": %d, \"ts\": %f}, \"RECEIVED\": {\"step\": %d, \"ts\": %f}, \"TRANSMITTED\": {\"step\": %d, \"ts\": %f}, \"DONE\": {\"step\": %d, \"ts\": %f}}},\n",
"Recv", proxyOpId, getpid(), 1, event->startTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize, event->states[posted].steps, event->states[posted].timestamp, event->states[received].steps, event->states[received].timestamp, event->states[transmitted].steps, event->states[transmitted].timestamp, event->states[done].steps, event->states[done].timestamp);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu}},\n",
"ScheduleRecv", proxyOpId, getpid(), 1, event->startTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"ScheduleRecv", proxyOpId, getpid(), 1, event->progrTs);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu}},\n",
"ProgressRecv", proxyOpId, getpid(), 1, event->progrTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize);
}
}
__hidden void printProxyOpEventTrailer(FILE* fh, struct proxyOp* event) {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
event->isSend ? "Send" : "Recv", proxyOpId++, getpid(), 1, event->stopTs);
event->isSend ? "ProgressSend" : "ProgressRecv", proxyOpId++, getpid(), 1, event->stopTs);
}
static __thread int proxyStepId;
__hidden void printProxyStepEventHeader(FILE* fh, struct proxyStep* event) {
if (event->isSend) {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"SendBufferWait", proxyStepId, getpid(), 1, event->startTs, event->step);
"SendGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_GPU_WAIT], event->step);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"SendBufferWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_STATE_IDX(ncclProfilerProxyStepSendGPUWait)]);
"SendGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_PEER_WAIT]);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"SendGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_STATE_IDX(ncclProfilerProxyStepSendGPUWait)], event->step);
"SendPeerWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_PEER_WAIT], event->step);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"SendGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_STATE_IDX(ncclProfilerProxyStepSendWait)]);
"SendPeerWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_WAIT]);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"SendWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_STATE_IDX(ncclProfilerProxyStepSendWait)], event->step);
"SendWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_WAIT], event->step);
} else {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"RecvBufferWait", proxyStepId, getpid(), 1, event->startTs, event->step);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"RecvBufferWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvWait)]);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"RecvWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvWait)], event->step);
"RecvWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_WAIT], event->step);
}
}
@@ -100,13 +96,13 @@ __hidden void printProxyStepEventTrailer(FILE* fh, struct proxyStep* event) {
"SendWait", proxyStepId++, getpid(), 1, event->stopTs);
} else {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"RecvWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvFlushWait)]);
"RecvWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_FLUSH_WAIT]);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"RecvFlushWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvFlushWait)], event->step);
"RecvFlushWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_FLUSH_WAIT], event->step);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"RecvFlushWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvGPUWait)]);
"RecvFlushWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_GPU_WAIT]);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"RecvGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvGPUWait)], event->step);
"RecvGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_GPU_WAIT], event->step);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"RecvGpuWait", proxyStepId++, getpid(), 1, event->stopTs);
}
@@ -115,8 +111,8 @@ __hidden void printProxyStepEventTrailer(FILE* fh, struct proxyStep* event) {
static __thread int kernelId;
__hidden void printKernelChEventHeader(FILE* fh, struct kernelCh* event) {
if (event->type != ncclProfileKernelCh) return;
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"GPU\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d}},\n",
"KernelCh", kernelId, getpid(), 1, event->startTs, event->channelId);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"GPU\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"StartGpuClk\": %lu, \"StopGpuClk\": %lu}},\n",
"KernelCh", kernelId, getpid(), 1, event->startTs, event->channelId, event->startGpuClk, event->stopGpuClk);
}
__hidden void printKernelChEventTrailer(FILE* fh, struct kernelCh* event) {
@@ -134,6 +130,8 @@ __hidden void printProxyCtrlEvent(FILE* fh, struct proxyCtrl* event) {
str = "Sleep";
} else if (event->state == ncclProfilerProxyCtrlAppend || event->state == ncclProfilerProxyCtrlAppendEnd) {
str = "Append";
} else {
return;
}
if (event->state == ncclProfilerProxyCtrlAppendEnd) {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"appended\": %d}},\n",
@@ -188,9 +186,8 @@ void debugEvent(void* eHandle, const char* tag) {
fprintf(fh, "Collective event %p tag = %s {\n", event, tag);
fprintf(fh, " refCount = %d\n", __atomic_load_n(&event->base.refCount, __ATOMIC_RELAXED));
fprintf(fh, " parent = %p\n", event->base.parent);
for (int j = 0; j < MAX_OPS; j++) {
for (int i = 0; i < MAX_CHANNELS; i++) if (event->send[i][j].type == ncclProfileProxyOp) fprintf(fh, " send[%d] = %p\n", i, &event->send[i]);
for (int i = 0; i < MAX_CHANNELS; i++) if (event->recv[i][j].type == ncclProfileProxyOp) fprintf(fh, " recv[%d] = %p\n", i, &event->recv[i]);
for (int j = 0; j < 2*MAX_OPS; j++) {
for (int i = 0; i < MAX_CHANNELS; i++) if (event->op[i][j].type == ncclProfileProxyOp) fprintf(fh, " op[%d] = %p\n", i, &event->op[i]);
}
fprintf(fh, " startTs = %f\n", event->base.startTs);
fprintf(fh, " stopTs = %f\n", event->base.stopTs);
@@ -207,17 +204,18 @@ void debugEvent(void* eHandle, const char* tag) {
} else if (type == ncclProfileProxyOp) {
struct proxyOp* event = (struct proxyOp *)eHandle;
fprintf(fh, "ProxyOp event %p tag = %s {\n", event, tag);
fprintf(fh, " type = %s\n", event->isSend ? "Send" : "Recv");
fprintf(fh, " type = %s\n", event->isSend < 0 ? "Unknown" : event->isSend ? "Send" : "Recv");
fprintf(fh, " channel = %d\n", event->channelId);
fprintf(fh, " parent = %p\n", event->parent);
fprintf(fh, " rank = %d\n", event->rank);
fprintf(fh, " startTs = %f\n", event->startTs);
fprintf(fh, " progrTs = %f\n", event->progrTs);
fprintf(fh, " stopTs = %f\n", event->stopTs);
fprintf(fh, "}\n");
} else if (type == ncclProfileProxyStep) {
struct proxyStep* event = (struct proxyStep *)eHandle;
fprintf(fh, "ProxyStep event %p tag = %s {\n", event, tag);
fprintf(fh, " type = %s\n", event->isSend ? "Send" : "Recv");
fprintf(fh, " type = %s\n", event->isSend < 0 ? "Unknown" : event->isSend ? "Send" : "Recv");
fprintf(fh, " parent = %p\n", event->parent);
fprintf(fh, " startTs = %f\n", event->startTs);
fprintf(fh, " stopTs = %f\n", event->stopTs);
@@ -260,8 +258,7 @@ void printEvent(FILE* fh, void* handle) {
for (int i = 0; i < MAX_CHANNELS; i++) {
printKernelChEventHeader(fh, &c->kernel[i]);
for (int j = 0; j < c->nProxyOps[i]; j++) {
printEvent(fh, &c->send[i][j]);
printEvent(fh, &c->recv[i][j]);
printEvent(fh, &c->op[i][j]);
}
printKernelChEventTrailer(fh, &c->kernel[i]);
}
+3
Просмотреть файл
@@ -7,6 +7,9 @@
#ifndef PRINT_EVENT_H_
#define PRINT_EVENT_H_
#include "nccl/common.h"
extern ncclDebugLogger_t logFn;
void debugEvent(void* eHandle, const char* tag);
void printEvent(FILE* fh, void* handle);
+25 -6
Просмотреть файл
@@ -17,6 +17,8 @@ PROFAPI ?= 1
NVTX ?= 1
RDMA_CORE ?= 0
NET_PROFILER ?= 0
MLX5DV ?= 0
MAX_EXT_NET_PLUGINS ?= 0
NVCC = $(CUDA_HOME)/bin/nvcc
@@ -49,8 +51,10 @@ CUDA11_PTX = -gencode=arch=compute_80,code=compute_80
CUDA12_PTX = -gencode=arch=compute_90,code=compute_90
CUDA13_PTX = -gencode=arch=compute_120,code=compute_120
ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 12 -a "0$(CUDA_MINOR)" -ge 8 -o "0$(CUDA_MAJOR)" -gt 12; echo $$?),0)
ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
# Prior to SM75 is deprecated from CUDA13.0 onwards
NVCC_GENCODE ?= $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA13_GENCODE) $(CUDA13_PTX)
else ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 12 -a "0$(CUDA_MINOR)" -ge 8; echo $$?),0)
# Include Blackwell support if we're using CUDA12.8 or above
NVCC_GENCODE ?= $(CUDA8_GENCODE) $(CUDA9_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA13_GENCODE) $(CUDA13_PTX)
else ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 11 -a "0$(CUDA_MINOR)" -ge 8 -o "0$(CUDA_MAJOR)" -gt 11; echo $$?),0)
@@ -66,14 +70,21 @@ else
endif
$(info NVCC_GENCODE is ${NVCC_GENCODE})
# CUDA 13.0 requires c++17
ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
CXXSTD ?= -std=c++17
else
CXXSTD ?= -std=c++11
endif
CXXFLAGS := -DCUDA_MAJOR=$(CUDA_MAJOR) -DCUDA_MINOR=$(CUDA_MINOR) -fPIC -fvisibility=hidden \
-Wall -Wno-unused-function -Wno-sign-compare -std=c++11 -Wvla \
-I $(CUDA_INC) \
-Wall -Wno-unused-function -Wno-sign-compare $(CXXSTD) -Wvla \
-I $(CUDA_INC) -I $(CUDA_INC)/cccl \
$(CXXFLAGS)
# Maxrregcount needs to be set accordingly to NCCL_MAX_NTHREADS (otherwise it will cause kernel launch errors)
# 512 : 120, 640 : 96, 768 : 80, 1024 : 60
# We would not have to set this if we used __launch_bounds__, but this only works on kernels, not on functions.
NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++11 --expt-extended-lambda -Xptxas -maxrregcount=96 -Xfatbin -compress-all
NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) $(CXXSTD) --expt-extended-lambda -Xptxas -maxrregcount=96 -Xfatbin -compress-all
# Use addprefix so that we can specify more than one path
NVLDFLAGS := -L${CUDA_LIB} -lcudart -lrt
@@ -136,9 +147,17 @@ CXXFLAGS += -DPROFAPI
endif
ifneq ($(RDMA_CORE), 0)
CXXFLAGS += -DNCCL_BUILD_RDMA_CORE=1
CXXFLAGS += -DNCCL_BUILD_RDMA_CORE=1 -libverbs
endif
ifneq ($(MLX5DV), 0)
CXXFLAGS += -DNCCL_BUILD_MLX5DV=1 -lmlx5
endif
ifneq ($(NET_PROFILER), 0)
CXXFLAGS += -DNCCL_ENABLE_NET_PROFILING=1
endif
ifneq ($(MAX_EXT_NET_PLUGINS), 0)
CXXFLAGS += -DNCCL_NET_MAX_PLUGINS=$(MAX_EXT_NET_PLUGINS)
endif
+2 -2
Просмотреть файл
@@ -1,6 +1,6 @@
##### version
NCCL_MAJOR := 2
NCCL_MINOR := 26
NCCL_PATCH := 6
NCCL_MINOR := 27
NCCL_PATCH := 3
NCCL_SUFFIX :=
PKG_REVISION := 1
+1 -1
Просмотреть файл
@@ -10,7 +10,7 @@ include ../makefiles/version.mk
INCEXPORTS := nccl.h
LIBSRCFILES := \
bootstrap.cc channel.cc collectives.cc debug.cc enqueue.cc group.cc \
init.cc init_nvtx.cc proxy.cc transport.cc mnnvl.cc \
init.cc init_nvtx.cc proxy.cc transport.cc mnnvl.cc allocator.cc symmetric.cc \
$(wildcard graph/*.cc) \
$(wildcard misc/*.cc) \
$(wildcard transport/*.cc) \
+196
Просмотреть файл
@@ -0,0 +1,196 @@
/*************************************************************************
* Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#include "comm.h"
#include "transport.h"
#include "group.h"
NCCL_API(ncclResult_t, ncclMemAlloc, void **ptr, size_t size);
ncclResult_t ncclMemAlloc(void **ptr, size_t size) {
NVTX3_FUNC_RANGE_IN(nccl_domain);
ncclResult_t ret = ncclSuccess;
#if CUDART_VERSION >= 12010
size_t memGran = 0;
CUdevice currentDev;
CUmemAllocationProp memprop = {};
CUmemAccessDesc accessDesc = {};
CUmemGenericAllocationHandle handle = (CUmemGenericAllocationHandle)-1;
int cudaDev;
int flag;
int dcnt;
if (ptr == NULL || size == 0) goto fallback;
if (ncclCudaLibraryInit() != ncclSuccess) goto fallback;
CUDACHECK(cudaGetDevice(&cudaDev));
CUCHECK(cuDeviceGet(&currentDev, cudaDev));
if (ncclCuMemEnable()) {
size_t handleSize = size;
int requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
// Query device to see if FABRIC handle support is available
flag = 0;
(void) CUPFN(cuDeviceGetAttribute(&flag, CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED, currentDev));
if (flag) requestedHandleTypes |= CU_MEM_HANDLE_TYPE_FABRIC;
memprop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
memprop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
memprop.requestedHandleTypes = (CUmemAllocationHandleType) requestedHandleTypes;
memprop.location.id = currentDev;
// Query device to see if RDMA support is available
flag = 0;
CUCHECK(cuDeviceGetAttribute(&flag, CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED, currentDev));
if (flag) memprop.allocFlags.gpuDirectRDMACapable = 1;
CUCHECK(cuMemGetAllocationGranularity(&memGran, &memprop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED));
CUDACHECK(cudaGetDeviceCount(&dcnt));
ALIGN_SIZE(handleSize, memGran);
if (requestedHandleTypes & CU_MEM_HANDLE_TYPE_FABRIC) {
/* First try cuMemCreate() with FABRIC handle support and then remove if it fails */
CUresult err = CUPFN(cuMemCreate(&handle, handleSize, &memprop, 0));
if (err == CUDA_ERROR_NOT_PERMITTED || err == CUDA_ERROR_NOT_SUPPORTED) {
requestedHandleTypes &= ~CU_MEM_HANDLE_TYPE_FABRIC;
memprop.requestedHandleTypes = (CUmemAllocationHandleType) requestedHandleTypes;
/* Allocate the physical memory on the device */
CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
} else if (err != CUDA_SUCCESS) {
// Catch and report any error from above
CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
}
} else {
/* Allocate the physical memory on the device */
CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
}
/* Reserve a virtual address range */
CUCHECK(cuMemAddressReserve((CUdeviceptr*)ptr, handleSize, memGran, 0, 0));
/* Map the virtual address range to the physical allocation */
CUCHECK(cuMemMap((CUdeviceptr)*ptr, handleSize, 0, handle, 0));
/* Now allow RW access to the newly mapped memory */
for (int i = 0; i < dcnt; ++i) {
int p2p = 0;
if (i == cudaDev || ((cudaDeviceCanAccessPeer(&p2p, i, cudaDev) == cudaSuccess) && p2p)) {
accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
accessDesc.location.id = i;
accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
CUCHECK(cuMemSetAccess((CUdeviceptr)*ptr, handleSize, &accessDesc, 1));
}
if (0 == p2p && i != cudaDev) INFO(NCCL_ALLOC, "P2P not supported between GPU%d and GPU%d", cudaDev, i);
}
goto exit;
}
fallback:
#endif
// Coverity is right to complain that we may pass a NULL ptr to cudaMalloc. That's deliberate though:
// we want CUDA to return an error to the caller.
// coverity[var_deref_model]
CUDACHECKGOTO(cudaMalloc(ptr, size), ret, fail);
exit:
return ret;
fail:
goto exit;
}
NCCL_API(ncclResult_t, ncclMemFree, void *ptr);
ncclResult_t ncclMemFree(void *ptr) {
NVTX3_FUNC_RANGE_IN(nccl_domain);
ncclResult_t ret = ncclSuccess;
int saveDevice;
CUDACHECK(cudaGetDevice(&saveDevice));
#if CUDART_VERSION >= 12010
CUdevice ptrDev = 0;
if (ptr == NULL) goto fallback;
if (ncclCudaLibraryInit() != ncclSuccess) goto fallback;
CUCHECKGOTO(cuPointerGetAttribute((void*)&ptrDev, CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL, (CUdeviceptr)ptr), ret, fail);
CUDACHECKGOTO(cudaSetDevice((int)ptrDev), ret, fail);
if (ncclCuMemEnable()) {
NCCLCHECKGOTO(ncclCuMemFree(ptr), ret, fail);
goto exit;
}
fallback:
#endif
CUDACHECKGOTO(cudaFree(ptr), ret, fail);
exit:
CUDACHECK(cudaSetDevice(saveDevice));
return ret;
fail:
goto exit;
}
// This is a collective function and should be called by all ranks in the communicator
ncclResult_t ncclCommSymmetricAllocInternal(struct ncclComm* comm, size_t size, size_t alignment, void** symPtr) {
ncclResult_t ret = ncclSuccess;
void* regSymAddr = NULL;
size_t allocSize = size;
size_t granularity;
CUdevice cuDev;
CUmemAllocationProp memprop = {};
CUmemGenericAllocationHandle memHandle;
int bit = 0, cnt = 0;
// aligment must be power of 2 as an input
while (bit < sizeof(size_t) * 8) {
if (alignment & (1L << bit)) cnt++;
if (cnt == 2) {
WARN("rank %d alignment %ld is not power of 2", comm->rank, alignment);
goto fail;
}
bit++;
}
// temporarily align the alignment to NCCL_REC_PAGE_SIZE
ALIGN_SIZE(alignment, NCCL_REC_PAGE_SIZE);
CUCHECKGOTO(cuDeviceGet(&cuDev, comm->cudaDev), ret, fail);
memprop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
memprop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
memprop.requestedHandleTypes = ncclCuMemHandleType;
memprop.location.id = cuDev;
CUCHECKGOTO(cuMemGetAllocationGranularity(&granularity, &memprop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED), ret, fail);
ALIGN_SIZE(allocSize, granularity);
CUCHECKGOTO(cuMemCreate(&memHandle, allocSize, &memprop, 0), ret, fail);
ALIGN_SIZE(comm->symAllocHead, alignment);
NCCLCHECKGOTO(ncclIpcSymmetricMap(comm, comm->symAllocHead, allocSize, memHandle, &regSymAddr), ret, fail);
NCCLCHECKGOTO(ncclNvlsSymmetricMap(comm, comm->symAllocHead, allocSize, regSymAddr), ret, fail);
NCCLCHECKGOTO(bootstrapIntraNodeBarrier(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, comm->localRankToRank[0]), ret, fail);
comm->symAllocHead += allocSize;
*symPtr = regSymAddr;
exit:
return ret;
fail:
*symPtr = NULL;
goto exit;
}
ncclResult_t ncclCommSymmetricFreeInternal(struct ncclComm* comm, void* symPtr) {
CUmemGenericAllocationHandle handle;
size_t size = 0;
ncclResult_t ret = ncclSuccess;
int saveDev = comm->cudaDev;
CUDACHECKGOTO(cudaGetDevice(&saveDev), ret, fail);
if (ncclCuMemEnable()) {
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
CUCHECKGOTO(cuMemRetainAllocationHandle(&handle, symPtr), ret, fail);
CUCHECKGOTO(cuMemRelease(handle), ret, fail);
CUCHECKGOTO(cuMemGetAddressRange(NULL, &size, (CUdeviceptr)symPtr), ret, fail);
NCCLCHECKGOTO(ncclNvlsSymmetricFree(comm, size, symPtr), ret, fail);
NCCLCHECKGOTO(ncclIpcSymmetricFree(comm, size, symPtr), ret, fail);
CUCHECKGOTO(cuMemRelease(handle), ret, fail);
}
exit:
CUDACHECK(cudaSetDevice(saveDev));
return ret;
fail:
goto exit;
}
+6 -3
Просмотреть файл
@@ -94,6 +94,7 @@ ncclResult_t bootstrapNetInit() {
pthread_mutex_lock(&bootstrapNetLock);
if (bootstrapNetInitDone == 0) {
const char* env = ncclGetEnv("NCCL_COMM_ID");
int nIfs = 0;
if (env) {
union ncclSocketAddress remoteAddr;
if (ncclSocketGetAddrFromString(&remoteAddr, env) != ncclSuccess) {
@@ -101,13 +102,15 @@ ncclResult_t bootstrapNetInit() {
pthread_mutex_unlock(&bootstrapNetLock);
return ncclInvalidArgument;
}
if (ncclFindInterfaceMatchSubnet(bootstrapNetIfName, &bootstrapNetIfAddr, &remoteAddr, MAX_IF_NAME_SIZE, 1) <= 0) {
NCCLCHECK(ncclFindInterfaceMatchSubnet(bootstrapNetIfName, &bootstrapNetIfAddr, &remoteAddr, MAX_IF_NAME_SIZE,
&nIfs));
if (nIfs <= 0) {
WARN("NET/Socket : No usable listening interface found");
pthread_mutex_unlock(&bootstrapNetLock);
return ncclSystemError;
}
} else {
int nIfs = ncclFindInterfaces(bootstrapNetIfName, &bootstrapNetIfAddr, MAX_IF_NAME_SIZE, 1);
NCCLCHECK(ncclFindInterfaces(bootstrapNetIfName, &bootstrapNetIfAddr, MAX_IF_NAME_SIZE, 1, &nIfs));
if (nIfs <= 0) {
WARN("Bootstrap : no socket interface found");
pthread_mutex_unlock(&bootstrapNetLock);
@@ -828,7 +831,7 @@ ncclResult_t bootstrapSplit(uint64_t magic, struct ncclComm* comm, struct ncclCo
NCCLCHECKGOTO(ncclCalloc(&state->peerP2pAddresses, nranks), ret, fail);
memcpy(state->peerP2pAddresses + rank, &peerSocketAddress, sizeof(union ncclSocketAddress));
if (parent->config.splitShare) {
if (parent->shareResources) {
/* map local rank to top parent local rank. */
for (int i = 0; i < nranks; ++i) {
comm->topParentRanks[i] = parent->topParentRanks[parentRanks[i]];
+1 -1
Просмотреть файл
@@ -147,7 +147,7 @@ ncclResult_t initCollnetChannel(struct ncclComm* comm, int channelId, struct ncc
ncclResult_t freeChannel(struct ncclChannel* channel, int nRanks, int collnetNRanks, int nvlsNRanks) {
int nPeers = nRanks + collnetNRanks + nvlsNRanks;
/* channel peers are only valid when async init thread completes commAlloc() and
* the channel is intialized with initChannel(); if either is not done, this channel
* the channel is initialized with initChannel(); if either is not done, this channel
* should never be free. */
if (channel->id == -1 || channel->peers == NULL) return ncclSuccess;
+42 -18
Просмотреть файл
@@ -16,6 +16,8 @@
#include <chrono>
#include "param.h"
#define NCCL_DEBUG_RESET_TRIGGERED (-2)
int ncclDebugLevel = -1;
static uint32_t ncclDebugTimestampLevels = 0; // bitmaps of levels that have timestamps turned on
static char ncclDebugTimestampFormat[256]; // with space for subseconds
@@ -26,7 +28,7 @@ static int pid = -1;
static char hostname[1024];
thread_local int ncclDebugNoWarn = 0;
char ncclLastError[1024] = ""; // Global string for the last error in human readable form
static uint64_t ncclDebugMask = NCCL_INIT | NCCL_BOOTSTRAP | NCCL_ENV; // Default debug sub-system mask is INIT and ENV
static uint64_t ncclDebugMask = 0;
FILE *ncclDebugFile = stdout;
static pthread_mutex_t ncclDebugLock = PTHREAD_MUTEX_INITIALIZER;
static std::chrono::steady_clock::time_point ncclEpoch;
@@ -34,11 +36,16 @@ static bool ncclWarnSetDebugInfo = false;
static __thread int tid = -1;
// This function must be called with ncclDebugLock locked!
static void ncclDebugInit() {
pthread_mutex_lock(&ncclDebugLock);
if (ncclDebugLevel != -1) { pthread_mutex_unlock(&ncclDebugLock); return; }
const char* nccl_debug = ncclGetEnv("NCCL_DEBUG");
int tempNcclDebugLevel = -1;
uint64_t tempNcclDebugMask = NCCL_INIT | NCCL_BOOTSTRAP | NCCL_ENV; // Default debug sub-system mask
if (ncclDebugLevel == NCCL_DEBUG_RESET_TRIGGERED && ncclDebugFile != stdout) {
// Finish the reset initiated via ncclResetDebugInit().
fclose(ncclDebugFile);
ncclDebugFile = stdout;
}
if (nccl_debug == NULL) {
tempNcclDebugLevel = NCCL_LOG_NONE;
} else if (strcasecmp(nccl_debug, "VERSION") == 0) {
@@ -61,7 +68,7 @@ static void ncclDebugInit() {
if (ncclDebugSubsysEnv != NULL) {
int invert = 0;
if (ncclDebugSubsysEnv[0] == '^') { invert = 1; ncclDebugSubsysEnv++; }
ncclDebugMask = invert ? ~0ULL : 0ULL;
tempNcclDebugMask = invert ? ~0ULL : 0ULL;
char *ncclDebugSubsys = strdup(ncclDebugSubsysEnv);
char *subsys = strtok(ncclDebugSubsys, ",");
while (subsys != NULL) {
@@ -102,7 +109,7 @@ static void ncclDebugInit() {
mask = NCCL_ALL;
}
if (mask) {
if (invert) ncclDebugMask &= ~mask; else ncclDebugMask |= mask;
if (invert) tempNcclDebugMask &= ~mask; else tempNcclDebugMask |= mask;
}
subsys = strtok(NULL, ",");
}
@@ -246,15 +253,15 @@ static void ncclDebugInit() {
if (debugFn[0] != '\0') {
FILE *file = fopen(debugFn, "w");
if (file != nullptr) {
setbuf(file, nullptr); // disable buffering
setlinebuf(file); // disable block buffering
ncclDebugFile = file;
}
}
}
ncclEpoch = std::chrono::steady_clock::now();
ncclDebugMask = tempNcclDebugMask;
__atomic_store_n(&ncclDebugLevel, tempNcclDebugLevel, __ATOMIC_RELEASE);
pthread_mutex_unlock(&ncclDebugLock);
}
/* Common logging function used by the INFO, WARN and TRACE macros
@@ -262,19 +269,38 @@ static void ncclDebugInit() {
* they can share the debugging mechanisms and output files
*/
void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *filefunc, int line, const char *fmt, ...) {
if (__atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE) == -1) ncclDebugInit();
bool locked = false; // Keeps track of the ncclDebugLock state.
int gotLevel = __atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE);
if (ncclDebugNoWarn != 0 && level == NCCL_LOG_WARN) { level = NCCL_LOG_INFO; flags = ncclDebugNoWarn; }
// Save the last error (WARN) as a human readable string
if (level == NCCL_LOG_WARN) {
pthread_mutex_lock(&ncclDebugLock);
locked = true;
va_list vargs;
va_start(vargs, fmt);
(void) vsnprintf(ncclLastError, sizeof(ncclLastError), fmt, vargs);
va_end(vargs);
pthread_mutex_unlock(&ncclDebugLock);
}
if (ncclDebugLevel < level || ((flags & ncclDebugMask) == 0)) return;
if (gotLevel >= 0 && (gotLevel < level || (flags & ncclDebugMask) == 0)) {
if (locked)
pthread_mutex_unlock(&ncclDebugLock);
return;
}
if (!locked) {
pthread_mutex_lock(&ncclDebugLock);
locked = true;
}
// From this point on ncclDebugLock is always locked so we don't need to check "locked" anymore.
if (ncclDebugLevel < 0)
ncclDebugInit();
if (ncclDebugLevel < level || ((flags & ncclDebugMask) == 0)) {
pthread_mutex_unlock(&ncclDebugLock);
return;
}
if (tid == -1) {
tid = syscall(SYS_gettid);
@@ -335,7 +361,7 @@ void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *file
// Add level specific formatting.
if (level == NCCL_LOG_WARN) {
len += snprintf(buffer+len, sizeof(buffer)-len, "[%d] %s:%d NCCL WARN ", cudaDev, filefunc, line);
if (ncclWarnSetDebugInfo) ncclDebugLevel = NCCL_LOG_INFO;
if (ncclWarnSetDebugInfo) __atomic_store_n(&ncclDebugLevel, NCCL_LOG_INFO, __ATOMIC_RELEASE);
} else if (level == NCCL_LOG_INFO) {
len += snprintf(buffer+len, sizeof(buffer)-len, "[%d] NCCL INFO ", cudaDev);
} else if (level == NCCL_LOG_TRACE && flags == NCCL_CALL) {
@@ -360,19 +386,17 @@ void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *file
// necessary since we write bytes instead of the string.
buffer[len++] = '\n';
fwrite(buffer, 1, len, ncclDebugFile);
pthread_mutex_unlock(&ncclDebugLock);
}
NCCL_API(void, ncclResetDebugInit);
void ncclResetDebugInit() {
// Cleans up from a previous ncclDebugInit() and reruns.
// Use this after changing NCCL_DEBUG and related parameters in the environment.
__atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE);
if (ncclDebugFile != stdout) {
fclose(ncclDebugFile);
ncclDebugFile = stdout;
}
ncclDebugLevel = -1;
ncclDebugInit();
pthread_mutex_lock(&ncclDebugLock);
// Let ncclDebugInit() know to complete the reset.
__atomic_store_n(&ncclDebugLevel, NCCL_DEBUG_RESET_TRIGGERED, __ATOMIC_RELEASE);
pthread_mutex_unlock(&ncclDebugLock);
}
NCCL_PARAM(SetThreadName, "SET_THREAD_NAME", 0);
+36 -9
Просмотреть файл
@@ -23,6 +23,9 @@ INCFLAGS = -I. -I.. -I$(BUILDDIR)/include -I../include
NVCUFLAGS += $(INCFLAGS) --compiler-options "-fPIC -fvisibility=hidden"
CXXFLAGS += $(INCFLAGS)
NVCUFLAGS_SYM := -ccbin $(CXX) $(CXXSTD) --expt-extended-lambda -Xptxas -maxrregcount=128 -Xfatbin -compress-all
NVCUFLAGS_SYM += $(INCFLAGS) --compiler-options "-fPIC -fvisibility=hidden"
SAY = @bash -c 'path="$$2"; [[ "$$(realpath "$$2")" =~ ^$(subst .,\.,$(abspath $(NCCLDIR)))/(.*)$$ ]] && path="$${BASH_REMATCH[1]}"; printf "%-15s %s\n" "$$1" "$$path"' SAY
COMPILE.cu = $(NVCC) $(NVCUFLAGS) -dc $2 -o $1
@@ -30,7 +33,22 @@ COMPILE.cc = $(CXX) $(CXXFLAGS) -c $2 -o $1
define COMPILE
@$(SAY) "Compiling" $2;\
mkdir -p $(dir $1);\
$(call COMPILE$(suffix $2),$1,$2)
$(call COMPILE$(or $3,$(suffix $2)),$1,$2)
endef
ifeq ($(shell echo "$$((1000*$(CUDA_MAJOR) + 10*$(CUDA_MINOR) >= 12080))"),1)
NVCC_GENCODE_LDMC_FP8 = -gencode=arch=compute_100a,code=sm_100a \
-gencode=arch=compute_120a,code=sm_120a
else ifeq ($(shell echo "$$((1000*$(CUDA_MAJOR) + 10*$(CUDA_MINOR) >= 12070))"),1)
NVCC_GENCODE_LDMC_FP8 = -gencode=arch=compute_100a,code=sm_100a
else
NVCC_GENCODE_LDMC_FP8 =
endif
define COMPILE_SYM
@$(SAY) "Compiling" $2;\
mkdir -p $(dir $1);\
$(NVCC) $(NVCUFLAGS_SYM) $3 -dw $2 -o $1
endef
DEPENDS.cu = $(NVCC) $(NVCUFLAGS) -M -dc $1
@@ -48,8 +66,6 @@ endef
all: $(MANIFEST)
ifeq (1,1)
# Case if the <gensrc> directory is generated on-demand:
$(OBJDIR)/gensrc: generate.py
@mkdir -p $@
(which python3 >/dev/null || \
@@ -57,22 +73,26 @@ $(OBJDIR)/gensrc: generate.py
printf "\n$${bar}\nERROR: Building NCCL requires a Python 3 installation invokable as 'python3'.\n$${bar}\n\n" 1>&2; \
exit 1)) \
&& ./generate.py $@ "$(ONLY_FUNCS)"
else
# Case if the <gensrc> directory is pre-generated and checked in the repo as ./gen:
$(OBJDIR)/gensrc:
@mkdir -p $(OBJDIR); ln -srfn ./gen $@
endif
$(OBJDIR)/gensrc/symmetric: $(OBJDIR)/gensrc symmetric/generate.py
@mkdir -p $@
./symmetric/generate.py $@
# The trailing ";" is necessary to make this an "empty recipe":
# https://www.gnu.org/software/make/manual/html_node/Empty-Recipes.html
$(OBJDIR)/gensrc/rules.mk: $(OBJDIR)/gensrc ;
$(OBJDIR)/gensrc/symmetric/rules.mk: $(OBJDIR)/gensrc/symmetric ;
-include $(OBJDIR)/gensrc/rules.mk
# "gensrc/rules.mk" populates $(LIB_OBJS_GEN)
-include $(OBJDIR)/gensrc/symmetric/rules.mk
# "gensrc/symmetric/rules.mk" populates $(LIB_OBJS_SYM_GEN)
SRCS = common.cu onerank.cu
LIB_OBJS = $(patsubst %, $(OBJDIR)/%.o, $(SRCS)) $(LIB_OBJS_GEN)
LIB_OBJS = $(patsubst %, $(OBJDIR)/%.o, $(SRCS)) $(LIB_OBJS_GEN) $(LIB_OBJS_SYM_GEN)
$(OBJDIR)/%.o: % $(OBJDIR)/%.d
$(call COMPILE,$@,$<)
@@ -80,12 +100,18 @@ $(OBJDIR)/%.o: % $(OBJDIR)/%.d
$(OBJDIR)/genobj/%.o: $(OBJDIR)/gensrc $(OBJDIR)/genobj/%.d
$(call COMPILE,$@,$(OBJDIR)/gensrc/$*)
$(OBJDIR)/genobj/symmetric/%.o: $(OBJDIR)/gensrc/symmetric $(OBJDIR)/genobj/symmetric/%.d
$(call COMPILE,$@,$(OBJDIR)/gensrc/symmetric/$*)
$(OBJDIR)/%.d: %
$(call DEPENDS,$@,$<)
$(OBJDIR)/genobj/%.d: $(OBJDIR)/gensrc/%
$(call DEPENDS,$@,$<)
$(OBJDIR)/genobj/symmetric/%.d: $(OBJDIR)/gensrc/symmetric/%
$(call DEPENDS,$@,$<)
$(DEVGLUE_OBJ): $(LIB_OBJS)
$(NVCC) $(NVCUFLAGS) -dlink $^ -o $@
@@ -94,6 +120,7 @@ $(MANIFEST): $(LIB_OBJS) $(DEVGLUE_OBJ)
-include $(wildcard $(OBJDIR)/*.d)
-include $(wildcard $(OBJDIR)/genobj/*.d)
-include $(wildcard $(OBJDIR)/genobj/symmetric/*.d)
.PHONY: clean
clean:
+204 -56
Просмотреть файл
@@ -173,73 +173,221 @@ struct RunWorkColl<ncclFuncAllGather, T, RedOp, NCCL_ALGO_PAT, NCCL_PROTO_SIMPLE
template<typename T, typename RedOp>
struct RunWorkColl<ncclFuncAllGather, T, RedOp, NCCL_ALGO_NVLS, NCCL_PROTO_SIMPLE> {
template<bool BcastSendNotRecv>
struct Scatterer {
struct ncclDevWorkColl* work;
ssize_t chunkSize;
ssize_t railGridOffset;
template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts, int MultimemSrcs, int MultimemDsts>
__device__ __forceinline__ void operator()(
int tid, int tn, int slice, int maxSliceSize,
int nSrcs, void** srcPtrs, int nDsts, void** dstPtrs, int32_t* dstSizes, uint32_t sendDirectFlag, uint32_t recvDirectFlag
) {
static_assert(SlicePerChunk==1, "require: SlicePerChunk==1");
static_assert(MaxDsts<=1 || MaxSrcs<=1, "require: MaxDsts<=1 || MaxSrcs<=1");
struct ncclNvls* nvls = &ncclShmem.channel.nvls;
int nNodes = ncclShmem.comm.nNodes;
int nRails = nvls->nHeads;
int part = ncclShmem.channelId - work->channelLo;
char* inbuf = (char*)work->sendbuff;
char* outbuf = (char*)work->recvbuff;
ssize_t countPerRank = work->collnet.count;
bool inPlace = (inbuf == outbuf + ncclShmem.comm.rank * countPerRank);
ssize_t railAllBeg = min(railGridOffset + part * chunkSize, nNodes * countPerRank);
ssize_t railAllEnd = min(railAllBeg + chunkSize, nNodes * countPerRank);
int railAllSize = railAllEnd - railAllBeg;
int rail = 0;
int src = 0;
if (BcastSendNotRecv) {
rail = nvls->headRank;
} else {
if (work->regUsed) return;
rail = 0;
}
if (tid < nDsts) dstSizes[tid] = railAllSize;
do {
int node = railAllBeg / countPerRank;
int railAllOffset = 0;
while (railAllOffset < railAllSize) {
ssize_t railOneBeg = node * countPerRank;
ssize_t railOneEnd = railOneBeg + countPerRank;
ssize_t railOneOffset = (railAllBeg + railAllOffset) - railOneBeg;
int delta = min(railAllEnd, railOneEnd) - (railAllBeg + railAllOffset);
int rank = ncclShmem.comm.collNetDenseToUserRank[node * nRails + rail];
ssize_t userOneBeg = rank * countPerRank + railOneOffset;
int outIsDst = (inPlace && rank == ncclShmem.comm.rank) || BcastSendNotRecv || work->regUsed ? 0 : 1;
if (nSrcs != 0 && outIsDst + nDsts != 0) {
reduceCopy<ncclCollUnroll(), RedOp, T,
/*MultimemSrcs,MinSrcs,MaxSrcs=*/MultimemSrcs, 1, 1,
/*MultimemDsts=*/MultimemDsts, 0 + MultimemDsts + MinDsts, 1 + MaxDsts,
/*PreOpSrcs=*/0>
(tid, tn, 0, nullptr, false,
/*nSrcs=*/1, [=]__device__(int s/*==0*/) -> void* {
return (char*)srcPtrs[src] + railAllOffset;
},
/*nDsts=*/outIsDst + nDsts, [=]__device__(int d) -> void* {
return d < outIsDst ? outbuf + userOneBeg
: work->regUsed ? (char*)dstPtrs[d - outIsDst] + userOneBeg
: (char*)dstPtrs[d - outIsDst] + railAllOffset;
}, delta);
}
railAllOffset += delta;
node += 1;
}
rail += 1;
src += 1;
} while (!BcastSendNotRecv && src < nRails);
}
};
__device__ __forceinline__ void run(int tid, int/*nthreads*/, struct ncclDevWorkColl* work) {
struct ncclNvls* nvls = &ncclShmem.channel.nvls;
const ssize_t rank = ncclShmem.comm.rank;
size_t count, gridOffset, channelCount;
size_t chunkCount;
ncclCollCbdPart(work, ncclShmem.channelId, NCCL_PROTO_SIMPLE, sizeof(T), &count, &gridOffset, &channelCount, &chunkCount);
size_t offset;
int nelem;
const int nThreadsBcast = work->regUsed ? (NCCL_MAX_NTHREADS - WARP_SIZE) : 4 * WARP_SIZE;
const int nThreadsGather = work->regUsed ? WARP_SIZE : NCCL_MAX_NTHREADS - nThreadsBcast;
const int tidEndGather = nThreadsGather;
const int tidEndBcast = tidEndGather + nThreadsBcast;
const int nThreadsNetSend = work->oneNode ? 0 : (work->netRegUsed ? WARP_SIZE : 6 * WARP_SIZE);
const int nThreadsGather = work->regUsed ? roundUp(nvls->nHeads << 2, WARP_SIZE) : 8 * WARP_SIZE;
const int nThreadsBcast = NCCL_MAX_NTHREADS - nThreadsNetSend - nThreadsGather;
if (!work->regUsed) {
if (tid < tidEndGather) {
// Gather
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanAsymmetric<NCCL_MAX_NVLS_ARITY, 0>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsGather, nvls->up, NULL, NULL, work->recvbuff,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.gather(offset, nvls->nHeads * count, nelem, count, -1, 0);
const int tidEndGather = nThreadsGather;
const int tidEndNetSend = tidEndGather + nThreadsNetSend;
const int tidEndBcast = tidEndNetSend + nThreadsBcast;
if (work->oneNode) {
const ssize_t rank = ncclShmem.comm.rank;
size_t count, gridOffset, channelCount, offset, chunkCount;
ncclCollCbdPart(work, ncclShmem.channelId, NCCL_PROTO_SIMPLE, sizeof(T), &count, &gridOffset, &channelCount, &chunkCount);
if (!work->regUsed) {
if (tid < tidEndGather) {
// Gather
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanAsymmetric<NCCL_MAX_NVLS_ARITY, 0>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsGather, nvls->up, NULL, NULL, work->recvbuff,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.gather(offset, nvls->nHeads * count, nelem, count, -1, 0);
}
// coverity[overrun-call] => Coverity think prims.index can be greater than 1
} else if (tid < tidEndBcast) {
// Bcast through NVLS
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/0, Proto, 0>
prims(tid - tidEndGather, nThreadsBcast, NULL, &nvls->down, work->sendbuff, NULL,
work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.send(offset, nelem);
}
// coverity[overrun-call] => Coverity think prims.index can be greater than 1
}
// coverity[overrun-call] => Coverity think prims.index can be greater than 1
} else if (tid < tidEndBcast) {
// Bcast through NVLS
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/0, Proto, 0>
prims(tid - tidEndGather, nThreadsBcast, NULL, &nvls->down, work->sendbuff, NULL,
work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.send(offset, nelem);
} else {
if (tid < tidEndGather) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanSymmetric<NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsGather, nvls->up, nvls->up, NULL, NULL,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
/* used as sync */
prims.scatter(0, 0, 0, 0, -1, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
prims.gather(0, 0, 0, 0, -1, 0);
}
} else if (tid < tidEndBcast) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
prims(tid - tidEndGather, nThreadsBcast, &nvls->down, &nvls->down, work->sendbuff, NULL,
work->redOpArg, 1 * Proto::MaxGroupWidth, 0, 0, work);
/* used as sync */
prims.recv(0, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
ssize_t inpOffset = gridOffset + elemOffset;
ssize_t outOffset = inpOffset + rank * count;
nelem = min(chunkCount, channelCount - elemOffset);
prims.directSend(inpOffset, outOffset, nelem);
}
}
// coverity[overrun-call] => Coverity think prims.index can be greater than 1
}
} else {
/* direct allgather */
// NVLS + IB SHARP
int nNodes = ncclShmem.comm.nNodes;
int part = ncclShmem.channelId - work->channelLo;
ssize_t countPerRank = work->collnet.count;
const int nChannels = work->channelHi - work->channelLo + 1;
ssize_t chunkCount = work->collnet.chunkCount;
if (tid < tidEndGather) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanSymmetric<NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsGather, nvls->up, nvls->up, NULL, NULL,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
/* used as sync */
prims.scatter(0, 0, 0, 0, -1, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
prims.gather(0, 0, 0, 0, -1, 0);
Primitives<T, RedOp, FanAsymmetric<NCCL_MAX_NVLS_ARITY, 0>, /*Direct=*/1, Proto, 0>
prims(tid, nThreadsGather, nvls->up, nullptr, nullptr, work->recvbuff,
/*redOpArg=*/0, 1 * Proto::MaxGroupWidth, 1, 1, work);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
Scatterer</*BcastSendNotRecv=*/false> scat;
scat.work = work;
scat.chunkSize = chunkCount;
scat.railGridOffset = railGridOffset;
prims.template process</*Recv=*/1, /*Send=*/0>(scat);
}
} else if (tid < tidEndBcast) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
prims(tid - tidEndGather, nThreadsBcast, &nvls->down, &nvls->down, work->sendbuff, NULL,
work->redOpArg, 1 * Proto::MaxGroupWidth, 0, 0, work);
/* used as sync */
prims.recv(0, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
ssize_t inpOffset = gridOffset + elemOffset;
ssize_t outOffset = inpOffset + rank * count;
nelem = min(chunkCount, channelCount - elemOffset);
prims.directSend(inpOffset, outOffset, nelem);
} else {
if (work->netRegUsed) {
using ProtoSend = ProtoSimple<1, 1, COLL_UNROLL>;
using ProtoBcast = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
int maxSteps = (int)divUp(nNodes * countPerRank, nChannels * chunkCount);
int curSteps = -1;
int postThread = tid - tidEndGather == 0 ? 1 : 0;
// for UB, we need to control the send speed to avoid net congestion.
// first unroll 2 steps, then unroll the rest steps when the data is received.
if (postThread) {
curSteps = min(2, maxSteps);
Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/1, ProtoSend, 0>::sendPeerNotify(nvls->out, 1, curSteps);
}
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, ProtoBcast, 0>
prims(tid - tidEndGather, nThreadsNetSend + nThreadsBcast, &nvls->out, &nvls->down, nullptr, nullptr,
/*redOpArg=*/0, 2 * ProtoBcast::MaxGroupWidth, 0, 0, work);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
Scatterer</*BcastSendNotRecv=*/true> scat;
scat.work = work;
scat.chunkSize = chunkCount;
scat.railGridOffset = railGridOffset;
prims.template process</*Recv=*/1, /*Send=*/1>(scat);
if (postThread && curSteps < maxSteps) {
curSteps++;
Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/1, ProtoSend, 0>::sendPeerNotify(nvls->out, 1, 1);
}
}
} else {
if (tid < tidEndNetSend) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/0, Proto, 0>
prims(tid - tidEndGather, nThreadsNetSend, nullptr, &nvls->out, work->sendbuff, nullptr,
/*redOpArg=*/0, 0 * Proto::MaxGroupWidth, 1, 1);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
ssize_t railAllBeg = railGridOffset + part * chunkCount;
ssize_t railAllEnd = min(railAllBeg + chunkCount, nNodes * countPerRank);
ssize_t railOneBeg = ncclShmem.comm.node * countPerRank;
ssize_t railOneEnd = railOneBeg + countPerRank;
ssize_t beg = max(railAllBeg, railOneBeg);
ssize_t end = min(railAllEnd, railOneEnd);
prims.send(beg - railOneBeg, max(ssize_t(0), end - beg));
}
} else {
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/0, Proto, 0>
prims(tid - tidEndNetSend, nThreadsBcast, &nvls->out, &nvls->down, nullptr, nullptr,
/*redOpArg=*/0, 2 * Proto::MaxGroupWidth, 0, 0);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
Scatterer</*BcastSendNotRecv=*/true> scat;
scat.work = work;
scat.chunkSize = chunkCount;
scat.railGridOffset = railGridOffset;
prims.template process</*Recv=*/1, /*Send=*/1>(scat);
}
}
}
}
}
@@ -254,7 +402,7 @@ struct RunWorkColl<ncclFuncAllGather, T, RedOp, NCCL_ALGO_COLLNET_DIRECT, NCCL_P
ssize_t chunkSize;
ssize_t railGridOffset;
template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts>
template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts, int MultimemSrcs, int MultimemDsts>
__device__ __forceinline__ void operator()(
int tid, int tn, int slice, int maxSliceSize,
int nSrcs, void** srcPtrs, int nDsts, void** dstPtrs, int32_t* dstSizes, uint32_t sendDirectFlag, uint32_t recvDirectFlag
+1 -1
Просмотреть файл
@@ -106,7 +106,7 @@ namespace {
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.directSend(offset, nelem);
prims.directSend(offset, offset, nelem);
}
}
else {
+26 -42
Просмотреть файл
@@ -52,7 +52,6 @@ struct ncclShmemData {
uint16_t funcId;
int nWorks;
int workSize;
uint32_t workConsumed;
uint64_t workCounter;
bool profilerEnabled;
struct ncclShmemGroup groups[NCCL_MAX_GROUPS];
@@ -182,7 +181,6 @@ __device__ __forceinline__ void loadWorkBatchToShmem(
}
if (tid == 0) {
ncclShmem.workSize = workSize;
ncclShmem.workConsumed = batch.offsetBase + (64-__clzll(batch.offsetBitset))*workSize;
}
// We deliberately replicate these div and mod calculations into the case
// blocks above so that they get constant divisor optimizations by the compiler.
@@ -242,6 +240,12 @@ __device__ __forceinline__ void loadWorkBatchToShmem(
}
}
__device__ __forceinline__ unsigned long long int globaltimer() {
unsigned long long int timer;
asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(timer));
return timer;
}
template<ncclFunc_t Fn, typename T, typename RedOp, int Algo, int Proto>
struct RunWorkColl {
__device__ void run(int tid, int tn, struct ncclDevWorkColl* work) {
@@ -296,40 +300,30 @@ struct RunWorkBatch {
#define STOP 1
#define FINI 2
__device__ __forceinline__ bool profilerEnabled(void) {
// Check if any of the workItems in the batch is profiled. If so, there is an equivalent
// profiler ProxyOp waiting for the counter update in the host thread. If this check was
// done only for the first workItem the profiler counter for other workItems in the batch
// could never be updated, leaving the host thread spinning forever for the counter update
// and causing a hang.
bool enabled = false;
for (int i = 0; i < ncclShmem.nWorks && !enabled; i++) {
if (ncclShmem.workType == ncclDevWorkTypeP2p)
enabled = ((struct ncclDevWorkP2p*)ncclShmem.workStorage)[i].profilerEnabled;
else
enabled = ((struct ncclDevWorkColl*)ncclShmem.workStorage)[i].profilerEnabled;
}
return enabled;
__device__ __forceinline__ bool profilerEnabled(int workItemIdx) {
return (ncclShmem.workType == ncclDevWorkTypeP2p) ?
((struct ncclDevWorkP2p*)ncclShmem.workStorage)[workItemIdx].profilerEnabled :
((struct ncclDevWorkColl*)ncclShmem.workStorage)[workItemIdx].profilerEnabled;
}
__device__ __forceinline__ void profiler(int action) {
if (action == START) {
if (threadIdx.x == 0) {
// increment workCounter regardless of the profiler being active or not
if (threadIdx.x == 0) {
int idx = 0;
uint64_t wc = ncclShmem.channel.workCounter + 1;
if (action == START) {
for (; wc <= ncclShmem.channel.workCounter + ncclShmem.nWorks; wc++) {
if (!profilerEnabled(idx++)) continue;
ncclShmem.comm.workStarted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].timestamp = globaltimer();
ncclShmem.comm.workStarted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].counter = wc;
}
} else {
for (; wc <= ncclShmem.channel.workCounter + ncclShmem.nWorks; wc++) {
if (!profilerEnabled(idx++)) continue;
ncclShmem.comm.workCompleted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].timestamp = globaltimer();
ncclShmem.comm.workCompleted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].counter = wc;
}
ncclShmem.channel.workCounter += ncclShmem.nWorks;
if(!profilerEnabled()) return;
ncclShmem.comm.workStarted[ncclShmem.channelId] = ncclShmem.channel.workCounter;
}
} else if (action == STOP) {
if (threadIdx.x == 0 && profilerEnabled()) {
ncclShmem.comm.workCompleted[ncclShmem.channelId] = ncclShmem.channel.workCounter;
}
} else { // FINI
if (threadIdx.x == 0) {
// store the workCounter back to vidmem regardless of the profiler being active or not
((ncclDevCommAndChannels*)ncclShmem.args.comm)->channels[ncclShmem.channelId].workCounter = ncclShmem.channel.workCounter;
if (!profilerEnabled()) return;
ncclShmem.comm.workCompleted[ncclShmem.channelId] = ncclShmem.channel.workCounter;
if (action == FINI) ((ncclDevCommAndChannels*)ncclShmem.args.comm)->channels[ncclShmem.channelId].workCounter = ncclShmem.channel.workCounter;
}
}
}
@@ -388,11 +382,6 @@ __device__ __forceinline__ void ncclKernelMain(struct ncclDevKernelArgs const* a
}
__syncthreads(); // publish ncclShmem
if (tid == 0 && ncclShmem.args.workStorageType == ncclDevWorkStorageTypeFifo) {
// ncclShmem.workConsumed written by loadWorkBatchToShmem before __syncthreads()
ncclShmem.comm.workConsumed[ncclShmem.channelId] = ncclShmem.workConsumed;
}
while (ncclShmem.aborted == 0) {
profiler(START);
if (0 <= SpecializedFnId && ncclShmem.funcId == (unsigned)SpecializedFnId) {
@@ -407,11 +396,6 @@ __device__ __forceinline__ void ncclKernelMain(struct ncclDevKernelArgs const* a
profiler(STOP);
loadWorkBatchToShmem(tid, tn, args, batchIx);
__syncthreads();
if (tid == 0 && ncclShmem.args.workStorageType == ncclDevWorkStorageTypeFifo) {
// ncclShmem.workConsumed written by loadWorkBatchToShmem before __syncthreads()
ncclShmem.comm.workConsumed[ncclShmem.channelId] = ncclShmem.workConsumed;
}
}
profiler(FINI);
}
+1 -1
Просмотреть файл
@@ -327,7 +327,7 @@ with open(os.path.join(gensrc, "rules.mk"), "w") as f:
out = f.write
impl_names = sorted(name_to_funcs.keys())
names = impl_names + ["host_table.cc", "device_table.cu"]
out("LIB_OBJS_GEN = $(patsubst %, $(OBJDIR)/genobj/%.o, {names})\n"
out("LIB_OBJS_GEN = $(patsubst %,$(OBJDIR)/genobj/%.o,{names})\n"
.format(names=" ".join(names)))
out("\n")
+86 -13
Просмотреть файл
@@ -99,37 +99,60 @@ template<>
union BytePack<0> {};
template<>
union BytePack<1> {
uint8_t u8, native;
uint8_t u8[1], native;
};
template<>
union BytePack<2> {
BytePack<1> half[2];
BytePack<1> b1[2];
uint8_t u8[2];
uint16_t u16, native;
uint16_t u16[1], native;
};
template<>
union BytePack<4> {
BytePack<2> half[2];
BytePack<1> b1[4];
BytePack<2> b2[2];
uint8_t u8[4];
uint16_t u16[2];
uint32_t u32, native;
uint32_t u32[1], native;
};
template<>
union BytePack<8> {
BytePack<4> half[2];
BytePack<1> b1[8];
BytePack<2> b2[4];
BytePack<4> b4[2];
uint8_t u8[8];
uint16_t u16[4];
uint32_t u32[2];
uint64_t u64, native;
uint64_t u64[1], native;
};
template<>
union alignas(16) BytePack<16> {
BytePack<8> half[2];
BytePack<1> b1[16];
BytePack<2> b2[8];
BytePack<4> b4[4];
BytePack<8> b8[2];
uint8_t u8[16];
uint16_t u16[8];
uint32_t u32[4];
uint64_t u64[2];
ulong2 ul2, native;
ulong2 ul2[1], native;
};
template<int Size>
union BytePack {
BytePack<Size/2> half[2];
BytePack<1> b1[Size];
BytePack<2> b2[Size/2];
BytePack<4> b4[Size/4];
BytePack<8> b8[Size/8];
BytePack<16> b16[Size/16];
uint8_t u8[Size];
uint16_t u16[Size/2];
uint32_t u32[Size/4];
uint64_t u64[Size/8];
};
template<typename T>
@@ -357,19 +380,19 @@ __device__ __forceinline__ void multimem_st_global<0>(uintptr_t addr, BytePack<0
}
template<>
__device__ __forceinline__ void multimem_st_global<1>(uintptr_t addr, BytePack<1> val) {
asm volatile("st.global.b8 [%0], %1;" :: "l"(addr), "r"((uint32_t)val.u8) : "memory");
asm volatile("st.global.b8 [%0], %1;" :: "l"(addr), "r"((uint32_t)val.native) : "memory");
}
template<>
__device__ __forceinline__ void multimem_st_global<2>(uintptr_t addr, BytePack<2> val) {
asm volatile("st.global.b16 [%0], %1;" :: "l"(addr), "h"(val.u16) : "memory");
asm volatile("st.global.b16 [%0], %1;" :: "l"(addr), "h"(val.native) : "memory");
}
template<>
__device__ __forceinline__ void multimem_st_global<4>(uintptr_t addr, BytePack<4> val) {
asm volatile("multimem.st.global.b32 [%0], %1;" :: "l"(addr), "r"(val.u32) : "memory");
asm volatile("multimem.st.global.b32 [%0], %1;" :: "l"(addr), "r"(val.native) : "memory");
}
template<>
__device__ __forceinline__ void multimem_st_global<8>(uintptr_t addr, BytePack<8> val) {
asm volatile("multimem.st.global.b64 [%0], %1;" :: "l"(addr), "l"(val.u64) : "memory");
asm volatile("multimem.st.global.b64 [%0], %1;" :: "l"(addr), "l"(val.native) : "memory");
}
template<>
__device__ __forceinline__ void multimem_st_global<16>(uintptr_t addr, BytePack<16> val) {
@@ -384,6 +407,56 @@ __device__ __forceinline__ void multimem_st_global(uintptr_t addr, BytePack<Size
}
#endif
// Load pack starting at index in array. Ignore elements past end (length of array).
template<typename Pack, typename T>
__device__ __forceinline__ Pack loadPack(T* ptr, int ix, int end) {
constexpr int Size = sizeof(Pack);
ptr += ix;
int n = end - ix;
if (alignof(T) == Size && sizeof(T) == Size) {
return *(Pack*)ptr;
} else if ((Size+3)/4 + 1 < Size/sizeof(T)) {
union { Pack ans; uint32_t part[Size/4]; };
int misalign = reinterpret_cast<uintptr_t>(ptr) % 4;
uint32_t* down = reinterpret_cast<uint32_t*>(reinterpret_cast<uintptr_t>(ptr) & -uintptr_t(4));
int i;
#pragma unroll
for (i=0; i < Size/4; i++) {
if (i*4/sizeof(T) < 1 || i*4/sizeof(T) < n) part[i] = down[i];
}
uint32_t extra;
if (misalign) extra = down[i];
#pragma unroll
for (i=0; i < Size/4; i++) {
part[i] = __funnelshift_r(part[i], part[i+1], 8*misalign);
}
if (misalign) part[i] = __funnelshift_r(part[i], extra, 8*misalign);
return ans;
} else {
union { Pack ans; BytePack<sizeof(T)> part[Size/sizeof(T)]; };
#pragma unroll
for (int i=0; i < Size/sizeof(T); i++) {
if (i < 1 || i < n) part[i] = ((BytePack<sizeof(T)>*)ptr)[i];
}
return ans;
}
}
// Store pack starting at index in array. Ignore elements past end (length of array).
template<typename Pack, typename T>
__device__ __forceinline__ void storePack(T* ptr, int ix, int end, Pack val) {
constexpr int Size = sizeof(Pack);
union { Pack tmp; BytePack<sizeof(T)> part[Size/sizeof(T)]; };
tmp = val;
ptr += ix;
int n = end - ix;
#pragma unroll
for (int i=0; i < Size/sizeof(T); i++) {
if (i < 1 || i < n) ((BytePack<sizeof(T)>*)ptr)[i] = part[i];
}
}
// Warp-uniform memory copy from shared address (not generic) to global memory.
// The number of bytes copied is `min(MaxBytes, nBytesAhead)`, a negative value
// is interpeted as zero. EltSize is the guaranteed alignment of the addresses and sizes.
@@ -426,10 +499,10 @@ __device__ __forceinline__ void copyGlobalShared_WarpUnrolled(
b4[3] = ld_shared<4>(srcAddr + 3*4);
if (srcMisalign != 0) {
BytePack<4> b4_4 = ld_shared<4>(srcAddr + 4*4);
b4[0].u32 = __funnelshift_r(b4[0].u32, b4[1].u32, srcMisalign*8);
b4[1].u32 = __funnelshift_r(b4[1].u32, b4[2].u32, srcMisalign*8);
b4[2].u32 = __funnelshift_r(b4[2].u32, b4[3].u32, srcMisalign*8);
b4[3].u32 = __funnelshift_r(b4[3].u32, b4_4.u32, srcMisalign*8);
b4[0].native = __funnelshift_r(b4[0].native, b4[1].native, srcMisalign*8);
b4[1].native = __funnelshift_r(b4[1].native, b4[2].native, srcMisalign*8);
b4[2].native = __funnelshift_r(b4[2].native, b4[3].native, srcMisalign*8);
b4[3].native = __funnelshift_r(b4[3].native, b4_4.native, srcMisalign*8);
}
if (Multimem) multimem_st_global<16>(dstAddr, b16);
else st_global<16>(dstAddr, b16);
+12 -6
Просмотреть файл
@@ -125,7 +125,7 @@ class Primitives<
void **ptrs = isSendNotRecv ? (ncclShmem.groups[group].dsts + Dst)
: (ncclShmem.groups[group].srcs + Src);
if (flags & NetRegMode) {
if ((flags & NetRegMode) && ((!isSendNotRecv && DirectRecv) || (isSendNotRecv && DirectSend))) {
if (P2p) {
ptrs[index] = NULL;
} else {
@@ -337,7 +337,7 @@ public:
}
template<int Recv, int Send, typename Fn>
__device__ __forceinline__ void process(Fn &&fn, uint32_t sendDirectFlag, uint32_t recvDirectFlag) {
__device__ __forceinline__ void process(Fn &&fn, uint32_t sendDirectFlag = 0, uint32_t recvDirectFlag = 0) {
#pragma unroll 1
for (int slice=0; slice < SlicePerChunk; slice++) {
if (tid < nworkers) {
@@ -361,7 +361,7 @@ public:
} else if (flags & DirectRead) { // empty send
ptrs[index] = nullptr;
} else {
ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*stepSize;
ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*connStepSize;
}
} else {
if (flags & DirectRead) {
@@ -372,11 +372,11 @@ public:
else
ptrs[index] = nullptr;
} else {
ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*stepSize;
ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*connStepSize;
}
}
} else {
ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*stepSize;
ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*connStepSize;
}
}
subBarrier();
@@ -391,7 +391,7 @@ public:
} else {
nsend = fan.nsend();
}
fn.template operator() < SlicePerChunk, 0, Recv*MaxRecv, 0, Send*MaxSend >
fn.template operator()<SlicePerChunk, 0, Recv*MaxRecv, 0, Send*MaxSend, MultimemSrcs, MultimemDsts>
(tid, nworkers, slice, stepSize * StepPerSlice,
nrecv, ncclShmem.groups[group].srcs,
nsend, ncclShmem.groups[group].dsts, ncclShmem.groups[group].dstSizes, sendDirectFlag, recvDirectFlag);
@@ -896,6 +896,12 @@ private:
__device__ __forceinline__ void directRecvDirectSend(intptr_t inpIx, intptr_t outIx, int eltN, bool postOp=false) {
genericOp<1, 1, 1, 1, -1, -1>(inpIx, outIx, eltN, postOp);
}
__device__ __forceinline__ void recvDirectSend(intptr_t outIx, int eltN, bool postOp=false) {
genericOp<0, 1, 1, 1, -1, -1>(-1, outIx, eltN, postOp);
}
__device__ __forceinline__ void directRecvSend(intptr_t outIx, int eltN, bool postOp=false) {
genericOp<1, 0, 1, 1, -1, -1>(outIx, outIx, eltN, postOp);
}
__device__ __forceinline__ void recvCopyDirectSend(intptr_t outIx, int eltN, bool postOp=false) {
genericOp<0, 1, 1, 1, -1, Output>(-1, outIx, eltN, postOp);
}
+312 -129
Просмотреть файл
@@ -38,18 +38,18 @@ struct IsFloatingPoint<double>: std::true_type {};
// 3. Have constructor taking `uint64_t opArg`.
template<typename T>
struct FuncCopy { using EltType = T; __device__ FuncCopy(uint64_t opArg=0) {}; };
struct FuncCopy { using EltType = T; __device__ __forceinline__ FuncCopy(uint64_t opArg=0) {}; };
template<typename T>
struct FuncSum { using EltType = T; __device__ FuncSum(uint64_t opArg=0) {}; };
struct FuncSum { using EltType = T; __device__ __forceinline__ FuncSum(uint64_t opArg=0) {}; };
template<typename T>
struct FuncProd { using EltType = T; __device__ FuncProd(uint64_t opArg=0) {}; };
struct FuncProd { using EltType = T; __device__ __forceinline__ FuncProd(uint64_t opArg=0) {}; };
template<typename T>
struct FuncMinMax {
using EltType = T;
BytePack<sizeof(T)> xormask; // only used by integers
bool isMinNotMax; // only used by floats
__device__ FuncMinMax(uint64_t opArg=0) {
__device__ __forceinline__ FuncMinMax(uint64_t opArg=0) {
xormask.native = opArg;
isMinNotMax = (opArg&1)==0;
}
@@ -64,13 +64,13 @@ template<typename T> struct FuncSumPostDiv;
template<typename Fn>
struct RedOpArg { // default case: no argument
static constexpr bool ArgUsed = false;
__device__ static uint64_t loadArg(void *ptr) { return 0; }
__device__ __forceinline__ static uint64_t loadArg(void *ptr) { return 0; }
};
template<typename T>
struct RedOpArg<FuncMinMax<T>> {
static constexpr bool ArgUsed = true;
__device__ static uint64_t loadArg(void *ptr) {
__device__ __forceinline__ static uint64_t loadArg(void *ptr) {
union { uint64_t u64; T val; };
u64 = 0;
val = *(T*)ptr;
@@ -84,6 +84,11 @@ struct RedOpArg<FuncMinMax<T>> {
// of elements. These classes are intended to be specialized for specific
// combinations of reduction function and pack size.
template<typename A, typename B, int EltPerPackA>
struct Apply_Cast/*{
static BytePack<EltPerPackA*sizeof(B)/sizeof(A)> cast(BytePack<EltPerPackA*sizeof(A)> a);
}*/;
template<typename Fn, int EltPerPack>
struct Apply_Reduce /*{
static BytePack<EltPerPack*sizeof(T)> reduce(
@@ -111,16 +116,60 @@ struct Apply_LoadMultimem/*{
static BytePack<BytePerPack> load(Fn fn, uintptr_t addr);
}*/;
// Helpers for dealing with BytePack<0>'s
template<typename A, typename B, int EltPerPack>
struct Apply_Cast_MaybeEmpty: Apply_Cast<A, B, EltPerPack> {};
template<typename A, typename B>
struct Apply_Cast_MaybeEmpty<A, B, /*EltPerPack=*/0> {
__device__ constexpr static BytePack<0> cast(BytePack<0> a) { return {}; }
};
template<typename Fn, int EltPerPack>
struct Apply_Reduce_MaybeEmpty: Apply_Reduce<Fn, EltPerPack> {};
template<typename Fn>
struct Apply_Reduce_MaybeEmpty<Fn, 0> {
__device__ constexpr static BytePack<0> reduce(Fn fn, BytePack<0> a, BytePack<0> b) { return {}; }
};
template<typename Fn, int EltPerPack>
struct Apply_PreOp_MaybeEmpty: Apply_PreOp<Fn, EltPerPack> {};
template<typename Fn>
struct Apply_PreOp_MaybeEmpty<Fn, 0> {
static constexpr bool IsIdentity = true;
__device__ constexpr static BytePack<0> preOp(Fn fn, BytePack<0> a) { return {}; }
};
template<typename Fn, int EltPerPack>
struct Apply_PostOp_MaybeEmpty: Apply_PostOp<Fn, EltPerPack> {};
template<typename Fn>
struct Apply_PostOp_MaybeEmpty<Fn, 0> {
static constexpr bool IsIdentity = true;
__device__ constexpr static BytePack<0> postOp(Fn fn, BytePack<0> a) { return {}; }
};
template<typename Fn, int BytePerPack>
struct Apply_LoadMultimem_MaybeEmpty: Apply_LoadMultimem<Fn, BytePerPack> {};
template<typename Fn>
struct Apply_LoadMultimem_MaybeEmpty<Fn, 0> {
__device__ constexpr static BytePack<0> load(Fn fn, uintptr_t addr) { return {}; }
};
////////////////////////////////////////////////////////////////////////////////
// Public API for calling the trait classes. These take the data elements as a
// pack of any type, which could be a BytePack<?> or any integral type (uint64_t,
// uint32_t, etc.), and will return a new pack where each element has been
// transformed appropriately.
template<typename A, typename B, typename PackA>
__device__ __forceinline__ BytePack<BytePackOf<PackA>::Size*sizeof(B)/sizeof(A)> applyCast(PackA a) {
return Apply_Cast_MaybeEmpty<A, B, BytePackOf<PackA>::Size/sizeof(A)>::cast(toPack(a));
}
template<typename Fn, typename Pack>
__device__ __forceinline__ Pack applyReduce(Fn fn, Pack a, Pack b) {
return fromPack<Pack>(
Apply_Reduce<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
Apply_Reduce_MaybeEmpty<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
::reduce(fn, toPack(a), toPack(b))
);
}
@@ -128,7 +177,7 @@ __device__ __forceinline__ Pack applyReduce(Fn fn, Pack a, Pack b) {
template<typename Fn, typename Pack>
__device__ __forceinline__ Pack applyPreOp(Fn fn, Pack a) {
return fromPack<Pack>(
Apply_PreOp<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
Apply_PreOp_MaybeEmpty<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
::preOp(fn, toPack(a))
);
}
@@ -136,23 +185,107 @@ __device__ __forceinline__ Pack applyPreOp(Fn fn, Pack a) {
template<typename Fn, typename Pack>
__device__ __forceinline__ Pack applyPostOp(Fn fn, Pack a) {
return fromPack<Pack>(
Apply_PostOp<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
Apply_PostOp_MaybeEmpty<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
::postOp(fn, toPack(a))
);
}
template<typename Fn, int BytePerPack>
__device__ __forceinline__ BytePack<BytePerPack> applyLoadMultimem(Fn fn, uintptr_t addr) {
return Apply_LoadMultimem<Fn, BytePerPack>::load(fn, addr);
return Apply_LoadMultimem_MaybeEmpty<Fn, BytePerPack>::load(fn, addr);
}
////////////////////////////////////////////////////////////////////////////////
// Apply_Cast
template<typename A, typename B, int EltPerPack>
struct Apply_Cast {
__device__ __forceinline__ static BytePack<EltPerPack*sizeof(B)> cast(BytePack<EltPerPack*sizeof(A)> a) {
BytePack<EltPerPack*sizeof(B)> b;
b.half[0] = Apply_Cast<A, B, EltPerPack/2>::cast(a.half[0]);
b.half[1] = Apply_Cast<A, B, EltPerPack/2>::cast(a.half[1]);
return b;
}
};
template<typename A, typename B>
struct Apply_Cast<A, B, /*EltPerPack=*/1> {
__device__ __forceinline__ static BytePack<sizeof(B)> cast(BytePack<sizeof(A)> a) {
return toPack(B(fromPack<A>(a)));
}
};
template<>
struct Apply_Cast<__half, float, /*EltPerPack=*/1> {
__device__ __forceinline__ static BytePack<sizeof(float)> cast(BytePack<sizeof(__half)> a) {
return toPack(__half2float(fromPack<__half>(a)));
}
};
template<>
struct Apply_Cast<float, __half, /*EltPerPack=*/1> {
__device__ __forceinline__ static BytePack<sizeof(__half)> cast(BytePack<sizeof(float)> a) {
return toPack(__float2half_rn(fromPack<float>(a)));
}
};
template<>
struct Apply_Cast<__half, float, /*EltPerPack=*/2> {
__device__ __forceinline__ static BytePack<4*2> cast(BytePack<2*2> a) {
return toPack(__half22float2(fromPack<__half2>(a)));
}
};
template<>
struct Apply_Cast<float, __half, /*EltPerPack=*/2> {
__device__ __forceinline__ static BytePack<2*2> cast(BytePack<4*2> a) {
return toPack(__float22half2_rn(fromPack<float2>(a)));
}
};
#if defined(__CUDA_BF16_TYPES_EXIST__) && (CUDART_RUNTIME >= 12000 || __CUDA_ARCH__ >= 800)
template<>
struct Apply_Cast<__nv_bfloat16, float, /*EltPerPack=*/2> {
__device__ __forceinline__ static BytePack<4*2> cast(BytePack<2*2> a) {
return toPack(__bfloat1622float2(fromPack<__nv_bfloat162>(a)));
}
};
template<>
struct Apply_Cast<float ,__nv_bfloat16, /*EltPerPack=*/2> {
__device__ __forceinline__ static BytePack<2*2> cast(BytePack<4*2> a) {
return toPack(__float22bfloat162_rn(fromPack<float2>(a)));
}
};
#endif
#define EASY_CAST(A, B, EltPerPack, VecA, VecB) \
template<> \
struct Apply_Cast<A, B, EltPerPack> { \
__device__ __forceinline__ static BytePack<sizeof(B)*EltPerPack> cast(BytePack<sizeof(A)*EltPerPack> a) { \
return toPack(VecB(fromPack<VecA>(a))); \
} \
}; \
template<> \
struct Apply_Cast<B, A, EltPerPack> { \
__device__ __forceinline__ static BytePack<sizeof(A)*EltPerPack> cast(BytePack<sizeof(B)*EltPerPack> b) { \
return toPack(VecA(fromPack<VecB>(b))); \
} \
};
#if defined(__CUDA_FP8_TYPES_EXIST__)
EASY_CAST(__nv_fp8_e5m2, float, 2, __nv_fp8x2_e5m2, float2)
EASY_CAST(__nv_fp8_e5m2, float, 4, __nv_fp8x4_e5m2, float4)
EASY_CAST(__nv_fp8_e4m3, float, 2, __nv_fp8x2_e4m3, float2)
EASY_CAST(__nv_fp8_e4m3, float, 4, __nv_fp8x4_e4m3, float4)
#endif
#undef EASY_CAST
////////////////////////////////////////////////////////////////////////////////
// Apply_Reduce
// Nonsensical base case
template<typename Fn>
struct Apply_Reduce<Fn, /*EltPerPack=*/0> {
__device__ static BytePack<0> reduce(Fn fn, BytePack<0> a, BytePack<0> b) {
__device__ __forceinline__ static BytePack<0> reduce(Fn fn, BytePack<0> a, BytePack<0> b) {
return {};
}
};
@@ -164,7 +297,7 @@ struct Apply_Reduce<Fn, /*EltPerPack=*/0> {
template<typename Fn, int EltPerPack>
struct Apply_Reduce {
template<int Size>
__device__ static BytePack<Size> reduce(Fn fn, BytePack<Size> a, BytePack<Size> b) {
__device__ __forceinline__ static BytePack<Size> reduce(Fn fn, BytePack<Size> a, BytePack<Size> b) {
a.half[0] = Apply_Reduce<Fn, EltPerPack/2>::reduce(fn, a.half[0], b.half[0]);
a.half[1] = Apply_Reduce<Fn, EltPerPack/2>::reduce(fn, a.half[1], b.half[1]);
return a;
@@ -174,25 +307,25 @@ struct Apply_Reduce {
// Base case definitions (EltPerPack == 1)
template<typename T>
struct Apply_Reduce<FuncCopy<T>, /*EltPerPack=*/1> {
__device__ static BytePack<sizeof(T)> reduce(FuncCopy<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
__device__ __forceinline__ static BytePack<sizeof(T)> reduce(FuncCopy<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
return a;
}
};
template<typename T>
struct Apply_Reduce<FuncSum<T>, /*EltPerPack=*/1> {
__device__ static BytePack<sizeof(T)> reduce(FuncSum<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
__device__ __forceinline__ static BytePack<sizeof(T)> reduce(FuncSum<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
return toPack<T>(fromPack<T>(a) + fromPack<T>(b));
}
};
template<typename T>
struct Apply_Reduce<FuncProd<T>, /*EltPerPack=*/1> {
__device__ static BytePack<sizeof(T)> reduce(FuncProd<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
__device__ __forceinline__ static BytePack<sizeof(T)> reduce(FuncProd<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
return toPack<T>(fromPack<T>(a) * fromPack<T>(b));
}
};
template<typename T>
struct Apply_Reduce<FuncMinMax<T>, /*EltPerPack=*/1> {
__device__ static BytePack<sizeof(T)> reduce(FuncMinMax<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
__device__ __forceinline__ static BytePack<sizeof(T)> reduce(FuncMinMax<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
return (a.native ^ fn.xormask.native) < (b.native ^ fn.xormask.native) ? a : b;
}
};
@@ -200,7 +333,7 @@ struct Apply_Reduce<FuncMinMax<T>, /*EltPerPack=*/1> {
// Optimizations for specfic types and element count combinations:
template<>
struct Apply_Reduce<FuncSum<uint8_t>, /*EltPerPack=*/4> {
__device__ static BytePack<4> reduce(FuncSum<uint8_t> fn, BytePack<4> a, BytePack<4> b) {
__device__ __forceinline__ static BytePack<4> reduce(FuncSum<uint8_t> fn, BytePack<4> a, BytePack<4> b) {
constexpr uint32_t even = 0x00ff00ffu;
uint32_t x = (a.native & even) + (b.native & even);
uint32_t y = (a.native & ~even) + (b.native & ~even);
@@ -236,7 +369,7 @@ struct Apply_Reduce<FuncMinMax<uint8_t>, /*EltPerPack=*/4> {
template<>
struct Apply_Reduce<FuncProd<uint8_t>, /*EltPerPack=*/4> {
__device__ static BytePack<4> reduce(FuncProd<uint8_t> fn, BytePack<4> apack, BytePack<4> bpack) {
__device__ __forceinline__ static BytePack<4> reduce(FuncProd<uint8_t> fn, BytePack<4> apack, BytePack<4> bpack) {
uint32_t a = apack.native;
uint32_t b = bpack.native;
uint32_t ab0 = (a*b) & 0xffu;
@@ -332,7 +465,7 @@ template<typename Fn, int EltPerPack>
struct Apply_PreOp {
static constexpr bool IsIdentity = Apply_PreOp<Fn, EltPerPack/2>::IsIdentity;
template<int Size>
__device__ static BytePack<Size> preOp(Fn fn, BytePack<Size> a) {
__device__ __forceinline__ static BytePack<Size> preOp(Fn fn, BytePack<Size> a) {
#if __cpp_if_constexpr
if constexpr(!IsIdentity) {
#else
@@ -352,7 +485,7 @@ template<typename Fn>
struct Apply_PreOp<Fn, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = true;
template<int Size>
__device__ static BytePack<Size> preOp(Fn fn, BytePack<Size> a) {
__device__ __forceinline__ static BytePack<Size> preOp(Fn fn, BytePack<Size> a) {
return a;
}
};
@@ -360,7 +493,7 @@ struct Apply_PreOp<Fn, /*EltPerPack=*/1> {
template<typename Fn>
struct Apply_PreOp<Fn, /*EltPerPack=*/0> {
static constexpr bool IsIdentity = true;
__device__ static BytePack<0> preOp(Fn fn, BytePack<0> a) {
__device__ __forceinline__ static BytePack<0> preOp(Fn fn, BytePack<0> a) {
return {};
}
};
@@ -373,7 +506,7 @@ template<typename Fn, int EltPerPack>
struct Apply_PostOp {
static constexpr bool IsIdentity = Apply_PostOp<Fn, EltPerPack/2>::IsIdentity;
template<int Size>
__device__ static BytePack<Size> postOp(Fn fn, BytePack<Size> a) {
__device__ __forceinline__ static BytePack<Size> postOp(Fn fn, BytePack<Size> a) {
#if __cpp_if_constexpr
if constexpr(!IsIdentity) {
#else
@@ -393,7 +526,7 @@ template<typename Fn>
struct Apply_PostOp<Fn, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = true;
template<int Size>
__device__ static BytePack<Size> postOp(Fn fn, BytePack<Size> a) {
__device__ __forceinline__ static BytePack<Size> postOp(Fn fn, BytePack<Size> a) {
return a;
}
};
@@ -401,7 +534,7 @@ struct Apply_PostOp<Fn, /*EltPerPack=*/1> {
template<typename Fn>
struct Apply_PostOp<Fn, /*EltPerPack=*/0> {
static constexpr bool IsIdentity = true;
__device__ static BytePack<0> postOp(Fn fn, BytePack<0> a) {
__device__ __forceinline__ static BytePack<0> postOp(Fn fn, BytePack<0> a) {
return {};
}
};
@@ -413,7 +546,7 @@ struct Apply_PostOp<Fn, /*EltPerPack=*/0> {
template<typename T>
struct RedOpArg<FuncPreMulSum<T>> {
static constexpr bool ArgUsed = true;
__device__ static uint64_t loadArg(void *ptr) {
__device__ __forceinline__ static uint64_t loadArg(void *ptr) {
union { uint64_t u64; T val; };
u64 = 0;
val = *(T*)ptr;
@@ -426,7 +559,7 @@ template<typename T>
struct FuncPreMulSum {
using EltType = T;
T scalar;
__device__ FuncPreMulSum(uint64_t opArg=0) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
union { uint64_t u64; T val; };
u64 = opArg;
scalar = val;
@@ -441,7 +574,7 @@ struct FuncPreMulSum<half> {
using EltType = half;
#if __CUDA_ARCH__ >= 530 && __CUDA_ARCH__ != 610
__half2 scalar;
__device__ FuncPreMulSum(uint64_t opArg=0) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
union { uint64_t u64; __half val; };
u64 = opArg;
scalar.x = val;
@@ -449,7 +582,7 @@ struct FuncPreMulSum<half> {
}
#else
float scalar;
__device__ FuncPreMulSum(uint64_t opArg=0) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
union { uint64_t u64; __half val; };
u64 = opArg;
scalar = (float)val;
@@ -466,7 +599,7 @@ struct FuncPreMulSum<half> {
using EltType = __nv_bfloat16;
#if __CUDA_ARCH__ >= 800
__nv_bfloat162 scalar;
__device__ FuncPreMulSum(uint64_t opArg=0) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
union { uint64_t u64; __nv_bfloat16 val; };
u64 = opArg;
scalar.x = val;
@@ -474,7 +607,7 @@ struct FuncPreMulSum<half> {
}
#else
float scalar;
__device__ FuncPreMulSum(uint64_t opArg=0) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
union { uint64_t u64; __nv_bfloat16 val; };
u64 = opArg;
scalar = __bfloat162float(val);
@@ -489,7 +622,7 @@ struct FuncPreMulSum<half> {
struct FuncPreMulSum<__nv_fp8_e4m3> {
using EltType = __nv_fp8_e4m3;
__half2 scalar2;
__device__ FuncPreMulSum(uint64_t opArg) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg) {
union { uint64_t u64; __nv_fp8_storage_t val; };
u64 = opArg;
scalar2.x = __half(__nv_cvt_fp8_to_halfraw(val, __NV_E4M3));
@@ -501,7 +634,7 @@ struct FuncPreMulSum<half> {
struct FuncPreMulSum<__nv_fp8_e5m2> {
using EltType = __nv_fp8_e5m2;
__half2 scalar2;
__device__ FuncPreMulSum(uint64_t opArg) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg) {
union { uint64_t u64; __nv_fp8_storage_t val; };
u64 = opArg;
scalar2.x = __half(__nv_cvt_fp8_to_halfraw(val, __NV_E5M2));
@@ -513,7 +646,7 @@ struct FuncPreMulSum<half> {
template<typename T, int EltPerPack>
struct Apply_Reduce<FuncPreMulSum<T>, EltPerPack> {
__device__ static BytePack<EltPerPack*sizeof(T)> reduce(FuncPreMulSum<T> fn, BytePack<EltPerPack*sizeof(T)> a, BytePack<EltPerPack*sizeof(T)> b) {
__device__ __forceinline__ static BytePack<EltPerPack*sizeof(T)> reduce(FuncPreMulSum<T> fn, BytePack<EltPerPack*sizeof(T)> a, BytePack<EltPerPack*sizeof(T)> b) {
// FuncPreMulSum reduce dispatches to FuncSum.
return Apply_Reduce<FuncSum<T>, EltPerPack>::reduce(FuncSum<T>(), a, b);
}
@@ -523,7 +656,7 @@ struct Apply_Reduce<FuncPreMulSum<T>, EltPerPack> {
template<typename T>
struct Apply_PreOp<FuncPreMulSum<T>, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(T)> preOp(FuncPreMulSum<T> fn, BytePack<sizeof(T)> a) {
__device__ __forceinline__ static BytePack<sizeof(T)> preOp(FuncPreMulSum<T> fn, BytePack<sizeof(T)> a) {
return toPack<T>(fromPack<T>(a) * fn.scalar);
}
};
@@ -534,7 +667,7 @@ struct Apply_PreOp<FuncPreMulSum<T>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(half)> preOp(FuncPreMulSum<half> fn, BytePack<sizeof(half)> a) {
__device__ __forceinline__ static BytePack<sizeof(half)> preOp(FuncPreMulSum<half> fn, BytePack<sizeof(half)> a) {
#if __CUDA_ARCH__ >= 530 && __CUDA_ARCH__ != 610
return toPack<half>(__hmul(fromPack<half>(a), fn.scalar.x));
#else
@@ -546,7 +679,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/2> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(half2)> preOp(FuncPreMulSum<half> fn, BytePack<sizeof(half2)> a) {
__device__ __forceinline__ static BytePack<sizeof(half2)> preOp(FuncPreMulSum<half> fn, BytePack<sizeof(half2)> a) {
return toPack<half2>(__hmul2(fromPack<half2>(a), fn.scalar));
}
};
@@ -559,7 +692,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<__nv_bfloat16>, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(__nv_bfloat16)> preOp(
__device__ __forceinline__ static BytePack<sizeof(__nv_bfloat16)> preOp(
FuncPreMulSum<__nv_bfloat16> fn, BytePack<sizeof(__nv_bfloat16)> a
) {
#if __CUDA_ARCH__ >= 800
@@ -573,7 +706,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<__nv_bfloat16>, /*EltPerPack=*/2> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(__nv_bfloat162)> preOp(
__device__ __forceinline__ static BytePack<sizeof(__nv_bfloat162)> preOp(
FuncPreMulSum<__nv_bfloat16> fn, BytePack<sizeof(__nv_bfloat162)> a
) {
return toPack<__nv_bfloat162>(__hmul2(fromPack<__nv_bfloat162>(a), fn.scalar));
@@ -590,7 +723,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<__nv_fp8_e4m3>, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(__nv_fp8_e4m3)> preOp(
__device__ __forceinline__ static BytePack<sizeof(__nv_fp8_e4m3)> preOp(
FuncPreMulSum<__nv_fp8_e4m3> fn, BytePack<sizeof(__nv_fp8_e4m3)> a
) {
return toPack<__nv_fp8_e4m3>(__nv_fp8_e4m3(__hmul(__half(fromPack<__nv_fp8_e4m3>(a)), fn.scalar2.x)));
@@ -599,7 +732,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<__nv_fp8_e4m3>, /*EltPerPack=*/2> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(__nv_fp8x2_e4m3)> preOp(
__device__ __forceinline__ static BytePack<sizeof(__nv_fp8x2_e4m3)> preOp(
FuncPreMulSum<__nv_fp8_e4m3> fn, BytePack<sizeof(__nv_fp8x2_e4m3)> a
) {
return toPack<__nv_fp8x2_e4m3>(__nv_fp8x2_e4m3(__hmul2(__half2(fromPack<__nv_fp8x2_e4m3>(a)), fn.scalar2)));
@@ -609,7 +742,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<__nv_fp8_e5m2>, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(__nv_fp8_e5m2)> preOp(
__device__ __forceinline__ static BytePack<sizeof(__nv_fp8_e5m2)> preOp(
FuncPreMulSum<__nv_fp8_e5m2> fn, BytePack<sizeof(__nv_fp8_e5m2)> a
) {
return toPack<__nv_fp8_e5m2>(__nv_fp8_e5m2(__hmul(__half(fromPack<__nv_fp8_e5m2>(a)), fn.scalar2.x)));
@@ -618,7 +751,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<__nv_fp8_e5m2>, /*EltPerPack=*/2> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(__nv_fp8x2_e5m2)> preOp(
__device__ __forceinline__ static BytePack<sizeof(__nv_fp8x2_e5m2)> preOp(
FuncPreMulSum<__nv_fp8_e5m2> fn, BytePack<sizeof(__nv_fp8x2_e5m2)> a
) {
return toPack<__nv_fp8x2_e5m2>(__nv_fp8x2_e5m2(__hmul2(__half2(fromPack<__nv_fp8x2_e5m2>(a)), fn.scalar2)));
@@ -633,7 +766,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<typename T>
struct RedOpArg<FuncSumPostDiv<T>> {
static constexpr bool ArgUsed = true;
__device__ static uint64_t loadArg(void *ptr) {
__device__ __forceinline__ static uint64_t loadArg(void *ptr) {
return *(uint64_t*)ptr;
}
};
@@ -646,12 +779,12 @@ struct FuncSumPostDiv {
uint32_t divisor:31, isSigned:1;
UintType recip;
__device__ FuncSumPostDiv(uint64_t opArg=0) {
__device__ __forceinline__ FuncSumPostDiv(uint64_t opArg=0) {
isSigned = opArg & 1;
divisor = opArg >> 1;
recip = UintType(-1)/divisor;
}
__device__ T divide(T x) {
__device__ __forceinline__ T divide(T x) {
// x is negative iff we are in signed mode and the top bit is set
bool xneg = isSigned && (x & ~(T(-1)>>1));
// Compute abs(x):
@@ -673,7 +806,7 @@ struct FuncSumPostDiv {
template<typename T, int EltPerPack>
struct Apply_Reduce<FuncSumPostDiv<T>, EltPerPack>:
Apply_Reduce<FuncSum<T>, EltPerPack> {
__device__ static BytePack<EltPerPack*sizeof(T)> reduce(FuncSumPostDiv<T> fn, BytePack<EltPerPack*sizeof(T)> a, BytePack<EltPerPack*sizeof(T)> b) {
__device__ __forceinline__ static BytePack<EltPerPack*sizeof(T)> reduce(FuncSumPostDiv<T> fn, BytePack<EltPerPack*sizeof(T)> a, BytePack<EltPerPack*sizeof(T)> b) {
// FuncSumPostDiv reduce dispatches to FuncSum.
return Apply_Reduce<FuncSum<T>, EltPerPack>::reduce(FuncSum<T>(), a, b);
}
@@ -682,7 +815,7 @@ struct Apply_Reduce<FuncSumPostDiv<T>, EltPerPack>:
template<typename T>
struct Apply_PostOp<FuncSumPostDiv<T>, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(T)> postOp(FuncSumPostDiv<T> fn, BytePack<sizeof(T)> a) {
__device__ __forceinline__ static BytePack<sizeof(T)> postOp(FuncSumPostDiv<T> fn, BytePack<sizeof(T)> a) {
return toPack<T>(fn.divide(fromPack<T>(a)));
}
};
@@ -690,120 +823,145 @@ struct Apply_PostOp<FuncSumPostDiv<T>, /*EltPerPack=*/1> {
////////////////////////////////////////////////////////////////////////////////
// Apply_LoadMultimem
#define SIZEOF_BytePack_field_u16 2
#define PTX_REG_BytePack_field_u16 "h"
#define RegCode_for_size_1 "r"
#define RegCode_for_size_2 "h"
#define RegCode_for_size_4 "r"
#define RegCode_for_size_8 "l"
#define SIZEOF_BytePack_field_u32 4
#define PTX_REG_BytePack_field_u32 "r"
#define RegSize_for_size_1 4
#define RegSize_for_size_2 2
#define RegSize_for_size_4 4
#define RegSize_for_size_8 8
#define SIZEOF_BytePack_field_u64 8
#define PTX_REG_BytePack_field_u64 "l"
#define PtxAcc_for_u32
#define PtxAcc_for_s32
#define PtxAcc_for_s64
#define PtxAcc_for_u64
#define PtxAcc_for_f32
#define PtxAcc_for_f64
#if CUDART_VERSION >= 12020
#define PtxAcc_for_f16 ".acc::f32"
#define PtxAcc_for_bf16 ".acc::f32"
#define PtxAcc_for_f16x2 ".acc::f32"
#define PtxAcc_for_bf16x2 ".acc::f32"
#else
#define PtxAcc_for_f16
#define PtxAcc_for_bf16
#define PtxAcc_for_f16x2
#define PtxAcc_for_bf16x2
#endif
#define PtxAcc_for_e4m3 ".acc::f16"
#define PtxAcc_for_e5m2 ".acc::f16"
#define PtxAcc_for_e4m3x4 ".acc::f16"
#define PtxAcc_for_e5m2x4 ".acc::f16"
#define DEFINE_Apply_LoadMultimem_sum(T, ptx_ty, pack_field) \
#define DEFINE_Apply_LoadMultimem_sum(T, ptx_ty, PackSize) \
template<> \
struct Apply_LoadMultimem<FuncSum<T>, SIZEOF_BytePack_field_##pack_field> { \
static constexpr int PackSize = SIZEOF_BytePack_field_##pack_field; \
__device__ static BytePack<PackSize> load(FuncSum<T> fn, uintptr_t addr) { \
BytePack<PackSize> ans; \
asm volatile("multimem.ld_reduce.relaxed.sys.global.add." #ptx_ty " %0, [%1];" \
: "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field) \
struct Apply_LoadMultimem<FuncSum<T>, PackSize> { \
__device__ __forceinline__ static BytePack<PackSize> load(FuncSum<T> fn, uintptr_t addr) { \
BytePack<RegSize_for_size_##PackSize> reg; \
asm volatile("multimem.ld_reduce.relaxed.sys.global.add" PtxAcc_for_##ptx_ty "." #ptx_ty " %0, [%1];" \
: "=" RegCode_for_size_##PackSize(reg.native) \
: "l"(addr) : "memory"); \
BytePack<PackSize> ans; \
ans.native = reg.native; \
return ans; \
} \
};
#define DEFINE_Apply_LoadMultimem_minmax(T, ptx_ty, pack_field) \
#define DEFINE_Apply_LoadMultimem_minmax(T, ptx_ty, PackSize) \
template<> \
struct Apply_LoadMultimem<FuncMinMax<T>, SIZEOF_BytePack_field_##pack_field> { \
static constexpr int PackSize = SIZEOF_BytePack_field_##pack_field; \
__device__ static BytePack<PackSize> load(FuncMinMax<T> fn, uintptr_t addr) { \
BytePack<PackSize> ans; \
struct Apply_LoadMultimem<FuncMinMax<T>, PackSize> { \
__device__ __forceinline__ static BytePack<PackSize> load(FuncMinMax<T> fn, uintptr_t addr) { \
BytePack<RegSize_for_size_##PackSize> reg; \
if (fn.isMinNotMax) { \
asm volatile("multimem.ld_reduce.relaxed.sys.global.min." #ptx_ty " %0, [%1];" \
: "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field) \
: "=" RegCode_for_size_##PackSize(reg.native) \
: "l"(addr) : "memory"); \
} else { \
asm volatile("multimem.ld_reduce.relaxed.sys.global.max." #ptx_ty " %0, [%1];" \
: "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field) \
: "=" RegCode_for_size_##PackSize(reg.native) \
: "l"(addr) : "memory"); \
} \
BytePack<PackSize> ans; \
ans.native = reg.native; \
return ans; \
} \
};
#define DEFINE_Apply_LoadMultimem_sum_v4(T, ptx_ty, pack_field) \
#define DEFINE_Apply_LoadMultimem_sum_v4(T, ptx_ty, VecEltSize) \
template<> \
struct Apply_LoadMultimem<FuncSum<T>, 4*(SIZEOF_BytePack_field_##pack_field)> { \
static constexpr int PackSize = 4*(SIZEOF_BytePack_field_##pack_field); \
__device__ static BytePack<PackSize> load(FuncSum<T> fn, uintptr_t addr) { \
BytePack<PackSize> ans; \
asm volatile("multimem.ld_reduce.relaxed.sys.global.add.v4." #ptx_ty " {%0,%1,%2,%3}, [%4];" \
: "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[0]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[1]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[2]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[3]) \
struct Apply_LoadMultimem<FuncSum<T>, 4*(VecEltSize)> { \
static constexpr int PackSize = 4*(VecEltSize); \
__device__ __forceinline__ static BytePack<PackSize> load(FuncSum<T> fn, uintptr_t addr) { \
union { BytePack<PackSize> ans; BytePack<VecEltSize> elts[4]; }; \
asm volatile("multimem.ld_reduce.relaxed.sys.global.add" PtxAcc_for_##ptx_ty ".v4." #ptx_ty " {%0,%1,%2,%3}, [%4];" \
: "=" RegCode_for_size_##VecEltSize(elts[0].native), \
"=" RegCode_for_size_##VecEltSize(elts[1].native), \
"=" RegCode_for_size_##VecEltSize(elts[2].native), \
"=" RegCode_for_size_##VecEltSize(elts[3].native) \
: "l"(addr) : "memory"); \
return ans; \
} \
};
#define DEFINE_Apply_LoadMultimem_minmax_v4(T, ptx_ty, pack_field) \
#define DEFINE_Apply_LoadMultimem_minmax_v4(T, ptx_ty, VecEltSize) \
template<> \
struct Apply_LoadMultimem<FuncMinMax<T>, 4*(SIZEOF_BytePack_field_##pack_field)> { \
static constexpr int PackSize = 4*(SIZEOF_BytePack_field_##pack_field); \
__device__ static BytePack<PackSize> load(FuncMinMax<T> fn, uintptr_t addr) { \
BytePack<PackSize> ans; \
struct Apply_LoadMultimem<FuncMinMax<T>, 4*(VecEltSize)> { \
static constexpr int PackSize = 4*(VecEltSize); \
__device__ __forceinline__ static BytePack<PackSize> load(FuncMinMax<T> fn, uintptr_t addr) { \
union { BytePack<PackSize> ans; BytePack<VecEltSize> elts[4]; }; \
if (fn.isMinNotMax) { \
asm volatile("multimem.ld_reduce.relaxed.sys.global.min.v4." #ptx_ty " {%0,%1,%2,%3}, [%4];" \
: "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[0]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[1]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[2]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[3]) \
: "=" RegCode_for_size_##VecEltSize(elts[0].native), \
"=" RegCode_for_size_##VecEltSize(elts[1].native), \
"=" RegCode_for_size_##VecEltSize(elts[2].native), \
"=" RegCode_for_size_##VecEltSize(elts[3].native) \
: "l"(addr) : "memory"); \
} else { \
asm volatile("multimem.ld_reduce.relaxed.sys.global.max.v4." #ptx_ty " {%0,%1,%2,%3}, [%4];" \
: "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[0]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[1]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[2]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[3]) \
: "=" RegCode_for_size_##VecEltSize(elts[0].native), \
"=" RegCode_for_size_##VecEltSize(elts[1].native), \
"=" RegCode_for_size_##VecEltSize(elts[2].native), \
"=" RegCode_for_size_##VecEltSize(elts[3].native) \
: "l"(addr) : "memory"); \
} \
return ans; \
} \
};
#define DEFINE_Apply_LoadMultimem_sum_v4x2_and_subhalf(T, ptx_ty, pack_field) \
DEFINE_Apply_LoadMultimem_sum_v4(T, ptx_ty, pack_field) \
#define DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(T, ptx_ty, VecEltSize) \
DEFINE_Apply_LoadMultimem_sum_v4(T, ptx_ty, VecEltSize) \
template<> \
struct Apply_LoadMultimem<FuncSum<T>, sizeof(T)> { \
__device__ static BytePack<sizeof(T)> load(FuncSum<T> fn, uintptr_t addr) { \
BytePack<2*sizeof(T)> tmp; \
asm volatile("multimem.ld_reduce.relaxed.sys.global.add." #ptx_ty " %0, [%1];" \
: "=" PTX_REG_BytePack_field_##pack_field(tmp.pack_field) \
: "l"(addr & -uintptr_t(2*sizeof(T))) : "memory"); \
return tmp.half[(addr/sizeof(T))%2]; \
__device__ __forceinline__ static BytePack<sizeof(T)> load(FuncSum<T> fn, uintptr_t addr) { \
union { BytePack<VecEltSize> tmp; BytePack<sizeof(T)> elts[(VecEltSize)/sizeof(T)]; }; \
asm volatile("multimem.ld_reduce.relaxed.sys.global.add" PtxAcc_for_##ptx_ty "." #ptx_ty " %0, [%1];" \
: "=" RegCode_for_size_##VecEltSize(tmp.native) \
: "l"(addr & -uintptr_t(VecEltSize)) : "memory"); \
return elts[(addr/sizeof(T))%((VecEltSize)/sizeof(T))]; \
} \
};
#define DEFINE_Apply_LoadMultimem_minmax_v4x2_and_subhalf(T, ptx_ty, pack_field) \
DEFINE_Apply_LoadMultimem_minmax_v4(T, ptx_ty, pack_field) \
#define DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(T, ptx_ty, VecEltSize) \
DEFINE_Apply_LoadMultimem_minmax_v4(T, ptx_ty, VecEltSize) \
template<> \
struct Apply_LoadMultimem<FuncMinMax<T>, sizeof(T)> { \
__device__ static BytePack<sizeof(T)> load(FuncMinMax<T> fn, uintptr_t addr) { \
BytePack<2*sizeof(T)> tmp; \
__device__ __forceinline__ static BytePack<sizeof(T)> load(FuncMinMax<T> fn, uintptr_t addr) { \
union { BytePack<VecEltSize> tmp; BytePack<sizeof(T)> elts[(VecEltSize)/sizeof(T)]; }; \
if (fn.isMinNotMax) { \
asm volatile("multimem.ld_reduce.relaxed.sys.global.min." #ptx_ty " %0, [%1];" \
: "=" PTX_REG_BytePack_field_##pack_field(tmp.pack_field) \
: "l"(addr & -uintptr_t(2*sizeof(T))) : "memory"); \
: "=" RegCode_for_size_##VecEltSize(tmp.native) \
: "l"(addr & -uintptr_t(VecEltSize)) : "memory"); \
} else { \
asm volatile("multimem.ld_reduce.relaxed.sys.global.max." #ptx_ty " %0, [%1];" \
: "=" PTX_REG_BytePack_field_##pack_field(tmp.pack_field) \
: "l"(addr & -uintptr_t(2*sizeof(T))) : "memory"); \
: "=" RegCode_for_size_##VecEltSize(tmp.native) \
: "l"(addr & -uintptr_t(VecEltSize)) : "memory"); \
} \
return tmp.half[(addr/sizeof(T))%2]; \
return elts[(addr/sizeof(T))%((VecEltSize)/sizeof(T))]; \
} \
};
template<typename Fn, int BytePerPack>
struct Apply_LoadMultimem {
__device__ static BytePack<BytePerPack> load(Fn fn, uintptr_t addr) {
__device__ __forceinline__ static BytePack<BytePerPack> load(Fn fn, uintptr_t addr) {
__trap();
return {};
}
@@ -826,29 +984,36 @@ struct Apply_LoadMultimem {
/*multimem.ld_reduce not supported:*/ 0;
};
DEFINE_Apply_LoadMultimem_sum(uint32_t, u32, u32)
DEFINE_Apply_LoadMultimem_minmax(uint32_t, u32, u32)
DEFINE_Apply_LoadMultimem_sum(uint32_t, u32, 4)
DEFINE_Apply_LoadMultimem_minmax(uint32_t, u32, 4)
DEFINE_Apply_LoadMultimem_sum(int32_t, s32, u32)
DEFINE_Apply_LoadMultimem_minmax(int32_t, s32, u32)
DEFINE_Apply_LoadMultimem_sum(int32_t, s32, 4)
DEFINE_Apply_LoadMultimem_minmax(int32_t, s32, 4)
DEFINE_Apply_LoadMultimem_sum(uint64_t, u64, u64)
DEFINE_Apply_LoadMultimem_minmax(uint64_t, u64, u64)
DEFINE_Apply_LoadMultimem_sum(uint64_t, u64, 8)
DEFINE_Apply_LoadMultimem_minmax(uint64_t, u64, 8)
DEFINE_Apply_LoadMultimem_sum(int64_t, u64, u64)
DEFINE_Apply_LoadMultimem_minmax(int64_t, s64, u64)
DEFINE_Apply_LoadMultimem_sum(int64_t, u64, 8)
DEFINE_Apply_LoadMultimem_minmax(int64_t, s64, 8)
DEFINE_Apply_LoadMultimem_sum(float, f32, u32)
DEFINE_Apply_LoadMultimem_sum_v4(float, f32, u32)
DEFINE_Apply_LoadMultimem_sum(float, f32, 4)
DEFINE_Apply_LoadMultimem_sum_v4(float, f32, 4)
DEFINE_Apply_LoadMultimem_sum(double, f64, u64)
DEFINE_Apply_LoadMultimem_sum(double, f64, 8)
DEFINE_Apply_LoadMultimem_sum_v4x2_and_subhalf(half, f16x2, u32)
DEFINE_Apply_LoadMultimem_minmax_v4x2_and_subhalf(half, f16x2, u32)
DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(half, f16x2, 4)
DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(half, f16x2, 4)
#if defined(__CUDA_BF16_TYPES_EXIST__)
DEFINE_Apply_LoadMultimem_sum_v4x2_and_subhalf(__nv_bfloat16, bf16x2, u32)
DEFINE_Apply_LoadMultimem_minmax_v4x2_and_subhalf(__nv_bfloat16, bf16x2, u32)
DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(__nv_bfloat16, bf16x2, 4)
DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(__nv_bfloat16, bf16x2, 4)
#endif
#if NCCL_CUDA_ARCH_FAMILY_SPECIFIC == 1000 || NCCL_CUDA_ARCH_FAMILY_SPECIFIC == 1010 || NCCL_CUDA_ARCH_SPECIFIC == 1200 || NCCL_CUDA_ARCH_SPECIFIC == 1210
DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(__nv_fp8_e4m3, e4m3x4, 4)
DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(__nv_fp8_e4m3, e4m3x4, 4)
DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(__nv_fp8_e5m2, e5m2x4, 4)
DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(__nv_fp8_e5m2, e5m2x4, 4)
#endif
#else
template<typename Fn>
@@ -860,11 +1025,29 @@ struct Apply_LoadMultimem {
#undef DEFINE_Apply_LoadMultimem
#undef DEFINE_Apply_LoadMultimem_v4
#undef DEFINE_Apply_LoadMultimem_v4x2_and_subhalf
#undef SIZEOF_BytePack_field_u64
#undef PTX_REG_BytePack_field_u64
#undef SIZEOF_BytePack_field_u32
#undef PTX_REG_BytePack_field_u32
#undef SIZEOF_BytePack_field_u16
#undef PTX_REG_BytePack_field_u16
#undef RegCode_for_size_2
#undef RegCode_for_size_4
#undef RegCode_for_size_8
#undef RegSize_for_size_1
#undef RegSize_for_size_2
#undef RegSize_for_size_4
#undef RegSize_for_size_8
#undef PtxAcc_for_u32
#undef PtxAcc_for_s32
#undef PtxAcc_for_s64
#undef PtxAcc_for_u64
#undef PtxAcc_for_f32
#undef PtxAcc_for_f64
#undef PtxAcc_for_f16
#undef PtxAcc_for_bf16
#undef PtxAcc_for_f16x2
#undef PtxAcc_for_bf16x2
#undef PtxAcc_for_e4m3
#undef PtxAcc_for_e5m2
#undef PtxAcc_for_e4m3x4
#undef PtxAcc_for_e5m2x4
#endif // REDUCE_KERNEL_H_
+185 -61
Просмотреть файл
@@ -142,82 +142,206 @@ struct RunWorkColl<ncclFuncReduceScatter, T, RedOp, NCCL_ALGO_PAT, NCCL_PROTO_SI
template<typename T, typename RedOp>
struct RunWorkColl<ncclFuncReduceScatter, T, RedOp, NCCL_ALGO_NVLS, NCCL_PROTO_SIMPLE> {
template<bool ReduceSendNotRecv>
struct Scatterer {
struct ncclDevWorkColl* work;
int chunkCount;
ssize_t railGridOffset;
template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts, int MultimemSrcs, int MultimemDsts>
__device__ __forceinline__ void operator()(
int tid, int tn, int slice, int maxSliceSize,
int nSrcs, void** srcPtrs, int nDsts, void** dstPtrs, int32_t* dstSizes, uint32_t sendDirectFlag, uint32_t recvDirectFlag
) {
static_assert(SlicePerChunk == 1, "require: SlicePerChunk==1");
static_assert(MaxDsts <= 1 || MaxSrcs <= 1, "require: MaxDsts<=1 || MaxSrcs<=1");
struct ncclNvls* nvls = &ncclShmem.channel.nvls;
int nNodes = ncclShmem.comm.nNodes;
int nRails = nvls->nHeads;
int part = ncclShmem.channelId - work->channelLo;
void* inbuf = (void*)work->sendbuff;
ssize_t countPerRank = work->collnet.count;
ssize_t railAllBeg = min(railGridOffset + part * chunkCount, nNodes * countPerRank);
ssize_t railAllEnd = min(railAllBeg + chunkCount, nNodes * countPerRank);
int railAllSize = railAllEnd - railAllBeg;
int rail = nvls->headRank;
int dst = 0;
if (ReduceSendNotRecv) {
if (work->regUsed) return;
rail = 0;
nSrcs = 1;
} else {
rail = nvls->headRank;
}
if (tid < nDsts) dstSizes[tid] = railAllSize;
do {
int node = railAllBeg / countPerRank;
int railAllOffset = 0;
while (railAllOffset < railAllSize) {
ssize_t railOneBeg = node * countPerRank;
ssize_t railOneEnd = railOneBeg + countPerRank;
ssize_t railOneOffset = (railAllBeg + railAllOffset) - railOneBeg;
int delta = min(railAllEnd, railOneEnd) - (railAllBeg + railAllOffset);
int rank = ncclShmem.comm.collNetDenseToUserRank[node * nRails + rail];
ssize_t userOneBeg = rank * countPerRank + railOneOffset;
if (nDsts != 0) {
reduceCopy<ncclCollUnroll(), RedOp, T,
/*MultimemSrcs=*/MultimemSrcs, 1, 1 + MaxSrcs,
/*MultimemDsts,MinDsts,MaxDsts=*/MultimemDsts, 1, 1,
/*PreOpSrcs=*/1>
(tid, tn, work->redOpArg, &work->redOpArg, false,
/*nSrcs=*/nSrcs, [=]__device__(int s) {
return work->regUsed ? (T*)srcPtrs[s] + userOneBeg :
!ReduceSendNotRecv ? (T*)srcPtrs[s] + railAllOffset:
(T*)inbuf + userOneBeg;
},
/*nDsts=*/1, [=]__device__(int d/*==0*/) {
return (T*)dstPtrs[dst] + railAllOffset;
}, delta);
}
railAllOffset += delta;
node += 1;
}
dst += 1;
rail += 1;
} while (ReduceSendNotRecv && dst < nRails);
}
};
__device__ __forceinline__ void run(int tid, int/*nthreads*/, struct ncclDevWorkColl* work) {
struct ncclNvls* nvls = &ncclShmem.channel.nvls;
size_t count;
size_t gridOffset;
size_t channelCount;
size_t chunkCount;
ncclCollCbdPart(work, ncclShmem.channelId, NCCL_PROTO_SIMPLE, sizeof(T), &count, &gridOffset, &channelCount, &chunkCount);
const int rank = ncclShmem.comm.rank;
const int nranks = ncclShmem.comm.nRanks;
size_t offset;
int nelem;
/* if we are direct NVLS, we only need to allocate 1 warp to scatter for sync;
* if not, based on #ranks, we allocate 7 or 5 warps to reduce to saturate bandwidth
* and the rest are allocated to scatter. */
const int nThreadsReduce = work->regUsed ? (NCCL_MAX_NTHREADS - WARP_SIZE) : (nranks <= 6 ? 7 * WARP_SIZE : 5 * WARP_SIZE);
const int nThreadsScatter = work->regUsed ? WARP_SIZE : (NCCL_MAX_NTHREADS - nThreadsReduce);
const int tidEndScatter = nThreadsScatter;
const int nThreadsNetRecv = work->oneNode ? 0 : (work->netRegUsed ? WARP_SIZE : 6 * WARP_SIZE);
const int nThreadsScatter = work->regUsed ? roundUp(nvls->nHeads << 2, WARP_SIZE) : 8 * WARP_SIZE;
const int nThreadsReduce = NCCL_MAX_NTHREADS - nThreadsNetRecv - nThreadsScatter;
const int tidEndNetRecv = nThreadsNetRecv;
const int tidEndScatter = tidEndNetRecv + nThreadsScatter;
const int tidEndReduce = tidEndScatter + nThreadsReduce;
if (!work->regUsed) {
if (tid < tidEndScatter) {
// Scatter
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanAsymmetric<0, NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsScatter, NULL, nvls->up, work->sendbuff, NULL,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.scatter(offset, nvls->nHeads * count, nelem, count, -1, 0);
if (work->oneNode) {
const int rank = ncclShmem.comm.rank;
size_t offset;
size_t count, gridOffset, channelCount, chunkCount;
ncclCollCbdPart(work, ncclShmem.channelId, NCCL_PROTO_SIMPLE, sizeof(T), &count, &gridOffset, &channelCount, &chunkCount);
if (!work->regUsed) {
if (tid < tidEndScatter) {
// Scatter
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanAsymmetric<0, NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsScatter, NULL, nvls->up, work->sendbuff, NULL,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.scatter(offset, nvls->nHeads * count, nelem, count, -1, 0);
}
// coverity[overrun-call] => Coverity think prims.index can be greater than 1
} else if (tid < tidEndReduce) {
// Reduce through NVLS
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
Primitives<T, RedOp, FanAsymmetric<1, 0>, /*Direct=*/0, Proto, 0>
prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, NULL, NULL, work->recvbuff,
work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.recv(offset, nelem);
}
}
// coverity[overrun-call] => Coverity think prims.index can be greater than 1
} else if (tid < tidEndReduce) {
// Reduce through NVLS
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
Primitives<T, RedOp, FanAsymmetric<1, 0>, /*Direct=*/0, Proto, 0>
prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, NULL, NULL, work->recvbuff,
work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.recv(offset, nelem);
} else {
if (tid < tidEndScatter) {
// Scatter
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanSymmetric<NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsScatter, nvls->up, nvls->up, NULL, NULL,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
prims.scatter(0, 0, 0, 0, -1, 0);
}
/* gather used as sync */
prims.gather(0, 0, 0, 0, -1, 0);
} else if (tid < tidEndReduce) {
// Reduce through NVLS
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, &nvls->down, NULL, work->recvbuff,
work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0, work);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
size_t outOffset = gridOffset + elemOffset;
size_t inpOffset = outOffset + rank * count;
nelem = min(chunkCount, channelCount - elemOffset);
// Coverity complains about a possible overrun inside the method invoked below, but that's actually
// a false positive.
// coverity[overrun-call:FALSE]
prims.directRecvCopy(inpOffset, outOffset, nelem);
}
/* send for sync */
prims.send(0, 0);
}
}
} else {
if (tid < tidEndScatter) {
// Scatter
// multi-node
int nNodes = ncclShmem.comm.nNodes;
int part = ncclShmem.channelId - work->channelLo;
ssize_t countPerRank = work->collnet.count;
const int nChannels = work->channelHi - work->channelLo + 1;
ssize_t chunkCount = work->collnet.chunkCount;
if (tid < tidEndNetRecv) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanSymmetric<NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsScatter, nvls->up, nvls->up, NULL, NULL,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
prims.scatter(0, 0, 0, 0, -1, 0);
if (work->netRegUsed) {
if (tid == 0) {
int steps = (int)divUp(nNodes * countPerRank, nChannels * chunkCount);
Primitives<T, RedOp, FanAsymmetric<1, 0>, /*Direct=*/0, Proto, 0>::recvPeerNotify(nvls->out, 0, steps);
}
__syncwarp();
} else {
Primitives<T, RedOp, FanAsymmetric<1, 0>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsNetRecv, &nvls->out, nullptr, nullptr, work->recvbuff,
work->redOpArg, 0 * Proto::MaxGroupWidth, 0, 0);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
ssize_t railAllBeg = railGridOffset + part * chunkCount;
ssize_t railAllEnd = min(railAllBeg + chunkCount, nNodes * countPerRank);
ssize_t railOneBeg = ncclShmem.comm.node * countPerRank;
ssize_t railOneEnd = railOneBeg + countPerRank;
ssize_t beg = max(railAllBeg, railOneBeg);
ssize_t end = min(railAllEnd, railOneEnd);
prims.recv(beg - railOneBeg, max(ssize_t(0), end - beg), /*postOp=*/true);
}
}
/* gather used as sync */
prims.gather(0, 0, 0, 0, -1, 0);
} else if (tid < tidEndReduce) {
// Reduce through NVLS
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, &nvls->down, NULL, work->recvbuff,
work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0, work);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
size_t outOffset = gridOffset + elemOffset;
size_t inpOffset = outOffset + rank * count;
nelem = min(chunkCount, channelCount - elemOffset);
// Coverity complains about a possible overrun inside the method invoked below, but that's actually
// a false positive.
// coverity[overrun-call:FALSE]
prims.directRecvCopy(inpOffset, outOffset, nelem);
} else {
if (tid < tidEndScatter) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanAsymmetric<0, NCCL_MAX_NVLS_ARITY>, /*Direct=*/1, Proto, 0>
prims(tid - tidEndNetRecv, nThreadsScatter, nullptr, nvls->up, work->sendbuff, nullptr,
work->redOpArg, 1 * Proto::MaxGroupWidth, 1, 1, work);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
Scatterer</*ReduceSendNotRecv=*/true> scat;
scat.work = work;
scat.chunkCount = chunkCount;
scat.railGridOffset = railGridOffset;
prims.template process</*Recv=*/0, /*Send=*/1>(scat);
}
} else if (tid < tidEndReduce) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, &nvls->out, nullptr, nullptr,
work->redOpArg, 2 * Proto::MaxGroupWidth, 0, 1, work);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
Scatterer</*ReduceSendNotRecv=*/false> scat;
scat.work = work;
scat.chunkCount = chunkCount;
scat.railGridOffset = railGridOffset;
prims.template process</*Recv=*/1, /*Send=*/1>(scat);
}
}
/* send for sync */
prims.send(0, 0);
}
}
}
@@ -231,7 +355,7 @@ struct RunWorkColl<ncclFuncReduceScatter, T, RedOp, NCCL_ALGO_COLLNET_DIRECT, NC
int chunkSize;
ssize_t railGridOffset;
template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts>
template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts, int MultimemSrcs, int MultimemDsts>
__device__ __forceinline__ void operator()(
int tid, int tn, int slice, int maxSliceSize,
int nSrcs, void** srcPtrs, int nDsts, void** dstPtrs, int32_t* dstSizes, uint32_t sendDirectFlag, uint32_t recvDirectFlag
+367
Просмотреть файл
@@ -0,0 +1,367 @@
#include "symmetric.h"
#include "symmetric/kernel.cuh"
#include "symmetric/primitives.cuh"
template<int BytePerPack, int UnrollPacks, int UnrollPeers>
static __device__ void bcastDeep(
ncclSymPrims& prim, int tn, int t, bool waitNeeded,
char* inputHere, char* outputRank0, bool inPlace, int nIters
) {
using Pack = BytePack<BytePerPack>;
int wn = tn/WARP_SIZE;
int w = t/WARP_SIZE;
int lane = t%WARP_SIZE;
int const& rank = prim.rank;
int const& nRanks = prim.nRanks;
uint32_t const& stride4G = prim.stride4G;
Pack* inpHere = (Pack*)inputHere + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
Pack* outRank0 = (Pack*)outputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
Pack tmp[UnrollPacks];
nIters -= w;
if (0 < nIters) {
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp[u] = inpHere[u*WARP_SIZE];
}
}
if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
if (0 < nIters) {
while (true) {
int dr = inPlace ? 1 : 0;
int r = rank + dr;
if (r == nRanks) r = 0;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int i = 0;
partial ? i < 1 : (dr + UnrollPeers <= nRanks);
partial ? i++ : (dr += UnrollPeers)) {
#pragma unroll
for (int ur=0; ur < UnrollPeers-partial; ur++) {
if (partial && dr == nRanks) break;
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) {
add4G(outRank0, r*stride4G)[u*WARP_SIZE] = tmp[u];
}
if (++r == nRanks) r = 0;
}
}
}
inpHere += intptr_t(wn)*UnrollPacks*WARP_SIZE;
outRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
nIters -= wn;
if (nIters <= 0) break;
// Load data for next iteration.
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp[u] = inpHere[u*WARP_SIZE];
}
}
}
}
template<int UnrollPeers, typename T>
static __device__ void bcastEnds(
ncclSymPrims& prim, int tn, int t,
T* inputHere, T* outputRank0, bool inPlace, size_t nElts, uint32_t nPreElts, size_t nSufElts
) {
int const& rank = prim.rank;
int const& nRanks = prim.nRanks;
uint32_t const& stride4G = prim.stride4G;
BytePack<sizeof(T)>* inpHere = (BytePack<sizeof(T)>*)inputHere;
BytePack<sizeof(T)>* outRank0 = (BytePack<sizeof(T)>*)outputRank0;
#pragma unroll 1
for (size_t i = t; i < nPreElts+nSufElts; i += tn) {
size_t elt = i < nPreElts ? i : nElts-nPreElts-nSufElts+i;
BytePack<sizeof(T)> tmp = inpHere[elt];
int dr = inPlace ? 1 : 0;
int r = rank + dr;
if (r == nRanks) r = 0;
#pragma unroll 1
for (; dr + UnrollPeers <= nRanks; dr += UnrollPeers) {
#pragma unroll UnrollPeers
for (int u=0; u < UnrollPeers; u++) {
*add4G(outRank0+elt, r*stride4G) = tmp;
if (++r == nRanks) r = 0;
}
}
#pragma unroll UnrollPeers
for (int u=0; u < UnrollPeers; u++) {
if (dr+u == nRanks) break;
*add4G(outRank0+elt, r*stride4G) = tmp;
if (++r == nRanks) r = 0;
}
}
}
template<typename T>
static __device__ void bcast(
ncclSymPrims& prim, int tn, int t, bool waitNeeded, T* input, T* output, size_t nElts
) {
bool inPlace = (input == output);
// Mpve to rank=0
output = prim.peerPtr(0, output);
uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
size_t nBytes = nElts*sizeof(T);
uint32_t nPreBytes = (128u - inputUptr)%128u;
nPreBytes = min((size_t)nPreBytes, nBytes);
uintptr_t cursor = nPreBytes;
constexpr int MinWarpPerBlock = 4;
if ((inputUptr-outputUptr)%16 == 0) {
constexpr int BytePerPack = 16, UnrollPacks = 4, UnrollPeers = 2;
constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
uint32_t chunks = (nBytes-cursor)/BytePerChunk;
chunks -= imodFast32(chunks, prim.nBlocks, prim.nBlocks_rcp32);
if (chunks != 0) {
uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
bcastDeep<BytePerPack, UnrollPacks, UnrollPeers>(
prim, tn, t, waitNeeded,
(char*)input + cursor, (char*)output + cursor, inPlace,
chunks*MinWarpPerBlock
);
cursor = cursorAfter;
waitNeeded = false;
}
}
if (sizeof(T) == 4 || (sizeof(T) < 4 && (inputUptr-outputUptr)%4 == 0)) {
constexpr int BytePerPack = 4, UnrollPacks = 4, UnrollPeers = 4;
constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
uint32_t chunks = (nBytes-cursor)/BytePerChunk;
chunks -= imodFast32(chunks, prim.nBlocks, prim.nBlocks_rcp32);
if (chunks != 0) {
uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
bcastDeep<(sizeof(T) <= BytePerPack ? BytePerPack : 0), UnrollPacks, UnrollPeers>(
prim, tn, t, waitNeeded,
(char*)input + cursor, (char*)output + cursor, inPlace,
chunks*MinWarpPerBlock
);
cursor = cursorAfter;
waitNeeded = false;
}
}
if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
constexpr int UnrollPeers = 8;
size_t nSufElts = (nBytes-cursor)/sizeof(T);
bcastEnds<UnrollPeers>(prim, tn, t, input, output, inPlace, nElts, nPreBytes/sizeof(T), nSufElts);
}
__device__ __forceinline__ void ncclSymRun_AllGather_ST(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier);
int const& rank = prim.rank;
// Threads numbered over rank.
int bt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
prim.block, prim.nBlocks,
threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
int btn = prim.nBlocks*blockDim.x;
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
//prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
bcast(prim, btn, bt, /*waitNeeded=*/true, (char*)args->input, (char*)args->output + rank*args->nElts, args->nElts);
prim.barrierArrive(ncclCoopCta(), /*release=*/true);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
}
template<typename T>
static __device__ void bcastMultimem(
ncclSymPrims& prim, int tn, int t, T* input, T* output, size_t nElts
) {
// Move output to multimem
output = prim.multimemPtr(output);
uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
size_t nBytes = nElts*sizeof(T);
uint32_t nPreBytes = (16-inputUptr)%16;
nPreBytes = min((size_t)nPreBytes, nBytes);
uintptr_t nSufBytes;
if ((inputUptr-outputUptr)%16 == 0) {
constexpr int BytePerPack = 16, UnrollPacks = 8;
constexpr int BytePerChunk = UnrollPacks*WARP_SIZE*BytePerPack;
uintptr_t cursor = nPreBytes;
uint32_t nChunks = (nBytes-cursor)/BytePerChunk;
uintptr_t cursorAfter = cursor + uintptr_t(nChunks)*BytePerChunk;
nSufBytes = nBytes - cursorAfter;
cursor += (t/WARP_SIZE)*UnrollPacks*WARP_SIZE*BytePerPack;
cursor += (t%WARP_SIZE)*BytePerPack;
int nIters = nChunks - t/WARP_SIZE;
#pragma unroll 1
while (0 < nIters) {
BytePack<BytePerPack> tmp[UnrollPacks];
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp[u] = *reinterpret_cast<BytePack<BytePerPack>*>(inputUptr + cursor + u*WARP_SIZE*BytePerPack);
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
multimem_st_global(outputUptr + cursor + u*WARP_SIZE*BytePerPack, tmp[u]);
}
cursor += tn*UnrollPacks*BytePerPack;
nIters -= tn/WARP_SIZE;
}
} else {
nPreBytes = 0;
nSufBytes = nBytes;
}
// Get the prefix+suffix element one at a time.
#pragma unroll 4
for (uintptr_t i = t*sizeof(T); i < nPreBytes + nSufBytes; i += tn*sizeof(T)) {
uintptr_t cursor = i < nPreBytes ? i : nBytes-nSufBytes+(i-nPreBytes);
BytePack<sizeof(T)> val = *reinterpret_cast<BytePack<sizeof(T)>*>(inputUptr + cursor);
multimem_st_global(outputUptr + cursor, val);
cursor += tn*sizeof(T);
}
}
__device__ __forceinline__ void ncclSymRun_AllGather_STMC(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier|ncclSymPrims_UseMultimem);
int const& rank = prim.rank;
char* input = args->input;
char* output = args->output;
size_t bytes = args->nElts;
// Round robin memory to blocks.
int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
prim.block, prim.nBlocks,
threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
int tn = prim.nBlocks*blockDim.x;
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
bcastMultimem(prim, tn, t, input, output + rank*bytes, bytes);
prim.barrierArrive(ncclCoopCta(), /*release=*/true);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
}
template<typename EltType>
static __device__ void allgather_LL_body(
ncclSymPrims &prim, EltType* input, EltType* output, int nElts, int nPacks, int nStrideElts
) {
using Pack = BytePack<8>;
constexpr int EltPerPack = 8/sizeof(EltType);
ncclCoopCta cta;
int rank = prim.rank;
int nRanks = prim.nRanks;
constexpr int tn = ncclSymMaxThreads;
int t = threadIdx.x;
#pragma unroll 1
while (0 < nElts) {
int nIterPacks = min(nPacks, tn);
if (t < nIterPacks) {
Pack x = loadPack<Pack>(input, t*EltPerPack, nElts);
prim.bcastLL(/*slot=*/nIterPacks*rank + t, x);
}
int tn_div_nPacks = tn/nIterPacks;
int tn_mod_nPacks = tn%nIterPacks;
int peer = t/nIterPacks;
int pack = t%nIterPacks;
#if 1
// NOTE: Unrolling speedup on eos nranks=8 size=64K: 5.7us vs 6.7us
constexpr int Unroll = 4;
#pragma unroll 1
for (int i = t; i < (nRanks*nIterPacks & -(Unroll*tn)); i += Unroll*tn) {
Pack got[Unroll];
prim.template recvLL<Unroll, Unroll>(i, Unroll, tn, /*&*/got);
#pragma unroll
for (int u=0; u < Unroll; u++) {
storePack<Pack>(output + peer*nStrideElts, pack*EltPerPack, nElts, got[u]);
peer += tn_div_nPacks;
pack += tn_mod_nPacks;
if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
}
}
int i = (nRanks*nIterPacks & -(Unroll*tn)) + t;
int n = (nRanks*nIterPacks)/tn % Unroll;
if (i + n*tn < nRanks*nIterPacks) n += 1;
if (n != 0) {
Pack got[Unroll];
prim.template recvLL<1, Unroll>(i, n, tn, /*&*/got);
#pragma unroll
for (int u=0; u < Unroll; u++) {
if (u != 0 && u == n) break;
storePack(output + peer*nStrideElts, pack*EltPerPack, nElts, got[u]);
peer += tn_div_nPacks;
pack += tn_mod_nPacks;
if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
}
}
#else
// The non-unrolled but "obviously correct" implementation for reference.
#pragma unroll 1
for (int i = t; i < nRanks*nIterPacks; i += tn) {
Pack got = prim.template recvLL<Pack>(i);
storePack(output + peer*nStrideElts, pack*EltPerPack, nElts, got);
peer += tn_div_nPacks;
pack += tn_mod_nPacks;
if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
}
#endif
prim.endLL(cta);
input += tn*EltPerPack;
output += tn*EltPerPack;
nElts -= tn*EltPerPack;
nPacks -= tn;
}
}
static __device__ void ncclSymRun_AllGather_LL_impl(ncclSymDevArgs const* args, bool multimem) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseLL | multimem*ncclSymPrims_UseMultimem);
using Pack = BytePack<8>;
constexpr int BytePerPack = 8;
int nElts = args->nElts;
int nPacks = divUp(nElts, BytePerPack);
uint32_t nPackPerBlock, nPackModBlock;
idivmodFast32(&nPackPerBlock, &nPackModBlock, nPacks, prim.nBlocks, prim.nBlocks_rcp32);
int blockPackBegin = prim.block*nPackPerBlock + minval<int>(prim.block, nPackModBlock);
int blockPackEnd = blockPackBegin + nPackPerBlock + (prim.block < nPackModBlock ? 1 : 0);
int nBlockPacks = blockPackEnd - blockPackBegin;
int nBlockElts = nElts - blockPackBegin*BytePerPack;
nBlockElts = min(nBlockElts, nBlockPacks*BytePerPack);
char* blockInput = args->input + blockPackBegin*BytePerPack;
char* blockOutput = args->output + blockPackBegin*BytePerPack;
uint32_t lowBits = args->nElts;
lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->input);
lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->output);
if (__builtin_expect(lowBits%8 == 0, true)) {
// NOTE: Specializing for 8-byte alignment in one case help at size=65K: 8.9us vs 5.6us
allgather_LL_body(prim, (BytePack<8>*)blockInput, (BytePack<8>*)blockOutput, nBlockElts/8, nBlockPacks, nElts/8);
} else {
allgather_LL_body(prim, blockInput, blockOutput, nBlockElts, nBlockPacks, nElts);
}
}
__device__ __forceinline__ void ncclSymRun_AllGather_LL(ncclSymDevArgs const* args) {
ncclSymRun_AllGather_LL_impl(args, /*multimem=*/false);
}
__device__ __forceinline__ void ncclSymRun_AllGather_LLMC(ncclSymDevArgs const* args) {
ncclSymRun_AllGather_LL_impl(args, /*multimem=*/true);
}
+432
Просмотреть файл
@@ -0,0 +1,432 @@
#include "symmetric.h"
#include "symmetric/kernel.cuh"
#include "symmetric/primitives.cuh"
template<int BytePerPack, int UnrollPacks, int UnrollPeers, typename T, typename Red>
static __device__ __forceinline__ void allreduceDeep(
ncclSymPrims& prim, int tn, int t, bool waitNeeded,
Red red, char* inputRank0, char* outputRank0, int32_t nIters
) {
using Pack = BytePack<BytePerPack>;
using Acc = typename Red::EltType;
using AccPack = BytePack<BytePerPack*sizeof(Acc)/sizeof(T)>;
int wn = tn/WARP_SIZE;
int w = t/WARP_SIZE;
int lane = t%WARP_SIZE;
int const& rank = prim.rank;
int const& nRanks = prim.nRanks;
uint32_t const& stride4G = prim.stride4G;
Pack* inpRank0 = (Pack*)inputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
Pack* outRank0 = (Pack*)outputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
Pack acc0[UnrollPacks];
nIters -= w;
if (0 < nIters) {
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
}
}
if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
if (0 < nIters) {
while (true) {
AccPack acc1[UnrollPacks];
int r = rank;
if (++r == nRanks) r = 0;
{ Pack tmp1[UnrollPacks];
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp1[u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
acc1[u] = applyReduce(red, applyCast<T, Acc>(acc0[u]), applyCast<T, Acc>(tmp1[u]));
}
}
if (++r == nRanks) r = 0;
int dr = 2;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int i = 0;
partial ? i < 1 : (dr + UnrollPeers <= nRanks);
partial ? i++ : (dr += UnrollPeers)) {
if (partial && dr == nRanks) break;
Pack tmp1[UnrollPeers][UnrollPacks];
#pragma unroll
for (int ur=0; ur < UnrollPeers-partial; ur++) {
if (partial && ur!=0 && dr+ur == nRanks) break;
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) {
tmp1[ur][u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
}
if (++r == nRanks) r = 0;
}
#pragma unroll
for (int ur=0; ur < UnrollPeers-partial; ur++) {
if (partial && ur!=0 && dr+ur == nRanks) break;
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) {
acc1[u] = applyReduce(red, acc1[u], applyCast<T, Acc>(tmp1[ur][u]));
}
}
}
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) acc0[u] = applyCast<Acc, T>(acc1[u]);
dr = 0;
r = rank;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int i = 0;
partial ? i < 1 : (dr + UnrollPeers <= nRanks);
partial ? i++ : (dr += UnrollPeers)) {
#pragma unroll
for (int ur=0; ur < UnrollPeers-partial; ur++) {
if (partial && dr == nRanks) break;
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) {
add4G(outRank0, r*stride4G)[u*WARP_SIZE] = acc0[u];
}
if (++r == nRanks) r = 0;
}
}
}
inpRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
outRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
nIters -= wn;
if (nIters <= 0) break;
// Load data for next iteration.
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
}
}
}
}
template<int UnrollPeers, typename Red, typename T>
static __device__ __forceinline__ void allreduceEnds(
ncclSymPrims& prim, int tn, int t, Red red,
T* inputRank0, T* outputRank0, size_t nElts, uint32_t nPreElts, size_t nSufElts
) {
using Acc = typename Red::EltType;
int const& rank = prim.rank;
int const& nRanks = prim.nRanks;
uint32_t const& stride4G = prim.stride4G;
BytePack<sizeof(T)>* inpRank0 = (BytePack<sizeof(T)>*)inputRank0;
BytePack<sizeof(T)>* outRank0 = (BytePack<sizeof(T)>*)outputRank0;
#pragma unroll 1
for (size_t i = t; i < nPreElts+nSufElts; i += tn) {
size_t elt = i < nPreElts ? i : nElts-nSufElts-nPreElts+i;
BytePack<sizeof(T)> acc0 = *add4G(inpRank0+elt, rank*stride4G);
BytePack<sizeof(Acc)> acc1;
BytePack<sizeof(T)> tmp[UnrollPeers];
int dr = 1;
int r = rank+1;
if (nRanks == r) r = 0;
bool first = true;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int j = 0;
partial ? j < 1 : (dr + UnrollPeers <= nRanks);
partial ? j++ : (dr += UnrollPeers)) {
if (partial && dr == nRanks) break;
#pragma unroll
for (int u=0; u < UnrollPeers-partial; u++) {
if (partial && u!=0 && dr+u == nRanks) break;
tmp[u] = *add4G(inpRank0+elt, r*stride4G);
r += 1;
if (r == nRanks) r = 0;
}
if (first) {
first = false;
acc1 = applyCast<T, Acc>(acc0);
}
#pragma unroll
for (int u=0; u < UnrollPeers-partial; u++) {
if (partial && u!=0 && dr+u == nRanks) break;
acc1 = applyReduce(red, acc1, applyCast<T, Acc>(tmp[u]));
}
}
}
acc0 = applyCast<Acc, T>(acc1);
dr = 0;
r = rank;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int j=0;
partial ? j < 1 : (dr + UnrollPeers <= nRanks);
partial ? j++ : (dr += UnrollPeers)) {
#pragma unroll
for (int u=0; u < UnrollPeers-partial; u++) {
if (partial && dr+u == nRanks) break;
*add4G(outRank0+elt, r*stride4G) = acc0;
r += 1;
if (r == nRanks) r = 0;
}
}
}
}
}
template<typename Red, typename T>
static __device__ void allreduce(
ncclSymPrims& prim, int tn, int t, bool waitNeeded,
Red red, T* input, T* output, size_t nElts
) {
int nRanks = prim.nRanks;
int nBlocks = prim.nBlocks;
// Mpve to rank=0
input = prim.peerPtr(0, input);
output = prim.peerPtr(0, output);
uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
size_t nBytes = nElts*sizeof(T);
uint32_t nPreBytes = (16u - inputUptr)%16u;
nPreBytes = min((size_t)nPreBytes, nBytes);
uintptr_t cursor = nPreBytes;
constexpr int MinWarpPerBlock = 4;
if ((inputUptr-outputUptr)%16 == 0) {
constexpr int BytePerPack = 16, UnrollPacks = 4, UnrollPeers = 2;
constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
uint32_t chunks = (nBytes-cursor)/BytePerChunk;
chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
if (chunks != 0) {
uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
allreduceDeep<BytePerPack, UnrollPacks, UnrollPeers, T>(
prim, tn, t, waitNeeded, red,
(char*)input + cursor, (char*)output + cursor,
chunks*MinWarpPerBlock
);
cursor = cursorAfter;
waitNeeded = false;
}
}
if (sizeof(T) == 4 || (sizeof(T) < 4 && (inputUptr-outputUptr)%4 == 0)) {
constexpr int BytePerPack = 4, UnrollPacks = 4, UnrollPeers = 4;
constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
uint32_t chunks = (nBytes-cursor)/BytePerChunk;
chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
if (chunks != 0) {
uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
allreduceDeep<(sizeof(T) <= BytePerPack ? BytePerPack : 0), UnrollPacks, UnrollPeers, T>(
prim, tn, t, waitNeeded, red,
(char*)input + cursor, (char*)output + cursor,
chunks*MinWarpPerBlock
);
cursor = cursorAfter;
waitNeeded = false;
}
}
if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
constexpr int UnrollPeers = 8;
size_t nSufElts = (nBytes-cursor)/sizeof(T);
allreduceEnds<UnrollPeers>(prim, tn, t, red, input, output, nElts, nPreBytes/sizeof(T), nSufElts);
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLD_AGxST(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier);
int /*const&*/ rank = prim.rank;
int /*const&*/ nRanks = prim.nRanks;
Red<typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type> red(args->redOpArg);
// Threads numbered globally such that we round robin warps by rank then block.
int gt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
rank, nRanks,
prim.block, prim.nBlocks,
threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
int gtn = nRanks*prim.nBlocks*blockDim.x;
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
//prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
allreduce(prim, gtn, gt, /*waitNeeded=*/true, red, (T*)args->input, (T*)args->output, args->nElts);
prim.barrierArrive(ncclCoopCta(), /*release=*/true);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
}
template<typename Red, typename T>
static __device__ void allreduceMultimem(
ncclSymPrims& prim, int tn, int t, Red red, T* input, T* output, size_t nElts
) {
// Mpve to multimem
input = prim.multimemPtr(input);
output = prim.multimemPtr(output);
uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
size_t nBytes = nElts*sizeof(T);
constexpr int BytePerPack = LoadMultimem_BigPackSize<Red>::BigPackSize;
uint32_t nPreBytes = (BytePerPack - inputUptr)%BytePerPack;
nPreBytes = min((size_t)nPreBytes, nBytes);
uintptr_t nSufBytes;
if (alignof(T) == BytePerPack || (inputUptr-outputUptr)%BytePerPack == 0) {
constexpr int UnrollPacks = 16*8/BytePerPack;
constexpr int BytePerChunk = UnrollPacks*WARP_SIZE*BytePerPack;
uintptr_t cursor = nPreBytes;
int nChunks = (nBytes-cursor)/BytePerChunk;
uintptr_t cursorAfter = cursor + uintptr_t(nChunks)*BytePerChunk;
nSufBytes = nBytes - cursorAfter;
cursor += (t/WARP_SIZE)*UnrollPacks*WARP_SIZE*BytePerPack;
cursor += (t%WARP_SIZE)*BytePerPack;
int nIters = nChunks - t/WARP_SIZE;
#pragma unroll 1
while (0 < nIters) {
BytePack<BytePerPack> tmp[UnrollPacks];
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp[u] = applyLoadMultimem<Red, BytePerPack>(red, inputUptr + cursor + u*WARP_SIZE*BytePerPack);
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
multimem_st_global(outputUptr + cursor + u*WARP_SIZE*BytePerPack, tmp[u]);
}
cursor += tn*UnrollPacks*BytePerPack;
nIters -= tn/WARP_SIZE;
}
} else {
nPreBytes = 0;
nSufBytes = nBytes;
}
// Get the prefix+suffix element one at a time.
#pragma unroll 4
for (uintptr_t i = t*sizeof(T); i < nPreBytes + nSufBytes; i += tn*sizeof(T)) {
uintptr_t cursor = i < nPreBytes ? i : nBytes-nSufBytes+(i-nPreBytes);
BytePack<sizeof(T)> val = applyLoadMultimem<Red, sizeof(T)>(red, inputUptr + cursor);
multimem_st_global(outputUptr + cursor, val);
cursor += tn*sizeof(T);
}
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLDMC_AGxSTMC(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier|ncclSymPrims_UseMultimem);
Red<typename ncclSymAccumType<Red, T, /*nvls=*/true>::Type> red(args->redOpArg);
// Threads numbered globally such that we round robin warps by rank then block.
int gt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
prim.rank, prim.nRanks,
prim.block, prim.nBlocks,
threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
int gtn = prim.nRanks*prim.nBlocks*blockDim.x;
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
allreduceMultimem(prim, gtn, gt, red, (T*)args->input, (T*)args->output, args->nElts);
prim.barrierArrive(ncclCoopCta(), /*release=*/true);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLL_R_impl(ncclSymDevArgs const* args, bool multimem) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseLL | multimem*ncclSymPrims_UseMultimem);
int /*const&*/ rank = prim.rank;
using Acc = typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type;
Red<Acc> red(args->redOpArg);
using Pack = BytePack<8>;
using AccPack = BytePack<8*sizeof(Acc)/sizeof(T)>;
constexpr int EltPerPack = 8/sizeof(T);
int nElts = args->nElts;
int nPacks = divUp(nElts, EltPerPack);
bool packAligned = 8 <= alignof(T) || (
args->nElts*sizeof(T) |
(uint32_t)reinterpret_cast<uintptr_t>(args->input) |
(uint32_t)reinterpret_cast<uintptr_t>(args->output)
)%8 == 0;
uint32_t nPackPerBlock, nPackModBlock;
idivmodFast32(&nPackPerBlock, &nPackModBlock, nPacks, prim.nBlocks, prim.nBlocks_rcp32);
int begin = prim.block*nPackPerBlock + minval<int>(prim.block, nPackModBlock);
int end = begin + nPackPerBlock + (prim.block < nPackModBlock ? 1 : 0);
nPacks = end - begin;
nElts -= begin*EltPerPack;
nElts = min(nElts, nPacks*EltPerPack);
T* input = (T*)args->input + begin*EltPerPack;
T* output = (T*)args->output + begin*EltPerPack;
ncclCoopCta cta;
int t = threadIdx.x;
int tn = ncclSymMaxThreads;
if (__builtin_expect(packAligned, true)) {
#pragma unroll 1
while (0 < nPacks) {
if (t < nPacks) {
int nIterPacks = min(nPacks, tn);
Pack inp = loadPack<Pack>((Pack*)input, t, nPacks);
prim.bcastLL(/*slot=*/nIterPacks*rank + t, inp);
Pack out = prim.template recvReduceLL<Pack, T>(t, nIterPacks, red);
storePack((Pack*)output, t, nPacks, out);
}
prim.endLL(cta);
input += tn*EltPerPack;
output += tn*EltPerPack;
nPacks -= tn;
}
} else {
#pragma unroll 1
while (0 < nElts) {
if (t*EltPerPack < nElts) {
int nIterPacks = min(nPacks, tn);
Pack inp = loadPack<Pack>(input, t*EltPerPack, nElts);
prim.bcastLL(/*slot=*/nIterPacks*rank + t, inp);
Pack out = prim.template recvReduceLL<Pack, T>(t, nIterPacks, red);
storePack(output, t*EltPerPack, nElts, out);
}
prim.endLL(cta);
input += tn*EltPerPack;
output += tn*EltPerPack;
nElts -= tn*EltPerPack;
nPacks -= tn;
}
}
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLL_R(ncclSymDevArgs const* args) {
ncclSymRun_AllReduce_AGxLL_R_impl<Red, T>(args, /*multimem=*/false);
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLLMC_R(ncclSymDevArgs const* args) {
ncclSymRun_AllReduce_AGxLL_R_impl<Red, T>(args, /*multimem=*/true);
}
Исполняемый файл
+294
Просмотреть файл
@@ -0,0 +1,294 @@
#!/usr/bin/env python3
import os
import sys
################################################################################
# The first command line argument is the path to the directory to generate and
# populate.
gensrc = sys.argv[1]
if os.path.exists(gensrc):
for name in os.listdir(gensrc):
os.remove(os.path.join(gensrc, name))
#os.truncate(os.path.join(gensrc, name), 0)
else:
os.mkdir(gensrc)
def paste(sep, *args):
return sep.join(args)
indents = 0
def emitln(f, lines):
global indents
for ln in ((lines,) if isinstance(lines, str) else lines):
f.write(' '*indents + ln + '\n')
def indent(s):
return '\n'.join(' '+l for l in s.splitlines())
class Rec(object):
def __init__(me, **kw):
me.__dict__.update(kw)
def __eq__(x, y):
if len(x) != len(y): return False
for k in x:
if k not in y: return False
if x[k] != y[k]: return False
return True
def __hash__(me):
h = 0
for k in me.__dict__:
h += hash((k, me.__dict__[k]))
return h
################################################################################
# Edit this region for introducing new algos etc
reductions = ["AllReduce","ReduceScatter"]
all_reds = ["sum"]
all_tys = ["f32","f16","bf16","f8e4m3","f8e5m2"]
nvls_algos_by_coll = {
"AllReduce": ["AGxLLMC_R","RSxLDMC_AGxSTMC"],
"ReduceScatter": ["LDMC"]
}
ldmc_algos = ["RSxLDMC_AGxSTMC", "LDMC"]
coll_to_lower = {
"AllGather": "all_gather",
"AllReduce": "all_reduce",
"ReduceScatter": "reduce_scatter"
}
red_to_ncclDevRedOp = {
"sum": "ncclDevSum"
}
red_to_Func = {
"sum": "FuncSum"
}
ty_to_ncclDataType = {
"f32": "ncclFloat32",
"f16": "ncclFloat16",
"bf16": "ncclBfloat16",
"f8e4m3": "ncclFloat8e4m3",
"f8e5m2": "ncclFloat8e5m2"
}
ty_to_cxxtype = {
"f32": "float",
"f16": "half",
"bf16": "__nv_bfloat16",
"f8e4m3": "__nv_fp8_e4m3",
"f8e5m2": "__nv_fp8_e5m2"
}
def enumerate_kernels():
for algo in ["LL","LLMC","ST","STMC"]:
yield Rec(coll="AllGather", algo=algo)
for red in all_reds:
for ty in all_tys:
for algo in ["AGxLL_R","AGxLLMC_R","RSxLD_AGxST","RSxLDMC_AGxSTMC"]:
yield Rec(coll="AllReduce", algo=algo, red=red, ty=ty)
for algo in ["LL","LD","LDMC"]:
yield Rec(coll="ReduceScatter", algo=algo, red=red, ty=ty)
def required_cuda(k):
cudart, arch, specific_sms = 0, 0, None
is_nvls = k.algo in nvls_algos_by_coll.get(k.coll, [])
if is_nvls:
cudart = max(cudart, 12010)
arch = 900
if k.coll in reductions:
if k.ty == "bf16":
cudart = max(cudart, 11000)
if k.ty.startswith("f8"):
cudart = max(cudart, 11080)
arch = 900
if k.algo in ldmc_algos:
cudart = 12070
arch = None
specific_sms = [100, 120]
return (cudart, arch, specific_sms)
################################################################################
def kernel_fdep(k):
return coll_to_lower[k.coll] + '.cu'
def kernel_fname(k):
if k.coll in reductions:
if k.algo in ldmc_algos and k.ty.startswith('f8'):
return paste('_', coll_to_lower[k.coll], k.red, k.ty, k.algo) + '.cu'
else:
return paste('_', coll_to_lower[k.coll], k.red, k.ty) + '.cu'
else:
return coll_to_lower[k.coll] + '.cu'
def kernel_gencode(k):
if k.coll in reductions and k.algo in ldmc_algos and k.ty.startswith('f8'):
return "$(NVCC_GENCODE_LDMC_FP8)"
else:
return "$(NVCC_GENCODE)"
def kernel_cname(k):
if k.coll in reductions:
return paste("_", "ncclSymDevKernel", k.coll, k.algo, k.red, k.ty)
else:
return paste("_", "ncclSymDevKernel", k.coll, k.algo)
def kernel_conds(k):
cudart, arch, specific_sms = required_cuda(k)
if cudart == 0: return (None, None)
cudart_cond = "CUDART_VERSION >= %d"%cudart
if not specific_sms:
arch_cond = "__CUDA_ARCH__ >= %d"%arch
else:
arch_cond = " || ".join(["0"] + ["NCCL_CUDA_ARCH_SPECIFIC==%d"%(10*sm) for sm in specific_sms])
return cudart_cond, arch_cond
def instantiate(k):
cudart_cond, arch_cond = kernel_conds(k)
if (cudart_cond, arch_cond) == (None, None):
form_red_ty = (
"__global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
" ncclSymRun_{id}<{red}, {ty}>(&args);\n"
"}}"
)
form = (
"__global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
" ncclSymRun_{id}(&args);\n"
"}}"
)
else:
form_red_ty = (
"#if {cudart_cond}\n"
" __global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
" #if {arch_cond}\n"
" ncclSymRun_{id}<{red}, {ty}>(&args);\n"
" #endif\n"
" }}\n"
"#endif"
)
form = (
"#if {cudart_cond}\n"
" __global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
" #if {arch_cond}\n"
" ncclSymRun_{id}(&args);\n"
" #endif\n"
" }}\n"
"#endif"
)
id = k.coll+'_'+k.algo
cname = kernel_cname(k)
if k.coll in reductions:
inst = form_red_ty.format(cname=cname, id=id, red=red_to_Func[k.red], ty=ty_to_cxxtype[k.ty], cudart_cond=cudart_cond, arch_cond=arch_cond)
else:
inst = form.format(cname=cname, id=id, cudart_cond=cudart_cond, arch_cond=arch_cond)
return inst
def prototype(k):
cudart_cond, arch_cond = kernel_conds(k)
if cudart_cond is None:
form = "__global__ void {cname}(ncclSymDevArgs const);"
else:
form = (
"#if {cudart_cond}\n"
" __global__ void {cname}(ncclSymDevArgs const);\n"
"#else\n"
" constexpr void* {cname} = nullptr;\n"
"#endif"
)
return form.format(cname=kernel_cname(k), cudart_cond=cudart_cond)
################################################################################
def partition(vals, keyfn):
ans = {}
for x in vals:
k = keyfn(x)
if k not in ans:
ans[k] = []
ans[k].append(x)
return ans
kernels_by_file = partition(enumerate_kernels(), lambda k: (kernel_fname(k), k.coll))
# Add dependency only files (e.g. allreduce.cu)
for coll in set(k.coll for k in enumerate_kernels()):
fname = coll_to_lower[coll]+'.cu'
if (fname, coll) not in kernels_by_file:
kernels_by_file[fname, coll] = []
# Generate each kernel instantiation file
for (fname, coll), ks in kernels_by_file.items():
with open(os.path.join(gensrc, fname), "w") as f:
emitln(f, '#include "symmetric.h"')
emitln(f, '#include "symmetric/kernel.cuh"')
emitln(f, '#include "symmetric/{coll}.cuh"'.format(coll=coll_to_lower[coll]))
for k in ks:
emitln(f, instantiate(k))
# Generate <gensrc>/symmetric_host.cc
with open(os.path.join(gensrc, "symmetric_kernels.cc"), "w") as f:
emitln(f, '#include "symmetric.h"')
emitln(f, '#include "device.h"')
emitln(f, '')
for k in enumerate_kernels():
emitln(f, prototype(k))
emitln(f, '')
emitln(f, 'extern int const ncclSymKernelCount = %d;' % len(list(enumerate_kernels())))
emitln(f, 'extern void* const ncclSymKernelList[] = {')
for k in enumerate_kernels():
emitln(f, '(void*){cname},'.format(cname=kernel_cname(k)))
emitln(f, 'nullptr};')
emitln(f, '')
emitln(f, 'void* ncclSymGetKernelPtr(ncclSymKernelId id, int red, ncclDataType_t ty) {')
indents += 1
emitln(f, 'switch (id) {')
emitln(f, 'default: return nullptr;')
for (coll, algo), coll_algo_ks in partition(enumerate_kernels(), lambda k: (k.coll, k.algo)).items():
emitln(f, 'case ncclSymKernelId_'+coll+'_'+algo+':')
indents += 1
if len(coll_algo_ks) == 1:
emitln(f, 'return (void*)&'+kernel_cname(coll_algo_ks[0])+';')
else:
emitln(f, 'switch ((ncclDevRedOp_t)red) {')
emitln(f, 'default: return nullptr;')
for red, coll_algo_red_ks in partition(coll_algo_ks, lambda k: k.red).items():
emitln(f, 'case '+red_to_ncclDevRedOp[red]+':')
indents += 1
emitln(f, 'switch (ty) {')
emitln(f, 'default: return nullptr;')
for k in coll_algo_red_ks:
emitln(f, 'case '+ty_to_ncclDataType[k.ty]+': return (void*)'+kernel_cname(k)+';')
emitln(f, '}')
indents -= 1
emitln(f, '}')
indents -=1
emitln(f, '}')
indents -= 1
emitln(f, '}')
# Generate <gensrc>/rules.mk
with open(os.path.join(gensrc, "rules.mk"), "w") as f:
inst_names = sorted(set(kernel_fname(k) for k in enumerate_kernels()))
names = inst_names + ["symmetric_kernels.cc"]
f.write("LIB_OBJS_SYM_GEN = $(patsubst %,$(OBJDIR)/genobj/symmetric/%.o,{names})\n"
.format(names=" ".join(names)))
f.write("\n")
inst_names = sorted(set((k.coll, kernel_fname(k), kernel_gencode(k)) for k in enumerate_kernels()))
for coll, name, gencode in inst_names:
f.write(
"$(OBJDIR)/genobj/symmetric/{name}.o: $(OBJDIR)/gensrc/symmetric $(OBJDIR)/genobj/symmetric/{coll}.cu.d\n"
"\t" "$(call COMPILE_SYM,$@,$(OBJDIR)/gensrc/symmetric/{name},{gencode})\n"
"\n"
.format(name=name, coll=coll_to_lower[coll], gencode=gencode)
)
+27
Просмотреть файл
@@ -0,0 +1,27 @@
#ifndef NCCL_DEVICE_SYMMETRIC_KERNEL_H_
#define NCCL_DEVICE_SYMMETRIC_KERNEL_H_
#include "symmetric.h"
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLL_R(struct ncclSymDevArgs const* args);
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLLMC_R(struct ncclSymDevArgs const* args);
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLD_AGxST(struct ncclSymDevArgs const* args);
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLDMC_AGxSTMC(struct ncclSymDevArgs const* args);
__device__ __forceinline__ void ncclSymRun_AllGather_LL(struct ncclSymDevArgs const* args);
__device__ __forceinline__ void ncclSymRun_AllGather_LLMC(struct ncclSymDevArgs const* args);
__device__ __forceinline__ void ncclSymRun_AllGather_ST(struct ncclSymDevArgs const* args);
__device__ __forceinline__ void ncclSymRun_AllGather_STMC(struct ncclSymDevArgs const* args);
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LL(struct ncclSymDevArgs const* args);
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LD(struct ncclSymDevArgs const* args);
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LDMC(struct ncclSymDevArgs const* args);
#endif
+420
Просмотреть файл
@@ -0,0 +1,420 @@
#ifndef NCCL_DEVICE_SYMMETRIC_PRIMITIVES_H_
#define NCCL_DEVICE_SYMMETRIC_PRIMITIVES_H_
#include "symmetric.h"
#include "bitops.h"
#include "collectives.h"
#include "op128.h"
#include "reduce_kernel.h"
#if __CUDA_ARCH__ >= 700
// __grid_constant__ appears to break cuda-gdb
#define NCCL_GRID_CONSTANT __grid_constant__
#else
#define NCCL_GRID_CONSTANT
#endif
// flattenIx(pos0, dim0, pos1, dim1, pos2, dim2, ...)
// Given a position vector `pos` in a rectangular index space with lengths in the `dim`
// vector, flatten that down to a linear index. The fastest moving dimension is given first.
__device__ __forceinline__ int flattenIx() { return 0; }
template<typename Int0, typename Int1, typename ...Ints>
static __device__ Int0 flattenIx(Int0 pos, Int1 size, Ints ...more) {
return pos + size*flattenIx(more...);
}
// Precomputed integer reciprocoals for denominator values 1..64 inclusive.
// Pass these to idivFast64() for fast division on the GPU.
static __device__ uint64_t idivRcp64_upto64(int x) {
static constexpr uint64_t table[65] = {
idivRcp64(0x01), idivRcp64(0x01), idivRcp64(0x02), idivRcp64(0x03),
idivRcp64(0x04), idivRcp64(0x05), idivRcp64(0x06), idivRcp64(0x07),
idivRcp64(0x08), idivRcp64(0x09), idivRcp64(0x0a), idivRcp64(0x0b),
idivRcp64(0x0c), idivRcp64(0x0d), idivRcp64(0x0e), idivRcp64(0x0f),
idivRcp64(0x10), idivRcp64(0x11), idivRcp64(0x12), idivRcp64(0x13),
idivRcp64(0x14), idivRcp64(0x15), idivRcp64(0x16), idivRcp64(0x17),
idivRcp64(0x18), idivRcp64(0x19), idivRcp64(0x1a), idivRcp64(0x1b),
idivRcp64(0x1c), idivRcp64(0x1d), idivRcp64(0x1e), idivRcp64(0x1f),
idivRcp64(0x20), idivRcp64(0x21), idivRcp64(0x22), idivRcp64(0x23),
idivRcp64(0x24), idivRcp64(0x25), idivRcp64(0x26), idivRcp64(0x27),
idivRcp64(0x28), idivRcp64(0x29), idivRcp64(0x2a), idivRcp64(0x2b),
idivRcp64(0x2c), idivRcp64(0x2d), idivRcp64(0x2e), idivRcp64(0x2f),
idivRcp64(0x30), idivRcp64(0x31), idivRcp64(0x32), idivRcp64(0x33),
idivRcp64(0x34), idivRcp64(0x35), idivRcp64(0x36), idivRcp64(0x37),
idivRcp64(0x38), idivRcp64(0x39), idivRcp64(0x3a), idivRcp64(0x3b),
idivRcp64(0x3c), idivRcp64(0x3d), idivRcp64(0x3e), idivRcp64(0x3f),
idivRcp64(0x40)
};
return table[x];
}
static __device__ uint32_t idivRcp32_upto64(int x) {
return idivRcp64_upto64(x)>>32;
}
namespace {
struct ncclCoopCta {
__device__ void sync() { __syncthreads(); }
__device__ int self() { return threadIdx.x; }
__device__ int count() { return blockDim.x; }
};
struct ncclCoopWarps {
int log2_nWarps;
__device__ void sync() {
asm volatile("barrier.sync %0, %1;" :: "r"(1 + (threadIdx.x>>(5+log2_nWarps))), "r"(32<<log2_nWarps) : "memory");
}
__device__ int self() { return threadIdx.x & ((32<<log2_nWarps)-1); }
__device__ int count() { return 32<<log2_nWarps; }
};
struct ncclCoopWarp {
__device__ void sync() { __syncwarp(); }
__device__ int self() { return threadIdx.x%32; }
__device__ int count() { return 32; }
};
}
namespace {
static constexpr int ncclSymPrims_UseBarrier = 1;
static constexpr int ncclSymPrims_UseLL = 2;
static constexpr int ncclSymPrims_UseMultimem = 4;
struct ncclSymPrims {
int flags;
int const &rank;
int const &nRanks;
uint32_t const &nRanks_rcp32;
int block, nBlocks;
uint32_t nBlocks_rcp32;
uint32_t nBlocks_nWarps_rcp32;
uint32_t nRanks_nBlocks_rcp32;
uint32_t nWarpPerRank, nWarpPerRank_rcp32;
struct ncclSymDevBase* const &base;
uintptr_t offsetMc;
uint32_t const &stride4G;
uint32_t barEpoch;
uint32_t llEpoch;
__device__ ncclSymPrims(ncclSymDevComm const &comm, int flags):
flags(flags),
rank(comm.rank),
nRanks(comm.nRanks),
nRanks_rcp32(comm.nRanks_rcp32),
block(blockIdx.x),
nBlocks(gridDim.x),
nBlocks_rcp32(idivRcp32_upto64(nBlocks)),
nBlocks_nWarps_rcp32(imulRcp32(nBlocks, nBlocks_rcp32, blockDim.x/32, idivRcp32_upto64(blockDim.x/32))),
nRanks_nBlocks_rcp32(imulRcp32(nRanks, nRanks_rcp32, gridDim.x, nBlocks_rcp32)),
nWarpPerRank(idivFast32(nBlocks*blockDim.x/32, nRanks, nRanks_rcp32)),
nWarpPerRank_rcp32(idivRcp32_upto64(nWarpPerRank)),
base(comm.base),
offsetMc((flags & ncclSymPrims_UseMultimem) ? (char*)comm.baseMc - (char*)base : 0x0),
stride4G(comm.stride4G) {
#if CUDART_VERSION >= 12030 && __CUDA_ARCH__ >= 900
cudaGridDependencySynchronize();
#endif
if ((flags & ncclSymPrims_UseBarrier) && threadIdx.x < nRanks) {
barEpoch = (flags & ncclSymPrims_UseMultimem) ? base->barEpochMc[block] : base->barEpochUc[block];
}
if (flags & ncclSymPrims_UseLL) llEpoch = base->llEpoch[block] + 2;
}
__device__ ~ncclSymPrims() {
if (threadIdx.x == 0) {
if (flags & ncclSymPrims_UseBarrier) {
((flags & ncclSymPrims_UseMultimem) ? base->barEpochMc : base->barEpochUc)[block] = barEpoch;
}
if (flags & ncclSymPrims_UseLL) base->llEpoch[block] = llEpoch - 2;
}
}
template<typename T>
__device__ T* peerPtr(int peer, T* selfPtr) {
return add4G(selfPtr, (peer-rank)*stride4G);
}
template<typename T>
__device__ T* multimemPtr(T* selfPtr) {
return reinterpret_cast<T*>(reinterpret_cast<uintptr_t>(selfPtr) + offsetMc);
}
__device__ void barrierArrive(ncclCoopCta cta, bool release) {
cta.sync();
#if __CUDA_ARCH__ < 700
if (release) {
if (cta.self() == 0) __threadfence_system();
cta.sync();
}
#endif
if (flags & ncclSymPrims_UseMultimem) {
#if __CUDA_ARCH__ >= 900 && CUDART_VERSION >= 12010
if (cta.self() == 0) {
uint32_t* inbox = &multimemPtr(base)->barInboxMc[block];
if (release) {
asm volatile("multimem.red.release.sys.add.u32 [%0],1;" :: "l"(inbox));
} else {
asm volatile("multimem.red.relaxed.sys.add.u32 [%0],1;" :: "l"(inbox));
}
}
#endif
} else {
int r = cta.self();
if (r != rank && r < nRanks) {
uint32_t* inbox = &peerPtr(r, base)->barInboxPerPeer[block*nRanks + rank];
#if __CUDA_ARCH__ >= 700
if (release) {
asm volatile("st.release.sys.u32 [%0],%1;" :: "l"(inbox), "r"(barEpoch+1));
} else {
asm volatile("st.relaxed.sys.u32 [%0],%1;" :: "l"(inbox), "r"(barEpoch+1));
}
#else
asm volatile("st.volatile.u32 [%0],%1;" :: "l"(inbox), "r"(barEpoch+1));
#endif
}
}
}
__device__ void barrierWait(ncclCoopCta cta, bool acquire) {
if (flags & ncclSymPrims_UseMultimem) {
#if __CUDA_ARCH__ >= 900
if (cta.self() == 0) {
uint32_t* inbox = &base->barInboxMc[block];
while (true) {
uint32_t got;
if (acquire) {
asm volatile("ld.acquire.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
} else {
asm volatile("ld.relaxed.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
}
if (got-(barEpoch+nRanks) <= uint32_t(-1)>>1) break;
}
barEpoch += nRanks;
}
#endif
} else {
int r = cta.self();
if (r != rank && r < nRanks) {
uint32_t* inbox = &base->barInboxPerPeer[block*nRanks + r];
while (true) {
uint32_t got;
#if __CUDA_ARCH__ >= 700
if (acquire) {
asm volatile("ld.acquire.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
} else {
asm volatile("ld.relaxed.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
}
#else
asm volatile("ld.volatile.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
#endif
if (got-(barEpoch+1) <= uint32_t(-1)>>1) break;
}
}
#if __CUDA_ARCH__ < 700
if (acquire) {
cta.sync();
if (cta.self() == 0) __threadfence();
}
#endif
barEpoch += 1;
}
cta.sync();
}
__device__ void endLL(ncclCoopCta cta) {
if (__builtin_expect(llEpoch >= -2u, false)) {
cta.sync();
uint4* buf = ncclSymDevBase_getLLBuf(base, nRanks, block, llEpoch);
int epochSize = ncclSymLLEpochSize(nRanks);
#pragma unroll 4
for (int i=cta.self(); i*16 < epochSize; i += cta.count()) {
buf[i] = uint4{0, 0, 0, 0};
}
}
cta.sync();
llEpoch += (llEpoch == -1u) ? 3 : 1;
}
template<typename T>
__device__ void sendLL(int peer, int slot, T val) {
union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
tmp = val;
uint4* buf = ncclSymDevBase_getLLBuf(peerPtr(peer, base), nRanks, block, llEpoch) + slot;
#pragma unroll
for (int u=0; u < divUp(sizeof(T),8); u++) {
asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(buf + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
}
}
template<typename T>
__device__ void bcastLL(int slot, T val) {
if (flags & ncclSymPrims_UseMultimem) {
union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
tmp = val;
uint4* bufmc = ncclSymDevBase_getLLBuf(multimemPtr(base), nRanks, block, llEpoch) + slot;
#pragma unroll
for (int u=0; u < divUp(sizeof(T),8); u++) {
asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(bufmc + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
}
} else {
union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
tmp = val;
uint4* buf0 = ncclSymDevBase_getLLBuf(peerPtr(0, base), nRanks, block, llEpoch) + slot;
int dr = 0;
int r = rank;
#pragma unroll 1
for (; dr+8 <= nRanks; dr += 8) {
#pragma unroll
for (int ur=0; ur < 8; ur++) {
uint4* buf = add4G(buf0, r*stride4G);
#pragma unroll
for (int u=0; u < divUp(sizeof(T),8); u++) {
asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(buf + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
}
r += 1;
if (r == nRanks) r = 0;
}
}
#pragma unroll
for (int ur=0; ur < 8; ur++, dr++) {
if (dr == nRanks) break;
uint4* buf = add4G(buf0, r*stride4G);
#pragma unroll
for (int u=0; u < divUp(sizeof(T),8); u++) {
asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(buf + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
}
r += 1;
if (r == nRanks) r = 0;
}
}
}
template<int nSlotsMin, int nSlotsMax, typename T>
__device__ void recvLL(int slot0, int nSlots, int stride, T(&elts)[nSlotsMax]) {
uint4* buf = ncclSymDevBase_getLLBuf(base, nRanks, block, llEpoch) + slot0;
uint4 tmp[nSlotsMax][divUp(sizeof(T),8)];
//int spins=0;
while (true) {
#pragma unroll
for (int u=0; u < nSlotsMax; u++) {
if (u < nSlotsMin || u < nSlots) {
#pragma unroll
for (int v=0; v < divUp(sizeof(T),8); v++) {
asm volatile("ld.volatile.v4.u32 {%0,%1,%2,%3},[%4];" : "=r"(tmp[u][v].x), "=r"(tmp[u][v].y), "=r"(tmp[u][v].z), "=r"(tmp[u][v].w) : "l"(buf + u*stride + v*ncclSymLLMaxSlots(sizeof(T))));
}
}
}
bool okAll = true;
#pragma unroll
for (int u=0; u < nSlotsMax; u++) {
#pragma unroll
for (int v=0; v < divUp(sizeof(T),8); v++) {
if (u < nSlotsMin || u < nSlots) {
bool ok = tmp[u][v].y == llEpoch &&
tmp[u][v].w == llEpoch;
okAll &= ok;
}
}
}
if (__builtin_expect(okAll, true)) break;
//if (spins++ == 10<<20) spins=0;
}
#pragma unroll
for (int u=0; u < nSlotsMax; u++) {
if (nSlotsMin <= u && u == nSlots) break;
union { T val; uint32_t u32[divUp(sizeof(T),8)][2]; };
#pragma unroll
for (int v=0; v < divUp(sizeof(T),8); v++) {
u32[v][0] = tmp[u][v].x;
u32[v][1] = tmp[u][v].z;
}
elts[u] = val;
}
}
template<typename Pack, typename T, typename Red, int Unroll=8>
__device__ Pack recvReduceLL(int slot, int stride, Red red) {
using Acc = typename Red::EltType;
using AccPack = BytePack<sizeof(Pack)*sizeof(Acc)/sizeof(T)>;
AccPack acc;
bool first = true;
int r = 0;
#pragma unroll 1
for (; r+Unroll <= nRanks; r += Unroll) {
Pack got[Unroll];
this->template recvLL</*Min=*/Unroll>(slot + r*stride, Unroll, stride, got);
AccPack acc0 = applyCast<T, Acc>(got[0]);
acc = first ? acc0 : applyReduce(red, acc, acc0);
first = false;
#pragma unroll
for (int i=1; i < Unroll; i++) acc = applyReduce(red, acc, applyCast<T, Acc>(got[i]));
}
if (r < nRanks) {
Pack got[Unroll];
this->template recvLL</*Min=*/1>(slot + r*stride, nRanks-r, stride, got);
AccPack acc0 = applyCast<T, Acc>(got[0]);
acc = first ? acc0 : applyReduce(red, acc, acc0);
#pragma unroll
for (int i=1; i < Unroll-1; i++) {
if (r+i < nRanks) acc = applyReduce(red, acc, applyCast<T, Acc>(got[i]));
}
}
return applyCast<Acc, T>(acc);
}
template<typename T>
__device__ T recvLL(int slot) {
T one[1];
this->template recvLL<1, 1, T>(slot, 1, 0, one);
return one[0];
}
template<typename Coop, typename T>
__device__ void coopRecvLL(Coop coop, int slot0, int nSlots, T* dst) {
int me = coop.self();
if (me < nSlots) {
uint4* buf = ncclSymDevBase_getLLBuf(base, nRanks, block, llEpoch) + slot0 + me;
uint4 got[divUp(sizeof(T), 8)];
//int spins=0;
#pragma unroll 1
while (true) {
#pragma unroll
for (int u=0; u < divUp(sizeof(T), 8); u++) {
asm volatile("ld.volatile.v4.u32 {%0,%1,%2,%3},[%4];" : "=r"(got[u].x), "=r"(got[u].y), "=r"(got[u].z), "=r"(got[u].w) : "l"(buf + u*ncclSymLLMaxSlots(sizeof(T))));
}
bool ok = true;
#pragma unroll
for (int u=0; u < divUp(sizeof(T), 8); u++) {
ok &= got[u].y == llEpoch;
ok &= got[u].w == llEpoch;
}
if (__builtin_expect(ok, true)) break;
//if (++spins == 10<<20) { spins=0; printf("r=%d LL spin @ ix=%d got=%d want=%d\n", rank, slot0+me, got[0].y, llEpoch); }
}
union { T val; uint32_t u32[divUp(sizeof(T), 8)][2]; };
#pragma unroll
for (int u=0; u < divUp(sizeof(T), 8); u++) {
u32[u][0] = got[u].x;
u32[u][1] = got[u].z;
}
dst[slot0 + me] = val;
}
}
};
}
template<template<typename> typename Red, typename T, bool nvls>
struct ncclSymAccumType { using Type = T; };
// Only Red's whose opArg is invariant w.r.t. the datatype can have a different
// accumulator type. At the moment this excludes integer min/max, sumpostdiv,
// and premulsum.
template<> struct ncclSymAccumType<FuncSum, __half, false> { using Type = float; };
#if defined(__CUDA_BF16_TYPES_EXIST__)
template<> struct ncclSymAccumType<FuncSum, __nv_bfloat16, false> { using Type = float; };
#endif
#if defined(__CUDA_FP8_TYPES_EXIST__)
template<> struct ncclSymAccumType<FuncSum, __nv_fp8_e4m3, false> { using Type = float; };
template<> struct ncclSymAccumType<FuncSum, __nv_fp8_e5m2, false> { using Type = float; };
#endif
#endif
+387
Просмотреть файл
@@ -0,0 +1,387 @@
#include "symmetric.h"
#include "symmetric/kernel.cuh"
#include "symmetric/primitives.cuh"
template<int BytePerPack, int UnrollPacks, int UnrollPeers, typename T, typename Red>
static __device__ void reduceDeep(
ncclSymPrims& prim, int tn, int t, bool waitNeeded,
Red red, char* inputRank0, char* outputHere, int32_t nIters
) {
using Pack = BytePack<BytePerPack>;
using Acc = typename Red::EltType;
using AccPack = BytePack<BytePerPack*sizeof(Acc)/sizeof(T)>;
int wn = tn/WARP_SIZE;
int w = t/WARP_SIZE;
int lane = t%WARP_SIZE;
int const& rank = prim.rank;
int const& nRanks = prim.nRanks;
uint32_t const& stride4G = prim.stride4G;
Pack* inpRank0 = (Pack*)inputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
Pack* outHere = (Pack*)outputHere + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
Pack acc0[UnrollPacks];
nIters -= w;
if (0 < nIters) {
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
}
}
if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
if (0 < nIters) {
while (true) {
AccPack acc1[UnrollPacks];
int r = rank+1;
if (r == nRanks) r = 0;
{ Pack tmp1[UnrollPacks];
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp1[u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
acc1[u] = applyReduce(red, applyCast<T, Acc>(acc0[u]), applyCast<T, Acc>(tmp1[u]));
}
}
r += 1;
if (r == nRanks) r = 0;
int dr = 2;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int i = 0;
partial ? i < 1 : (dr + UnrollPeers <= nRanks);
partial ? i++ : (dr += UnrollPeers)) {
if (partial && dr == nRanks) break;
Pack tmp1[UnrollPeers][UnrollPacks];
#pragma unroll
for (int ur=0; ur < UnrollPeers-partial; ur++) {
if (partial && ur!=0 && dr+ur == nRanks) break;
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) {
tmp1[ur][u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
}
r += 1;
if (r == nRanks) r = 0;
}
#pragma unroll
for (int ur=0; ur < UnrollPeers-partial; ur++) {
if (partial && ur!=0 && dr+ur == nRanks) break;
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) {
acc1[u] = applyReduce(red, acc1[u], applyCast<T, Acc>(tmp1[ur][u]));
}
}
}
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) acc0[u] = applyCast<Acc, T>(acc1[u]);
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) outHere[u*WARP_SIZE] = acc0[u];
inpRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
outHere += intptr_t(wn)*UnrollPacks*WARP_SIZE;
nIters -= wn;
if (nIters <= 0) break;
// Load data for next iteration.
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
}
}
}
}
template<int UnrollPeers, typename Red, typename T>
static __device__ void reduceEnds(
ncclSymPrims& prim, int tn, int t, Red red,
T* inputRank0, T* outputHere, size_t nElts, uint32_t nPreElts, size_t nSufElts
) {
using Acc = typename Red::EltType;
int const& rank = prim.rank;
int const& nRanks = prim.nRanks;
uint32_t const& stride4G = prim.stride4G;
BytePack<sizeof(T)>* inpRank0 = (BytePack<sizeof(T)>*)inputRank0;
BytePack<sizeof(T)>* outHere = (BytePack<sizeof(T)>*)outputHere;
#pragma unroll 1
for (size_t i = t; i < nPreElts+nSufElts; i += tn) {
size_t elt = i < nPreElts ? i : nElts-nSufElts-nPreElts+i;
BytePack<sizeof(T)> acc0 = *add4G(inpRank0+elt, rank*stride4G);
BytePack<sizeof(Acc)> acc1;
BytePack<sizeof(T)> tmp[UnrollPeers];
int dr = 1;
int r = rank+1;
if (nRanks == r) r = 0;
bool first = true;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int j = 0;
partial ? j < 1 : (dr + UnrollPeers <= nRanks);
partial ? j++ : (dr += UnrollPeers)) {
if (partial && dr == nRanks) break;
#pragma unroll
for (int u=0; u < UnrollPeers-partial; u++) {
if (partial && u!=0 && dr+u == nRanks) break;
tmp[u] = *add4G(inpRank0+elt, r*stride4G);
r += 1;
if (r == nRanks) r = 0;
}
if (first) {
first = false;
acc1 = applyCast<T, Acc>(acc0);
}
#pragma unroll
for (int u=0; u < UnrollPeers-partial; u++) {
if (partial && u!=0 && dr+u == nRanks) break;
acc1 = applyReduce(red, acc1, applyCast<T, Acc>(tmp[u]));
}
}
}
acc0 = applyCast<Acc, T>(acc1);
outHere[elt] = acc0;
}
}
template<typename Red, typename T>
static __device__ void reduce(
ncclSymPrims& prim, int tn, int t, bool waitNeeded,
Red red, T* input, T* output, size_t nElts
) {
int nRanks = prim.nRanks;
int nBlocks = prim.nBlocks;
// Mpve input to rank=0
input = prim.peerPtr(0, input);
uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
uint32_t alignment = uint32_t(inputUptr - outputUptr);
size_t nBytes = nElts*sizeof(T);
uint32_t nPreBytes = (16u - inputUptr)%16u;
nPreBytes = min((size_t)nPreBytes, nBytes);
uintptr_t cursor = nPreBytes;
constexpr int MinWarpPerBlock = 4;
if (alignment%16 == 0) {
constexpr int BytePerPack = 16, UnrollPacks = 4, UnrollPeers = 2;
constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
uint32_t chunks = (nBytes-cursor)/BytePerChunk;
chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
if (chunks != 0) {
uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
reduceDeep<BytePerPack, UnrollPacks, UnrollPeers, T>(
prim, tn, t, waitNeeded, red,
(char*)input + cursor, (char*)output + cursor,
chunks*MinWarpPerBlock
);
cursor = cursorAfter;
waitNeeded = false;
}
}
if (sizeof(T) == 4 || (sizeof(T) < 4 && alignment%4 == 0)) {
constexpr int BytePerPack = 4, UnrollPacks = 4, UnrollPeers = 4;
constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
uint32_t chunks = (nBytes-cursor)/BytePerChunk;
chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
if (chunks != 0) {
uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
reduceDeep<(sizeof(T) <= BytePerPack ? BytePerPack : 0), UnrollPacks, UnrollPeers, T>(
prim, tn, t, waitNeeded, red,
(char*)input + cursor, (char*)output + cursor,
chunks*MinWarpPerBlock
);
cursor = cursorAfter;
waitNeeded = false;
}
}
if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
constexpr int UnrollPeers = 8;
size_t nSufElts = (nBytes-cursor)/sizeof(T);
reduceEnds<UnrollPeers>(prim, tn, t, red, input, output, nElts, nPreBytes/sizeof(T), nSufElts);
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LD(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier);
Red<typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type> red(args->redOpArg);
// Round robin warps over blocks.
int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
prim.block, prim.nBlocks,
threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
int tn = prim.nBlocks*blockDim.x;
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
//prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
reduce(prim, tn, t, /*waitNeeded=*/true, red, (T*)args->input + prim.rank*args->nElts, (T*)args->output, args->nElts);
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
}
template<typename Red, typename T>
static __device__ void reduceMultimem(
ncclSymPrims& prim, int tn, int t, Red red, T* input, T* output, size_t nElts
) {
// Mpve input to multimem
input = prim.multimemPtr(input);
uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
size_t nBytes = nElts*sizeof(T);
constexpr int BytePerPack = LoadMultimem_BigPackSize<Red>::BigPackSize;
uint32_t nPreBytes = (BytePerPack - inputUptr)%BytePerPack;
nPreBytes = min((size_t)nPreBytes, nBytes);
uintptr_t nSufBytes;
if (sizeof(T) == BytePerPack || (inputUptr-outputUptr)%BytePerPack == 0) {
constexpr int UnrollPacks = 8*(16/BytePerPack);
constexpr int BytePerChunk = UnrollPacks*WARP_SIZE*BytePerPack;
uintptr_t cursor = nPreBytes;
uint32_t nChunks = (nBytes-cursor)/BytePerChunk;
uintptr_t cursorAfter = cursor + uintptr_t(nChunks)*BytePerChunk;
nSufBytes = nBytes - cursorAfter;
cursor += (t/WARP_SIZE)*UnrollPacks*WARP_SIZE*BytePerPack;
cursor += (t%WARP_SIZE)*BytePerPack;
int nIters = nChunks - t/WARP_SIZE;
#pragma unroll 1
while (0 < nIters) {
BytePack<BytePerPack> tmp[UnrollPacks];
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp[u] = applyLoadMultimem<Red, BytePerPack>(red, inputUptr + cursor + u*WARP_SIZE*BytePerPack);
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
*reinterpret_cast<BytePack<BytePerPack>*>(outputUptr + cursor + u*WARP_SIZE*BytePerPack) = tmp[u];
}
cursor += tn*UnrollPacks*BytePerPack;
nIters -= tn/WARP_SIZE;
}
} else {
nPreBytes = 0;
nSufBytes = nBytes;
}
// Get the prefix+suffix element one at a time.
#pragma unroll 4
for (uintptr_t i = t*sizeof(T); i < nPreBytes + nSufBytes; i += tn*sizeof(T)) {
uintptr_t cursor = i < nPreBytes ? i : nBytes-nSufBytes+(i-nPreBytes);
BytePack<sizeof(T)> val = applyLoadMultimem<Red, sizeof(T)>(red, inputUptr + cursor);
*reinterpret_cast<BytePack<sizeof(T)>*>(outputUptr + cursor) = val;
cursor += tn*sizeof(T);
}
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LDMC(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier|ncclSymPrims_UseMultimem);
Red<typename ncclSymAccumType<Red, T, /*nvls=*/true>::Type> red(args->redOpArg);
// Round robin warps over blocks.
int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
prim.block, prim.nBlocks,
threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
int tn = prim.nBlocks*blockDim.x;
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
reduceMultimem(prim, tn, t, red, (T*)args->input + prim.rank*args->nElts, (T*)args->output, args->nElts);
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
}
// T is user type, EltType is the most aligned type
template<typename T, typename Red, typename EltType>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LL_body(
ncclSymPrims &prim, Red red, EltType* input, EltType* output, int nElts, int nPacks, int nStrideElts) {
using Pack = BytePack<8>;
constexpr int EltPerPack = 8/sizeof(EltType);
int nRanks = prim.nRanks;
int rank = prim.rank;
int t = threadIdx.x;
int tn = ncclSymMaxThreads;
ncclCoopCta cta;
#pragma unroll 1
while (0 < nElts) {
int nIterPacks = min(nPacks, tn);
int tn_div_nPacks = tn/nIterPacks;
int tn_mod_nPacks = tn%nIterPacks;
int peer = t/nIterPacks;
int pack = t%nIterPacks;
#pragma unroll 1
for (int i = t; i < nRanks*nIterPacks; i += tn) {
Pack got = loadPack<Pack>(input + peer*nStrideElts, pack*EltPerPack, nElts);
prim.sendLL(peer, rank*nIterPacks + pack, got);
peer += tn_div_nPacks;
pack += tn_mod_nPacks;
if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
}
if (t < nIterPacks) {
Pack got = prim.template recvReduceLL<Pack, T>(t, nIterPacks, red);
storePack(output, t*EltPerPack, nElts, got);
}
prim.endLL(cta);
input += tn*EltPerPack;
output += tn*EltPerPack;
nElts -= tn*EltPerPack;
nPacks -= tn;
}
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LL(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseLL);
Red<typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type> red(args->redOpArg);
using Pack = BytePack<8>;
constexpr int EltPerPack = 8/sizeof(T);
int nAllElts = args->nElts;
int nAllPacks = divUp(nAllElts, EltPerPack);
uint32_t nPackPerBlock, nPackModBlock;
idivmodFast32(&nPackPerBlock, &nPackModBlock, nAllPacks, prim.nBlocks, prim.nBlocks_rcp32);
int blockPackBegin = prim.block*nPackPerBlock + minval<int>(prim.block, nPackModBlock);
int blockPackEnd = blockPackBegin + nPackPerBlock + (prim.block < nPackModBlock ? 1 : 0);
int nPacks = blockPackEnd - blockPackBegin;
int nElts = nAllElts - blockPackBegin*EltPerPack;
nElts = min(nElts, nPacks*EltPerPack);
T* input = (T*)args->input + blockPackBegin*EltPerPack;
T* output = (T*)args->output + blockPackBegin*EltPerPack;
uint32_t lowBits = args->nElts*sizeof(T);
lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->input);
lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->output);
if (__builtin_expect(lowBits%8 == 0, true)) {
ncclSymRun_ReduceScatter_LL_body<T>(prim, red, (Pack*)input, (Pack*)output, nPacks, nPacks, nAllElts/EltPerPack);
} else {
ncclSymRun_ReduceScatter_LL_body<T>(prim, red, input, output, nElts, nPacks, nAllElts);
}
}
+252 -136
Просмотреть файл
@@ -13,6 +13,7 @@
#include "cudawrap.h"
#include "profiler.h"
#include "transport.h"
#include "register_inline.h"
#include <cstring> // std::memcpy
#include <cinttypes> // PRIx64
@@ -28,34 +29,41 @@ ncclResult_t ncclInitKernelsForDevice(int cudaArch, int maxSharedMem, size_t* ma
int carveout = ncclParamL1SharedMemoryCarveout();
int ncclMaxSharedMem = ncclShmemDynamicSize(cudaArch);
for (int k=0; k < ncclDevKernelCount; k++) {
void* fn = ncclDevKernelList[k];
cudaFuncAttributes attr = {0};
if (fn == nullptr) continue;
for (int sym=0; sym <= 1; sym++) {
int kcount = sym==0 ? ncclDevKernelCount : ncclSymKernelCount;
void* const* kptrs = sym==0 ? ncclDevKernelList : ncclSymKernelList;
for (int k=0; k < kcount; k++) {
void* fn = kptrs[k];
cudaFuncAttributes attr = {0};
if (fn == nullptr) continue;
CUDACHECKGOTO(cudaFuncGetAttributes(&attr, fn), result, ignore0);
if (maxStackSize) {
if (attr.localSizeBytes > *maxStackSize) *maxStackSize = attr.localSizeBytes;
ignore0:;
}
if (carveout) {
CUDACHECKGOTO(cudaFuncSetAttribute(fn,
cudaFuncAttributePreferredSharedMemoryCarveout, carveout),
result, ignore1);
ignore1:;
}
if (ncclMaxSharedMem != 0) {
int sharedMemSize = ncclMaxSharedMem;
if (sharedMemSize > (maxSharedMem-attr.sharedSizeBytes)) {
WARN("cudaArch %d ncclMaxSharedMem %d exceeds device/fn maxSharedMem %zu",
cudaArch, sharedMemSize, maxSharedMem-attr.sharedSizeBytes);
return ncclSystemError;
cudaError_t errcode = cudaFuncGetAttributes(&attr, fn);
if (errcode == cudaErrorNoKernelImageForDevice) continue;
CUDACHECKGOTO(errcode, result, ignore0);
if (maxStackSize) {
if (attr.localSizeBytes > *maxStackSize) *maxStackSize = attr.localSizeBytes;
ignore0:;
}
CUDACHECKGOTO(cudaFuncSetAttribute(fn,
cudaFuncAttributeMaxDynamicSharedMemorySize, sharedMemSize),
result, next_kernel);
if (carveout) {
CUDACHECKGOTO(cudaFuncSetAttribute(fn,
cudaFuncAttributePreferredSharedMemoryCarveout, carveout),
result, ignore1);
ignore1:;
}
if (ncclMaxSharedMem != 0) {
int sharedMemSize = ncclMaxSharedMem;
if (sharedMemSize > (maxSharedMem-attr.sharedSizeBytes)) {
WARN("cudaArch %d ncclMaxSharedMem %d exceeds device/fn maxSharedMem %zu",
cudaArch, sharedMemSize, maxSharedMem-attr.sharedSizeBytes);
return ncclSystemError;
}
CUDACHECKGOTO(cudaFuncSetAttribute(fn,
cudaFuncAttributeMaxDynamicSharedMemorySize, sharedMemSize),
result, next_kernel);
}
next_kernel:;
}
next_kernel:;
}
return result;
}
@@ -258,8 +266,8 @@ static bool testBudget(
ncclResult_t ncclTasksRegAndEnqueue(struct ncclComm* comm) {
struct ncclKernelPlanner* planner = &comm->planner;
if (planner->isSymColl) return ncclSuccess;
struct ncclTaskColl *task;
task = ncclIntruQueueHead(&planner->collTaskQueue);
while (task != nullptr) {
// Build a ncclDevWorkColl[Reg?] struct for each task.
@@ -331,6 +339,38 @@ ncclResult_t ncclPrepareTasks(struct ncclComm* comm, bool* algoNeedConnect, bool
int fnOpTyIndices[ncclNumFuncs*ncclNumDevRedOps*ncclNumTypes];
int fnOpTyCount = 0;
if (comm->nNodes == 1 && planner->nTasksColl == 1 && planner->nTasksP2p == 0) {
void* sendSymPtr;
void* recvSymPtr;
struct ncclReg* sendReg;
struct ncclReg* recvReg;
size_t size = task->count*ncclTypeSize(task->datatype);
NCCLCHECK(ncclRegFindSymmetric(comm, task->sendbuff, size, &sendSymPtr, &sendReg));
NCCLCHECK(ncclRegFindSymmetric(comm, task->recvbuff, size, &recvSymPtr, &recvReg));
bool implemented = ncclSymImplemented(task->func, task->opDev.op, task->datatype);
if (sendReg && recvReg && (sendReg->winFlags & recvReg->winFlags & NCCL_WIN_COLL_SYMMETRIC) && implemented) {
enum ncclSymKernelId kernel;
int nChannels, nWarps;
float estTimeUs = 1.e18;
NCCLCHECK(ncclSymPickKernel(comm, task->func, task->opDev.op, task->datatype, task->count, &estTimeUs, &kernel, &nChannels, &nWarps));
// We should only use symmetric kernel if it beats the asymmetric kernel. But the
// perf model accuracy from asymmetric kernels is too inaccurate and reports too high
// of a bandwidth. For now just always use symmetric if available.
if (kernel != ncclSymKernelId_Count) {
task->sendbuff = sendSymPtr;
task->recvbuff = recvSymPtr;
task->devFuncId = (int)kernel;
task->nMaxChannels = nChannels;
task->nWarps = nWarps;
ncclIntruQueueEnqueue(&planner->collTaskQueue, task);
planner->isSymColl = true;
return ncclSuccess;
}
}
}
// Walk the size sorted tasks, binning them by (fn,op,ty).
while (task != nullptr) {
struct ncclTaskColl* next = task->next;
@@ -603,6 +643,10 @@ static ncclResult_t scheduleCollTasksToPlan(
(countHi != 0 ? countHi : countLo) -= cells*elementsPerCell - task->count;
nChannels = (countLo!=0 ? 1 : 0) + nMidChannels + (cellsHi!=0 ? 1 : 0);
// Update number of channels propagated to the profiler
task->nChannels = (uint8_t)nChannels;
// Ensure room for worst case of one new batch per channel
if (!testBudget(budget, plan->nWorkBatches + nChannels, plan->workBytes + workNode->size)) {
return ncclSuccess;
@@ -860,6 +904,8 @@ static ncclResult_t addP2pToPlan(
partSize = divUp(bytes[dir], nChannels[dir]);
}
}
// Update number of channels propagated to the profiler
if (p2pTasks[dir]) p2pTasks[dir]->nChannels = nChannels[dir];
}
struct ncclWorkList* workNode = ncclMemoryStackAllocInlineArray<ncclWorkList, ncclDevWorkP2p>(&comm->memScoped, 1);
@@ -1052,47 +1098,17 @@ static ncclResult_t scheduleP2pTasksToPlan(
}
// Spin until its safe to increase comm->workFifoProduced to desiredProduced.
static void waitWorkFifoAvailable(struct ncclComm* comm, uint32_t desiredProduced) {
bool hasRoom = (desiredProduced - comm->workFifoConsumedLeast) <= comm->workFifoBytes;
if (hasRoom) return;
while (true) {
// We have to poll for notifications from device.
uint32_t* consumedLive = comm->workFifoConsumed;
uint32_t consumed[MAXCHANNELS];
for (int c=0; c < MAXCHANNELS; c++) {
consumed[c] = __atomic_load_n(&consumedLive[c], __ATOMIC_RELAXED);
static ncclResult_t waitWorkFifoAvailable(struct ncclComm* comm, uint32_t desiredProduced) {
bool hasRoom = (desiredProduced - comm->workFifoConsumed) <= comm->workFifoBytes;
if (!hasRoom) {
while (true) {
NCCLCHECK(ncclCommPollEventCallbacks(comm, /*waitSome=*/true));
hasRoom = (desiredProduced - comm->workFifoConsumed) <= comm->workFifoBytes;
if (hasRoom) break;
sched_yield();
}
// Compiler-only fence to prevent fusion of loops to encourage dense loads.
__atomic_signal_fence(__ATOMIC_SEQ_CST);
uint32_t produced = comm->workFifoProduced;
uint32_t consumedLeast = produced;
for (int c=0; c < MAXCHANNELS; c++) {
// consumedLeast is min over all non-quiesced channels
if (consumed[c] != comm->channels[c].workFifoProduced) {
if ((produced - consumedLeast) < (produced - consumed[c])) {
consumedLeast = consumed[c];
}
}
}
// Compiler only fence to prevent fusion of loops to encourage dense stores.
__atomic_signal_fence(__ATOMIC_SEQ_CST);
for (int c=0; c < MAXCHANNELS; c++) {
// Advance counter on quiesced channels so they don't lag behind
// too far where they could get lost in 32-bit wraparound.
if (consumed[c] == comm->channels[c].workFifoProduced) {
comm->channels[c].workFifoProduced = consumedLeast;
__atomic_store_n(&consumedLive[c], consumedLeast, __ATOMIC_RELAXED);
}
}
comm->workFifoConsumedLeast = consumedLeast;
hasRoom = (desiredProduced - comm->workFifoConsumedLeast) <= comm->workFifoBytes;
if (hasRoom) break;
sched_yield();
}
return ncclSuccess;
}
namespace {
@@ -1106,11 +1122,14 @@ namespace {
struct uploadWork_cleanup_t* me = (struct uploadWork_cleanup_t*)cb;
free(me->hostBuf);
CUDACHECK(cudaEventDestroy(me->base.event));
free(me);
return ncclSuccess;
}
}
static ncclResult_t uploadWork(struct ncclComm* comm, struct ncclKernelPlan* plan) {
if (plan->isSymColl) return ncclSuccess;
size_t workBytes = plan->workBytes;
size_t batchBytes = plan->nWorkBatches*sizeof(struct ncclDevWorkBatch);
void* fifoBufHost;
@@ -1127,7 +1146,7 @@ static ncclResult_t uploadWork(struct ncclComm* comm, struct ncclKernelPlan* pla
fifoBufHost = comm->workFifoBuf;
fifoCursor = comm->workFifoProduced;
fifoMask = comm->workFifoBytes-1;
waitWorkFifoAvailable(comm, fifoCursor + workBytes);
NCCLCHECK(waitWorkFifoAvailable(comm, fifoCursor + workBytes));
plan->kernelArgs->workBuf = comm->workFifoBufDev;
break;
case ncclDevWorkStorageTypePersistent:
@@ -1208,7 +1227,7 @@ static ncclResult_t uploadWork(struct ncclComm* comm, struct ncclKernelPlan* pla
ncclIntruQueueEnqueue(&comm->eventCallbackQueue, (struct ncclCommEventCallback *)cleanup);
NCCLCHECKGOTO(ncclStrongStreamRelease(ncclCudaGraphNone(), &comm->sharedRes->deviceStream, /*concurrent=*/false), result, fail);
NCCLCHECKGOTO(ncclCommPollEventCallbacks(comm), result, fail);
NCCLCHECKGOTO(ncclCommPollEventCallbacks(comm, /*waitSome=*/false), result, fail);
finish_scope:
if (mode != cudaStreamCaptureModeRelaxed) (void)cudaThreadExchangeStreamCaptureMode(&mode);
@@ -1226,6 +1245,7 @@ static ncclResult_t uploadProxyOps(struct ncclComm* comm, struct ncclKernelPlan*
uint64_t collOpCount = comm->sharedRes->collOpCount;
uint64_t p2pOpBump[MAXCHANNELS] = {/*0...*/};
// Advance comm's collOpCount by number of colls in this plan.
int hasp2p = 0;
comm->sharedRes->collOpCount += plan->collOpCount;
comm->collOpCount += plan->collOpCount;
@@ -1244,6 +1264,7 @@ static ncclResult_t uploadProxyOps(struct ncclComm* comm, struct ncclKernelPlan*
// remember last value to compute max.
p2pOpBump[op->channelId] = (oldId>>1) + 1; // +1 to ensure next plan doesn't collide
op->opCount = (comm->sharedRes->p2pOpCount[op->channelId]<<1) + oldId;
hasp2p = 1;
} else { // coll
op->opCount = (collOpCount<<1) + oldId;
}
@@ -1253,9 +1274,11 @@ static ncclResult_t uploadProxyOps(struct ncclComm* comm, struct ncclKernelPlan*
op = op->enqNext;
}
for (int c=0; c < MAXCHANNELS; c++) {
// Advance channel's p2pOpCount by number of p2p's in this plan channel.
comm->sharedRes->p2pOpCount[c] += p2pOpBump[c];
if (hasp2p) {
for (int c=0; c < MAXCHANNELS; c++) {
// Advance channel's p2pOpCount by number of p2p's in this plan channel.
comm->sharedRes->p2pOpCount[c] += p2pOpBump[c];
}
}
return ncclSuccess;
}
@@ -1263,8 +1286,10 @@ static ncclResult_t uploadProxyOps(struct ncclComm* comm, struct ncclKernelPlan*
static ncclResult_t hostStreamPlanTask(struct ncclComm* comm, struct ncclKernelPlan* plan) {
NCCLCHECK(ncclProfilerStartGroupEvent(plan));
NCCLCHECK(ncclProfilerStartTaskEvents(plan));
NCCLCHECK(uploadProxyOps(comm, plan));
NCCLCHECK(ncclProxyStart(comm));
if (ncclIntruQueueHead(&plan->proxyOpQueue)) {
NCCLCHECK(uploadProxyOps(comm, plan));
NCCLCHECK(ncclProxyStart(comm));
}
NCCLCHECK(ncclProfilerStopTaskEvents(plan));
NCCLCHECK(ncclProfilerStopGroupEvent(plan));
if (!plan->persistent) {
@@ -1281,7 +1306,6 @@ static void CUDART_CB hostStreamPlanCallback(void *plan_) {
if (result != ncclSuccess) {
WARN("hostStreamPlanCallback() failed : %s", ncclGetErrorString(result));
}
if (!plan->persistent) ncclAtomicRefCountDecrement(&plan->comm->sharedRes->noncapturedRefs);
return;
}
@@ -1357,9 +1381,8 @@ namespace {
static ncclResult_t getImplicitOrder(enum ncclImplicitOrder *mode, bool capturing, int driver=-1) {
if (ncclParamLaunchOrderImplicit()) {
// Due to an unresolved bug in CUDA ncclImplicitOrderLaunch is not supported in graphs
if (capturing) { *mode = ncclImplicitOrderSerial; return ncclSuccess; }
if (driver < 0) { NCCLCHECK(ncclCudaDriverVersion(&driver)); }
if (capturing && driver < 12090) { *mode = ncclImplicitOrderSerial; return ncclSuccess; }
*mode = 12030 <= std::min<int>(CUDART_VERSION, driver) ? ncclImplicitOrderLaunch : ncclImplicitOrderSerial;
return ncclSuccess;
}
@@ -1386,26 +1409,51 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
plan->workStorageType = persistent ? ncclDevWorkStorageTypePersistent
: ncclDevWorkStorageTypeFifo;
struct ncclKernelPlanBudget budget;
budget.inArgsBytes = comm->workArgsBytes - sizeof(struct ncclDevKernelArgs);
// Non-persistent kernels fill up at most half of our fifo per kernel.
budget.outArgsBytes = plan->persistent ? (1<<30) : comm->workFifoBytes/2;
if (planner->isSymColl) {
plan->workStorageType = ncclDevWorkStorageTypeArgs;
// Drain coll tasks first. This is essential since we partition tasks based
// on the work budget and p2p work isn't collective. If we were to drain p2p
// first, the place where we cut the kernel could vary by rank which would
// cause the "shortest channel first" channel picker to have divergent results.
if (planner->nTasksColl != 0) {
NCCLCHECKGOTO(scheduleCollTasksToPlan(comm, plan, &budget), result, failure);
}
// And only drain p2p tasks once colls are depleted.
if (planner->nTasksColl == 0 && planner->nTasksP2p != 0) {
NCCLCHECKGOTO(scheduleP2pTasksToPlan(comm, plan, &budget), result, failure);
}
finishPlan(comm, plan);
if (plan->workBytes != 0) {
struct ncclTaskColl* task = ncclIntruQueueHead(&planner->collTaskQueue);
plan->isSymColl = true;
plan->kernelFn = ncclSymGetKernelPtr((ncclSymKernelId)task->devFuncId, task->opDev.op, task->datatype);
plan->threadPerBlock = task->nWarps*WARP_SIZE;
plan->channelMask = uint64_t(-1) >> (64-task->nMaxChannels);
plan->kernelArgsSize = sizeof(struct ncclSymDevArgs);
plan->kernelSymArgs = ncclMemoryStackAlloc<struct ncclSymDevArgs>(&comm->memScoped);
plan->kernelSymArgs->comm = comm->symDevComm;
plan->kernelSymArgs->rootRank = task->root;
plan->kernelSymArgs->redOpArg = task->opDev.scalarArg;
plan->kernelSymArgs->nElts = task->count;
plan->kernelSymArgs->input = (char*)task->sendbuff;
plan->kernelSymArgs->output = (char*)task->recvbuff;
planner->nTasksColl -= 1;
ncclIntruQueueEnqueue(&planner->planQueue, plan);
INFO(NCCL_TUNING, "%s [Symmetric]: %ld Bytes -> Kernel %s nchannels %d nthreads %d",
ncclFuncToString(task->func), task->count * ncclTypeSize(task->datatype), ncclSymKernelIdToString(task->devFuncId), task->nMaxChannels, plan->threadPerBlock);
nPlans += 1;
} else {
struct ncclKernelPlanBudget budget;
budget.inArgsBytes = comm->workArgsBytes - sizeof(struct ncclDevKernelArgs);
// Non-persistent kernels fill up at most half of our fifo per kernel.
budget.outArgsBytes = plan->persistent ? (1<<30) : comm->workFifoBytes/2;
// Drain coll tasks first. This is essential since we partition tasks based
// on the work budget and p2p work isn't collective. If we were to drain p2p
// first, the place where we cut the kernel could vary by rank which would
// cause the "shortest channel first" channel picker to have divergent results.
if (planner->nTasksColl != 0) {
NCCLCHECKGOTO(scheduleCollTasksToPlan(comm, plan, &budget), result, failure);
}
// And only drain p2p tasks once colls are depleted.
if (planner->nTasksColl == 0 && planner->nTasksP2p != 0) {
NCCLCHECKGOTO(scheduleP2pTasksToPlan(comm, plan, &budget), result, failure);
}
finishPlan(comm, plan);
if (plan->workBytes != 0) {
ncclIntruQueueEnqueue(&planner->planQueue, plan);
nPlans += 1;
}
}
} while (planner->nTasksColl + planner->nTasksP2p != 0);
@@ -1428,6 +1476,7 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
bool capturing = ncclCudaGraphValid(planner->capturingGraph);
enum ncclImplicitOrder implicitOrder;
cudaError_t status = cudaSuccess;
NCCLCHECKGOTO(getImplicitOrder(&implicitOrder, capturing), result, failure);
if (implicitOrder != ncclImplicitOrderNone) {
@@ -1439,7 +1488,8 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
NCCLCHECKGOTO(ncclStreamWaitStream(launchStream, launchOrder, comm->sharedRes->scratchEvent), result, failure);
}
if (persistent || comm->sharedRes->persistentRefs != 0 || ncclCudaLaunchBlocking || __atomic_load_n(&comm->sharedRes->noncapturedRefs, __ATOMIC_ACQUIRE)) {
if (!persistent && comm->sharedRes->persistentRefs) status = cudaEventQuery(comm->sharedRes->hostStream.serialEvent);
if (persistent || ncclCudaLaunchBlocking || status == cudaErrorNotReady) {
// We have to launch host tasks to push proxy args. We are careful to only
// do this if necessary since host tasks impose a high performance cost in CUDA.
bool acquired = false;
@@ -1450,7 +1500,6 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
acquired = true;
NCCLCHECKGOTO(ncclStrongStreamAcquire(planner->capturingGraph, &comm->sharedRes->hostStream, /*concurrent=*/false, &hostStream), result, failure);
}
if (!persistent) ncclAtomicRefCountIncrement(&comm->sharedRes->noncapturedRefs);
plan->isHostCbEnq = true;
CUDACHECKGOTO(cudaLaunchHostFunc(hostStream, hostStreamPlanCallback, plan), result, failure);
}
@@ -1485,6 +1534,8 @@ ncclResult_t ncclLaunchKernelBefore_NoUncapturedCuda(struct ncclComm* comm, stru
NCCL_PARAM(MemSyncDomain, "MEM_SYNC_DOMAIN", cudaLaunchMemSyncDomainRemote);
#endif
NCCL_PARAM(NvlinkUtilCentricSchedEnable, "NVLINK_UTIL_CENTRIC_SCHED_ENABLE", 0);
ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan) {
ncclResult_t ret = ncclSuccess;
struct ncclKernelPlanner* planner = &comm->planner;
@@ -1512,7 +1563,7 @@ ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan
unsigned int clusterSize = (compCap >= 90) ? comm->config.cgaClusterSize : 0;
CUlaunchConfig launchConfig = {0};
CUlaunchAttribute launchAttrs[4] = {};
CUlaunchAttribute launchAttrs[6] = {};
int attrs = 0;
/* Cooperative Group Array (CGA)
* On sm90 and later we have an extra level of hierarchy where we
@@ -1549,6 +1600,18 @@ ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan
launchAttrs[attrs].value.launchCompletionEvent.flags = 0;
attrs++;
}
if (comm->planner.isSymColl && compCap >= 90 && driverVersion >= 12030) {
launchAttrs[attrs].id = CU_LAUNCH_ATTRIBUTE_PROGRAMMATIC_STREAM_SERIALIZATION;
launchAttrs[attrs].value.programmaticStreamSerializationAllowed = 1;
attrs++;
}
#endif
#if CUDART_VERSION >= 13000
if (compCap >= 90 && driverVersion >= 13000) {
launchAttrs[attrs].id = CU_LAUNCH_ATTRIBUTE_NVLINK_UTIL_CENTRIC_SCHEDULING;
launchAttrs[attrs].value.nvlinkUtilCentricScheduling = ncclParamNvlinkUtilCentricSchedEnable();
attrs++;
}
#endif
launchConfig.gridDimX = grid.x;
launchConfig.gridDimY = grid.y;
@@ -1560,7 +1623,6 @@ ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan
launchConfig.attrs = launchAttrs;
launchConfig.numAttrs = attrs;
launchConfig.hStream = launchStream;
CUCHECKGOTO(cuLaunchKernelEx(&launchConfig, fn, nullptr, extra), ret, do_return);
#endif
} else {
@@ -1573,21 +1635,30 @@ do_return:
}
ncclResult_t ncclLaunchKernelAfter_NoCuda(struct ncclComm* comm, struct ncclKernelPlan* plan) {
if (!(plan->persistent || ncclCudaLaunchBlocking || plan->isHostCbEnq)) {
// We are not using the host stream for proxy ops and reclaimation submission.
if (!plan->isHostCbEnq) {
// we are not using the host stream for proxy ops and reclaimation submission, call
// hostStreamPlanTask directly
NCCLCHECK(hostStreamPlanTask(comm, plan));
} else {
// We are using the host stream for proxy ops and reclaimation submission.
// Only plans with proxy ops have a callback pushed by ncclLaunchPrepare.
// Since non-persistent plans also require reclaimation, we have to do it
// here.
if (!plan->persistent && !plan->hasProxyOps) {
ncclIntruQueueMpscEnqueue(&comm->callbackQueue, &plan->reclaimer);
}
}
return ncclSuccess;
}
namespace {
struct KernelFinishCallback {
struct ncclCommEventCallback base;
uint32_t workFifoConsumed;
};
ncclResult_t KernelFinishCallback_fn(
struct ncclComm* comm, struct ncclCommEventCallback* cb
) {
struct KernelFinishCallback* me = (struct KernelFinishCallback*)cb;
comm->workFifoConsumed = me->workFifoConsumed;
CUDACHECK(cudaEventDestroy(me->base.event));
free(me);
return ncclSuccess;
}
}
ncclResult_t ncclLaunchFinish(struct ncclComm* comm) {
struct ncclKernelPlanner* planner = &comm->planner;
if (!ncclIntruQueueEmpty(&planner->planQueue)) {
@@ -1597,7 +1668,21 @@ ncclResult_t ncclLaunchFinish(struct ncclComm* comm) {
cudaStream_t launchStream = planner->streams->stream; // First user stream gets launch
cudaStream_t deviceStream, launchOrder;
CUDACHECK(cudaEventRecord(comm->sharedRes->scratchEvent, launchStream));
cudaEvent_t finishedEvent = comm->sharedRes->scratchEvent;
CUDACHECK(cudaEventRecord(finishedEvent, launchStream));
if (comm->workFifoProduced - comm->workFifoProducedLastRecorded > comm->workFifoBytes/8) {
comm->workFifoProducedLastRecorded = comm->workFifoProduced;
struct KernelFinishCallback* cb;
NCCLCHECK(ncclCalloc(&cb, 1));
cb->base.event = finishedEvent;
cb->base.fn = KernelFinishCallback_fn;
cb->workFifoConsumed = comm->workFifoProduced;
ncclIntruQueueEnqueue(&comm->eventCallbackQueue, &cb->base);
// We just stole scratchEvent so must create a new one.
CUDACHECK(cudaEventCreateWithFlags(&comm->sharedRes->scratchEvent, cudaEventDisableTiming));
}
// deviceStream waits on userStream[0]
NCCLCHECK(ncclStrongStreamAcquiredWorkStream(planner->capturingGraph, &comm->sharedRes->deviceStream, /*concurrent=*/false, &deviceStream));
@@ -1606,13 +1691,13 @@ ncclResult_t ncclLaunchFinish(struct ncclComm* comm) {
// on launchStream as a fast-forward. When building CUDA graphs fast forwards should
// be handled specially so as not to create graphs with a blowup in the number of edges.
// So we could do this:
// CUDACHECK(cudaStreamWaitEvent(deviceStream, comm->sharedRes->scratchEvent, 0));
// CUDACHECK(cudaStreamWaitEvent(deviceStream, finishedEvent, 0));
// But instead we do:
NCCLCHECK(ncclStreamAdvanceToEvent(planner->capturingGraph, deviceStream, comm->sharedRes->scratchEvent));
NCCLCHECK(ncclStreamAdvanceToEvent(planner->capturingGraph, deviceStream, finishedEvent));
// Each userStream[i] waits on userStream[0]
for (struct ncclCudaStreamList* l=planner->streams->next; l != nullptr; l = l->next) {
CUDACHECK(cudaStreamWaitEvent(l->stream, comm->sharedRes->scratchEvent, 0));
CUDACHECK(cudaStreamWaitEvent(l->stream, finishedEvent, 0));
}
bool capturing = ncclCudaGraphValid(planner->capturingGraph);
enum ncclImplicitOrder implicitOrder;
@@ -1623,7 +1708,7 @@ ncclResult_t ncclLaunchFinish(struct ncclComm* comm) {
// Incorporate launch event into per-device (context) launch order.
NCCLCHECK(ncclStrongStreamAcquiredWorkStream(planner->capturingGraph, &comm->context->launchOrder, concurrent, &launchOrder));
// If we don't have launch events (requires CUDA 12.3) then just use completion event (serialize execution).
CUDACHECK(cudaStreamWaitEvent(launchOrder, implicitOrder == ncclImplicitOrderLaunch ? comm->sharedRes->launchEvent : comm->sharedRes->scratchEvent));
CUDACHECK(cudaStreamWaitEvent(launchOrder, implicitOrder == ncclImplicitOrderLaunch ? comm->sharedRes->launchEvent : finishedEvent));
// Release launchOrder as acquired in ncclLaunchPrepare()
NCCLCHECK(ncclStrongStreamRelease(planner->capturingGraph, &comm->context->launchOrder, concurrent));
}
@@ -1645,7 +1730,7 @@ static inline ncclResult_t getCollNetSupport(
if (info->opDev.op == ncclDevPreMulSum || info->opDev.op == ncclDevSumPostDiv) {
netOp = ncclSum;
}
*collNetSupport = comm->collNetSupport;
*collNetSupport = comm->config.collnetEnable;
switch (info->func) {
case ncclFuncAllReduce:
case ncclFuncReduce:
@@ -1683,10 +1768,8 @@ static ncclResult_t updateCollCostTable(
if ((a == NCCL_ALGO_COLLNET_DIRECT || a == NCCL_ALGO_COLLNET_CHAIN) && collNetSupport != 1) continue;
// CollNetDirect is only supported for up to 8 local GPUs
if (a == NCCL_ALGO_COLLNET_DIRECT && comm->maxLocalRanks > NCCL_MAX_DIRECT_ARITY+1) continue;
if ((a == NCCL_ALGO_NVLS || a == NCCL_ALGO_NVLS_TREE) && nvlsSupport != 1 && info->func != ncclFuncAllGather) continue;
if ((a == NCCL_ALGO_NVLS || a == NCCL_ALGO_NVLS_TREE) && (!nvlsSupport || (info->func != ncclFuncAllReduce && comm->localRanks > NCCL_MAX_NVLS_ARITY))) continue;
if (a == NCCL_ALGO_NVLS && collNetSupport != 1 && comm->nNodes > 1) continue;
/* now we only support single-node NVLS allgather and reducescatter */
if (a == NCCL_ALGO_NVLS && (info->func == ncclFuncAllGather || info->func == ncclFuncReduceScatter) && (comm->nNodes > 1 || comm->nRanks > NCCL_MAX_NVLS_ARITY)) continue;
/* Tree reduceScatter doesn't support scaling yet */
if (a == NCCL_ALGO_PAT && info->func == ncclFuncReduceScatter
&& (info->opDev.op == ncclDevPreMulSum || info->opDev.op == ncclDevSumPostDiv)) continue;
@@ -1801,7 +1884,14 @@ static ncclResult_t getAlgoInfo(
struct ncclComm* comm, struct ncclTaskColl* info,
int collNetSupport, int nvlsSupport, int numPipeOps, ncclSimInfo_t* simInfo/* = NULL*/
) {
size_t nBytes = ncclTypeSize(info->datatype)*ncclFuncMaxSendRecvCount(info->func, comm->nRanks, info->count);
size_t elementSize = ncclTypeSize(info->datatype);
size_t nBytes = elementSize * ncclFuncMaxSendRecvCount(info->func, comm->nRanks, info->count);
struct ncclReg* regSendBuf = NULL;
struct ncclReg* regRecvBuf = NULL;
int regBuff;
bool isSendValid, isRecvValid;
size_t sendbuffSize = elementSize * ncclFuncSendCount(info->func, comm->nRanks, info->count);
size_t recvbuffSize = elementSize * ncclFuncRecvCount(info->func, comm->nRanks, info->count);
info->algorithm = NCCL_ALGO_UNDEF;
info->protocol = NCCL_PROTO_UNDEF;
int nMaxChannels = 0;
@@ -1809,20 +1899,42 @@ static ncclResult_t getAlgoInfo(
initCollCostTable((float **)collCostTable);
NCCLCHECK(updateCollCostTable(comm, info, nBytes, collNetSupport, nvlsSupport, numPipeOps, (float **)collCostTable));
if (comm->tuner != NULL) {
size_t elementSize = ncclTypeSize(info->datatype);
size_t sendbuffSize = elementSize*ncclFuncSendCount(info->func, comm->nRanks, info->count);
size_t recvbuffSize = elementSize*ncclFuncRecvCount(info->func, comm->nRanks, info->count);
struct ncclReg* regSendBuf;
struct ncclReg* regRecvBuf;
NCCLCHECK(ncclRegFind(comm, info->sendbuff, sendbuffSize, &regSendBuf));
NCCLCHECK(ncclRegFind(comm, info->recvbuff, recvbuffSize, &regRecvBuf));
int regBuff = ((regSendBuf && regRecvBuf) || (ncclCudaGraphValid(comm->planner.capturingGraph) && ncclParamGraphRegister()));
NCCLCHECK(ncclRegLocalIsValid(regSendBuf, &isSendValid));
NCCLCHECK(ncclRegLocalIsValid(regRecvBuf, &isRecvValid));
regBuff = (regSendBuf && regRecvBuf && isSendValid && isRecvValid) || (ncclCudaGraphValid(comm->planner.capturingGraph) && ncclParamGraphRegister());
NCCLCHECK(comm->tuner->getCollInfo(
comm->tunerContext, info->func, nBytes,
numPipeOps, (float **)collCostTable, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
regBuff, &nMaxChannels));
NCCLCHECK(topoGetAlgoInfo(comm, info, nBytes, (float **)collCostTable, simInfo));
} else {
NCCLCHECK(topoGetAlgoInfo(comm, info, nBytes, (float **)collCostTable, simInfo));
// NCCL_CTA_POLICY_EFFICIENCY requires user (non-symmetric) buffer registration (currently unsupported with MNNVL)
if (comm->config.CTAPolicy == NCCL_CTA_POLICY_EFFICIENCY && ncclGetEnv("NCCL_ALGO") == NULL && ncclGetEnv("NCCL_PROTO") == NULL && !comm->MNNVL) {
// make algorithm selection based on buffer registration
// there can be other specialized policies for algorithms and protocols pickup in the future
NCCLCHECK(ncclRegFind(comm, info->sendbuff, sendbuffSize, &regSendBuf));
NCCLCHECK(ncclRegFind(comm, info->recvbuff, recvbuffSize, &regRecvBuf));
NCCLCHECK(ncclRegLocalIsValid(regSendBuf, &isSendValid));
NCCLCHECK(ncclRegLocalIsValid(regRecvBuf, &isRecvValid));
regBuff = (regSendBuf && regRecvBuf && isSendValid && isRecvValid) || (ncclCudaGraphValid(comm->planner.capturingGraph) && ncclParamGraphRegister());
if (regBuff && (info->func == ncclFuncAllGather || info->func == ncclFuncReduceScatter)) {
if ((comm->nNodes > 1 && collNetSupport && nvlsSupport) || (comm->nNodes == 1 && nvlsSupport)) {
int recChannels;
NCCLCHECK(ncclNvlsRegResourcesQuery(comm, info, &recChannels));
if (recChannels <= info->nMaxChannels) {
info->algorithm = NCCL_ALGO_NVLS;
info->protocol = NCCL_PROTO_SIMPLE;
info->nMaxChannels = recChannels;
info->nWarps = comm->maxThreads[info->algorithm][info->protocol] / WARP_SIZE;
}
}
}
}
}
NCCLCHECK(topoGetAlgoInfo(comm, info, nBytes, (float **)collCostTable, simInfo));
info->nMaxChannels = nMaxChannels == 0 ? info->nMaxChannels : nMaxChannels;
return ncclSuccess;
}
@@ -1892,16 +2004,20 @@ static ncclResult_t calcCollChunking(
while (nBytes / (nChannels * chunkSize) < comm->channels[0].collnetChain.depth * 8 && chunkSize > 65536) chunkSize /= 2;
while (nBytes / (nChannels * chunkSize) < comm->channels[0].collnetChain.depth && chunkSize > 32768) chunkSize /= 2;
} else if (info->algorithm == NCCL_ALGO_NVLS) {
int maxChunkSize = comm->nvlsChunkSize;
if (comm->nNodes > 1 && comm->bandwidths[ncclFuncAllReduce][NCCL_ALGO_NVLS][NCCL_PROTO_SIMPLE] < 150) maxChunkSize = 32768;
if (chunkSize > maxChunkSize) chunkSize = maxChunkSize;
// Use uint64_t so that concurrentOps*chunkSize*X does not overflow.
// However, nChannels * comm->channels[0].nvls.nHeads should easily fit in 32 bits.
// coverity[overflow_before_widen]
uint64_t concurrentOps = nChannels * comm->channels[0].nvls.nHeads;
if ((nBytes < (64 * (concurrentOps * chunkSize))) && (chunkSize > 65536)) chunkSize = 65536;
if ((nBytes < (8 * (concurrentOps * chunkSize))) && (chunkSize > 32768)) chunkSize = 32768;
if ((nBytes < (2 * (concurrentOps * chunkSize))) && (chunkSize > 16384)) chunkSize = 16384;
if ((info->regBufType & NCCL_NVLS_REG_BUFFER) && (info->func == ncclFuncAllGather || info->func == ncclFuncReduceScatter)) {
chunkSize = comm->buffSizes[NCCL_PROTO_SIMPLE] / NCCL_STEPS;
} else {
int maxChunkSize = comm->nvlsChunkSize;
if (comm->nNodes > 1 && comm->bandwidths[ncclFuncAllReduce][NCCL_ALGO_NVLS][NCCL_PROTO_SIMPLE] < 150) maxChunkSize = 32768;
if (chunkSize > maxChunkSize) chunkSize = maxChunkSize;
// Use uint64_t so that concurrentOps*chunkSize*X does not overflow.
// However, nChannels * comm->channels[0].nvls.nHeads should easily fit in 32 bits.
// coverity[overflow_before_widen]
uint64_t concurrentOps = nChannels * comm->channels[0].nvls.nHeads;
if ((nBytes < (64 * (concurrentOps * chunkSize))) && (chunkSize > 65536)) chunkSize = 65536;
if ((nBytes < (8 * (concurrentOps * chunkSize))) && (chunkSize > 32768)) chunkSize = 32768;
if ((nBytes < (2 * (concurrentOps * chunkSize))) && (chunkSize > 16384)) chunkSize = 16384;
}
} else if (info->algorithm == NCCL_ALGO_NVLS_TREE) {
// Use uint64_t so that concurrentOps*chunkSize*X does not overflow.
// However, nChannels * comm->channels[0].nvls.nHeads should easily fit in 32 bits.
@@ -2045,7 +2161,7 @@ static ncclResult_t calcCollChunking(
proxyOp->reg = 0;
}
if (pattern == ncclPatternCollnetDirect) {
if (pattern == ncclPatternCollnetDirect || pattern == ncclPatternNvls) {
proxyOp->specifics.collnetDirect.nNodes = comm->nNodes;
proxyOp->specifics.collnetDirect.node = comm->node;
if (info->func == ncclFuncAllGather || info->func == ncclFuncReduceScatter) {
@@ -2168,7 +2284,7 @@ static ncclResult_t taskAppend(struct ncclComm* comm, struct ncclInfo* info) {
bool isSendNotRecv = info->coll == ncclFuncSend;
// Must be in thread local group before tasks can be alloc'd in `comm->memScoped`.
ncclGroupCommJoin(info->comm);
ncclGroupCommJoin(info->comm, ncclGroupTaskTypeCollective);
struct ncclTaskP2p* p2p = ncclMemoryPoolAlloc<struct ncclTaskP2p>(&comm->memPool_ncclTaskP2p, &comm->memPermanent);
p2p->func = info->coll;
p2p->buff = (void*)info->recvbuff;
@@ -2235,7 +2351,7 @@ static ncclResult_t taskAppend(struct ncclComm* comm, struct ncclInfo* info) {
return ncclSuccess;
} else {
// Must be in thread local group before tasks can be alloc'd in `comm->memScoped`.
ncclGroupCommJoin(info->comm);
ncclGroupCommJoin(info->comm, ncclGroupTaskTypeCollective);
struct ncclTaskColl* t = ncclMemoryPoolAlloc<struct ncclTaskColl>(&comm->memPool_ncclTaskColl, &comm->memPermanent);
t->func = info->coll;
t->sendbuff = info->sendbuff;
+5 -5
Просмотреть файл
@@ -258,7 +258,7 @@ static ncclResult_t connectNvls(struct ncclComm* comm, int* nvlsHeads, int nHead
channel->nvls.out = -1; // NVLS+SHARP not yet implemented.
channel->nvls.headRank = headRank;
channel->nvls.treeUp = channel->nvls.treeDown[0] = channel->nvls.treeDown[1] = channel->nvls.treeDown[2] = -1;
if (comm->collNetSupport && channel->nvls.headRank != -1) channel->nvls.out = comm->nRanks;
if (comm->config.collnetEnable && channel->nvls.headRank != -1) channel->nvls.out = comm->nRanks;
}
if (comm->nNodes == 1) return ncclSuccess;
@@ -330,7 +330,7 @@ int ncclMinNchannels() {
if (ncclParamMinNrings() != -2) minNchannels = ncclParamMinNrings();
if (ncclParamMinNchannels() != -2) minNchannels = ncclParamMinNchannels();
if (minNchannels > MAXCHANNELS) {
WARN("User asked for a minimum of %d channels, limiting to %d", minNchannels, MAXCHANNELS);
INFO(NCCL_GRAPH|NCCL_ENV, "User asked for a minimum of %d channels, limiting to %d", minNchannels, MAXCHANNELS);
minNchannels = MAXCHANNELS;
}
if (minNchannels < 0) minNchannels = 0;
@@ -346,7 +346,7 @@ int ncclMaxNchannels() {
maxNchannels = std::min(maxNchannels, ncclDevMaxChannelsForArgsBytes(ncclParamWorkArgsBytes()));
if (maxNchannels > MAXCHANNELS) maxNchannels = MAXCHANNELS;
if (maxNchannels < 1) {
WARN("User asked for a maximum of %d channels, setting it to 1", maxNchannels);
INFO(NCCL_GRAPH|NCCL_ENV, "User asked for a maximum of %d channels, setting it to 1", maxNchannels);
maxNchannels = 1;
}
return maxNchannels;
@@ -379,7 +379,7 @@ ncclResult_t ncclTopoPostset(struct ncclComm* comm, int* firstRanks, int* treePa
int nNodes = comm->nNodes;
int nChannels = comm->nChannels;
int minHeadNum = INT_MAX;
int shared = parent && parent->nvlsSupport && parent->config.splitShare;
int shared = parent && parent->nvlsSupport && parent->shareResources;
NCCLCHECK(ncclCalloc(&ringRecv, nNodes*MAXCHANNELS));
NCCLCHECKGOTO(ncclCalloc(&ringSend, nNodes*MAXCHANNELS), ret, fail);
NCCLCHECKGOTO(ncclCalloc(&ringPrev, nranks*MAXCHANNELS), ret, fail);
@@ -452,7 +452,7 @@ ncclResult_t ncclTopoPostset(struct ncclComm* comm, int* firstRanks, int* treePa
nChannels = comm->nChannels = std::min(MAXCHANNELS,nChannels*2);
// Setup CollNet
if (comm->collNetSupport == 1) {
if (comm->config.collnetEnable) {
struct ncclTopoGraph* collNetChainGraph = graphs[NCCL_ALGO_COLLNET_CHAIN];
// Add more channels to saturate intra-node bandwidth, except the 1 PPN case
if (collNetChainGraph->bwIntra > collNetChainGraph->bwInter && comm->nRanks > comm->nNodes) {
+62 -36
Просмотреть файл
@@ -214,7 +214,7 @@ ncclResult_t ncclGetLevel(int* level, const char* disableEnv, const char* levelE
const char* str = ncclGetEnv(disableEnv);
if (str) {
int disable = strtol(str, NULL, 0);
if (disable == 1) l = 0;
if (disable == 1) l = PATH_LOC;
if (l >= 0) INFO(NCCL_ALL, "%s set by environment to %d", disableEnv, disable);
}
}
@@ -247,7 +247,18 @@ ncclResult_t ncclGetLevel(int* level, const char* disableEnv, const char* levelE
NCCL_PARAM(IgnoreDisabledP2p, "IGNORE_DISABLED_P2P", 0);
int ncclTopoUserP2pLevel = -1;
static int ncclTopoUserP2pLevel = -1; // Initially "uninitialized". When initialized but unset, changes to -2.
// Gets the user-provided value of NCCL_P2P_LEVEL/NCCL_P2P_DISABLE. If the user did not provide any, the value
// of the "level" argument is left unchanged.
ncclResult_t ncclGetUserP2pLevel(int* level) {
if (ncclTopoUserP2pLevel == -1)
NCCLCHECK(ncclGetLevel(&ncclTopoUserP2pLevel, "NCCL_P2P_DISABLE", "NCCL_P2P_LEVEL"));
if (ncclTopoUserP2pLevel != -2)
*level = ncclTopoUserP2pLevel;
return ncclSuccess;
}
ncclResult_t ncclTopoCheckP2p(struct ncclComm* comm, struct ncclTopoSystem* system, int rank1, int rank2,
int* p2p, int *read, int* intermediateRank) {
int mnnvl = 0;
@@ -275,9 +286,9 @@ ncclResult_t ncclTopoCheckP2p(struct ncclComm* comm, struct ncclTopoSystem* syst
// Get GPUs from topology
int g1, g2;
NCCLCHECK(ncclTopoRankToIndex(system, rank1, &g1));
NCCLCHECK(ncclTopoRankToIndex(system, rank1, &g1, /*showWarn=*/true));
struct ncclTopoNode* gpu1 = system->nodes[GPU].nodes+g1;
if (ncclTopoRankToIndex(system, rank2, &g2) == ncclInternalError) {
if (ncclTopoRankToIndex(system, rank2, &g2, /*showWarn=*/false) == ncclInternalError) {
// GPU not found, we can't use p2p.
return ncclSuccess;
}
@@ -302,15 +313,8 @@ ncclResult_t ncclTopoCheckP2p(struct ncclComm* comm, struct ncclTopoSystem* syst
if ((arch == NCCL_TOPO_CPU_ARCH_X86 && vendor == NCCL_TOPO_CPU_VENDOR_AMD) && system->nodes[GPU].count <= 2) p2pLevel = PATH_SYS;
// User override
if (ncclTopoUserP2pLevel == -1)
NCCLCHECK(ncclGetLevel(&ncclTopoUserP2pLevel, "NCCL_P2P_DISABLE", "NCCL_P2P_LEVEL"));
if (ncclTopoUserP2pLevel != -2) {
p2pLevel = ncclTopoUserP2pLevel;
goto compare;
}
NCCLCHECK(ncclGetUserP2pLevel(&p2pLevel));
compare:
// Compute the PCI distance and compare with the p2pLevel.
if (path->type <= p2pLevel) *p2p = 1;
@@ -378,7 +382,8 @@ NCCL_PARAM(NetGdrRead, "NET_GDR_READ", -2);
int ncclTopoUserGdrLevel = -1;
const char* ncclTopoGdrModeStr[ncclTopoGdrModeNum] = { "Disabled", "Default", "PCI" };
NCCL_PARAM(NetGdrC2c, "NET_GDR_C2C", 0);
// On C2C platforms use GDRDMA on NICs which are connected to the CPUs
NCCL_PARAM(NetGdrC2c, "NET_GDR_C2C", 1);
ncclResult_t ncclTopoCheckGdr(struct ncclTopoSystem* system, int rank, int64_t netId, int read, enum ncclTopoGdrMode* gdrMode) {
*gdrMode = ncclTopoGdrModeDisable;
@@ -387,7 +392,7 @@ ncclResult_t ncclTopoCheckGdr(struct ncclTopoSystem* system, int rank, int64_t n
int n, g;
NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, &n));
struct ncclTopoNode* net = system->nodes[NET].nodes+n;
NCCLCHECK(ncclTopoRankToIndex(system, rank, &g));
NCCLCHECK(ncclTopoRankToIndex(system, rank, &g, /*showWarn=*/true));
struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
// Check that both the NIC and GPUs support it
@@ -423,29 +428,29 @@ ncclResult_t ncclTopoCheckGdr(struct ncclTopoSystem* system, int rank, int64_t n
// In case of PXN, use the intermediate GPU distance instead
int proxyRank;
NCCLCHECK(ncclTopoGetIntermediateRank(system, gpu->gpu.rank, netId, &proxyRank));
NCCLCHECK(ncclTopoRankToIndex(system, proxyRank, &g));
NCCLCHECK(ncclTopoRankToIndex(system, proxyRank, &g, /*showWarn=*/true));
gpu = system->nodes[GPU].nodes+g;
distance = gpu->paths[NET][n].type;
}
int c;
NCCLCHECK(ncclGetLocalCpu(system, g, &c));
if (ncclParamNetGdrC2c() && distance == PATH_PHB && gpu->paths[CPU][c].type == PATH_C2C) {
// On C2C platforms we can still use GDRDMA on NICs connected to the CPUs
INFO(NCCL_NET, "GPU %d / HCA %lx connected to CPU %d via C2C link", rank, netId, c);
// On C2C platforms we can still use GDRDMA on NICs connected to the CPUs
if (ncclParamNetGdrC2c() && distance == PATH_P2C) {
INFO(NCCL_GRAPH | NCCL_NET, "GPU %d / HCA %lx connected via C2C link", rank, netId);
distance = PATH_C2C;
}
if (distance > netGdrLevel) {
INFO(NCCL_NET,"GPU Direct RDMA Disabled for GPU %d / HCA %lx (distance %d > %d)", rank, netId, distance, netGdrLevel);
INFO(NCCL_GRAPH|NCCL_NET,"GPU Direct RDMA Disabled for GPU %d / HCA %lx (distance %d > %d)", rank, netId, distance, netGdrLevel);
return ncclSuccess;
}
// Force PCIe mapping if path goes through PCI on a C2C system
int c;
NCCLCHECK(ncclGetLocalCpu(system, g, &c));
if (gpu->paths[CPU][c].type == PATH_C2C && distance != PATH_C2C) *gdrMode = ncclTopoGdrModePci;
else *gdrMode = ncclTopoGdrModeDefault;
INFO(NCCL_NET,"GPU Direct RDMA Enabled for GPU %d / HCA %lx (distance %d <= %d), read %d mode %s", rank, netId, distance, netGdrLevel, read, ncclTopoGdrModeStr[*gdrMode]);
INFO(NCCL_GRAPH|NCCL_NET,"GPU Direct RDMA Enabled for GPU %d / HCA %lx (distance %d <= %d), read %d mode %s", rank, netId, distance, netGdrLevel, read, ncclTopoGdrModeStr[*gdrMode]);
return ncclSuccess;
}
@@ -480,7 +485,7 @@ ncclResult_t ncclTopoNeedFlush(struct ncclComm* comm, int64_t netId, int netDev,
if (props.forceFlush == 1 || ncclParamNetForceFlush()) return ncclSuccess;
int g;
struct ncclTopoSystem* system = comm->topo;
NCCLCHECK(ncclTopoRankToIndex(system, rank, &g));
NCCLCHECK(ncclTopoRankToIndex(system, rank, &g, /*showWarn=*/true));
struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
// Flush is required on Ampere and earlier
if (gpu->gpu.cudaCompCap >= 90) *flush = 0;
@@ -506,8 +511,8 @@ ncclResult_t ncclTopoCheckNet(struct ncclTopoSystem* system, int rank1, int rank
*net = 1;
// First check the current GPU-to-GPU speed.
int g1, g2;
if (ncclTopoRankToIndex(system, rank1, &g1) != ncclSuccess ||
ncclTopoRankToIndex(system, rank2, &g2) != ncclSuccess) {
if (ncclTopoRankToIndex(system, rank1, &g1, /*showWarn=*/false) != ncclSuccess ||
ncclTopoRankToIndex(system, rank2, &g2, /*showWarn=*/false) != ncclSuccess) {
return ncclSuccess;
}
@@ -533,7 +538,7 @@ ncclResult_t ncclTopoGetIntermediateRank(struct ncclTopoSystem* system, int rank
// Get GPU and NET
int n, g;
NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, &n));
NCCLCHECK(ncclTopoRankToIndex(system, rank, &g));
NCCLCHECK(ncclTopoRankToIndex(system, rank, &g, /*showWarn=*/true));
struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
struct ncclTopoLinkList* path = gpu->paths[NET]+n;
if (path->type == PATH_PXN) {
@@ -601,6 +606,8 @@ ncclResult_t ncclTopoGetPxnRanks(struct ncclComm* comm, int** intermediateRanks,
return ncclSuccess;
}
NCCL_PARAM(PxnC2c, "PXN_C2C", 0);
ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm* comm) {
// Precompute paths between GPUs/NICs.
@@ -659,6 +666,20 @@ ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm
}
}
}
// update the GPU -> NIC path in the case of C2C + PHB
for (int n = 0; n < system->nodes[NET].count; n++) {
struct ncclTopoNode* netNode = system->nodes[NET].nodes + n;
for (int g = 0; g < system->nodes[GPU].count; g++) {
struct ncclTopoNode* gpuNode = system->nodes[GPU].nodes + g;
int c;
NCCLCHECK(ncclGetLocalCpu(system, g, &c));
if (c == -1) continue;
if (gpuNode->paths[NET][n].type == PATH_PHB && gpuNode->paths[CPU][c].type == PATH_C2C) {
gpuNode->paths[NET][n].type = PATH_P2C;
netNode->paths[GPU][g].type = PATH_P2C;
}
}
}
// Update paths for NICs (no GPU Direct, PXN, ...)
for (int n=0; n<system->nodes[NET].count; n++) {
@@ -674,15 +695,20 @@ ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm
// PXN = PCI + NVLink.
struct ncclTopoNode* peerNode = system->nodes[GPU].nodes+localGpuIndex;
// Only use PXN for NIC n if remote GPU p ...
if (peerNode->paths[NET][n].type <= PATH_PXB && // Is connected to the NIC through PCI
peerNode->paths[GPU][g].type <= PATH_NVL && // Is connected to us through NVLink
NCCL_TOPO_ID_SYSTEM_ID(peerNode->id) == NCCL_TOPO_ID_SYSTEM_ID(gpu->id) && // Is on the same node as us
(peerNode->paths[NET][n].bw > gpu->paths[NET][n].bw || // Has either higher BW to that NIC
gpu->paths[NET][n].type > PATH_PXB)) // or avoids going through a CPU
// We can use that GPU as relay to communicate with that NIC.
// Only enabling it in the GPU->NIC direction for now to favor
// receiving locally and sending remotely (consistent with net.cc)
NCCLCHECK(addInterStep(system, GPU, localGpuIndex, GPU, g, NET, n));
if (/* (1) is either connected to the NIC with PXB*/
(peerNode->paths[NET][n].type <= PATH_PXB ||
/* or with P2C and PxN over C2C is enabled */
(ncclParamPxnC2c() && peerNode->paths[NET][n].type == PATH_P2C)) &&
/* and (2) is connected to us through NVLink */
peerNode->paths[GPU][g].type <= PATH_NVL &&
/* and (3) is on the same node as us */
NCCL_TOPO_ID_SYSTEM_ID(peerNode->id) == NCCL_TOPO_ID_SYSTEM_ID(gpu->id) &&
/* and (4) has either higher bw to that NIC or avoid going through the CPU*/
(peerNode->paths[NET][n].bw > gpu->paths[NET][n].bw || gpu->paths[NET][n].type > PATH_PXB))
// We can use that GPU as relay to communicate with that NIC.
// Only enabling it in the GPU->NIC direction for now to favor
// receiving locally and sending remotely (consistent with net.cc)
NCCLCHECK(addInterStep(system, GPU, localGpuIndex, GPU, g, NET, n));
}
}
if (gpu->paths[NET][n].type < PATH_PHB) {
@@ -761,7 +787,7 @@ static ncclResult_t ncclTopoGetNchannels(struct ncclComm* comm, int g /*local gp
int peer;
struct ncclTopoSystem* system = comm->topo;
struct ncclTopoLinkList* path = NULL;
if (ncclTopoRankToIndex(system, peerRank, &peer) == ncclSuccess) {
if (ncclTopoRankToIndex(system, peerRank, &peer, /*showWarn=*/false) == ncclSuccess) {
// Same rank
if (g == peer) {
*nChannels = -1;
+16 -27
Просмотреть файл
@@ -137,6 +137,7 @@ static ncclResult_t ncclTopoFollowPath(struct ncclTopoSystem* system, struct ncc
float bw = intra ? graph->bwIntra : graph->bwInter;
int type = intra ? graph->typeIntra : graph->typeInter;
if (path->type >= PATH_DIS) return ncclSuccess;
if (mult == 1 && (path->type > type)) return ncclSuccess;
if (mult == 1 && (graph->pattern == NCCL_TOPO_PATTERN_BALANCED_TREE ||
graph->pattern == NCCL_TOPO_PATTERN_TREE ||
@@ -328,8 +329,7 @@ ncclResult_t ncclTopoReplayGetGpu(struct ncclTopoSystem* system, struct ncclTopo
*g = i;
return ncclSuccess;
}
if (*g == -1) return ncclInternalError;
return ncclSuccess;
return ncclInternalError;
}
ncclResult_t ncclTopoSearchRecGpu(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, struct ncclTopoNode* gpu, int step, int backToNet, int backToFirstRank, int forcedOrder, int *time);
@@ -658,24 +658,12 @@ ncclResult_t ncclTopoSearchRecNet(struct ncclTopoSystem* system, struct ncclTopo
}
// Then try the most local GPUs
float maxBw = 0;
int minHops = 0xfffffff;
struct ncclTopoLinkList* paths = net->paths[GPU];
for (int g=0; g<system->nodes[GPU].count; g++) {
if (paths[g].bw > maxBw) {
maxBw = paths[g].bw;
minHops = paths[g].count;
} else if (paths[g].bw == maxBw && paths[g].count > 0 && paths[g].count < minHops) {
minHops = paths[g].count;
}
}
if (maxBw >= bw) {
for (int i=0; i<system->nodes[GPU].count; i++) {
int g = (graph->nChannels+i)%system->nodes[GPU].count;
if (paths[g].bw == maxBw && paths[g].count == minHops) {
NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, 0, time, NET, n, g));
}
}
int localGpus[NCCL_TOPO_MAX_NODES], localGpuCount, pathType;
NCCLCHECK(ncclTopoGetLocal(system, NET, n, GPU, localGpus, &localGpuCount, &pathType));
// if no GPUs are connected, skip this net
if (pathType == PATH_DIS) continue;
for (int g = 0; g < localGpuCount; ++g) {
NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, 0, time, NET, n, localGpus[g]));
}
}
}
@@ -762,6 +750,7 @@ struct kvDict kvDictLinkType[] = {
{ "PIX", PATH_PIX },
{ "PXB", PATH_PXB },
{ "PXN", PATH_PXN },
{ "P2C", PATH_P2C },
{ "PHB", PATH_PHB },
{ "SYS", PATH_SYS },
{ NULL, 0 }
@@ -920,8 +909,8 @@ float sm90SpeedArrayInter[] = { 48.0, 45.0, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0,
#define NSPEEDSINTRA_SM90 (sizeof(sm90SpeedArrayIntra)/sizeof(float))
#define NSPEEDSINTER_SM90 (sizeof(sm90SpeedArrayInter)/sizeof(float))
float sm100SpeedArrayIntra[] = { 90.0, 80.0, 70.0, 60.0, 50.0, 40.0, 30.0, 24.0, 20.0, 19.0 };
float sm100SpeedArrayInter[] = { 48.0, 45.0, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0, 17.5, 15.0, 12.0, 6.0, 3.0, 2.4, 1.2, 0.24, 0.12 };
float sm100SpeedArrayIntra[] = { 90.0, 80.0, 70.0, 60.0, 50.0, 40.0, 30.0, 24.0, 20.0, 19.0, 18.0 };
float sm100SpeedArrayInter[] = { 47.9, 45.0, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0, 17.5, 15.0, 12.0, 6.0, 3.0, 2.4, 1.2, 0.24, 0.12 };
#define NSPEEDSINTRA_SM100 (sizeof(sm100SpeedArrayIntra)/sizeof(float))
#define NSPEEDSINTER_SM100 (sizeof(sm100SpeedArrayInter)/sizeof(float))
@@ -1060,13 +1049,13 @@ search:
int maxIntra = system->nodes[NET].count > 0 ? tmpGraph.typeInter : maxTypeIntra;
if (tmpGraph.typeIntra < maxIntra && (graph->nChannels == 0 || tmpGraph.typeIntra < graph->typeIntra)) {
tmpGraph.typeIntra += 1;
goto search;
if (tmpGraph.typeIntra < PATH_DIS) goto search;
}
tmpGraph.typeIntra = minTypeIntra;
if (system->nodes[NET].count > 0 && tmpGraph.typeInter < maxTypeInter && (graph->nChannels == 0 || tmpGraph.typeInter < graph->typeInter || tmpGraph.typeInter < PATH_PXN)) {
tmpGraph.typeInter += 1;
goto search;
if (tmpGraph.typeInter < PATH_DIS) goto search;
}
tmpGraph.typeInter = minTypeInter;
@@ -1124,7 +1113,7 @@ done:
}
if (graph->nChannels == 0 && graph->collNet == 0 && graph->pattern != NCCL_TOPO_PATTERN_NVLS) {
WARN("Could not find a path for pattern %d, falling back to simple order", graph->pattern);
INFO(NCCL_GRAPH, "Could not find a path for pattern %d, falling back to simple order", graph->pattern);
for (int i=0; i<ngpus; i++) graph->intra[i] = system->nodes[GPU].nodes[i].gpu.rank;
graph->inter[0] = graph->inter[1] = 0;
graph->bwIntra = graph->bwInter = 0.1;
@@ -1248,7 +1237,7 @@ ncclResult_t ncclTopoGetNetDev(struct ncclComm* comm, int rank, struct ncclTopoG
}
if (pxnLevel == 1) {
int g, n;
NCCLCHECK(ncclTopoRankToIndex(comm->topo, rank, &g));
NCCLCHECK(ncclTopoRankToIndex(comm->topo, rank, &g, /*showWarn=*/true));
NCCLCHECK(ncclTopoIdToIndex(comm->topo, NET, netId, &n));
struct ncclTopoNode* gpu = comm->topo->nodes[GPU].nodes+g;
if (gpu->paths[NET][n].type <= PATH_PXN) {
@@ -1260,7 +1249,7 @@ ncclResult_t ncclTopoGetNetDev(struct ncclComm* comm, int rank, struct ncclTopoG
// Check which local GPU corresponds to that NIC and see if we can use PXN.
int n, g1, g2;
NCCLCHECK(ncclTopoIdToIndex(comm->topo, NET, netId, &n));
NCCLCHECK(ncclTopoRankToIndex(comm->topo, rank, &g1));
NCCLCHECK(ncclTopoRankToIndex(comm->topo, rank, &g1, /*showWarn=*/true));
NCCLCHECK(ncclTopoGetLocalGpu(comm->topo, netId, &g2));
if (g2 != -1) {
struct ncclTopoNode* peerGpu = comm->topo->nodes[GPU].nodes+g2;
+84 -50
Просмотреть файл
@@ -9,12 +9,10 @@
#include "topo.h"
#include "comm.h"
#include "nvmlwrap.h"
#include "net.h"
#include "coll_net.h"
#include "transport.h"
#include <sys/stat.h>
#include <fcntl.h>
#include "xml.h"
#include "cpuset.h"
#include "bootstrap.h"
@@ -22,8 +20,8 @@
#define BUSID_REDUCED_SIZE (sizeof("0000:00"))
const char* topoNodeTypeStr[] = { "GPU", "PCI", "NVS", "CPU", "NIC", "NET" };
const char* topoLinkTypeStr[] = { "LOC", "NVL", "", "C2C", "PCI", "", "", "", "SYS", "NET" };
const char* topoPathTypeStr[] = { "LOC", "NVL", "NVB", "C2C", "PIX", "PXB", "PXN", "PHB", "SYS", "NET", "DIS" };
const char* topoLinkTypeStr[] = { "LOC", "NVL", "", "C2C", "PCI", "", "", "", "", "SYS", "NET" };
const char* topoPathTypeStr[] = { "LOC", "NVL", "NVB", "C2C", "PIX", "PXB", "PXN", "P2C", "PHB", "SYS", "NET", "DIS" };
/******************************************************************/
/******************* Graph Creation Functions *********************/
@@ -251,7 +249,7 @@ ncclResult_t ncclTopoFlattenBcmSwitches(struct ncclTopoSystem* system) {
pciSwitch->pci.device |= 0xffff;
free(subSwIds);
// Restart, as system->nodes[PCI].nodes has changed.
s = 0;
s = -1; // Will be incremented to 0 in the next loop iteration
continue;
fail:
free(subSwIds);
@@ -404,7 +402,9 @@ ncclResult_t ncclTopoAddGpu(struct ncclXmlNode* xmlGpu, struct ncclTopoSystem* s
return ncclSuccess;
}
struct kvDict kvDictPciClass[] = { { "0x060400", PCI }, { "0x068000", NVS }, { "0x068001", CPU }, { "0x03", GPU }, { "0x02", NIC }, { NULL, PCI /* Default fallback value */ } };
#define PCI_BRIDGE_DEVICE_CLASS "0x060400"
struct kvDict kvDictPciClass[] = { { PCI_BRIDGE_DEVICE_CLASS, PCI }, { "0x068000", NVS }, { "0x068001", CPU }, { "0x03", GPU }, { "0x02", NIC }, { NULL, PCI /* Default fallback value */ } };
struct kvDict kvDictPciGen[] = {
{ "2.5 GT/s", 15 }, { "5 GT/s", 30 }, { "8 GT/s", 60 }, { "16 GT/s", 120 }, { "32 GT/s", 240 }, /* Kernel 5.6 and earlier */
{ "2.5 GT/s PCIe", 15 }, { "5.0 GT/s PCIe", 30 }, { "8.0 GT/s PCIe", 60 }, { "16.0 GT/s PCIe", 120 }, { "32.0 GT/s PCIe", 240 }, { "64.0 GT/s PCIe", 480 },
@@ -699,6 +699,7 @@ static ncclResult_t xmlInitAttrInt(struct ncclXmlNode* node, const char* attrNam
if (index == -1) {
index = node->nAttrs++;
strncpy(node->attrs[index].key, attrName, MAX_STR_LEN);
node->attrs[index].key[MAX_STR_LEN] = '\0';
snprintf(node->attrs[index].value, MAX_STR_LEN, "%d", value);
}
return ncclSuccess;
@@ -709,6 +710,7 @@ static ncclResult_t xmlInitAttrUint64(struct ncclXmlNode* node, const char* attr
if (index == -1) {
index = node->nAttrs++;
strncpy(node->attrs[index].key, attrName, MAX_STR_LEN);
node->attrs[index].key[MAX_STR_LEN] = '\0';
snprintf(node->attrs[index].value, MAX_STR_LEN, "0x%lx", value);
}
return ncclSuccess;
@@ -719,6 +721,7 @@ static ncclResult_t xmlInitAttrFloat(struct ncclXmlNode* node, const char* attrN
if (index == -1) {
index = node->nAttrs++;
strncpy(node->attrs[index].key, attrName, MAX_STR_LEN);
node->attrs[index].key[MAX_STR_LEN] = '\0';
snprintf(node->attrs[index].value, MAX_STR_LEN, "%f", value);
}
return ncclSuccess;
@@ -799,6 +802,17 @@ typedef struct xmlNodeStack {
} xmlNodeStack;
ncclResult_t ncclFindFirstPciParent(ncclXmlNode** parent) {
ncclXmlNode* newParent = *parent;
while (strcmp(newParent->name, "pci") != 0) {
newParent = newParent->parent;
if (newParent == nullptr) return ncclSuccess;
if (strcmp(newParent->name, "system") == 0) return ncclSuccess;
}
*parent = newParent;
return ncclSuccess;
}
// 1. Find the common parent xmlNode between the given set of nodes
ncclResult_t ncclTopoGetPath(ncclXmlNode** nodes, int nNodes, int* path, ncclXmlNode** parent) {
// Track a stack of parents per-net node being merged
@@ -897,6 +911,7 @@ ncclResult_t ncclTopoGetPath(ncclXmlNode** nodes, int nNodes, int* path, ncclXml
}
out:
ncclFindFirstPciParent(&common);
*parent = common;
free(parents);
return ncclSuccess;
@@ -960,13 +975,19 @@ ncclResult_t ncclTopoMakePciParent(struct ncclXml* xml, struct ncclXmlNode** par
return ncclSuccess;
}
ncclResult_t ncclTopoMakeVnic(ncclComm_t comm, struct ncclXml* xml, ncclNetVDeviceProps_t* vProps,
struct ncclXmlNode** physNetNodes, struct ncclXmlNode** netNode, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
ncclResult_t ncclTopoMakeVnic(struct ncclXml* xml, ncclNetVDeviceProps_t* vProps,
struct ncclXmlNode** physNetNodes, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
if (vProps->ndevs > NCCL_NET_MAX_DEVS_PER_NIC) {
WARN("TOPO/NET : Tried to merge too many NICs. %d > %d", vProps->ndevs, NCCL_NET_MAX_DEVS_PER_NIC);
return ncclInternalError;
}
// Don't make vNics of size 1
if (vProps->ndevs == 1) {
TRACE(NCCL_GRAPH, "TOPO/NET : Skipping vNic of size 1");
return ncclSuccess;
}
// Trigger the merge, then get the new device's properties
int vDevIndex = 0;
ncclResult_t ret = makeVDevice(&vDevIndex, vProps);
@@ -976,11 +997,18 @@ struct ncclXmlNode** physNetNodes, struct ncclXmlNode** netNode, ncclResult_t (*
return ret;
}
// Mark original NICs as keep="0" in the topology
for (int i = 0; i < vProps->ndevs; i++) {
int dev = vProps->devs[i];
struct ncclXmlNode* netNode = physNetNodes[dev];
NCCLCHECK(xmlSetAttrInt(netNode, "keep", 0));
}
INFO(NCCL_GRAPH, "TOPO/NET : Made vNic %d", vDevIndex);
return ncclSuccess;
}
ncclResult_t ncclTopoForceMerge(ncclComm_t comm, struct ncclXml* xml, const char* str, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
ncclResult_t ncclTopoForceMerge(struct ncclXml* xml, char* str, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
ncclResult_t ret = ncclSuccess;
INFO(NCCL_ENV|NCCL_NET, "TOPO/NET : Force-fusing NICs using NCCL_NET_FORCE_MERGE=%s", str);
char* ncStr;
@@ -1018,8 +1046,7 @@ ncclResult_t ncclTopoForceMerge(ncclComm_t comm, struct ncclXml* xml, const char
goto fail;
}
struct ncclXmlNode* netNode;
ret = ncclTopoMakeVnic(comm, xml, &vProps, physNetNodes, &netNode, makeVDevice);
ret = ncclTopoMakeVnic(xml, &vProps, physNetNodes, makeVDevice);
if (ret == ncclSuccess) {
// Only set that a device is "placed" after successfully making a vNic (it's possible to exit before this)
for (int i = 0; i < vProps.ndevs; i++) {
@@ -1041,7 +1068,7 @@ fail:
goto exit;
}
ncclResult_t ncclTopoAutoMerge(ncclComm_t comm, struct ncclXml* xml, int mergeLevel, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
ncclResult_t ncclTopoAutoMerge(struct ncclXml* xml, int mergeLevel, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
// Compute the path type between each device
int* paths = NULL;
ncclResult_t res = ncclSuccess;
@@ -1085,8 +1112,7 @@ ncclResult_t ncclTopoAutoMerge(ncclComm_t comm, struct ncclXml* xml, int mergeLe
return ncclInternalError;
}
struct ncclXmlNode* netNode;
ncclResult_t ret = ncclTopoMakeVnic(comm, xml, &vProps, physNetNodes, &netNode, makeVDevice);
ncclResult_t ret = ncclTopoMakeVnic(xml, &vProps, physNetNodes, makeVDevice);
// Merging failed.
// Mark all as unplaced and increase their distance to disconnected (PATH_DIS)
@@ -1118,6 +1144,7 @@ struct kvDict nicPathKvList[] = {
{ "PIX", PATH_PIX },
{ "PXB", PATH_PXB },
{ "PXN", PATH_PXN },
{ "P2C", PATH_P2C },
{ "PHB", PATH_PHB },
{ "SYS", PATH_SYS },
{ NULL, 0 }
@@ -1139,14 +1166,19 @@ ncclResult_t ncclTopoGetVNicParent(struct ncclXml* xml, ncclResult_t (*getProper
if (path == PATH_LOC) {
*parent = NULL;
} else if (parent && strcmp((*parent)->name, "pci") == 0) {
// If the common parent is PCI, we must reparent the new NIC under a made up busId
NCCLCHECK(ncclTopoMakePciParent(xml, parent, physNetNodes[0]));
// Compare PCI class here to avoid NCCL WARN when the "class" attribute doesn't exist
const char* c;
NCCLCHECK(xmlGetAttrStr(*parent, "class", &c));
if (strcmp(c, PCI_BRIDGE_DEVICE_CLASS) == 0) {
// If the common parent is a PCI switch, we must reparent the new NIC under a made up pci device with a unique busid
NCCLCHECK(ncclTopoMakePciParent(xml, parent, physNetNodes[0]));
}
}
TRACE(NCCL_GRAPH, "Selected parent %s with path %d", (*parent)->name, path);
return ncclSuccess;
}
ncclResult_t ncclTopoMakeVNics(ncclComm_t comm, struct ncclXml* xml, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*getProperties)(int, ncclNetProperties_t*), int physicalDevs) {
ncclResult_t ncclTopoMakeVNics(struct ncclXml* xml, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*getProperties)(int, ncclNetProperties_t*), int physicalDevs) {
int* placedDevs = NULL;
struct ncclXmlNode** physNetNodes = NULL;
if (physicalDevs == 0) return ncclSuccess;
@@ -1170,15 +1202,15 @@ ncclResult_t ncclTopoMakeVNics(ncclComm_t comm, struct ncclXml* xml, ncclResult_
{ // Avoids warnings related to jumping to "out"
const char* mergeLevelEnv = ncclGetEnv("NCCL_NET_MERGE_LEVEL");
if (mergeLevelEnv) kvConvertToInt(mergeLevelEnv, &mergeLevel, nicPathKvList);
const char* forceMerge = ncclGetEnv("NCCL_NET_FORCE_MERGE");
char* forceMerge = (char*) ncclGetEnv("NCCL_NET_FORCE_MERGE");
NCCLCHECK(ncclCalloc(&placedDevs, physicalDevs));
memset(placedDevs, 0, sizeof(int)*physicalDevs);
if (forceMerge) {
NCCLCHECKGOTO(ncclTopoForceMerge(comm, xml, forceMerge, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
NCCLCHECKGOTO(ncclTopoForceMerge(xml, forceMerge, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
}
}
NCCLCHECKGOTO(ncclTopoAutoMerge(comm, xml, mergeLevel, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
NCCLCHECKGOTO(ncclTopoAutoMerge(xml, mergeLevel, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
out:
free(physNetNodes);
@@ -1187,7 +1219,7 @@ out:
return res;
}
static ncclResult_t ncclTopoPopulateNics(ncclComm_t comm, ncclXml* xml, int startIndex, int endIndex, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), const char* netName, int coll, int keep, int virtualNics) {
static ncclResult_t ncclTopoPopulateNics(ncclXml* xml, int startIndex, int endIndex, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), const char* netName, int coll, int virtualNics, bool dmaBufSupport) {
for (int n = startIndex; n < endIndex; n++) {
ncclNetProperties_t props;
NCCLCHECK(getProperties(n, &props));
@@ -1206,15 +1238,17 @@ static ncclResult_t ncclTopoPopulateNics(ncclComm_t comm, ncclXml* xml, int star
const char* colAttr;
NCCLCHECK(xmlGetAttr(netNode, "coll", &colAttr));
// If coll == 0 but the netNode is tagged as coll, don't update the keep value
if (colAttr == NULL || coll != 0 || strcmp(colAttr,"1") != 0) NCCLCHECK(xmlSetAttrInt(netNode, "keep", keep));
NCCLCHECK(xmlSetAttrInt(netNode, "keep", 1));
int dev;
xmlGetAttrIntDefault(netNode, "dev", &dev, -1);
if (dev != -1 && dev != n) INFO(NCCL_GRAPH, "TOPO/NET : Changing %s dev index from %d to %d", netName, dev, n);
NCCLCHECK(xmlSetAttrInt(netNode, "dev", n));
NCCLCHECK(xmlInitAttrInt(netNode, "latency", props.latency));
NCCLCHECK(xmlInitAttrInt(netNode, "speed", props.speed));
NCCLCHECK(xmlInitAttrInt(netNode, "port", props.port));
NCCLCHECK(xmlInitAttrUint64(netNode, "guid", props.guid));
NCCLCHECK(xmlInitAttrInt(netNode, "maxconn", props.maxComms));
bool gdrSupport = (props.ptrSupport & NCCL_PTR_CUDA) || (comm->dmaBufSupport && (props.ptrSupport & NCCL_PTR_DMABUF));
bool gdrSupport = (props.ptrSupport & NCCL_PTR_CUDA) || (dmaBufSupport && (props.ptrSupport & NCCL_PTR_DMABUF));
INFO(NCCL_NET,"NET/%s : GPU Direct RDMA %s for HCA %d '%s'", netName, gdrSupport ? "Enabled" : "Disabled", n, props.name);
NCCLCHECK(xmlInitAttrInt(netNode, "gdr", gdrSupport));
// Only set coll if it's not 0
@@ -1230,30 +1264,22 @@ static ncclResult_t ncclTopoPopulateNics(ncclComm_t comm, ncclXml* xml, int star
return ncclSuccess;
}
struct ncclTopoNetState {
int nVirtualNics;
int nPhysicalNics;
const char* name;
};
// Calls to network plugin APIs should be protected. This function should be called inside a per-process lock.
static ncclResult_t ncclTopoProcessNet(ncclComm_t comm, ncclXml* xml, int coll, const char* dumpXmlFile, ncclTopoNetState* state, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*devices)(int*), const char* netName) {
ncclResult_t ncclTopoProcessNet(ncclXml* xml, int coll, const char* dumpXmlFile, ncclTopoNetState* state, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*devices)(int*), const char* netName, bool dmaBufSupport) {
int usePhysicalDevices = (dumpXmlFile || makeVDevice == NULL);
if (state->nPhysicalNics == -1) NCCLCHECK(devices(&state->nPhysicalNics));
// Enumerate physical devices
NCCLCHECK(ncclTopoPopulateNics(comm, xml, 0, state->nPhysicalNics, getProperties, netName, coll, 1, 0));
NCCLCHECK(ncclTopoPopulateNics(xml, 0, state->nPhysicalNics, getProperties, netName, coll, false, dmaBufSupport));
if (!usePhysicalDevices) {
if (state->nVirtualNics == -1) {
NCCLCHECK(ncclTopoMakeVNics(comm, xml, makeVDevice, getProperties, state->nPhysicalNics));
NCCLCHECK(ncclTopoMakeVNics(xml, makeVDevice, getProperties, state->nPhysicalNics));
int nDevs;
NCCLCHECK(devices(&nDevs));
state->nVirtualNics = nDevs - state->nPhysicalNics;
}
// Remove keep=1 for physical collnets
if (state->nVirtualNics > 0) {
NCCLCHECK(ncclTopoPopulateNics(comm, xml, 0, state->nPhysicalNics, getProperties, netName, coll, 0, 0));
// Populate new devices
NCCLCHECK(ncclTopoPopulateNics(comm, xml, state->nPhysicalNics, state->nPhysicalNics+state->nVirtualNics, getProperties, netName, coll, 1, 1));
NCCLCHECK(ncclTopoPopulateNics(xml, state->nPhysicalNics, state->nPhysicalNics+state->nVirtualNics, getProperties, netName, coll, true, dmaBufSupport));
}
}
@@ -1301,6 +1327,15 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
// Try default XML topology location
NCCLCHECKGOTO(ncclTopoGetXmlFromFile("/var/run/nvidia-topologyd/virtualTopology.xml", xml, 0), ret, fail);
}
// Fixup the cpu's host_hashes.
struct ncclXmlNode* node;
// Update every cpu node's host_hash attribute since those are not
// intended to be preserved from the XML files that have been read.
NCCLCHECKGOTO(xmlFindTag(xml, "cpu", &node), ret, fail);
while (node != nullptr) {
NCCLCHECKGOTO(xmlSetAttrLong(node, "host_hash", getHostHash()), ret, fail);
NCCLCHECKGOTO(xmlFindNextTag(xml, "cpu", node, &node), ret, fail);
}
if (xml->maxIndex == 0) {
// Create top tag
struct ncclXmlNode* top;
@@ -1313,7 +1348,6 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
// Detect only the GPU managed by this process. We'll get any others through XML fusion.
char busId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
NCCLCHECKGOTO(int64ToBusId(comm->peerInfo[comm->rank].busId, busId), ret, fail);
struct ncclXmlNode* node;
NCCLCHECKGOTO(ncclTopoFillGpu(xml, busId, &node), ret, fail);
if (node) {
NCCLCHECKGOTO(xmlSetAttrInt(node, "keep", 1), ret, fail);
@@ -1330,12 +1364,12 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
state = NULL;
if (collNetSupport(comm)) {
NCCLCHECKGOTO(ncclTopoGetSharedState(&state, comm->ncclCollNet->name, collNetStates), ret, fail);
NCCLCHECKGOTO(ncclTopoProcessNet(comm, xml, 1, dumpXmlFile, state,
comm->ncclCollNet->getProperties, comm->ncclCollNet->makeVDevice, comm->ncclCollNet->devices, comm->ncclCollNet->name), ret, fail);
NCCLCHECKGOTO(ncclTopoProcessNet(xml, 1, dumpXmlFile, state,
comm->ncclCollNet->getProperties, comm->ncclCollNet->makeVDevice, comm->ncclCollNet->devices, comm->ncclCollNet->name, comm->dmaBufSupport), ret, fail);
}
NCCLCHECKGOTO(ncclTopoGetSharedState(&state, comm->ncclNet->name, netStates), ret, fail);
NCCLCHECKGOTO(ncclTopoProcessNet(comm, xml, 0, dumpXmlFile, state,
comm->ncclNet->getProperties, comm->ncclNet->makeVDevice, comm->ncclNet->devices, comm->ncclNet->name), ret, fail);
NCCLCHECKGOTO(ncclTopoProcessNet(xml, 0, dumpXmlFile, state,
comm->ncclNet->getProperties, comm->ncclNet->makeVDevice, comm->ncclNet->devices, comm->ncclNet->name, comm->dmaBufSupport), ret, fail);
pthread_mutex_unlock(&netLock);
netLockHeld = 0;
@@ -1399,7 +1433,7 @@ fail:
goto exit;
}
static ncclResult_t ncclTopoGetLocal(struct ncclTopoSystem* system, int type, int index, int resultType,
ncclResult_t ncclTopoGetLocal(struct ncclTopoSystem* system, int type, int index, int resultType,
int locals[NCCL_TOPO_MAX_NODES], int* localCount, int* pathType) {
int minType = PATH_DIS;
float maxBw = 0;
@@ -1452,7 +1486,7 @@ ncclResult_t getLocalNetCountByBw(struct ncclTopoSystem* system, int gpu, int *c
ncclResult_t ncclTopoGetLocalNet(struct ncclTopoSystem* system, int rank, int channelId, int64_t* id, int* dev) {
int gpu;
NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpu));
NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpu, /*showWarn=*/true));
int localNets[NCCL_TOPO_MAX_NODES];
int localNetCount;
@@ -1517,7 +1551,7 @@ NCCL_PARAM(IgnoreCpuAffinity, "IGNORE_CPU_AFFINITY", 0);
ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu_set_t* affinity) {
struct ncclTopoNode* cpu = NULL, *gpu = NULL;
int gpuIndex, cpuIndex;
NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpuIndex));
NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpuIndex, /*showWarn=*/true));
NCCLCHECK(ncclGetLocalCpu(system, gpuIndex, &cpuIndex));
gpu = system->nodes[GPU].nodes+gpuIndex;
cpu = system->nodes[CPU].nodes+cpuIndex;
@@ -1529,8 +1563,8 @@ ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu
#ifdef ENABLE_TRACE
{
char affinityStr[sizeof(cpu_set_t)*2];
NCCLCHECK(ncclCpusetToStr(&mask, affinityStr));
TRACE(NCCL_INIT, "Current affinity for GPU %d is %s", gpu->gpu.dev, affinityStr);
TRACE(NCCL_INIT, "Current affinity for GPU %d is %s", gpu->gpu.dev,
ncclCpusetToRangeStr(&mask, affinityStr, sizeof(affinityStr)));
}
#endif
@@ -1540,8 +1574,8 @@ ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu
#ifdef ENABLE_TRACE
{
char affinityStr[sizeof(cpu_set_t)*2];
NCCLCHECK(ncclCpusetToStr(&cpuMask, affinityStr));
TRACE(NCCL_INIT, "CPU GPU affinity for GPU %d is %s", gpu->gpu.dev, affinityStr);
TRACE(NCCL_INIT, "CPU GPU affinity for GPU %d is %s", gpu->gpu.dev,
ncclCpusetToRangeStr(&cpuMask, affinityStr, sizeof(affinityStr)));
}
#endif
@@ -1558,8 +1592,8 @@ ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu
// If there is a non empty set, use it to set affinity
if (CPU_COUNT(&finalMask)) {
char affinityStr[sizeof(cpu_set_t)*2];
NCCLCHECK(ncclCpusetToStr(&finalMask, affinityStr));
INFO(NCCL_INIT, "Setting affinity for GPU %d to %s", gpu->gpu.dev, affinityStr);
INFO(NCCL_INIT, "Setting affinity for GPU %d to %s", gpu->gpu.dev,
ncclCpusetToRangeStr(&finalMask, affinityStr, sizeof(affinityStr)));
}
return ncclSuccess;
}
+22 -8
Просмотреть файл
@@ -9,6 +9,8 @@
#include "graph.h"
#include "core.h"
#include "xml.h"
#include "net.h"
#define LOC_BW 5000.0
#define SM60_NVLINK_BW 18.0
@@ -50,9 +52,10 @@ extern const char* topoNodeTypeStr[];
#define LINK_PCI 4
// Skipping 5 for PATH_PXB
// Skipping 6 for PATH_PXN
// Skipping 7 for PATH_PHB
#define LINK_SYS 8
#define LINK_NET 9
// Skipping 7 for PATH_P2C
// Skipping 8 for PATH_PHB
#define LINK_SYS 9
#define LINK_NET 10
extern const char* topoLinkTypeStr[];
// Local (myself)
@@ -76,20 +79,23 @@ extern const char* topoLinkTypeStr[];
// Connection between a GPU and a NIC using an intermediate GPU. Used to enable rail-local, aggregated network send/recv operations.
#define PATH_PXN 6
// Connection between a GPU and a NIC using the C2C connection to the CPU and the PCIe connection to the NIC
#define PATH_P2C 7
// Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
#define PATH_PHB 7
#define PATH_PHB 8
// Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
#define PATH_SYS 8
#define PATH_SYS 9
// Connection through the network
#define PATH_NET 9
#define PATH_NET 10
// New type of path which should precede PATH_PIX
#define PATH_PORT PATH_NVL
// Disconnected
#define PATH_DIS 10
#define PATH_DIS 11
extern const char* topoPathTypeStr[];
struct ncclTopoNode;
@@ -181,6 +187,13 @@ ncclResult_t ncclTopoGetGpuMinPath(struct ncclTopoSystem* system, int type, int*
ncclResult_t ncclTopoGetGpuMaxPath(struct ncclTopoSystem* system, int type, int* max);
ncclResult_t ncclTopoSplitNvLink(struct ncclTopoSystem* system, int* splitNvLink);
struct ncclTopoNetState {
int nVirtualNics;
int nPhysicalNics;
const char* name;
};
ncclResult_t ncclTopoProcessNet(ncclXml* xml, int coll, const char* dumpXmlFile, ncclTopoNetState* state, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*devices)(int*), const char* netName, bool dmaBufSupport);
#define NCCL_TOPO_XML_MAX_NODES 256
#define NCCL_GRAPH_XML_MAX_NODES 4096
ncclResult_t ncclTopoGetSystemFromXml(struct ncclXml* xml, struct ncclTopoSystem** topoSystem, uint64_t localHostHash);
@@ -200,7 +213,7 @@ static ncclResult_t ncclTopoIdToIndex(struct ncclTopoSystem* system, int type, i
return ncclInternalError;
}
static ncclResult_t ncclTopoRankToIndex(struct ncclTopoSystem* system, int rank, int* index) {
static ncclResult_t ncclTopoRankToIndex(struct ncclTopoSystem* system, int rank, int* index, bool showWarn) {
*index = -1;
for (int i=0; i<system->nodes[GPU].count; i++) {
if (system->nodes[GPU].nodes[i].gpu.rank == rank) {
@@ -208,6 +221,7 @@ static ncclResult_t ncclTopoRankToIndex(struct ncclTopoSystem* system, int rank,
return ncclSuccess;
}
}
if (showWarn) WARN("ncclTopoRankToIndex could not find rank %d", rank);
return ncclInternalError;
}
+58 -34
Просмотреть файл
@@ -16,13 +16,13 @@ static int getNthreads(const char* name, int env, int min, int max, int def) {
int nt = env;
if (nt > 0) {
if (nt % WARP_SIZE != 0) {
WARN("Invalid %s %d (must be a multiple of %d)", name, nt, WARP_SIZE);
INFO(NCCL_GRAPH|NCCL_ENV, "Invalid %s %d (must be a multiple of %d)", name, nt, WARP_SIZE);
nt = max;
} else if (nt > max) {
WARN("Invalid %s %d (maximum %d).", name, nt, max);
INFO(NCCL_GRAPH|NCCL_ENV, "Invalid %s %d (maximum %d).", name, nt, max);
nt = max;
} else if (nt < min) {
WARN("Invalid %s %d (minimum %d).", name, nt, min);
INFO(NCCL_GRAPH|NCCL_ENV, "Invalid %s %d (minimum %d).", name, nt, min);
nt = min;
}
} else {
@@ -51,11 +51,14 @@ static int getNthreads(const char* name, int env, int min, int max, int def) {
// NCCL_PROTO="^LL128;allreduce:LL128"
// Enable everything but LL128, but only LL128 for allreduce.
ncclResult_t parseList(const char* str, const char* prefixElems[], int nprefixes, const char* elems[], int nelems, int* list) {
ncclResult_t ret = ncclSuccess;
char* fullStr = strdup(str);
char* tmpFullStr;
char* fullToken = strtok_r(fullStr, ";", &tmpFullStr);
char* subToken = nullptr;
char* tokStr = nullptr;
while (fullToken) {
char* subToken = strdup(fullToken);
subToken = strdup(fullToken);
char* tmpSubStr;
char* prefix = strtok_r(subToken, ":", &tmpSubStr);
char* elemList = strtok_r(NULL, ":", &tmpSubStr);
@@ -65,7 +68,8 @@ ncclResult_t parseList(const char* str, const char* prefixElems[], int nprefixes
// because then all the prefixes before the prefix-less entry would be
// overwritten.
WARN("All entries except the first must have a prefix: \"%s\"", str);
return ncclInvalidUsage;
ret = ncclInvalidUsage;
goto fail;
}
elemList = prefix;
prefix = NULL;
@@ -84,7 +88,7 @@ ncclResult_t parseList(const char* str, const char* prefixElems[], int nprefixes
foundPrefix = true;
for (int e=0; e<nelems; e++) list[p*nelems+e] = unset;
char* tokStr = strdup(elemList);
tokStr = strdup(elemList);
char* tmpStr;
char* elem = strtok_r(tokStr, ",", &tmpStr);
while (elem) {
@@ -97,22 +101,32 @@ ncclResult_t parseList(const char* str, const char* prefixElems[], int nprefixes
}
if (e==nelems) {
WARN("Unrecognized element token \"%s\" when parsing \"%s\"", elem, str);
return ncclInvalidUsage;
ret = ncclInvalidUsage;
goto fail;
}
elem = strtok_r(NULL, ",", &tmpStr);
}
free(tokStr);
tokStr = nullptr;
}
if (!foundPrefix) {
WARN("Unrecognized prefix token \"%s\" when parsing \"%s\"", prefix, str);
return ncclInvalidUsage;
ret = ncclInvalidUsage;
goto fail;
}
free(subToken);
subToken = nullptr;
fullToken = strtok_r(NULL, ";", &tmpFullStr);
}
exit:
free(tokStr);
free(subToken);
free(fullStr);
return ncclSuccess;
return ret;
fail:
goto exit;
}
// Latencies in us, Bandwidths in GB/s
@@ -194,6 +208,8 @@ static float getNetOverhead(struct ncclComm* comm) {
return 1.0;
}
NCCL_PARAM(Ll128C2c, "LL128_C2C", 1);
ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCompCap, struct ncclTopoGraph** graphs) {
int simpleDefaultThreads = (graphs[NCCL_ALGO_RING]->bwIntra*graphs[NCCL_ALGO_RING]->nChannels <= PCI_BW) ? 256 : NCCL_SIMPLE_MAX_NTHREADS;
comm->maxThreads[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] =
@@ -248,7 +264,14 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
&& a == NCCL_ALGO_PAT && (p != NCCL_PROTO_SIMPLE || ncclPatEnable(comm) == 0)) continue;
int collnet = (a == NCCL_ALGO_COLLNET_DIRECT || a == NCCL_ALGO_COLLNET_CHAIN) ? 1 : 0;
float bw = nNodes <= 2 || collnet ? graphs[a]->bwIntra : graphs[a]->bwInter;
if (a == NCCL_ALGO_NVLS) bw = std::min(graphs[a]->bwIntra, graphs[a]->bwInter);
if (a == NCCL_ALGO_NVLS) {
if (coll == ncclFuncAllReduce) {
bw = std::min(graphs[a]->bwIntra, graphs[a]->bwInter);
} else {
// allgather and reducescatter
bw = std::min(graphs[a]->bwIntra * (ppn - 1.0f) / ppn, graphs[a]->bwInter * 0.9f);
}
}
if (a == NCCL_ALGO_NVLS_TREE) bw = std::min(graphs[a]->bwIntra, nNodes <= 2 ? graphs[a]->bwInter : graphs[a]->bwInter/2);
float busBw = graphs[a]->nChannels * bw;
@@ -264,19 +287,7 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
if (a == NCCL_ALGO_COLLNET_CHAIN && p != NCCL_PROTO_SIMPLE) busBw = 0; // Not used
if (a == NCCL_ALGO_COLLNET_DIRECT && p == NCCL_PROTO_SIMPLE) {
if (coll == ncclFuncAllGather || coll == ncclFuncReduceScatter) {
busBw = ppn * bw;
// AllGather/ReduceScatter requires 1:1 GPU:NIC
int nicPerNode = comm->collNetHeadsNum;
if (coll == ncclFuncAllGather && comm->nNodes > 1) {
if (!comm->ncclCollNet || !comm->ncclCollNet->iallgather || ppn > nicPerNode) busBw = 0;
}
if (coll == ncclFuncReduceScatter && comm->nNodes > 1) {
if (!comm->ncclCollNet || !comm->ncclCollNet->ireducescatter || ppn > nicPerNode) busBw = 0;
}
// Measured corrective ratio needed at 1 ppn and 8ppn. Here we hackishly
// interpolate the two.
float w = (ppn-1)/(8-1);
busBw *= w*0.85 + (1-w)*0.95;
busBw = ppn * std::min(graphs[a]->bwIntra, graphs[a]->bwInter * 0.9f);
} else {
// Collnet+Direct requires all GPUs to have a local NIC to work at full speed
float factor = ppn / (1.0*graphs[a]->nChannels); // GPU/NIC ratio
@@ -285,6 +296,26 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
if (minCompCap >= 90) busBw *= .85;
}
}
// disable collnet for allgather/reducescatter if #localranks > #heads
// AllGather/ReduceScatter requires 1:1 GPU:NIC
if ((a == NCCL_ALGO_NVLS || a == NCCL_ALGO_COLLNET_DIRECT) && p == NCCL_PROTO_SIMPLE && (coll == ncclFuncAllGather || coll == ncclFuncReduceScatter) && comm->nNodes > 1) {
int nHeads = 0;
if (coll == ncclFuncAllGather && comm->nNodes > 1 && (!comm->ncclCollNet || !comm->ncclCollNet->iallgather)) busBw = 0.0f;
if (coll == ncclFuncReduceScatter && comm->nNodes > 1 && (!comm->ncclCollNet || !comm->ncclCollNet->ireducescatter)) busBw = 0.0f;
if (comm->config.collnetEnable)
nHeads = comm->collNetHeadsNum;
else
busBw = 0.0f;
if (busBw > 0.0f) {
for (int r = 0; r < comm->nRanks; r++) {
int node = comm->rankToNode[r];
if (comm->nodeRanks[node].localRanks > nHeads) {
busBw = 0.0f;
break;
}
}
}
}
// Convert bus BW to algorithm BW
if (!(a != NCCL_ALGO_RING && (coll == ncclFuncAllGather || coll == ncclFuncReduceScatter))) {
@@ -411,7 +442,7 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
// Disable NVLS Tree on a single node
if (comm->nNodes == 1 && a == NCCL_ALGO_NVLS_TREE) disable = 1;
// Disable Collnet+Direct, Collnet+Chain or Collnet+NVLS if collnet is not supported.
if (comm->collNetSupport == 0 &&
if (comm->config.collnetEnable == 0 &&
(a == NCCL_ALGO_COLLNET_DIRECT ||
a == NCCL_ALGO_COLLNET_CHAIN ||
(a == NCCL_ALGO_NVLS && comm->nNodes > 1))) disable = 1;
@@ -426,17 +457,10 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
if (pEnable == 2 && p == NCCL_PROTO_LL128) {
// Enable LL128 by default only on Volta/Ampere/Hopper/Blackwell+NVLink. Other cases are not tested and may cause silent data corruption.
pEnable = 1;
pEnable &= (graphs[a]->typeInter <= PATH_PXB || (minCompCap >= 90 && graphs[a]->typeInter <= PATH_PXN));
pEnable &= (graphs[a]->typeInter <= PATH_PXB || (minCompCap >= 90 && graphs[a]->typeInter <= (ncclParamLl128C2c() ? PATH_P2C : PATH_PXN)));
pEnable &= (graphs[a]->typeIntra <= PATH_NVB);
pEnable &= (minCompCap == maxCompCap);
switch (minCompCap) {
case 70: pEnable &= 1; break;
case 80: pEnable &= 1; break;
case 90: pEnable &= !(CUDART_VERSION == 11080 && c == ncclFuncAllReduce && a == NCCL_ALGO_RING && comm->nRanks == 2); break;
case 100: pEnable &= 1; break;
case 120: pEnable &= 1; break;
default: pEnable &= 0; break;
}
pEnable &= !(minCompCap < 70 || (minCompCap == 90 && CUDART_VERSION == 11080 && c == ncclFuncAllReduce && a == NCCL_ALGO_RING && comm->nRanks == 2));
}
if (pEnable == 0) comm->bandwidths[c][a][p] = 0;
if (algoEnable[c*NCCL_NUM_ALGORITHMS+a] == 0) comm->bandwidths[c][a][p] = 0;
@@ -483,7 +507,7 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
}
}
}
// Set per-thread amount of work before we increase nThreads and nChannels
for (int a=0; a<NCCL_NUM_ALGORITHMS; a++) {
comm->threadThresholds[a][NCCL_PROTO_LL] = NCCL_LL_THREAD_THRESHOLD;
+29 -8
Просмотреть файл
@@ -39,7 +39,13 @@ ncclResult_t xmlGetValue(FILE* file, char* value, char* last) {
#if INT_OK
int o = 0;
do {
value[o++] = c;
value[o] = c;
if (o == MAX_STR_LEN-1) {
value[o] = '\0';
WARN("Error : value %s too long (max %d)", value, MAX_STR_LEN);
return ncclInternalError;
}
o++;
NCCLCHECK(xmlGetChar(file, &c));
} while (c >= '0' && c <= '9');
value[o] = '\0';
@@ -51,10 +57,17 @@ ncclResult_t xmlGetValue(FILE* file, char* value, char* last) {
#endif
}
int o = 0;
char quote = c; // Remember which quote type we started with
do {
NCCLCHECK(xmlGetChar(file, &c));
value[o++] = c;
} while (c != '"');
value[o] = c;
if (o == MAX_STR_LEN-1) {
value[o] = '\0';
WARN("Error : value %s too long (max %d)", value, MAX_STR_LEN);
return ncclInternalError;
}
o++;
} while (c != quote);
value[o-1] = '\0';
NCCLCHECK(xmlGetChar(file, last));
return ncclSuccess;
@@ -267,7 +280,7 @@ ncclResult_t ncclTopoDumpXmlRec(int indent, FILE* file, struct ncclXmlNode* node
ncclResult_t ncclTopoDumpXmlToFile(const char* xmlTopoFile, struct ncclXml* xml) {
FILE* file = fopen(xmlTopoFile, "w");
if (file == NULL) {
WARN("Unable to open %s, not dumping topology.", xmlTopoFile);
INFO(NCCL_GRAPH|NCCL_ENV, "Unable to open %s, not dumping topology.", xmlTopoFile);
return ncclSuccess;
}
NCCLCHECK(ncclTopoDumpXmlRec(0, file, xml->nodes));
@@ -375,7 +388,7 @@ ncclResult_t ncclTopoGetXmlFromFile(const char* xmlTopoFile, struct ncclXml* xml
FILE* file = fopen(xmlTopoFile, "r");
if (file == NULL) {
if (warn) {
WARN("Could not open XML topology file %s : %s", xmlTopoFile, strerror(errno));
INFO(NCCL_GRAPH|NCCL_ENV, "Could not open XML topology file %s : %s", xmlTopoFile, strerror(errno));
}
return ncclSuccess;
}
@@ -759,7 +772,7 @@ ncclResult_t ncclTopoGetXmlFromGpu(struct ncclXmlNode* pciNode, nvmlDevice_t nvm
int maxNvLinks = (sm < 60) ? 0 : (sm < 70) ? 4 : (sm < 80) ? 6 : (sm < 90) ? 12 : 18;
if (maxNvLinks > 0 && nvmlDev == NULL) {
WARN("No NVML device handle. Skipping nvlink detection.");
INFO(NCCL_GRAPH, "No NVML device handle. Skipping nvlink detection.");
maxNvLinks = 0;
}
@@ -961,8 +974,16 @@ ncclResult_t ncclTopoTrimXmlRec(struct ncclXmlNode* node, int* keep) {
NCCLCHECK(ncclTopoTrimXmlRec(subs[s], &k));
*keep += k;
}
if (*keep == 0 && // Trim PCI switches or CPU with no used GPU/NIC under them.
(strcmp(node->name, "pci") == 0 || strcmp(node->name, "cpu") == 0)) {
// Remove node if it has no children and no keep attribute
if (*keep == 0 && // Trim PCI switches, CPUs with no used GPU/NIC under them, or pruned NICs
(strcmp(node->name, "pci") == 0 || strcmp(node->name, "cpu") == 0 || strcmp(node->name, "nic") == 0 || strcmp(node->name, "net") == 0)) {
#ifdef ENABLE_TRACE
const char* name;
const char* busid;
NCCLCHECK(xmlGetAttr(node, "name", &name));
NCCLCHECK(xmlGetAttr(node, "busid", &busid));
TRACE(NCCL_GRAPH, "Removing node %s %s %s\n", node->name, name, busid);
#endif
NCCLCHECK(xmlRemoveNode(node));
}
}
+7 -4
Просмотреть файл
@@ -117,6 +117,13 @@ static ncclResult_t xmlGetAttrIntDefault(struct ncclXmlNode* node, const char* a
return ncclSuccess;
}
static ncclResult_t xmlGetAttrUint64(struct ncclXmlNode* node, const char* attrName, uint64_t* value) {
const char* str;
NCCLCHECK(xmlGetAttrStr(node, attrName, &str));
*value = strtoull(str, NULL, 0);
return ncclSuccess;
}
static ncclResult_t xmlGetAttrLong(struct ncclXmlNode* node, const char* attrName, int64_t* value) {
const char* str;
NCCLCHECK(xmlGetAttrStr(node, attrName, &str));
@@ -124,7 +131,6 @@ static ncclResult_t xmlGetAttrLong(struct ncclXmlNode* node, const char* attrNam
return ncclSuccess;
}
static ncclResult_t xmlGetAttrFloat(struct ncclXmlNode* node, const char* attrName, float* value) {
const char* str;
NCCLCHECK(xmlGetAttrStr(node, attrName, &str));
@@ -254,7 +260,6 @@ static ncclResult_t xmlSetAttrInt(struct ncclXmlNode* node, const char* attrName
node->attrs[index].key[MAX_STR_LEN] = '\0';
}
snprintf(node->attrs[index].value, MAX_STR_LEN, "%d", value);
node->attrs[index].value[MAX_STR_LEN] = '\0';
return ncclSuccess;
}
@@ -267,7 +272,6 @@ static ncclResult_t xmlSetAttrFloat(struct ncclXmlNode* node, const char* attrNa
node->attrs[index].key[MAX_STR_LEN] = '\0';
}
snprintf(node->attrs[index].value, MAX_STR_LEN, "%g", value);
node->attrs[index].value[MAX_STR_LEN] = '\0';
return ncclSuccess;
}
@@ -280,7 +284,6 @@ static ncclResult_t xmlSetAttrLong(struct ncclXmlNode* node, const char* attrNam
node->attrs[index].key[MAX_STR_LEN] = '\0';
}
snprintf(node->attrs[index].value, MAX_STR_LEN, "%#lx", value);
node->attrs[index].value[MAX_STR_LEN] = '\0';
return ncclSuccess;
}
+243 -113
Просмотреть файл
@@ -12,16 +12,14 @@
#include <assert.h>
#include "bootstrap.h"
#define GROUP_MAX_RECLAIM_STEPS 10
__thread int ncclGroupDepth = 0; // depth of ncclGroupStart nesting
__thread ncclResult_t ncclGroupError = ncclSuccess;
__thread struct ncclComm* ncclGroupCommHead = nullptr;
__thread struct ncclComm* ncclGroupCommHead[ncclGroupTaskTypeNum] = {nullptr};
__thread struct ncclComm* ncclGroupCommPreconnectHead = nullptr;
__thread struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> ncclAsyncJobs;
__thread struct ncclGroupJob *ncclGroupJobMainPtr = NULL;
__thread struct ncclGroupJob ncclGroupJobMain;
__thread int ncclGroupBlocking = -1; /* default mode */
__thread bool ncclGroupJobAbortFlag = false;
void* ncclAsyncJobMain(void* arg);
ncclResult_t ncclAsyncLaunch(
@@ -191,6 +189,66 @@ fail:
goto exit;
}
struct ncclGroupSymmetricJob {
struct ncclAsyncJob base;
struct ncclComm* comm;
};
NCCL_PARAM(WinStride, "WIN_STRIDE", -1);
ncclResult_t ncclCommGroupRegisterSymmetric(struct ncclAsyncJob* job_) {
struct ncclGroupSymmetricJob* job = (struct ncclGroupSymmetricJob*)job_;
struct ncclComm* comm = job->comm;
ncclResult_t ret = ncclSuccess;
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
if (comm->baseStride == 0) {
cudaStream_t hostStream;
// first time to allocate symmetric VA space.
// calling into this function means symmetric is supported.
struct ncclSymDevBase* symBase = NULL;
size_t size = ncclSymDevBase::size(comm->localRanks);
if (ncclParamWinStride() != -1) {
comm->baseStride = ncclParamWinStride();
} else {
size_t maxStride = 0;
for (int r = 0; r < comm->nRanks; ++r)
if (comm->peerInfo[r].totalGlobalMem > maxStride) maxStride = comm->peerInfo[r].totalGlobalMem;
comm->baseStride = maxStride;
}
INFO(NCCL_INIT, "rank %d base stride %zuGB total VM %zuGB", comm->rank, comm->baseStride >> 30, (comm->baseStride * comm->localRanks) >> 30);
NCCLCHECKGOTO(ncclIpcSymmetricInit(comm), ret, fail);
NCCLCHECKGOTO(ncclNvlsSymmetricInit(comm), ret, fail);
comm->symAllocHead = 0;
// Allocate symmetric memory for NCCL internal usage
NCCLCHECKGOTO(ncclCommSymmetricAllocInternal(comm, size, alignof(struct ncclSymDevBase), (void**)&symBase), ret, fail);
assert((void*)symBase == (void*)(comm->baseUCSymPtr + comm->localRank * comm->baseStride));
NCCLCHECKGOTO(ncclStrongStreamAcquire(ncclCudaGraphNone(), &comm->sharedRes->hostStream, /*concurrent=*/false, &hostStream), ret, fail);
CUDACHECKGOTO(cudaMemsetAsync(symBase, 0, size, hostStream), ret, fail);
CUDACHECKGOTO(cudaStreamSynchronize(hostStream), ret, fail);
NCCLCHECKGOTO(ncclStrongStreamRelease(ncclCudaGraphNone(), &comm->sharedRes->hostStream, /*concurrent=*/false), ret, fail);
comm->symDevComm.base = (struct ncclSymDevBase*)(comm->baseUCSymPtr + comm->localRank * comm->baseStride);
comm->symDevComm.baseMc = (struct ncclSymDevBase*)comm->baseMCSymPtr;
comm->symDevComm.nRanks = comm->localRanks;
comm->symDevComm.nRanks_rcp32 = idivRcp32(comm->localRanks);
comm->symDevComm.rank = comm->localRank;
comm->symDevComm.stride4G = comm->baseStride >> 32;
}
while (!ncclIntruQueueEmpty(&comm->symRegTaskQueue)) {
struct ncclSymRegTask* task = ncclIntruQueueDequeue(&comm->symRegTaskQueue);
NCCLCHECKGOTO(ncclCommSymmetricRegisterInternal(comm, task->buff, task->baseSize, task->alignment, task->memHandle, task->regHandle), ret, fail);
free(task);
}
exit:
return ret;
fail:
goto exit;
}
static ncclResult_t doLaunches(struct ncclComm* head) {
ncclResult_t result = ncclSuccess;
struct ncclComm* cliqueHead = head;
@@ -207,7 +265,7 @@ static ncclResult_t doLaunches(struct ncclComm* head) {
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), result, failure);
NCCLCHECKGOTO(ncclLaunchPrepare(comm), result, failure);
if (useBarrier) ncclCommIntraBarrierIn(comm, 1);
comm = comm->groupNext;
comm = comm->groupNext[ncclGroupTaskTypeCollective];
} while (comm != nullptr && comm->intraComm0 == cliqueHead->intraComm0);
cliqueNextHead = comm;
@@ -224,7 +282,7 @@ static ncclResult_t doLaunches(struct ncclComm* head) {
bool moreRounds = false;
comm = cliqueHead;
do { // Iterate clique members.
struct ncclComm* next = comm->groupNext;
struct ncclComm* next = comm->groupNext[ncclGroupTaskTypeCollective];
if (useBarrier) {
// Barrier reduction result tells us if this was the final round.
moreRounds = 0 != ncclCommIntraBarrierOut(comm);
@@ -259,64 +317,60 @@ failure:
return result;
}
static inline void groupResetJobState(struct ncclGroupJob* job) {
if (job) {
if (job->groupBlockingPtr) *job->groupBlockingPtr = -1;
if (job->abortFlagPtr) *job->abortFlagPtr = false;
if (job->groupErrorPtr) *job->groupErrorPtr = ncclSuccess;
if (job->groupCommHeadPtr) *job->groupCommHeadPtr = NULL;
if (job->groupCommPreconnectHeadPtr) *job->groupCommPreconnectHeadPtr = NULL;
memset(job, 0, sizeof(struct ncclGroupJob));
}
static inline void groupLocalResetJobState() {
ncclGroupError = ncclSuccess;
for (int type = 0; type < ncclGroupTaskTypeNum; ++type) ncclGroupCommHead[type] = NULL;
ncclGroupCommPreconnectHead = NULL;
ncclGroupBlocking = -1;
ncclIntruQueueConstruct(&ncclAsyncJobs);
return;
}
static void groupCleanup(struct ncclComm** groupCommHeadPtr, struct ncclComm** groupCommPreconnectHeadPtr, struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next>* asyncJobsPtr, ncclResult_t* groupErrorPtr, int* groupBlockingPtr, volatile bool* groupJobAbortFlagPtr, ncclResult_t error) {
struct ncclComm* comm = *groupCommHeadPtr;
/* reset all thread local variables */
*groupCommHeadPtr = NULL;
*groupCommPreconnectHeadPtr = NULL;
*groupErrorPtr = ncclSuccess;
*groupBlockingPtr = -1;
*groupJobAbortFlagPtr = false;
while (comm != nullptr) {
struct ncclComm* next = comm->groupNext;
(void) ncclGroupCommLeave(comm); // overwrites comm->groupNext
// We don't know if preconnect succeeded or happened at all, so clear
// the flags that let `taskAppend()` skip over checking if preconnect
// is needed.
comm->preconnectNext = reinterpret_cast<struct ncclComm*>(0x1);
for (int i = 0; i < comm->nRanks; i++) {
comm->connectSend[i] = 0UL;
comm->connectRecv[i] = 0UL;
}
// Reclaim abandoned kernel plan memory. Note ncclWork structs were already
// reclaimed by a `ncclMemoryStackPop(&comm->memScoped)` during `ncclGroupCommLeave()`.
while (!ncclIntruQueueEmpty(&comm->planner.planQueue)) {
struct ncclKernelPlan* plan = ncclIntruQueueDequeue(&comm->planner.planQueue);
// Persistent plans will be reclaimed via the callbackQueue when the
// graph drops its UserObject reference.
if (!plan->persistent) {
while (!ncclIntruQueueEmpty(&plan->proxyOpQueue)) {
struct ncclProxyOp* pxop = ncclIntruQueueDequeue(&plan->proxyOpQueue);
ncclMemoryPoolFree(&comm->memPool_ncclProxyOp, pxop);
static void groupCleanup(struct ncclComm** groupCommHeadPtr, struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next>* asyncJobsPtr, ncclResult_t error) {
struct ncclComm* comm;
for (int type = 0; type < ncclGroupTaskTypeNum; ++type) {
comm = groupCommHeadPtr[type];
// reset groupCommHeadPtr[type]
groupCommHeadPtr[type] = nullptr;
while (comm != nullptr) {
struct ncclComm* next = comm->groupNext[type];
(void)ncclGroupCommLeave(comm, type); // overwrites comm->groupNext
// We don't know if preconnect succeeded or happened at all, so clear
// the flags that let `taskAppend()` skip over checking if preconnect
// is needed.
if (type == ncclGroupTaskTypeCollective) {
comm->preconnectNext = reinterpret_cast<struct ncclComm*>(0x1);
for (int i = 0; i < comm->nRanks; i++) {
comm->connectSend[i] = 0UL;
comm->connectRecv[i] = 0UL;
}
// Reclaim abandoned kernel plan memory. Note ncclWork structs were already
// reclaimed by a `ncclMemoryStackPop(&comm->memScoped)` during `ncclGroupCommLeave()`.
while (!ncclIntruQueueEmpty(&comm->planner.planQueue)) {
struct ncclKernelPlan* plan = ncclIntruQueueDequeue(&comm->planner.planQueue);
// Persistent plans will be reclaimed via the callbackQueue when the
// graph drops its UserObject reference.
if (!plan->persistent) {
while (!ncclIntruQueueEmpty(&plan->proxyOpQueue)) {
struct ncclProxyOp* pxop = ncclIntruQueueDequeue(&plan->proxyOpQueue);
ncclMemoryPoolFree(&comm->memPool_ncclProxyOp, pxop);
}
ncclMemoryPoolFree(&comm->memPool_ncclKernelPlan, plan);
}
}
{ // Reset comm->planner to empty.
ncclKernelPlanner::Peer* tmp = comm->planner.peers;
memset(&comm->planner, 0, sizeof(comm->planner));
comm->planner.peers = tmp;
if (comm->planner.peers != NULL) memset(comm->planner.peers, 0, comm->nRanks * sizeof(comm->planner.peers[0]));
}
ncclMemoryPoolFree(&comm->memPool_ncclKernelPlan, plan);
}
}
{ // Reset comm->planner to empty.
ncclKernelPlanner::Peer* tmp = comm->planner.peers;
memset(&comm->planner, 0, sizeof(comm->planner));
comm->planner.peers = tmp;
if (comm->planner.peers != NULL) memset(comm->planner.peers, 0, comm->nRanks*sizeof(comm->planner.peers[0]));
if (!comm->config.blocking)
(void)ncclCommSetAsyncError(comm, error);
comm = next;
}
if (!comm->config.blocking)
(void) ncclCommSetAsyncError(comm, error);
comm = next;
}
/* reset everything */
@@ -393,11 +447,10 @@ fail:
static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInfo = NULL) {
ncclResult_t ret = ncclSuccess;
struct ncclGroupJob *gjob = (struct ncclGroupJob*) job_;
struct ncclComm *groupCommHeadMain = *gjob->groupCommHeadPtr;
struct ncclComm *groupCommPreconnectHeadMain = *gjob->groupCommPreconnectHeadPtr;
struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> *asyncJobsMain = gjob->asyncJobsPtr;
bool *groupAbortFlag = gjob->abortFlagPtr;
struct ncclComm **groupCommHeadMain = gjob->groupCommHead;
struct ncclComm *groupCommPreconnectHeadMain = gjob->groupCommPreconnectHead;
struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> *asyncJobsMain = &gjob->asyncJobs;
bool *groupAbortFlag = &gjob->abortFlag;
if (!simInfo && groupCommPreconnectHeadMain != nullptr) {
struct ncclComm* comm = groupCommPreconnectHeadMain;
@@ -421,9 +474,41 @@ static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInf
NCCLCHECKGOTO(asyncJobLaunch(asyncJobsMain, groupAbortFlag), ret, fail);
// only loop through sym alloc and register tasks
for (int type = ncclGroupTaskTypeSymRegister; type <= ncclGroupTaskTypeSymRegister; ++type) {
if (groupCommHeadMain[type]) {
struct ncclComm* cliqueHead = groupCommHeadMain[type];
struct ncclComm* comm = NULL;
struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> asyncSymJobs;
ncclIntruQueueConstruct(&asyncSymJobs);
do {
comm = cliqueHead;
do {
struct ncclGroupSymmetricJob* job;
NCCLCHECKGOTO(ncclCalloc(&job, 1), ret, fail);
job->base.func = ncclCommGroupRegisterSymmetric;
job->base.undo = nullptr;
job->base.destructor = free;
job->base.state = ncclGroupJobRunning;
job->base.abortFlag = comm->abortFlag;
job->base.abortFlagDev = comm->abortFlagDev;
job->comm = comm;
ncclIntruQueueEnqueue(&asyncSymJobs, (struct ncclAsyncJob*)job);
comm = comm->groupNext[type];
} while (comm != nullptr && comm->intraComm0 == cliqueHead->intraComm0);
NCCLCHECKGOTO(asyncJobLaunch(&asyncSymJobs, groupAbortFlag), ret, fail);
while (!ncclIntruQueueEmpty(&asyncSymJobs)) {
struct ncclAsyncJob* job = ncclIntruQueueDequeue(&asyncSymJobs);
if (job->destructor) job->destructor((void*)job);
}
cliqueHead = comm;
} while (cliqueHead != nullptr);
}
}
/* Connect channels at runtime if cumem is supported */
if (groupCommHeadMain != nullptr) {
struct ncclComm* cliqueHead = groupCommHeadMain;
if (groupCommHeadMain[ncclGroupTaskTypeCollective] != nullptr) {
struct ncclComm* cliqueHead = groupCommHeadMain[ncclGroupTaskTypeCollective];
struct ncclComm* comm = NULL;
struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> asyncCollJobs;
ncclIntruQueueConstruct(&asyncCollJobs);
@@ -454,7 +539,7 @@ static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInf
memcpy(job->algoNeedConnect, algoNeedConnect, sizeof(bool) * NCCL_NUM_ALGORITHMS);
ncclIntruQueueEnqueue(&asyncCollJobs, &job->base);
}
comm = comm->groupNext;
comm = comm->groupNext[ncclGroupTaskTypeCollective];
} while (comm != nullptr && comm->intraComm0 == cliqueHead->intraComm0);
// connect
NCCLCHECKGOTO(asyncJobLaunch(&asyncCollJobs, groupAbortFlag), ret, fail);
@@ -466,42 +551,49 @@ static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInf
} while (cliqueHead != nullptr);
// done with all buffer allocation, start registration and enqueue
comm = groupCommHeadMain;
comm = groupCommHeadMain[ncclGroupTaskTypeCollective];
do {
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
NCCLCHECKGOTO(ncclTasksRegAndEnqueue(comm), ret, fail);
comm = comm->groupNext;
comm = comm->groupNext[ncclGroupTaskTypeCollective];
} while (comm);
}
if ((!simInfo) && (groupCommHeadMain != nullptr)) {
NCCLCHECKGOTO(doLaunches(groupCommHeadMain), ret, fail);
if ((!simInfo) && (groupCommHeadMain[ncclGroupTaskTypeCollective] != nullptr)) {
NCCLCHECKGOTO(doLaunches(groupCommHeadMain[ncclGroupTaskTypeCollective]), ret, fail);
}
while (!ncclIntruQueueEmpty(asyncJobsMain)) {
struct ncclAsyncJob* job = ncclIntruQueueDequeue(asyncJobsMain);
if (!job->destroyFlag && job->comm && !job->comm->config.blocking)
if (!job->destroyFlag && job->comm && !job->comm->config.blocking && groupCommHeadMain[ncclGroupTaskTypeCollective] == nullptr)
(void) ncclCommSetAsyncError(job->comm, ret);
if (job->destructor) job->destructor((void*)job);
}
while (groupCommHeadMain != nullptr) {
struct ncclComm* comm = groupCommHeadMain;
struct ncclComm* next = comm->groupNext;
// Poll for callbacks sent to us from other threads. Typically these free
// resources from to our memory pools and UB
NCCLCHECKGOTO(ncclCommPollCallbacks(comm, /*waitSome=*/false), ret, fail);
(void) ncclGroupCommLeave(comm);
if (!comm->config.blocking) {
(void) ncclCommSetAsyncError(comm, ret);
for (int type = 0; type < ncclGroupTaskTypeNum; ++type) {
while (groupCommHeadMain[type] != nullptr) {
struct ncclComm* comm = groupCommHeadMain[type];
struct ncclComm* next = comm->groupNext[type];
// Poll for callbacks sent to us from other threads. Typically these free
// resources from to our memory pools and UB
if (comm->reclaimSteps == GROUP_MAX_RECLAIM_STEPS) {
NCCLCHECKGOTO(ncclCommPollCallbacks(comm, /*waitSome=*/false), ret, fail);
comm->reclaimSteps = 0;
} else {
comm->reclaimSteps++;
}
(void)ncclGroupCommLeave(comm, type);
if (!comm->config.blocking) {
(void)ncclCommSetAsyncError(comm, ret);
}
groupCommHeadMain[type] = next;
}
groupCommHeadMain = next;
}
exit:
return ret;
fail:
groupCleanup(gjob->groupCommHeadPtr, gjob->groupCommPreconnectHeadPtr, gjob->asyncJobsPtr, gjob->groupErrorPtr, gjob->groupBlockingPtr, gjob->abortFlagPtr, ret);
groupCleanup(gjob->groupCommHead, &gjob->asyncJobs, ret);
goto exit;
}
@@ -514,6 +606,8 @@ ncclResult_t ncclGroupEndInternal(ncclSimInfo_t* simInfo) {
ncclSimInfo_t internalSimInfo = NCCL_SIM_INFO_INITIALIZER;
ncclSimInfo_t* internalSimInfoPtr = NULL;
size_t realSize = 0;
bool hasCommHead = false;
ncclGroupJob* groupJob = NULL;
internalSimInfo.magic = 0;
@@ -539,72 +633,108 @@ ncclResult_t ncclGroupEndInternal(ncclSimInfo_t* simInfo) {
internalSimInfoPtr = &internalSimInfo;
}
if (ncclGroupCommHead != nullptr || !ncclIntruQueueEmpty(&ncclAsyncJobs) || ncclGroupCommPreconnectHead != nullptr) {
ncclGroupJobMain.groupCommHeadPtr = &ncclGroupCommHead;
ncclGroupJobMain.groupCommPreconnectHeadPtr = &ncclGroupCommPreconnectHead;
ncclGroupJobMain.groupErrorPtr = &ncclGroupError;
ncclGroupJobMain.asyncJobsPtr = &ncclAsyncJobs;
ncclGroupJobMain.abortFlagPtr = &ncclGroupJobAbortFlag;
ncclGroupJobMain.groupBlockingPtr = &ncclGroupBlocking;
ncclGroupJobMain.initialized = true;
ncclGroupJobMainPtr = &ncclGroupJobMain;
for (int type = 0; type < ncclGroupTaskTypeNum; ++type) {
if (ncclGroupCommHead[type]) {
hasCommHead = true;
break;
}
}
NCCLCHECKGOTO(ncclCalloc(&groupJob, 1), ret, fail);
ncclIntruQueueConstruct(&groupJob->asyncJobs);
groupJob->groupRefCount = 0;
groupJob->nonBlockingInit = false;
memcpy(groupJob->groupCommHead, ncclGroupCommHead, sizeof(ncclGroupCommHead));
groupJob->groupCommPreconnectHead = ncclGroupCommPreconnectHead;
groupJob->groupError = ncclSuccess;
groupJob->abortFlag = false;
groupJob->joined = false;
ncclIntruQueueTransfer(&groupJob->asyncJobs, &ncclAsyncJobs);
if (hasCommHead || !ncclIntruQueueEmpty(&groupJob->asyncJobs) || ncclGroupCommPreconnectHead != nullptr) {
/* make sure ncclGroupBlocking has been set. */
assert(ncclGroupBlocking == 0 || ncclGroupBlocking == 1);
if (ncclGroupBlocking == 0) {
/* nonblocking group */
if (!ncclIntruQueueEmpty(&ncclAsyncJobs)) {
ncclAsyncJob* job = ncclIntruQueueHead(&ncclAsyncJobs);
if (!ncclIntruQueueEmpty(&groupJob->asyncJobs)) {
ncclAsyncJob* job = ncclIntruQueueHead(&groupJob->asyncJobs);
do {
NCCLCHECKGOTO(ncclCommSetAsyncError(job->comm, ncclInProgress), ret, fail);
job->comm->groupJob = ncclGroupJobMainPtr;
if (job->comm->groupJob == NULL) {
job->comm->groupJob = groupJob;
groupJob->groupRefCount++;
}
job = job->next;
} while (job);
}
if (ncclGroupCommHead) {
ncclComm_t comm = ncclGroupCommHead;
do {
NCCLCHECKGOTO(ncclCommSetAsyncError(comm, ncclInProgress), ret, fail);
/* link group job to communicators. */
comm->groupJob = ncclGroupJobMainPtr;
comm = comm->groupNext;
} while (comm);
for (int type = 0; type < ncclGroupTaskTypeNum; ++type) {
if (ncclGroupCommHead[type]) {
ncclComm_t comm = ncclGroupCommHead[type];
do {
NCCLCHECKGOTO(ncclCommSetAsyncError(comm, ncclInProgress), ret, fail);
/* link group job to communicators. */
if (comm->groupJob == NULL) {
comm->groupJob = groupJob;
groupJob->groupRefCount++;
}
comm = comm->groupNext[type];
} while (comm);
}
}
ncclGroupJobMainPtr->base.func = groupLaunchNonBlocking;
PTHREADCHECKGOTO(pthread_create(&ncclGroupJobMainPtr->base.thread, NULL, ncclAsyncJobMain, (void*)&ncclGroupJobMainPtr->base), "pthread_create", ret, fail);
groupJob->base.func = groupLaunchNonBlocking;
PTHREADCHECKGOTO(pthread_create(&groupJob->base.thread, NULL, ncclAsyncJobMain, (void*)&groupJob->base), "pthread_create", ret, fail);
groupJob->nonBlockingInit = true;
ret = ncclInProgress;
} else {
/* blocking group */
int savedDev;
CUDACHECKGOTO(cudaGetDevice(&savedDev), ret, fail);
NCCLCHECKGOTO(groupLaunch(&ncclGroupJobMainPtr->base, internalSimInfoPtr), ret, fail);
NCCLCHECKGOTO(groupLaunch(&groupJob->base, internalSimInfoPtr), ret, fail);
CUDACHECKGOTO(cudaSetDevice(savedDev), ret, fail);
if (simInfo) memcpy((void*)simInfo, (void*)internalSimInfoPtr, realSize);
groupResetJobState(ncclGroupJobMainPtr);
free(groupJob);
}
}
/* Reset the job state for the next group call. */
groupLocalResetJobState();
exit:
return ret;
fail:
groupCleanup(&ncclGroupCommHead, &ncclGroupCommPreconnectHead, &ncclAsyncJobs, &ncclGroupError, &ncclGroupBlocking, &ncclGroupJobAbortFlag, ret);
if (groupJob) {
groupCleanup(groupJob->groupCommHead, &groupJob->asyncJobs, ret);
free(groupJob);
} else {
groupCleanup(ncclGroupCommHead, &ncclAsyncJobs, ret);
}
groupLocalResetJobState();
goto exit;
}
ncclResult_t ncclGroupJobComplete(struct ncclGroupJob* groupJob) {
ncclResult_t ret = ncclSuccess;
if (groupJob && groupJob->initialized) {
ret = ncclAsyncJobComplete(&groupJob->base);
groupResetJobState(groupJob);
if (groupJob && groupJob->nonBlockingInit) {
if (!__atomic_exchange_n(&groupJob->joined, true, __ATOMIC_ACQ_REL)) {
ret = ncclAsyncJobComplete(&groupJob->base);
}
if (ncclAtomicRefCountDecrement(&groupJob->groupRefCount) == 0) {
free(groupJob);
}
}
return ret;
}
ncclResult_t ncclGroupJobAbort(struct ncclGroupJob* groupJob) {
if (groupJob && groupJob->initialized) {
__atomic_store_n(groupJob->abortFlagPtr, true, __ATOMIC_RELEASE);
NCCLCHECK(ncclGroupJobComplete(groupJob));
if (groupJob && groupJob->nonBlockingInit) {
if (!__atomic_exchange_n(&groupJob->joined, true, __ATOMIC_ACQ_REL)) {
__atomic_store_n(&groupJob->abortFlag, true, __ATOMIC_RELAXED);
ncclAsyncJobComplete(&groupJob->base);
}
if (ncclAtomicRefCountDecrement(&groupJob->groupRefCount) == 0) {
free(groupJob);
}
}
return ncclSuccess;
}
+13
Просмотреть файл
@@ -0,0 +1,13 @@
/*************************************************************************
* Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#ifndef NCCL_ALLOCATOR_H_
#define NCCL_ALLOCATOR_H_
ncclResult_t ncclCommSymmetricAllocInternal(struct ncclComm* comm, size_t size, size_t alignment, void** symPtr);
ncclResult_t ncclCommSymmetricFreeInternal(struct ncclComm* comm, void* symPtr);
#endif
+163 -23
Просмотреть файл
@@ -19,6 +19,28 @@
#endif
#endif
template<typename Int>
constexpr static __host__ __device__ Int minval(Int a) { return a; }
template<typename Int, typename ...More>
constexpr static __host__ __device__ Int minval(Int a, Int b, More ...more) {
#if __CUDA_ARCH__
return minval(min(a, b), more...);
#else
return minval(a < b ? a : b, more...);
#endif
}
template<typename Int>
constexpr static __host__ __device__ Int maxval(Int a) { return a; }
template<typename Int, typename ...More>
constexpr static __host__ __device__ Int maxval(Int a, Int b, More ...more) {
#if __CUDA_ARCH__
return maxval(max(a, b), more...);
#else
return maxval(a > b ? a : b, more...);
#endif
}
#define DIVUP(x, y) \
(((x)+(y)-1)/(y))
@@ -32,32 +54,150 @@
size = ((size + (align) - 1) / (align)) * (align);
template<typename X, typename Y, typename Z = decltype(X()+Y())>
__host__ __device__ constexpr Z divUp(X x, Y y) {
static __host__ __device__ constexpr Z divUp(X x, Y y) {
return (x+y-1)/y;
}
template<typename X, typename Y, typename Z = decltype(X()+Y())>
__host__ __device__ constexpr Z roundUp(X x, Y y) {
static __host__ __device__ constexpr Z roundUp(X x, Y y) {
return (x+y-1) - (x+y-1)%y;
}
template<typename X, typename Y, typename Z = decltype(X()+Y())>
__host__ __device__ constexpr Z roundDown(X x, Y y) {
static __host__ __device__ constexpr Z roundDown(X x, Y y) {
return x - x%y;
}
// assumes second argument is a power of 2
template<typename X, typename Z = decltype(X()+int())>
__host__ __device__ constexpr Z alignUp(X x, int a) {
static __host__ __device__ constexpr Z alignUp(X x, int a) {
return (x + a-1) & Z(-a);
}
// assumes second argument is a power of 2
template<typename X, typename Z = decltype(X()+int())>
__host__ __device__ constexpr Z alignDown(X x, int a) {
static __host__ __device__ constexpr Z alignDown(X x, int a) {
return x & Z(-a);
}
template<typename Int>
inline __host__ __device__ int countOneBits(Int x) {
constexpr __host__ __device__ bool isPow2(Int x) {
return (x & (x-1)) == 0;
}
template<typename T>
static __host__ __device__ T add4G(T base, int delta4G) {
union { T tmp; uint32_t u32[2]; };
tmp = base;
u32[1] += delta4G;
return tmp;
}
template<typename T>
static __host__ __device__ T incWrap4G(T ptr, uint32_t delta4G, uint32_t lo4G, uint32_t hi4G) {
union { T tmp; uint32_t u32[2]; };
tmp = ptr;
u32[1] += delta4G;
if (u32[1] >= hi4G) u32[1] -= hi4G-lo4G;
return tmp;
}
template<typename T>
static __host__ __device__ T decWrap4G(T ptr, uint32_t delta4G, uint32_t lo4G, uint32_t hi4G) {
union { T tmp; uint32_t u32[2]; };
tmp = ptr;
u32[1] -= delta4G;
if (u32[1] < lo4G) u32[1] += hi4G-lo4G;
return tmp;
}
// Produce the reciprocal of x for use in idivByRcp
constexpr __host__ __device__ uint32_t idivRcp32(uint32_t x) {
return uint32_t(uint64_t(0x100000000)/x);
}
constexpr __host__ __device__ uint64_t idivRcp64(uint64_t x) {
return uint64_t(-1)/x + isPow2(x);
}
static __host__ __device__ uint32_t mul32hi(uint32_t a, uint32_t b) {
#if __CUDA_ARCH__
return __umulhi(a, b);
#else
return uint64_t(a)*b >> 32;
#endif
}
static __host__ __device__ uint64_t mul64hi(uint64_t a, uint64_t b) {
#if __CUDA_ARCH__
return __umul64hi(a, b);
#else
return (uint64_t)(((unsigned __int128)a)*b >> 64);
#endif
}
// Produce the reciprocal of x*y given their respective reciprocals. This incurs
// no integer division on device.
static __host__ __device__ uint32_t imulRcp32(uint32_t x, uint32_t xrcp, uint32_t y, uint32_t yrcp) {
if (xrcp == 0) return yrcp;
if (yrcp == 0) return xrcp;
uint32_t rcp = mul32hi(xrcp, yrcp);
uint32_t rem = -x*y*rcp;
if (x*y <= rem) rcp += 1;
return rcp;
}
static __host__ __device__ uint64_t imulRcp64(uint64_t x, uint64_t xrcp, uint64_t y, uint64_t yrcp) {
if (xrcp == 0) return yrcp;
if (yrcp == 0) return xrcp;
uint64_t rcp = mul64hi(xrcp, yrcp);
uint64_t rem = -x*y*rcp;
if (x*y <= rem) rcp += 1;
return rcp;
}
// Fast integer division where divisor has precomputed reciprocal.
// idivFast(x, y, idivRcp(y)) == x/y
static __host__ __device__ void idivmodFast32(uint32_t *quo, uint32_t *rem, uint32_t x, uint32_t y, uint32_t yrcp) {
uint32_t q = x, r = 0;
if (yrcp != 0) {
q = mul32hi(x, yrcp);
r = x - y*q;
if (r >= y) { q += 1; r -= y; }
}
*quo = q;
*rem = r;
}
static __host__ __device__ void idivmodFast64(uint64_t *quo, uint64_t *rem, uint64_t x, uint64_t y, uint64_t yrcp) {
uint64_t q = x, r = 0;
if (yrcp != 0) {
q = mul64hi(x, yrcp);
r = x - y*q;
if (r >= y) { q += 1; r -= y; }
}
*quo = q;
*rem = r;
}
static __host__ __device__ uint32_t idivFast32(uint32_t x, uint32_t y, uint32_t yrcp) {
uint32_t q, r;
idivmodFast32(&q, &r, x, y, yrcp);
return q;
}
static __host__ __device__ uint32_t idivFast64(uint64_t x, uint64_t y, uint64_t yrcp) {
uint64_t q, r;
idivmodFast64(&q, &r, x, y, yrcp);
return q;
}
static __host__ __device__ uint32_t imodFast32(uint32_t x, uint32_t y, uint32_t yrcp) {
uint32_t q, r;
idivmodFast32(&q, &r, x, y, yrcp);
return r;
}
static __host__ __device__ uint32_t imodFast64(uint64_t x, uint64_t y, uint64_t yrcp) {
uint64_t q, r;
idivmodFast64(&q, &r, x, y, yrcp);
return r;
}
template<typename Int>
static __host__ __device__ int countOneBits(Int x) {
#if __CUDA_ARCH__
if (sizeof(Int) <= sizeof(unsigned int)) {
return __popc((unsigned int)x);
@@ -83,7 +223,7 @@ inline __host__ __device__ int countOneBits(Int x) {
// Returns index of first one bit or returns -1 if mask is zero.
template<typename Int>
inline __host__ __device__ int firstOneBit(Int mask) {
static __host__ __device__ int firstOneBit(Int mask) {
int i;
#if __CUDA_ARCH__
if (sizeof(Int) <= sizeof(int)) {
@@ -108,14 +248,14 @@ inline __host__ __device__ int firstOneBit(Int mask) {
}
template<typename Int>
inline __host__ __device__ int popFirstOneBit(Int* mask) {
static __host__ __device__ int popFirstOneBit(Int* mask) {
Int tmp = *mask;
*mask &= *mask-1;
return firstOneBit(tmp);
}
template<typename Int>
inline __host__ __device__ int log2Down(Int x) {
static __host__ __device__ int log2Down(Int x) {
int w, n;
#if __CUDA_ARCH__
if (sizeof(Int) <= sizeof(int)) {
@@ -147,7 +287,7 @@ inline __host__ __device__ int log2Down(Int x) {
}
template<typename Int>
inline __host__ __device__ int log2Up(Int x) {
static __host__ __device__ int log2Up(Int x) {
int w, n;
if (x != 0) x -= 1;
#if __CUDA_ARCH__
@@ -180,19 +320,19 @@ inline __host__ __device__ int log2Up(Int x) {
}
template<typename Int>
inline __host__ __device__ Int pow2Up(Int x) {
static __host__ __device__ Int pow2Up(Int x) {
return Int(1)<<log2Up(x);
}
template<typename Int>
inline __host__ __device__ Int pow2Down(Int x) {
static __host__ __device__ Int pow2Down(Int x) {
// True, log2Down can return -1, but we don't normally pass 0 as an argument...
// coverity[negative_shift]
return Int(1)<<log2Down(x);
}
template<typename UInt, int nSubBits>
inline __host__ UInt reverseSubBits(UInt x) {
static __host__ UInt reverseSubBits(UInt x) {
if (nSubBits >= 16 && 8*sizeof(UInt) == nSubBits) {
switch (8*sizeof(UInt)) {
case 16: x = __builtin_bswap16(x); break;
@@ -225,7 +365,7 @@ template<> struct ncclToUnsigned<unsigned long long> { using type = unsigned lon
// Reverse the bottom nBits bits of x. The top bits will be overwritten with 0's.
template<typename Int>
inline __host__ __device__ Int reverseBits(Int x, int nBits) {
static __host__ __device__ Int reverseBits(Int x, int nBits) {
using UInt = typename ncclToUnsigned<Int>::type;
union { UInt ux; Int sx; };
sx = x;
@@ -249,7 +389,7 @@ inline __host__ __device__ Int reverseBits(Int x, int nBits) {
// has nearly the full range of uint32_t except it only keeps the top 3 bits
// beneath the leading 1 bit and thus has a max value of 0xf0000000.
inline __host__ __device__ uint32_t u32fpEncode(uint32_t x, int bitsPerPow2) {
static __host__ __device__ uint32_t u32fpEncode(uint32_t x, int bitsPerPow2) {
int log2x;
#if __CUDA_ARCH__
log2x = 31-__clz(x|1);
@@ -261,7 +401,7 @@ inline __host__ __device__ uint32_t u32fpEncode(uint32_t x, int bitsPerPow2) {
return exponent<<bitsPerPow2 | mantissa;
}
inline __host__ __device__ uint32_t u32fpDecode(uint32_t x, int bitsPerPow2) {
static __host__ __device__ uint32_t u32fpDecode(uint32_t x, int bitsPerPow2) {
uint32_t exponent = x>>bitsPerPow2;
uint32_t mantissa = (x & ((1u<<bitsPerPow2)-1)) | (exponent!=0 ? 0x8 : 0);
if (exponent != 0) exponent -= 1;
@@ -270,16 +410,16 @@ inline __host__ __device__ uint32_t u32fpDecode(uint32_t x, int bitsPerPow2) {
constexpr uint32_t u32fp8MaxValue() { return 0xf0000000; }
inline __host__ __device__ uint8_t u32fp8Encode(uint32_t x) {
static __host__ __device__ uint8_t u32fp8Encode(uint32_t x) {
return u32fpEncode(x, 3);
}
inline __host__ __device__ uint32_t u32fp8Decode(uint8_t x) {
static __host__ __device__ uint32_t u32fp8Decode(uint8_t x) {
return u32fpDecode(x, 3);
}
// The hash isn't just a function of the bytes but also where the bytes are split
// into different calls to eatHash().
inline __host__ __device__ void eatHash(uint64_t acc[2], const void* bytes, size_t size) {
static __host__ __device__ void eatHash(uint64_t acc[2], const void* bytes, size_t size) {
char const* ptr = (char const*)bytes;
acc[0] ^= size;
while (size != 0) {
@@ -302,11 +442,11 @@ inline __host__ __device__ void eatHash(uint64_t acc[2], const void* bytes, size
}
template<typename T>
inline __host__ __device__ void eatHash(uint64_t acc[2], const T* bytes) {
static __host__ __device__ void eatHash(uint64_t acc[2], const T* bytes) {
eatHash(acc, (const void*)bytes, sizeof(T));
}
inline __host__ __device__ uint64_t digestHash(uint64_t const acc[2]) {
static __host__ __device__ uint64_t digestHash(uint64_t const acc[2]) {
uint64_t h = acc[0];
h ^= h >> 31;
h *= 0xbac3bd562846de6b;
@@ -316,13 +456,13 @@ inline __host__ __device__ uint64_t digestHash(uint64_t const acc[2]) {
return h;
}
inline __host__ __device__ uint64_t getHash(const void* bytes, size_t size) {
static __host__ __device__ uint64_t getHash(const void* bytes, size_t size) {
uint64_t acc[2] = {1, 1};
eatHash(acc, bytes, size);
return digestHash(acc);
}
template<typename T>
inline __host__ __device__ uint64_t getHash(const T* bytes) {
static __host__ __device__ uint64_t getHash(const T* bytes) {
return getHash((const void*)bytes, sizeof(T));
}
+44 -15
Просмотреть файл
@@ -17,6 +17,7 @@
#include "register.h"
#include "graph.h"
#include "profiler.h"
#include "allocator.h"
#if CUDART_VERSION < 9000
struct cudaLaunchParams {
@@ -131,7 +132,6 @@ struct ncclSharedResources {
int* tpRankToLocalRank;
// Internal streams
struct ncclStrongStream deviceStream, hostStream;
int noncapturedRefs; // number of non-captured hostStreamPlanCallback on the stream
int persistentRefs;
cudaEvent_t launchEvent, scratchEvent;
@@ -218,6 +218,7 @@ struct ncclTaskColl {
// Profiler plugin
int eActivationMask;
void* eventHandle;
uint8_t nChannels;
};
struct ncclTaskP2p {
struct ncclTaskP2p* next;
@@ -231,6 +232,7 @@ struct ncclTaskP2p {
// Profiler plugin
int eActivationMask;
void* eventHandle;
uint8_t nChannels;
};
struct ncclKernelPlan {
@@ -243,10 +245,14 @@ struct ncclKernelPlan {
bool persistent; // aka captured in a graph
bool isHostCbEnq;
bool isSymColl;
enum ncclDevWorkStorageType workStorageType;
bool kernelSpecialized;
void *kernelFn;
struct ncclDevKernelArgs* kernelArgs;
void* kernelFn;
union {
struct ncclDevKernelArgs* kernelArgs;
struct ncclSymDevArgs* kernelSymArgs;
};
size_t kernelArgsSize;
uint64_t channelMask; // bitset of which channels are present
bool hasProxyOps; // does any channel have a non-empty proxyOpQueue
@@ -355,6 +361,7 @@ struct ncclKernelPlanner {
struct Peer* peers/*[nRanks]*/;
int nTasksColl, nTasksP2p;
bool persistent;
bool isSymColl;
// The list of user streams aggregated over all tasks present.
struct ncclCudaStreamList* streams;
@@ -404,6 +411,12 @@ struct ncclKernelPlanner {
#define NCCL_MAGIC 0x0280028002800280 // Nickel atomic number is 28.
typedef enum ncclGroupTaskType {
ncclGroupTaskTypeCollective = 0,
ncclGroupTaskTypeSymRegister = 1,
ncclGroupTaskTypeNum = 2,
} ncclGroupTaskType_t;
struct ncclComm {
uint64_t startMagic;
struct ncclMemoryStack memPermanent, memScoped;
@@ -420,9 +433,10 @@ struct ncclComm {
struct ncclTopoSystem* topo;
struct ncclProxyConnector* gproxyConn;
struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next> legacyRegCleanupQueue;
bool peerInfoValid;
int netPluginLoaded;
ncclNet_t* ncclNet;
int netPluginIndex;
int ncclNetVer;
ncclNetDeviceType netDeviceType;
ncclCollNet_t* ncclCollNet;
@@ -439,7 +453,6 @@ struct ncclComm {
uint64_t magic; // Magic number for all network communication. Not a security key -- only goal is to detect mismatches.
const char* commName;
uint64_t commHash;
int rank; // my rank in the communicator
int nRanks; // number of GPUs in communicator
@@ -515,6 +528,7 @@ struct ncclComm {
// Device side of the communicator (for cudaFree's)
struct ncclDevComm* devComm; // actually = &ncclDevCommAndChannels::comm
struct ncclSymDevComm symDevComm;
uint32_t workArgsBytes; // max size of kernel args
uint32_t workFifoBytes; // size of workFifoBuf, power of 2
@@ -522,12 +536,10 @@ struct ncclComm {
void* workFifoBufDev;
void* workFifoBufGdrHandle;
// Monotonic number of bytes (mod 1<<32) consumed per channel. In cudaHost memory.
uint32_t* workFifoConsumed/*[MAXCHANNELS]*/;
// Last observed value of: min(workFifoConsumed[c] for c < MAXCHANNELS)
uint32_t workFifoConsumedLeast;
// Monotonic number of bytes (mod 1<<32) sent to fifo.
uint32_t workFifoProduced;
uint32_t workFifoProducedLastRecorded;
uint32_t workFifoConsumed;
// Intra-process sync
struct ncclComm* intraComm0; // leader of intra-process comms (self possible)
@@ -543,10 +555,8 @@ struct ncclComm {
struct ncclProxyState* proxyState;
int proxyRefCountOld; /* store proxy post-atomic-sub refcount */
// Whether this communicator uses collNet
int collNetSupport;
bool isOneRPN;
uint8_t collNetSupportMatrix[4/*sum,prod,max,min*/][ncclNumTypes];
bool intraNodeP2pSupport;
int* collNetHeads;
int collNetHeadsNum;
int* collNetDenseToUserRank;
@@ -568,7 +578,7 @@ struct ncclComm {
// Next comm in this thread's active ncclGroup[Start|End](). Holds "0x1" when
// this comm is not yet in a group.
struct ncclComm* groupNext;
struct ncclComm* groupNext[ncclGroupTaskTypeNum];
// Subset of those in groupNext list. Holds 0x1 if not needing preconnect.
struct ncclComm* preconnectNext;
int localPersistentRefs; // number of persistent plan-lists capturing this comm
@@ -588,6 +598,7 @@ struct ncclComm {
ncclUserRedOp *userRedOps;
// Queue of things for the main thread to do
int reclaimSteps;
struct ncclIntruQueueMpsc<struct ncclCommCallback, &ncclCommCallback::next> callbackQueue;
ncclConfig_t config;
@@ -600,6 +611,9 @@ struct ncclComm {
// group job to support multi-thread FT
struct ncclGroupJob *groupJob;
// Flag indicating if this communicator shares resources with parent or children
bool shareResources;
// Tuning plugin
int tunerPluginLoaded;
ncclTuner_t* tuner;
@@ -613,9 +627,18 @@ struct ncclComm {
// buffer registration cache
struct ncclRegCache regCache;
int isAllNvlink;
bool isAllDirectP2p;
int symmetricSupport;
bool useNetPXN;
bool useGdr;
int splitCount;
// symmetric buffer
uint8_t* baseUCSymPtr;
uint8_t* baseMCSymPtr;
size_t baseStride;
size_t symAllocHead;
CUmemGenericAllocationHandle symMCHandle;
struct ncclIntruQueue<struct ncclSymRegTask, &ncclSymRegTask::next> symRegTaskQueue;
uint64_t endMagic;
};
@@ -647,15 +670,21 @@ inline ncclResult_t ncclCommPollCallbacks(struct ncclComm* comm, bool waitSome)
return ncclSuccess;
}
inline ncclResult_t ncclCommPollEventCallbacks(struct ncclComm *comm) {
inline ncclResult_t ncclCommPollEventCallbacks(struct ncclComm *comm, bool waitSome) {
ncclResult_t result = ncclSuccess;
cudaStreamCaptureMode mode = cudaStreamCaptureModeRelaxed;
CUDACHECK(cudaThreadExchangeStreamCaptureMode(&mode));
while (true) {
struct ncclCommEventCallback* cb = ncclIntruQueueHead(&comm->eventCallbackQueue);
if (cb == nullptr) break;
cudaError_t ok = cudaEventSynchronize(cb->event);
if (ok == cudaErrorNotReady) break;
cudaError_t ok;
if (waitSome) {
ok = cudaEventSynchronize(cb->event);
waitSome = false;
} else {
ok = cudaEventQuery(cb->event);
if (ok == cudaErrorNotReady) break;
}
ncclIntruQueueDequeue(&comm->eventCallbackQueue);
if (ok == cudaSuccess) {
NCCLCHECKGOTO(cb->fn(comm, cb), result, finish);
+25
Просмотреть файл
@@ -58,4 +58,29 @@ static ncclResult_t ncclCpusetToStr(cpu_set_t* mask, char* str) {
return ncclSuccess;
}
static char* ncclCpusetToRangeStr(cpu_set_t* mask, char* str, size_t len) {
int c = 0;
int start = -1;
// Iterate through all possible CPU bits plus one extra position
for (int cpu = 0; cpu <= CPU_SETSIZE; cpu++) {
int isSet = (cpu == CPU_SETSIZE) ? 0 : CPU_ISSET(cpu, mask);
// Start of a new range
if (isSet && start == -1) {
start = cpu;
}
// End of a range, add comma between ranges
if (!isSet && start != -1) {
if (cpu-1 == start) {
c += snprintf(str+c, len-c, "%s%d", c ? "," : "", start);
} else {
c += snprintf(str+c, len-c, "%s%d-%d", c ? "," : "", start, cpu-1);
}
if (c >= len-1) break;
start = -1;
}
}
if (c == 0) str[0] = '\0';
return str;
}
#endif
+37 -33
Просмотреть файл
@@ -36,6 +36,10 @@ extern CUmemAllocationHandleType ncclCuMemHandleType;
} \
} while(false)
#define CUCALL(cmd) do { \
pfn_##cmd; \
} while(false)
#define CUCHECKGOTO(cmd, res, label) do { \
CUresult err = pfn_##cmd; \
if( err != CUDA_SUCCESS ) { \
@@ -66,49 +70,49 @@ extern CUmemAllocationHandleType ncclCuMemHandleType;
} \
} while(0)
#define DECLARE_CUDA_PFN_EXTERN(symbol) extern PFN_##symbol pfn_##symbol
#define DECLARE_CUDA_PFN_EXTERN(symbol,version) extern PFN_##symbol##_v##version pfn_##symbol
#if CUDART_VERSION >= 11030
/* CUDA Driver functions loaded with cuGetProcAddress for versioning */
DECLARE_CUDA_PFN_EXTERN(cuDeviceGet);
DECLARE_CUDA_PFN_EXTERN(cuDeviceGetAttribute);
DECLARE_CUDA_PFN_EXTERN(cuGetErrorString);
DECLARE_CUDA_PFN_EXTERN(cuGetErrorName);
DECLARE_CUDA_PFN_EXTERN(cuMemGetAddressRange);
DECLARE_CUDA_PFN_EXTERN(cuCtxCreate);
DECLARE_CUDA_PFN_EXTERN(cuCtxDestroy);
DECLARE_CUDA_PFN_EXTERN(cuCtxGetCurrent);
DECLARE_CUDA_PFN_EXTERN(cuCtxSetCurrent);
DECLARE_CUDA_PFN_EXTERN(cuCtxGetDevice);
DECLARE_CUDA_PFN_EXTERN(cuPointerGetAttribute);
DECLARE_CUDA_PFN_EXTERN(cuLaunchKernel);
DECLARE_CUDA_PFN_EXTERN(cuDeviceGet, 2000);
DECLARE_CUDA_PFN_EXTERN(cuDeviceGetAttribute, 2000);
DECLARE_CUDA_PFN_EXTERN(cuGetErrorString, 6000);
DECLARE_CUDA_PFN_EXTERN(cuGetErrorName, 6000);
DECLARE_CUDA_PFN_EXTERN(cuMemGetAddressRange, 3020);
DECLARE_CUDA_PFN_EXTERN(cuCtxCreate, 11040);
DECLARE_CUDA_PFN_EXTERN(cuCtxDestroy, 4000);
DECLARE_CUDA_PFN_EXTERN(cuCtxGetCurrent, 4000);
DECLARE_CUDA_PFN_EXTERN(cuCtxSetCurrent, 4000);
DECLARE_CUDA_PFN_EXTERN(cuCtxGetDevice, 2000);
DECLARE_CUDA_PFN_EXTERN(cuPointerGetAttribute, 4000);
DECLARE_CUDA_PFN_EXTERN(cuLaunchKernel, 4000);
#if CUDART_VERSION >= 11080
DECLARE_CUDA_PFN_EXTERN(cuLaunchKernelEx);
DECLARE_CUDA_PFN_EXTERN(cuLaunchKernelEx, 11060);
#endif
// cuMem API support
DECLARE_CUDA_PFN_EXTERN(cuMemAddressReserve);
DECLARE_CUDA_PFN_EXTERN(cuMemAddressFree);
DECLARE_CUDA_PFN_EXTERN(cuMemCreate);
DECLARE_CUDA_PFN_EXTERN(cuMemGetAllocationGranularity);
DECLARE_CUDA_PFN_EXTERN(cuMemExportToShareableHandle);
DECLARE_CUDA_PFN_EXTERN(cuMemImportFromShareableHandle);
DECLARE_CUDA_PFN_EXTERN(cuMemMap);
DECLARE_CUDA_PFN_EXTERN(cuMemRelease);
DECLARE_CUDA_PFN_EXTERN(cuMemRetainAllocationHandle);
DECLARE_CUDA_PFN_EXTERN(cuMemSetAccess);
DECLARE_CUDA_PFN_EXTERN(cuMemUnmap);
DECLARE_CUDA_PFN_EXTERN(cuMemGetAllocationPropertiesFromHandle);
DECLARE_CUDA_PFN_EXTERN(cuMemAddressReserve, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemAddressFree, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemCreate, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemGetAllocationGranularity, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemExportToShareableHandle, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemImportFromShareableHandle, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemMap, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemRelease, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemRetainAllocationHandle, 11000);
DECLARE_CUDA_PFN_EXTERN(cuMemSetAccess, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemUnmap, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemGetAllocationPropertiesFromHandle, 10020);
#if CUDA_VERSION >= 11070
DECLARE_CUDA_PFN_EXTERN(cuMemGetHandleForAddressRange); // DMA-BUF support
DECLARE_CUDA_PFN_EXTERN(cuMemGetHandleForAddressRange, 11070); // DMA-BUF support
#endif
#if CUDA_VERSION >= 12010
/* NVSwitch Multicast support */
DECLARE_CUDA_PFN_EXTERN(cuMulticastAddDevice);
DECLARE_CUDA_PFN_EXTERN(cuMulticastBindMem);
DECLARE_CUDA_PFN_EXTERN(cuMulticastBindAddr);
DECLARE_CUDA_PFN_EXTERN(cuMulticastCreate);
DECLARE_CUDA_PFN_EXTERN(cuMulticastGetGranularity);
DECLARE_CUDA_PFN_EXTERN(cuMulticastUnbind);
DECLARE_CUDA_PFN_EXTERN(cuMulticastAddDevice, 12010);
DECLARE_CUDA_PFN_EXTERN(cuMulticastBindMem, 12010);
DECLARE_CUDA_PFN_EXTERN(cuMulticastBindAddr, 12010);
DECLARE_CUDA_PFN_EXTERN(cuMulticastCreate, 12010);
DECLARE_CUDA_PFN_EXTERN(cuMulticastGetGranularity, 12010);
DECLARE_CUDA_PFN_EXTERN(cuMulticastUnbind, 12010);
#endif
#endif
+37 -12
Просмотреть файл
@@ -10,6 +10,7 @@
#include "nccl.h"
#include "nccl_common.h"
#include "bitops.h"
#include "symmetric.h"
#include <algorithm>
#include <stdint.h>
#include <sys/types.h>
@@ -29,6 +30,30 @@ extern const char* ncclProtoStr[NCCL_NUM_PROTOCOLS];
#define NCCL_CUDA_ARCH 0
#endif
#ifdef __CUDA_ARCH_SPECIFIC__
#define NCCL_CUDA_ARCH_SPECIFIC __CUDA_ARCH_SPECIFIC__
#elif defined(__CUDA_ARCH_HAS_FEATURE__)
#if __CUDA_ARCH_HAS_FEATURE__(SM90_ALL)
#define NCCL_CUDA_ARCH_SPECIFIC 900
#elif __CUDA_ARCH_HAS_FEATURE__(SM100_ALL)
#define NCCL_CUDA_ARCH_SPECIFIC 1000
#elif __CUDA_ARCH_HAS_FEATURE__(SM101_ALL)
#define NCCL_CUDA_ARCH_SPECIFIC 1010
#elif __CUDA_ARCH_HAS_FEATURE__(SM120_ALL)
#define NCCL_CUDA_ARCH_SPECIFIC 1200
#else
#define NCCL_CUDA_ARCH_SPECIFIC 0
#endif
#else
#define NCCL_CUDA_ARCH_SPECIFIC 0
#endif
#ifdef __CUDA_ARCH_FAMILY_SPECIFIC__
#define NCCL_CUDA_ARCH_FAMILY_SPECIFIC __CUDA_ARCH_FAMILY_SPECIFIC__
#else
#define NCCL_CUDA_ARCH_FAMILY_SPECIFIC 0
#endif
#include "net_device.h"
enum ncclDevRedOp_t {
@@ -380,6 +405,14 @@ struct alignas(16) ncclDevChannel {
uint64_t workCounter;
};
#define MAX_PROFILER_EVENTS_PER_CHANNEL 64
struct ncclDevProfiler {
struct {
uint64_t counter;
uint64_t timestamp;
} data[MAX_PROFILER_EVENTS_PER_CHANNEL];
};
struct ncclDevComm {
int rank;
int nRanks;
@@ -389,9 +422,6 @@ struct ncclDevComm {
int p2pChunkSize;
int isAllNvlink;
// Work fifo return credits
uint32_t* workConsumed/*[MAXCHANNELS]*/;
int* collNetDenseToUserRank;
// Flag to ask NCCL kernels to abort
@@ -402,8 +432,8 @@ struct ncclDevComm {
int* rankToLocalRank;
// Profiler counters
uint64_t* workStarted/*[MAXCHANNELS]*/;
uint64_t* workCompleted/*[MAXCHANNELS]*/;
struct ncclDevProfiler* workStarted/*[MAXCHANNELS]*/;
struct ncclDevProfiler* workCompleted/*[MAXCHANNELS]*/;
};
struct alignas(16) ncclDevCommAndChannels {
@@ -476,7 +506,7 @@ __host__ __device__ constexpr int ncclCalcUnroll(int bytePerPack, int insns, int
__host__ __device__ constexpr int ncclCollUnroll(int cudaArch = NCCL_CUDA_ARCH) {
// Our collective unroll should move to the same bytes&insns model as NVLS.
return cudaArch >= 800 ? (cudaArch == 1200 ? 6 : 8) : 4;
return cudaArch >= 800 ? (cudaArch / 100 == 12 ? 6 : 8) : 4;
}
__host__ __device__ constexpr int ncclNvlsUnrollBytes(int cudaArch = NCCL_CUDA_ARCH) { return 4*16; }
@@ -507,7 +537,6 @@ extern int const ncclDevKernelCount;
extern void* const ncclDevKernelList[/*ncclDevKernelCount*/];
// Table of most specialized kernel function to run given func index.
extern int const ncclDevFuncIdCount;
extern int const ncclDevFuncRowToId[];
extern void* const ncclDevKernelForFunc[/*funcIndex*/];
extern bool const ncclDevKernelForFuncIsSpecialized[/*funcIndex*/];
@@ -535,11 +564,7 @@ inline bool ncclNvlsSupported(int devRedOp, int type) {
// `ncclDevFuncIndex()` needs to be in sync with "all_functions()" in "src/device/generate.py"
inline int ncclDevFuncId(int coll, int devRedOp, int type, int algo, int proto) {
#if defined(__CUDA_BF16_TYPES_EXIST__)
constexpr int NumTypes = ncclNumTypes;
#else
constexpr int NumTypes = ncclNumTypes + 1;
#endif
int row;
do {
row = 0; // ncclDevFuncIndex_P2p
@@ -564,7 +589,7 @@ inline int ncclDevFuncId(int coll, int devRedOp, int type, int algo, int proto)
}
row += nAlgos*NCCL_NUM_PROTOCOLS;
nAlgos = 6;
nAlgos = 6; // TREE RING COLLNET_DIRECT COLLNET_CHAIN NVLS NVLS_TREE
if (coll == ncclFuncAllReduce) {
row += ((devRedOp*NumTypes + type)*nAlgos + algo)*NCCL_NUM_PROTOCOLS + proto;
break;
+5 -1
Просмотреть файл
@@ -50,6 +50,8 @@ int ncclPxnDisable(struct ncclComm* comm);
ncclResult_t ncclTopoGetPxnRanks(struct ncclComm* comm, int** intermediateRanks, int* nranks);
ncclResult_t ncclGetLocalCpu(struct ncclTopoSystem* system, int gpu, int* retCpu);
ncclResult_t ncclGetUserP2pLevel(int* level);
// Find CPU affinity
ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu_set_t* affinity);
@@ -74,7 +76,9 @@ ncclResult_t ncclTopoGetLocalNet(struct ncclTopoSystem* system, int rank, int ch
ncclResult_t ncclTopoGetLocalGpu(struct ncclTopoSystem* system, int64_t netId, int* gpuIndex);
ncclResult_t getLocalNetCountByBw(struct ncclTopoSystem* system, int gpu, int *count);
#define NCCL_TOPO_MAX_NODES 256
// Allows for up to 32 NICs per node on GB200-NVL72
#define NCCL_TOPO_MAX_NODES 576
ncclResult_t ncclTopoGetLocal(struct ncclTopoSystem* system, int type, int index, int resultType, int locals[NCCL_TOPO_MAX_NODES], int* localCount, int* pathType);
// Init search. Needs to be done before calling ncclTopoCompute
ncclResult_t ncclTopoSearchInit(struct ncclTopoSystem* system);
+27 -41
Просмотреть файл
@@ -9,9 +9,11 @@
#include "nccl.h"
#include "comm.h"
#include "allocator.h"
#include "register.h"
ncclResult_t ncclGroupErrCheck(ncclResult_t ret);
void ncclGroupCommJoin(struct ncclComm* comm);
void ncclGroupCommJoin(struct ncclComm* comm, int type);
void ncclGroupCommPreconnect(struct ncclComm* comm);
ncclResult_t ncclGroupCommLeave(struct ncclComm* comm);
ncclResult_t ncclGroupJobAbort(struct ncclGroupJob* groupJob);
@@ -52,13 +54,14 @@ ncclResult_t ncclAsyncLaunch(
struct ncclGroupJob {
struct ncclAsyncJob base;
struct ncclComm **groupCommHeadPtr;
struct ncclComm **groupCommPreconnectHeadPtr;
ncclResult_t *groupErrorPtr;
bool *abortFlagPtr;
int *groupBlockingPtr;
struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> *asyncJobsPtr;
bool initialized;
int groupRefCount;
bool nonBlockingInit;
bool joined;
struct ncclComm *groupCommHead[ncclGroupTaskTypeNum];
struct ncclComm *groupCommPreconnectHead;
ncclResult_t groupError;
bool abortFlag;
struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> asyncJobs;
};
ncclResult_t ncclGroupStartInternal();
@@ -69,27 +72,9 @@ ncclResult_t ncclAsyncJobComplete(struct ncclAsyncJob* job);
extern __thread int ncclGroupDepth; // depth of ncclGroupStart nesting
extern __thread ncclResult_t ncclGroupError;
extern __thread struct ncclComm* ncclGroupCommHead;
extern __thread struct ncclComm* ncclGroupCommHead[ncclGroupTaskTypeNum];
extern __thread struct ncclComm* ncclGroupCommPreconnectHead;
extern __thread int ncclGroupBlocking;
extern __thread struct ncclGroupJob *ncclGroupJobMainPtr;
extern __thread struct ncclGroupJob ncclGroupJobMain;
static inline void groupResetJobState() {
ncclGroupBlocking = -1;
ncclGroupJobMainPtr = NULL;
memset(&ncclGroupJobMain, 0, sizeof(struct ncclGroupJob));
return;
}
static inline ncclResult_t groupJobComplete(struct ncclGroupJob* job) {
ncclResult_t ret = ncclSuccess;
if (job) {
ret = ncclAsyncJobComplete(&job->base);
groupResetJobState();
}
return ret;
}
inline ncclResult_t ncclGroupStartInternal() {
ncclGroupDepth++;
@@ -104,31 +89,32 @@ inline ncclResult_t ncclGroupErrCheck(ncclResult_t ret) {
}
// Add comm to this thread's group
inline void ncclGroupCommJoin(struct ncclComm* comm) {
if (comm->groupNext == reinterpret_cast<struct ncclComm*>(0x1)) {
inline void ncclGroupCommJoin(struct ncclComm* comm, int type) {
if (comm->groupNext[type] == reinterpret_cast<struct ncclComm*>(0x1)) {
// Insert comm into ncclGroupCommHead adjacent to sibling comms. This preserves
// the users program order yet insures siblings occur consecutively. This
// is required by doLaunches() in "group.cc".
struct ncclComm** pp = &ncclGroupCommHead;
struct ncclComm** pp = &ncclGroupCommHead[type];
while (*pp != nullptr && comm->intraComm0 != (*pp)->intraComm0)
pp = &(*pp)->groupNext;
pp = &(*pp)->groupNext[type];
// didn't find its clique, we need to insert it with ascending order based on commHash
if (*pp == nullptr) {
pp = &ncclGroupCommHead;
while (*pp != nullptr && (*pp)->commHash < comm->commHash) pp = &(*pp)->groupNext;
pp = &ncclGroupCommHead[type];
while (*pp != nullptr && (*pp)->commHash < comm->commHash) pp = &(*pp)->groupNext[type];
}
comm->groupNext = *pp;
comm->groupNext[type] = *pp;
*pp = comm;
// Comms gets a new memory stack scope upon joining. Each task batched for
// this comm is allocated there.
ncclMemoryStackPush(&comm->memScoped);
// Initialize planner
ncclKernelPlanner::Peer* tmp = comm->planner.peers;
memset(&comm->planner, 0, sizeof(comm->planner));
comm->planner.peers = tmp;
if (type == ncclGroupTaskTypeCollective) {
// Initialize planner
ncclKernelPlanner::Peer* tmp = comm->planner.peers;
memset(&comm->planner, 0, sizeof(comm->planner));
comm->planner.peers = tmp;
}
}
ncclGroupBlocking = comm->config.blocking;
}
@@ -141,8 +127,8 @@ inline void ncclGroupCommPreconnect(struct ncclComm* comm) {
}
// Comm has left group
inline ncclResult_t ncclGroupCommLeave(struct ncclComm* comm) {
comm->groupNext = reinterpret_cast<struct ncclComm*>(0x1);
inline ncclResult_t ncclGroupCommLeave(struct ncclComm* comm, int type) {
comm->groupNext[type] = reinterpret_cast<struct ncclComm*>(0x1);
ncclMemoryStackPop(&comm->memScoped);
return ncclSuccess;
}
+18
Просмотреть файл
@@ -0,0 +1,18 @@
#ifndef NCCL_MLX5DV_CORE_H_
#define NCCL_MLX5DV_CORE_H_
/* Basic MLX5 direct verbs structs. Needed to dynamically load MLX5 direct verbs functions without
* explicit including of MLX5 direct verbs header.
*/
#include <stddef.h>
#include <stdint.h>
#include <sys/types.h>
#include <unistd.h>
#include "ibvwrap.h"
enum mlx5dv_reg_dmabuf_access {
MLX5DV_REG_DMABUF_ACCESS_DATA_DIRECT = (1<<0),
};
#endif // NCCL_MLX5DV_CORE_H_
+23
Просмотреть файл
@@ -0,0 +1,23 @@
#ifndef NCCL_MLX5DV_SYMBOLS_H_
#define NCCL_MLX5DV_SYMBOLS_H_
#ifdef NCCL_BUILD_MLX5DV
#include <infiniband/mlx5dv.h>
#else
#include "mlx5/mlx5dvcore.h"
#endif
#include "nccl.h"
/* MLX5 Direct Verbs Function Pointers*/
struct ncclMlx5dvSymbols {
bool (*mlx5dv_internal_is_supported)(struct ibv_device *device);
int (*mlx5dv_internal_get_data_direct_sysfs_path)(struct ibv_context *context, char *buf, size_t buf_len);
/* DMA-BUF support */
struct ibv_mr * (*mlx5dv_internal_reg_dmabuf_mr)(struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access);
};
/* Constructs MLX5 direct verbs symbols per rdma-core linking or dynamic loading mode */
ncclResult_t buildMlx5dvSymbols(struct ncclMlx5dvSymbols* mlx5dvSymbols);
#endif // NCCL_MLX5DV_SYMBOLS_H_
+41
Просмотреть файл
@@ -0,0 +1,41 @@
/*************************************************************************
* Copyright (c) 2004, 2005 Topspin Communications. All rights reserved.
* Copyright (c) 2004, 2011-2012 Intel Corporation. All rights reserved.
* Copyright (c) 2005, 2006, 2007 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2005 PathScale, Inc. All rights reserved.
*
* Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#ifndef NCCL_MLX5DVWRAP_H_
#define NCCL_MLX5DVWRAP_H_
#include <arpa/inet.h>
#include <netinet/in.h>
#ifdef NCCL_BUILD_MLX5DV
#include <infiniband/mlx5dv.h>
#else
#include "mlx5/mlx5dvcore.h"
#endif
#include "core.h"
#include "ibvwrap.h"
#include <sys/types.h>
#include <unistd.h>
typedef enum mlx5dv_return_enum
{
MLX5DV_SUCCESS = 0, //!< The operation was successful
} mlx5dv_return_t;
ncclResult_t wrap_mlx5dv_symbols(void);
/* NCCL wrappers of MLX5 direct verbs functions */
bool wrap_mlx5dv_is_supported(struct ibv_device *device);
ncclResult_t wrap_mlx5dv_get_data_direct_sysfs_path(struct ibv_context *context, char *buf, size_t buf_len);
/* DMA-BUF support */
ncclResult_t wrap_mlx5dv_reg_dmabuf_mr(struct ibv_mr **ret, struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access);
struct ibv_mr * wrap_direct_mlx5dv_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access);
#endif // NCCL_MLX5DVWRAP_H_
+13 -1
Просмотреть файл
@@ -7,6 +7,8 @@
#ifndef NCCL_DEBUG_H_
#define NCCL_DEBUG_H_
#include <cstdint>
typedef enum {
NCCL_LOG_NONE = 0,
NCCL_LOG_VERSION = 1,
@@ -38,6 +40,16 @@ typedef enum {
typedef void (*ncclDebugLogger_t)(ncclDebugLogLevel level, unsigned long flags, const char *file, int line, const char *fmt, ...);
// NCCL core profiler callback for network defined events instrumentation
enum {
ncclProfilerNetEventStart = 0,
ncclProfilerNetEventStop,
ncclProfilerNetEventUpdate,
ncclProfilerNetEventUpdateAndStop,
};
typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* pHandle, int64_t pluginId, void* extData);
#define NCCL_NUM_FUNCTIONS 5 // Send/Recv not included for now
typedef enum {
ncclFuncBroadcast = 0,
@@ -51,7 +63,7 @@ typedef enum {
ncclNumFuncs = 8
} ncclFunc_t;
#define NCCL_NUM_ALGORITHMS 7 // Tree/Ring/CollNet*
#define NCCL_NUM_ALGORITHMS 7 // Tree/Ring/CollNet*/PAT
#define NCCL_ALGO_UNDEF -1
#define NCCL_ALGO_TREE 0
#define NCCL_ALGO_RING 1
-2
Просмотреть файл
@@ -14,8 +14,6 @@
typedef char ncclNetHandle_t[NCCL_NET_HANDLE_MAXSIZE];
ncclResult_t ncclNetPluginLoad(struct ncclComm* comm);
ncclResult_t ncclNetPluginUnload(struct ncclComm* comm);
ncclResult_t ncclNetInit(struct ncclComm* comm);
ncclResult_t ncclNetFinalize(struct ncclComm* comm);
+2 -1
Просмотреть файл
@@ -31,10 +31,11 @@
#define NVTX_SID_CommInitRankScalable 12 // same schema as NVTX_SID_CommInitRank
#define NVTX_SID_CommSplit 13
#define NVTX_SID_CommFinalize 14
#define NVTX_SID_CommShrink 15
// When adding new schema IDs, DO NOT re-use/overlap with the enum schema ID below!
// Define static schema ID for the reduction operation.
#define NVTX_PAYLOAD_ENTRY_NCCL_REDOP 15 + NVTX_PAYLOAD_ENTRY_TYPE_SCHEMA_ID_STATIC_START
#define NVTX_PAYLOAD_ENTRY_NCCL_REDOP 16 + NVTX_PAYLOAD_ENTRY_TYPE_SCHEMA_ID_STATIC_START
extern const nvtxDomainHandle_t ncclNvtxDomainHandle;
+10
Просмотреть файл
@@ -67,6 +67,16 @@ NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsCommSplit, static cons
)
)
NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsCommShrink, static constexpr,
NCCL_NVTX_PAYLOAD_ENTRIES(
(uint64_t, newcomm, TYPE_UINT64, nccl_nvtxCommStr),
(int, nranks, TYPE_INT, nccl_nvtxNranksStr),
(int, myrank, TYPE_INT, nccl_nvtxRankStr),
(int, cudaDev, TYPE_INT, nccl_nvtxCudaDevStr),
(int, num_exclude, TYPE_INT, "num_exclude")
)
)
NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsCommFinalize, static constexpr,
NCCL_NVTX_PAYLOAD_ENTRIES(
(uint64_t, comm, TYPE_UINT64, nccl_nvtxCommStr)
+3 -4
Просмотреть файл
@@ -28,10 +28,9 @@
#define NCCL_NET_MAX_REQUESTS 32
// Max number of ncclNet objects which can live in the same process
#define NCCL_NET_MAX_PLUGINS 3
// NCCL core profiler callback for network defined events instrumentation
typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* pHandle, int64_t pluginId, void* extData);
#ifndef NCCL_NET_MAX_PLUGINS
#define NCCL_NET_MAX_PLUGINS 16
#endif
#include "net/net_v10.h"
#include "net/net_v9.h"
+32 -22
Просмотреть файл
@@ -19,43 +19,53 @@ enum {
};
typedef enum {
ncclProfilerProxyOpSendPosted,
ncclProfilerProxyOpSendRemFifoWait,
ncclProfilerProxyOpSendTransmitted,
ncclProfilerProxyOpSendDone,
ncclProfilerProxyOpRecvPosted,
ncclProfilerProxyOpRecvReceived,
ncclProfilerProxyOpRecvTransmitted,
ncclProfilerProxyOpRecvDone,
ncclProfilerProxyOpSendPosted = 0, // deprecated in v4
ncclProfilerProxyOpSendRemFifoWait = 1, // deprecated in v4
ncclProfilerProxyOpSendTransmitted = 2, // deprecated in v4
ncclProfilerProxyOpSendDone = 3, // deprecated in v4
ncclProfilerProxyOpRecvPosted = 4, // deprecated in v4
ncclProfilerProxyOpRecvReceived = 5, // deprecated in v4
ncclProfilerProxyOpRecvTransmitted = 6, // deprecated in v4
ncclProfilerProxyOpRecvDone = 7, // deprecated in v4
ncclProfilerProxyOpInProgress_v4 = 19,
/* Legacy proxy profiler states */
ncclProfilerProxyStepSendGPUWait,
ncclProfilerProxyStepSendWait,
ncclProfilerProxyStepRecvWait,
ncclProfilerProxyStepRecvFlushWait,
ncclProfilerProxyStepRecvGPUWait,
ncclProfilerProxyStepSendGPUWait = 8,
ncclProfilerProxyStepSendPeerWait_v4 = 20,
ncclProfilerProxyStepSendWait = 9,
ncclProfilerProxyStepRecvWait = 10,
ncclProfilerProxyStepRecvFlushWait = 11,
ncclProfilerProxyStepRecvGPUWait = 12,
/* Legacy proxy control states */
ncclProfilerProxyCtrlIdle,
ncclProfilerProxyCtrlActive,
ncclProfilerProxyCtrlSleep,
ncclProfilerProxyCtrlWakeup,
ncclProfilerProxyCtrlAppend,
ncclProfilerProxyCtrlAppendEnd,
ncclProfilerProxyCtrlIdle = 13,
ncclProfilerProxyCtrlActive = 14,
ncclProfilerProxyCtrlSleep = 15,
ncclProfilerProxyCtrlWakeup = 16,
ncclProfilerProxyCtrlAppend = 17,
ncclProfilerProxyCtrlAppendEnd = 18,
/* Network defined event states */
ncclProfilerNetPluginUpdate = 21,
/* Kernel event states */
ncclProfilerKernelChStop = 22,
} ncclProfilerEventState_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v1_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v2_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v3_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v4_t;
#include <cstdint>
#include "profiler/profiler_v4.h"
#include "profiler/profiler_v3.h"
#include "profiler/profiler_v2.h"
#include "profiler/profiler_v1.h"
typedef ncclProfiler_v3_t ncclProfiler_t;
typedef ncclProfilerEventDescr_v3_t ncclProfilerEventDescr_t;
typedef ncclProfilerEventStateArgs_v3_t ncclProfilerEventStateArgs_t;
typedef ncclProfiler_v4_t ncclProfiler_t;
typedef ncclProfilerEventDescr_v4_t ncclProfilerEventDescr_t;
typedef ncclProfilerEventStateArgs_v4_t ncclProfilerEventStateArgs_t;
#define NCCL_PROFILER_NET_VER_BITS (16)
#define NCCL_PROFILER_NET_VER_MASK (~0U >> NCCL_PROFILER_NET_VER_BITS)
+123
Просмотреть файл
@@ -0,0 +1,123 @@
/*************************************************************************
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#ifndef PROFILER_V4_H_
#define PROFILER_V4_H_
typedef struct {
uint8_t type; // event type descriptor: ncclProfileColl, ...
void* parentObj; // pointer to the profiler parent object (for coll is the group)
int rank; // originating rank
union {
struct {
uint64_t seqNumber;
const char* func;
void const* sendBuff;
void* recvBuff;
size_t count;
int root;
const char* datatype;
uint8_t nChannels;
uint8_t nWarps;
const char* algo;
const char* proto;
} coll;
struct {
const char* func;
void* buff;
const char* datatype;
size_t count;
int peer;
uint8_t nChannels;
} p2p;
struct {
pid_t pid; // pid of the originating process
uint8_t channelId; // channel id for this proxy operation
int peer; // remote rank for send/recv
int nSteps; // number of steps for this proxy operation
int chunkSize; // amount of data transferred by this proxy operation
int isSend;
} proxyOp;
struct {
int step;
} proxyStep;
struct {
uint8_t channelId;
uint64_t pTimer; // start timestamp from GPU globaltimer
} kernelCh;
struct {
int64_t id;
void* data;
} netPlugin;
};
} ncclProfilerEventDescr_v4_t;
typedef union {
struct {
size_t transSize;
} proxyStep;
struct {
int appendedProxyOps;
} proxyCtrl;
struct {
void* data;
} netPlugin;
struct {
uint64_t pTimer;
} kernelCh;
} ncclProfilerEventStateArgs_v4_t;
typedef struct {
const char* name;
// init - initialize the profiler plugin
// Input
// - context : opaque profiler context object for separating profiler behavior across comms
// - commName : user assigned communicator name
// - commHash : communicator id
// - nNodes : number of nodes in communicator
// - nranks : number of ranks in communicator
// - rank : rank identifier in communicator
// - logfn : logger function
// Output
// - eActivationMask: bitmask of active events set by the plugin
ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
// startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
// Input
// - context: opaque profiler context object
// - eDescr : pointer to ncclProfilerEventDescr_t object
// Output
// - eHandle: return event handle for supplied event descriptor object
ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);
// stopEvent - stop/finalize an event inside and event set
// Input
// - eHandle: handle to event object
ncclResult_t (*stopEvent)(void* eHandle);
// recordEventState - record event state transitions and event attribute updates
// Input
// - eHandle : handle to event object created through startEvent
// - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
// - eState : event state transition
ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);
// finalize - finalize the profiler plugin
// Input
// - context: opaque profiler context object
ncclResult_t (*finalize)(void* context);
} ncclProfiler_v4_t;
#endif
+6 -7
Просмотреть файл
@@ -21,8 +21,8 @@ struct ncclProxyConnector;
struct ncclProfilerProxy {
bool initialized;
uint64_t* workStarted/*[MAXCHANNELS]*/;
uint64_t* workCompleted/*[MAXCHANNELS]*/;
struct ncclDevProfiler* workStarted/*[MAXCHANNELS]*/;
struct ncclDevProfiler* workCompleted/*[MAXCHANNELS]*/;
uint64_t workCounter[MAXCHANNELS]; // host work counter
struct ncclProxyConnector sendProxyConn[MAXCHANNELS];
struct ncclProxyConnector recvProxyConn[MAXCHANNELS];
@@ -43,8 +43,7 @@ ncclResult_t ncclProfilerStartTaskEvents(struct ncclKernelPlan* plan);
ncclResult_t ncclProfilerStopTaskEvents(struct ncclKernelPlan* plan);
// Proxy Op Start/Stop Event Wrappers
ncclResult_t ncclProfilerStartSendProxyOpEvent(int sub, struct ncclProxyArgs* args);
ncclResult_t ncclProfilerStartRecvProxyOpEvent(int sub, struct ncclProxyArgs* args);
ncclResult_t ncclProfilerStartProxyOpEvent(int sub, struct ncclProxyArgs* args);
ncclResult_t ncclProfilerStopProxyOpEvent(int sub, struct ncclProxyArgs* args);
// Proxy Step Start/Stop Event Wrappers
@@ -57,11 +56,11 @@ ncclResult_t ncclProfilerStartProxyCtrlEvent(void* profilerContext, void** eHand
ncclResult_t ncclProfilerStopProxyCtrlEvent(void* eHandle);
// Kernel Channel Start/Stop Event Wrappers
ncclResult_t ncclProfilerStartKernelChEvent(struct ncclProxyArgs* args, int s);
ncclResult_t ncclProfilerStopKernelChEvent(struct ncclProxyArgs* args, int s);
ncclResult_t ncclProfilerStartKernelChEvent(struct ncclProxyArgs* args, int s, uint64_t start);
ncclResult_t ncclProfilerStopKernelChEvent(struct ncclProxyArgs* args, int s, uint64_t stop);
// Record Event Wrappers
ncclResult_t ncclProfilerRecordProxyOpEventState(int sub, struct ncclProxyArgs* args, int steps, size_t transSize, ncclProfilerEventState_t eState);
ncclResult_t ncclProfilerRecordProxyOpEventState(int sub, struct ncclProxyArgs* args, ncclProfilerEventState_t eState);
ncclResult_t ncclProfilerRecordProxyStepEventState(int sub, struct ncclProxyArgs* args, int stepId, ncclProfilerEventState_t eState);
ncclResult_t ncclProfilerRecordProxyCtrlEventState(void*eHandle, int appended, ncclProfilerEventState_t eState);
+10 -2
Просмотреть файл
@@ -105,6 +105,13 @@ struct ncclProxyOp {
struct ncclProxyOp *enqNext;
};
struct ncclProxySubArgs;
struct ncclProxyEventHandle {
void* stepEventHandle;
struct ncclProxySubArgs* subArgPtr;
};
struct ncclProxySubArgs {
struct ncclProxyConnection* connection;
int reg;
@@ -137,13 +144,12 @@ struct ncclProxySubArgs {
// Profiler plugin
int eActivationMask;
int rank;
uint64_t profilerSteps;
pid_t pid;
void* profilerContext;
void* taskEventHandle;
void* opEventHandle;
void* kernelEventHandle;
void* stepEventHandles[NCCL_STEPS];
struct ncclProxyEventHandle pHandles[NCCL_STEPS];
size_t transSize;
uint64_t workCounter;
@@ -226,6 +232,8 @@ struct ncclProxyPeer {
};
struct ncclSharedNetComms {
int activeConnect[MAXCHANNELS];
int activeAccept[MAXCHANNELS];
void* sendComm[MAXCHANNELS];
void* recvComm[MAXCHANNELS];
int sendRefCount[MAXCHANNELS];
+19 -5
Просмотреть файл
@@ -29,18 +29,24 @@ struct ncclRegNetHandles {
struct ncclRegNetHandles* next;
};
struct ncclSymRegTask {
struct ncclSymRegTask *next;
void* buff;
size_t baseSize;
CUmemGenericAllocationHandle memHandle;
struct ncclReg* regHandle;
size_t alignment;
};
struct ncclReg {
// common attributes
size_t pages;
uintptr_t begAddr, endAddr; // page aligned
int localRefs;
int graphRefs;
uintptr_t addr;
uint32_t state;
// net reg
struct ncclRegNetHandles* netHandleHead;
// nvls reg
uintptr_t baseAddr;
size_t baseSize;
CUdeviceptr regAddr;
size_t regUCSize, regMCSize;
int dev;
@@ -52,6 +58,10 @@ struct ncclReg {
// general ipc reg
struct ncclPeerRegIpcAddr regIpcAddrs;
struct ncclIpcRegInfo* ipcInfos[NCCL_MAX_LOCAL_RANKS];
// symmetric reg
void* baseSymPtr;
size_t symSize;
int winFlags;
};
struct ncclRegCache {
@@ -60,10 +70,14 @@ struct ncclRegCache {
uintptr_t pageSize;
};
struct ncclWindow {
struct ncclReg* handle;
};
ncclResult_t ncclRegCleanup(struct ncclComm* comm);
ncclResult_t ncclRegFind(struct ncclComm* comm, const void* data, size_t size, struct ncclReg** reg);
ncclResult_t ncclCommGraphRegister(const ncclComm_t comm, void* buff, size_t size, void** handle);
ncclResult_t ncclCommGraphDeregister(const ncclComm_t comm, struct ncclReg *handle);
ncclResult_t ncclRegLocalIsValid(struct ncclReg *reg, bool *isValid);
ncclResult_t ncclCommSymmetricRegisterInternal(struct ncclComm* comm, void* buff, size_t baseSize, size_t alignment, CUmemGenericAllocationHandle memHandle, struct ncclReg* regHandle);
#endif
+33
Просмотреть файл
@@ -0,0 +1,33 @@
#ifndef NCCL_REGISTER_INLINE_H_
#define NCCL_REGISTER_INLINE_H_
#include "comm.h"
#include "register.h"
static inline ncclResult_t ncclRegFind(struct ncclComm* comm, const void* data, size_t size, struct ncclReg** outReg) {
struct ncclRegCache* cache = &comm->regCache;
*outReg = NULL;
for (int slot=0; /*true*/; slot++) {
if (slot == cache->population) return ncclSuccess;
struct ncclReg *reg = cache->slots[slot];
if ((uintptr_t)data < reg->begAddr) return ncclSuccess;
if ((uintptr_t)data + size <= reg->endAddr) {
*outReg = reg;
return ncclSuccess;
}
}
}
static inline ncclResult_t ncclRegFindSymmetric(struct ncclComm* comm, const void* data, size_t size, void** symPtr, struct ncclReg** outReg) {
struct ncclReg* regRecord = NULL;
*symPtr = NULL;
*outReg = NULL;
NCCLCHECK(ncclRegFind(comm, data, size, &regRecord));
if (regRecord && regRecord->baseSymPtr) {
*symPtr = (void*)((uintptr_t)regRecord->baseSymPtr + (uintptr_t)data - (uintptr_t)regRecord->begAddr);
*outReg = regRecord;
}
return ncclSuccess;
}
#endif
+4 -2
Просмотреть файл
@@ -69,8 +69,10 @@ struct ncclSocket {
const char *ncclSocketToString(const union ncclSocketAddress *addr, char *buf, const int numericHostForm = 1);
ncclResult_t ncclSocketGetAddrFromString(union ncclSocketAddress* ua, const char* ip_port_pair);
int ncclFindInterfaceMatchSubnet(char* ifNames, union ncclSocketAddress* localAddrs, union ncclSocketAddress* remoteAddr, int ifNameMaxSize, int maxIfs);
int ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs, int ifNameMaxSize, int maxIfs);
ncclResult_t ncclFindInterfaceMatchSubnet(char* ifName, union ncclSocketAddress* localAddr,
union ncclSocketAddress* remoteAddr, int ifNameMaxSize, int* found);
ncclResult_t ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs, int ifNameMaxSize, int maxIfs,
int* nIfs);
// Initialize a socket
ncclResult_t ncclSocketInit(struct ncclSocket* sock, const union ncclSocketAddress* addr = NULL, uint64_t magic = NCCL_SOCKET_MAGIC, enum ncclSocketType type = ncclSocketTypeUnknown, volatile uint32_t* abortFlag = NULL, int asyncFlag = 0, int customRetry = 0);
+90
Просмотреть файл
@@ -0,0 +1,90 @@
#ifndef NCCL_DEVICE_SYMMETRIC_H_
#define NCCL_DEVICE_SYMMETRIC_H_
#include "nccl.h"
#include "nccl_common.h"
#include "bitops.h"
constexpr int ncclSymMaxBlocks = 64;
constexpr int ncclSymMaxThreads = 512;
constexpr int ncclSymLLMaxEltSize = 64;
constexpr __host__ __device__ int ncclSymLLMaxSlots(int eltSize = ncclSymLLMaxEltSize) {
return ncclSymMaxThreads*ncclSymLLMaxEltSize/eltSize;
}
constexpr __host__ __device__ int ncclSymLLEpochSize(int nRanks) {
return /*LL Overhead*/2 * maxval(ncclSymMaxThreads*nRanks*8, ncclSymLLMaxSlots(ncclSymLLMaxEltSize)*ncclSymLLMaxEltSize);
}
struct alignas(16) ncclSymDevBase {
uint32_t llEpoch[ncclSymMaxBlocks];
uint32_t barEpochMc[ncclSymMaxBlocks], barEpochUc[ncclSymMaxBlocks];
uint32_t barInboxMc[ncclSymMaxBlocks];
uint32_t barInboxPerPeer[];
static constexpr size_t size(int nRanks) {
return sizeof(ncclSymDevBase) +
alignUp(ncclSymMaxBlocks*nRanks*sizeof(uint32_t), 16) +
ncclSymMaxBlocks * /*epochs=*/2 * ncclSymLLEpochSize(nRanks);
}
};
static __device__ uint4* ncclSymDevBase_getLLBuf(struct ncclSymDevBase* base, int nRanks, int block, uint32_t epoch) {
// Get pointer to buffer trailing the header struct.
char* ans = (char*)(base + 1);
// Skip over barInboxPerPeer[]
ans += alignUp(ncclSymMaxBlocks*nRanks*sizeof(uint32_t), 16);
// Skip to our block
int epochSize = ncclSymLLEpochSize(nRanks);
ans += block * /*epochs=*/2 * epochSize;
ans += (epoch & 1)*epochSize;
return (uint4*)ans;
}
struct ncclSymDevComm {
ncclSymDevBase* base;
ncclSymDevBase* baseMc;
uint32_t stride4G;
int nRanks, rank;
uint32_t nRanks_rcp32; // idivRcp32(nRanks)
};
struct alignas(16) ncclSymDevArgs {
struct ncclSymDevComm comm;
int rootRank;
uint64_t redOpArg; // must be collectively uniform
size_t nElts;
char* input;
char* output;
};
enum ncclSymKernelId {
ncclSymKernelId_AllReduce_AGxLL_R,
ncclSymKernelId_AllReduce_AGxLLMC_R,
ncclSymKernelId_AllReduce_RSxLD_AGxST,
ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC,
ncclSymKernelId_AllGather_LL,
ncclSymKernelId_AllGather_LLMC,
ncclSymKernelId_AllGather_ST,
ncclSymKernelId_AllGather_STMC,
ncclSymKernelId_ReduceScatter_LL,
ncclSymKernelId_ReduceScatter_LD,
ncclSymKernelId_ReduceScatter_LDMC,
ncclSymKernelId_Count
};
bool ncclSymImplemented(ncclFunc_t fn, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty);
ncclResult_t ncclSymPickKernel(struct ncclComm* comm, ncclFunc_t fn, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty, size_t nElts, float* estTimeUs, ncclSymKernelId* kernelId, int* nBlocks, int* nWarps);
// Generated by src/device/symmetric/generate.py
extern int const ncclSymKernelCount;
extern void* const ncclSymKernelList[];
void* ncclSymGetKernelPtr(ncclSymKernelId kernelId, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty);
const char* ncclSymKernelIdToString(int kernelId);
#endif
+17 -1
Просмотреть файл
@@ -22,6 +22,7 @@
#include "proxy.h"
#include "comm.h"
#include "bootstrap.h"
extern struct ncclTransport p2pTransport;
extern struct ncclTransport shmTransport;
@@ -46,6 +47,7 @@ struct ncclPeerInfo {
int64_t busId;
struct ncclComm* comm;
int cudaCompCap;
size_t totalGlobalMem;
// MNNVL support
nvmlGpuFabricInfoV_t fabricInfo;
int cuMemSupport;
@@ -53,6 +55,8 @@ struct ncclPeerInfo {
};
#define CONNECT_SIZE 256
#define NCCL_MAX_PAGE_SIZE (512L * 1024L * 1024L)
#define NCCL_REC_PAGE_SIZE (2L * 1024L * 1024L)
struct ncclConnect {
char data[CONNECT_SIZE];
};
@@ -80,6 +84,7 @@ struct ncclNvlsSharedRes {
char* ucBuff; // Unicast NVLS buffer address
char* ucCredit; // Unicast NVLS credit address
int nChannels;
int nHeads;
struct ncclShmemCollBuff nvlsShmem;
void *nvlsShmemHandle;
};
@@ -119,7 +124,8 @@ struct ncclTransport {
ncclResult_t ncclTransportP2pConnect(struct ncclComm* comm, int channelId, int nrecv, int* peerRecv, int nsend, int* peerSend, int connIndex);
ncclResult_t ncclTransportP2pSetup(struct ncclComm* comm, struct ncclTopoGraph* graph, int connIndex);
ncclResult_t ncclTransportCheckP2pType(struct ncclComm* comm, bool* intraNodeP2pSupport, bool* directMode);
ncclResult_t ncclTransportCheckP2pType(struct ncclComm* comm, bool* isAllDirectP2p, bool* directMode);
ncclResult_t ncclTransportIsAllDirectP2p(struct ncclComm* comm, int* isAllDirectP2p);
ncclResult_t ncclNvlsInit(struct ncclComm* comm);
ncclResult_t ncclNvlsSetup(struct ncclComm* comm, struct ncclComm* parent);
@@ -154,5 +160,15 @@ ncclResult_t ncclRegisterP2pIpcBuffer(struct ncclComm* comm, void* userbuff, siz
ncclResult_t ncclRegisterP2pNetBuffer(struct ncclComm* comm, void* userbuff, size_t size, struct ncclConnector* conn, int* regFlag, void** handle, struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next>* cleanupQueue);
ncclResult_t ncclRegisterCollBuffers(struct ncclComm* comm, struct ncclTaskColl* info, void* outRegBufSend[NCCL_MAX_LOCAL_RANKS], void* outRegBufRecv[NCCL_MAX_LOCAL_RANKS], struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next>* cleanupQueue, bool* regNeedConnect);
ncclResult_t ncclRegisterCollNvlsBuffers(struct ncclComm* comm, struct ncclTaskColl* info, void* outRegBufSend[NCCL_MAX_LOCAL_RANKS], void* outRegBufRecv[NCCL_MAX_LOCAL_RANKS], struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next>* cleanupQueue, bool* regNeedConnect);
ncclResult_t ncclNvlsRegResourcesQuery(struct ncclComm* comm, struct ncclTaskColl* info, int* recChannels);
ncclResult_t ncclIpcSymmetricInit(struct ncclComm* comm);
ncclResult_t ncclIpcSymmetricMap(struct ncclComm* comm, size_t offset, size_t size, CUmemGenericAllocationHandle memHandle, void** symPtr);
ncclResult_t ncclIpcSymmetricFree(struct ncclComm* comm, size_t size, void* symPtr);
ncclResult_t ncclIpcSymmetricFinalize(struct ncclComm* comm);
ncclResult_t ncclNvlsSymmetricInit(struct ncclComm* comm);
ncclResult_t ncclNvlsSymmetricMap(struct ncclComm* comm, size_t offset, size_t ucsize, void* ucaddr);
ncclResult_t ncclNvlsSymmetricFree(struct ncclComm* comm, size_t ucsize, void* ucaddr);
ncclResult_t ncclNvlsSymmetricFinalize(struct ncclComm* comm);
#endif
+6
Просмотреть файл
@@ -43,6 +43,12 @@ static long log2i(long n) {
return log2Down(n);
}
// Comparator function for qsort/bsearch to compare integers
static int compareInts(const void *a, const void *b) {
int ia = *(const int*)a, ib = *(const int*)b;
return (ia > ib) - (ia < ib);
}
inline uint64_t clockNano() {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
+260 -210
Просмотреть файл
@@ -18,6 +18,7 @@
#include "argcheck.h"
#include "tuner.h"
#include "ras.h"
#include "profiler.h"
#include "mnnvl.h"
#include <fcntl.h>
#include <string.h>
@@ -29,6 +30,7 @@
#include <unistd.h>
#include "param.h"
#include "nvtx_payload_schemas.h"
#include "utils.h"
#define STR2(v) #v
#define STR(v) STR2(v)
@@ -48,6 +50,10 @@ NCCL_PARAM(GroupCudaStream, "GROUP_CUDA_STREAM", NCCL_GROUP_CUDA_STREAM);
NCCL_PARAM(CheckPointers, "CHECK_POINTERS", 0);
NCCL_PARAM(CommBlocking, "COMM_BLOCKING", NCCL_CONFIG_UNDEF_INT);
NCCL_PARAM(RuntimeConnect, "RUNTIME_CONNECT", 1);
NCCL_PARAM(WinEnable, "WIN_ENABLE", 1);
NCCL_PARAM(CollnetEnable, "COLLNET_ENABLE", NCCL_CONFIG_UNDEF_INT);
NCCL_PARAM(CtaPolicy, "CTA_POLICY", NCCL_CONFIG_UNDEF_INT);
NCCL_PARAM(NvlsChannels, "NVLS_NCHANNELS", NCCL_CONFIG_UNDEF_INT);
static ncclResult_t commReclaim(ncclComm_t comm);
@@ -174,6 +180,10 @@ static ncclResult_t commFree(ncclComm_t comm) {
if (comm == NULL)
return ncclSuccess;
if (comm->symmetricSupport && comm->symDevComm.base) {
NCCLCHECK(ncclCommSymmetricFreeInternal(comm, comm->baseUCSymPtr + comm->rank * comm->baseStride));
}
NCCLCHECK(ncclRasCommFini(comm));
/* in commReclaim, we have guaranteed only last rank which calls ncclCommDestroy() will
@@ -253,15 +263,16 @@ static ncclResult_t commFree(ncclComm_t comm) {
NCCLCHECK(ncclRegCleanup(comm));
if (comm->symmetricSupport) {
NCCLCHECK(ncclNvlsSymmetricFinalize(comm));
NCCLCHECK(ncclIpcSymmetricFinalize(comm));
}
INFO(NCCL_INIT,"comm %p rank %d nranks %d cudaDev %d busId %lx - %s COMPLETE", comm, comm->rank, comm->nRanks, comm->cudaDev, comm->busId, abort ? "Abort" : "Destroy");
commPoison(comm); // poison comm before free to avoid comm reuse.
NCCLCHECK(ncclProfilerPluginFinalize(comm));
NCCLCHECK(ncclNetFinalize(comm));
NCCLCHECK(ncclNetPluginUnload(comm));
ncclCudaContextDrop(comm->context);
free(comm);
return ncclSuccess;
@@ -271,7 +282,7 @@ NCCL_PARAM(DisableGraphHelper, "GRAPH_HELPER_DISABLE", 0);
// GDRCOPY support: FIFO_ENABLE when enabled locates a workFifo in CUDA memory
NCCL_PARAM(GdrCopyFifoEnable, "GDRCOPY_FIFO_ENABLE", 1);
#define NCCL_WORK_FIFO_BYTES_DEFAULT (1<<20)
NCCL_PARAM(WorkFifoBytes, "WORK_FIFO_BYTES", -1);
NCCL_PARAM(WorkFifoBytes, "WORK_FIFO_BYTES", NCCL_WORK_FIFO_BYTES_DEFAULT);
NCCL_PARAM(WorkArgsBytes, "WORK_ARGS_BYTES", INT64_MAX);
enum ncclLaunchMode ncclParamLaunchMode;
@@ -331,12 +342,10 @@ static ncclResult_t commAlloc(struct ncclComm* comm, struct ncclComm* parent, in
comm->rank = rank;
comm->nRanks = ndev;
NCCLCHECK(ncclNetPluginLoad(comm));
NCCLCHECK(ncclNetInit(comm));
NCCLCHECK(ncclProfilerPluginInit(comm));
INFO(NCCL_INIT, "Using network %s", comm->ncclNet->name);
if (parent && parent->config.splitShare) {
if (parent && parent->shareResources) {
if (parent->ncclNet != comm->ncclNet) {
WARN("Split shares resources, but parent comm netName %s is different from child comm netName %s", parent->ncclNet->name, comm->ncclNet->name);
return ncclInvalidUsage;
@@ -361,13 +370,14 @@ static ncclResult_t commAlloc(struct ncclComm* comm, struct ncclComm* parent, in
comm->checkPointers = ncclParamCheckPointers() == 1 ? true : false;
comm->dmaBufSupport = (dmaBufSupported(comm) == ncclSuccess) ? true : false;
comm->collNetSupport = 0;
memset(comm->collNetSupportMatrix, 0, sizeof(comm->collNetSupportMatrix));
ncclMemoryPoolConstruct(&comm->memPool_ncclKernelPlan);
ncclMemoryPoolConstruct(&comm->memPool_ncclProxyOp);
comm->groupNext = reinterpret_cast<struct ncclComm*>(0x1);
for (int i = 0; i < ncclGroupTaskTypeNum; i++) {
comm->groupNext[i] = reinterpret_cast<struct ncclComm*>(0x1);
}
comm->preconnectNext = reinterpret_cast<struct ncclComm*>(0x1);
static_assert(MAXCHANNELS <= sizeof(*comm->connectSend)*8, "comm->connectSend must have enough bits for all channels");
@@ -378,7 +388,7 @@ static ncclResult_t commAlloc(struct ncclComm* comm, struct ncclComm* parent, in
// Mark channels as non initialized.
for (int c=0; c < MAXCHANNELS; c++) comm->channels[c].id = -1;
if (parent == NULL || !parent->config.splitShare) {
if (parent == NULL || !parent->shareResources) {
struct ncclSharedResources* sharedRes = NULL;
NCCLCHECK(ncclCalloc(&sharedRes, 1));
/* most of attributes are assigned later in initTransportsRank(). */
@@ -432,6 +442,7 @@ static ncclResult_t devCommSetup(ncclComm_t comm) {
bool ccEnable;
cudaStream_t deviceStream;
memset(&tmpCommAndChans, '\0', sizeof(tmpCommAndChans));
NCCLCHECKGOTO(ncclStrongStreamAcquire(ncclCudaGraphNone(), &comm->sharedRes->deviceStream, /*concurrent=*/false, &deviceStream), ret, fail);
NCCLCHECKGOTO(ncclCudaCallocAsync(&devCommAndChans, 1, deviceStream), ret, fail);
ncclCommPushCudaFree(comm, devCommAndChans);
@@ -458,22 +469,12 @@ static ncclResult_t devCommSetup(ncclComm_t comm) {
if (ccEnable) {
comm->workFifoBytes = 0;
} else {
int64_t workFifoBytesParam = ncclParamWorkFifoBytes();
if (workFifoBytesParam == -1) {
if (comm->MNNVL && (comm->compCap >= 100)) {
// WAR: Disable work fifo for Blackwell all2all hang issue on MNNVL
INFO(NCCL_INIT, "Disabling work fifo");
comm->workFifoBytes = 0;
} else {
comm->workFifoBytes = NCCL_WORK_FIFO_BYTES_DEFAULT;
}
} else {
if (0 != (workFifoBytesParam & (workFifoBytesParam-1))) {
WARN("NCCL_WORK_FIFO_BYTES=%ld is being ignored because it is not a power of 2.", workFifoBytesParam);
comm->workFifoBytes = NCCL_WORK_FIFO_BYTES_DEFAULT;
}
comm->workFifoBytes = std::min<uint64_t>(workFifoBytesParam, 1ul<<30);
comm->workFifoBytes = ncclParamWorkFifoBytes();
if (0 != (comm->workFifoBytes & (comm->workFifoBytes-1))) {
WARN("NCCL_WORK_FIFO_BYTES=%d is being ignored because it is not a power of 2.", comm->workFifoBytes);
comm->workFifoBytes = NCCL_WORK_FIFO_BYTES_DEFAULT;
}
comm->workFifoBytes = std::min(comm->workFifoBytes, 1u<<30);
}
if (comm->rank == 0) {
@@ -492,11 +493,9 @@ static ncclResult_t devCommSetup(ncclComm_t comm) {
comm->workFifoBufDev = comm->workFifoBuf;
}
NCCLCHECKGOTO(ncclCudaHostCalloc(&comm->workFifoConsumed, MAXCHANNELS), ret, fail);
ncclCommPushCudaHostFree(comm, comm->workFifoConsumed);
comm->workFifoProduced = 0;
comm->workFifoConsumedLeast = 0;
tmpCommAndChans.comm.workConsumed = comm->workFifoConsumed;
comm->workFifoProducedLastRecorded = 0;
comm->workFifoConsumed = 0;
// Alloc profiler counters for the kernel
NCCLCHECKGOTO(ncclCudaHostCalloc(&comm->profiler.workStarted, MAXCHANNELS), ret, fail);
@@ -549,6 +548,7 @@ NCCL_PARAM(MNNVLUUID, "MNNVL_UUID", -1);
NCCL_PARAM(MNNVLCliqueId, "MNNVL_CLIQUE_ID", -1);
static ncclResult_t fillInfo(struct ncclComm* comm, struct ncclPeerInfo* info, uint64_t commHash) {
cudaDeviceProp prop;
info->rank = comm->rank;
info->cudaDev = comm->cudaDev;
info->nvmlDev = comm->nvmlDev;
@@ -556,6 +556,8 @@ static ncclResult_t fillInfo(struct ncclComm* comm, struct ncclPeerInfo* info, u
info->hostHash=getHostHash()+commHash;
info->pidHash=getPidHash()+commHash;
info->cuMemSupport = ncclCuMemEnable();
CUDACHECK(cudaGetDeviceProperties(&prop, comm->cudaDev));
info->totalGlobalMem = ROUNDUP(prop.totalGlobalMem, (1L << 32));
// Get the device MAJOR:MINOR of /dev/shm so we can use that
// information to decide whether we can use SHM for inter-process
@@ -700,6 +702,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
struct ncclTopoRanks topoRanks;
int cpuArch;
int cpuVendor;
int localRanks;
};
int nChannelsOrig;
@@ -711,12 +714,14 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
struct ncclProxyConnector proxyConn;
int* pxnPeers = NULL;
int *topParentLocalRanks = NULL;
int p2pLevel = -1;
timers[TIMER_INIT_ALLGATHER] = clockNano();
// AllGather1 - begin
NCCLCHECKGOTO(ncclCalloc(&comm->peerInfo, nranks+1), ret, fail); // Extra rank to represent CollNet root
NCCLCHECKGOTO(fillInfo(comm, comm->peerInfo+rank, comm->commHash), ret, fail);
NCCLCHECKGOTO(bootstrapAllGather(comm->bootstrap, comm->peerInfo, sizeof(struct ncclPeerInfo)), ret, fail);
__atomic_store_n(&comm->peerInfoValid, true, __ATOMIC_RELEASE);
comm->cuMemSupport = 1;
for (int i = 0; i < nranks; i++) {
@@ -738,7 +743,8 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
timers[TIMER_INIT_ALLGATHER] = clockNano() - timers[TIMER_INIT_ALLGATHER];
// Check for MNNVL support
if ((nNodes > 1 && ncclParamMNNVLEnable() != 0) || ncclParamMNNVLEnable() == 1) {
NCCLCHECKGOTO(ncclGetUserP2pLevel(&p2pLevel), ret, fail);
if ((nNodes > 1 && ncclParamMNNVLEnable() != 0 && p2pLevel != 0) || ncclParamMNNVLEnable() == 1) {
NCCLCHECKGOTO(ncclMnnvlCheck(comm), ret, fail);
}
@@ -829,14 +835,8 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
}
// Determine local CollNet support
if (collNetSupport(comm)) {
const char *collNetEnable = ncclGetEnv("NCCL_COLLNET_ENABLE");
if (collNetEnable != NULL) {
INFO(NCCL_ALL, "NCCL_COLLNET_ENABLE set by environment to %s.", collNetEnable);
if (strcmp(collNetEnable, "1") == 0) {
comm->collNetSupport = 1;
}
}
if (!collNetSupport(comm)) {
comm->config.collnetEnable = 0;
}
// Determine local Nvls support
@@ -873,7 +873,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
collNetDirectGraph->collNet = 1;
collNetDirectGraph->minChannels = 1;
collNetDirectGraph->maxChannels = MAXCHANNELS;
if (comm->collNetSupport) {
if (comm->config.collnetEnable) {
NCCLCHECKGOTO(ncclTopoCompute(comm->topo, collNetChainGraph), ret, fail);
NCCLCHECKGOTO(ncclTopoPrintGraph(comm->topo, collNetChainGraph), ret, fail);
NCCLCHECKGOTO(ncclTopoCompute(comm->topo, collNetDirectGraph), ret, fail);
@@ -1014,7 +1014,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
}
comm->maxTreePattern = std::max(comm->maxTreePattern, allGather3Data[i].graphInfo[NCCL_ALGO_TREE].pattern);
}
if (graphs[NCCL_ALGO_COLLNET_CHAIN]->nChannels == 0) comm->collNetSupport = 0;
if (graphs[NCCL_ALGO_COLLNET_CHAIN]->nChannels == 0) comm->config.collnetEnable = 0;
if (graphs[NCCL_ALGO_NVLS]->nChannels == 0) comm->nvlsSupport = comm->nvlsChannels = 0;
comm->nChannels = treeGraph->nChannels = ringGraph->nChannels = std::min(treeGraph->nChannels, ringGraph->nChannels);
@@ -1025,11 +1025,11 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
}
// Determine CollNet support after all-gather now that we know nNodes and each node localRanks
if (comm->collNetSupport == 1) {
if (comm->config.collnetEnable == 1) {
int collNetNodeThreshold = ncclParamCollNetNodeThreshold();
if (comm->nNodes < collNetNodeThreshold) {
INFO(NCCL_INIT, "Communicator has %d nodes which is less than CollNet node threshold %d, disabling CollNet", comm->nNodes, collNetNodeThreshold);
comm->collNetSupport = 0;
comm->config.collnetEnable = 0;
}
}
NCCLCHECK(ncclTopoPathAllNVLink(comm->topo, &comm->isAllNvlink));
@@ -1075,9 +1075,12 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
}
comm->topParentLocalRanks = topParentLocalRanks;
NCCLCHECKGOTO(ncclTransportCheckP2pType(comm, &comm->intraNodeP2pSupport, &comm->directMode), ret, fail);
// Profiler plugin context has to be initialized before proxy thread
NCCLCHECK(ncclProfilerPluginInit(comm));
NCCLCHECKGOTO(ncclTransportCheckP2pType(comm, &comm->isAllDirectP2p, &comm->directMode), ret, fail);
// Launch proxy service thread, after this, the proxy calls can be used.
if (parent && parent->config.splitShare) {
if (parent && parent->shareResources) {
comm->proxyState = parent->sharedRes->proxyState;
ncclAtomicRefCountIncrement(&parent->sharedRes->proxyState->refCount);
} else {
@@ -1147,10 +1150,10 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
for (int c=0; c<comm->nChannels; c++) {
NCCLCHECKGOTO(setupChannel(comm, c, rank, nranks, rings+c*nranks), ret, fail);
}
// Setup NVLS
// Attempt to setup NVLS, may silently fail and disable NVLS
NCCLCHECKGOTO(ncclNvlsSetup(comm, parent), ret, fail);
// Check if we can setup CollNet
if (comm->collNetSupport > 0) ncclCollNetSetup(comm, parent, graphs);
if (comm->config.collnetEnable) ncclCollNetSetup(comm, parent, graphs);
} else {
for (int c=0; c<comm->nChannels; c++) {
NCCLCHECKGOTO(setupChannel(comm, c, rank, nranks, rings+c*nranks), ret, fail);
@@ -1163,7 +1166,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
// Connect PAT only for communicators with 1 GPU per node
if (comm->maxLocalRanks == 1) NCCLCHECKGOTO(ncclTransportPatConnect(comm), ret, fail);
// Setup NVLS
// Attempt to setup NVLS, may silently fail and disable NVLS
NCCLCHECKGOTO(ncclNvlsSetup(comm, parent), ret, fail);
NCCLCHECKGOTO(ncclNvlsBufferSetup(comm), ret, fail);
@@ -1171,7 +1174,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
NCCLCHECKGOTO(ncclNvlsTreeConnect(comm), ret, fail);
// Check if we can setup CollNet
if (comm->collNetSupport > 0) {
if (comm->config.collnetEnable) {
ncclCollNetSetup(comm, parent, graphs);
NCCLCHECKGOTO(ncclCollNetChainBufferSetup(comm), ret, fail);
if (comm->maxLocalRanks <= NCCL_MAX_DIRECT_ARITY+1) {
@@ -1244,9 +1247,13 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
}
}
comm->symmetricSupport = comm->isAllDirectP2p && comm->nNodes == 1 && ncclParamWinEnable() && ncclCuMemEnable();
comm->baseStride = 0;
// Call devCommSetup before the last barrier, making sure we don't have a thread running in front and starting to
// launch NCCL kernels before all cuda mem allocation is complete. That could cause a deadlock.
NCCLCHECKGOTO(devCommSetup(comm), ret, fail);
timers[TIMER_INIT_CONNECT] = clockNano() - timers[TIMER_INIT_CONNECT];
/* Local intra-node barrier */
NCCLCHECKGOTO(bootstrapIntraNodeBarrier(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, comm->localRankToRank[0]), ret, fail);
@@ -1260,7 +1267,7 @@ exit:
/* If split resource is shared, we are not able to unlink the proxy ops pool here since the child comm can
* attach the proxy ops pool of parent at any time; otherwise, unlink it here to make sure the pool will be
* properly cleaned up. */
if (comm->sharedRes->owner == comm && !comm->config.splitShare && ret == ncclSuccess && !ncclCuMemEnable()) ncclProxyShmUnlink(comm);
if (comm->sharedRes->owner == comm && !comm->shareResources && ret == ncclSuccess && !ncclCuMemEnable()) ncclProxyShmUnlink(comm);
free(allTopoRanks);
free(nodesTreePatterns);
free(nodesFirstRank);
@@ -1293,6 +1300,9 @@ struct ncclCommInitRankAsyncJob {
struct ncclComm* parent;
int color, key;
int splitCount;
// For Shrink
int* excludeRanksList;
int excludeRanksCount;
// name of the function calling
char funcName[NCCL_COMMINIT_FUNCNAME_LEN];
};
@@ -1303,6 +1313,7 @@ struct ncclCommFinalizeAsyncJob {
};
NCCL_PARAM(CommSplitShareResources, "COMM_SPLIT_SHARE_RESOURCES", NCCL_CONFIG_UNDEF_INT);
NCCL_PARAM(CommShrinkShareResources, "COMM_SHRINK_SHARE_RESOURCES", NCCL_CONFIG_UNDEF_INT);
typedef struct{
int key;
@@ -1350,6 +1361,21 @@ fail:
goto exit;
}
static ncclResult_t getParentRanks(int parentRanks, int parentRank, int* excludeRanksList, int excludeRanksCount, int* nRanksRet, int* myRankRet, int* parentRanksRet) {
int count = 0, j = 0;
for (int i = 0; i < parentRanks; i++) {
// we assume excludeRanksList is sorted
if (j < excludeRanksCount && excludeRanksList[j] == i) {
j++;
continue;
}
if (i == parentRank) *myRankRet = count;
parentRanksRet[count++] = i;
}
*nRanksRet = parentRanks - excludeRanksCount;
return ncclSuccess;
}
static ncclResult_t ncclCommInitRankFunc(struct ncclAsyncJob* job_) {
struct ncclCommInitRankAsyncJob* job = (struct ncclCommInitRankAsyncJob*)job_;
ncclComm_t comm = job->comm;
@@ -1383,9 +1409,13 @@ static ncclResult_t ncclCommInitRankFunc(struct ncclAsyncJob* job_) {
if (job->parent) {
NCCLCHECKGOTO(ncclCalloc(&parentRanks, job->parent->nRanks), res, fail);
NCCLCHECKGOTO(commGetSplitInfo(comm, job->parent, job->color, job->key, &job->nranks, &job->myrank, parentRanks), res, fail);
// Negative color does not create a new comm object. We needed to take part in the allgather, but we're done now.
if (job->color == NCCL_SPLIT_NOCOLOR) goto exit;
if (job->excludeRanksCount) {
NCCLCHECKGOTO(getParentRanks(job->parent->nRanks, job->parent->rank, job->excludeRanksList, job->excludeRanksCount, &job->nranks, &job->myrank, parentRanks), res, fail);
} else {
NCCLCHECKGOTO(commGetSplitInfo(comm, job->parent, job->color, job->key, &job->nranks, &job->myrank, parentRanks), res, fail);
// Negative color does not create a new comm object. We needed to take part in the allgather, but we're done now.
if (job->color == NCCL_SPLIT_NOCOLOR) goto exit;
}
timers[TIMER_INIT_ALLOC] = clockNano();
NCCLCHECKGOTO(commAlloc(comm, job->parent, job->nranks, job->myrank), res, fail);
timers[TIMER_INIT_ALLOC] = clockNano() - timers[TIMER_INIT_ALLOC];
@@ -1477,6 +1507,10 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
int minCTAsEnv;
int maxCTAsEnv;
int splitShareEnv;
int collnetEnableEnv;
int ctaPolicyEnv;
int shrinkShareEnv;
int nvlsCTAsEnv;
/* override configuration from env variable. */
blockingEnv = ncclParamCommBlocking();
@@ -1522,6 +1556,25 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
if (splitShareEnv != NCCL_CONFIG_UNDEF_INT) {
comm->config.splitShare = splitShareEnv;
}
shrinkShareEnv = ncclParamCommShrinkShareResources();
if (shrinkShareEnv != NCCL_CONFIG_UNDEF_INT) {
comm->config.shrinkShare = shrinkShareEnv;
}
collnetEnableEnv = ncclParamCollnetEnable();
if (collnetEnableEnv != NCCL_CONFIG_UNDEF_INT) {
comm->config.collnetEnable = collnetEnableEnv;
}
ctaPolicyEnv = ncclParamCtaPolicy();
if (ctaPolicyEnv != NCCL_CONFIG_UNDEF_INT) {
comm->config.CTAPolicy = ctaPolicyEnv;
}
nvlsCTAsEnv = ncclParamNvlsChannels();
if (nvlsCTAsEnv != NCCL_CONFIG_UNDEF_INT) {
comm->config.nvlsCTAs = nvlsCTAsEnv;
}
/* cap channels if needed */
if (comm->config.minCTAs > MAXCHANNELS) {
@@ -1544,6 +1597,20 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
comm->config.splitShare = 0;
}
if (comm->config.collnetEnable != 1 && comm->config.collnetEnable != 0) {
INFO(NCCL_ENV, "collnetEnable %d is not a valid value 0/1, set it to 0", comm->config.collnetEnable);
comm->config.collnetEnable = 0;
}
if (comm->config.CTAPolicy < NCCL_CTA_POLICY_DEFAULT || comm->config.CTAPolicy > NCCL_CTA_POLICY_EFFICIENCY) {
INFO(NCCL_ENV, "CTAPolicy %d is not a valid value, set it to %d", comm->config.CTAPolicy, NCCL_CTA_POLICY_DEFAULT);
comm->config.CTAPolicy = NCCL_CTA_POLICY_DEFAULT;
}
if (comm->config.nvlsCTAs != NCCL_CONFIG_UNDEF_INT && comm->config.nvlsCTAs <= 0) {
INFO(NCCL_ENV, "nvlsCTAs %d is not a valid value, NCCL will decide the default value automatically", comm->config.nvlsCTAs);
comm->config.nvlsCTAs = NCCL_CONFIG_UNDEF_INT;
}
return ret;
}
@@ -1584,6 +1651,17 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
internalConfigPtr->maxCTAs = defaultConfig.maxCTAs;
internalConfigPtr->netName = defaultConfig.netName;
}
if (internalConfigPtr->version < NCCL_VERSION(2, 25, 0)) {
internalConfigPtr->trafficClass = defaultConfig.trafficClass;
}
if (internalConfigPtr->version < NCCL_VERSION(2, 27, 0)) {
internalConfigPtr->collnetEnable = defaultConfig.collnetEnable;
internalConfigPtr->CTAPolicy = defaultConfig.CTAPolicy;
internalConfigPtr->shrinkShare = defaultConfig.shrinkShare;
internalConfigPtr->nvlsCTAs = defaultConfig.nvlsCTAs;
}
}
/* check input config attributes, -1 means user-undefined and we should use default value from NCCL. */
@@ -1615,6 +1693,31 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
goto fail;
}
if (internalConfigPtr->collnetEnable != NCCL_CONFIG_UNDEF_INT && (internalConfigPtr->collnetEnable < 0 || internalConfigPtr->collnetEnable > 1)) {
WARN("Invalid config collnetEnable attribute value %d", internalConfigPtr->collnetEnable);
ret = ncclInvalidArgument;
goto fail;
}
if (internalConfigPtr->CTAPolicy != NCCL_CONFIG_UNDEF_INT && (internalConfigPtr->CTAPolicy < NCCL_CTA_POLICY_DEFAULT ||
internalConfigPtr->CTAPolicy > NCCL_CTA_POLICY_EFFICIENCY)) {
WARN("Invalid config policy attribute value %d", internalConfigPtr->CTAPolicy);
ret = ncclInvalidArgument;
goto fail;
}
if (internalConfigPtr->shrinkShare != NCCL_CONFIG_UNDEF_INT && internalConfigPtr->shrinkShare != 0 && internalConfigPtr->shrinkShare != 1) {
WARN("Invalid config shrinkShare attribute value %d", internalConfigPtr->shrinkShare);
ret = ncclInvalidArgument;
goto fail;
}
if (internalConfigPtr->nvlsCTAs != NCCL_CONFIG_UNDEF_INT && internalConfigPtr->nvlsCTAs <= 0) {
WARN("Invalid config nvlsCTAs attribute value %d", internalConfigPtr->nvlsCTAs);
ret = ncclInvalidArgument;
goto fail;
}
/* default config value can be tuned on different platform. */
NCCL_CONFIG_DEFAULT(internalConfigPtr, blocking, NCCL_CONFIG_UNDEF_INT, 1, "Blocking", "%d");
NCCL_CONFIG_DEFAULT(internalConfigPtr, cgaClusterSize, NCCL_CONFIG_UNDEF_INT, 4, "CGA cluster size", "%d");
@@ -1623,6 +1726,11 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
NCCL_CONFIG_DEFAULT(internalConfigPtr, netName, NCCL_CONFIG_UNDEF_PTR, NULL, "Net name", "%s");
NCCL_CONFIG_DEFAULT(internalConfigPtr, splitShare, NCCL_CONFIG_UNDEF_INT, 0, "Split share", "%d");
NCCL_CONFIG_DEFAULT(internalConfigPtr, trafficClass, NCCL_CONFIG_UNDEF_INT, NCCL_CONFIG_UNDEF_INT, "Traffic class", "%d");
NCCL_CONFIG_DEFAULT(internalConfigPtr, commName, NCCL_CONFIG_UNDEF_PTR, NULL, "Comm name", "%s");
NCCL_CONFIG_DEFAULT(internalConfigPtr, collnetEnable, NCCL_CONFIG_UNDEF_INT, 0, "Collnet enable", "%d");
NCCL_CONFIG_DEFAULT(internalConfigPtr, CTAPolicy, NCCL_CONFIG_UNDEF_INT, NCCL_CTA_POLICY_DEFAULT, "CTA policy flags", "%d");
NCCL_CONFIG_DEFAULT(internalConfigPtr, shrinkShare, NCCL_CONFIG_UNDEF_INT, 0, "shrinkShare", "%d");
NCCL_CONFIG_DEFAULT(internalConfigPtr, nvlsCTAs, NCCL_CONFIG_UNDEF_INT, NCCL_CONFIG_UNDEF_INT, "nvlsCTAs", "%d");
/* assign config to communicator */
comm->config.blocking = internalConfigPtr->blocking;
@@ -1632,7 +1740,11 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
comm->config.netName = internalConfigPtr->netName;
comm->config.splitShare = internalConfigPtr->splitShare;
comm->config.trafficClass = internalConfigPtr->trafficClass;
comm->config.commName = internalConfigPtr->commName;
comm->config.collnetEnable = internalConfigPtr->collnetEnable;
comm->config.CTAPolicy = internalConfigPtr->CTAPolicy;
comm->config.shrinkShare = internalConfigPtr->shrinkShare;
comm->config.nvlsCTAs = internalConfigPtr->nvlsCTAs;
NCCLCHECKGOTO(envConfigOverride(comm), ret, fail);
exit:
@@ -1909,7 +2021,7 @@ static ncclResult_t commDestroySync(struct ncclAsyncJob* job_) {
WARN("commDestroySync: comm %p rank %d sync deviceStream error %d\n", comm, comm->rank, ret);
}
NCCLCHECKGOTO(ncclCommPollEventCallbacks(comm), ret, fail);
NCCLCHECKGOTO(ncclCommPollEventCallbacks(comm, true), ret, fail);
NCCLCHECKGOTO(ncclCommPollCallbacks(comm, false), ret, fail);
// And keep polling until all graphs referencing us die.
while (comm->localPersistentRefs != 0) {
@@ -2052,7 +2164,6 @@ ncclResult_t ncclCommDestroy(ncclComm_t comm) {
NVTX3_PAYLOAD(comm->commHash, nranks, rank, cudaDev));
TRACE(NCCL_INIT, "comm %p rank %d nRanks %d cudaDev %d busId %lx", comm, rank, nranks, cudaDev, comm->busId);
NCCLCHECK(ncclGroupStartInternal());
// Try and prevent a double free of the comm struct (user error)
if (comm->rank == -1 || comm->nRanks == -1 || comm->cudaDev == -1 || comm->busId == -1) {
WARN("comm %p has already been destroyed", comm);
@@ -2067,13 +2178,22 @@ ncclResult_t ncclCommDestroy(ncclComm_t comm) {
NCCLCHECKGOTO(ncclAsyncLaunch((struct ncclAsyncJob*)job, commReclaim, NULL, free, comm), res, fail);
exit:
ncclGroupErrCheck(res);
NCCLCHECK(ncclGroupEndInternal());
return res;
fail:
goto exit;
}
static ncclResult_t setCommAbortFlags(ncclComm_t comm, int value) {
// Set abort flags
if (comm->childAbortFlag != nullptr) {
__atomic_store_n(comm->childAbortFlag, value, __ATOMIC_RELEASE);
__atomic_store_n(comm->childAbortFlagDev, value, __ATOMIC_RELEASE);
}
__atomic_store_n(comm->abortFlag, value, __ATOMIC_RELEASE);
__atomic_store_n(comm->abortFlagDev, value, __ATOMIC_RELEASE);
return ncclSuccess;
}
NCCL_API(ncclResult_t, ncclCommAbort, ncclComm_t comm);
ncclResult_t ncclCommAbort(ncclComm_t comm) {
NVTX3_RANGE(NcclNvtxParamsCommAbort);
@@ -2081,14 +2201,8 @@ ncclResult_t ncclCommAbort(ncclComm_t comm) {
if (comm == NULL) {
return ncclSuccess;
}
NCCLCHECK(ncclGroupStartInternal());
// Ask anything that might still be running on the device to quit
if (comm->childAbortFlag != nullptr) {
__atomic_store_n(comm->childAbortFlag, 1, __ATOMIC_RELEASE);
__atomic_store_n(comm->childAbortFlagDev, 1, __ATOMIC_RELEASE);
}
__atomic_store_n(comm->abortFlag, 1, __ATOMIC_RELEASE);
__atomic_store_n(comm->abortFlagDev, 1, __ATOMIC_RELEASE);
NCCLCHECK(setCommAbortFlags(comm,1));
comm->destroyFlag = 1;
/* init thread must be joined before we destroy the comm,
* and we should ignore the init error here. */
@@ -2109,38 +2223,51 @@ ncclResult_t ncclCommAbort(ncclComm_t comm) {
NCCLCHECKGOTO(ncclAsyncLaunch((struct ncclAsyncJob*)job, commReclaim, NULL, free, comm), res, fail);
exit:
ncclGroupErrCheck(res);
NCCLCHECK(ncclGroupEndInternal());
return ncclSuccess;
fail:
goto exit;
}
NCCL_API(ncclResult_t, ncclCommSplit, ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t *config);
ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t *config) {
static void childCommCleanupJob(void* job) {
struct ncclCommInitRankAsyncJob* initJob = (struct ncclCommInitRankAsyncJob*)job;
if (initJob->excludeRanksList) free(initJob->excludeRanksList);
free(job);
}
// initializing a child communicator (for both split and shrink)
static ncclResult_t ncclCommInitChildComm(ncclComm_t comm, ncclComm_t* newcomm, bool isShrink, int flags, int color, int key, int* excludeRanksList, int excludeRanksCount,
ncclConfig_t* config, const char* caller) {
struct ncclCommInitRankAsyncJob *job = NULL;
struct ncclComm* childComm = NCCL_COMM_NULL;
ncclResult_t res = ncclSuccess;
NVTX3_RANGE(NcclNvtxParamsCommSplit)
int oldDev;
CUDACHECK(cudaGetDevice(&oldDev));
NCCLCHECKGOTO(CommCheck(comm, caller, "comm"), res, exit);
NCCLCHECKGOTO(PtrCheck(newcomm, caller, "newcomm"), res, exit);
if (isShrink) {
NCCLCHECKGOTO(PtrCheck(excludeRanksList, caller, "excludeRanksList"), res, exit);
NCCLCHECKGOTO(excludeRanksCount > 0 ? ncclSuccess : ncclInvalidArgument, res, exit);
// excludeRanksList may not be sorted, need to sort it
qsort(excludeRanksList, excludeRanksCount, sizeof(int), compareInts);
// ranks in excludeRanksList should not call into this function
NCCLCHECKGOTO(bsearch(&comm->rank, excludeRanksList, excludeRanksCount, sizeof(int), compareInts) ? ncclInvalidArgument : ncclSuccess, res, exit);
}
NCCLCHECKGOTO(ncclCommEnsureReady(comm), res, exit);
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), res, exit);
NCCLCHECK(ncclGroupStartInternal());
NCCLCHECKGOTO(CommCheck(comm, "CommSplit", "comm"), res, fail);
NCCLCHECKGOTO(PtrCheck(newcomm, "CommSplit", "newcomm"), res, fail);
NCCLCHECKGOTO(ncclCommEnsureReady(comm), res, fail);
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), res, fail);
/* *newcomm should be NCCL_COMM_NULL until comm split fully complete. */
*newcomm = NCCL_COMM_NULL;
if (color == NCCL_SPLIT_NOCOLOR) {
if (!isShrink && color == NCCL_SPLIT_NOCOLOR) {
INFO(NCCL_INIT, "Rank %d has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator", comm->rank);
} else {
NCCLCHECKGOTO(ncclCalloc(&childComm, 1), res, fail);
childComm->startMagic = childComm->endMagic = NCCL_MAGIC;
if (comm->config.splitShare) {
// Set the shareResource field, this is used throughout the init and must be reset every time.
// If we shrink, we only reuse resources if we shrink in the default mode
comm->shareResources = isShrink ? (!(flags & NCCL_SHRINK_ABORT) && comm->config.shrinkShare) : comm->config.splitShare;
if (comm->shareResources) {
childComm->abortFlag = comm->abortFlag;
childComm->abortFlagDev = comm->abortFlagDev;
childComm->abortFlagRefCount = comm->abortFlagRefCount;
@@ -2161,38 +2288,39 @@ ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t *newc
NCCLCHECKGOTO(parseCommConfig(childComm, config), res, fail);
}
/* start with ncclInProgress and will be changed to ncclSuccess if init succeeds. */
childComm->initState = ncclInProgress;
/* start with ncclInternalError and will be changed to ncclSuccess if init succeeds. */
childComm->initState = ncclInternalError;
}
NCCLCHECKGOTO(ncclCalloc(&job, 1), res, fail);
job->comm = childComm;
job->newcomm = newcomm;
job->parent = comm;
job->splitCount = ++comm->splitCount;
job->color = color;
job->key = key;
if (excludeRanksList) {
// need to copy the list of ranks to exclude because the job is async
job->excludeRanksCount = excludeRanksCount;
NCCLCHECKGOTO(ncclCalloc(&job->excludeRanksList, excludeRanksCount), res, fail);
memcpy(job->excludeRanksList, excludeRanksList, excludeRanksCount * sizeof(int));
} else {
// each split has to lead to a unique comm, so increment the splitCount
job->splitCount = ++comm->splitCount;
job->excludeRanksList = NULL;
}
job->cudaDev = comm->cudaDev;
snprintf(job->funcName, NCCL_COMMINIT_FUNCNAME_LEN, "%s", __func__);
NCCLCHECKGOTO(ncclAsyncLaunch((struct ncclAsyncJob*)job, ncclCommInitRankFunc, NULL, free, comm), res, fail);
snprintf(job->funcName, NCCL_COMMINIT_FUNCNAME_LEN, "%s", caller);
NCCLCHECKGOTO(ncclAsyncLaunch((struct ncclAsyncJob*)job, ncclCommInitRankFunc, /*undo=*/NULL, /*destructor=*/childCommCleanupJob, comm), res, fail);
exit:
(void)cudaSetDevice(oldDev);
(void)ncclGroupErrCheck(res);
NCCLCHECK(ncclGroupEndInternal());
if (res == ncclSuccess && *newcomm) {
NVTX3_RANGE_ADD_PAYLOAD(CommSplit, NcclNvtxParamsCommSplitSchema,
NVTX3_PAYLOAD((*newcomm)->commHash, comm->commHash, comm->nRanks, comm->rank, comm->cudaDev, color, key));
}
return res;
fail:
if (childComm) {
if (!comm->config.splitShare) {
free(childComm->abortFlag);
if (!comm->shareResources) {
if (childComm->abortFlag) free(childComm->abortFlag);
if (childComm->abortFlagDev) ncclCudaHostFree(childComm->abortFlagDev);
free(childComm->abortFlagRefCount);
if (childComm->abortFlagRefCount) free(childComm->abortFlagRefCount);
}
free(childComm);
}
@@ -2200,6 +2328,44 @@ fail:
goto exit;
}
NCCL_API(ncclResult_t, ncclCommShrink, ncclComm_t comm, int* excludeRanksList, int excludeRanksCount, ncclComm_t* newcomm, ncclConfig_t* config, int shrinkFlags);
ncclResult_t ncclCommShrink(ncclComm_t comm, int* excludeRanksList, int excludeRanksCount, ncclComm_t *newcomm, ncclConfig_t* config, int shrinkFlags) {
NVTX3_RANGE(NcclNvtxParamsCommShrink)
ncclResult_t res = ncclSuccess;
NCCLCHECK(ncclGroupStartInternal());
// Handle error mode by setting abort flags and waiting for kernels to complete and unset the flags to avoid bootstrap issues
if (shrinkFlags & NCCL_SHRINK_ABORT) {
NCCLCHECKGOTO(setCommAbortFlags(comm, 1), res, exit);
NCCLCHECKGOTO(ncclStrongStreamSynchronize(&comm->sharedRes->deviceStream), res, exit);
NCCLCHECKGOTO(setCommAbortFlags(comm, 0), res, exit);
}
NCCLCHECKGOTO(ncclCommInitChildComm(comm, newcomm, /*isShrink=*/true, shrinkFlags, /*color=*/0, /*key=*/comm->rank, excludeRanksList, excludeRanksCount, config, __func__), res, exit);
if (*newcomm) NVTX3_RANGE_ADD_PAYLOAD(CommShrink, NcclNvtxParamsCommShrinkSchema, NVTX3_PAYLOAD(comm->commHash, comm->nRanks, comm->rank, comm->cudaDev, excludeRanksCount));
exit:
(void)ncclGroupErrCheck(res);
NCCLCHECK(ncclGroupEndInternal());
return res;
}
NCCL_API(ncclResult_t, ncclCommSplit, ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t *config);
ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t *config) {
NVTX3_RANGE(NcclNvtxParamsCommSplit)
ncclResult_t res = ncclSuccess;
NCCLCHECK(ncclGroupStartInternal());
NCCLCHECKGOTO(ncclCommInitChildComm(comm, newcomm, /*isShrink=*/false, /*shrink mode=*/NCCL_SHRINK_DEFAULT, color, key, NULL, 0, config, __func__), res, exit);
if (*newcomm)
NVTX3_RANGE_ADD_PAYLOAD(CommSplit, NcclNvtxParamsCommSplitSchema, NVTX3_PAYLOAD((*newcomm)->commHash, comm->commHash, comm->nRanks, comm->rank, comm->cudaDev, color, key));
exit:
(void)ncclGroupErrCheck(res);
NCCLCHECK(ncclGroupEndInternal());
return res;
}
NCCL_API(const char*, ncclGetErrorString, ncclResult_t code);
const char* ncclGetErrorString(ncclResult_t code) {
switch (code) {
@@ -2277,119 +2443,3 @@ ncclResult_t ncclCommUserRank(const ncclComm_t comm, int* rank) {
*rank = comm->rank;
return ncclSuccess;
}
NCCL_API(ncclResult_t, ncclMemAlloc, void **ptr, size_t size);
ncclResult_t ncclMemAlloc(void **ptr, size_t size) {
NVTX3_FUNC_RANGE_IN(nccl_domain);
ncclResult_t ret = ncclSuccess;
#if CUDART_VERSION >= 12010
size_t memGran = 0;
CUdevice currentDev;
CUmemAllocationProp memprop = {};
CUmemAccessDesc accessDesc = {};
CUmemGenericAllocationHandle handle;
int cudaDev;
int flag;
int dcnt;
if (ptr == NULL || size == 0) goto fallback;
if (ncclCudaLibraryInit() != ncclSuccess) goto fallback;
CUDACHECK(cudaGetDevice(&cudaDev));
CUCHECK(cuDeviceGet(&currentDev, cudaDev));
if (ncclCuMemEnable()) {
size_t handleSize = size;
int requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
// Query device to see if FABRIC handle support is available
flag = 0;
(void) CUPFN(cuDeviceGetAttribute(&flag, CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED, currentDev));
if (flag) requestedHandleTypes |= CU_MEM_HANDLE_TYPE_FABRIC;
memprop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
memprop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
memprop.requestedHandleTypes = (CUmemAllocationHandleType) requestedHandleTypes;
memprop.location.id = currentDev;
// Query device to see if RDMA support is available
flag = 0;
CUCHECK(cuDeviceGetAttribute(&flag, CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED, currentDev));
if (flag) memprop.allocFlags.gpuDirectRDMACapable = 1;
CUCHECK(cuMemGetAllocationGranularity(&memGran, &memprop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED));
CUDACHECK(cudaGetDeviceCount(&dcnt));
ALIGN_SIZE(handleSize, memGran);
if (requestedHandleTypes & CU_MEM_HANDLE_TYPE_FABRIC) {
/* First try cuMemCreate() with FABRIC handle support and then remove if it fails */
CUresult err = CUPFN(cuMemCreate(&handle, handleSize, &memprop, 0));
if (err == CUDA_ERROR_NOT_PERMITTED || err == CUDA_ERROR_NOT_SUPPORTED) {
requestedHandleTypes &= ~CU_MEM_HANDLE_TYPE_FABRIC;
memprop.requestedHandleTypes = (CUmemAllocationHandleType) requestedHandleTypes;
/* Allocate the physical memory on the device */
CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
}
} else {
/* Allocate the physical memory on the device */
CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
}
/* Reserve a virtual address range */
CUCHECK(cuMemAddressReserve((CUdeviceptr*)ptr, handleSize, memGran, 0, 0));
/* Map the virtual address range to the physical allocation */
CUCHECK(cuMemMap((CUdeviceptr)*ptr, handleSize, 0, handle, 0));
/* Now allow RW access to the newly mapped memory */
for (int i = 0; i < dcnt; ++i) {
int p2p = 0;
if (i == cudaDev || ((cudaDeviceCanAccessPeer(&p2p, cudaDev, i) == cudaSuccess) && p2p)) {
accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
accessDesc.location.id = i;
accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
CUCHECK(cuMemSetAccess((CUdeviceptr)*ptr, handleSize, &accessDesc, 1));
}
if (0 == p2p && i != cudaDev) INFO(NCCL_ALLOC, "P2P not supported between GPU%d and GPU%d", cudaDev, i);
}
goto exit;
}
fallback:
#endif
// Coverity is right to complain that we may pass a NULL ptr to cudaMalloc. That's deliberate though:
// we want CUDA to return an error to the caller.
// coverity[var_deref_model]
CUDACHECKGOTO(cudaMalloc(ptr, size), ret, fail);
exit:
return ret;
fail:
goto exit;
}
NCCL_API(ncclResult_t, ncclMemFree, void *ptr);
ncclResult_t ncclMemFree(void *ptr) {
NVTX3_FUNC_RANGE_IN(nccl_domain);
ncclResult_t ret = ncclSuccess;
int saveDevice;
CUDACHECK(cudaGetDevice(&saveDevice));
#if CUDART_VERSION >= 12010
CUdevice ptrDev = 0;
if (ptr == NULL) goto fallback;
if (ncclCudaLibraryInit() != ncclSuccess) goto fallback;
CUCHECKGOTO(cuPointerGetAttribute((void*)&ptrDev, CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL, (CUdeviceptr)ptr), ret, fail);
CUDACHECKGOTO(cudaSetDevice((int)ptrDev), ret, fail);
if (ncclCuMemEnable()) {
NCCLCHECKGOTO(ncclCuMemFree(ptr), ret, fail);
goto exit;
}
fallback:
#endif
CUDACHECKGOTO(cudaFree(ptr), ret, fail);
exit:
CUDACHECK(cudaSetDevice(saveDevice));
return ret;
fail:
goto exit;
}
+77 -68
Просмотреть файл
@@ -105,53 +105,53 @@ error:
#endif
}
#define DECLARE_CUDA_PFN(symbol) PFN_##symbol pfn_##symbol = nullptr
#define DECLARE_CUDA_PFN(symbol,version) PFN_##symbol##_v##version pfn_##symbol = nullptr
#if CUDART_VERSION >= 11030
/* CUDA Driver functions loaded with cuGetProcAddress for versioning */
DECLARE_CUDA_PFN(cuDeviceGet);
DECLARE_CUDA_PFN(cuDeviceGetAttribute);
DECLARE_CUDA_PFN(cuGetErrorString);
DECLARE_CUDA_PFN(cuGetErrorName);
DECLARE_CUDA_PFN(cuDeviceGet, 2000);
DECLARE_CUDA_PFN(cuDeviceGetAttribute, 2000);
DECLARE_CUDA_PFN(cuGetErrorString, 6000);
DECLARE_CUDA_PFN(cuGetErrorName, 6000);
/* enqueue.cc */
DECLARE_CUDA_PFN(cuMemGetAddressRange);
DECLARE_CUDA_PFN(cuLaunchKernel);
DECLARE_CUDA_PFN(cuMemGetAddressRange, 3020);
DECLARE_CUDA_PFN(cuLaunchKernel, 4000);
#if CUDA_VERSION >= 11080
DECLARE_CUDA_PFN(cuLaunchKernelEx);
DECLARE_CUDA_PFN(cuLaunchKernelEx, 11060);
#endif
/* proxy.cc */
DECLARE_CUDA_PFN(cuCtxCreate);
DECLARE_CUDA_PFN(cuCtxDestroy);
DECLARE_CUDA_PFN(cuCtxGetCurrent);
DECLARE_CUDA_PFN(cuCtxSetCurrent);
DECLARE_CUDA_PFN(cuCtxGetDevice);
DECLARE_CUDA_PFN(cuCtxCreate, 11040);
DECLARE_CUDA_PFN(cuCtxDestroy, 4000);
DECLARE_CUDA_PFN(cuCtxGetCurrent, 4000);
DECLARE_CUDA_PFN(cuCtxSetCurrent, 4000);
DECLARE_CUDA_PFN(cuCtxGetDevice, 2000);
/* cuMem API support */
DECLARE_CUDA_PFN(cuMemAddressReserve);
DECLARE_CUDA_PFN(cuMemAddressFree);
DECLARE_CUDA_PFN(cuMemCreate);
DECLARE_CUDA_PFN(cuMemGetAllocationGranularity);
DECLARE_CUDA_PFN(cuMemExportToShareableHandle);
DECLARE_CUDA_PFN(cuMemImportFromShareableHandle);
DECLARE_CUDA_PFN(cuMemMap);
DECLARE_CUDA_PFN(cuMemRelease);
DECLARE_CUDA_PFN(cuMemRetainAllocationHandle);
DECLARE_CUDA_PFN(cuMemSetAccess);
DECLARE_CUDA_PFN(cuMemUnmap);
DECLARE_CUDA_PFN(cuMemGetAllocationPropertiesFromHandle);
DECLARE_CUDA_PFN(cuMemAddressReserve, 10020);
DECLARE_CUDA_PFN(cuMemAddressFree, 10020);
DECLARE_CUDA_PFN(cuMemCreate, 10020);
DECLARE_CUDA_PFN(cuMemGetAllocationGranularity, 10020);
DECLARE_CUDA_PFN(cuMemExportToShareableHandle, 10020);
DECLARE_CUDA_PFN(cuMemImportFromShareableHandle, 10020);
DECLARE_CUDA_PFN(cuMemMap, 10020);
DECLARE_CUDA_PFN(cuMemRelease, 10020);
DECLARE_CUDA_PFN(cuMemRetainAllocationHandle, 11000);
DECLARE_CUDA_PFN(cuMemSetAccess, 10020);
DECLARE_CUDA_PFN(cuMemUnmap, 10020);
DECLARE_CUDA_PFN(cuMemGetAllocationPropertiesFromHandle, 10020);
/* ncclMemAlloc/Free */
DECLARE_CUDA_PFN(cuPointerGetAttribute);
DECLARE_CUDA_PFN(cuPointerGetAttribute, 4000);
#if CUDA_VERSION >= 11070
/* transport/collNet.cc/net.cc*/
DECLARE_CUDA_PFN(cuMemGetHandleForAddressRange); // DMA-BUF support
DECLARE_CUDA_PFN(cuMemGetHandleForAddressRange, 11070); // DMA-BUF support
#endif
#if CUDA_VERSION >= 12010
/* NVSwitch Multicast support */
DECLARE_CUDA_PFN(cuMulticastAddDevice);
DECLARE_CUDA_PFN(cuMulticastBindMem);
DECLARE_CUDA_PFN(cuMulticastBindAddr);
DECLARE_CUDA_PFN(cuMulticastCreate);
DECLARE_CUDA_PFN(cuMulticastGetGranularity);
DECLARE_CUDA_PFN(cuMulticastUnbind);
DECLARE_CUDA_PFN(cuMulticastAddDevice, 12010);
DECLARE_CUDA_PFN(cuMulticastBindMem, 12010);
DECLARE_CUDA_PFN(cuMulticastBindAddr, 12010);
DECLARE_CUDA_PFN(cuMulticastCreate, 12010);
DECLARE_CUDA_PFN(cuMulticastGetGranularity, 12010);
DECLARE_CUDA_PFN(cuMulticastUnbind, 12010);
#endif
#endif
@@ -162,8 +162,17 @@ bool ncclCudaLaunchBlocking = false;
#if CUDART_VERSION >= 11030
#if CUDART_VERSION >= 12000
#define LOAD_SYM(symbol, ignore) do { \
#if CUDART_VERSION >= 13000
#define LOAD_SYM(symbol, version, ignore) do { \
cudaDriverEntryPointQueryResult driverStatus = cudaDriverEntryPointSymbolNotFound; \
res = cudaGetDriverEntryPointByVersion(#symbol, (void **) (&pfn_##symbol), version, cudaEnableDefault, &driverStatus); \
if (res != cudaSuccess || driverStatus != cudaDriverEntryPointSuccess) { \
if (!ignore) { \
WARN("Retrieve %s version %d failed with %d status %d", #symbol, version, res, driverStatus); \
return ncclSystemError; } \
} } while(0)
#elif CUDART_VERSION >= 12000
#define LOAD_SYM(symbol, version, ignore) do { \
cudaDriverEntryPointQueryResult driverStatus = cudaDriverEntryPointSymbolNotFound; \
res = cudaGetDriverEntryPoint(#symbol, (void **) (&pfn_##symbol), cudaEnableDefault, &driverStatus); \
if (res != cudaSuccess || driverStatus != cudaDriverEntryPointSuccess) { \
@@ -172,7 +181,7 @@ bool ncclCudaLaunchBlocking = false;
return ncclSystemError; } \
} } while(0)
#else
#define LOAD_SYM(symbol, ignore) do { \
#define LOAD_SYM(symbol, version, ignore) do { \
res = cudaGetDriverEntryPoint(#symbol, (void **) (&pfn_##symbol), cudaEnableDefault); \
if (res != cudaSuccess) { \
if (!ignore) { \
@@ -188,46 +197,46 @@ static ncclResult_t cudaPfnFuncLoader(void) {
cudaError_t res;
LOAD_SYM(cuGetErrorString, 0);
LOAD_SYM(cuGetErrorName, 0);
LOAD_SYM(cuDeviceGet, 0);
LOAD_SYM(cuDeviceGetAttribute, 0);
LOAD_SYM(cuMemGetAddressRange, 1);
LOAD_SYM(cuCtxCreate, 1);
LOAD_SYM(cuCtxDestroy, 1);
LOAD_SYM(cuCtxGetCurrent, 1);
LOAD_SYM(cuCtxSetCurrent, 1);
LOAD_SYM(cuCtxGetDevice, 1);
LOAD_SYM(cuLaunchKernel, 1);
LOAD_SYM(cuGetErrorString, 6000, 0);
LOAD_SYM(cuGetErrorName, 6000, 0);
LOAD_SYM(cuDeviceGet, 2000, 0);
LOAD_SYM(cuDeviceGetAttribute, 2000, 0);
LOAD_SYM(cuMemGetAddressRange, 3020, 1);
LOAD_SYM(cuCtxCreate, 11040, 1);
LOAD_SYM(cuCtxDestroy, 4000, 1);
LOAD_SYM(cuCtxGetCurrent, 4000, 1);
LOAD_SYM(cuCtxSetCurrent, 4000, 1);
LOAD_SYM(cuCtxGetDevice, 2000, 1);
LOAD_SYM(cuLaunchKernel, 4000, 1);
#if CUDA_VERSION >= 11080
LOAD_SYM(cuLaunchKernelEx, 1);
LOAD_SYM(cuLaunchKernelEx, 11060, 1);
#endif
/* cuMem API support */
LOAD_SYM(cuMemAddressReserve, 1);
LOAD_SYM(cuMemAddressFree, 1);
LOAD_SYM(cuMemCreate, 1);
LOAD_SYM(cuMemGetAllocationGranularity, 1);
LOAD_SYM(cuMemExportToShareableHandle, 1);
LOAD_SYM(cuMemImportFromShareableHandle, 1);
LOAD_SYM(cuMemMap, 1);
LOAD_SYM(cuMemRelease, 1);
LOAD_SYM(cuMemRetainAllocationHandle, 1);
LOAD_SYM(cuMemSetAccess, 1);
LOAD_SYM(cuMemUnmap, 1);
LOAD_SYM(cuMemGetAllocationPropertiesFromHandle, 1);
LOAD_SYM(cuMemAddressReserve, 10020, 1);
LOAD_SYM(cuMemAddressFree, 10020, 1);
LOAD_SYM(cuMemCreate, 10020, 1);
LOAD_SYM(cuMemGetAllocationGranularity, 10020, 1);
LOAD_SYM(cuMemExportToShareableHandle, 10020, 1);
LOAD_SYM(cuMemImportFromShareableHandle, 10020, 1);
LOAD_SYM(cuMemMap, 10020, 1);
LOAD_SYM(cuMemRelease, 10020, 1);
LOAD_SYM(cuMemRetainAllocationHandle, 11000, 1);
LOAD_SYM(cuMemSetAccess, 10020, 1);
LOAD_SYM(cuMemUnmap, 10020, 1);
LOAD_SYM(cuMemGetAllocationPropertiesFromHandle, 10020, 1);
/* ncclMemAlloc/Free */
LOAD_SYM(cuPointerGetAttribute, 1);
LOAD_SYM(cuPointerGetAttribute, 4000, 1);
#if CUDA_VERSION >= 11070
LOAD_SYM(cuMemGetHandleForAddressRange, 1); // DMA-BUF support
LOAD_SYM(cuMemGetHandleForAddressRange, 11070, 1); // DMA-BUF support
#endif
#if CUDA_VERSION >= 12010
/* NVSwitch Multicast support */
LOAD_SYM(cuMulticastAddDevice, 1);
LOAD_SYM(cuMulticastBindMem, 1);
LOAD_SYM(cuMulticastBindAddr, 1);
LOAD_SYM(cuMulticastCreate, 1);
LOAD_SYM(cuMulticastGetGranularity, 1);
LOAD_SYM(cuMulticastUnbind, 1);
LOAD_SYM(cuMulticastAddDevice, 12010, 1);
LOAD_SYM(cuMulticastBindMem, 12010, 1);
LOAD_SYM(cuMulticastBindAddr, 12010, 1);
LOAD_SYM(cuMulticastCreate, 12010, 1);
LOAD_SYM(cuMulticastGetGranularity, 12010, 1);
LOAD_SYM(cuMulticastUnbind, 12010, 1);
#endif
return ncclSuccess;
}
+4
Просмотреть файл
@@ -8,7 +8,11 @@
#include <sys/types.h>
#include <unistd.h>
#ifdef NCCL_BUILD_RDMA_CORE
#include <infiniband/verbs.h>
#else
#include "ibvcore.h"
#endif
#include "ibvsymbols.h"
static pthread_once_t initOnceControl = PTHREAD_ONCE_INIT;
+74
Просмотреть файл
@@ -0,0 +1,74 @@
#include <sys/types.h>
#include <unistd.h>
#include "mlx5/mlx5dvsymbols.h"
#ifdef NCCL_BUILD_MLX5DV
/* Mlx5dv linking mode. Symbols are pointers to linked MLX5 Direct Verbs */
#define ASSIGN_SYM(container, symbol, name) container->name= &symbol;
ncclResult_t buildMlx5dvSymbols(struct ncclMlx5dvSymbols* mlx5dvSymbols) {
ASSIGN_SYM(mlx5dvSymbols, mlx5dv_is_supported, mlx5dv_internal_is_supported);
ASSIGN_SYM(mlx5dvSymbols, mlx5dv_get_data_direct_sysfs_path, mlx5dv_internal_get_data_direct_sysfs_path);
ASSIGN_SYM(mlx5dvSymbols, mlx5dv_reg_dmabuf_mr, mlx5dv_internal_reg_dmabuf_mr);
return ncclSuccess;
}
#else
/* Mlx5dv dynamic loading mode. Symbols are loaded from shared objects. */
#include <dlfcn.h>
#include "core.h"
// MLX5DV Library versioning
#define MLX5DV_VERSION "MLX5_1.8"
ncclResult_t buildMlx5dvSymbols(struct ncclMlx5dvSymbols* mlx5dvSymbols) {
static void* mlx5dvhandle = NULL;
void* tmp;
void** cast;
mlx5dvhandle=dlopen("libmlx5.so", RTLD_NOW);
if (!mlx5dvhandle) {
mlx5dvhandle=dlopen("libmlx5.so.1", RTLD_NOW);
if (!mlx5dvhandle) {
INFO(NCCL_INIT, "Failed to open libmlx5.so[.1]");
goto teardown;
}
}
#define LOAD_SYM(handle, symbol, funcptr) do { \
cast = (void**)&funcptr; \
tmp = dlvsym(handle, symbol, MLX5DV_VERSION); \
if (tmp == NULL) { \
WARN("dlvsym failed on %s - %s version %s", symbol, dlerror(), MLX5DV_VERSION); \
goto teardown; \
} \
*cast = tmp; \
} while (0)
// Attempt to load a specific symbol version - fail silently
#define LOAD_SYM_VERSION(handle, symbol, funcptr, version) do { \
cast = (void**)&funcptr; \
*cast = dlvsym(handle, symbol, version); \
} while (0)
LOAD_SYM(mlx5dvhandle, "mlx5dv_is_supported", mlx5dvSymbols->mlx5dv_internal_is_supported);
// Cherry-pick the mlx5dv_get_data_direct_sysfs_path API from MLX5 1.25
LOAD_SYM_VERSION(mlx5dvhandle, "mlx5dv_get_data_direct_sysfs_path", mlx5dvSymbols->mlx5dv_internal_get_data_direct_sysfs_path, "MLX5_1.25");
// Cherry-pick the ibv_reg_dmabuf_mr API from MLX5 1.25
LOAD_SYM_VERSION(mlx5dvhandle, "mlx5dv_reg_dmabuf_mr", mlx5dvSymbols->mlx5dv_internal_reg_dmabuf_mr, "MLX5_1.25");
return ncclSuccess;
teardown:
mlx5dvSymbols->mlx5dv_internal_is_supported = NULL;
mlx5dvSymbols->mlx5dv_internal_get_data_direct_sysfs_path = NULL;
mlx5dvSymbols->mlx5dv_internal_reg_dmabuf_mr = NULL;
if (mlx5dvhandle != NULL) dlclose(mlx5dvhandle);
return ncclSystemError;
}
#endif
+75
Просмотреть файл
@@ -0,0 +1,75 @@
/*************************************************************************
* Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#include "mlx5/mlx5dvwrap.h"
#include <sys/types.h>
#include <unistd.h>
#ifdef NCCL_BUILD_MLX5DV
#include <infiniband/mlx5dv.h>
#else
#include "mlx5/mlx5dvcore.h"
#endif
#include "mlx5/mlx5dvsymbols.h"
static pthread_once_t initOnceControl = PTHREAD_ONCE_INIT;
static ncclResult_t initResult;
struct ncclMlx5dvSymbols mlx5dvSymbols;
ncclResult_t wrap_mlx5dv_symbols(void) {
pthread_once(&initOnceControl,
[](){ initResult = buildMlx5dvSymbols(&mlx5dvSymbols); });
return initResult;
}
/* CHECK_NOT_NULL: helper macro to check for NULL symbol */
#define CHECK_NOT_NULL(container, internal_name) \
if (container.internal_name == NULL) { \
WARN("lib wrapper not initialized."); \
return ncclInternalError; \
}
#define MLX5DV_PTR_CHECK_ERRNO(container, internal_name, call, retval, error_retval, name) \
CHECK_NOT_NULL(container, internal_name); \
retval = container.call; \
if (retval == error_retval) { \
WARN("Call to " name " failed with error %s", strerror(errno)); \
return ncclSystemError; \
} \
return ncclSuccess;
#define MLX5DV_INT_CHECK_RET_ERRNO(container, internal_name, call, success_retval, name) \
CHECK_NOT_NULL(container, internal_name); \
int ret = container.call; \
if (ret != success_retval) { \
INFO(NCCL_NET, "Call to " name " failed with error %s errno %d", strerror(ret), ret); \
return ncclSystemError; \
} \
return ncclSuccess;
bool wrap_mlx5dv_is_supported(struct ibv_device *device) {
if (mlx5dvSymbols.mlx5dv_internal_is_supported == NULL) {
return 0;
}
return mlx5dvSymbols.mlx5dv_internal_is_supported(device);
}
ncclResult_t wrap_mlx5dv_get_data_direct_sysfs_path(struct ibv_context *context, char *buf, size_t buf_len) {
MLX5DV_INT_CHECK_RET_ERRNO(mlx5dvSymbols, mlx5dv_internal_get_data_direct_sysfs_path, mlx5dv_internal_get_data_direct_sysfs_path(context, buf, buf_len), 0, "mlx5dv_get_data_direct_sysfs_path");
}
/* DMA-BUF support */
ncclResult_t wrap_mlx5dv_reg_dmabuf_mr(struct ibv_mr **ret, struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access) {
MLX5DV_PTR_CHECK_ERRNO(mlx5dvSymbols, mlx5dv_internal_reg_dmabuf_mr, mlx5dv_internal_reg_dmabuf_mr(pd, offset, length, iova, fd, access, mlx5_access), *ret, NULL, "mlx5dv_reg_dmabuf_mr");
}
struct ibv_mr * wrap_direct_mlx5dv_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access) {
if (mlx5dvSymbols.mlx5dv_internal_reg_dmabuf_mr == NULL) {
errno = EOPNOTSUPP; // ncclIbDmaBufSupport() requires this errno being set
return NULL;
}
return mlx5dvSymbols.mlx5dv_internal_reg_dmabuf_mr(pd, offset, length, iova, fd, access, mlx5_access);
}
+101 -67
Просмотреть файл
@@ -68,7 +68,8 @@ static ncclResult_t socketProgress(int op, struct ncclSocket* sock, void* ptr, i
return ncclSuccess;
} else {
char line[SOCKET_NAME_MAXLEN+1];
WARN("socketProgress: Connection closed by remote peer %s", ncclSocketToString(&sock->addr, line, 0));
WARN("socketProgress: Connection closed by remote peer %s",
ncclSocketToString(&sock->addr, line, /*numericHostForm*/0));
return ncclRemoteError;
}
}
@@ -86,17 +87,22 @@ static ncclResult_t socketWait(int op, struct ncclSocket* sock, void* ptr, int s
* Output: "IPv4/IPv6 address<port>"
*/
const char *ncclSocketToString(const union ncclSocketAddress *addr, char *buf, const int numericHostForm /*= 1*/) {
if (buf == NULL || addr == NULL) return NULL;
const struct sockaddr *saddr = &addr->sa;
if (saddr->sa_family != AF_INET && saddr->sa_family != AF_INET6) { buf[0]='\0'; return buf; }
const struct sockaddr *saddr;
char host[NI_MAXHOST], service[NI_MAXSERV];
int flag = NI_NUMERICSERV | (numericHostForm ? NI_NUMERICHOST : 0);
if (buf == NULL || addr == NULL) goto fail;
saddr = &addr->sa;
if (saddr->sa_family != AF_INET && saddr->sa_family != AF_INET6) goto fail;
/* NI_NUMERICHOST: If set, then the numeric form of the hostname is returned.
* (When not set, this will still happen in case the node's name cannot be determined.)
*/
int flag = NI_NUMERICSERV | (numericHostForm ? NI_NUMERICHOST : 0);
(void) getnameinfo(saddr, sizeof(union ncclSocketAddress), host, NI_MAXHOST, service, NI_MAXSERV, flag);
if (getnameinfo(saddr, sizeof(union ncclSocketAddress), host, NI_MAXHOST, service, NI_MAXSERV, flag)) goto fail;
sprintf(buf, "%s<%s>", host, service);
return buf;
fail:
if (buf)
buf[0] = '\0';
return buf;
}
static uint16_t socketToPort(union ncclSocketAddress *addr) {
@@ -120,7 +126,8 @@ static int envSocketFamily(void) {
return family;
}
static int findInterfaces(const char* prefixList, char* names, union ncclSocketAddress *addrs, int sock_family, int maxIfNameSize, int maxIfs) {
static ncclResult_t findInterfaces(const char* prefixList, char* names, union ncclSocketAddress *addrs, int sock_family,
int maxIfNameSize, int maxIfs, int* found) {
#ifdef ENABLE_TRACE
char line[SOCKET_NAME_MAXLEN+1];
#endif
@@ -131,10 +138,10 @@ static int findInterfaces(const char* prefixList, char* names, union ncclSocketA
if (searchExact) prefixList++;
int nUserIfs = parseStringList(prefixList, userIfs, MAX_IFS);
int found = 0;
*found = 0;
struct ifaddrs *interfaces, *interface;
getifaddrs(&interfaces);
for (interface = interfaces; interface && found < maxIfs; interface = interface->ifa_next) {
SYSCHECK(getifaddrs(&interfaces), "getifaddrs");
for (interface = interfaces; interface && *found < maxIfs; interface = interface->ifa_next) {
if (interface->ifa_addr == NULL) continue;
/* We only support IPv4 & IPv6 */
@@ -162,23 +169,23 @@ static int findInterfaces(const char* prefixList, char* names, union ncclSocketA
// Check that this interface has not already been saved
// getifaddrs() normal order appears to be; IPv4, IPv6 Global, IPv6 Link
bool duplicate = false;
for (int i = 0; i < found; i++) {
for (int i = 0; i < *found; i++) {
if (strcmp(interface->ifa_name, names+i*maxIfNameSize) == 0) { duplicate = true; break; }
}
if (!duplicate) {
// Store the interface name
strncpy(names+found*maxIfNameSize, interface->ifa_name, maxIfNameSize);
strncpy(names + (*found)*maxIfNameSize, interface->ifa_name, maxIfNameSize);
// Store the IP address
int salen = (family == AF_INET) ? sizeof(struct sockaddr_in) : sizeof(struct sockaddr_in6);
memset(addrs+found, '\0', sizeof(*addrs));
memcpy(addrs+found, interface->ifa_addr, salen);
found++;
memset(addrs + *found, '\0', sizeof(*addrs));
memcpy(addrs + *found, interface->ifa_addr, salen);
(*found)++;
}
}
freeifaddrs(interfaces);
return found;
return ncclSuccess;
}
static bool matchSubnet(struct ifaddrs local_if, union ncclSocketAddress* remote) {
@@ -219,20 +226,21 @@ static bool matchSubnet(struct ifaddrs local_if, union ncclSocketAddress* remote
same &= (local_addr->sin6_scope_id == remote_addr.sin6_scope_id);
return same;
} else {
WARN("Net : Unsupported address family type");
INFO(NCCL_NET, "Net : Unsupported address family type");
return false;
}
}
int ncclFindInterfaceMatchSubnet(char* ifNames, union ncclSocketAddress* localAddrs, union ncclSocketAddress* remoteAddr, int ifNameMaxSize, int maxIfs) {
ncclResult_t ncclFindInterfaceMatchSubnet(char* ifName, union ncclSocketAddress* localAddr,
union ncclSocketAddress* remoteAddr, int ifNameMaxSize, int* found) {
#ifdef ENABLE_TRACE
char line[SOCKET_NAME_MAXLEN+1];
#endif
char line_a[SOCKET_NAME_MAXLEN+1];
int found = 0;
#endif
*found = 0;
struct ifaddrs *interfaces, *interface;
getifaddrs(&interfaces);
for (interface = interfaces; interface && !found; interface = interface->ifa_next) {
SYSCHECK(getifaddrs(&interfaces), "getifaddrs");
for (interface = interfaces; interface && !*found; interface = interface->ifa_next) {
if (interface->ifa_addr == NULL) continue;
/* We only support IPv4 & IPv6 */
@@ -247,21 +255,18 @@ int ncclFindInterfaceMatchSubnet(char* ifNames, union ncclSocketAddress* localAd
// Store the local IP address
int salen = (family == AF_INET) ? sizeof(struct sockaddr_in) : sizeof(struct sockaddr_in6);
memcpy(localAddrs+found, interface->ifa_addr, salen);
memcpy(localAddr, interface->ifa_addr, salen);
// Store the interface name
strncpy(ifNames+found*ifNameMaxSize, interface->ifa_name, ifNameMaxSize);
strncpy(ifName, interface->ifa_name, ifNameMaxSize);
TRACE(NCCL_INIT|NCCL_NET,"NET : Found interface %s:%s in the same subnet as remote address %s", interface->ifa_name, ncclSocketToString(localAddrs+found, line), ncclSocketToString(remoteAddr, line_a));
found++;
if (found == maxIfs) break;
TRACE(NCCL_INIT|NCCL_NET,"NET : Found interface %s:%s in the same subnet as remote address %s",
interface->ifa_name, ncclSocketToString(localAddr, line), ncclSocketToString(remoteAddr, line_a));
*found = 1;
}
if (found == 0) {
WARN("Net : No interface found in the same subnet as remote address %s", ncclSocketToString(remoteAddr, line_a));
}
freeifaddrs(interfaces);
return found;
return ncclSuccess;
}
ncclResult_t ncclSocketGetAddrFromString(union ncclSocketAddress* ua, const char* ip_port_pair) {
@@ -344,40 +349,41 @@ ncclResult_t ncclSocketGetAddrFromString(union ncclSocketAddress* ua, const char
return ncclSuccess;
}
int ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs, int ifNameMaxSize, int maxIfs) {
ncclResult_t ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs, int ifNameMaxSize, int maxIfs,
int* nIfs) {
static int shownIfName = 0;
int nIfs = 0;
// Allow user to force the INET socket family selection
int sock_family = envSocketFamily();
// User specified interface
const char* env = ncclGetEnv("NCCL_SOCKET_IFNAME");
*nIfs = 0;
if (env && strlen(env) > 1) {
INFO(NCCL_ENV, "NCCL_SOCKET_IFNAME set by environment to %s", env);
// Specified by user : find or fail
if (shownIfName++ == 0) INFO(NCCL_NET, "NCCL_SOCKET_IFNAME set to %s", env);
nIfs = findInterfaces(env, ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
NCCLCHECK(findInterfaces(env, ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
} else {
// Try to automatically pick the right one
// Start with IB
nIfs = findInterfaces("ib", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
NCCLCHECK(findInterfaces("ib", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
// else see if we can get some hint from COMM ID
if (nIfs == 0) {
if (*nIfs == 0) {
const char* commId = ncclGetEnv("NCCL_COMM_ID");
if (commId && strlen(commId) > 1) {
INFO(NCCL_ENV, "NCCL_COMM_ID set by environment to %s", commId);
// Try to find interface that is in the same subnet as the IP in comm id
union ncclSocketAddress idAddr;
ncclSocketGetAddrFromString(&idAddr, commId);
nIfs = ncclFindInterfaceMatchSubnet(ifNames, ifAddrs, &idAddr, ifNameMaxSize, maxIfs);
NCCLCHECK(ncclSocketGetAddrFromString(&idAddr, commId));
NCCLCHECK(ncclFindInterfaceMatchSubnet(ifNames, ifAddrs, &idAddr, ifNameMaxSize, nIfs));
}
}
// Then look for anything else (but not docker or lo)
if (nIfs == 0) nIfs = findInterfaces("^docker,lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
if (*nIfs == 0) NCCLCHECK(findInterfaces("^docker,lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
// Finally look for docker, then lo.
if (nIfs == 0) nIfs = findInterfaces("docker", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
if (nIfs == 0) nIfs = findInterfaces("lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
if (*nIfs == 0) NCCLCHECK(findInterfaces("docker", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
if (*nIfs == 0) NCCLCHECK(findInterfaces("lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
}
return nIfs;
return ncclSuccess;
}
ncclResult_t ncclSocketListen(struct ncclSocket* sock) {
@@ -439,17 +445,20 @@ static ncclResult_t socketTryAccept(struct ncclSocket* sock) {
/* per accept's man page, for linux sockets, the following errors might be already pending errors
* and should be considered as EAGAIN. To avoid infinite loop in case of errors, we use the retry count*/
if (++sock->errorRetries == ncclParamRetryCnt()) {
WARN("socketTryAccept: exceeded error retry count (%d), %s", sock->errorRetries, strerror(errno));
WARN("socketTryAccept: exceeded error retry count after %d attempts, %s", sock->errorRetries, strerror(errno));
return ncclSystemError;
}
INFO(NCCL_ALL, "Call to accept returned %s, retrying", strerror(errno));
} else if (errno != EAGAIN && errno != EWOULDBLOCK) {
INFO(NCCL_NET|NCCL_INIT, "Call to accept returned %s, retrying", strerror(errno));
} else if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
WARN("socketTryAccept: Accept failed: %s", strerror(errno));
return ncclSystemError;
}
return ncclSuccess;
}
NCCL_PARAM(SocketMaxRecvBuff, "SOCKET_RCVBUF", -1);
NCCL_PARAM(SocketMaxSendBuff, "SOCKET_SNDBUF", -1);
static ncclResult_t socketSetFlags(struct ncclSocket* sock) {
const int one = 1;
/* Set socket as non-blocking if async or if we need to be able to abort */
@@ -458,34 +467,55 @@ static ncclResult_t socketSetFlags(struct ncclSocket* sock) {
SYSCHECK(flags = fcntl(sock->fd, F_GETFL), "fcntl");
SYSCHECK(fcntl(sock->fd, F_SETFL, flags | O_NONBLOCK), "fcntl");
}
SYSCHECK(setsockopt(sock->fd, IPPROTO_TCP, TCP_NODELAY, (char*)&one, sizeof(int)), "setsockopt");
SYSCHECK(setsockopt(sock->fd, IPPROTO_TCP, TCP_NODELAY, (char*)&one, sizeof(int)), "setsockopt TCP NODELAY");
// setsockopt should not fail even if the sizes are too large, do not change the default if unset by the user (=-1)
int rcvBuf = ncclParamSocketMaxRecvBuff(), sndBuf = ncclParamSocketMaxSendBuff();
if (sndBuf > 0) SYSCHECK(setsockopt(sock->fd, SOL_SOCKET, SO_SNDBUF, (char*)&sndBuf, sizeof(int)), "setsockopt SO_SNDBUF");
if (rcvBuf > 0) SYSCHECK(setsockopt(sock->fd, SOL_SOCKET, SO_RCVBUF, (char*)&rcvBuf, sizeof(int)), "setsockopt SO_RCVBUF");
return ncclSuccess;
}
static void socketResetAccept(struct ncclSocket* sock) {
char line[SOCKET_NAME_MAXLEN+1];
INFO(NCCL_NET|NCCL_INIT, "socketFinalizeAccept: didn't receive a valid magic from %s",
ncclSocketToString(&sock->addr, line));
// Ignore spurious connection and accept again
(void)close(sock->fd);
sock->fd = -1;
sock->state = ncclSocketStateAccepting;
sock->finalizeCounter = 0;
}
static ncclResult_t socketFinalizeAccept(struct ncclSocket* sock) {
uint64_t magic;
enum ncclSocketType type;
int received;
char line[SOCKET_NAME_MAXLEN+1];
// once accepted, linux sockets do NOT inherit file status flags such as O_NONBLOCK (BSD ones do)
NCCLCHECK(socketSetFlags(sock));
if (sock->asyncFlag == 0 || sock->finalizeCounter < sizeof(magic)) {
if (sock->asyncFlag == 0) {
received = 0;
NCCLCHECK(socketWait(NCCL_SOCKET_RECV, sock, &magic, sizeof(magic), &received));
if (socketWait(NCCL_SOCKET_RECV, sock, &magic, sizeof(magic), &received) != ncclSuccess) {
socketResetAccept(sock);
return ncclSuccess;
}
} else {
int closed = 0;
received = sock->finalizeCounter;
NCCLCHECK(socketProgress(NCCL_SOCKET_RECV, sock, sock->finalizeBuffer, sizeof(magic), &received));
NCCLCHECK(socketProgress(NCCL_SOCKET_RECV, sock, sock->finalizeBuffer, sizeof(magic), &received, &closed));
sock->finalizeCounter = received;
if (received < sizeof(magic)) return ncclSuccess;
if (received < sizeof(magic)) {
if (closed) {
socketResetAccept(sock);
}
return ncclSuccess;
}
memcpy(&magic, sock->finalizeBuffer, sizeof(magic));
}
if (magic != sock->magic) {
WARN("socketFinalizeAccept: wrong magic %lx != %lx", magic, sock->magic);
close(sock->fd);
sock->fd = -1;
// Ignore spurious connection and accept again
sock->state = ncclSocketStateAccepting;
socketResetAccept(sock);
return ncclSuccess;
}
}
@@ -500,7 +530,7 @@ static ncclResult_t socketFinalizeAccept(struct ncclSocket* sock) {
memcpy(&type, sock->finalizeBuffer, sizeof(type));
}
if (type != sock->type) {
WARN("socketFinalizeAccept: wrong type %d != %d", type, sock->type);
WARN("socketFinalizeAccept from %s: wrong type %d != %d", ncclSocketToString(&sock->addr, line), type, sock->type);
sock->state = ncclSocketStateError;
close(sock->fd);
sock->fd = -1;
@@ -532,32 +562,38 @@ cleanup:
}
goto exit;
}
static ncclResult_t socketConnectCheck(struct ncclSocket* sock, int errCode, const char funcName[]) {
char line[SOCKET_NAME_MAXLEN+1];
if (errCode == 0) {
sock->state = ncclSocketStateConnected;
} else if (errCode == EINPROGRESS) {
sock->state = ncclSocketStateConnectPolling;
} else if (errCode == ETIMEDOUT || errCode == EHOSTUNREACH || errCode == ECONNREFUSED) {
} else if (errCode == EINTR || errCode == EWOULDBLOCK || errCode == EAGAIN || errCode == ETIMEDOUT ||
errCode == EHOSTUNREACH || errCode == ECONNREFUSED) {
if (sock->customRetry == 0) {
if (sock->errorRetries++ == ncclParamRetryCnt()) {
sock->state = ncclSocketStateError;
WARN("%s: connect returned %s, exceeded error retry count (%d)", funcName, strerror(errCode), sock->errorRetries);
WARN("%s: connect to %s returned %s, exceeded error retry count after %d attempts",
funcName, ncclSocketToString(&sock->addr, line), strerror(errCode), sock->errorRetries);
return ncclRemoteError;
}
unsigned int sleepTime = sock->errorRetries * ncclParamRetryTimeOut();
INFO(NCCL_ALL, "%s: connect returned %s, retrying (%d/%ld) after sleep for %u msec", funcName, strerror(errCode), sock->errorRetries, ncclParamRetryCnt(), sleepTime);
INFO(NCCL_NET|NCCL_INIT, "%s: connect to %s returned %s, retrying (%d/%ld) after sleep for %u msec",
funcName, ncclSocketToString(&sock->addr, line), strerror(errCode),
sock->errorRetries, ncclParamRetryCnt(), sleepTime);
msleep(sleepTime);
}
NCCLCHECK(socketResetFd(sock)); /* in case of failure in connect, socket state is unspecified */
sock->state = ncclSocketStateConnecting;
} else {
char line[SOCKET_NAME_MAXLEN+1];
sock->state = ncclSocketStateError;
WARN("%s: Connect to %s failed : %s", funcName, ncclSocketToString(&sock->addr, line), strerror(errCode));
WARN("%s: connect to %s failed : %s", funcName, ncclSocketToString(&sock->addr, line), strerror(errCode));
return ncclSystemError;
}
return ncclSuccess;
}
static ncclResult_t socketStartConnect(struct ncclSocket* sock) {
/* blocking/non-blocking connect() is determined by asyncFlag. */
int ret = connect(sock->fd, &sock->addr.sa, sock->salen);
@@ -568,6 +604,7 @@ static ncclResult_t socketPollConnect(struct ncclSocket* sock) {
struct pollfd pfd;
int timeout = 1, ret;
socklen_t rlen = sizeof(int);
char line[SOCKET_NAME_MAXLEN+1];
memset(&pfd, 0, sizeof(struct pollfd));
pfd.fd = sock->fd;
@@ -577,10 +614,7 @@ static ncclResult_t socketPollConnect(struct ncclSocket* sock) {
if (ret == 0 || (ret < 0 && errno == EINTR)) {
return ncclSuccess;
} else if (ret < 0) {
WARN("socketPollConnect poll() failed with error %s", strerror(errno));
return ncclRemoteError;
} else if (ret != 1 || (pfd.revents & POLLOUT) == 0) {
WARN("socketPollConnect poll() returned %d%s", ret, (pfd.revents & POLLOUT) ? "" : ", no POLLOUT events");
WARN("socketPollConnect to %s failed with error %s", ncclSocketToString(&sock->addr, line), strerror(errno));
return ncclSystemError;
}
@@ -899,7 +933,7 @@ ncclResult_t ncclSocketTryRecv(struct ncclSocket* sock, void* ptr, int size, int
ncclResult_t ncclSocketShutdown(struct ncclSocket* sock, int how) {
if (sock != NULL) {
if (sock->fd >= 0) {
shutdown(sock->fd, how);
SYSCHECK(shutdown(sock->fd, how), "shutdown");
}
sock->state = ncclSocketStateTerminating;
}
@@ -921,8 +955,8 @@ ncclResult_t ncclSocketClose(struct ncclSocket* sock, bool wait) {
* by refcount of fd, but close() is. close() won't close a fd and send FIN packet if
* the fd is duplicated (e.g. fork()). So shutdown() guarantees the correct and graceful
* connection close here. */
shutdown(sock->fd, SHUT_RDWR);
close(sock->fd);
(void)shutdown(sock->fd, SHUT_RDWR);
(void)close(sock->fd);
}
sock->state = ncclSocketStateClosed;
sock->fd = -1;
+34
Просмотреть файл
@@ -9,6 +9,12 @@
#include "checks.h"
#include "param.h"
#if CUDART_VERSION >= 13000
#define cudaStreamGetCaptureInfo_v3 cudaStreamGetCaptureInfo
#define cudaGraphAddDependencies_v2 cudaGraphAddDependencies
#define cudaStreamUpdateCaptureDependencies_v2 cudaStreamUpdateCaptureDependencies
#endif
// Tracks the captured work a given graph captured identified by its graph id.
struct ncclStrongStreamCapture {
struct ncclStrongStreamCapture* next;
@@ -89,7 +95,11 @@ ncclResult_t ncclCudaGetCapturingGraph(
} else {
#if CUDART_VERSION >= 11030
cudaStreamCaptureStatus status;
#if CUDART_VERSION >= 13000
CUDACHECK(cudaStreamGetCaptureInfo_v3(stream, &status, &graph->graphId, &graph->graph, nullptr, nullptr, nullptr));
#else
CUDACHECK(cudaStreamGetCaptureInfo_v2(stream, &status, &graph->graphId, &graph->graph, nullptr, nullptr));
#endif
if (status != cudaStreamCaptureStatusActive) {
graph->origin = nullptr;
graph->graph = nullptr;
@@ -224,7 +234,11 @@ ncclResult_t ncclStrongStreamAcquire(
CUDACHECK(cudaEventRecord(scratch, graph.origin));
CUDACHECK(cudaStreamWaitEvent(cap->captureStream, scratch, 0));
CUDACHECK(cudaEventDestroy(scratch));
#if CUDART_VERSION >= 13000
CUDACHECK(cudaStreamUpdateCaptureDependencies_v2(cap->captureStream, nullptr, nullptr, 0, cudaStreamSetCaptureDependencies));
#else
CUDACHECK(cudaStreamUpdateCaptureDependencies(cap->captureStream, nullptr, 0, cudaStreamSetCaptureDependencies));
#endif
if (mixing && firstCapture) {
CUDACHECK(cudaEventRecord(ss->serialEvent, ss->liveStream));
@@ -284,7 +298,11 @@ ncclResult_t ncclStrongStreamRelease(
// Make this record order after previous record on this stream.
if (cap->lastRecord != nullptr) {
#if CUDART_VERSION >= 13000
CUDACHECK(cudaGraphAddDependencies_v2(graph.graph, &cap->lastRecord, &recordNode, nullptr, 1));
#else
CUDACHECK(cudaGraphAddDependencies(graph.graph, &cap->lastRecord, &recordNode, 1));
#endif
}
cap->lastRecord = recordNode;
@@ -292,7 +310,11 @@ ncclResult_t ncclStrongStreamRelease(
cudaStreamCaptureStatus status;
cudaGraphNode_t const* nodes;
size_t count = 0;
#if CUDART_VERSION >= 13000
cudaError_t res = cudaStreamGetCaptureInfo_v3(cap->captureStream, &status, nullptr, nullptr, &nodes, nullptr, &count);
#else
cudaError_t res = cudaStreamGetCaptureInfo_v2(cap->captureStream, &status, nullptr, nullptr, &nodes, &count);
#endif
#if CUDART_VERSION >= 12030
if (res == cudaErrorLossyQuery) { // CUDA is telling us the dependencies have edge annotations.
@@ -308,7 +330,11 @@ ncclResult_t ncclStrongStreamRelease(
else {
CUDACHECK(res /* = cudaStreamGetCaptureInfo_v2(...)*/);
for (int i=0; i < (int)count; i++) {
#if CUDART_VERSION >= 13000
CUDACHECK(cudaGraphAddDependencies_v2(graph.graph, &nodes[i], &recordNode, nullptr, 1));
#else
CUDACHECK(cudaGraphAddDependencies(graph.graph, &nodes[i], &recordNode, 1));
#endif
}
}
@@ -339,7 +365,11 @@ ncclResult_t ncclStreamAdvanceToEvent(struct ncclCudaGraph g, cudaStream_t s, cu
cudaStreamCaptureStatus status;
cudaGraphNode_t const* nodes;
size_t count = 0;
#if CUDART_VERSION >= 13000
cudaError_t res = cudaStreamGetCaptureInfo_v3(tmp, &status, nullptr, nullptr, &nodes, nullptr, &count);
#else
cudaError_t res = cudaStreamGetCaptureInfo_v2(tmp, &status, nullptr, nullptr, &nodes, &count);
#endif
#if CUDART_VERSION >= 12030
if (res == cudaErrorLossyQuery) { // CUDA is telling us the dependencies have edge annotations.
@@ -352,7 +382,11 @@ ncclResult_t ncclStreamAdvanceToEvent(struct ncclCudaGraph g, cudaStream_t s, cu
#endif
else {
CUDACHECK(res /* = cudaStreamGetCaptureInfo_v2(...)*/);
#if CUDART_VERSION >= 13000
CUDACHECK(cudaStreamUpdateCaptureDependencies_v2(s, (cudaGraphNode_t*)nodes, nullptr, count, cudaStreamSetCaptureDependencies));
#else
CUDACHECK(cudaStreamUpdateCaptureDependencies(s, (cudaGraphNode_t*)nodes, count, cudaStreamSetCaptureDependencies));
#endif
}
CUDACHECK(cudaStreamDestroy(tmp));
+7 -2
Просмотреть файл
@@ -58,7 +58,12 @@ ncclResult_t ncclMnnvlCheck(struct ncclComm* comm) {
// Allocate FABRIC handle compatible memory
ncclResult_t ret = ncclCuMemAlloc(&ptr, &handle, CU_MEM_HANDLE_TYPE_FABRIC, CUDA_IPC_MIN);
if (ret != ncclSuccess) return ncclSuccess;
if (ret != ncclSuccess) {
// Return an error if this is a MNNVL capable system but FABRIC handles are not supported
WARN("MNNVL (cliqueSize %d) is available but not working on this system. Check the IMEX channel configuration (/dev/nvidia-caps-imex-channels). Set NCCL_MNNVL_ENABLE=0 to ignore this issue.",
comm->clique.size);
return ncclSystemError;
}
err = CUPFN(cuMemExportToShareableHandle(&cuDesc, handle, CU_MEM_HANDLE_TYPE_FABRIC, 0));
if (err != CUDA_SUCCESS ||
(err = CUPFN(cuMemImportFromShareableHandle(&handle, &cuDesc, CU_MEM_HANDLE_TYPE_FABRIC))) != CUDA_SUCCESS) {
@@ -66,7 +71,7 @@ ncclResult_t ncclMnnvlCheck(struct ncclComm* comm) {
(void) pfn_cuGetErrorString(err, &errStr);
NCCLCHECK(ncclCuMemFree(ptr));
// Return an error if this is a MNNVL capable system but it's not working
WARN("MNNVL (cliqueSize %d) is available but not supported on this system. Check the IMEX configuration.",
WARN("MNNVL (cliqueSize %d) is available but not working on this system. Check the IMEX configuration (nvidia-imex-ctl -N). Set NCCL_MNNVL_ENABLE=0 to ignore this issue.",
comm->clique.size);
return ncclSystemError;
}
+40 -1
Просмотреть файл
@@ -31,6 +31,7 @@ extern "C" {
#include <limits.h>
/* Opaque handle to communicator */
typedef struct ncclComm* ncclComm_t;
typedef struct ncclWindow* ncclWindow_t;
#define NCCL_COMM_NULL NULL
#define NCCL_UNIQUE_ID_BYTES 128
@@ -52,9 +53,21 @@ typedef enum { ncclSuccess = 0,
#define NCCL_SPLIT_NOCOLOR -1
#define NCCL_UNDEF_FLOAT -1.0f
/* Window Registration flags */
#define NCCL_WIN_DEFAULT 0x00
#define NCCL_WIN_COLL_SYMMETRIC 0x01
/* NCCL performance policy */
#define NCCL_CTA_POLICY_DEFAULT 0x00
#define NCCL_CTA_POLICY_EFFICIENCY 0x01
/* ncclCommShrink flags*/
#define NCCL_SHRINK_DEFAULT 0x00 /* shrink the parent communicator */
#define NCCL_SHRINK_ABORT 0x01 /* First, terminate ongoing parent operations, and then shrink the parent communicator */
/* Communicator configuration. Users can assign value to attributes to specify the
* behavior of a communicator. */
typedef struct ncclConfig_v21700 {
typedef struct ncclConfig_v22700 {
/* attributes that users should never touch. */
size_t size;
unsigned int magic;
@@ -67,6 +80,11 @@ typedef struct ncclConfig_v21700 {
const char *netName;
int splitShare;
int trafficClass;
const char *commName;
int collnetEnable;
int CTAPolicy;
int shrinkShare;
int nvlsCTAs;
} ncclConfig_t;
/* Config initializer must be assigned to initialize config structure when it is created.
@@ -82,6 +100,11 @@ typedef struct ncclConfig_v21700 {
NCCL_CONFIG_UNDEF_PTR, /* netName */ \
NCCL_CONFIG_UNDEF_INT, /* splitShare */ \
NCCL_CONFIG_UNDEF_INT, /* trafficClass */ \
NCCL_CONFIG_UNDEF_PTR, /* commName */ \
NCCL_CONFIG_UNDEF_INT, /* collnetEnable */ \
NCCL_CONFIG_UNDEF_INT, /* CTAPolicy */ \
NCCL_CONFIG_UNDEF_INT, /* shrinkShare */ \
NCCL_CONFIG_UNDEF_INT, /* nvlsCTAs */ \
}
/* This struct will be used by ncclGroupSimulateEnd() API to query information about simulation. */
@@ -173,6 +196,14 @@ ncclResult_t pncclCommAbort(ncclComm_t comm);
ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t* config);
ncclResult_t pncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t* config);
/* Shrink existing communicator.
* Ranks in excludeRanksList will be removed form the existing communicator.
* Within the new communicator, ranks will be re-ordered to fill the gap of removed ones.
* If config is NULL, the new communicator will inherit the original communicator's configuration
* The flag enables NCCL to adapt to various states of the parent communicator, see NCCL_SHRINK flags.*/
ncclResult_t ncclCommShrink(ncclComm_t comm, int* excludeRanksList, int excludeRanksCount, ncclComm_t* newcomm, ncclConfig_t* config, int shrinkFlags);
ncclResult_t pncclCommShrink(ncclComm_t comm, int* excludeRanksList, int excludeRanksCount, ncclComm_t* newcomm, ncclConfig_t* config, int shrinkFlags);
/* Creates a new communicator (multi thread/process version), similar to ncclCommInitRankConfig.
* Allows to use more than one ncclUniqueId (up to one per rank), indicated by nId, to accelerate the init operation.
* The number of ncclUniqueIds and their order must be the same for every rank.
@@ -216,6 +247,14 @@ ncclResult_t pncclCommRegister(const ncclComm_t comm, void* buff, size_t size, v
ncclResult_t ncclCommDeregister(const ncclComm_t comm, void* handle);
ncclResult_t pncclCommDeregister(const ncclComm_t comm, void* handle);
/* Register memory window */
ncclResult_t ncclCommWindowRegister(ncclComm_t comm, void* buff, size_t size, ncclWindow_t* win, int winFlags);
ncclResult_t pncclCommWindowRegister(ncclComm_t comm, void* buff, size_t size, ncclWindow_t* win, int winFlags);
/* Deregister symmetric memory */
ncclResult_t ncclCommWindowDeregister(ncclComm_t comm, ncclWindow_t win);
ncclResult_t pncclCommWindowDeregister(ncclComm_t comm, ncclWindow_t win);
/* Reduction operation selector */
typedef enum { ncclNumOps_dummy = 5 } ncclRedOp_dummy_t;
typedef enum { ncclSum = 0,
+219 -171
Просмотреть файл
@@ -8,6 +8,7 @@
#include "bootstrap.h"
#include "checks.h"
#include "plugin.h"
#include "nccl_net.h"
#include <string.h>
#include <errno.h>
@@ -15,137 +16,100 @@
//#include <sys/stat.h>
//#include <unistd.h>
extern ncclNet_t* getNcclNet_v6(void* netPluginLib);
extern ncclNet_t* getNcclNet_v7(void* netPluginLib);
extern ncclNet_t* getNcclNet_v8(void* netPluginLib);
extern ncclNet_t* getNcclNet_v9(void* netPluginLib);
extern ncclNet_t* getNcclNet_v10(void* netPluginLib);
typedef ncclNet_t* getNcclNet_t(void* netPluginLib);
typedef ncclCollNet_t* getNcclCollNet_t(void* netPluginLib);
extern ncclCollNet_t* getNcclCollNet_v6(void* netPluginLib);
extern ncclCollNet_t* getNcclCollNet_v7(void* netPluginLib);
extern ncclCollNet_t* getNcclCollNet_v8(void* netPluginLib);
extern ncclCollNet_t* getNcclCollNet_v9(void* netPluginLib);
extern ncclCollNet_t* getNcclCollNet_v10(void* netPluginLib);
static pthread_mutex_t netLock = PTHREAD_MUTEX_INITIALIZER;
ncclNet_t* ncclNets[NCCL_NET_MAX_PLUGINS] = { nullptr, &ncclNetIb, &ncclNetSocket };
static int ncclNetsVer[NCCL_NET_MAX_PLUGINS] = { -1, 10, 10 };
ncclCollNet_t* ncclCollNets[NCCL_NET_MAX_PLUGINS] = { nullptr, nullptr, nullptr };
enum ncclNetState {
ncclNetStateInit = 0,
ncclNetStateEnabled = 1,
ncclNetStateDisabled = 2
};
enum ncclNetState ncclNetStates[NCCL_NET_MAX_PLUGINS] = { ncclNetStateInit, ncclNetStateInit, ncclNetStateInit };
enum ncclNetState ncclCollNetStates[NCCL_NET_MAX_PLUGINS] = { ncclNetStateInit, ncclNetStateInit, ncclNetStateInit };
extern getNcclNet_t getNcclNet_v6;
extern getNcclNet_t getNcclNet_v7;
extern getNcclNet_t getNcclNet_v8;
extern getNcclNet_t getNcclNet_v9;
extern getNcclNet_t getNcclNet_v10;
extern getNcclCollNet_t getNcclCollNet_v6;
extern getNcclCollNet_t getNcclCollNet_v7;
extern getNcclCollNet_t getNcclCollNet_v8;
extern getNcclCollNet_t getNcclCollNet_v9;
extern getNcclCollNet_t getNcclCollNet_v10;
NCCL_PARAM(NetPluginRefCount, "NET_PLUGIN_REF_COUNT", 1);
#define NCCL_NET_VERSION_COUNT 5
int ncclNetVersion[NCCL_NET_VERSION_COUNT] = {10, 9, 8, 7, 6};
getNcclNet_t* getNcclNet[NCCL_NET_VERSION_COUNT] = {getNcclNet_v10, getNcclNet_v9, getNcclNet_v8, getNcclNet_v7, getNcclNet_v6};
getNcclCollNet_t* getNcclCollNet[NCCL_NET_VERSION_COUNT] = {getNcclCollNet_v10, getNcclCollNet_v9, getNcclCollNet_v8, getNcclCollNet_v7, getNcclCollNet_v6};
#define NCCL_NET_NUM_INTERNAL_PLUGINS 2
typedef enum ncclNetPluginState {
ncclNetPluginStateDisabled = -2, // Plugin library failed to initialize
ncclNetPluginStateLoadFailed = -1, // Plugin library failed to load
ncclNetPluginStateLoadReady = 0, // Plugin library is ready to be loaded
ncclNetPluginStateInitReady = 1, // Plugin library is loaded and ready to be initialized
ncclNetPluginStateEnabled = 2, // Plugin library is loaded and initialized
} ncclNetPluginState_t;
#define MAX_STR_LEN 255
typedef struct netPluginLib {
char name[MAX_STR_LEN]; // Name of the plugin library
void* dlHandle; // Handle to the plugin library
ncclNet_t* ncclNet; // Pointer to the ncclNet_t structure
int ncclNetVer; // Version of the nccl net plugin
ncclCollNet_t* ncclCollNet; // Pointer to the ncclCollNet_t structure
ncclNetPluginState_t ncclNetPluginState; // State of the nccl net plugin
ncclNetPluginState_t ncclCollNetPluginState; // State of the nccl coll net plugin
int ncclNetPluginRefCount; // Reference count for the nccl net plugin
} netPluginLib_t;
int pluginCount = 0;
bool netPluginLibsInitialized = false;
netPluginLib_t netPluginLibs[NCCL_NET_MAX_PLUGINS] = { 0 };
static pthread_mutex_t netPluginLock = PTHREAD_MUTEX_INITIALIZER;
static void* netPluginLib;
static pthread_once_t initPluginLibsOnceControl = PTHREAD_ONCE_INIT;
static int netPluginRefCount;
static void initNetPluginRefCountOnce(void) { netPluginRefCount = ncclParamNetPluginRefCount();}
enum {
netPluginLoadFailed = -1,
netPluginLoadReady = 0,
netPluginLoadSuccess = 1,
};
static int netPluginStatus = netPluginLoadReady;
ncclResult_t ncclNetPluginLoad(struct ncclComm* comm) {
static pthread_once_t netPluginRefCountOnce = PTHREAD_ONCE_INIT;
pthread_once(&netPluginRefCountOnce, initNetPluginRefCountOnce);
pthread_mutex_lock(&netPluginLock);
if (netPluginLoadFailed == netPluginStatus) {
goto exit;
static ncclResult_t ncclNetPluginUnload(netPluginLib_t* pluginLib) {
if ((pluginLib->dlHandle) && ((pluginLib->ncclNetPluginRefCount) == 0)) {
INFO(NCCL_INIT|NCCL_NET, "Unloading plugin %s", pluginLib->name);
NCCLCHECK(ncclClosePluginLib(pluginLib->dlHandle));
memset(pluginLib, 0, sizeof(netPluginLib_t));
}
if (netPluginLoadSuccess == netPluginStatus) {
++netPluginRefCount;
goto exit;
}
netPluginLib = ncclOpenNetPluginLib(ncclGetEnv("NCCL_NET_PLUGIN"));
if (netPluginLib == nullptr) {
goto fail;
}
ncclNets[0] = getNcclNet_v10(netPluginLib);
if (ncclNets[0]) ncclNetsVer[0] = 10;
if (ncclNets[0] == nullptr) {
// Try v9 plugin
ncclNets[0] = getNcclNet_v9(netPluginLib);
if (ncclNets[0]) ncclNetsVer[0] = 9;
}
if (ncclNets[0] == nullptr) {
// Try v8 plugin
ncclNets[0] = getNcclNet_v8(netPluginLib);
if (ncclNets[0]) ncclNetsVer[0] = 8;
}
if (ncclNets[0] == nullptr) {
// Try v7 plugin
ncclNets[0] = getNcclNet_v7(netPluginLib);
if (ncclNets[0]) ncclNetsVer[0] = 7;
}
if (ncclNets[0] == nullptr) {
// Try v6 plugin
ncclNets[0] = getNcclNet_v6(netPluginLib);
if (ncclNets[0]) ncclNetsVer[0] = 6;
}
if (ncclNets[0] == nullptr) {
goto fail;
}
// Check for CollNet
ncclCollNets[0] = getNcclCollNet_v10(netPluginLib);
if (ncclCollNets[0] == nullptr) {
ncclCollNets[0] = getNcclCollNet_v9(netPluginLib);
}
if (ncclCollNets[0] == nullptr) {
ncclCollNets[0] = getNcclCollNet_v8(netPluginLib);
}
if (ncclCollNets[0] == nullptr) {
ncclCollNets[0] = getNcclCollNet_v7(netPluginLib);
}
if (ncclCollNets[0] == nullptr) {
ncclCollNets[0] = getNcclCollNet_v6(netPluginLib);
}
++netPluginRefCount;
netPluginStatus = netPluginLoadSuccess;
comm->netPluginLoaded = 1;
exit:
pthread_mutex_unlock(&netPluginLock);
return ncclSuccess;
fail:
if (netPluginLib) NCCLCHECK(ncclClosePluginLib(netPluginLib));
netPluginStatus = netPluginLoadFailed;
goto exit;
}
ncclResult_t ncclNetPluginUnload(struct ncclComm* comm) {
pthread_mutex_lock(&netPluginLock);
if (comm->netPluginLoaded && 0 == (--netPluginRefCount)) {
if (ncclNets[0]) {
INFO(NCCL_NET, "NET/Plugin: Closing net plugin '%s'", ncclNets[0]->name);
}
if (ncclCollNets[0]) {
INFO(NCCL_NET, "NET/Plugin: Closing collnet plugin '%s'", ncclCollNets[0]->name);
}
NCCLCHECK(ncclClosePluginLib(netPluginLib));
netPluginLib = nullptr;
ncclNets[0] = nullptr;
ncclCollNets[0] = nullptr;
netPluginStatus = netPluginLoadReady;
comm->netPluginLoaded = 0;
for (int i = 0; i < NCCL_NET_MAX_PLUGINS; ++i)
ncclCollNetStates[i] = ncclNetStates[i] = ncclNetStateInit;
static ncclResult_t ncclNetPluginLoad(netPluginLib_t* pluginLib) {
pluginLib->dlHandle = ncclOpenNetPluginLib(pluginLib->name);
if (pluginLib->dlHandle == nullptr) goto fail;
// load ncclNet
for (int i = 0; i < NCCL_NET_VERSION_COUNT; i++) {
pluginLib->ncclNetVer = ncclNetVersion[i];
pluginLib->ncclNet = getNcclNet[i](pluginLib->dlHandle);
if (pluginLib->ncclNet) break;
}
pthread_mutex_unlock(&netPluginLock);
// if we fail to find a net, exit
if (pluginLib->ncclNet == nullptr) goto fail;
pluginLib->ncclNetPluginState = ncclNetPluginStateInitReady;
// load ncclColNet
for (int i = 0; i < NCCL_NET_VERSION_COUNT; i++) {
pluginLib->ncclCollNet = getNcclCollNet[i](pluginLib->dlHandle);
if (pluginLib->ncclCollNet) break;
}
if (pluginLib->ncclCollNet == nullptr)
pluginLib->ncclCollNetPluginState = ncclNetPluginStateLoadFailed;
else
pluginLib->ncclCollNetPluginState = ncclNetPluginStateInitReady;
INFO(NCCL_INIT|NCCL_NET, "Successfully loaded external plugin %s", pluginLib->name);
exit:
return ncclSuccess;
fail:
if (pluginLib->dlHandle) {
NCCLCHECK(ncclClosePluginLib(pluginLib->dlHandle));
}
pluginLib->ncclNetPluginState = ncclNetPluginStateLoadFailed;
pluginLib->ncclCollNetPluginState = ncclNetPluginStateLoadFailed;
goto exit;
}
ncclResult_t ncclNetCheckDeviceVersion(struct ncclComm* comm, ncclNet_t* net, int dev) {
@@ -172,72 +136,156 @@ ncclResult_t ncclNetCheckDeviceVersion(struct ncclComm* comm, ncclNet_t* net, in
return ncclSuccess;
}
static ncclResult_t netGetState(int i, enum ncclNetState* state) {
pthread_mutex_lock(&netLock);
if (ncclNetStates[i] == ncclNetStateInit) {
int ndev;
if (ncclNets[i]->init(ncclDebugLog, ncclProfilerCallback) != ncclSuccess) ncclNetStates[i] = ncclNetStateDisabled;
else if (ncclNets[i]->devices(&ndev) != ncclSuccess || ndev <= 0) ncclNetStates[i] = ncclNetStateDisabled;
else ncclNetStates[i] = ncclNetStateEnabled;
static ncclResult_t ncclNetPluginInit(netPluginLib_t* pluginLib) {
int ndev;
if (pluginLib->ncclNetPluginState == ncclNetPluginStateInitReady && pluginLib->ncclNet) {
if (pluginLib->ncclNet->init(ncclDebugLog, ncclProfilerCallback) != ncclSuccess) goto fail;
if (pluginLib->ncclNet->devices(&ndev) != ncclSuccess || ndev <= 0) goto fail;
}
pluginLib->ncclNetPluginState = ncclNetPluginStateEnabled;
INFO(NCCL_INIT|NCCL_NET, "Initialized NET plugin %s", pluginLib->ncclNet->name);
if (pluginLib->ncclCollNetPluginState == ncclNetPluginStateInitReady && pluginLib->ncclCollNet) {
if (pluginLib->ncclCollNet->init(ncclDebugLog) != ncclSuccess) pluginLib->ncclCollNetPluginState = ncclNetPluginStateDisabled;
else if (pluginLib->ncclCollNet->devices(&ndev) != ncclSuccess || ndev <= 0) pluginLib->ncclCollNetPluginState = ncclNetPluginStateDisabled;
else {
pluginLib->ncclCollNetPluginState = ncclNetPluginStateEnabled;
}
}
exit:
return ncclSuccess;
fail:
pluginLib->ncclNetPluginState = ncclNetPluginStateDisabled;
pluginLib->ncclCollNetPluginState = ncclNetPluginStateDisabled;
goto exit;
}
static ncclResult_t ncclNetPluginAssignToComm(struct ncclComm* comm, int pluginIndex, bool* isAssigned) {
const char* netName = comm->config.netName;
if (netName && strcasecmp(netName, netPluginLibs[pluginIndex].ncclNet->name) != 0) goto fail;
if (ncclSuccess != ncclNetCheckDeviceVersion(comm, netPluginLibs[pluginIndex].ncclNet, 0)) goto fail;
if (netPluginLibs[pluginIndex].ncclNetPluginState >= ncclNetPluginStateEnabled) {
comm->ncclNet = netPluginLibs[pluginIndex].ncclNet;
comm->ncclNetVer = netPluginLibs[pluginIndex].ncclNetVer;
comm->netPluginIndex = pluginIndex;
netPluginLibs[pluginIndex].ncclNetPluginRefCount++;
*isAssigned = true;
INFO(NCCL_INIT|NCCL_NET, "Assigned NET plugin %s to comm", netPluginLibs[pluginIndex].ncclNet->name);
if (netPluginLibs[pluginIndex].ncclCollNetPluginState >= ncclNetPluginStateEnabled) {
comm->ncclCollNet = netPluginLibs[pluginIndex].ncclCollNet;
}
}
exit:
return ncclSuccess;
fail:
*isAssigned = false;
netPluginLibs[pluginIndex].ncclNetPluginState = ncclNetPluginStateEnabled;
netPluginLibs[pluginIndex].ncclCollNetPluginState = ncclNetPluginStateEnabled;
goto exit;
}
static ncclResult_t ncclNetPluginDisableOtherExternal(int pluginIndex) {
// Only if an external plugin is enabled, disable other external plugins
if (pluginIndex >= (pluginCount - NCCL_NET_NUM_INTERNAL_PLUGINS)) return ncclSuccess;
char names[MAX_STR_LEN*(NCCL_NET_MAX_PLUGINS - NCCL_NET_NUM_INTERNAL_PLUGINS)] = { 0 };
for (int i = 0; i < (pluginCount - NCCL_NET_NUM_INTERNAL_PLUGINS); i++) {
if (i != pluginIndex) {
// Append all disabled plugin names to a string
snprintf(names+strlen(names), sizeof(names)-strlen(names), (strlen(names) == 0) ? "%s" : ", %s", netPluginLibs[i].name);
netPluginLibs[i].ncclNetPluginState = ncclNetPluginStateDisabled;
}
}
if(strlen(names) > 0) {
INFO(NCCL_INIT|NCCL_NET, "Disabling external plugins: %s", names);
}
*state = ncclNetStates[i];
pthread_mutex_unlock(&netLock);
return ncclSuccess;
}
static ncclResult_t collNetGetState(int i, enum ncclNetState* state) {
pthread_mutex_lock(&netLock);
if (ncclCollNetStates[i] == ncclNetStateInit) {
int ndev;
if (ncclCollNets[i]->init(ncclDebugLog) != ncclSuccess) ncclCollNetStates[i] = ncclNetStateDisabled;
else if (ncclCollNets[i]->devices(&ndev) != ncclSuccess || ndev <= 0) ncclCollNetStates[i] = ncclNetStateDisabled;
else ncclCollNetStates[i] = ncclNetStateEnabled;
static void initPluginLibsOnceFunc() {
char* netPluginName = nullptr;
const char* defaultNetPlugin = "libnccl-net.so";
const char* envNetPlugin = nullptr;
char* envNetPluginList = nullptr;
char* savePtr = nullptr;
int pluginCounter = 0;
memset(netPluginLibs, 0, NCCL_NET_MAX_PLUGINS * sizeof(netPluginLib_t));
envNetPlugin = ncclGetEnv("NCCL_NET_PLUGIN");
if (envNetPlugin) {
envNetPluginList = strdup(envNetPlugin);
// Iterate over list until the list is empty
netPluginName = strtok_r(envNetPluginList, ",", &savePtr);
while(netPluginName) {
// We have 2 internal plugins (ib and socket)
// So, we can have at most( NCCL_NET_MAX_PLUGINS - (NCCL_NET_NUM_INTERNAL_PLUGINS)) in the NCCL_NET_PLUGIN list
if (pluginCounter >= (NCCL_NET_MAX_PLUGINS - (NCCL_NET_NUM_INTERNAL_PLUGINS))) {
INFO(NCCL_NET|NCCL_INIT,"NCCL_NET_PLUGIN list contains more than %d plugins, ignoring the rest", (NCCL_NET_MAX_PLUGINS - (NCCL_NET_NUM_INTERNAL_PLUGINS + 1)));
break;
}
// need to leave space for the name + "\n"
if((strlen(netPluginName)+1) <= MAX_STR_LEN) {
netPluginLibs[pluginCounter].ncclNetPluginState = ncclNetPluginStateLoadReady;
netPluginLibs[pluginCounter].ncclNetPluginRefCount = ncclParamNetPluginRefCount();
strcpy(netPluginLibs[pluginCounter].name, netPluginName);
pluginCounter++;
} else {
INFO(NCCL_NET|NCCL_INIT,"NCCL_NET_PLUGIN list contains a plugin name %s longer than %d characters, ignoring it.", netPluginName, MAX_STR_LEN);
}
netPluginName = strtok_r(nullptr, ",", &savePtr);
}
if (envNetPluginList) free(envNetPluginList);
} else {
// Add default net plugin
netPluginLibs[pluginCounter].ncclNetPluginState = ncclNetPluginStateLoadReady;
netPluginLibs[pluginCounter].ncclNetPluginRefCount = ncclParamNetPluginRefCount();
strcpy(netPluginLibs[pluginCounter++].name, defaultNetPlugin);
}
*state = ncclCollNetStates[i];
pthread_mutex_unlock(&netLock);
return ncclSuccess;
// Add 2 internal ib and socket plugins
netPluginLibs[pluginCounter].ncclNet = &ncclNetIb;
netPluginLibs[pluginCounter++].ncclNetPluginState = ncclNetPluginStateInitReady;
netPluginLibs[pluginCounter].ncclNet = &ncclNetSocket;
netPluginLibs[pluginCounter++].ncclNetPluginState = ncclNetPluginStateInitReady;
pluginCount = pluginCounter;
}
ncclResult_t ncclNetInit(struct ncclComm* comm) {
// Initialize main communication network
const char* netName;
bool ok = false;
netName = comm->config.netName;
for (int i=0; i<3; i++) {
if (ncclNets[i] == nullptr) continue;
enum ncclNetState state;
NCCLCHECK(netGetState(i, &state));
if (state != ncclNetStateEnabled) continue;
if (netName && strcasecmp(netName, ncclNets[i]->name) != 0) continue;
if (ncclSuccess != ncclNetCheckDeviceVersion(comm, ncclNets[i], 0)) {
// Mismatched device plugin version
continue;
bool ncclNetPluginInitialized = false;
pthread_once(&initPluginLibsOnceControl, initPluginLibsOnceFunc);
pthread_mutex_lock(&netPluginLock);
for (int pluginIndex = 0; pluginIndex < pluginCount; pluginIndex++) {
if ((pluginIndex < (pluginCount - NCCL_NET_NUM_INTERNAL_PLUGINS)) && (netPluginLibs[pluginIndex].ncclNetPluginState == ncclNetPluginStateLoadReady)) {
NCCLCHECK(ncclNetPluginLoad(&netPluginLibs[pluginIndex]));
}
comm->ncclNet = ncclNets[i];
comm->ncclNetVer = ncclNetsVer[i];
ok = true;
if (ncclCollNets[i]) {
NCCLCHECK(collNetGetState(i, &state));
if (state == ncclNetStateEnabled) {
comm->ncclCollNet = ncclCollNets[i];
if (netPluginLibs[pluginIndex].ncclNetPluginState == ncclNetPluginStateInitReady) {
NCCLCHECK(ncclNetPluginInit(&netPluginLibs[pluginIndex]));
}
if (netPluginLibs[pluginIndex].ncclNetPluginState == ncclNetPluginStateEnabled) {
bool isAssigned = false;
NCCLCHECK(ncclNetPluginAssignToComm(comm, pluginIndex, &isAssigned));
if (isAssigned) {
// If one external plugin is assigned to a comm, then disable all other external plugins
ncclNetPluginDisableOtherExternal(pluginIndex);
ncclNetPluginInitialized = true;
break;
}
}
break;
}
if (!ok) {
WARN("Error: network %s not found.", netName ? netName : "");
return ncclInvalidUsage;
}
return ncclSuccess;
pthread_mutex_unlock(&netPluginLock);
if (ncclNetPluginInitialized) return ncclSuccess;
WARN("Failed to initialize any NET plugin");
return ncclInvalidUsage;
}
ncclResult_t ncclNetFinalize(struct ncclComm* comm) {
comm->ncclNet = nullptr;
comm->ncclCollNet = nullptr;
int pluginIndex = comm->netPluginIndex;
pthread_mutex_lock(&netPluginLock);
netPluginLibs[pluginIndex].ncclNetPluginRefCount--;
for (int i = 0; i < (pluginCount - NCCL_NET_NUM_INTERNAL_PLUGINS); i++) {
NCCLCHECK(ncclNetPluginUnload(&netPluginLibs[i]));
}
pthread_mutex_unlock(&netPluginLock);
return ncclSuccess;
}
+36 -29
Просмотреть файл
@@ -23,7 +23,7 @@ enum ncclPluginType {
static void *libHandles[NUM_LIBS];
static const char *pluginNames[NUM_LIBS] = { "NET", "TUNER", "PROFILER" };
static const char *pluginPrefix[NUM_LIBS] = { "libnccl-net", "libnccl-tuner", "libnccl-profiler" };
static const char *pluginFallback[NUM_LIBS] = { "Using internal net plugin.", "Using internal tuner plugin.", "" };
static const char *pluginFallback[NUM_LIBS] = { "", "Using internal tuner plugin.", "" };
static unsigned long subsys[NUM_LIBS] = { NCCL_INIT|NCCL_NET, NCCL_INIT|NCCL_TUNING, NCCL_INIT };
static void* tryOpenLib(char* name, int* err, char* errStr) {
@@ -49,10 +49,9 @@ static void* tryOpenLib(char* name, int* err, char* errStr) {
return handle;
}
static void appendNameToList(char* nameList, int *nameListLen, char* name) {
snprintf(nameList, *nameListLen, " %s", name);
nameList += strlen(name) + 1;
*nameListLen -= strlen(name) + 1;
static void appendNameToList(char* nameList, int *leftChars, char* name) {
snprintf(nameList + PATH_MAX - *leftChars, *leftChars, " %s", name);
*leftChars -= strlen(name) + 1;
}
static void* openPluginLib(enum ncclPluginType type, const char* libName) {
@@ -62,28 +61,31 @@ static void* openPluginLib(enum ncclPluginType type, const char* libName) {
char eNoEntNameList[PATH_MAX] = { 0 };
if (libName && strlen(libName)) {
snprintf(libName_, MAX_STR_LEN, "%s", libName);
libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
if (libHandles[type]) {
INFO(subsys[type], "%s/Plugin: Plugin name set by env to %s", pluginNames[type], libName_);
return libHandles[type];
}
if (openErr == ENOENT) {
appendNameToList(eNoEntNameList, &len, libName_);
// match names that start with 'lib' and end with '.so'
if (strlen(libName) >= strlen("libX.so") && strncmp(libName, "lib", strlen("lib")) == 0 && strncmp(libName + strlen(libName) - strlen(".so"), ".so", strlen(".so")) == 0) {
snprintf(libName_, MAX_STR_LEN, "%s", libName);
libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
if (libHandles[type]) {
INFO(subsys[type], "%s/Plugin: Plugin name set by env to %s", pluginNames[type], libName_);
return libHandles[type];
}
if (openErr == ENOENT) {
appendNameToList(eNoEntNameList, &len, libName_);
} else {
INFO(subsys[type], "%s/Plugin: %s", pluginNames[type], openErrStr);
}
} else {
INFO(subsys[type], "%s/Plugin: %s", pluginNames[type], openErrStr);
}
snprintf(libName_, MAX_STR_LEN, "%s-%s.so", pluginPrefix[type], libName);
libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
if (libHandles[type]) {
INFO(subsys[type], "%s/Plugin: Plugin name set by env to %s", pluginNames[type], libName_);
return libHandles[type];
}
if (openErr == ENOENT) {
appendNameToList(eNoEntNameList, &len, libName_);
} else {
INFO(subsys[type], "%s/Plugin: %s", pluginNames[type], openErrStr);
snprintf(libName_, MAX_STR_LEN, "%s-%s.so", pluginPrefix[type], libName);
libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
if (libHandles[type]) {
INFO(subsys[type], "%s/Plugin: Plugin name set by env to %s", pluginNames[type], libName_);
return libHandles[type];
}
if (openErr == ENOENT) {
appendNameToList(eNoEntNameList, &len, libName_);
} else {
INFO(subsys[type], "%s/Plugin: %s", pluginNames[type], openErrStr);
}
}
} else {
snprintf(libName_, MAX_STR_LEN, "%s.so", pluginPrefix[type]);
@@ -123,12 +125,17 @@ void* ncclGetNetPluginLib(void) {
}
ncclResult_t ncclClosePluginLib(void* handle) {
bool found = false;
for (int l=0; l<NUM_LIBS; l++) {
if (libHandles[l] == handle) {
libHandles[l] = nullptr;
dlclose(handle);
return ncclSuccess;
if (!found) {
if (handle) {
dlclose(handle);
}
found = true;
}
}
}
return ncclInternalError;
return ncclSuccess;
}
+43 -48
Просмотреть файл
@@ -17,6 +17,7 @@
extern ncclProfiler_t* getNcclProfiler_v1(void* lib);
extern ncclProfiler_t* getNcclProfiler_v2(void* lib);
extern ncclProfiler_t* getNcclProfiler_v3(void* lib);
extern ncclProfiler_t* getNcclProfiler_v4(void* lib);
static pthread_mutex_t profilerLock = PTHREAD_MUTEX_INITIALIZER;
static int profilerPluginRefCount;
@@ -49,7 +50,10 @@ static ncclResult_t ncclProfilerPluginLoad(void) {
goto fail;
}
ncclProfiler = getNcclProfiler_v3(profilerPluginLib);
ncclProfiler = getNcclProfiler_v4(profilerPluginLib);
if (ncclProfiler == nullptr) {
ncclProfiler = getNcclProfiler_v3(profilerPluginLib);
}
if (ncclProfiler == nullptr) {
ncclProfiler = getNcclProfiler_v2(profilerPluginLib);
}
@@ -162,7 +166,7 @@ ncclResult_t ncclProfilerPluginInit(struct ncclComm* comm) {
TIME_START_EVENT(init);
ncclProfilerPluginLoad();
if (__builtin_expect(ncclProfiler != NULL, 0)) {
int err = ncclProfiler->init(&comm->profilerContext, &ncclProfilerEventMask);
int err = ncclProfiler->init(&comm->profilerContext, &ncclProfilerEventMask, comm->config.commName, comm->commHash, comm->nNodes, comm->nRanks, comm->rank, ncclDebugLog);
if (err) {
WARN("Profiler init failed with error (%d). Continue without profiler.", err);
ncclProfiler = NULL;
@@ -239,8 +243,6 @@ ncclResult_t ncclProfilerStartTaskEvents(struct ncclKernelPlan* plan) {
eDescr.type = ncclProfileColl;
eDescr.parentObj = plan->groupEventHandle;
eDescr.rank = plan->comm->rank;
eDescr.coll.name = plan->comm->commName;
eDescr.coll.commHash = plan->comm->commHash;
eDescr.coll.seqNumber = plan->comm->seqNumber[ct->func];
eDescr.coll.func = ncclFuncToString(ct->func);
eDescr.coll.sendBuff = ct->sendbuff;
@@ -248,7 +250,7 @@ ncclResult_t ncclProfilerStartTaskEvents(struct ncclKernelPlan* plan) {
eDescr.coll.count = ct->count;
eDescr.coll.root = ct->root;
eDescr.coll.datatype = ncclDatatypeToString(ct->datatype);
eDescr.coll.nMaxChannels = ct->nMaxChannels;
eDescr.coll.nChannels = ct->nChannels;
eDescr.coll.nWarps = ct->nWarps;
eDescr.coll.algo = ncclAlgoToString(ct->algorithm);
eDescr.coll.proto = ncclProtoToString(ct->protocol);
@@ -264,7 +266,7 @@ ncclResult_t ncclProfilerStartTaskEvents(struct ncclKernelPlan* plan) {
// gives the consistency.
if (!plan->persistent || (__builtin_expect(ncclProfiler != NULL, 0) && plan->groupEventHandle &&
(ct->eActivationMask & ncclProfileKernelCh)))
plan->comm->seqNumber[ct->func]++;
__atomic_fetch_add(&plan->comm->seqNumber[ct->func], 1, __ATOMIC_RELAXED);
ct = ct->next;
}
if (__builtin_expect(ncclProfiler != NULL, 0)) {
@@ -277,13 +279,12 @@ ncclResult_t ncclProfilerStartTaskEvents(struct ncclKernelPlan* plan) {
eDescr.type = ncclProfileP2p;
eDescr.parentObj = plan->groupEventHandle;
eDescr.rank = plan->comm->rank;
eDescr.p2p.name = plan->comm->commName;
eDescr.p2p.commHash = plan->comm->commHash;
eDescr.p2p.func = ncclFuncToString(pt->func);
eDescr.p2p.buff = pt->buff;
eDescr.p2p.count = pt->count;
eDescr.p2p.datatype = ncclDatatypeToString(pt->datatype);
eDescr.p2p.peer = pt->root;
eDescr.p2p.nChannels = pt->nChannels;
ncclProfiler->startEvent(plan->comm->profilerContext, &pt->eventHandle, &eDescr);
}
pt = pt->next;
@@ -319,7 +320,7 @@ ncclResult_t ncclProfilerStopTaskEvents(struct ncclKernelPlan* plan) {
// made of sliceSteps steps rather than one step. In the profiler we are still
// interested in whole network transfers though, so we account for this when
// computing the actual network step number.
ncclResult_t ncclProfilerStartSendProxyOpEvent(int s, struct ncclProxyArgs* args) {
ncclResult_t ncclProfilerStartProxyOpEvent(int s, struct ncclProxyArgs* args) {
TIME_START_EVENT(proxyOpStart);
struct ncclProxySubArgs* sub = &args->subs[s];
if (__builtin_expect(ncclProfiler != NULL, 0)) {
@@ -333,29 +334,7 @@ ncclResult_t ncclProfilerStartSendProxyOpEvent(int s, struct ncclProxyArgs* args
eDescr.proxyOp.peer = sub->peer;
eDescr.proxyOp.nSteps = DIVUP(sub->nsteps, args->sliceSteps);
eDescr.proxyOp.chunkSize = args->chunkSize * args->sliceSteps;
eDescr.proxyOp.isSend = 1;
ncclProfiler->startEvent(sub->profilerContext, &sub->opEventHandle, &eDescr);
}
}
TIME_STOP_EVENT(proxyOpStart);
return ncclSuccess;
}
ncclResult_t ncclProfilerStartRecvProxyOpEvent(int s, struct ncclProxyArgs* args) {
TIME_START_EVENT(proxyOpStart);
struct ncclProxySubArgs* sub = &args->subs[s];
if (__builtin_expect(ncclProfiler != NULL, 0)) {
if (sub->eActivationMask & (ncclProfileProxyOp | ncclProfileProxyStep | ncclProfileNetPlugin)) {
ncclProfilerEventDescr_t eDescr = { 0 };
eDescr.type = ncclProfileProxyOp;
eDescr.parentObj = sub->taskEventHandle;
eDescr.rank = sub->rank;
eDescr.proxyOp.pid = sub->pid;
eDescr.proxyOp.channelId = sub->channelId;
eDescr.proxyOp.peer = sub->peer;
eDescr.proxyOp.nSteps = DIVUP(sub->nsteps, args->sliceSteps);
eDescr.proxyOp.chunkSize = args->chunkSize * args->sliceSteps;
eDescr.proxyOp.isSend = 0;
eDescr.proxyOp.isSend = args->progress == ncclTransports[TRANSPORT_NET]->send.proxyProgress ? 1 : 0;
ncclProfiler->startEvent(sub->profilerContext, &sub->opEventHandle, &eDescr);
}
}
@@ -385,7 +364,8 @@ ncclResult_t ncclProfilerStartSendProxyStepEvent(int s, struct ncclProxyArgs* ar
eDescr.parentObj = sub->opEventHandle;
eDescr.rank = sub->rank;
eDescr.proxyStep.step = step_;
ncclProfiler->startEvent(sub->profilerContext, &sub->stepEventHandles[step_%NCCL_STEPS], &eDescr);
ncclProfiler->startEvent(sub->profilerContext, &sub->pHandles[step_%NCCL_STEPS].stepEventHandle, &eDescr);
sub->pHandles[step_%NCCL_STEPS].subArgPtr = sub;
}
}
TIME_STOP_EVENT(proxyStepStart);
@@ -403,7 +383,8 @@ ncclResult_t ncclProfilerStartRecvProxyStepEvent(int s, struct ncclProxyArgs* ar
eDescr.parentObj = sub->opEventHandle;
eDescr.rank = sub->rank;
eDescr.proxyStep.step = step_;
ncclProfiler->startEvent(sub->profilerContext, &sub->stepEventHandles[step_%NCCL_STEPS], &eDescr);
ncclProfiler->startEvent(sub->profilerContext, &sub->pHandles[step_%NCCL_STEPS].stepEventHandle, &eDescr);
sub->pHandles[step_%NCCL_STEPS].subArgPtr = sub;
}
}
TIME_STOP_EVENT(proxyStepStart);
@@ -415,9 +396,9 @@ ncclResult_t ncclProfilerStopProxyStepEvent(int s, struct ncclProxyArgs* args, i
struct ncclProxySubArgs* sub = &args->subs[s];
if (__builtin_expect(ncclProfiler != NULL, 0)) {
int step_ = DIVUP(stepId, args->sliceSteps);
if (sub->stepEventHandles[step_%NCCL_STEPS]) {
ncclProfiler->stopEvent(sub->stepEventHandles[step_%NCCL_STEPS]);
sub->stepEventHandles[step_%NCCL_STEPS] = NULL;
if (sub->pHandles[step_%NCCL_STEPS].stepEventHandle) {
ncclProfiler->stopEvent(sub->pHandles[step_%NCCL_STEPS].stepEventHandle);
sub->pHandles[step_%NCCL_STEPS].stepEventHandle = NULL;
}
}
TIME_STOP_EVENT(proxyStepStop);
@@ -451,7 +432,7 @@ ncclResult_t ncclProfilerStopProxyCtrlEvent(void* eHandle) {
return ncclSuccess;
}
ncclResult_t ncclProfilerStartKernelChEvent(struct ncclProxyArgs* args, int s) {
ncclResult_t ncclProfilerStartKernelChEvent(struct ncclProxyArgs* args, int s, uint64_t start) {
if (__builtin_expect(ncclProfiler != NULL, 0)) {
struct ncclProxySubArgs* sub = &args->subs[s];
if (sub->eActivationMask & ncclProfileKernelCh) {
@@ -459,29 +440,31 @@ ncclResult_t ncclProfilerStartKernelChEvent(struct ncclProxyArgs* args, int s) {
eDescr.type = ncclProfileKernelCh;
eDescr.parentObj = sub->taskEventHandle;
eDescr.kernelCh.channelId = sub->channelId;
eDescr.kernelCh.pTimer = start;
ncclProfiler->startEvent(sub->profilerContext, &sub->kernelEventHandle, &eDescr);
}
}
return ncclSuccess;
}
ncclResult_t ncclProfilerStopKernelChEvent(struct ncclProxyArgs* args, int s) {
ncclResult_t ncclProfilerStopKernelChEvent(struct ncclProxyArgs* args, int s, uint64_t stop) {
if (__builtin_expect(ncclProfiler != NULL, 0)) {
struct ncclProxySubArgs* sub = &args->subs[s];
if (sub->kernelEventHandle) {
ncclProfilerEventStateArgs_t a = { };
a.kernelCh.pTimer = stop;
ncclProfiler->recordEventState(sub->kernelEventHandle, ncclProfilerKernelChStop, &a);
ncclProfiler->stopEvent(sub->kernelEventHandle);
}
}
return ncclSuccess;
}
ncclResult_t ncclProfilerRecordProxyOpEventState(int s, struct ncclProxyArgs* args, int steps, size_t transSize, ncclProfilerEventState_t eState) {
ncclResult_t ncclProfilerRecordProxyOpEventState(int s, struct ncclProxyArgs* args, ncclProfilerEventState_t eState) {
TIME_START_EVENT(proxyOpRecord);
struct ncclProxySubArgs* sub = &args->subs[s];
if (__builtin_expect(ncclProfiler != NULL, 0) && sub->opEventHandle) {
ncclProfilerEventStateArgs_t a = { };
a.proxyOp.steps = DIVUP(steps, args->sliceSteps);
a.proxyOp.transSize = transSize;
ncclProfiler->recordEventState(sub->opEventHandle, eState, &a);
}
TIME_STOP_EVENT(proxyOpRecord);
@@ -493,8 +476,10 @@ ncclResult_t ncclProfilerRecordProxyStepEventState(int s, struct ncclProxyArgs*
struct ncclProxySubArgs* sub = &args->subs[s];
if (__builtin_expect(ncclProfiler != NULL, 0) && sub->opEventHandle) {
int step_ = DIVUP(stepId, args->sliceSteps);
if (sub->stepEventHandles[step_%NCCL_STEPS]) {
ncclProfiler->recordEventState(sub->stepEventHandles[step_%NCCL_STEPS], eState, 0);
if (sub->pHandles[step_%NCCL_STEPS].stepEventHandle) {
ncclProfilerEventStateArgs_t a = { };
a.proxyStep.transSize = sub->transSize;
ncclProfiler->recordEventState(sub->pHandles[step_%NCCL_STEPS].stepEventHandle, eState, &a);
}
}
TIME_STOP_EVENT(proxyStepRecord);
@@ -547,18 +532,28 @@ bool ncclProfilerPluginLoaded(void) {
ncclResult_t ncclProfilerCallback(void** eHandle, int type, void* pHandle, int64_t pluginId, void* extData) {
if (__builtin_expect(ncclProfiler != NULL, 0)) {
struct ncclProxySubArgs* sub = (struct ncclProxySubArgs*)pHandle;
if (type == 0) { // start
if (type == ncclProfilerNetEventStart) { // start
struct ncclProxyEventHandle* p = (struct ncclProxyEventHandle*)pHandle;
struct ncclProxySubArgs* sub = p->subArgPtr;
if (sub->eActivationMask & ncclProfileNetPlugin) {
ncclProfilerEventDescr_t eDescr = { 0 };
eDescr.type = ncclProfileNetPlugin;
eDescr.parentObj = sub->stepEventHandles[sub->profilerSteps%NCCL_STEPS];
eDescr.parentObj = p->stepEventHandle;
eDescr.rank = sub->rank;
eDescr.netPlugin.id = pluginId;
eDescr.netPlugin.data = extData;
ncclProfiler->startEvent(sub->profilerContext, eHandle, &eDescr);
}
} else { // stop
} else if (type == ncclProfilerNetEventStop) { // stop
ncclProfiler->stopEvent(*eHandle);
} else if (type == ncclProfilerNetEventUpdate) { // update
ncclProfilerEventStateArgs_t args = { };
args.netPlugin.data = extData;
ncclProfiler->recordEventState(*eHandle, ncclProfilerNetPluginUpdate, &args);
} else { // update and stop
ncclProfilerEventStateArgs_t args = { };
args.netPlugin.data = extData;
ncclProfiler->recordEventState(*eHandle, ncclProfilerNetPluginUpdate, &args);
ncclProfiler->stopEvent(*eHandle);
}
}
+27 -13
Просмотреть файл
@@ -53,6 +53,7 @@ static uint8_t ncclStringToDatatype(const char* dt) {
}
static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclProfilerEventDescr_t* eDescr) {
*eHandle = NULL;
ncclProfilerEventDescr_v1_t eDescr_v1 = { 0 };
eDescr_v1.type = eDescr->type;
eDescr_v1.parentObj = eDescr->parentObj;
@@ -60,8 +61,8 @@ static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclP
switch(eDescr->type) {
case ncclProfileGroup: break;
case ncclProfileColl: {
eDescr_v1.coll.name = eDescr->coll.name;
eDescr_v1.coll.commHash = eDescr->coll.commHash;
eDescr_v1.coll.name = nullptr; // removed in v4
eDescr_v1.coll.commHash = 0; // removed in v4
eDescr_v1.coll.seqNumber = eDescr->coll.seqNumber;
eDescr_v1.coll.func = ncclStringToFunc(eDescr->coll.func);
eDescr_v1.coll.sendBuff = eDescr->coll.sendBuff;
@@ -71,14 +72,14 @@ static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclP
eDescr_v1.coll.datatype = ncclStringToDatatype(eDescr->coll.datatype);
eDescr_v1.coll.op = 0; // removed in v2
eDescr_v1.coll.trafficBytes = 0; // removed in v3
eDescr_v1.coll.nMaxChannels = eDescr->coll.nMaxChannels;
eDescr_v1.coll.nMaxChannels = eDescr->coll.nChannels;
eDescr_v1.coll.nWarps = eDescr->coll.nWarps;
eDescr_v1.coll.algo = ncclStringToAlgo(eDescr->coll.algo);
eDescr_v1.coll.proto = ncclStringToProto(eDescr->coll.proto);
} break;
case ncclProfileP2p: {
eDescr_v1.p2p.name = eDescr->p2p.name;
eDescr_v1.p2p.commHash = eDescr->p2p.commHash;
eDescr_v1.p2p.name = nullptr; // removed in v4
eDescr_v1.p2p.commHash = 0; // removed in v4
eDescr_v1.p2p.func = ncclStringToFunc(eDescr->p2p.func);
eDescr_v1.p2p.buff = eDescr->p2p.buff;
eDescr_v1.p2p.count = eDescr->p2p.count;
@@ -97,21 +98,34 @@ static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclP
eDescr_v1.proxyStep.step = eDescr->proxyStep.step;
} break;
case ncclProfileProxyCtrl: break;
case ncclProfileKernelCh:
case ncclProfileNetPlugin: {
*eHandle = NULL;
return ncclSuccess;
}
default:;
default: return ncclSuccess;
}
return ncclProfiler_v1->startEvent(context, eHandle, &eDescr_v1);
}
static ncclResult_t ncclProfiler_recordEventState(void* eHandle, ncclProfilerEventState_t eState, ncclProfilerEventStateArgs_t* eStateArgs) {
return ncclProfiler_v1->recordEventState(eHandle, eState, (ncclProfilerEventStateArgs_v1_t*)eStateArgs);
ncclProfilerEventStateArgs_v1_t args = { };
switch (eState) {
case ncclProfilerProxyCtrlIdle:
case ncclProfilerProxyCtrlActive:
case ncclProfilerProxyCtrlSleep:
case ncclProfilerProxyCtrlWakeup:
case ncclProfilerProxyCtrlAppend:
case ncclProfilerProxyCtrlAppendEnd:
args.proxyCtrl.appendedProxyOps = eStateArgs->proxyCtrl.appendedProxyOps;
break;
case ncclProfilerProxyStepSendGPUWait:
case ncclProfilerProxyStepSendWait:
case ncclProfilerProxyStepRecvWait:
case ncclProfilerProxyStepRecvFlushWait:
case ncclProfilerProxyStepRecvGPUWait:
break;
default: return ncclSuccess;
}
return ncclProfiler_v1->recordEventState(eHandle, eState, &args);
}
static ncclResult_t ncclProfiler_init(void** context, int* eActivationMask) {
static ncclResult_t ncclProfiler_init(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn) {
NCCLCHECK(ncclProfiler_v1->init(context, eActivationMask));
ncclProfiler.startEvent = ncclProfiler_startEvent;
ncclProfiler.stopEvent = ncclProfiler_v1->stopEvent;
+25 -7
Просмотреть файл
@@ -20,8 +20,8 @@ static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclP
switch(eDescr->type) {
case ncclProfileGroup: break;
case ncclProfileColl: {
eDescr_v2.coll.name = eDescr->coll.name;
eDescr_v2.coll.commHash = eDescr->coll.commHash;
eDescr_v2.coll.name = nullptr; // removed in v4
eDescr_v2.coll.commHash = 0; // removed in v4
eDescr_v2.coll.seqNumber = eDescr->coll.seqNumber;
eDescr_v2.coll.func = eDescr->coll.func;
eDescr_v2.coll.sendBuff = eDescr->coll.sendBuff;
@@ -30,14 +30,14 @@ static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclP
eDescr_v2.coll.root = eDescr->coll.root;
eDescr_v2.coll.datatype = eDescr->coll.datatype;
eDescr_v2.coll.trafficBytes = 0; // removed in v3
eDescr_v2.coll.nMaxChannels = eDescr->coll.nMaxChannels;
eDescr_v2.coll.nMaxChannels = eDescr->coll.nChannels;
eDescr_v2.coll.nWarps = eDescr->coll.nWarps;
eDescr_v2.coll.algo = eDescr->coll.algo;
eDescr_v2.coll.proto = eDescr->coll.proto;
} break;
case ncclProfileP2p: {
eDescr_v2.p2p.name = eDescr->p2p.name;
eDescr_v2.p2p.commHash = eDescr->p2p.commHash;
eDescr_v2.p2p.name = nullptr; // removed in v4
eDescr_v2.p2p.commHash = 0; // removed in v4
eDescr_v2.p2p.func = eDescr->p2p.func;
eDescr_v2.p2p.buff = eDescr->p2p.buff;
eDescr_v2.p2p.count = eDescr->p2p.count;
@@ -62,10 +62,28 @@ static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclP
}
static ncclResult_t ncclProfiler_recordEventState(void* eHandle, ncclProfilerEventState_t eState, ncclProfilerEventStateArgs_t* eStateArgs) {
return ncclProfiler_v2->recordEventState(eHandle, eState, (ncclProfilerEventStateArgs_v2_t *)eStateArgs);
ncclProfilerEventStateArgs_v2_t args = { };
switch (eState) {
case ncclProfilerProxyCtrlIdle:
case ncclProfilerProxyCtrlActive:
case ncclProfilerProxyCtrlSleep:
case ncclProfilerProxyCtrlWakeup:
case ncclProfilerProxyCtrlAppend:
case ncclProfilerProxyCtrlAppendEnd:
args.proxyCtrl.appendedProxyOps = eStateArgs->proxyCtrl.appendedProxyOps;
break;
case ncclProfilerProxyStepSendGPUWait:
case ncclProfilerProxyStepSendWait:
case ncclProfilerProxyStepRecvWait:
case ncclProfilerProxyStepRecvFlushWait:
case ncclProfilerProxyStepRecvGPUWait:
break;
default: return ncclSuccess;
}
return ncclProfiler_v2->recordEventState(eHandle, eState, &args);
}
static ncclResult_t ncclProfiler_init(void** context, int* eActivationMask) {
static ncclResult_t ncclProfiler_init(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn) {
NCCLCHECK(ncclProfiler_v2->init(context, eActivationMask));
ncclProfiler.startEvent = ncclProfiler_startEvent;
ncclProfiler.stopEvent = ncclProfiler_v2->stopEvent;
+92 -1
Просмотреть файл
@@ -6,14 +6,105 @@
#include "comm.h"
#include "nccl_profiler.h"
#include "checks.h"
static ncclProfiler_t ncclProfiler;
static ncclProfiler_v3_t* ncclProfiler_v3;
static ncclResult_t ncclProfiler_startEvent(void* context, void** eHandle, ncclProfilerEventDescr_t* eDescr) {
*eHandle = nullptr;
ncclProfilerEventDescr_v3_t eDescr_v3 = { };
eDescr_v3.type = eDescr->type;
eDescr_v3.parentObj = eDescr->parentObj;
eDescr_v3.rank = eDescr->rank;
switch(eDescr->type) {
case ncclProfileGroup: break;
case ncclProfileColl: {
eDescr_v3.coll.name = nullptr; // removed in v4
eDescr_v3.coll.commHash = 0; // removed in v4
eDescr_v3.coll.seqNumber = eDescr->coll.seqNumber;
eDescr_v3.coll.func = eDescr->coll.func;
eDescr_v3.coll.sendBuff = eDescr->coll.sendBuff;
eDescr_v3.coll.recvBuff = eDescr->coll.recvBuff;
eDescr_v3.coll.count = eDescr->coll.count;
eDescr_v3.coll.root = eDescr->coll.root;
eDescr_v3.coll.datatype = eDescr->coll.datatype;
eDescr_v3.coll.nMaxChannels = eDescr->coll.nChannels;
eDescr_v3.coll.nWarps = eDescr->coll.nWarps;
eDescr_v3.coll.algo = eDescr->coll.algo;
eDescr_v3.coll.proto = eDescr->coll.proto;
} break;
case ncclProfileP2p: {
eDescr_v3.p2p.name = nullptr; // removed in v4
eDescr_v3.p2p.commHash = 0; // removed in v4
eDescr_v3.p2p.func = eDescr->p2p.func;
eDescr_v3.p2p.buff = eDescr->p2p.buff;
eDescr_v3.p2p.count = eDescr->p2p.count;
eDescr_v3.p2p.datatype = eDescr->p2p.datatype;
eDescr_v3.p2p.peer = eDescr->p2p.peer;
} break;
case ncclProfileProxyOp: {
eDescr_v3.proxyOp.pid = eDescr->proxyOp.pid;
eDescr_v3.proxyOp.channelId = eDescr->proxyOp.channelId;
eDescr_v3.proxyOp.peer = eDescr->proxyOp.peer;
eDescr_v3.proxyOp.nSteps = eDescr->proxyOp.nSteps;
eDescr_v3.proxyOp.chunkSize = eDescr->proxyOp.chunkSize;
eDescr_v3.proxyOp.isSend = eDescr->proxyOp.isSend;
} break;
case ncclProfileProxyStep: {
eDescr_v3.proxyStep.step = eDescr->proxyStep.step;
} break;
case ncclProfileProxyCtrl: break;
case ncclProfileKernelCh: {
eDescr_v3.kernelCh.channelId = eDescr->kernelCh.channelId;
} break;
case ncclProfileNetPlugin: {
eDescr_v3.netPlugin.id = eDescr->netPlugin.id;
eDescr_v3.netPlugin.data = eDescr->netPlugin.data;
} break;
default: return ncclSuccess;
}
return ncclProfiler_v3->startEvent(context, eHandle, &eDescr_v3);
}
static ncclResult_t ncclProfiler_recordEventState(void* eHandle, ncclProfilerEventState_t eState, ncclProfilerEventStateArgs_t* eStateArgs) {
ncclProfilerEventStateArgs_v3_t args = { };
switch (eState) {
case ncclProfilerProxyCtrlIdle:
case ncclProfilerProxyCtrlActive:
case ncclProfilerProxyCtrlSleep:
case ncclProfilerProxyCtrlWakeup:
case ncclProfilerProxyCtrlAppend:
case ncclProfilerProxyCtrlAppendEnd:
args.proxyCtrl.appendedProxyOps = eStateArgs->proxyCtrl.appendedProxyOps;
break;
case ncclProfilerProxyStepSendGPUWait:
case ncclProfilerProxyStepSendWait:
case ncclProfilerProxyStepRecvWait:
case ncclProfilerProxyStepRecvFlushWait:
case ncclProfilerProxyStepRecvGPUWait:
break;
default: return ncclSuccess;
}
return ncclProfiler_v3->recordEventState(eHandle, eState, &args);
}
static ncclResult_t ncclProfiler_init(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn) {
NCCLCHECK(ncclProfiler_v3->init(context, eActivationMask));
ncclProfiler.startEvent = ncclProfiler_startEvent;
ncclProfiler.stopEvent = ncclProfiler_v3->stopEvent;
ncclProfiler.recordEventState = ncclProfiler_recordEventState;
ncclProfiler.finalize = ncclProfiler_v3->finalize;
return ncclSuccess;
}
ncclProfiler_t* getNcclProfiler_v3(void* lib) {
ncclProfiler_v3 = (ncclProfiler_v3_t*)dlsym(lib, "ncclProfiler_v3");
if (ncclProfiler_v3) {
ncclProfiler.name = ncclProfiler_v3->name;
ncclProfiler.init = ncclProfiler_init;
INFO(NCCL_INIT|NCCL_ENV, "PROFILER/Plugin: loaded %s", ncclProfiler_v3->name);
return ncclProfiler_v3;
return &ncclProfiler;
}
INFO(NCCL_INIT|NCCL_ENV, "PROFILER/Plugin: failed to find ncclProfiler_v3");
return NULL;
+21
Просмотреть файл
@@ -0,0 +1,21 @@
/*************************************************************************
* Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#include "comm.h"
#include "nccl_profiler.h"
#include "checks.h"
static ncclProfiler_v4_t* ncclProfiler_v4;
ncclProfiler_t* getNcclProfiler_v4(void* lib) {
ncclProfiler_v4 = (ncclProfiler_v4_t*)dlsym(lib, "ncclProfiler_v4");
if (ncclProfiler_v4) {
INFO(NCCL_INIT|NCCL_ENV, "PROFILER/Plugin: loaded %s", ncclProfiler_v4->name);
return ncclProfiler_v4;
}
INFO(NCCL_INIT|NCCL_ENV, "PROFILER/Plugin: failed to find ncclProfiler_v4");
return NULL;
}
+10 -7
Просмотреть файл
@@ -416,6 +416,7 @@ static ncclResult_t ncclProxyOpToArgs(struct ncclProxyOp* op, struct ncclProxyAr
args->state = ncclProxyOpReady;
args->progress = op->connection->tcomm->proxyProgress;
args->proxyAppendPtr = op->connection->proxyAppendPtr;
if (args->pattern != ncclPatternProfiler) ncclProfilerStartProxyOpEvent(subIndex, args);
return ncclSuccess;
}
@@ -634,10 +635,10 @@ ncclResult_t ncclProxySaveOp(struct ncclComm* comm, struct ncclProxyOp* op, bool
const int rank = comm->rank, nranks = comm->nRanks;
int *nstepsSend = NULL, *nstepsRecv = NULL;
PatRSAlgorithm<char> algo(op->chunkSize, NCCL_STEPS, 16, 0, size, size, op->chunkSize, rank, nranks);
struct ncclPatStep ps = {0};
NCCLCHECKGOTO(ncclCalloc(&nstepsSend, log2Up(nranks)), result, exit_pat_up);
NCCLCHECKGOTO(ncclCalloc(&nstepsRecv, log2Up(nranks)), result, exit_pat_up);
struct ncclPatStep ps;
do {
algo.getNextOp(&ps);
if (ps.flags & PatSkipped) continue;
@@ -668,10 +669,10 @@ ncclResult_t ncclProxySaveOp(struct ncclComm* comm, struct ncclProxyOp* op, bool
const int rank = comm->rank, nranks = comm->nRanks;
int *nstepsSend = NULL, *nstepsRecv = NULL;
PatAGAlgorithm<char> algo(op->chunkSize, NCCL_STEPS, 16, 0, size, size, op->chunkSize, rank, nranks);
struct ncclPatStep ps = {0};
NCCLCHECKGOTO(ncclCalloc(&nstepsSend, log2Up(nranks)), result, exit_pat_down);
NCCLCHECKGOTO(ncclCalloc(&nstepsRecv, log2Up(nranks)), result, exit_pat_down);
struct ncclPatStep ps;
do {
algo.getNextOp(&ps);
if (ps.flags & PatSkipped) continue;
@@ -933,11 +934,13 @@ void* ncclProxyProgress(void *proxyState_) {
INFO(NCCL_ALL,"%s:%d -> %d [Progress Thread]", __FILE__, __LINE__, ret);
break;
}
void* eHandle;
ncclProfilerStartProxyCtrlEvent(proxyState->profilerContext, &eHandle);
if (lastIdle == 0 && idle == 1) ncclProfilerRecordProxyCtrlEventState(eHandle, 0, ncclProfilerProxyCtrlIdle);
if (lastIdle == 1 && idle == 0) ncclProfilerRecordProxyCtrlEventState(eHandle, 0, ncclProfilerProxyCtrlActive);
ncclProfilerStopProxyCtrlEvent(eHandle);
if ((lastIdle == 0 && idle == 1) || (lastIdle == 1 && idle == 0)) {
void* eHandle;
ncclProfilerStartProxyCtrlEvent(proxyState->profilerContext, &eHandle);
if (lastIdle == 0 && idle == 1) ncclProfilerRecordProxyCtrlEventState(eHandle, 0, ncclProfilerProxyCtrlIdle);
if (lastIdle == 1 && idle == 0) ncclProfilerRecordProxyCtrlEventState(eHandle, 0, ncclProfilerProxyCtrlActive);
ncclProfilerStopProxyCtrlEvent(eHandle);
}
if (idle || !state->active || (++proxyOpAppendCounter == ncclParamProgressAppendOpFreq())) {
int added = 0;
proxyOpAppendCounter = 0;
+10 -4
Просмотреть файл
@@ -606,6 +606,10 @@ static ncclResult_t rasCollCommsInit(struct rasCollRequest** pReq, size_t* pReqL
for (int commIdx = 0; commIdx < nNcclComms; commIdx++) {
if (ncclComms[commIdx] == nullptr) // nullptr's are always at the end after sorting.
break;
if (!__atomic_load_n(&ncclComms[commIdx]->peerInfoValid, __ATOMIC_ACQUIRE)) {
// Critical data is not yet initialized -- ignore the communicator.
continue;
}
// A process may manage multiple GPUs and thus have multiple communicators with the same commHash.
// Comparing just the commHash is OK though within communicators that are part of the same process.
if (commIdx == 0 || ncclComms[commIdx]->commHash != ncclComms[commIdx-1]->commHash) {
@@ -651,6 +655,8 @@ static ncclResult_t rasCollCommsInit(struct rasCollRequest** pReq, size_t* pReqL
// collCommIdx counts rasCollComms::comm (comm); commIdx indexes ncclComms.
for (int collCommIdx = 0, commIdx = 0; collCommIdx < nComms; collCommIdx++) {
struct ncclComm* ncclComm = ncclComms[commIdx];
if (!__atomic_load_n(&ncclComm->peerInfoValid, __ATOMIC_ACQUIRE))
continue;
comm->commId.commHash = ncclComm->commHash;
comm->commId.hostHash = ncclComm->peerInfo->hostHash;
@@ -663,15 +669,15 @@ static ncclResult_t rasCollCommsInit(struct rasCollRequest** pReq, size_t* pReqL
commIdx++) {
ncclComm = ncclComms[commIdx];
struct rasCollComms::comm::rank* rank = comm->ranks+comm->nRanks;
ncclResult_t asyncError;
rank->commRank = ncclComm->rank;
// rasNetSendCollReq initializes coll->peers[0] to our rasNetListeningSocket.addr, so peerIdx is initially
// always 0. It will increase after we send this response back to the peer we got the request from.
rank->peerIdx = 0;
memcpy(rank->collOpCounts, ncclComm->seqNumber, sizeof(rank->collOpCounts));
rank->status.initState = ncclComm->initState;
if (ncclCommGetAsyncError(ncclComm, &asyncError) == ncclSuccess)
rank->status.asyncError = asyncError;
rank->status.asyncError = __atomic_load_n(&ncclComm->asyncResult, __ATOMIC_ACQUIRE);
if (rank->status.asyncError == ncclSuccess && ncclComm->proxyState)
rank->status.asyncError = __atomic_load_n(&ncclComm->proxyState->asyncResult, __ATOMIC_ACQUIRE);
rank->status.finalizeCalled = (ncclComm->finalizeCalled != 0);
rank->status.destroyFlag = (ncclComm->destroyFlag != 0);
rank->status.abortFlag = (__atomic_load_n(ncclComm->abortFlag, __ATOMIC_ACQUIRE) != 0);
@@ -680,7 +686,7 @@ static ncclResult_t rasCollCommsInit(struct rasCollRequest** pReq, size_t* pReqL
comm->nRanks++;
} // for (commIdx)
if (firstNewSkipMissingIdx != -1 &&
if (__atomic_load_n(&ncclComm->peerInfoValid, __ATOMIC_ACQUIRE) && firstNewSkipMissingIdx != -1 &&
memcmp(req->comms.skipMissingRanksComms+firstNewSkipMissingIdx, &comm->commId, sizeof(comm->commId)) == 0) {
// Fill in the missingRanks array that follows the comm->ranks.
struct rasCollCommsMissingRank* missingRanks = (struct rasCollCommsMissingRank*)(comm->ranks+comm->nRanks);
+17 -11
Просмотреть файл
@@ -365,15 +365,16 @@ ncclResult_t rasNetAcceptNewSocket() {
NCCLCHECKGOTO(ncclSocketAccept(&sock->sock, &rasNetListeningSocket), ret, fail);
NCCLCHECKGOTO(ncclSocketReady(&sock->sock, &ready), ret, fail);
if (sock->sock.fd != -1) {
NCCLCHECKGOTO(rasGetNewPollEntry(&sock->pfd), ret, fail);
rasPfds[sock->pfd].fd = sock->sock.fd;
rasPfds[sock->pfd].events = POLLIN; // Initially we'll just wait for a handshake from the other side. This also
// helps the code tell the sides apart.
sock->status = RAS_SOCK_CONNECTING;
if (sock->sock.fd == -1)
goto fail; // We'll return ncclSuccess, but we need to clean up the incomplete socket first.
INFO(NCCL_RAS, "RAS new incoming socket connection from %s", ncclSocketToString(&sock->sock.addr, rasLine));
}
NCCLCHECKGOTO(rasGetNewPollEntry(&sock->pfd), ret, fail);
rasPfds[sock->pfd].fd = sock->sock.fd;
rasPfds[sock->pfd].events = POLLIN; // Initially we'll just wait for a handshake from the other side. This also
// helps the code tell the sides apart.
sock->status = RAS_SOCK_CONNECTING;
INFO(NCCL_RAS, "RAS new incoming socket connection from %s", ncclSocketToString(&sock->sock.addr, rasLine));
exit:
return ret;
@@ -480,7 +481,10 @@ void rasSocksHandleTimeouts(int64_t now, int64_t* nextWakeup) {
// Once we get an EOF when receiving data, we finalize the termination.
// For not fully established sockets, we can terminate immediately as there's no useful data to extract.
void rasSocketTerminate(struct rasSocket* sock, bool finalize, uint64_t startRetryOffset, bool retry) {
assert(sock->status != RAS_SOCK_CLOSED);
if (sock->status == RAS_SOCK_CLOSED) {
INFO(NCCL_RAS, "RAS socket in closed state passed for termination -- internal error?");
// The code below can actually handle such a case gracefully.
}
if (sock->conn) {
struct rasConnection* conn = sock->conn;
// If the sock of the connection points back to us, it means that we are the current socket of this
@@ -542,8 +546,10 @@ void rasSocketTerminate(struct rasSocket* sock, bool finalize, uint64_t startRet
} else {
// Either the caller requested finalization or we cannot receive on it.
(void)ncclSocketClose(&sock->sock);
rasPfds[sock->pfd].fd = -1;
rasPfds[sock->pfd].events = rasPfds[sock->pfd].revents = 0;
if (sock->pfd != -1) {
rasPfds[sock->pfd].fd = -1;
rasPfds[sock->pfd].events = rasPfds[sock->pfd].revents = 0;
}
free(sock->recvMsg);
freeSockEntry(sock);
}
+23 -20
Просмотреть файл
@@ -1,6 +1,7 @@
#include "register.h"
#include "transport.h"
#include "enqueue.h"
#include "register_inline.h"
static ncclResult_t registerCheckP2PConnection(struct ncclComm* comm, struct ncclConnector* conn, struct ncclTopoGraph* graph, int peer, bool* needReg) {
if (conn->connected) {
@@ -61,32 +62,34 @@ ncclResult_t ncclRegisterCollNvlsBuffers(
if (nvlsReged && comm->nNodes > 1 && info->algorithm == NCCL_ALGO_NVLS) {
if (comm->planner.persistent && ncclParamGraphRegister()) {
ncclCollnetGraphRegisterBuffer(comm, info->recvbuff, recvbuffSize, collNetSend, &collnetReged, &sendHandle, cleanupQueue, &info->nCleanupQueueElts);
if (collnetReged) ncclCollnetGraphRegisterBuffer(comm, info->recvbuff, recvbuffSize, collNetRecv, &collnetReged, &recvHandle, cleanupQueue, &info->nCleanupQueueElts);
if (info->func == ncclFuncAllGather) {
ncclCollnetGraphRegisterBuffer(comm, info->sendbuff, sendbuffSize, collNetSend, &collnetReged, &sendHandle, cleanupQueue, &info->nCleanupQueueElts);
} else if (info->func == ncclFuncReduceScatter) {
ncclCollnetGraphRegisterBuffer(comm, info->recvbuff, recvbuffSize, collNetRecv, &collnetReged, &recvHandle, cleanupQueue, &info->nCleanupQueueElts);
} else if (info->func == ncclFuncAllReduce) {
ncclCollnetGraphRegisterBuffer(comm, info->recvbuff, recvbuffSize, collNetRecv, &collnetReged, &recvHandle, cleanupQueue, &info->nCleanupQueueElts);
if (collnetReged) ncclCollnetGraphRegisterBuffer(comm, info->recvbuff, recvbuffSize, collNetSend, &collnetReged, &sendHandle, cleanupQueue, &info->nCleanupQueueElts);
}
}
if (collnetReged == 0 && ncclParamLocalRegister()) {
ncclCollnetLocalRegisterBuffer(comm, info->recvbuff, recvbuffSize, collNetSend, &collnetReged, &sendHandle);
if (collnetReged) ncclCollnetLocalRegisterBuffer(comm, info->recvbuff, recvbuffSize, collNetRecv, &collnetReged, &recvHandle);
if (info->func == ncclFuncAllGather) {
ncclCollnetLocalRegisterBuffer(comm, info->sendbuff, sendbuffSize, collNetSend, &collnetReged, &sendHandle);
} else if (info->func == ncclFuncReduceScatter) {
ncclCollnetLocalRegisterBuffer(comm, info->recvbuff, recvbuffSize, collNetRecv, &collnetReged, &recvHandle);
} else if (info->func == ncclFuncAllReduce) {
ncclCollnetLocalRegisterBuffer(comm, info->recvbuff, recvbuffSize, collNetRecv, &collnetReged, &recvHandle);
if (collnetReged) ncclCollnetLocalRegisterBuffer(comm, info->recvbuff, recvbuffSize, collNetSend, &collnetReged, &sendHandle);
}
}
}
if (nvlsReged) {
*regNeedConnect = 0;
/* tweak NVLS channels usage; for registered NVLS buffer to saturate bandwidth. */
if (comm->nNodes == 1) {
if (info->func == ncclFuncReduceScatter) {
// RS: Further tweaks for Blackwell with NVLS registered buffers
info->nMaxChannels = std::max(comm->config.minCTAs, std::min(comm->config.maxCTAs, (comm->compCap >= 100) ? 6 : 5));
}
else {
// AR/AG: Further tweaks for Blackwell with NVLS registered buffers
info->nMaxChannels = std::max(comm->config.minCTAs, std::min(comm->config.maxCTAs, (comm->compCap >= 100) ? 8 : 4));
}
} else {
// Further tweaks for Blackwell with NVLS registered buffers
info->nMaxChannels = std::max(comm->config.minCTAs, std::min(comm->config.maxCTAs, (comm->compCap >= 100) ? 7 : 6));
}
int recChannels;
NCCLCHECK(ncclNvlsRegResourcesQuery(comm, info, &recChannels));
info->nMaxChannels = recChannels;
info->regBufType |= NCCL_NVLS_REG_BUFFER;
}
@@ -188,7 +191,7 @@ ncclResult_t ncclRegisterCollBuffers(
struct ncclChannel* channel = comm->channels;
int ipcRegFlag = 0, netSendRegFlag = 0, netRecvRegFlag = 0;
void *sendHandle, *recvHandle;
if (info->func != ncclFuncReduceScatter && comm->intraNodeP2pSupport) {
if (info->func != ncclFuncReduceScatter && comm->isAllDirectP2p) {
for (int r = 0; r < NCCL_MAX_DIRECT_ARITY; ++r) {
for (int down = 0; down < 2; ++down) {
int peer = down ? channel->collnetDirect.down[r] : channel->collnetDirect.up[r];
@@ -308,7 +311,7 @@ ncclResult_t ncclRegisterCollBuffers(
}
}
}
if (nPeers > 0 && comm->intraNodeP2pSupport) {
if (nPeers > 0 && comm->isAllDirectP2p) {
if (comm->planner.persistent && ncclParamGraphRegister()) {
ncclIpcGraphRegisterBuffer(comm, info->recvbuff, recvbuffSize, peerRanks, nPeers, NCCL_IPC_COLLECTIVE, &regBufFlag, &info->recvbuffOffset, &info->recvbuffRmtAddrs, cleanupQueue, &info->nCleanupQueueElts);
}
@@ -365,7 +368,7 @@ ncclResult_t ncclRegisterCollBuffers(
void *sendHandle, *recvHandle;
NCCLCHECK(ncclRegFind(comm, info->recvbuff, recvbuffSize, &recvRegRecord));
if (recvRegRecord == NULL && !(comm->planner.persistent && ncclParamGraphRegister())) goto exit;
if (comm->intraNodeP2pSupport) {
if (comm->isAllDirectP2p) {
for (int c = 0; c < comm->nChannels; ++c) {
struct ncclChannel* channel = comm->channels + c;
struct ncclTree* tree = NULL;
+119 -21
Просмотреть файл
@@ -10,24 +10,21 @@
#include "net.h"
#include "register.h"
#include "transport.h"
#include "group.h"
ncclResult_t ncclRegFind(struct ncclComm* comm, const void* data, size_t size, struct ncclReg** reg) {
NCCL_PARAM(LocalRegister, "LOCAL_REGISTER", 1);
static ncclResult_t regFindHandleFromSymAddr(struct ncclComm* comm, void* baseSymPtr, struct ncclReg** handle) {
struct ncclRegCache* cache = &comm->regCache;
uintptr_t pageSize = cache->pageSize;
uintptr_t addr = (uintptr_t)data & -pageSize;
size_t pages = ((uintptr_t)data + size - addr + pageSize-1)/pageSize;
*reg = NULL;
for (int slot=0; /*true*/; slot++) {
if (slot == cache->population || addr < cache->slots[slot]->addr) return ncclSuccess;
if ((addr >= cache->slots[slot]->addr) &&
((addr-cache->slots[slot]->addr)/pageSize+pages) <= cache->slots[slot]->pages) {
*reg = cache->slots[slot];
return ncclSuccess;
*handle = NULL;
for (int slot = 0; slot < cache->population; slot++) {
if (baseSymPtr == cache->slots[slot]->baseSymPtr) {
*handle = cache->slots[slot];
break;
}
}
return ncclSuccess;
}
NCCL_PARAM(LocalRegister, "LOCAL_REGISTER", 1);
ncclResult_t ncclRegLocalIsValid(struct ncclReg *reg, bool *isValid) {
if (reg && isValid) {
@@ -43,14 +40,14 @@ ncclResult_t ncclRegister(struct ncclComm* comm, void* data, size_t size, bool i
NCCLCHECK(CommCheck(comm, "ncclCommRegister", "comm"));
struct ncclRegCache* cache = &comm->regCache;
uintptr_t pageSize = cache->pageSize;
uintptr_t addr = (uintptr_t)data & -pageSize;
size_t pages = ((uintptr_t)data + size - addr + pageSize-1)/pageSize;
uintptr_t begAddr = (uintptr_t)data & -pageSize;
uintptr_t endAddr = ((uintptr_t)data + size + pageSize-1) & -pageSize;
if (comm->checkPointers) NCCLCHECK(CudaPtrCheck(data, comm, "buff", "ncclCommRegister"));
INFO(NCCL_REG, "register comm %p buffer %p size %zi", comm, data, size);
for (int slot=0; /*true*/; slot++) {
if ((slot == cache->population) || (addr < cache->slots[slot]->addr)) {
if ((slot == cache->population) || (begAddr < cache->slots[slot]->begAddr)) {
if (cache->population == cache->capacity) { // must grow cache
cache->capacity = cache->capacity < 32 ? 32 : 2*cache->capacity;
NCCLCHECK(ncclRealloc(&cache->slots, cache->population, cache->capacity));
@@ -58,15 +55,15 @@ ncclResult_t ncclRegister(struct ncclComm* comm, void* data, size_t size, bool i
memmove(cache->slots+slot+1, cache->slots+slot, (cache->population-slot)*sizeof(struct ncclReg*));
NCCLCHECK(ncclCalloc(cache->slots+slot, 1));
struct ncclReg* regSlot = cache->slots[slot];
regSlot->addr = addr;
regSlot->pages = pages;
regSlot->begAddr = begAddr;
regSlot->endAddr = endAddr;
if (isGraph) regSlot->graphRefs = 1;
else regSlot->localRefs = 1;
cache->population += 1;
*handle = regSlot;
goto exit;
} else if ((addr >= cache->slots[slot]->addr) &&
((addr-cache->slots[slot]->addr)/pageSize+pages) <= cache->slots[slot]->pages) {
} else if ((cache->slots[slot]->begAddr <= begAddr) &&
(cache->slots[slot]->endAddr >= endAddr)) {
if (isGraph) cache->slots[slot]->graphRefs++;
else cache->slots[slot]->localRefs++;
*handle = cache->slots[slot];
@@ -120,7 +117,7 @@ ncclResult_t ncclRegCleanup(struct ncclComm* comm) {
struct ncclRegCache* cache = &comm->regCache;
for (int i = 0; i < cache->population; i++) {
struct ncclReg* reg = cache->slots[i];
INFO(NCCL_INIT, "Cleanup buffer %p pages %lx", (void*)reg->addr, reg->pages);
INFO(NCCL_INIT, "Cleanup buffer %p pages %lx", (void*)reg->begAddr, (reg->endAddr-reg->begAddr)/cache->pageSize);
NCCLCHECK(regCleanup(comm, reg));
free(reg);
}
@@ -177,3 +174,104 @@ ncclResult_t ncclCommGraphDeregister(const ncclComm_t comm, struct ncclReg *hand
NCCLCHECK(commDeregister(comm, true, handle));
return ncclSuccess;
}
ncclResult_t ncclCommSymmetricRegisterInternal(struct ncclComm* comm, void* buff, size_t baseSize, size_t alignment, CUmemGenericAllocationHandle memHandle, struct ncclReg* regHandle) {
ncclResult_t ret = ncclSuccess;
void* regSymAddr = NULL;
ALIGN_SIZE(comm->symAllocHead, alignment);
NCCLCHECKGOTO(ncclIpcSymmetricMap(comm, comm->symAllocHead, baseSize, memHandle, &regSymAddr), ret, fail);
NCCLCHECKGOTO(ncclNvlsSymmetricMap(comm, comm->symAllocHead, baseSize, regSymAddr), ret, fail);
NCCLCHECKGOTO(bootstrapIntraNodeBarrier(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, comm->localRankToRank[0]), ret, fail);
comm->symAllocHead += baseSize;
regHandle->baseSymPtr = regSymAddr;
regHandle->symSize = baseSize;
exit:
return ret;
fail:
regHandle->baseSymPtr = NULL;
regHandle->symSize = 0;
goto exit;
}
NCCL_API(ncclResult_t, ncclCommWindowRegister, ncclComm_t comm, void* buff, size_t size, ncclWindow_t* win, int winFlags);
ncclResult_t ncclCommWindowRegister(ncclComm_t comm, void* buff, size_t size, ncclWindow_t* win, int winFlags) {
ncclResult_t ret = ncclSuccess;
CUmemGenericAllocationHandle memHandle;
size_t baseSize;
void* baseAddr = NULL;
struct ncclReg* regHandle = NULL;
int saveDev;
*win = NULL;
CUDACHECK(cudaGetDevice(&saveDev));
NCCLCHECK(ncclGroupStartInternal());
if (!ncclParamLocalRegister() || !ncclCuMemEnable()) {
goto exit;
}
NCCLCHECKGOTO(ncclCommEnsureReady(comm), ret, fail);
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
if (comm && buff && size && win) {
size_t alignment = 0;
CUCHECKGOTO(cuMemGetAddressRange((CUdeviceptr*)&baseAddr, &baseSize, (CUdeviceptr)buff), ret, fail);
// size and alignment check
if (!((uintptr_t)baseAddr % NCCL_REC_PAGE_SIZE == 0 && baseSize % NCCL_REC_PAGE_SIZE == 0 && (uintptr_t)buff + size <= (uintptr_t)baseAddr + baseSize)) {
WARN("buffer %p (baseAddr %p align %d) size %zu (baseSize %ld align %d) does not satisfy symmetric registration requirements", buff, baseAddr, (uintptr_t)baseAddr % NCCL_REC_PAGE_SIZE == 0, size, baseSize, baseSize % NCCL_REC_PAGE_SIZE == 0);
goto fail;
}
NCCLCHECKGOTO(ncclRegister(comm, baseAddr, baseSize, false, (void**)&regHandle), ret, fail);
NCCLCHECKGOTO(ncclCalloc(win, 1), ret, fail);
(*win)->handle = regHandle;
regHandle->winFlags = winFlags;
if (regHandle->baseSymPtr == NULL && comm->symmetricSupport) {
struct ncclSymRegTask* task;
CUCHECKGOTO(cuMemRetainAllocationHandle(&memHandle, baseAddr), ret, fail);
CUCHECKGOTO(cuMemRelease(memHandle), ret, fail);
alignment = baseSize >= NCCL_REC_PAGE_SIZE * 72L ? NCCL_MAX_PAGE_SIZE : NCCL_REC_PAGE_SIZE;
NCCLCHECKGOTO(ncclCalloc(&task, 1), ret, fail);
task->buff = buff;
task->baseSize = baseSize;
task->memHandle = memHandle;
task->regHandle = regHandle;
task->alignment = alignment;
ncclIntruQueueEnqueue(&comm->symRegTaskQueue, task);
ncclGroupCommJoin(comm, ncclGroupTaskTypeSymRegister);
}
}
exit:
ncclGroupErrCheck(ret);
NCCLCHECK(ret = ncclGroupEndInternal());
cudaSetDevice(saveDev);
return ret;
fail:
free(*win);
*win = NULL;
goto exit;
}
NCCL_API(ncclResult_t, ncclCommWindowDeregister, ncclComm_t comm, ncclWindow_t win);
ncclResult_t ncclCommWindowDeregister(ncclComm_t comm, ncclWindow_t win) {
ncclResult_t ret = ncclSuccess;
int saveDev;
struct ncclReg* regHandle;
CUDACHECK(cudaGetDevice(&saveDev));
if (win == NULL) goto exit;
regHandle = win->handle;
if (regHandle && ncclParamLocalRegister() && ncclCuMemEnable()) {
if (regHandle->baseSymPtr) {
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
NCCLCHECKGOTO(ncclNvlsSymmetricFree(comm, regHandle->symSize, regHandle->baseSymPtr), ret, fail);
NCCLCHECKGOTO(ncclIpcSymmetricFree(comm, regHandle->symSize, regHandle->baseSymPtr), ret, fail);
}
NCCLCHECKGOTO(commDeregister(comm, false, regHandle), ret, fail);
}
free(win);
exit:
CUDACHECK(cudaSetDevice(saveDev));
return ret;
fail:
goto exit;
}
+296
Просмотреть файл
@@ -0,0 +1,296 @@
#include "symmetric.h"
#include "comm.h"
#include "device.h"
#include <cmath>
constexpr char const* kernelName[] = {
// Must align with enum ncclSymKernelId definition in src/include/symmetric.h
"AllReduce_AGxLL_R",
"AllReduce_AGxLLMC_R",
"AllReduce_RSxLD_AGxST",
"AllReduce_RSxLDMC_AGxSTMC",
"AllGather_LL",
"AllGather_LLMC",
"AllGather_ST",
"AllGather_STMC",
"ReduceScatter_LL",
"ReduceScatter_LD",
"ReduceScatter_LDMC"
};
constexpr uint32_t kernelMask_STMC = 1<<ncclSymKernelId_AllGather_LLMC |
1<<ncclSymKernelId_AllGather_STMC |
1<<ncclSymKernelId_AllReduce_AGxLLMC_R |
1<<ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC |
1<<ncclSymKernelId_ReduceScatter_LDMC;
constexpr uint32_t kernelMask_LDMC = 1<<ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC |
1<<ncclSymKernelId_ReduceScatter_LDMC;
constexpr uint32_t kernelMask_LL = 1<<ncclSymKernelId_AllReduce_AGxLL_R |
1<<ncclSymKernelId_AllReduce_AGxLLMC_R |
1<<ncclSymKernelId_AllGather_LL |
1<<ncclSymKernelId_AllGather_LLMC |
1<<ncclSymKernelId_ReduceScatter_LL;
constexpr uint32_t kernelMask_AG = 1<<ncclSymKernelId_AllGather_LL |
1<<ncclSymKernelId_AllGather_LLMC |
1<<ncclSymKernelId_AllGather_ST |
1<<ncclSymKernelId_AllGather_STMC;
constexpr uint32_t kernelMask_AR = 1<<ncclSymKernelId_AllReduce_AGxLLMC_R |
1<<ncclSymKernelId_AllReduce_AGxLL_R |
1<<ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC |
1<<ncclSymKernelId_AllReduce_RSxLD_AGxST;
constexpr uint32_t kernelMask_RS = 1<<ncclSymKernelId_ReduceScatter_LD |
1<<ncclSymKernelId_ReduceScatter_LDMC |
1<<ncclSymKernelId_ReduceScatter_LL;
static uint32_t kernelMask_coll(ncclFunc_t coll) {
switch (coll) {
case ncclFuncAllGather: return kernelMask_AG;
case ncclFuncAllReduce: return kernelMask_AR;
case ncclFuncReduceScatter: return kernelMask_RS;
default: return 0;
}
}
static uint32_t kernelMask_user() {
static uint32_t cache = -1u;
uint32_t got = __atomic_load_n(&cache, __ATOMIC_RELAXED);
if (got == -1u) {
// TODO: Enhance this to be a pattern match. I like regex's but we also have
// the parseList() used by NCCL_ALGO/PROTO.
char const* name = ncclGetEnv("NCCL_SYM_KERNEL");
if (name == nullptr || strcmp(name, "^") == 0) {
static_assert((int)ncclSymKernelId_Count < 32, "Use more than 32 bits");
got = (1<<(int)ncclSymKernelId_Count)-1;
} else {
got = 0;
for (int k=0; k < (int)ncclSymKernelId_Count; k++) {
if (strcmp(kernelName[k], name) == 0) {
__atomic_store_n(&cache, 1<<k, __ATOMIC_RELAXED);
got = 1<<k;
break;
}
}
}
__atomic_store_n(&cache, got, __ATOMIC_RELAXED);
}
return got;
}
NCCL_PARAM(SymCTAs, "SYM_CTAS", 0)
static double softmin(double x, double ceiling, double softness) {
// looks like a smooth version of: min(x, ceiling)
return ceiling - softness*std::log1p((std::exp(ceiling/softness) - 1)*std::exp(-x/softness));
}
static double softplus(double x, double softness) {
// looks like a smooth version of: max(0, x)
double z = x/softness;
return 100.0 <= z ? x : softness*std::log1p(std::exp(z));
}
static double model(double busBytes, double baseLat, int nSMs, double smBw, double busMultiplier, double peakBw) {
double bw = softmin(nSMs*smBw*busMultiplier, peakBw, smBw);
return baseLat + softplus(busBytes/bw - 1, 1);
}
// Given the kernel and bytes, return the minimum number of blocks to run on such that
// perf is 99% of running at max blocks, and return the estimate runtime for that
// block count.
static void queryModel(struct ncclComm* comm, ncclSymKernelId k, size_t nBytes, float* timeUs, int* nBlocks) {
constexpr double LL_BusFactor = 9; // 2X the bytes, plus some processing, plus no unrolling
int nRanks = comm->nRanks;
int nMaxBlocks = ncclSymMaxBlocks;
int nMaxBlocksNvls = divUp((comm->cudaArch < 1000 ? 16 : 32), nRanks);
size_t busBytes; // max(bytes sent, bytes received)
double busMultiplier = 1;
switch (k) {
default:
busBytes = size_t(1)<<50;
break;
case ncclSymKernelId_AllReduce_AGxLL_R:
busBytes = nRanks*nBytes*LL_BusFactor;
break;
case ncclSymKernelId_AllReduce_AGxLLMC_R:
busBytes = nRanks*nBytes*LL_BusFactor;
busMultiplier = 1.1; // To beat non-MC LL
break;
case ncclSymKernelId_AllReduce_RSxLD_AGxST:
busBytes = 2*nBytes*(nRanks-1)/nRanks;
break;
case ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC:
busBytes = nBytes/nRanks + nBytes;
busMultiplier = nRanks;
nMaxBlocks = nMaxBlocksNvls;
break;
case ncclSymKernelId_AllGather_LL:
busBytes = nRanks*nBytes*LL_BusFactor;
break;
case ncclSymKernelId_AllGather_LLMC:
busBytes = nRanks*nBytes*LL_BusFactor;
busMultiplier = 1.1; // To beat non-MC LL
break;
case ncclSymKernelId_AllGather_ST:
busBytes = (nRanks-1)*nBytes;
break;
case ncclSymKernelId_AllGather_STMC:
busBytes = (nRanks-1)*nBytes; // Wrong. Should be nRanks*nBytes but we want to beat non-MC.
busMultiplier = 0.55*nRanks;
nMaxBlocks = nMaxBlocksNvls;
break;
case ncclSymKernelId_ReduceScatter_LL:
busBytes = nRanks*nBytes*LL_BusFactor;
break;
case ncclSymKernelId_ReduceScatter_LD:
busBytes = (nRanks-1)*nBytes;
break;
case ncclSymKernelId_ReduceScatter_LDMC:
busBytes = (nRanks-1)*nBytes; // Wrong. Should be nRanks*nBytes but we want to beat non-MC.
busMultiplier = 0.55*nRanks;
nMaxBlocks = nMaxBlocksNvls;
break;
}
nMaxBlocks = std::min<int>(nMaxBlocks, comm->config.maxCTAs);
int nMinBlocks = comm->config.minCTAs;
int nUserCTAs = std::min<int>(ncclSymMaxBlocks, ncclParamSymCTAs());
if (nUserCTAs > 0) nMinBlocks = nMaxBlocks = nUserCTAs;
bool isLL = kernelMask_LL>>k & 1;
bool isAG = kernelMask_AG>>k & 1;
bool isAR = kernelMask_AR>>k & 1;
constexpr double GBps = (1<<30)/1.e6;
double baseLat, smBw, peakBw;
if (comm->cudaArch < 1000) {
baseLat = isLL ? 4.5 : 7.8;
smBw = isAR ? 65*GBps : 44*GBps;
peakBw = k == ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC ? 480*GBps : 320*GBps;
} else {
baseLat = isLL ? (isAG ? 8.5 : 11) : (isAR ? 19.5 : 13.0);
smBw = 55*GBps;
peakBw = k == ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC ? 1000*GBps : 600*GBps;
}
*nBlocks = nMaxBlocks;
*timeUs = model(busBytes, baseLat, nMaxBlocks, smBw, busMultiplier, peakBw);
// Use least number of blocks that puts us within a tolerance of peak performance.
for (int bn = nMinBlocks; bn < nMaxBlocks; bn++) {
double time = model(busBytes, baseLat, bn, smBw, busMultiplier, peakBw);
if (time <= 1.025*(*timeUs)) {
*nBlocks = bn;
*timeUs = time;
break;
}
}
}
bool ncclSymImplemented(ncclFunc_t coll, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty) {
bool isFloat;
switch (ty) {
case ncclFloat64:
case ncclFloat32:
case ncclFloat16:
case ncclBfloat16:
case ncclFloat8e4m3:
case ncclFloat8e5m2:
isFloat = true;
break;
default:
isFloat = false;
break;
}
switch (coll) {
case ncclFuncAllGather:
return true;
case ncclFuncAllReduce:
case ncclFuncReduceScatter:
return red == ncclDevSum && isFloat && ty != ncclFloat64;
default:
return false;
}
}
ncclResult_t ncclSymPickKernel(
struct ncclComm* comm, ncclFunc_t coll, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty, size_t nElts,
float* estTimeUs, ncclSymKernelId* kernelId, int* nBlocks, int* nWarps
) {
uint32_t kmask = kernelMask_coll(coll);
kmask &= kernelMask_user();
bool hasSTMC = comm->nvlsSupport;
bool hasLDMC = false;
if (comm->nvlsSupport) {
switch (ty) {
case ncclInt32:
case ncclUint32:
case ncclInt64:
case ncclUint64:
case ncclFloat16:
case ncclBfloat16:
hasLDMC = red == ncclDevSum || red == ncclDevMinMax;
break;
case ncclFloat8e4m3:
case ncclFloat8e5m2:
hasLDMC = red == ncclDevSum || red == ncclDevMinMax;
hasLDMC &= comm->compCap >= 100;
break;
case ncclFloat:
case ncclDouble:
hasLDMC = red == ncclDevSum;
break;
default: break;
}
}
if (!hasSTMC) kmask &= ~kernelMask_STMC;
if (!hasLDMC) kmask &= ~kernelMask_LDMC;
size_t nBytes = nElts*ncclTypeSize(ty);
size_t nBusBytes = (coll == ncclFuncAllReduce ? 1 : comm->nRanks)*nBytes;
// LL kernels use 32-bit ints to track element counts and indices.
if (nBusBytes >= (size_t(2)<<30)) kmask &= ~kernelMask_LL;
// Any kernel might use 32-bit int to track unrolled loop chunks (which are going
// to be at least 32 bytes per chunk)
if (nBusBytes >= 32*(size_t(2)<<30)) kmask = 0;
ncclSymKernelId bestKernel = ncclSymKernelId_Count;
float bestTime = 1.e30f;
int bestBlocks = 999;
constexpr float smPenalty = .025f; // 2.5% percent increase in time per SM
uint32_t kmaskRemain = kmask;
while (kmaskRemain != 0) {
ncclSymKernelId k = (ncclSymKernelId)popFirstOneBit(&kmaskRemain);
float kTime;
int kBlocks;
queryModel(comm, k, nBytes, &kTime, &kBlocks);
if (kTime*(1.0f + smPenalty*kBlocks) < bestTime*(1.0f + smPenalty*bestBlocks)) {
bestKernel = k;
bestTime = kTime;
bestBlocks = kBlocks;
}
}
*kernelId = bestKernel;
*estTimeUs = kmask==0 || kernelMask_user() == (1<<ncclSymKernelId_Count)-1 ? bestTime : 0.0f;
*nBlocks = bestBlocks;
*nWarps = 16;
return ncclSuccess;
}
const char* ncclSymKernelIdToString(int kernelId) {
if (kernelId < 0 || kernelId >= ncclSymKernelId_Count) {
return "Unknown";
}
return kernelName[kernelId];
}
+6 -5
Просмотреть файл
@@ -71,7 +71,7 @@ NCCL_PARAM(ConnectRoundMaxPeers, "CONNECT_ROUND_MAX_PEERS", 128);
NCCL_PARAM(ReportConnectProgress, "REPORT_CONNECT_PROGRESS", 0);
#include <sys/time.h>
ncclResult_t ncclTransportCheckP2pType(struct ncclComm* comm, bool* intraNodeP2pSupport, bool* directMode) {
ncclResult_t ncclTransportCheckP2pType(struct ncclComm* comm, bool* isAllDirectP2p, bool* directMode) {
bool supportFlag = true;
bool directFlag = false;
if (comm->localRanks == 1) {
@@ -84,8 +84,9 @@ ncclResult_t ncclTransportCheckP2pType(struct ncclComm* comm, bool* intraNodeP2p
struct ncclPeerInfo* ipeerInfo = &comm->peerInfo[ipeer];
struct ncclPeerInfo* jpeerInfo = &comm->peerInfo[jpeer];
int canConnect = 0;
NCCLCHECK(ncclTransports[0]->canConnect(&canConnect, comm, NULL, ipeerInfo, jpeerInfo));
if (!canConnect && supportFlag == true) {
int intermediateRank = -1;
NCCLCHECK(ncclTopoCheckP2p(comm, comm->topo, ipeerInfo->rank, jpeerInfo->rank, &canConnect, NULL, &intermediateRank));
if (!canConnect || intermediateRank != -1) {
supportFlag = false;
}
if (ipeerInfo->hostHash == jpeerInfo->hostHash && ipeerInfo->pidHash == jpeerInfo->pidHash) directFlag = true;
@@ -93,9 +94,9 @@ ncclResult_t ncclTransportCheckP2pType(struct ncclComm* comm, bool* intraNodeP2p
}
}
}
*intraNodeP2pSupport = supportFlag;
*isAllDirectP2p = supportFlag;
*directMode = directFlag;
if (comm->rank == 0) INFO(NCCL_INIT, "Check P2P Type intraNodeP2pSupport %d directMode %d", supportFlag, directFlag);
if (comm->rank == 0) INFO(NCCL_INIT, "Check P2P Type isAllDirectP2p %d directMode %d", supportFlag, directFlag);
return ncclSuccess;
}
+7 -9
Просмотреть файл
@@ -13,6 +13,7 @@
#include "assert.h"
#include "bootstrap.h"
#include "channel.h"
#include "register_inline.h"
int64_t ncclParamGdrCopySyncEnable();
int64_t ncclParamGdrCopyFlushEnable();
@@ -1188,7 +1189,7 @@ static ncclResult_t collnetRegisterBuffer(struct ncclComm* comm, const void* use
goto exit;
} else {
/* start register collnet buffer */
struct collnetRegInfo info = { regRecord->addr, regRecord->pages * comm->regCache.pageSize };
struct collnetRegInfo info = { regRecord->begAddr, regRecord->endAddr - regRecord->begAddr };
void* handle = NULL;
struct ncclConnInfo* conn = (type == collNetRecv) ? &comm->channels[0].peers[comm->nRanks]->recv[type].conn : &comm->channels[0].peers[comm->nRanks]->send[type].conn;
@@ -1389,7 +1390,7 @@ ncclResult_t ncclCollNetChainBufferSetup(ncclComm_t comm) {
ncclResult_t ret = ncclSuccess;
char line[1024];
if (comm->collNetSupport == 0) goto exit;
if (comm->config.collnetEnable == 0) goto exit;
// Connect Collnet + chain
for (int c = 0; c < comm->nChannels; c++) {
struct ncclChannel* channel = comm->channels + c;
@@ -1421,7 +1422,7 @@ fail:
ncclResult_t ncclCollNetDirectBufferSetup(ncclComm_t comm) {
ncclResult_t ret = ncclSuccess;
if (comm->collNetSupport == 0) goto exit;
if (comm->config.collnetEnable == 0) goto exit;
// Connect intra-node CollNet + Direct
for (int c = 0; c < comm->nChannels; c++) {
@@ -1498,8 +1499,8 @@ ncclResult_t ncclCollNetSetup(ncclComm_t comm, ncclComm_t parent, struct ncclTop
comm->collNetHeads = headsUnique;
comm->collNetHeadsNum = nHeadsUnique;
if (parent && parent->collNetSupport && parent->nNodes == comm->nNodes) {
if (!parent->config.splitShare) {
if (parent && parent->config.collnetEnable && parent->nNodes == comm->nNodes) {
if (!parent->shareResources) {
collNetSetupFail = 1;
goto fail;
}
@@ -1547,9 +1548,6 @@ ncclResult_t ncclCollNetSetup(ncclComm_t comm, ncclComm_t parent, struct ncclTop
NCCLCHECKGOTO(collNetInitRailRankMap(comm), ret, fail);
} else {
/* TODO: CX-6 and CX-7 both do not support multiple sharp resources per process, if child comm cannot
* share the sharp resource from parent, we cannot use sharp in this case. This restriction might be
* lifted by sharp plugin/IB hardware in the future. */
collNetSetupFail = 1;
if (comm->rank == 0) {
WARN("Child comms (nRanks %d) fails to share parent comms (nRanks %d) sharp resources", comm->nRanks, parent->nRanks);
@@ -1629,7 +1627,7 @@ exit:
return ret;
fail:
ncclTransportCollNetFree(comm);
comm->collNetSupport = 0;
comm->config.collnetEnable = 0;
goto exit;
}
+29 -23
Просмотреть файл
@@ -16,6 +16,7 @@
#include "transport.h"
#include "shm.h"
#include <assert.h>
#include "register_inline.h"
static_assert(sizeof(ncclNetHandle_t) <= CONNECT_SIZE, "NET Connect info is too large");
@@ -629,8 +630,6 @@ static ncclResult_t sendProxySetup(struct ncclProxyConnection* connection, struc
resources->netDeviceVersion = props.netDeviceVersion;
resources->netDeviceType = props.netDeviceType;
resources->netDeviceVersion = props.netDeviceVersion;
resources->netDeviceType = props.netDeviceType;
/* point-to-point size limits*/
resources->maxP2pBytes = props.maxP2pBytes;
if((resources->maxP2pBytes <= 0) || (resources->maxP2pBytes > NCCL_MAX_NET_SIZE_BYTES)) {
@@ -732,7 +731,14 @@ static ncclResult_t sendProxyConnect(struct ncclProxyConnection* connection, str
NCCLCHECK(ncclCalloc(progressState->netComms + resources->netDev, proxyState->tpnRanks));
}
struct ncclSharedNetComms* comms = progressState->netComms[resources->netDev] + resources->tpRemoteRank;
if (comms->sendComm[resources->channelId] == NULL) ret = proxyState->ncclNet->connect(resources->netDev, &commConfig, req->handle, comms->sendComm + resources->channelId, &resources->netDeviceHandle);
// let only one localrank connect to a tpRemoteRank to avoid duplicate connections
if (comms->activeConnect[resources->channelId] == 0)
comms->activeConnect[resources->channelId] = (resources->tpLocalRank + 1);
if (comms->sendComm[resources->channelId] == NULL
&& comms->activeConnect[resources->channelId] == (resources->tpLocalRank + 1)) {
ret = proxyState->ncclNet->connect(resources->netDev, &commConfig, req->handle,
comms->sendComm + resources->channelId, &resources->netDeviceHandle);
}
resources->netSendComm = comms->sendComm[resources->channelId];
if (comms->sendComm[resources->channelId]) comms->sendRefCount[resources->channelId]++;
} else {
@@ -886,7 +892,15 @@ static ncclResult_t recvProxyConnect(struct ncclProxyConnection* connection, str
NCCLCHECK(ncclCalloc(progressState->netComms + resources->netDev, proxyState->tpnRanks));
}
struct ncclSharedNetComms* comms = progressState->netComms[resources->netDev] + resources->tpRemoteProxyRank;
if (comms->recvComm[resources->channelId] == NULL) ret = proxyState->ncclNet->accept(resources->netListenComm, comms->recvComm+resources->channelId, &resources->netDeviceHandle);
// reuse handle to for netdev/remote rank to avoid duplicate connections
if (comms->activeAccept[resources->channelId] == 0)
comms->activeAccept[resources->channelId] = (resources->tpLocalRank + 1);
//try connecting while comm is null
if (comms->recvComm[resources->channelId] == NULL
&& comms->activeAccept[resources->channelId] == (resources->tpLocalRank + 1)) {
ret = proxyState->ncclNet->accept(resources->netListenComm,
comms->recvComm+resources->channelId, &resources->netDeviceHandle);
}
resources->netRecvComm = comms->recvComm[resources->channelId];
if (comms->recvComm[resources->channelId]) comms->recvRefCount[resources->channelId]++;
} else {
@@ -1101,7 +1115,7 @@ static ncclResult_t sendProxyProgress(struct ncclProxyState* proxyState, struct
// Set step base for next op
resources->step = sub->base + sub->nsteps;
sub->posted = sub->transmitted = sub->done = 0;
ncclProfilerStartSendProxyOpEvent(s, args);
ncclProfilerRecordProxyOpEventState(s, args, ncclProfilerProxyOpInProgress_v4);
if (!sub->reg)
sub->sendMhandle = resources->mhandles[args->protocol];
}
@@ -1140,7 +1154,6 @@ static ncclResult_t sendProxyProgress(struct ncclProxyState* proxyState, struct
} else {
sub->posted += args->sliceSteps;
}
ncclProfilerRecordProxyOpEventState(s, args, sub->posted, sub->transSize, ncclProfilerProxyOpSendPosted);
ncclProfilerRecordProxyStepEventState(s, args, postedStepId, ncclProfilerProxyStepSendGPUWait);
args->idle = 0;
continue;
@@ -1188,18 +1201,17 @@ static ncclResult_t sendProxyProgress(struct ncclProxyState* proxyState, struct
}
}
if (ready) {
ncclProfilerRecordProxyOpEventState(s, args, sub->transmitted+args->sliceSteps, sub->transSize, ncclProfilerProxyOpSendRemFifoWait);
ncclProfilerRecordProxyStepEventState(s, args, transmittedStepId, ncclProfilerProxyStepSendPeerWait_v4);
// Data is ready, try to send.
// Coverity complains about the size here as pointing to an out-of-scope temporary. Which is nonsense,
// since size is a plain integer.
// coverity[use_invalid:FALSE]
NCCLCHECK(proxyState->ncclNet->isend(resources->netSendComm, buff, size, resources->tpRank, sub->sendMhandle, sub, sub->requests+buffSlot));
void* phandle = &sub->pHandles[DIVUP(transmittedStepId, args->sliceSteps)%NCCL_STEPS];
NCCLCHECK(proxyState->ncclNet->isend(resources->netSendComm, buff, size, resources->tpRank, sub->sendMhandle, phandle, sub->requests+buffSlot));
if (sub->requests[buffSlot] != NULL) {
TRACE(NCCL_NET, "sendProxy [%ld/%d/%d] Isend posted, req %p, buff %p, size %d, proto %d, myRank %d, channelId %d, mhandle %p", sub->transmitted, buffSlot, sub->nsteps, sub->requests[buffSlot], buff, size, p, proxyState->tpRank, sub->channelId, sub->sendMhandle);
sub->transSize += size;
sub->transSize = size;
sub->transmitted += args->sliceSteps;
sub->profilerSteps++;
ncclProfilerRecordProxyOpEventState(s, args, sub->transmitted, sub->transSize, ncclProfilerProxyOpSendTransmitted);
ncclProfilerRecordProxyStepEventState(s, args, transmittedStepId, ncclProfilerProxyStepSendWait);
args->idle = 0;
continue;
@@ -1220,7 +1232,6 @@ static ncclResult_t sendProxyProgress(struct ncclProxyState* proxyState, struct
TRACE(NCCL_NET, "sendProxy [%ld/%d/%d] request %p done", sub->done, buffSlot, sub->nsteps, sub->requests[buffSlot]);
sub->done += args->sliceSteps;
ncclProfilerStopProxyStepEvent(s, args, doneStepId);
ncclProfilerRecordProxyOpEventState(s, args, sub->done, sub->transSize, ncclProfilerProxyOpSendDone);
if (resources->shared == 0) {
volatile uint64_t* sendHead = resources->gdcSync ? resources->gdcSync : &resources->sendMem->head;
@@ -1282,7 +1293,7 @@ static ncclResult_t recvProxyProgress(struct ncclProxyState* proxyState, struct
sub->posted = sub->received = sub->transmitted = sub->done = 0;
sub->regBufferReady = 0;
for (int i=0; i<groupSize; i++) sub[-i].groupSize = groupSize;
ncclProfilerStartRecvProxyOpEvent(s, args);
ncclProfilerRecordProxyOpEventState(s, args, ncclProfilerProxyOpInProgress_v4);
if (!sub->reg)
sub->recvMhandle = resources->mhandles[args->protocol];
}
@@ -1343,7 +1354,7 @@ static ncclResult_t recvProxyProgress(struct ncclProxyState* proxyState, struct
if (sub->nbytes < sizes[subCount]) sizes[subCount] = sub->nbytes;
tags[subCount] = resources->tpRemoteRank;
mhandles[subCount] = sub->recvMhandle;
phandles[subCount] = sub;
phandles[subCount] = &sub->pHandles[DIVUP(postedStepId, args->sliceSteps)%NCCL_STEPS];
subCount++;
}
}
@@ -1362,8 +1373,6 @@ static ncclResult_t recvProxyProgress(struct ncclProxyState* proxyState, struct
int postedStepId = sub->posted;
TRACE(NCCL_NET, "recvProxy [%ld/%ld/%d] Irecv posted, buff %p, size %ld, myRank %d, channelId %d, mhandle %p", sub->posted, (sub->base + sub->posted) % NCCL_STEPS, sub->nsteps, ptrs[i], sizes[i], proxyState->tpRank, sub->channelId, mhandles[i]);
sub->posted += args->sliceSteps;
sub->profilerSteps++;
ncclProfilerRecordProxyOpEventState(s+i, args, sub->posted, sub->transSize, ncclProfilerProxyOpRecvPosted);
ncclProfilerRecordProxyStepEventState(s+i, args, postedStepId, ncclProfilerProxyStepRecvWait);
}
args->idle = 0;
@@ -1393,9 +1402,8 @@ static ncclResult_t recvProxyProgress(struct ncclProxyState* proxyState, struct
struct recvNetResources* resources = (struct recvNetResources*)(sub->connection->transportResources);
volatile struct ncclConnFifo* connFifo = (volatile struct ncclConnFifo*)resources->recvMem->connFifo;
connFifo[buffSlot].size = -1;
sub->transSize += sizes[i];
sub->transSize = sizes[i];
sub->received += args->sliceSteps;
ncclProfilerRecordProxyOpEventState(s+i, args, sub->received, sub->transSize, ncclProfilerProxyOpRecvReceived);
ncclProfilerRecordProxyStepEventState(s+i, args, receivedStepId, ncclProfilerProxyStepRecvFlushWait);
if (step < sub->nsteps) {
struct recvNetResources* resources = (struct recvNetResources*) (sub->connection->transportResources);
@@ -1459,7 +1467,6 @@ static ncclResult_t recvProxyProgress(struct ncclProxyState* proxyState, struct
int transmittedStepId = sub->transmitted;
sub->transmitted += args->sliceSteps;
ncclProfilerRecordProxyOpEventState(s+i, args, sub->transmitted, sub->transSize, ncclProfilerProxyOpRecvTransmitted);
ncclProfilerRecordProxyStepEventState(s+i, args, transmittedStepId, ncclProfilerProxyStepRecvGPUWait);
if (step < sub->nsteps) {
__sync_synchronize();
@@ -1479,7 +1486,6 @@ static ncclResult_t recvProxyProgress(struct ncclProxyState* proxyState, struct
struct ncclProxySubArgs* subGroup = args->subs+s;
for (int i=0; i<subGroup->groupSize; i++) {
struct ncclProxySubArgs* sub = subGroup + i;
int doneStepId = sub->done;
if (sub->done == sub->nsteps) continue;
if (sub->transmitted > sub->done) {
struct recvNetResources* resources = (struct recvNetResources*) (sub->connection->transportResources);
@@ -1494,9 +1500,9 @@ static ncclResult_t recvProxyProgress(struct ncclProxyState* proxyState, struct
NCCLCHECK(proxyState->ncclNet->irecvConsumed(resources->netRecvComm, subGroup->recvRequestsSubCount, subGroup->recvRequestsCache[sub->done%NCCL_STEPS]));
subGroup->recvRequestsCache[sub->done%NCCL_STEPS] = NULL;
}
int doneStepId = sub->done;
sub->done += args->sliceSteps;
ncclProfilerStopProxyStepEvent(s+i, args, doneStepId);
ncclProfilerRecordProxyOpEventState(s+i, args, sub->done, sub->transSize, ncclProfilerProxyOpRecvDone);
args->idle = 0;
if (sub->done == sub->nsteps) {
args->done++;
@@ -1547,9 +1553,9 @@ static ncclResult_t netRegisterBuffer(ncclComm* comm, const void* userbuff, size
if (found) {
*outRegBufFlag = 1;
outHandle[p] = netHandle->handle;
INFO(NCCL_REG, "rank %d - NET reuse buffer %p size %ld (baseAddr %p size %ld) handle %p", comm->rank, userbuff, buffSize, (void*)regRecord->addr, regRecord->pages * comm->regCache.pageSize, netHandle->handle);
INFO(NCCL_REG, "rank %d - NET reuse buffer %p size %ld (baseAddr %p size %ld) handle %p", comm->rank, userbuff, buffSize, (void*)regRecord->begAddr, regRecord->endAddr - regRecord->begAddr, netHandle->handle);
} else {
struct netRegInfo info = { regRecord->addr, regRecord->pages * comm->regCache.pageSize };
struct netRegInfo info = { regRecord->begAddr, regRecord->endAddr - regRecord->begAddr };
void* handle = NULL;
if (peerConn->conn.flags & NCCL_DIRECT_NIC) {
+197 -105
Просмотреть файл
@@ -25,8 +25,10 @@
#include "timer.h"
#include "ibvwrap.h"
#include "mlx5/mlx5dvwrap.h"
#define MAXNAMESIZE 64
#define MAXSUFFIXSIZE 16
#define MAXNAMESIZE (64 + MAXSUFFIXSIZE)
static char ncclIbIfName[MAX_IF_NAME_SIZE+1];
static union ncclSocketAddress ncclIbIfAddr;
@@ -55,6 +57,17 @@ struct ncclIbStats {
int fatalErrorCount;
};
enum ncclIbProvider {
IB_PROVIDER_NONE = 0,
IB_PROVIDER_MLX5 = 1,
IB_PROVIDER_MAX = 2,
};
const char* ibProviderName[] = {
"None",
"Mlx5",
};
static int ncclNIbDevs = -1;
struct alignas(64) ncclIbDev {
pthread_mutex_t lock;
@@ -77,6 +90,12 @@ struct alignas(64) ncclIbDev {
struct ibv_port_attr portAttr;
struct ncclIbStats stats;
int dmaBufSupported;
enum ncclIbProvider ibProvider;
union {
struct {
int dataDirect;
} mlx5;
} capsProvider;
};
#define MAX_IB_DEVS 32
@@ -106,6 +125,7 @@ NCCL_PARAM(IbAdaptiveRouting, "IB_ADAPTIVE_ROUTING", -2);
NCCL_PARAM(IbFifoTc, "IB_FIFO_TC", -1);
NCCL_PARAM(IbAsyncEvents,"IB_RETURN_ASYNC_EVENTS",1);
NCCL_PARAM(IbEceEnable,"IB_ECE_ENABLE",1);
NCCL_PARAM(IbDataDirect,"IB_DATA_DIRECT",1);
static ncclResult_t ncclIbStatsInit(struct ncclIbStats* stat) {
__atomic_store_n(&stat->fatalErrorCount, 0, __ATOMIC_RELAXED);
@@ -451,6 +471,10 @@ static ncclResult_t ncclIbGetPciPath(char* devName, char** path, int* realPort)
if (p == NULL) {
WARN("Could not find real path of %s (%s)", devName, devicePath);
} else {
// Merge multi-port NICs into the same PCI device
p[strlen(p)-1] = '0';
// Also merge virtual functions (VF) into the same device
if (ncclParamIbMergeVfs()) p[strlen(p)-3] = p[strlen(p)-4] = '0';
// Keep the real port aside (the ibv port is always 1 on recent cards)
*realPort = 0;
for (int d=0; d<ncclNIbDevs; d++) {
@@ -495,9 +519,29 @@ static int ncclIbRelaxedOrderingCapable(void) {
return r == ncclInternalError ? 0 : 1;
}
static bool ncclMlx5dvDmaBufCapable(ibv_context *context){
ncclResult_t res;
int dev_fail = 0;
struct ibv_pd* pd;
NCCLCHECKGOTO(wrap_ibv_alloc_pd(&pd, context), res, failure);
// Test kernel DMA-BUF support with a dummy call (fd=-1)
(void)wrap_direct_ibv_reg_dmabuf_mr(pd, 0ULL /*offset*/, 0ULL /*len*/, 0ULL /*iova*/, -1 /*fd*/, 0 /*flags*/);
// ibv_reg_dmabuf_mr() will fail with EOPNOTSUPP/EPROTONOSUPPORT if not supported (EBADF otherwise)
(void)wrap_direct_mlx5dv_reg_dmabuf_mr(pd, 0ULL /*offset*/, 0ULL /*len*/, 0ULL /*iova*/, -1 /*fd*/, 0 /*flags*/, 0 /* mlx5 flags*/);
// mlx5dv_reg_dmabuf_mr() will fail with EOPNOTSUPP/EPROTONOSUPPORT if not supported (EBADF otherwise)
dev_fail |= (errno == EOPNOTSUPP) || (errno == EPROTONOSUPPORT);
NCCLCHECKGOTO(wrap_ibv_dealloc_pd(pd), res, failure);
// stop the search and goto failure
if (dev_fail) goto failure;
return true;
failure:
return false;
}
ncclResult_t ncclIbMakeVDeviceInternal(int* d, ncclNetVDeviceProps_t* props) {
if (ncclParamIbMergeNics() == 0 && props->ndevs > 1) {
WARN("NET/IB : Trying to merge multiple devices together when NCCL_IB_MERGE_NICS=0. Please enable it or disable device merging in NCCL.");
INFO(NCCL_NET, "NET/IB : Skipping makeVDevice, NCCL_IB_MERGE_NICS=0");
return ncclInvalidUsage;
}
@@ -565,14 +609,17 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
if (ncclParamIbDisable()) return ncclInternalError;
static int shownIbHcaEnv = 0;
if(wrap_ibv_symbols() != ncclSuccess) { return ncclInternalError; }
if(wrap_mlx5dv_symbols() != ncclSuccess) { INFO(NCCL_NET, "NET/IB : Failed to open mlx5dv symbols. Advance features like CX-8 Direct-NIC will be disabled."); }
if (ncclNIbDevs == -1) {
pthread_mutex_lock(&ncclIbLock);
wrap_ibv_fork_init();
if (ncclNIbDevs == -1) {
int nIpIfs = 0;
ncclNIbDevs = 0;
ncclNMergedIbDevs = 0;
if (ncclFindInterfaces(ncclIbIfName, &ncclIbIfAddr, MAX_IF_NAME_SIZE, 1) != 1) {
NCCLCHECK(ncclFindInterfaces(ncclIbIfName, &ncclIbIfAddr, MAX_IF_NAME_SIZE, 1, &nIpIfs));
if (nIpIfs != 1) {
WARN("NET/IB : No IP interface found.");
ret = ncclInternalError;
goto fail;
@@ -600,6 +647,17 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
WARN("NET/IB : Unable to open device %s", devices[d]->name);
continue;
}
enum ncclIbProvider ibProvider = IB_PROVIDER_NONE;
char dataDirectDevicePath[PATH_MAX];
int dataDirectSupported = 0;
if (wrap_mlx5dv_is_supported(devices[d])) {
ibProvider = IB_PROVIDER_MLX5;
snprintf(dataDirectDevicePath, PATH_MAX, "/sys");
if((ncclMlx5dvDmaBufCapable(context)) && (wrap_mlx5dv_get_data_direct_sysfs_path(context, dataDirectDevicePath + 4, PATH_MAX - 4) == ncclSuccess)) {
INFO(NCCL_NET, "Data Direct DMA Interface is detected for device:%s", devices[d]->name);
if(ncclParamIbDataDirect()) dataDirectSupported = 1;
}
}
int nPorts = 0;
struct ibv_device_attr devAttr;
memset(&devAttr, 0, sizeof(devAttr));
@@ -609,58 +667,69 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
continue;
}
for (int port_num = 1; port_num <= devAttr.phys_port_cnt; port_num++) {
struct ibv_port_attr portAttr;
if (ncclSuccess != wrap_ibv_query_port(context, port_num, &portAttr)) {
WARN("NET/IB : Unable to query port_num %d", port_num);
continue;
for (int dataDirect = 0; dataDirect < 1 + dataDirectSupported; ++dataDirect) {
struct ibv_port_attr portAttr;
if (ncclSuccess != wrap_ibv_query_port(context, port_num, &portAttr)) {
WARN("NET/IB : Unable to query port_num %d", port_num);
continue;
}
if (portAttr.state != IBV_PORT_ACTIVE) continue;
if (portAttr.link_layer != IBV_LINK_LAYER_INFINIBAND
&& portAttr.link_layer != IBV_LINK_LAYER_ETHERNET) continue;
// check against user specified HCAs/ports
if (! (matchIfList(devices[d]->name, port_num, userIfs, nUserIfs, searchExact) ^ searchNot)) {
continue;
}
pthread_mutex_init(&ncclIbDevs[ncclNIbDevs].lock, NULL);
ncclIbDevs[ncclNIbDevs].device = d;
ncclIbDevs[ncclNIbDevs].ibProvider = ibProvider;
ncclIbDevs[ncclNIbDevs].guid = devAttr.sys_image_guid;
ncclIbDevs[ncclNIbDevs].portAttr = portAttr;
ncclIbDevs[ncclNIbDevs].portNum = port_num;
ncclIbDevs[ncclNIbDevs].link = portAttr.link_layer;
ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed) * ncclIbWidth(portAttr.active_width);
ncclIbDevs[ncclNIbDevs].context = context;
ncclIbDevs[ncclNIbDevs].pdRefs = 0;
ncclIbDevs[ncclNIbDevs].pd = NULL;
if (!dataDirect) {
strncpy(ncclIbDevs[ncclNIbDevs].devName, devices[d]->name, MAXNAMESIZE);
NCCLCHECKGOTO(ncclIbGetPciPath(ncclIbDevs[ncclNIbDevs].devName, &ncclIbDevs[ncclNIbDevs].pciPath, &ncclIbDevs[ncclNIbDevs].realPort), ret, fail);
}
else {
snprintf(ncclIbDevs[ncclNIbDevs].devName, MAXNAMESIZE, "%s_dma", devices[d]->name);
NCCLCHECK(ncclCalloc(&ncclIbDevs[ncclNIbDevs].pciPath, PATH_MAX));
strncpy(ncclIbDevs[ncclNIbDevs].pciPath, dataDirectDevicePath, PATH_MAX);
ncclIbDevs[ncclNIbDevs].capsProvider.mlx5.dataDirect = 1;
}
ncclIbDevs[ncclNIbDevs].maxQp = devAttr.max_qp;
ncclIbDevs[ncclNIbDevs].mrCache.capacity = 0;
ncclIbDevs[ncclNIbDevs].mrCache.population = 0;
ncclIbDevs[ncclNIbDevs].mrCache.slots = NULL;
NCCLCHECK(ncclIbStatsInit(&ncclIbDevs[ncclNIbDevs].stats));
// Enable ADAPTIVE_ROUTING by default on IB networks
// But allow it to be overloaded by an env parameter
ncclIbDevs[ncclNIbDevs].ar = (portAttr.link_layer == IBV_LINK_LAYER_INFINIBAND) ? 1 : 0;
if (ncclParamIbAdaptiveRouting() != -2) ncclIbDevs[ncclNIbDevs].ar = ncclParamIbAdaptiveRouting();
INFO(NCCL_NET,"NET/IB: [%d] %s:%s:%d/%s provider=%s speed=%d context=%p pciPath=%s ar=%d", d, devices[d]->name, devices[d]->dev_name, ncclIbDevs[ncclNIbDevs].portNum,
NCCL_IB_LLSTR(portAttr.link_layer), ibProviderName[ncclIbDevs[ncclNIbDevs].ibProvider], ncclIbDevs[ncclNIbDevs].speed, context, ncclIbDevs[ncclNIbDevs].pciPath, ncclIbDevs[ncclNIbDevs].ar);
PTHREADCHECKGOTO(pthread_create(&ncclIbAsyncThread, NULL, ncclIbAsyncThreadMain, ncclIbDevs + ncclNIbDevs), "pthread_create", ret, fail);
ncclSetThreadName(ncclIbAsyncThread, "NCCL IbAsync %2d", ncclNIbDevs);
PTHREADCHECKGOTO(pthread_detach(ncclIbAsyncThread), "pthread_detach", ret, fail); // will not be pthread_join()'d
// Add this plain physical device to the list of virtual devices
int vDev;
ncclNetVDeviceProps_t vProps = {0};
vProps.ndevs = 1;
vProps.devs[0] = ncclNIbDevs;
NCCLCHECK(ncclIbMakeVDeviceInternal(&vDev, &vProps));
ncclNIbDevs++;
nPorts++;
}
if (portAttr.state != IBV_PORT_ACTIVE) continue;
if (portAttr.link_layer != IBV_LINK_LAYER_INFINIBAND
&& portAttr.link_layer != IBV_LINK_LAYER_ETHERNET) continue;
// check against user specified HCAs/ports
if (! (matchIfList(devices[d]->name, port_num, userIfs, nUserIfs, searchExact) ^ searchNot)) {
continue;
}
pthread_mutex_init(&ncclIbDevs[ncclNIbDevs].lock, NULL);
ncclIbDevs[ncclNIbDevs].device = d;
ncclIbDevs[ncclNIbDevs].guid = devAttr.sys_image_guid;
ncclIbDevs[ncclNIbDevs].portAttr = portAttr;
ncclIbDevs[ncclNIbDevs].portNum = port_num;
ncclIbDevs[ncclNIbDevs].link = portAttr.link_layer;
ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed) * ncclIbWidth(portAttr.active_width);
ncclIbDevs[ncclNIbDevs].context = context;
ncclIbDevs[ncclNIbDevs].pdRefs = 0;
ncclIbDevs[ncclNIbDevs].pd = NULL;
strncpy(ncclIbDevs[ncclNIbDevs].devName, devices[d]->name, MAXNAMESIZE);
NCCLCHECKGOTO(ncclIbGetPciPath(ncclIbDevs[ncclNIbDevs].devName, &ncclIbDevs[ncclNIbDevs].pciPath, &ncclIbDevs[ncclNIbDevs].realPort), ret, fail);
ncclIbDevs[ncclNIbDevs].maxQp = devAttr.max_qp;
ncclIbDevs[ncclNIbDevs].mrCache.capacity = 0;
ncclIbDevs[ncclNIbDevs].mrCache.population = 0;
ncclIbDevs[ncclNIbDevs].mrCache.slots = NULL;
NCCLCHECK(ncclIbStatsInit(&ncclIbDevs[ncclNIbDevs].stats));
// Enable ADAPTIVE_ROUTING by default on IB networks
// But allow it to be overloaded by an env parameter
ncclIbDevs[ncclNIbDevs].ar = (portAttr.link_layer == IBV_LINK_LAYER_INFINIBAND) ? 1 : 0;
if (ncclParamIbAdaptiveRouting() != -2) ncclIbDevs[ncclNIbDevs].ar = ncclParamIbAdaptiveRouting();
TRACE(NCCL_NET,"NET/IB: [%d] %s:%s:%d/%s speed=%d context=%p pciPath=%s ar=%d", d, devices[d]->name, devices[d]->dev_name, ncclIbDevs[ncclNIbDevs].portNum,
NCCL_IB_LLSTR(portAttr.link_layer), ncclIbDevs[ncclNIbDevs].speed, context, ncclIbDevs[ncclNIbDevs].pciPath, ncclIbDevs[ncclNIbDevs].ar);
PTHREADCHECKGOTO(pthread_create(&ncclIbAsyncThread, NULL, ncclIbAsyncThreadMain, ncclIbDevs + ncclNIbDevs), "pthread_create", ret, fail);
ncclSetThreadName(ncclIbAsyncThread, "NCCL IbAsync %2d", ncclNIbDevs);
PTHREADCHECKGOTO(pthread_detach(ncclIbAsyncThread), "pthread_detach", ret, fail); // will not be pthread_join()'d
// Add this plain physical device to the list of virtual devices
int vDev;
ncclNetVDeviceProps_t vProps = {0};
vProps.ndevs = 1;
vProps.devs[0] = ncclNIbDevs;
NCCLCHECK(ncclIbMakeVDeviceInternal(&vDev, &vProps));
ncclNIbDevs++;
nPorts++;
}
if (nPorts == 0 && ncclSuccess != wrap_ibv_close_device(context)) { ret = ncclInternalError; goto fail; }
}
@@ -779,6 +848,9 @@ ncclResult_t ncclIbGetPhysProperties(int dev, ncclNetProperties_t* props) {
props->ptrSupport |= NCCL_PTR_DMABUF; // GDR support via DMA-BUF
}
props->forceFlush = 0;
if (ibDev->capsProvider.mlx5.dataDirect) {
props->forceFlush = 1;
}
props->latency = 0; // Not set
props->port = ibDev->portNum + ibDev->realPort;
props->maxComms = ibDev->maxQp;
@@ -893,6 +965,7 @@ struct ncclProfilerInfo {
int qpIndex[MAX_QPS_PER_REQ];
int nEventHandles;
ncclProfilerNetIbDescr_v1_t data;
void* pHandle;
};
struct ncclIbRequest {
@@ -1312,23 +1385,27 @@ ib_recv_dev_list:
devInfo->gid.global.interface_id = commDev->base.gidInfo.localGid.global.interface_id;
// info logging
if (devInfo->link_layer == IBV_LINK_LAYER_INFINIBAND) { // IB
for (int q = 0; q < comm->base.nqps; q++) {
// Print just the QPs for this dev
if (comm->base.qps[q].devIndex == i)
for (int q = 0; q < comm->base.nqps; q++) {
// Print just the QPs for this dev
if (comm->base.qps[q].devIndex == i) {
if (devInfo->link_layer == IBV_LINK_LAYER_INFINIBAND) { // IB
INFO(NCCL_NET,"NET/IB: %s %d IbDev %d Port %d qpn %d mtu %d LID %d subnet-prefix %lu FLID %d fifoRkey=0x%x fifoLkey=0x%x",
comm->base.vProps.ndevs > 2 ? "NCCL MergedDev" : "NCCL Dev",
dev, commDev->base.ibDevN, ibDev->portNum, meta.qpInfo[q].qpn, devInfo->mtu, devInfo->lid,
devInfo->gid.global.subnet_prefix, ncclIbExtractFlid(&devInfo->gid), devInfo->fifoRkey, commDev->fifoMr->lkey);
}
} else { // RoCE
for (int q = 0; q < comm->base.nqps; q++) {
// Print just the QPs for this dev
if (comm->base.qps[q].devIndex == i)
INFO(NCCL_NET,"NET/IB: %s %d IbDev %d Port %d qpn %d mtu %d query_ece={supported=%d, vendor_id=0x%x, options=0x%x, comp_mask=0x%x} GID %ld (%lX/%lX) fifoRkey=0x%x fifoLkey=0x%x",
comm->base.vProps.ndevs > 2 ? "NCCL MergedDev" : "NCCL Dev", dev,
commDev->base.ibDevN, ibDev->portNum, meta.qpInfo[q].qpn, devInfo->mtu, meta.qpInfo[q].ece_supported, meta.qpInfo[q].ece.vendor_id, meta.qpInfo[q].ece.options, meta.qpInfo[q].ece.comp_mask, (int64_t)commDev->base.gidInfo.localGidIndex,
devInfo->gid.global.subnet_prefix, devInfo->gid.global.interface_id, devInfo->fifoRkey, commDev->fifoMr->lkey);
comm->base.vProps.ndevs > 2 ? "NCCL MergedDev" : "NCCL Dev",
dev, commDev->base.ibDevN, ibDev->portNum, meta.qpInfo[q].qpn, devInfo->mtu, devInfo->lid,
devInfo->gid.global.subnet_prefix, ncclIbExtractFlid(&devInfo->gid), devInfo->fifoRkey, commDev->fifoMr->lkey);
} else { // RoCE
INFO(NCCL_NET,"NET/IB: %s %d IbDev %d Port %d qpn %d mtu %d GID %ld (%lX/%lX) fifoRkey=0x%x fifoLkey=0x%x",
comm->base.vProps.ndevs > 2 ? "NCCL MergedDev" : "NCCL Dev", dev,
commDev->base.ibDevN, ibDev->portNum, meta.qpInfo[q].qpn, devInfo->mtu,
(int64_t)commDev->base.gidInfo.localGidIndex,
devInfo->gid.global.subnet_prefix, devInfo->gid.global.interface_id, devInfo->fifoRkey, commDev->fifoMr->lkey);
}
// Log ECE info
if (meta.qpInfo[q].ece_supported) {
INFO(NCCL_NET,"NET/IB: IbDev %d Port %d qpn %d query_ece={supported=%d, vendor_id=0x%x, options=0x%x, comp_mask=0x%x}",
commDev->base.ibDevN, ibDev->portNum, meta.qpInfo[q].qpn,
meta.qpInfo[q].ece_supported, meta.qpInfo[q].ece.vendor_id, meta.qpInfo[q].ece.options, meta.qpInfo[q].ece.comp_mask);
}
}
}
if (link_layer == IBV_LINK_LAYER_UNSPECIFIED) link_layer = devInfo->link_layer;
@@ -1406,8 +1483,14 @@ ib_connect:
ncclIbSendCommDev* commDev = comm->devs + devIndex;
struct ibv_qp* qp = comm->base.qps[q].qp;
if (remQpInfo->ece_supported)
if (remQpInfo->ece_supported) {
struct ncclIbQp* nqp = comm->base.qps + q;
int ibDevN = comm->devs[nqp->devIndex].base.ibDevN;
struct ncclIbDev* ibDev = ncclIbDevs + ibDevN;
INFO(NCCL_NET,"NET/IB: IbDev %d Port %d qpn %d set_ece={supported=%d, vendor_id=0x%x, options=0x%x, comp_mask=0x%x}",
ibDevN, ibDev->portNum, qp->qp_num, remMeta.qpInfo[q].ece_supported, remMeta.qpInfo[q].ece.vendor_id, remMeta.qpInfo[q].ece.options, remMeta.qpInfo[q].ece.comp_mask);
NCCLCHECKGOTO(wrap_ibv_set_ece(qp, &remQpInfo->ece, &remQpInfo->ece_supported), ret, fail);
}
ncclIbDev* ibDev = ncclIbDevs + commDev->base.ibDevN;
remDevInfo->mtu = std::min(remDevInfo->mtu, ibDev->portAttr.active_mtu);
@@ -1415,16 +1498,6 @@ ib_connect:
NCCLCHECKGOTO(ncclIbRtsQp(qp), ret, fail);
}
if (link_layer == IBV_LINK_LAYER_ETHERNET ) { // RoCE
for (int q = 0; q < comm->base.nqps; q++) {
struct ncclIbQp* qp = comm->base.qps + q;
int ibDevN = comm->devs[qp->devIndex].base.ibDevN;
struct ncclIbDev* ibDev = ncclIbDevs + ibDevN;
INFO(NCCL_NET,"NET/IB: IbDev %d Port %d qpn %d set_ece={supported=%d, vendor_id=0x%x, options=0x%x, comp_mask=0x%x}",
ibDevN, ibDev->portNum, remMeta.qpInfo[q].qpn, remMeta.qpInfo[q].ece_supported, remMeta.qpInfo[q].ece.vendor_id, remMeta.qpInfo[q].ece.options, remMeta.qpInfo[q].ece.comp_mask);
}
}
comm->base.nDataQps = std::max(comm->base.vProps.ndevs, comm->base.nRemDevs);
comm->base.ready = 1;
@@ -1750,9 +1823,8 @@ ncclResult_t ncclIbGetRequest(struct ncclIbNetCommBase* base, struct ncclIbReque
if (r->type == NCCL_NET_IB_REQ_UNUSED) {
r->base = base;
r->sock = NULL;
r->devBases[0] = NULL;
r->devBases[1] = NULL;
r->events[0] = r->events[1] = 0;
memset(r->devBases, 0, sizeof(r->devBases));
memset(r->events, 0, sizeof(r->events));
*req = r;
return ncclSuccess;
}
@@ -1789,7 +1861,11 @@ ncclResult_t ncclIbRegMrDmaBufInternal(ncclIbNetCommDevBase* base, void* data, s
if (ncclIbRelaxedOrderingEnabled) flags |= IBV_ACCESS_RELAXED_ORDERING;
if (fd != -1) {
/* DMA-BUF support */
NCCLCHECKGOTO(wrap_ibv_reg_dmabuf_mr(&mr, base->pd, offset, pages*pageSize, addr, fd, flags), res, returning);
if (!ncclIbDevs[base->ibDevN].capsProvider.mlx5.dataDirect) {
NCCLCHECKGOTO(wrap_ibv_reg_dmabuf_mr(&mr, base->pd, offset, pages*pageSize, addr, fd, flags), res, returning);
} else {
NCCLCHECKGOTO(wrap_mlx5dv_reg_dmabuf_mr(&mr, base->pd, offset, pages*pageSize, addr, fd, flags, MLX5DV_REG_DMABUF_ACCESS_DATA_DIRECT), res, returning);
}
} else {
if (ncclIbRelaxedOrderingEnabled) {
// Use IBVERBS_1.8 API - needed for IBV_ACCESS_RELAXED_ORDERING support
@@ -1897,7 +1973,7 @@ ncclResult_t ncclIbDeregMr(void* comm, void* mhandle) {
NCCL_PARAM(IbSplitDataOnQps, "IB_SPLIT_DATA_ON_QPS", 0);
ncclResult_t ncclIbMultiSend(struct ncclIbSendComm* comm, int slot, void* pHandle) {
ncclResult_t ncclIbMultiSend(struct ncclIbSendComm* comm, int slot) {
struct ncclIbRequest** reqs = comm->fifoReqs[slot];
volatile struct ncclIbSendFifo* slots = comm->fifo[slot];
int nreqs = slots[0].nreqs;
@@ -1989,19 +2065,21 @@ ncclResult_t ncclIbMultiSend(struct ncclIbSendComm* comm, int slot, void* pHandl
struct ibv_send_wr* bad_wr;
#ifdef NCCL_ENABLE_NET_PROFILING
// QP profiling loop
for (int r=0; r<nreqs && pHandle; r++) {
for (int r=0; r<nreqs; r++) {
// Store comm qpIndex for this request
int nEventHandles = reqs[r]->pInfo[0].nEventHandles;
reqs[r]->pInfo[0].qpIndex[nEventHandles%MAX_QPS_PER_REQ] = qpIndex;
assert(nEventHandles < MAX_QPS_PER_REQ);
reqs[r]->pInfo[0].qpIndex[nEventHandles] = qpIndex;
// Store info for profiler
int pluginId = NCCL_PROFILER_NET_TYPE_IB | NCCL_PROFILER_NET_IB_VER;
int64_t pluginId = NCCL_PROFILER_NET_TYPE_IB | NCCL_PROFILER_NET_IB_VER;
reqs[r]->pInfo[0].data.type = ncclProfileQp;
reqs[r]->pInfo[0].data.qp.device = devIndex;
reqs[r]->pInfo[0].data.qp.wr_id = comm->wrs[r].wr_id;
reqs[r]->pInfo[0].data.qp.opcode = comm->wrs[r].opcode;
reqs[r]->pInfo[0].data.qp.qpNum = qp->qp->qp_num;
reqs[r]->pInfo[0].data.qp.length = comm->sges[r].length;
NCCLCHECK(ncclProfilerFunction(&reqs[r]->pInfo[0].qpEventHandles[nEventHandles%MAX_QPS_PER_REQ], 0, pHandle, pluginId, &reqs[r]->pInfo[0].data));
void* pHandle = reqs[r]->pInfo[0].pHandle;
NCCLCHECK(ncclProfilerFunction(&reqs[r]->pInfo[0].qpEventHandles[nEventHandles], ncclProfilerNetEventStart, pHandle, pluginId, &reqs[r]->pInfo[0].data));
reqs[r]->pInfo[0].nEventHandles++;
}
#endif
@@ -2023,8 +2101,11 @@ ncclResult_t ncclIbMultiSend(struct ncclIbSendComm* comm, int slot, void* pHandl
ncclResult_t ncclIbIsend(void* sendComm, void* data, size_t size, int tag, void* mhandle, void* phandle, void** request) {
struct ncclIbSendComm* comm = (struct ncclIbSendComm*)sendComm;
if (comm->base.ready == 0) { WARN("NET/IB: ncclIbIsend() called when comm->base.ready == 0"); return ncclInternalError; }
if (comm->base.ready == 0) { *request = NULL; return ncclSuccess; }
if (comm->base.ready == 0) {
WARN("NET/IB: ncclIbIsend() called when comm->base.ready == 0");
*request = NULL;
return ncclInternalError;
}
NCCLCHECK(ncclIbStatsCheckFatalCount(&comm->base.stats,__func__));
struct ncclIbMrHandle* mhandleWrapper = (struct ncclIbMrHandle*) mhandle;
@@ -2065,6 +2146,9 @@ ncclResult_t ncclIbIsend(void* sendComm, void* data, size_t size, int tag, void*
req->send.size = size;
req->send.data = data;
req->send.offset = 0;
#ifdef NCCL_ENABLE_NET_PROFILING
req->pInfo[0].pHandle = phandle;
#endif
// Populate events
int nEvents = ncclParamIbSplitDataOnQps() ? comm->base.nqps : comm->base.nDataQps;
@@ -2094,7 +2178,7 @@ ncclResult_t ncclIbIsend(void* sendComm, void* data, size_t size, int tag, void*
}
TIME_START(0);
NCCLCHECK(ncclIbMultiSend(comm, slot, phandle));
NCCLCHECK(ncclIbMultiSend(comm, slot));
// Clear slots[0]->nreqs, as well as other fields to help debugging and sanity checks
memset((void*)slots, 0, sizeof(struct ncclIbSendFifo));
@@ -2187,8 +2271,11 @@ ncclResult_t ncclIbPostFifo(struct ncclIbRecvComm* comm, int n, void** data, siz
ncclResult_t ncclIbIrecv(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles, void** phandles, void** request) {
struct ncclIbRecvComm* comm = (struct ncclIbRecvComm*)recvComm;
if (comm->base.ready == 0) { WARN("NET/IB: ncclIbIrecv() called when comm->base.ready == 0"); return ncclInternalError; }
if (comm->base.ready == 0) { *request = NULL; return ncclSuccess; }
if (comm->base.ready == 0) {
WARN("NET/IB: ncclIbIrecv() called when comm->base.ready == 0");
*request = NULL;
return ncclInternalError;
}
if (n > NCCL_NET_IB_MAX_RECVS) return ncclInternalError;
NCCLCHECK(ncclIbStatsCheckFatalCount(&comm->base.stats,__func__));
@@ -2222,14 +2309,17 @@ ncclResult_t ncclIbIrecv(void* recvComm, int n, void** data, size_t* sizes, int*
ncclIbAddEvent(req, qp->devIndex, &comm->devs[qp->devIndex].base);
#ifdef NCCL_ENABLE_NET_PROFILING
// Start a QP event for every request in the multirecv and every qp
for (int r = 0; r < n && phandles; r++) {
for (int r = 0; r < n; r++) {
int nEventHandles = req->pInfo[r].nEventHandles;
assert(nEventHandles < MAX_QPS_PER_REQ);
req->pInfo[r].qpIndex[nEventHandles] = comm->base.qpIndex;
// Store info for profiler
int pluginId = NCCL_PROFILER_NET_TYPE_IB | NCCL_PROFILER_NET_IB_VER;
int64_t pluginId = NCCL_PROFILER_NET_TYPE_IB | NCCL_PROFILER_NET_IB_VER;
req->pInfo[r].data.type = ncclProfileQp;
req->pInfo[r].data.qp.device = qp->devIndex;
req->pInfo[r].data.qp.wr_id = wr.wr_id;
req->pInfo[r].data.qp.qpNum = qp->qp->qp_num;
NCCLCHECK(ncclProfilerFunction(&req->pInfo[r].qpEventHandles[i], 0, phandles[r], pluginId, &req->pInfo[r].data));
NCCLCHECK(ncclProfilerFunction(&req->pInfo[r].qpEventHandles[nEventHandles], ncclProfilerNetEventStart, phandles[r], pluginId, &req->pInfo[r].data));
req->pInfo[r].nEventHandles++;
}
#endif
@@ -2311,7 +2401,7 @@ ncclResult_t ncclIbTest(void* request, int* done, int* sizes) {
sizes[i] = r->recv.sizes[i];
#ifdef NCCL_ENABLE_NET_PROFILING
for (int j = 0; j < r->pInfo[i].nEventHandles; j++) {
NCCLCHECK(ncclProfilerFunction(&r->pInfo[i].qpEventHandles[j], 1, NULL, 0, NULL));
NCCLCHECK(ncclProfilerFunction(&r->pInfo[i].qpEventHandles[j], ncclProfilerNetEventStop, NULL, 0, NULL));
}
#endif
}
@@ -2320,7 +2410,7 @@ ncclResult_t ncclIbTest(void* request, int* done, int* sizes) {
sizes[0] = r->send.size;
#ifdef NCCL_ENABLE_NET_PROFILING
for (int j = 0; j < r->pInfo[0].nEventHandles; j++) {
NCCLCHECK(ncclProfilerFunction(&r->pInfo[0].qpEventHandles[j], 1, NULL, 0, NULL));
NCCLCHECK(ncclProfilerFunction(&r->pInfo[0].qpEventHandles[j], ncclProfilerNetEventStop, NULL, 0, NULL));
}
#endif
}
@@ -2368,20 +2458,21 @@ ncclResult_t ncclIbTest(void* request, int* done, int* sizes) {
#ifdef ENABLE_TRACE
char line[SOCKET_NAME_MAXLEN+1];
TRACE(NCCL_NET, "Got completion from peer %s with status=%d opcode=%d len=%u wr_id=%lu r=%p type=%d events={%d,%d}, i=%d",
ncclSocketToString(&addr, line), wc->status, wc->opcode,wc->byte_len, wc->wr_id, req, req->type, req->events[0], req->events[1], i);
TRACE(NCCL_NET, "Got completion from peer %s with status=%d opcode=%d len=%u wr_id=%lu r=%p type=%d events={%d,%d,%d,%d}, i=%d",
ncclSocketToString(&addr, line), wc->status, wc->opcode,wc->byte_len, wc->wr_id, req, req->type, req->events[0], req->events[1], req->events[2], req->events[3], i);
#endif
if (req && req->type == NCCL_NET_IB_REQ_SEND) {
for (int j = 0; j < req->nreqs; j++) {
struct ncclIbRequest* sendReq = r->base->reqs+((wc->wr_id >> (j*8)) & 0xff);
if ((sendReq->events[i] <= 0)) {
WARN("NET/IB: sendReq(%p)->events={%d,%d}, i=%d, j=%d <= 0", sendReq, sendReq->events[0], sendReq->events[1], i, j);
WARN("NET/IB: sendReq(%p)->events={%d,%d,%d,%d}, i=%d, j=%d <= 0", sendReq, sendReq->events[0], sendReq->events[1], sendReq->events[2], sendReq->events[3], i, j);
return ncclInternalError;
}
sendReq->events[i]--;
#ifdef NCCL_ENABLE_NET_PROFILING
// Stop Qp event for sendReq
NCCLCHECK(ncclProfilerFunction(&sendReq->pInfo[j].qpEventHandles[getReqQpIndex(sendReq, j, wc->qp_num)], 1, NULL, 0, NULL));
int qpIndex = getReqQpIndex(sendReq, j, wc->qp_num);
NCCLCHECK(ncclProfilerFunction(&sendReq->pInfo[j].qpEventHandles[qpIndex], ncclProfilerNetEventStop, NULL, 0, NULL));
#endif
}
} else {
@@ -2398,7 +2489,8 @@ ncclResult_t ncclIbTest(void* request, int* done, int* sizes) {
#ifdef NCCL_ENABLE_NET_PROFILING
// Stop Qp event for workFifo
for (int j = 0; j < req->nreqs; j++) {
NCCLCHECK(ncclProfilerFunction(&req->pInfo[j].qpEventHandles[getReqQpIndex(req, j, wc->qp_num)], 1, NULL, 0, NULL));
int qpIndex = getReqQpIndex(req, j, wc->qp_num);
NCCLCHECK(ncclProfilerFunction(&req->pInfo[j].qpEventHandles[qpIndex], ncclProfilerNetEventStop, NULL, 0, NULL));
}
#endif
}
+64 -35
Просмотреть файл
@@ -45,7 +45,7 @@ ncclResult_t ncclNetSocketInit(ncclDebugLogger_t logFunction, ncclProfilerCallba
if (ncclNetIfs == -1) {
char names[MAX_IF_NAME_SIZE*MAX_IFS];
union ncclSocketAddress addrs[MAX_IFS];
ncclNetIfs = ncclFindInterfaces(names, addrs, MAX_IF_NAME_SIZE, MAX_IFS);
NCCLCHECK(ncclFindInterfaces(names, addrs, MAX_IF_NAME_SIZE, MAX_IFS, &ncclNetIfs));
if (ncclNetIfs <= 0) {
WARN("NET/Socket : no interface found");
pthread_mutex_unlock(&ncclNetSocketLock);
@@ -124,8 +124,9 @@ ncclResult_t ncclNetSocketGetProperties(int dev, ncclNetProperties_t* props) {
#define MAX_SOCKETS 64
#define MAX_THREADS 16
#define MAX_REQUESTS NCCL_NET_MAX_REQUESTS
#define MIN_CHUNKSIZE (64*1024)
NCCL_PARAM(SocketInlineSize, "SOCKET_INLINE", /*128 B=*/1 << 7);
NCCL_PARAM(SocketMinTaskSize, "SOCKET_MIN_TASKSIZE", /*64 kiB=*/1 << 16);
NCCL_PARAM(SocketNsocksPerThread, "NSOCKS_PERTHREAD", -2);
NCCL_PARAM(SocketNthreads, "SOCKET_NTHREADS", -2);
@@ -171,6 +172,7 @@ struct ncclNetSocketRequest {
int op;
void* data;
int size;
void* inlineData;
struct ncclSocket* ctrlSock;
int offset;
int used;
@@ -211,6 +213,7 @@ struct ncclNetSocketComm {
int nSocks;
int nThreads;
int nextSock;
void* inlineData;
struct ncclNetSocketRequest requests[MAX_REQUESTS];
pthread_t helperThread[MAX_THREADS];
struct ncclNetSocketThreadResources threadResources[MAX_THREADS];
@@ -241,13 +244,13 @@ void* persistentSocketThread(void *args_) {
data.sock.fd = r->sock->fd;
data.sock.op = r->op;
data.sock.length = r->size;
ncclProfilerFunction(&eHandle[i+j], 0, resource->pInfo->pHandle, NCCL_PROFILER_NET_TYPE_SOCK | 1, &data);
ncclProfilerFunction(&eHandle[i+j], ncclProfilerNetEventStart, resource->pInfo->pHandle, NCCL_PROFILER_NET_TYPE_SOCK | 1, &data);
}
#endif
r->result = ncclSocketProgress(r->op, r->sock, r->data, r->size, &r->offset);
if (r->result != ncclSuccess) {
#ifdef NCCL_ENABLE_NET_PROFILING
ncclProfilerFunction(&eHandle[i+j], 1, NULL, 0, NULL);
ncclProfilerFunction(&eHandle[i+j], ncclProfilerNetEventStop, NULL, 0, NULL);
eHandle[i+j] = NULL;
#endif
WARN("NET/Socket : socket progress error");
@@ -257,7 +260,7 @@ void* persistentSocketThread(void *args_) {
if (r->offset < r->size) repeat = 1;
#ifdef NCCL_ENABLE_NET_PROFILING
if (repeat == 0) {
ncclProfilerFunction(&eHandle[i+j], 1, NULL, 0, NULL);
ncclProfilerFunction(&eHandle[i+j], ncclProfilerNetEventStop, NULL, 0, NULL);
eHandle[i+j] = NULL;
}
#endif
@@ -360,6 +363,7 @@ fail:
goto exit;
}
#define SOCKET_CTRL_SIZE (sizeof(int))
ncclResult_t ncclNetSocketConnect(int dev, ncclNetCommConfig_t* config, void* opaqueHandle, void** sendComm, ncclNetDeviceHandle_t** /*sendDevComm*/) {
if (dev < 0 || dev >= ncclNetIfs) { // data transfer socket is based on specified dev
return ncclInternalError;
@@ -401,6 +405,7 @@ socket_send:
NCCLCHECK(ncclSocketProgress(NCCL_SOCKET_SEND, sock, &i, sizeof(uint8_t), &done));
if (done == 0) return ncclSuccess;
}
NCCLCHECK(ncclCalloc(&comm->inlineData, MAX_REQUESTS * (SOCKET_CTRL_SIZE + ncclParamSocketInlineSize())));
*sendComm = comm;
return ncclSuccess;
}
@@ -449,6 +454,7 @@ socket_recv:
memcpy(rComm->socks+sendSockIdx, sock, sizeof(struct ncclSocket));
free(sock);
}
NCCLCHECK(ncclCalloc(&rComm->inlineData, MAX_REQUESTS * (SOCKET_CTRL_SIZE + ncclParamSocketInlineSize())));
*recvComm = rComm;
/* reset lComm state */
@@ -470,6 +476,7 @@ ncclResult_t ncclNetSocketGetRequest(struct ncclNetSocketComm* comm, int op, voi
r->used = 1;
r->comm = comm;
r->nSubs = 0;
r->inlineData = (uint8_t*)comm->inlineData + i * (SOCKET_CTRL_SIZE + ncclParamSocketInlineSize());
*req = r;
return ncclSuccess;
}
@@ -520,6 +527,9 @@ ncclResult_t ncclNetSocketGetTask(struct ncclNetSocketComm* comm, struct ncclPro
return ncclInternalError;
}
// if the dataSize is smaller than the inline size, return the inline size; if not, return 0 to avoid the extra copy.
static int ncclNetSocketInlineSize(int dataSize) { return (dataSize <= ncclParamSocketInlineSize()) ? dataSize : 0; }
ncclResult_t ncclNetSocketTest(void* request, int* done, int* size) {
*done = 0;
struct ncclNetSocketRequest *r = (struct ncclNetSocketRequest*)request;
@@ -527,37 +537,55 @@ ncclResult_t ncclNetSocketTest(void* request, int* done, int* size) {
WARN("NET/Socket : test called with NULL request");
return ncclInternalError;
}
if (r->used == 1) { /* try to send/recv size */
int data = r->size;
int offset = 0;
NCCLCHECK(ncclSocketProgress(r->op, r->ctrlSock, &data, sizeof(int), &offset));
if (offset == 0) return ncclSuccess; /* Not ready -- retry later */
// Not sure we could ever receive less than 4 bytes, but just in case ...
if (offset < sizeof(int)) NCCLCHECK(ncclSocketWait(r->op, r->ctrlSock, &data, sizeof(int), &offset));
// Check size is less or equal to the size provided by the user
if (r->op == NCCL_SOCKET_RECV && data > r->size) {
char line[SOCKET_NAME_MAXLEN+1];
union ncclSocketAddress addr;
NCCLCHECK(ncclSocketGetAddr(r->ctrlSock, &addr));
WARN("NET/Socket : peer %s message truncated : receiving %d bytes instead of %d. If you believe your socket network is in healthy state, \
there may be a mismatch in collective sizes or environment settings (e.g. NCCL_PROTO, NCCL_ALGO) between ranks",
ncclSocketToString(&addr, line), data, r->size);
return ncclInvalidUsage;
if (r->used == 1) { /* try to send/recv size (+ inline data if any) */
int msgSize;
uint8_t* msg = (uint8_t*)r->inlineData;
if (r->op == NCCL_SOCKET_SEND) {
// sender side has the right data size, copy size info + inline data to the buffer
int inlineSize = ncclNetSocketInlineSize(r->size);
msgSize = inlineSize + SOCKET_CTRL_SIZE;
memcpy(msg, &r->size, SOCKET_CTRL_SIZE);
if (inlineSize > 0) memcpy(msg + SOCKET_CTRL_SIZE, r->data, inlineSize);
} else {
// receiver side doesn't have the right data size, wait for the sender to send it
int sizeOffset = 0, senderSize = 0;
while (sizeOffset < SOCKET_CTRL_SIZE) {
NCCLCHECK(ncclSocketProgress(r->op, r->ctrlSock, msg, SOCKET_CTRL_SIZE, &sizeOffset));
if (sizeOffset == 0) return ncclSuccess; /* not ready yet*/
}
memcpy(&senderSize, msg, SOCKET_CTRL_SIZE);
if (senderSize > r->size) {
char line[SOCKET_NAME_MAXLEN + 1];
union ncclSocketAddress addr;
NCCLCHECK(ncclSocketGetAddr(r->ctrlSock, &addr));
WARN("NET/Socket : peer %s message truncated : receiving %d bytes instead of %d. If you believe your socket network is in a healthy state, "
"there may be a mismatch in collective sizes or environment settings (e.g. NCCL_PROTO, NCCL_ALGO) between ranks",
ncclSocketToString(&addr, line), senderSize, r->size);
return ncclInvalidUsage;
}
// copy to the data buffer if we have received some inline data already
int receivedInline = sizeOffset - SOCKET_CTRL_SIZE;
if (receivedInline > 0) memcpy(r->data, msg + SOCKET_CTRL_SIZE, receivedInline);
// from the actual size, extract the remaining inline size to be received and redirect the msg buffer to the user data
r->size = senderSize;
msgSize = ncclNetSocketInlineSize(r->size) - receivedInline;
msg = (uint8_t*)r->data + receivedInline;
}
r->size = data;
r->offset = 0;
r->used = 2; // done exchanging size
// divide into subtasks
int chunkOffset = 0, i = 0;
int offset = 0;
while (offset < msgSize) {
NCCLCHECK(ncclSocketProgress(r->op, r->ctrlSock, msg, msgSize, &offset));
if (offset == 0) return ncclSuccess; /* not ready yet*/
}
// done exchanging sizes, r->size now contains the actual size
r->used = 2;
r->offset = ncclNetSocketInlineSize(r->size);
int chunkOffset = r->offset, i = 0;
if (r->comm->nSocks > 0) {
// each request can be divided up to nSocks tasks
int taskSize = std::max(MIN_CHUNKSIZE, DIVUP(r->size, r->comm->nSocks));
// each request can be divided up to nSocks tasks, we use the size left to transfer
int taskSize = std::max((int)ncclParamSocketMinTaskSize(), DIVUP(r->size - r->offset, r->comm->nSocks));
while (chunkOffset < r->size) {
int chunkSize = std::min(taskSize, r->size-chunkOffset);
NCCLCHECK(ncclNetSocketGetTask(r->comm, &r->pInfo, r->op, (char*)(r->data)+chunkOffset, chunkSize, r->tasks+i++));
int chunkSize = std::min(taskSize, r->size - chunkOffset);
NCCLCHECK(ncclNetSocketGetTask(r->comm, &r->pInfo, r->op, (char*)(r->data) + chunkOffset, chunkSize, r->tasks + i++));
chunkOffset += chunkSize;
}
}
@@ -588,7 +616,7 @@ ncclResult_t ncclNetSocketTest(void* request, int* done, int* size) {
data.sock.fd = r->ctrlSock->fd;
data.sock.op = r->op;
data.sock.length = r->size;
ncclProfilerFunction(&r->pInfo.eHandle, 0, r->pInfo.pHandle, NCCL_PROFILER_NET_TYPE_SOCK | 1, &data);
ncclProfilerFunction(&r->pInfo.eHandle, ncclProfilerNetEventStart, r->pInfo.pHandle, NCCL_PROFILER_NET_TYPE_SOCK | 1, &data);
}
#endif
if (r->offset < r->size) {
@@ -599,7 +627,7 @@ ncclResult_t ncclNetSocketTest(void* request, int* done, int* size) {
*done = 1;
r->used = 0;
#ifdef NCCL_ENABLE_NET_PROFILING
ncclProfilerFunction(&r->pInfo.eHandle, 1, NULL, 0, NULL);
ncclProfilerFunction(&r->pInfo.eHandle, ncclProfilerNetEventStop, NULL, 0, NULL);
r->pInfo.eHandle = NULL;
#endif
}
@@ -673,6 +701,7 @@ ncclResult_t ncclNetSocketClose(void* opaqueComm) {
NCCLCHECK(ncclSocketReady(&comm->socks[i], &ready));
if (ready) NCCLCHECK(ncclSocketClose(&comm->socks[i]));
}
if(comm->inlineData) free(comm->inlineData);
free(comm);
}
return ncclSuccess;
+211 -49
Просмотреть файл
@@ -13,6 +13,7 @@
#include "enqueue.h"
#include "register.h"
#include "transport.h"
#include "register_inline.h"
#if CUDART_VERSION >= 12010
@@ -109,7 +110,9 @@ ncclResult_t nvlsGroupUnbind(struct ncclComm *comm, size_t size, CUmemGenericAll
}
ncclResult_t ncclNvlsDeregBuffer(struct ncclComm* comm, CUmemGenericAllocationHandle *mcHandler, CUdeviceptr ptr, int dev, size_t ucsize, size_t mcsize) {
CUCHECK(cuMulticastUnbind(*mcHandler, dev, 0/*mcOffset*/, ucsize));
// unbind can trigger RM error if buffer is freed already by users
// however, it is safe to ignore the error, and unbind will succeed anyway
CUCALL(cuMulticastUnbind(*mcHandler, dev, 0/*mcOffset*/, ucsize));
CUCHECK(cuMemUnmap(ptr, mcsize));
CUCHECK(cuMemAddressFree(ptr, mcsize));
CUCHECK(cuMemRelease(*mcHandler));
@@ -143,9 +146,9 @@ ncclResult_t nvlsGroupUnmapMem(struct ncclComm *comm, size_t ucsize, void* ucptr
#define NVLS_MEM_ALIGN_SIZE (1 << 21)
#define NVLS_NCHANNELS_SM90 16
#define NVLS_NCHANNELS_SM100 32
#define NVLS_NCHANNELS_SM100_NVL 24
NCCL_PARAM(NvlsEnable, "NVLS_ENABLE", 2);
NCCL_PARAM(NvlsChannels, "NVLS_NCHANNELS", -2);
NCCL_PARAM(NvlsChunkSize, "NVLS_CHUNKSIZE", 128*1024);
ncclResult_t ncclNvlsInit(struct ncclComm* comm) {
@@ -171,12 +174,31 @@ ncclResult_t ncclNvlsInit(struct ncclComm* comm) {
comm->nvlsSupport = 1;
}
INFO(NCCL_INIT, "NVLS multicast support is %savailable on dev %d", comm->nvlsSupport ? "" : "not ", dev);
if (comm->nvlsSupport) {
int channels = (comm->compCap >= 100) ? NVLS_NCHANNELS_SM100 : NVLS_NCHANNELS_SM90;
if (ncclParamNvlsChannels() >= 0) channels = ncclParamNvlsChannels();
int channels;
if (comm->compCap >= 100) {
// Use a reduced number of channels for single node/MNNVL domain on Blackwell.
// comm->nNodes is not yet initialized at this point so we need to use other data.
bool multiNode;
if (comm->MNNVL) {
multiNode = (comm->clique.size < comm->nRanks);
} else {
int i;
for (i = 1; i < comm->nRanks; i++) {
if (comm->peerInfo[i].hostHash != comm->peerInfo[0].hostHash)
break;
}
multiNode = (i < comm->nRanks);
}
channels = (multiNode ? NVLS_NCHANNELS_SM100 : NVLS_NCHANNELS_SM100_NVL);
} else {
channels = NVLS_NCHANNELS_SM90;
}
if (comm->config.nvlsCTAs != NCCL_CONFIG_UNDEF_INT) channels = comm->config.nvlsCTAs;
comm->nvlsChannels = std::max(comm->config.minCTAs, std::min(comm->config.maxCTAs, channels));
}
INFO(NCCL_INIT, "NVLS multicast support is %savailable on dev %d (NVLS_NCHANNELS %d)",
comm->nvlsSupport ? "" : "not ", dev, comm->nvlsChannels);
return ncclSuccess;
}
@@ -242,16 +264,33 @@ static ncclResult_t nvlsAllocateMem(struct ncclComm* comm, const CUmemAccessDesc
CUCHECKGOTO(cuMemAddressReserve((CUdeviceptr*)ucptr, ucsize, ucgran, 0U, 0), ret, fail);
// Alloc local physical mem for this NVLS group
CUCHECKGOTO(cuMemCreate(ucHandle, ucsize, &ucprop, 0), ret, fail);
CUCHECKGOTO(cuMemMap((CUdeviceptr)*ucptr, ucsize, 0, *ucHandle, 0), ret, fail);
CUCHECKGOTO(cuMemSetAccess((CUdeviceptr)*ucptr, ucsize, desc, 1), ret, fail);
CUDACHECKGOTO(cudaMemset(*ucptr, 0, ucsize), ret, fail);
CUCHECKGOTO(cuMemCreate(ucHandle, ucsize, &ucprop, 0), ret, fail1);
CUCHECKGOTO(cuMemMap((CUdeviceptr)*ucptr, ucsize, 0, *ucHandle, 0), ret, fail2);
CUCHECKGOTO(cuMemSetAccess((CUdeviceptr)*ucptr, ucsize, desc, 1), ret, fail3);
CUDACHECKGOTO(cudaMemset(*ucptr, 0, ucsize), ret, fail3);
// intra-node barrier to mitigate the possible hang in cuMulticastBindMem during abort
NCCLCHECKGOTO(bootstrapIntraNodeBarrier(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, comm->localRankToRank[0]), ret, fail);
NCCLCHECKGOTO(bootstrapIntraNodeBarrier(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, comm->localRankToRank[0]), ret, fail3);
// Bind physical memory to the Multicast group
// NB: It will block until all ranks have been added to the Group
CUCHECKGOTO(cuMulticastBindMem(*mcHandle, 0/*mcOffset*/, *ucHandle, 0/*memOffset*/, ucsize, 0/*flags*/), ret, fail);
// This is where we normally see issues if the system NVLS/Multicast support is broken
{
CUresult err = CUPFN(cuMulticastBindMem(*mcHandle, 0/*mcOffset*/, *ucHandle, 0/*memOffset*/, ucsize, 0/*flags*/));
if (err != CUDA_SUCCESS) {
const char *errStr; \
(void) pfn_cuGetErrorString(err, &errStr); \
if (ncclParamNvlsEnable() == 1) {
// Fail the job as NVLS support is not available
WARN("Failed to bind NVLink SHARP (NVLS) Multicast memory of size %ld : CUDA error %d '%s'.\nThis is usually caused by a system or configuration error in the Fabric Manager or NVSwitches.\nDo not force-enable NVLS (NCCL_NVLS_ENABLE=1) if you wish to avoid this error in the future.", ucsize, err, errStr );
ret = ncclUnhandledCudaError;
} else {
// Continue without NVLS support (returns ncclSuccess)
INFO(NCCL_INIT|NCCL_NVLS, "Failed to bind NVLink SHARP (NVLS) Multicast memory of size %ld : CUDA error %d '%s'. Proceeding without NVLS support.", ucsize, err, errStr);
}
comm->nvlsSupport = comm->nvlsChannels = 0;
goto fail3;
}
}
// Map mc virtual address
CUCHECKGOTO(cuMemAddressReserve((CUdeviceptr*)mcptr, mcsize, mcgran, 0U, 0), ret, fail);
@@ -263,6 +302,12 @@ static ncclResult_t nvlsAllocateMem(struct ncclComm* comm, const CUmemAccessDesc
exit:
return ret;
fail3:
CUCHECK(cuMemUnmap((CUdeviceptr)*ucptr, ucsize));
fail2:
CUCHECK(cuMemRelease(*ucHandle));
fail1:
CUCHECK(cuMemAddressFree((CUdeviceptr)*ucptr, ucsize));
fail:
if (allocMcHandle && *mcptr == NULL && *ucptr == NULL) CUCHECK(cuMemRelease(*mcHandle));
goto exit;
@@ -291,8 +336,8 @@ ncclResult_t ncclNvlsBufferSetup(struct ncclComm* comm) {
nvlsPerRankSize = nChannels * 2 * buffSize;
nvlsTotalSize = nvlsPerRankSize * nHeads;
INFO(NCCL_INIT | NCCL_NVLS, "NVLS comm %p headRank %d nHeads %d buffSize %zu nvlsPerRankSize %zu nvlsTotalSize %zu",
comm, headRank, nHeads, buffSize, nvlsPerRankSize, nvlsTotalSize);
INFO(NCCL_INIT | NCCL_NVLS, "NVLS comm %p headRank %d nHeads %d nvlsRanks %d buffSize %zu nvlsPerRankSize %zu nvlsTotalSize %zu",
comm, headRank, nHeads, comm->localRanks, buffSize, nvlsPerRankSize, nvlsTotalSize);
NCCLCHECKGOTO(nvlsAllocateMem(comm, &resources->accessDesc, nvlsTotalSize, &resources->ucBuffHandle, &resources->mcBuffHandle, (void**)&resources->ucBuff, (void**)&resources->mcBuff, &resources->buffUCSize, &resources->buffMCSize), res, fail);
@@ -338,32 +383,10 @@ ncclResult_t ncclNvlsSetup(struct ncclComm* comm, struct ncclComm* parent) {
size_t typeSize;
char shmPath[sizeof("/dev/shm/nccl-XXXXXX")];
uintptr_t *nvlsShmem = NULL;
bool nvlsShare = parent && parent->nvlsSupport && parent->config.splitShare;
int nHeads = comm->channels[0].nvls.nHeads;
bool nvlsShare = parent && parent->nvlsSupport && parent->shareResources && parent->localRanks == comm->localRanks;
if (comm->nvlsSupport == 0 || comm->nvlsChannels == 0) return ncclSuccess;
if (nvlsShare && parent->channels[0].nvls.nHeads == nHeads) {
for (int ch = 0; ch < nHeads; ++ch) {
bool find = false;
for (int h = 0; h < parent->channels[0].nvls.nHeads; ++h) {
if (comm->nvlsHeads[ch] == parent->nvlsHeads[h]) {
// find the head
find = true;
break;
}
}
if (find == false) {
nvlsShare = false;
goto setup;
}
}
nvlsShare = true;
} else {
nvlsShare = false;
}
setup:
comm->nvlsChunkSize = ncclParamNvlsChunkSize();
if (nvlsShare) {
/* reuse NVLS resources */
@@ -387,9 +410,10 @@ setup:
comm->nvlsResources->inited = false;
comm->nvlsResources->refCount = 1;
comm->nvlsResources->nChannels = comm->nvlsChannels;
comm->nvlsResources->nHeads = nHeads;
resources = comm->nvlsResources;
if (parent && parent->nvlsSupport && parent->config.splitShare) {
if (parent && parent->nvlsSupport && parent->shareResources) {
/* ranks on other nodes might share the NVLS resources, we need to cap nvlsChannels
* to make sure nvlsChannels match for each rank. */
comm->nvlsChannels = std::min(comm->nvlsChannels, parent->nvlsResources->nChannels);
@@ -529,9 +553,9 @@ ncclResult_t tryRegisterBuffer(struct ncclComm *comm, uintptr_t userBuff, size_t
if (userBuff) {
NCCLCHECKGOTO(ncclRegFind(comm, (void*)userBuff, buffSize, &regRecord), ret, fail);
if (regRecord) {
CUDACHECKGOTO(cudaPointerGetAttributes(&attr, (void*)regRecord->addr), ret, fail);
CUDACHECKGOTO(cudaPointerGetAttributes(&attr, (void*)regRecord->begAddr), ret, fail);
if (attr.type == cudaMemoryTypeDevice) {
size_t regSize = regRecord->pages * comm->regCache.pageSize;
size_t regSize = regRecord->endAddr - regRecord->begAddr;
memset(&mcprop, 0, sizeof(CUmulticastObjectProp));
mcprop.numDevices = comm->localRanks;
mcprop.handleTypes = ncclCuMemHandleType;
@@ -546,8 +570,7 @@ ncclResult_t tryRegisterBuffer(struct ncclComm *comm, uintptr_t userBuff, size_t
ucprop.requestedHandleTypes = ncclCuMemHandleType;
CUCHECKGOTO(cuMemGetAllocationGranularity(&ucgran, &ucprop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED), ret, fail);
CUCHECKGOTO(cuMemGetAddressRange((CUdeviceptr*)&regRecord->baseAddr, &regRecord->baseSize, (CUdeviceptr)regRecord->addr), ret, fail);
if (regRecord->addr % ucgran == 0) {
if (regRecord->begAddr % ucgran == 0) {
if (regSize % ucgran != 0) {
regRecord->regUCSize = ALIGN_SIZE(regSize, ucgran);
} else {
@@ -555,7 +578,7 @@ ncclResult_t tryRegisterBuffer(struct ncclComm *comm, uintptr_t userBuff, size_t
}
regRecord->state |= NVLS_REG_POSSIBLE;
memcpy(&regData[comm->localRank].reg, regRecord, sizeof(struct ncclReg));
regData[comm->localRank].offset = userBuff - regRecord->addr;
regData[comm->localRank].offset = userBuff - regRecord->begAddr;
}
}
@@ -595,7 +618,7 @@ ncclResult_t tryRegisterBuffer(struct ncclComm *comm, uintptr_t userBuff, size_t
// Coverity complains that regRecord could be NULL. That won't in practice be the case because we've already checked
// (regData[i].reg.state & NVLS_REG_POSSIBLE) of all local ranks, which would catch it and bail out.
// coverity[var_deref_op]
CUCHECKGOTO(cuMulticastBindAddr(mcHandle, 0, (CUdeviceptr)regRecord->addr, ucsize, 0), ret, fail);
CUCHECKGOTO(cuMulticastBindAddr(mcHandle, 0, (CUdeviceptr)regRecord->begAddr, ucsize, 0), ret, fail);
// Create a VA for the NVLS
CUCHECKGOTO(cuMemAddressReserve(&regPtr, mcsize, mcgran, 0U, 0), ret, fail);
@@ -610,7 +633,7 @@ ncclResult_t tryRegisterBuffer(struct ncclComm *comm, uintptr_t userBuff, size_t
regRecord->mcHandle = mcHandle;
regRecord->state |= NVLS_REG_COMPLETE;
/* get all buffer addresses */
regRecord->caddrs[comm->localRank] = regRecord->addr;
regRecord->caddrs[comm->localRank] = regRecord->begAddr;
NCCLCHECKGOTO(ncclShmemAllgather(comm, &comm->nvlsResources->nvlsShmem, regRecord->caddrs + comm->localRank, regRecord->caddrs, sizeof(uintptr_t)), ret, fail);
/* Although registration is done, we still need to check whether the offsets are same among ranks. */
@@ -642,23 +665,23 @@ static ncclResult_t nvlsRegisterBuffer(struct ncclComm *comm, const void *sendbu
if (sendRegRecord) {
memcpy(&regData[comm->localRank * 2].reg, sendRegRecord, sizeof(struct ncclReg));
regData[comm->localRank * 2].offset = (uintptr_t)sendbuff - sendRegRecord->addr;
regData[comm->localRank * 2].offset = (uintptr_t)sendbuff - sendRegRecord->begAddr;
}
if (recvRegRecord) {
memcpy(&regData[comm->localRank * 2 + 1].reg, recvRegRecord, sizeof(struct ncclReg));
regData[comm->localRank * 2 + 1].offset = (uintptr_t)recvbuff - recvRegRecord->addr;
regData[comm->localRank * 2 + 1].offset = (uintptr_t)recvbuff - recvRegRecord->begAddr;
}
NCCLCHECKGOTO(ncclShmemAllgather(comm, &comm->nvlsResources->nvlsShmem, regData + comm->localRank * 2, regData, sizeof(struct localRegData) * 2), ret, fail);
/* first check whether all local ranks find their registered buffer */
for (int i = 0; i < comm->localRanks; ++i) {
if ((regData[i * 2].reg.state & NVLS_REG_COMPLETE) == 0 || regData[comm->localRank * 2].reg.caddrs[i] != regData[i * 2].reg.addr) {
if ((regData[i * 2].reg.state & NVLS_REG_COMPLETE) == 0 || regData[comm->localRank * 2].reg.caddrs[i] != regData[i * 2].reg.begAddr) {
sendNeedReg = true;
}
if ((regData[i * 2 + 1].reg.state & NVLS_REG_COMPLETE) == 0 || regData[comm->localRank * 2 + 1].reg.caddrs[i] != regData[i * 2 + 1].reg.addr) {
if ((regData[i * 2 + 1].reg.state & NVLS_REG_COMPLETE) == 0 || regData[comm->localRank * 2 + 1].reg.caddrs[i] != regData[i * 2 + 1].reg.begAddr) {
recvNeedReg = true;
}
@@ -787,7 +810,7 @@ ncclResult_t ncclNvlsGraphRegisterBuffer(
NCCLCHECK(ncclCommGraphRegister(comm, baseRecv, baseRecvSize, (void**)&recvRegRecord));
}
NCCLCHECK(nvlsRegisterBuffer(comm, baseSend, baseRecv, baseSendSize, baseRecvSize, sendRegRecord, recvRegRecord, outRegBufUsed, outRegBufSend, outRegBufRecv));
NCCLCHECK(nvlsRegisterBuffer(comm, sendbuff, recvbuff, sendbuffSize, recvbuffSize, sendRegRecord, recvRegRecord, outRegBufUsed, outRegBufSend, outRegBufRecv));
if (*outRegBufUsed) {
if (sendRegRecord) {
@@ -815,6 +838,124 @@ ncclResult_t ncclNvlsGraphRegisterBuffer(
return ncclSuccess;
}
ncclResult_t ncclNvlsSymmetricInit(struct ncclComm* comm) {
ncclResult_t ret = ncclSuccess;
if (comm && comm->nvlsSupport) {
CUmulticastObjectProp mcprop = {};
CUmemGenericAllocationHandle mcHandle;
char shareableHandle[NVLS_HANDLE_SIZE];
CUmemAccessDesc accessDesc = {};
mcprop.numDevices = comm->localRanks;
mcprop.handleTypes = ncclCuMemHandleType;
mcprop.flags = 0;
mcprop.size = comm->baseStride;
if (comm->localRank == 0) {
NCCLCHECKGOTO(nvlsGroupCreate(comm, &mcprop, comm->localRank, comm->localRanks, &mcHandle, shareableHandle), ret, fail);
NCCLCHECKGOTO(bootstrapIntraNodeBroadcast(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, 0, shareableHandle, NVLS_HANDLE_SIZE), ret, fail);
} else {
NCCLCHECKGOTO(bootstrapIntraNodeBroadcast(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, 0, shareableHandle, NVLS_HANDLE_SIZE), ret, fail);
NCCLCHECKGOTO(nvlsGroupConnect(comm, shareableHandle, comm->localRankToRank[0], &mcHandle), ret, fail);
}
CUCHECKGOTO(cuMulticastAddDevice(mcHandle, comm->cudaDev), ret, fail);
CUCHECKGOTO(cuMemAddressReserve((CUdeviceptr*)&comm->baseMCSymPtr, comm->baseStride, NCCL_MAX_PAGE_SIZE, 0, 0), ret, fail);
CUCHECKGOTO(cuMemMap((CUdeviceptr)comm->baseMCSymPtr, comm->baseStride, 0, mcHandle, 0), ret, fail);
accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
accessDesc.location.id = comm->cudaDev;
accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
CUCHECKGOTO(cuMemSetAccess((CUdeviceptr)comm->baseMCSymPtr, comm->baseStride, &accessDesc, 1), ret, fail);
comm->symMCHandle = mcHandle;
}
exit:
return ret;
fail:
goto exit;
}
ncclResult_t ncclNvlsSymmetricFinalize(struct ncclComm* comm) {
ncclResult_t ret = ncclSuccess;
if (comm && comm->nvlsSupport && comm->baseMCSymPtr) {
CUCHECKGOTO(cuMemUnmap((CUdeviceptr)comm->baseMCSymPtr, comm->baseStride), ret, fail);
CUCHECKGOTO(cuMemAddressFree((CUdeviceptr)comm->baseMCSymPtr, comm->baseStride), ret, fail);
CUCHECKGOTO(cuMemRelease(comm->symMCHandle), ret, fail);
}
exit:
return ret;
fail:
goto exit;
}
ncclResult_t ncclNvlsSymmetricMap(struct ncclComm* comm, size_t offset, size_t ucsize, void* ucaddr) {
ncclResult_t ret = ncclSuccess;
assert((uintptr_t)ucaddr % NCCL_REC_PAGE_SIZE == 0 && ucsize % NCCL_REC_PAGE_SIZE == 0);
if (comm && comm->nvlsSupport && ucaddr && ucsize > 0) {
CUCHECKGOTO(cuMulticastBindAddr(comm->symMCHandle, offset, (CUdeviceptr)ucaddr, ucsize, 0), ret, fail);
INFO(NCCL_ALLOC, "NVLS symmetric alloc mc buffer ptr %p offset %ld UC addr %p UC size %ld symAllocHead %ld", comm->baseMCSymPtr + offset, offset, ucaddr, ucsize, comm->symAllocHead);
}
exit:
return ret;
fail:
goto exit;
}
ncclResult_t ncclNvlsSymmetricFree(struct ncclComm* comm, size_t ucsize, void* ucaddr) {
ncclResult_t ret = ncclSuccess;
if (comm && comm->nvlsSupport && ucaddr && ucsize > 0) {
size_t offset = (size_t)ucaddr - ((size_t)comm->baseUCSymPtr + comm->localRank * comm->baseStride);
CUCHECKGOTO(cuMulticastUnbind(comm->symMCHandle, comm->cudaDev, offset, ucsize), ret, fail);
}
exit:
return ret;
fail:
goto exit;
}
ncclResult_t ncclNvlsRegResourcesQuery(struct ncclComm* comm, struct ncclTaskColl* info, int* recChannels) {
int factor;
ncclResult_t ret = ncclSuccess;
if (comm->nNodes == 1) {
if (info->func == ncclFuncReduceScatter) {
factor = (comm->compCap >= 100 ? 6 : 5) * 8;
*recChannels = std::max(comm->config.minCTAs, std::min(comm->config.maxCTAs, DIVUP(factor, comm->nvlsResources->nHeads)));
} else if (info->func == ncclFuncAllGather) {
factor = 4 * 8;
*recChannels = std::max(comm->config.minCTAs, std::min(comm->config.maxCTAs, DIVUP(factor, comm->nvlsResources->nHeads)));
} else if (info->func == ncclFuncAllReduce) {
if (comm->compCap >= 100) {
factor = 8 * 8;
} else {
factor = 4 * 8;
}
*recChannels = std::max(comm->config.minCTAs, std::min(comm->config.maxCTAs, DIVUP(factor, comm->nvlsResources->nHeads)));
} else {
goto fail;
}
} else {
// Further tweaks for Blackwell with NVLS registered buffers
if (info->func == ncclFuncReduceScatter) {
factor = (comm->bandwidths[ncclFuncReduceScatter][NCCL_ALGO_NVLS][NCCL_PROTO_SIMPLE] > 400 ? 7 : 6) * 8;
*recChannels = std::max(comm->config.minCTAs, std::min(comm->config.maxCTAs, DIVUP(factor, comm->nvlsResources->nHeads)));
} else if (info->func == ncclFuncAllGather) {
factor = 6 * 8;
*recChannels = std::max(comm->config.minCTAs, std::min(comm->config.maxCTAs, DIVUP(factor, comm->nvlsResources->nHeads)));
} else if (info->func == ncclFuncAllReduce) {
factor = (comm->compCap >= 100 ? 7 : 6) * 8;
*recChannels = std::max(comm->config.minCTAs, std::min(comm->config.maxCTAs, DIVUP(factor, comm->nvlsResources->nHeads)));
} else {
goto fail;
}
}
exit:
return ret;
fail:
ret = ncclInvalidArgument;
goto exit;
}
#else
/*
@@ -860,4 +1001,25 @@ ncclResult_t ncclNvlsDeregBuffer(struct ncclComm* comm, CUmemGenericAllocationHa
return ncclSuccess;
}
ncclResult_t ncclNvlsSymmetricInit(struct ncclComm* comm) {
return ncclSuccess;
}
ncclResult_t ncclNvlsSymmetricMap(struct ncclComm* comm, size_t offset, size_t ucsize, void* ucaddr) {
return ncclSuccess;
}
ncclResult_t ncclNvlsSymmetricFree(struct ncclComm* comm, size_t ucsize, void* ucaddr) {
return ncclSuccess;
}
ncclResult_t ncclNvlsSymmetricFinalize(struct ncclComm* comm) {
return ncclSuccess;
}
ncclResult_t ncclNvlsRegResourcesQuery(struct ncclComm* comm, struct ncclTaskColl* info, int* recChannels) {
*recChannels = 0;
return ncclSuccess;
}
#endif /* CUDA_VERSION >= 12010 */
+88 -5
Просмотреть файл
@@ -12,6 +12,7 @@
#include "transport.h"
#include <assert.h>
#include "shm.h"
#include "register_inline.h"
enum p2pType { P2P_DIRECT, P2P_INTERMEDIATE, P2P_IPC, P2P_CUMEM };
@@ -826,7 +827,7 @@ ncclResult_t ret = ncclSuccess;
// We already have IPC info for peerLocalRank, no need to register it, we can reuse it
*regBufFlag = 1;
if (isLegacyIpc) *isLegacyIpc = regRecord->ipcInfos[peerLocalRank]->impInfo.legacyIpcCap;
INFO(NCCL_REG, "rank %d - IPC reuse buffer %p size %ld (baseAddr %p size %ld) to peer %d regAddr %p", comm->rank, userbuff, buffSize, (void*)regRecord->addr, regRecord->pages * comm->regCache.pageSize, peerRank, regRecord->ipcInfos[peerLocalRank]->impInfo.rmtRegAddr);
INFO(NCCL_REG, "rank %d - IPC reuse buffer %p size %ld (baseAddr %p size %ld) to peer %d regAddr %p", comm->rank, userbuff, buffSize, (void*)regRecord->begAddr, regRecord->endAddr - regRecord->begAddr, peerRank, regRecord->ipcInfos[peerLocalRank]->impInfo.rmtRegAddr);
} else {
// Register buffer with peerLocalRank
struct ncclProxyConnector* proxyConn = NULL;
@@ -885,11 +886,11 @@ ncclResult_t ret = ncclSuccess;
void* rmtRegAddr = NULL;
ipcInfo.size = baseSize;
ipcInfo.offset = regRecord->addr - (uintptr_t)baseAddr;
ipcInfo.offset = regRecord->begAddr - (uintptr_t)baseAddr;
// Now ipcInfo contains all necessary registration info. Start to register buffer on proxy side
// and get the remote register address back.
if (proxyConn) {
INFO(NCCL_REG, "rank %d - IPC registering buffer %p size %ld (baseAddr %p size %ld) to peer %d", comm->rank, userbuff, buffSize, (void*)regRecord->addr, ipcInfo.size, peerRank);
INFO(NCCL_REG, "rank %d - IPC registering buffer %p size %ld (baseAddr %p size %ld) to peer %d", comm->rank, userbuff, buffSize, (void*)regRecord->begAddr, ipcInfo.size, peerRank);
NCCLCHECKGOTO(ncclProxyCallBlocking(comm, proxyConn, ncclProxyMsgRegister, &ipcInfo, sizeof(p2pIpcExpInfo), &rmtRegAddr, sizeof(void*)), ret, fail);
}
if (rmtRegAddr) {
@@ -909,7 +910,7 @@ ncclResult_t ret = ncclSuccess;
regRecord->regIpcAddrs.hostPeerRmtAddrs[peerLocalRank] = (uintptr_t)rmtRegAddr;
needUpdate = true;
*regBufFlag = 1;
INFO(NCCL_REG, "rank %d - IPC registered buffer %p size %ld (baseAddr %p size %ld) to peer %d regAddr %p offsetOut %ld", comm->rank, userbuff, buffSize, (void*)regRecord->addr, ipcInfo.size, peerRank, rmtRegAddr, (uintptr_t)userbuff - regRecord->addr);
INFO(NCCL_REG, "rank %d - IPC register buffer %p size %ld (baseAddr %p size %ld) to peer %d regAddr %p offsetOut %ld", comm->rank, userbuff, buffSize, (void*)regRecord->begAddr, ipcInfo.size, peerRank, rmtRegAddr, (uintptr_t)userbuff - regRecord->begAddr);
}
}
}
@@ -935,7 +936,7 @@ ncclResult_t ret = ncclSuccess;
// p2p always returns remote addr here since remote buffer addr is passed in ncclDevWorkP2p struct
peerRmtAddrs = (uintptr_t*)regRecord->regIpcAddrs.hostPeerRmtAddrs[peerLocalRank];
}
*offsetOut = (uintptr_t)userbuff - regRecord->addr;
*offsetOut = (uintptr_t)userbuff - regRecord->begAddr;
*peerRmtAddrsOut = peerRmtAddrs;
}
}
@@ -1117,6 +1118,88 @@ fail:
goto exit;
}
ncclResult_t ncclIpcSymmetricInit(struct ncclComm* comm) {
CUCHECK(cuMemAddressReserve((CUdeviceptr*)&comm->baseUCSymPtr, comm->baseStride * comm->localRanks, NCCL_MAX_PAGE_SIZE, 0, 0));
return ncclSuccess;
}
ncclResult_t ncclIpcSymmetricFinalize(struct ncclComm* comm) {
if (comm->baseUCSymPtr) {
CUCHECK(cuMemAddressFree((CUdeviceptr)comm->baseUCSymPtr, comm->baseStride * comm->localRanks));
}
return ncclSuccess;
}
ncclResult_t ncclIpcSymmetricMap(struct ncclComm* comm, size_t offset, size_t size, CUmemGenericAllocationHandle memHandle, void** symPtr) {
ncclResult_t ret = ncclSuccess;
CUmemGenericAllocationHandle impHandle;
int impFd = -1;
ncclCuDesc* desc = NULL;
CUmemAccessDesc accessDesc = {};
assert(offset % NCCL_REC_PAGE_SIZE == 0 && size % NCCL_REC_PAGE_SIZE == 0);
NCCLCHECKGOTO(ncclCalloc(&desc, comm->localRanks), ret, fail);
if (ncclCuMemHandleType == CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR) {
memcpy(&desc[comm->localRank].data, &memHandle, sizeof(CUmemGenericAllocationHandle));
} else {
CUCHECKGOTO(cuMemExportToShareableHandle(&desc[comm->localRank].handle, memHandle, ncclCuMemHandleType, 0), ret, fail);
}
NCCLCHECKGOTO(bootstrapIntraNodeAllGather(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, desc, sizeof(ncclCuDesc)), ret, fail);
// start mapping
accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
accessDesc.location.id = comm->cudaDev;
accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
for (int r = 0; r < comm->localRanks; ++r) {
CUdeviceptr maddr;
if (r == comm->localRank) {
impHandle = memHandle;
} else {
if (ncclCuMemHandleType == CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR) {
impFd = -1;
NCCLCHECKGOTO(ncclProxyClientGetFdBlocking(comm, comm->localRankToRank[r], &desc[r].data, &impFd), ret, fail);
CUCHECKGOTO(cuMemImportFromShareableHandle(&impHandle, (void*)(uintptr_t)impFd, ncclCuMemHandleType), ret, fail);
SYSCHECKGOTO(close(impFd), "close", ret, fail);
} else {
CUCHECKGOTO(cuMemImportFromShareableHandle(&impHandle, (void*)&desc[r].handle, ncclCuMemHandleType), ret, fail);
}
}
maddr = (CUdeviceptr)(comm->baseUCSymPtr + (size_t)r * comm->baseStride + offset);
CUCHECKGOTO(cuMemMap(maddr, size, 0, impHandle, 0), ret, fail);
CUCHECKGOTO(cuMemSetAccess(maddr, size, &accessDesc, 1), ret, fail);
if (r == comm->localRank) {
*symPtr = (void*)maddr;
} else {
CUCHECKGOTO(cuMemRelease(impHandle), ret, fail);
}
}
INFO(NCCL_ALLOC, "IPC symmetric alloc buffer %p offset %ld size %ld symAllocHead %ld", *symPtr, offset, size, comm->symAllocHead);
exit:
free(desc);
return ret;
fail:
goto exit;
}
ncclResult_t ncclIpcSymmetricFree(struct ncclComm* comm, size_t size, void* symPtr) {
ncclResult_t ret = ncclSuccess;
if (comm && symPtr && size > 0) {
size_t offset = (size_t)symPtr - ((size_t)comm->baseUCSymPtr + comm->localRank * comm->baseStride);
for (int r = 0; r < comm->localRanks; ++r) {
CUdeviceptr peerAddr = (CUdeviceptr)(comm->baseUCSymPtr + r * comm->baseStride + offset);
CUCHECKGOTO(cuMemUnmap(peerAddr, size), ret, fail);
}
}
exit:
return ret;
fail:
goto exit;
}
struct ncclTransport p2pTransport = {
"P2P",
p2pCanConnect,
+7 -6
Просмотреть файл
@@ -6,6 +6,7 @@
#include "transport.h"
#include "proxy.h"
#include "profiler.h"
#include "device.h"
static ncclResult_t profilerProxyConnect(struct ncclProxyConnection* connection, struct ncclProxyState* proxyState, void* reqBuff, int reqSize, void* respBuff, int respSize, int* done) {
connection->proxyAppendPtr = &connection->proxyAppend;
@@ -29,15 +30,15 @@ static ncclResult_t profilerProxyProgress(struct ncclProxyState* proxyState, str
if (args->state == ncclProxyOpProgress) {
for (int s = 0; s < args->nsubs; s++) {
struct ncclProxySubArgs* sub = args->subs + s;
uint64_t* workStarted = (uint64_t *)sub->sendbuff;
uint64_t* workCompleted = (uint64_t *)sub->recvbuff;
if (sub->posted < sub->nsteps && sub->base <= workStarted[sub->channelId]) {
ncclProfilerStartKernelChEvent(args, s);
struct ncclDevProfiler* workStarted = (struct ncclDevProfiler *)sub->sendbuff;
struct ncclDevProfiler* workCompleted = (struct ncclDevProfiler *)sub->recvbuff;
if (sub->posted < sub->nsteps && sub->base <= workStarted[sub->channelId].data[sub->base%MAX_PROFILER_EVENTS_PER_CHANNEL].counter) {
ncclProfilerStartKernelChEvent(args, s, workStarted[sub->channelId].data[sub->base%MAX_PROFILER_EVENTS_PER_CHANNEL].timestamp);
sub->posted = sub->nsteps;
continue; // allow events on every channel to start
}
if (sub->transmitted < sub->nsteps && sub->base <= workCompleted[sub->channelId]) {
ncclProfilerStopKernelChEvent(args, s);
if (sub->transmitted < sub->nsteps && sub->base <= workCompleted[sub->channelId].data[sub->base%MAX_PROFILER_EVENTS_PER_CHANNEL].counter) {
ncclProfilerStopKernelChEvent(args, s, workCompleted[sub->channelId].data[sub->base%MAX_PROFILER_EVENTS_PER_CHANNEL].timestamp);
sub->transmitted = sub->nsteps;
args->done++;
}
+1 -1
Просмотреть файл
@@ -10,7 +10,7 @@
#include "transport.h"
#define SHM_PATH_MAX 128
#define SHM_HANDLE_TYPE CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR
#define SHM_HANDLE_TYPE ncclCuMemHandleType
struct shmBuffInfo {
void *hptr;