db3bfd118f
Network user buffer support for collectives
* Leverage user buffer registration to achieve zero-copy
inter-node communications for Ring, NVLS and Collnet
Add RAS subsystem
* Create a RAS thread keeping track of all NCCL communicators.
* Add a ncclras tool contacting the RAS thread and getting a
report.
Add fp8 support
* Add support for e5m2 and e4m3 8-bit floating point operations.
* Use Tree/PAT algorithms when possible for better numerical
stability.
Add NIC fusion
* Add a NET API to ask the network plugin to fuse a set of
interfaces together.
* Fuse multiple NICs under the same PCI switch as a single,
larger NIC.
Socket connection failure retry
* Retry in case of socket connection failure (unreachable host)
* Avoid "Software caused connection abort" errors on retries
QP connection failure retry
* Retry in case of IB QP connection failure during ibv_modify_qp.
NET API improvements
* Allow plugins to force a flush in case data and completion
ordering is not guaranteed.
* Indicate when completion is not needed (e.g. for the LL128
protocol), allowing plugins to skip generating a completion.
* Allow for full offload of allgather operations when using one
GPU per node.
NCCL_ALGO/NCCL_PROTO strict enforcement
* Extend NCCL_ALGO/NCCL_PROTO syntax to be able to specify
ALGO/PROTO filters for each collective operation.
* Strictly enforce the ALGO/PROTO filters, no longer fall back
on the ring algorithm when the filtering leaves no option and
error out instead.
Enable CUMEM host allocations
* Use cumem functions for host memory allocation by default.
Improved profiler plugin API
* Avoid dependencies with NCCL includes.
* Add information on whether the buffer is registered or not
Adjust PAT tuning
* Improve transition between PAT and ring at scale.
Fix hangs when running with different CPU architectures
* Detect when we use a mix of GPU architectures
* Ensure Algo/Proto decisions are made based on that unified
state.
Fix FD leak in UDS
* Fix a leak when mapping buffers intra-node with cumem IPCs.
Fix crash when mixing buffer registration and graph buffer registration.
* Separate local and graph registration to avoid crashes when we free
buffers.
Fix user buffer registration with dmabuf
* Make ncclSend/ncclRecv communication with buffer registration functional
on network plugins relying on dmabuf for buffer registration.
Fix crash in IB code caused by uninitialized fields.
Fix non-blocking ncclSend/ncclRecv
* Fix case where ncclSend/ncclRecv would return ncclSuccess in non-blocking
mode even though the operation was not enqueued onto the stream.
* Issue #1495
Various compiler tweaks and fixes
* PR #758
Fix typo in ncclTopoPrintGraph
* Issue #1468
[ROCm/rccl commit: 6aae379278]
44 行
1.4 KiB
C
44 行
1.4 KiB
C
/*************************************************************************
|
|
* Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
|
|
*
|
|
* See LICENSE.txt for license information
|
|
************************************************************************/
|
|
|
|
#ifndef NCCL_INT_DEBUG_H_
|
|
#define NCCL_INT_DEBUG_H_
|
|
|
|
#include "nccl.h"
|
|
#include "nccl_common.h"
|
|
#include <stdio.h>
|
|
|
|
#include <pthread.h>
|
|
|
|
// Conform to pthread and NVTX standard
|
|
#define NCCL_THREAD_NAMELEN 16
|
|
|
|
extern int ncclDebugLevel;
|
|
extern FILE *ncclDebugFile;
|
|
|
|
void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *filefunc, int line, const char *fmt, ...) __attribute__ ((format (printf, 5, 6)));
|
|
|
|
// Let code temporarily downgrade WARN into INFO
|
|
extern thread_local int ncclDebugNoWarn;
|
|
extern char ncclLastError[];
|
|
|
|
#define VERSION(...) ncclDebugLog(NCCL_LOG_VERSION, NCCL_ALL, __FILE__, __LINE__, __VA_ARGS__)
|
|
#define WARN(...) ncclDebugLog(NCCL_LOG_WARN, NCCL_ALL, __FILE__, __LINE__, __VA_ARGS__)
|
|
#define INFO(FLAGS, ...) ncclDebugLog(NCCL_LOG_INFO, (FLAGS), __func__, __LINE__, __VA_ARGS__)
|
|
#define TRACE_CALL(...) ncclDebugLog(NCCL_LOG_TRACE, NCCL_CALL, __func__, __LINE__, __VA_ARGS__)
|
|
|
|
#ifdef ENABLE_TRACE
|
|
#define TRACE(FLAGS, ...) ncclDebugLog(NCCL_LOG_TRACE, (FLAGS), __func__, __LINE__, __VA_ARGS__)
|
|
#else
|
|
#define TRACE(...)
|
|
#endif
|
|
|
|
void ncclSetThreadName(pthread_t thread, const char *fmt, ...);
|
|
|
|
void ncclResetDebugInit();
|
|
|
|
#endif
|