Граф коммитов

147 Коммитов

Автор SHA1 Сообщение Дата
Nilesh M Negi 329e13efff Revert "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667)
* Revert "[SRC] Enable unroll=1 for gfx950 (#1602)"
This reverts commit 307bc10781.

* Update Changelog

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-04-30 23:33:08 -05:00
BertanDogancay a6bf9bfc9e Merge remote-tracking branch 'nccl/master' into develop 2025-04-23 20:47:43 -07:00
Mustafa Abduljabbar 82afb2bcfe Expose production tuning table in topo_explorer using internal RCCL/NCCL logic (#1628)
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool
2025-04-23 15:44:56 -04:00
Tim 9a55ff60a9 RCCL Replayer update (#1603)
RCCL recorder w/ suggested change and UT
2025-04-19 00:21:27 -04:00
Pedram Alizadeh e40ff4f84a all_reduce LL/LL128 and Ring/Tree multi-node tuning for MI300 (#1627)
* Enabling LL128 by default on MI300

* Add missing CUDACHECK

* Adjust BW correction factors to fix the Tree->Ring switching point

* Refactor and add ll128 AR logarithmic factor to tuning models

* Move RCCL tuning changes to a separate file 

* Use enum for tunable indexing

* Use explicit indexing in tuning models to avoid mismatch issues

* Place rcclGetSizePerRank in a function

* Remove HIP ifdef for rccl-only call

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
2025-04-10 11:43:54 -04:00
Mustafa Abduljabbar 4be06f04d8 Add AllGather LL128 multi-node tuning and include LL cutoff points in tuning models (#1618)
* Enable LL/LL128 cutoff points in tuning models

* Initializing ll/ll128 model cutoffs for MI300

* Use RCCL_LL_LIMITS_UNDEFINED

---------

Co-authored-by: PedramAlizadeh <pmohamma@amd.com>
2025-04-02 16:26:23 -04:00
Bertan Dogancay 532f54c244 Merge pull request #1559 from BertanDogancay/2.23
[SYNC] 2.23.4-1
2025-03-28 17:06:56 -04:00
Nilesh M Negi 307bc10781 [SRC] Enable unroll=1 for gfx950 (#1602)
* [SRC] Enable unroll=1 for gfx950

* Fix typo from rebase in generate.py

* Support for unroll=1 and gfx90a when building for all GPU targets

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-03-27 18:21:35 -05:00
BertanDogancay 0b2062c560 Merge remote-tracking branch 'nccl/master' into develop 2025-03-27 12:53:04 -05:00
Mustafa Abduljabbar f67b2cc908 Multi-node reduce_scatter improved auto-selection for LL and LL128 on gfx942 (#1604)
* Add reduce_scatter LL and LL128 thresholds

* Always honor user choice for protocol
2025-03-17 11:21:01 -04:00
Wenkai Du 245c2de909 Enable LL128 on gfx942 (#1549) 2025-03-16 15:10:05 -07:00
Bertan Dogancay d88cca3098 [Transport] Fix IntraNet (#1582) 2025-03-04 13:30:36 -05:00
Bertan Dogancay 85eb1f16bc Use bit reversal based mapping for multi-node (#1572) 2025-02-26 09:48:03 -05:00
Bertan Dogancay 387c973b5d [P2P] Have connIdx for both send and recv (#1524) 2025-02-04 11:53:20 -05:00
Wenkai Du a5c6b547a2 Add back opCount and channel ID to debug trace (#1520) 2025-02-03 08:55:27 -08:00
Wenkai Du caba0bc049 Add HDP flush for gfx940 (#1434)
* Fix collective trace

* Use nontemporal for st_global

* Fix previous commit

* Add HDP flush to data receive path

* Fix previous commit

* Control flushing by NCCL_NET_FORCE_FLUSH and RCCL_NET_HDP_FLUSH

* Introduce RCCL_NET_HDP_FLUSH and RCCL_NET_GDR_FLUSH

Both are on by default. Turn both off will skip all flush will likely
result in data error.

* Enable GDR copy by default

* Remove GDR flush env var because it is disabled by GDC flush

* Output kernel collective trace at comm destroy by default

* Limit kernel timeout messages to 100

* Use system relaxed atomic for loadInt

* Refine timeout messages and use atomic for setting offset from CPU

* Add kernel trace for barrier timeout

* Add backup barrier to avoid race in atomicAdd

* Use different counters for different warps

* Rework barrier implementation

* Fix for other GFX

* Use __hip_atomic_store and __hip_atomic_load

* Fix bug in previous commit

* Don't reset barrier values in running kernel

* Update trace format

* Fix typo

* Switch back to hip_atomic_fetch_add

* Use same barrier implementation for all GFX

* Remove extra threadfence

* Turn off HDP flush by default

Please use RCCL_NET_HDP_FLUSH=1 to switch on HDP flush

* Remove unnecessary changes from alterative barrier implementation

* Added back __threadfence_block

* Revert back to threadfence for gfx other than gfx94x
2025-01-31 07:51:10 -08:00
BertanDogancay 36343be84f Merge remote-tracking branch 'nccl/master' into develop 2025-01-23 12:08:46 -06:00
Sylvain Jeaugey 6aae379278 2.24.3-1
Network user buffer support for collectives
 * Leverage user buffer registration to achieve zero-copy
   inter-node communications for Ring, NVLS and Collnet

Add RAS subsystem
 * Create a RAS thread keeping track of all NCCL communicators.
 * Add a ncclras tool contacting the RAS thread and getting a
   report.

Add fp8 support
 * Add support for e5m2 and e4m3 8-bit floating point operations.
 * Use Tree/PAT algorithms when possible for better numerical
   stability.

Add NIC fusion
 * Add a NET API to ask the network plugin to fuse a set of
   interfaces together.
 * Fuse multiple NICs under the same PCI switch as a single,
   larger NIC.

Socket connection failure retry
 * Retry in case of socket connection failure (unreachable host)
 * Avoid "Software caused connection abort" errors on retries

QP connection failure retry
 * Retry in case of IB QP connection failure during ibv_modify_qp.

NET API improvements
 * Allow plugins to force a flush in case data and completion
   ordering is not guaranteed.
 * Indicate when completion is not needed (e.g. for the LL128
   protocol), allowing plugins to skip generating a completion.
 * Allow for full offload of allgather operations when using one
   GPU per node.

NCCL_ALGO/NCCL_PROTO strict enforcement
 * Extend NCCL_ALGO/NCCL_PROTO syntax to be able to specify
   ALGO/PROTO filters for each collective operation.
 * Strictly enforce the ALGO/PROTO filters, no longer fall back
   on the ring algorithm when the filtering leaves no option and
   error out instead.

Enable CUMEM host allocations
 * Use cumem functions for host memory allocation by default.

Improved profiler plugin API
 * Avoid dependencies with NCCL includes.
 * Add information on whether the buffer is registered or not

Adjust PAT tuning
 * Improve transition between PAT and ring at scale.

Fix hangs when running with different CPU architectures
 * Detect when we use a mix of GPU architectures
 * Ensure Algo/Proto decisions are made based on that unified
   state.

Fix FD leak in UDS
 * Fix a leak when mapping buffers intra-node with cumem IPCs.

Fix crash when mixing buffer registration and graph buffer registration.
 * Separate local and graph registration to avoid crashes when we free
   buffers.

Fix user buffer registration with dmabuf
 * Make ncclSend/ncclRecv communication with buffer registration functional
   on network plugins relying on dmabuf for buffer registration.

Fix crash in IB code caused by uninitialized fields.

Fix non-blocking ncclSend/ncclRecv
 * Fix case where ncclSend/ncclRecv would return ncclSuccess in non-blocking
   mode even though the operation was not enqueued onto the stream.
 * Issue #1495

Various compiler tweaks and fixes
 * PR #758

Fix typo in ncclTopoPrintGraph
 * Issue #1468
2025-01-07 02:01:15 -08:00
Bertan Dogancay dfe4a3ed81 Fix typo in ncclGetKernelIndex macro (#1424) 2024-11-18 10:40:05 -05:00
Bertan Dogancay cb175fb0b3 Template generic kernel for unroll factor (#1419)
* Template generic kernel for unroll factor
2024-11-12 18:27:29 -05:00
darren-amd ebf0417e90 remove undefined computeColl declaration 2024-11-04 13:42:01 -05:00
Bertan Dogancay 373f113524 Dynamically select unroll factor to build for when targeting local arch (#1371)
* Dynamically select unroll factor to build for when targeting local arch only
2024-10-21 10:53:11 -04:00
Wenkai Du 821d2e1f30 Allow zero byte sendrecv in alltoallv (#1349)
* Allow zero byte sendrecv in alltoallv

* Fix previous merge error
2024-10-11 10:40:32 -07:00
BertanDogancay 84081064a0 Merge remote-tracking branch 'nccl/master' into develop 2024-10-02 09:31:25 -05:00
Sylvain Jeaugey 68b542363f 2.23.4-1
Add scalable init API
 * Add new ncclCommInitRankScalable to allow for passing multiple
   unique IDs to the init function.
 * Spreads the load onto multiple bootstrap roots, allowing for
   constant bootstrap time.
 * Requires multiple ranks to create a unique ID, and the CPU-side
   ID exchange code to call allgather[v] instead of broadcast.

Accelerate init bootstrap operations
 * Reduce the number of calls to allgather.
 * Allow roots to reply early to ranks when information is already
   available.
 * Add an option to use ncclNet instead of sockets to perform
   bootstrap allgather operations.

Add PAT algorithms for Allgather and ReduceScatter
 * Parallel Aggregated Trees, variation of Bruck algorithm.
 * Logarithmic number of network steps for small sizes at scale.
 * Only supports one rank per node at the moment.

Add support for registered buffers for intra-node communication.
 * Allow registered user buffers to be accessed directly intra-node
 * Avoids extra copies in algorithms which permit it, saving
   memory bandwidth and helping with compute overlap.

Add profiler plugin API
 * New plugin API for profiling
 * Supports various levels of profiling, with a hierarchy.

Asynchronous graph allocation
 * Make calls to cudaMalloc and cudaMemcpy during graph allocation
   asynchronous.
 * Significantly speeds up graph capture.

Use fatal IB asynchronous events to stop network operation
 * Avoids many other error messages
 * Only fatal errors are affected; potentially transient errors
   (e.g. port down) do not cause an immediate stop.

Set P2P level to PXB on AMD CPUs when using more than 2 GPUs per node
 * P2P would cause a significant performance degradation when using
   many GPUs, and therefore many interleaved data flows.
 * Disable P2P through the CPU when we have 3+ GPUs per node; keep it
   enabled when we only have 2 GPUs.

Improve the init logs to report the real NCCL function.
 * Make the log report ncclCommInitRank or ncclCommSplit, rather than
   the generic ncclCommInitRankFunc.

Add a parameter to set the location of the user configuration file.
 * Add NCCL_CONF_FILE environment variable to set where the user's
   configuration file resides.

Increase default IB timeout
 * Increase IB timeout value from 18 to 20.
 * Should help avoid fatal errors on large RoCE systems.

Add new check for nvidia peermem
 * On linux kernels 6.6+, /sys/kernel/mm/memory_peers is no longer
   present; check for /sys/module/nvidia_peermem/version instead.

Fix old performance regression when mixing small and large operations.
 * Improves distribution of work on channels.

Fix crash when NUMA IDs are equal to -1.
 * Can happen when a NIC is a virtual NIC, or when linux doesn't
   know which NUMA node a device is attached to
 * Issue NVIDIA/nccl-tests#233

Fix tree graph search when NCCL_CROSS_NIC is set to 1.
 * Would force NCCL to use the balanced_tree pattern, thereby
   disabling LL128 on platforms with 1 GPU+1 NIC per PCI switch.
 * Would also try to use alternate rings even though it was not
   needed.

Compiler tweaks and fixes
 * PR #1177
 * PR #1228

Fix stack smash
 * PR #1325

Fixes for multi-node NVLink + IB operation

Coverity fixes and comments.
2024-09-16 23:41:17 -07:00
mberenjk db840f024e adding all nccl apis to api_support to enable rccl tracing by rocprofv3 (#1297)
* adding all nccl apis to api_support to enable rccl tracing by rocprofv3

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
2024-08-22 12:36:07 -05:00
akolliasAMD d6c317d6ae removed hcc mentions (#1291) 2024-08-14 15:04:13 -06:00
akolliasAMD c246e25f8e gfx12 Disable ll protocol (#1268) 2024-07-26 08:59:55 -06:00
Wenkai Du 89349f2ce4 Template unroll for RCCL kernels (#1250)
* Template unroll for RCCL kernels

* Adding unroll template arg during CMake hipification

* Reduce linking parallel jobs to avoid OOM in CI

* Workaround issues with UT tests

SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking

* CI: do not use -j 16 when building

* CI: use -j 8 when building

* Only reduce parallel linking job for CI extended

* Restore original jenkins command. Change parallel linking jobs in cmake

* Disable MSCCLPP

---------

Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>
2024-07-19 08:15:59 -07:00
Sylvain Jeaugey 178b6b7590 2.22.3-1
Rework core for NVIDIA Trusted Computing
 * Compress work structs so that they are shared between channels
 * Utilize the full amount of kernel argument space permitted (4k)
   before resorting to work fifo.
 * Rework the task preprocessing phase.
 * Use a separate abortDevFlag which is kept in sync with abortFlag
   using cudaMemcpy operations.
 * Rename src/include/align.h to src/include/bitops.h

Add lazy connection establishment for collective operations
 * Move buffer allocation and connection establishment to the first
   collective operation using that algorithm.
 * Accelerate init time and reduce memory usage.
 * Avoid allocating NVLS buffers if all calls are registered.
 * Compute algo/proto in ncclLaunchCollTasksInfo early on.
 * Connect peers in ncclCollPreconnectFunc if not connected already.
 * Also move shared buffer creation to the first send/recv call.

Accelerate intra-node NVLink detection
 * Make each rank only detect NVLinks attached to its GPU.
 * Fuse XMLs to reconstruct the full NVLink topology

Add init profiling to report time spend in different init phases.
 * Report timings of bootstrap, allgather, search, connect, etc.
 * Add new "PROFILE" category for NCCL_DEBUG_SUBSYS.

Add support for PCI p2p on split PCI switches
 * Detect split PCI switches through a kernel module exposing
   switch information.
 * Update the topology XML and graph to add those inter-switch
   connections.

Add cost estimation API
 * Add a new ncclGroupEndSimulate primitive to return the estimated
   time a group would take.

Net/IB: Add separate traffic class for fifo messages
 * Add NCCL_IB_FIFO_TC to control the traffic class of fifo messages
   independently from NCCL_IB_TC.
   Merges PR #1194

Net/IB: Add support for IB router
 * Use flid instead of lid if subnets do not match
 * Warn if flid is 0

Optimizations and fixes for device network offload (unpack)
 * Double the default number of channels
 * Cache netDeviceType
 * Fix save/increment head logic to enable Tree support.

Support ncclGroupStart/End for ncclCommAbort/Destroy
 * Allow Abort/Destroy to be called within a group when managing
   multiple GPUs with a single process.

Improve Tuner API
 * Provide to the plugin the original cost table so that the plugin
   can leave unknown or disabled algo/proto combinations untouched.
 * Remove nvlsSupport and collnetSupport.

Do not print version to stdout when using a debug file
 * Also print version from all processes with INFO debug level.
   Fixes issue #1271

Fix clang warnings in NVTX headers
 * Update NVTX headers to the latest version
   Fixes issue #1270

Disable port fusion in heterogeneous systems
 * Do not fuse ports if a mix of multi-port and single port are detected.

Fix NVLS graphs search for dual NICs.
 * Fix NVLS graph search when we have more than one NIC per GPU.

Fix crash with collnetDirect
 * Add separate graph search for collnetDirect, testing alltoall paths
   and working similarly to the NVLS search.

Fix hang when nodes have different CPU types
 * Add the CPU type to the rank peer info.
 * Align all ranks on the CPU type after the first allgather.
 * Only use the aligned CPU type for all tuning operations.
   Fixes issue #1136
   Fixes issue #1184

Fix performance of registered send/recv operations
 * Allow for single full size operations
 * Add INFO to confirm the registration of send/recv buffers.

Move all sync ops to finalize stage
 * Ensure ncclCommDestroy is non-blocking if ncclCommFinalize has
   been called.

Improve error reporting during SHM segment creation

Improve support of various compilers
   Merges PR #1177
   Merges PR #1228

Allow net and tuner plugins to be statically linked
 * Search for ncclNet or ncclTuner symbols in the main binary.
   Merges PR #979

Plugin examples includes cleanup
 * Harmonize err.h and common.h usage.
 * Add mixed plugin with both net and tuner.
2024-06-19 01:57:16 -07:00
Nusrat Islam ef442f8f92 set MAXCHANNELS to 128 2024-06-03 13:05:05 -05:00
Nusrat Islam 506f16c506 add 256 channels support 2024-06-03 13:03:18 -05:00
Wenkai Du 73221b4230 Add ring simple chunk size tuning (#1180)
* Add ring simple chunk size tuning

* modifying the tuning table to improve the performance of broadcast for 8MB to 32MB for single-node MI300X after ring simple chunk size tuning

* modifying the tuning table to improve the performance of reduce for 1MB to 4MB for single-node MI300X after ring simple chunk size tuning

---------

Co-authored-by: PedramAlizadeh <pmohamma@amd.com>
2024-05-29 07:59:47 -07:00
Wenkai Du eeea3b693b Report error when collective is not enabled in build (#1177)
* Report error when collective is not enabled in build

* Fix typo
2024-05-16 10:11:12 -07:00
Wenkai Du ecafc1969c Support WSL2 (#1173) 2024-05-10 07:31:12 -07:00
Wenkai Du cd6e840e0b Add back tree simple chunk size tuning (#1157) 2024-04-28 19:48:53 -07:00
Wenkai Du 9e0c9b4ed8 Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ (#1154) 2024-04-25 07:19:18 -07:00
BertanDogancay e1a835910e Merge remote-tracking branch 'nccl/master' into develop 2024-04-23 13:34:00 -07:00
mberenjk 428837ffe4 replacing rccl_bfloat16 with hip_bfloat16 (#1126)
Co-authored-by: mberenjk <mberenjk@amd.com>
2024-04-11 11:30:37 -05:00
Sylvain Jeaugey ab2b89c4c3 2.21.5-1
Add support for IB SHARP 1PPN operation with user buffers.
Improve support for MNNVL, add NVLS support and multi-clique support.
 * Detect the NVLS clique through NVML
 * Exchange XML between peers in the same NVLS clique and fuse XMLs
   before creating the topology graph.
 * Rework bootstrap allgather algorithms to allow for large allgather
   operations intra-node (XML exchange).
Net/IB: add support for dynamic GID detection.
 * Automatically select RoCEv2/IPv4 interface by default. Allow to
   select IPv6 or even the network/mask.
Reduce NVLS memory usage.
 * Add stepSize as property of a connection to allow for different
   sizes on different peers; set it to 128K for NVLink SHARP.
Improve tuner loading
 * Look for more paths, be more consistent with the network device
   plugin.
 * Also search for tuner support inside the net plugin.
Improve tuner API
 * Add context to support multi-device per process.
Add magic number around comm object to detect comm corruption.
 * Add some basic check around communicators so that we can report a
   problem when a communicator gets corrupted or a wrong comm pointer
   is passed to NCCL.
Fix net/IB error path. Github PR #1164
Fix collnet rail mapping with split comm.
Fix packet reordering issue causing bootstrap mismatch
 * Use a different tag in ncclTransportP2pSetup for the connectInfo
   exchange and the following barrier.
Fix hang when crossNic is inconsistent between ranks.
Fix minCompCap/maxCompCap computation. Github issue #1184
2024-04-02 01:53:21 -07:00
Andy li 6777e65c1d Enable fp8 support (#1101)
* initial checkin

* resolve cr comments

* resolve the build issue

* fix the data correctless issue

* update fp8 header file and update the unit test for fp8 support

* remove fp16 from fp8 headers

* fix ut issue and catch up the latest code from develop

* udate according to cr comments

* update ut according to cr comments

* update num floats for each SumPostDiv from 4 to 6

* update fp8 header file name

* fix the typo
2024-03-08 15:17:53 -08:00
Bertan Dogancay b275ed0b56 LL128 check if all XGMI (#1089) 2024-02-21 09:41:40 -07:00
Sylvain Jeaugey b6475625fb 2.20.3-1
Add support for alternating rings, allow for cross-nic rings without
cross-rail communication.
Add support for user buffer registration for network send/recv.
Optimize aggregated operations to better utilize all channels.
Add flattening for BCM PCI gen5 switches.
Add support for inter-node NVLink communication
Add support for port fusion in NET/IB.
Add support for ReduceScatter and AllGather using Collnet.
Update net API to v8.
Fix hang during A2A connection.
2024-02-13 04:22:38 -08:00
BertanDogancay 00fdb1ef51 Clean up 2024-01-31 17:27:15 -08:00
Wenkai Du 1a134b283b Merge remote-tracking branch 'rccl/develop' into 2.19.4 2024-01-31 11:53:10 -06:00
BertanDogancay 9ff53eeeae Merge remote-tracking branch 'nccl/master' into develop 2024-01-30 14:43:43 -08:00
Bertan Dogancay 01b359027b Include common.h in enqueue.cc instead (#1067) 2024-01-30 08:24:22 -08:00
BertanDogancay 81ddf9de89 Merge remote-tracking branch 'nccl/v2.19' into develop 2024-01-24 15:25:33 -08:00
Wenkai Du 7e25d5bc55 Use new HIP graph API compatible with CUDA 11030 (#991)
* Use new HIP graph API compatible with CUDA 11030

* Update dependency to ROCm 6.1

* Fix single stream use case
2024-01-21 19:00:50 -08:00
Bertan Dogancay 28d9b170c9 [DEV] Configure functions in RCCL (#986)
* configure functions in rccl
2024-01-18 15:07:16 -07:00