rocm-systems

Автор	SHA1	Сообщение	Дата
Nilesh M Negi	329e13efff	Revert "[SRC] Enable unroll=1 for gfx950 (#1602 )" (#1667 ) * Revert "[SRC] Enable unroll=1 for gfx950 (#1602)" This reverts commit `307bc10781`. * Update Changelog --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2025-04-30 23:33:08 -05:00
BertanDogancay	a6bf9bfc9e	Merge remote-tracking branch 'nccl/master' into develop	2025-04-23 20:47:43 -07:00
Mustafa Abduljabbar	82afb2bcfe	Expose production tuning table in topo_explorer using internal RCCL/NCCL logic (#1628 ) * Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled * Algo/protocol/max channels can be obtained with the new RCCL API * Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc * Add usage example in topo-explorer tool	2025-04-23 15:44:56 -04:00
Tim	9a55ff60a9	RCCL Replayer update (#1603 ) RCCL recorder w/ suggested change and UT	2025-04-19 00:21:27 -04:00
Pedram Alizadeh	e40ff4f84a	all_reduce LL/LL128 and Ring/Tree multi-node tuning for MI300 (#1627 ) * Enabling LL128 by default on MI300 * Add missing CUDACHECK * Adjust BW correction factors to fix the Tree->Ring switching point * Refactor and add ll128 AR logarithmic factor to tuning models * Move RCCL tuning changes to a separate file * Use enum for tunable indexing * Use explicit indexing in tuning models to avoid mismatch issues * Place rcclGetSizePerRank in a function * Remove HIP ifdef for rccl-only call --------- Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>	2025-04-10 11:43:54 -04:00
Mustafa Abduljabbar	4be06f04d8	Add AllGather LL128 multi-node tuning and include LL cutoff points in tuning models (#1618 ) * Enable LL/LL128 cutoff points in tuning models * Initializing ll/ll128 model cutoffs for MI300 * Use RCCL_LL_LIMITS_UNDEFINED --------- Co-authored-by: PedramAlizadeh <pmohamma@amd.com>	2025-04-02 16:26:23 -04:00
Bertan Dogancay	532f54c244	Merge pull request #1559 from BertanDogancay/2.23 [SYNC] 2.23.4-1	2025-03-28 17:06:56 -04:00
Nilesh M Negi	307bc10781	[SRC] Enable unroll=1 for gfx950 (#1602 ) * [SRC] Enable unroll=1 for gfx950 * Fix typo from rebase in generate.py * Support for unroll=1 and gfx90a when building for all GPU targets --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2025-03-27 18:21:35 -05:00
BertanDogancay	0b2062c560	Merge remote-tracking branch 'nccl/master' into develop	2025-03-27 12:53:04 -05:00
Mustafa Abduljabbar	f67b2cc908	Multi-node reduce_scatter improved auto-selection for LL and LL128 on gfx942 (#1604 ) * Add reduce_scatter LL and LL128 thresholds * Always honor user choice for protocol	2025-03-17 11:21:01 -04:00
Wenkai Du	245c2de909	Enable LL128 on gfx942 (#1549 )	2025-03-16 15:10:05 -07:00
Bertan Dogancay	d88cca3098	[Transport] Fix IntraNet (#1582 )	2025-03-04 13:30:36 -05:00
Bertan Dogancay	85eb1f16bc	Use bit reversal based mapping for multi-node (#1572 )	2025-02-26 09:48:03 -05:00
Bertan Dogancay	387c973b5d	[P2P] Have connIdx for both send and recv (#1524 )	2025-02-04 11:53:20 -05:00
Wenkai Du	a5c6b547a2	Add back opCount and channel ID to debug trace (#1520 )	2025-02-03 08:55:27 -08:00
Wenkai Du	caba0bc049	Add HDP flush for gfx940 (#1434 ) * Fix collective trace * Use nontemporal for st_global * Fix previous commit * Add HDP flush to data receive path * Fix previous commit * Control flushing by NCCL_NET_FORCE_FLUSH and RCCL_NET_HDP_FLUSH * Introduce RCCL_NET_HDP_FLUSH and RCCL_NET_GDR_FLUSH Both are on by default. Turn both off will skip all flush will likely result in data error. * Enable GDR copy by default * Remove GDR flush env var because it is disabled by GDC flush * Output kernel collective trace at comm destroy by default * Limit kernel timeout messages to 100 * Use system relaxed atomic for loadInt * Refine timeout messages and use atomic for setting offset from CPU * Add kernel trace for barrier timeout * Add backup barrier to avoid race in atomicAdd * Use different counters for different warps * Rework barrier implementation * Fix for other GFX * Use __hip_atomic_store and __hip_atomic_load * Fix bug in previous commit * Don't reset barrier values in running kernel * Update trace format * Fix typo * Switch back to hip_atomic_fetch_add * Use same barrier implementation for all GFX * Remove extra threadfence * Turn off HDP flush by default Please use RCCL_NET_HDP_FLUSH=1 to switch on HDP flush * Remove unnecessary changes from alterative barrier implementation * Added back __threadfence_block * Revert back to threadfence for gfx other than gfx94x	2025-01-31 07:51:10 -08:00
BertanDogancay	36343be84f	Merge remote-tracking branch 'nccl/master' into develop	2025-01-23 12:08:46 -06:00
Sylvain Jeaugey	6aae379278	2.24.3-1 Network user buffer support for collectives * Leverage user buffer registration to achieve zero-copy inter-node communications for Ring, NVLS and Collnet Add RAS subsystem * Create a RAS thread keeping track of all NCCL communicators. * Add a ncclras tool contacting the RAS thread and getting a report. Add fp8 support * Add support for e5m2 and e4m3 8-bit floating point operations. * Use Tree/PAT algorithms when possible for better numerical stability. Add NIC fusion * Add a NET API to ask the network plugin to fuse a set of interfaces together. * Fuse multiple NICs under the same PCI switch as a single, larger NIC. Socket connection failure retry * Retry in case of socket connection failure (unreachable host) * Avoid "Software caused connection abort" errors on retries QP connection failure retry * Retry in case of IB QP connection failure during ibv_modify_qp. NET API improvements * Allow plugins to force a flush in case data and completion ordering is not guaranteed. * Indicate when completion is not needed (e.g. for the LL128 protocol), allowing plugins to skip generating a completion. * Allow for full offload of allgather operations when using one GPU per node. NCCL_ALGO/NCCL_PROTO strict enforcement * Extend NCCL_ALGO/NCCL_PROTO syntax to be able to specify ALGO/PROTO filters for each collective operation. * Strictly enforce the ALGO/PROTO filters, no longer fall back on the ring algorithm when the filtering leaves no option and error out instead. Enable CUMEM host allocations * Use cumem functions for host memory allocation by default. Improved profiler plugin API * Avoid dependencies with NCCL includes. * Add information on whether the buffer is registered or not Adjust PAT tuning * Improve transition between PAT and ring at scale. Fix hangs when running with different CPU architectures * Detect when we use a mix of GPU architectures * Ensure Algo/Proto decisions are made based on that unified state. Fix FD leak in UDS * Fix a leak when mapping buffers intra-node with cumem IPCs. Fix crash when mixing buffer registration and graph buffer registration. * Separate local and graph registration to avoid crashes when we free buffers. Fix user buffer registration with dmabuf * Make ncclSend/ncclRecv communication with buffer registration functional on network plugins relying on dmabuf for buffer registration. Fix crash in IB code caused by uninitialized fields. Fix non-blocking ncclSend/ncclRecv * Fix case where ncclSend/ncclRecv would return ncclSuccess in non-blocking mode even though the operation was not enqueued onto the stream. * Issue #1495 Various compiler tweaks and fixes * PR #758 Fix typo in ncclTopoPrintGraph * Issue #1468	2025-01-07 02:01:15 -08:00
Bertan Dogancay	dfe4a3ed81	Fix typo in ncclGetKernelIndex macro (#1424 )	2024-11-18 10:40:05 -05:00
Bertan Dogancay	cb175fb0b3	Template generic kernel for unroll factor (#1419 ) * Template generic kernel for unroll factor	2024-11-12 18:27:29 -05:00
darren-amd	ebf0417e90	remove undefined computeColl declaration	2024-11-04 13:42:01 -05:00
Bertan Dogancay	373f113524	Dynamically select unroll factor to build for when targeting local arch (#1371 ) * Dynamically select unroll factor to build for when targeting local arch only	2024-10-21 10:53:11 -04:00
Wenkai Du	821d2e1f30	Allow zero byte sendrecv in alltoallv (#1349 ) * Allow zero byte sendrecv in alltoallv * Fix previous merge error	2024-10-11 10:40:32 -07:00
BertanDogancay	84081064a0	Merge remote-tracking branch 'nccl/master' into develop	2024-10-02 09:31:25 -05:00
Sylvain Jeaugey	68b542363f	2.23.4-1 Add scalable init API * Add new ncclCommInitRankScalable to allow for passing multiple unique IDs to the init function. * Spreads the load onto multiple bootstrap roots, allowing for constant bootstrap time. * Requires multiple ranks to create a unique ID, and the CPU-side ID exchange code to call allgather[v] instead of broadcast. Accelerate init bootstrap operations * Reduce the number of calls to allgather. * Allow roots to reply early to ranks when information is already available. * Add an option to use ncclNet instead of sockets to perform bootstrap allgather operations. Add PAT algorithms for Allgather and ReduceScatter * Parallel Aggregated Trees, variation of Bruck algorithm. * Logarithmic number of network steps for small sizes at scale. * Only supports one rank per node at the moment. Add support for registered buffers for intra-node communication. * Allow registered user buffers to be accessed directly intra-node * Avoids extra copies in algorithms which permit it, saving memory bandwidth and helping with compute overlap. Add profiler plugin API * New plugin API for profiling * Supports various levels of profiling, with a hierarchy. Asynchronous graph allocation * Make calls to cudaMalloc and cudaMemcpy during graph allocation asynchronous. * Significantly speeds up graph capture. Use fatal IB asynchronous events to stop network operation * Avoids many other error messages * Only fatal errors are affected; potentially transient errors (e.g. port down) do not cause an immediate stop. Set P2P level to PXB on AMD CPUs when using more than 2 GPUs per node * P2P would cause a significant performance degradation when using many GPUs, and therefore many interleaved data flows. * Disable P2P through the CPU when we have 3+ GPUs per node; keep it enabled when we only have 2 GPUs. Improve the init logs to report the real NCCL function. * Make the log report ncclCommInitRank or ncclCommSplit, rather than the generic ncclCommInitRankFunc. Add a parameter to set the location of the user configuration file. * Add NCCL_CONF_FILE environment variable to set where the user's configuration file resides. Increase default IB timeout * Increase IB timeout value from 18 to 20. * Should help avoid fatal errors on large RoCE systems. Add new check for nvidia peermem * On linux kernels 6.6+, /sys/kernel/mm/memory_peers is no longer present; check for /sys/module/nvidia_peermem/version instead. Fix old performance regression when mixing small and large operations. * Improves distribution of work on channels. Fix crash when NUMA IDs are equal to -1. * Can happen when a NIC is a virtual NIC, or when linux doesn't know which NUMA node a device is attached to * Issue NVIDIA/nccl-tests#233 Fix tree graph search when NCCL_CROSS_NIC is set to 1. * Would force NCCL to use the balanced_tree pattern, thereby disabling LL128 on platforms with 1 GPU+1 NIC per PCI switch. * Would also try to use alternate rings even though it was not needed. Compiler tweaks and fixes * PR #1177 * PR #1228 Fix stack smash * PR #1325 Fixes for multi-node NVLink + IB operation Coverity fixes and comments.	2024-09-16 23:41:17 -07:00
mberenjk	db840f024e	adding all nccl apis to api_support to enable rccl tracing by rocprofv3 (#1297 ) * adding all nccl apis to api_support to enable rccl tracing by rocprofv3 Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com> Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>	2024-08-22 12:36:07 -05:00
akolliasAMD	d6c317d6ae	removed hcc mentions (#1291 )	2024-08-14 15:04:13 -06:00
akolliasAMD	c246e25f8e	gfx12 Disable ll protocol (#1268 )	2024-07-26 08:59:55 -06:00
Wenkai Du	89349f2ce4	Template unroll for RCCL kernels (#1250 ) * Template unroll for RCCL kernels * Adding unroll template arg during CMake hipification * Reduce linking parallel jobs to avoid OOM in CI * Workaround issues with UT tests SWDEV-469533: register spill fix is needed for mainline build LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs Use -parallel-jobs=8 for linking * CI: do not use -j 16 when building * CI: use -j 8 when building * Only reduce parallel linking job for CI extended * Restore original jenkins command. Change parallel linking jobs in cmake * Disable MSCCLPP --------- Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>	2024-07-19 08:15:59 -07:00
Sylvain Jeaugey	178b6b7590	2.22.3-1 Rework core for NVIDIA Trusted Computing * Compress work structs so that they are shared between channels * Utilize the full amount of kernel argument space permitted (4k) before resorting to work fifo. * Rework the task preprocessing phase. * Use a separate abortDevFlag which is kept in sync with abortFlag using cudaMemcpy operations. * Rename src/include/align.h to src/include/bitops.h Add lazy connection establishment for collective operations * Move buffer allocation and connection establishment to the first collective operation using that algorithm. * Accelerate init time and reduce memory usage. * Avoid allocating NVLS buffers if all calls are registered. * Compute algo/proto in ncclLaunchCollTasksInfo early on. * Connect peers in ncclCollPreconnectFunc if not connected already. * Also move shared buffer creation to the first send/recv call. Accelerate intra-node NVLink detection * Make each rank only detect NVLinks attached to its GPU. * Fuse XMLs to reconstruct the full NVLink topology Add init profiling to report time spend in different init phases. * Report timings of bootstrap, allgather, search, connect, etc. * Add new "PROFILE" category for NCCL_DEBUG_SUBSYS. Add support for PCI p2p on split PCI switches * Detect split PCI switches through a kernel module exposing switch information. * Update the topology XML and graph to add those inter-switch connections. Add cost estimation API * Add a new ncclGroupEndSimulate primitive to return the estimated time a group would take. Net/IB: Add separate traffic class for fifo messages * Add NCCL_IB_FIFO_TC to control the traffic class of fifo messages independently from NCCL_IB_TC. Merges PR #1194 Net/IB: Add support for IB router * Use flid instead of lid if subnets do not match * Warn if flid is 0 Optimizations and fixes for device network offload (unpack) * Double the default number of channels * Cache netDeviceType * Fix save/increment head logic to enable Tree support. Support ncclGroupStart/End for ncclCommAbort/Destroy * Allow Abort/Destroy to be called within a group when managing multiple GPUs with a single process. Improve Tuner API * Provide to the plugin the original cost table so that the plugin can leave unknown or disabled algo/proto combinations untouched. * Remove nvlsSupport and collnetSupport. Do not print version to stdout when using a debug file * Also print version from all processes with INFO debug level. Fixes issue #1271 Fix clang warnings in NVTX headers * Update NVTX headers to the latest version Fixes issue #1270 Disable port fusion in heterogeneous systems * Do not fuse ports if a mix of multi-port and single port are detected. Fix NVLS graphs search for dual NICs. * Fix NVLS graph search when we have more than one NIC per GPU. Fix crash with collnetDirect * Add separate graph search for collnetDirect, testing alltoall paths and working similarly to the NVLS search. Fix hang when nodes have different CPU types * Add the CPU type to the rank peer info. * Align all ranks on the CPU type after the first allgather. * Only use the aligned CPU type for all tuning operations. Fixes issue #1136 Fixes issue #1184 Fix performance of registered send/recv operations * Allow for single full size operations * Add INFO to confirm the registration of send/recv buffers. Move all sync ops to finalize stage * Ensure ncclCommDestroy is non-blocking if ncclCommFinalize has been called. Improve error reporting during SHM segment creation Improve support of various compilers Merges PR #1177 Merges PR #1228 Allow net and tuner plugins to be statically linked * Search for ncclNet or ncclTuner symbols in the main binary. Merges PR #979 Plugin examples includes cleanup * Harmonize err.h and common.h usage. * Add mixed plugin with both net and tuner.	2024-06-19 01:57:16 -07:00
Nusrat Islam	ef442f8f92	set MAXCHANNELS to 128	2024-06-03 13:05:05 -05:00
Nusrat Islam	506f16c506	add 256 channels support	2024-06-03 13:03:18 -05:00
Wenkai Du	73221b4230	Add ring simple chunk size tuning (#1180 ) * Add ring simple chunk size tuning * modifying the tuning table to improve the performance of broadcast for 8MB to 32MB for single-node MI300X after ring simple chunk size tuning * modifying the tuning table to improve the performance of reduce for 1MB to 4MB for single-node MI300X after ring simple chunk size tuning --------- Co-authored-by: PedramAlizadeh <pmohamma@amd.com>	2024-05-29 07:59:47 -07:00
Wenkai Du	eeea3b693b	Report error when collective is not enabled in build (#1177 ) * Report error when collective is not enabled in build * Fix typo	2024-05-16 10:11:12 -07:00
Wenkai Du	ecafc1969c	Support WSL2 (#1173 )	2024-05-10 07:31:12 -07:00
Wenkai Du	cd6e840e0b	Add back tree simple chunk size tuning (#1157 )	2024-04-28 19:48:53 -07:00
Wenkai Du	9e0c9b4ed8	Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ (#1154 )	2024-04-25 07:19:18 -07:00
BertanDogancay	e1a835910e	Merge remote-tracking branch 'nccl/master' into develop	2024-04-23 13:34:00 -07:00
mberenjk	428837ffe4	replacing rccl_bfloat16 with hip_bfloat16 (#1126 ) Co-authored-by: mberenjk <mberenjk@amd.com>	2024-04-11 11:30:37 -05:00
Sylvain Jeaugey	ab2b89c4c3	2.21.5-1 Add support for IB SHARP 1PPN operation with user buffers. Improve support for MNNVL, add NVLS support and multi-clique support. * Detect the NVLS clique through NVML * Exchange XML between peers in the same NVLS clique and fuse XMLs before creating the topology graph. * Rework bootstrap allgather algorithms to allow for large allgather operations intra-node (XML exchange). Net/IB: add support for dynamic GID detection. * Automatically select RoCEv2/IPv4 interface by default. Allow to select IPv6 or even the network/mask. Reduce NVLS memory usage. * Add stepSize as property of a connection to allow for different sizes on different peers; set it to 128K for NVLink SHARP. Improve tuner loading * Look for more paths, be more consistent with the network device plugin. * Also search for tuner support inside the net plugin. Improve tuner API * Add context to support multi-device per process. Add magic number around comm object to detect comm corruption. * Add some basic check around communicators so that we can report a problem when a communicator gets corrupted or a wrong comm pointer is passed to NCCL. Fix net/IB error path. Github PR #1164 Fix collnet rail mapping with split comm. Fix packet reordering issue causing bootstrap mismatch * Use a different tag in ncclTransportP2pSetup for the connectInfo exchange and the following barrier. Fix hang when crossNic is inconsistent between ranks. Fix minCompCap/maxCompCap computation. Github issue #1184	2024-04-02 01:53:21 -07:00
Andy li	6777e65c1d	Enable fp8 support (#1101 ) * initial checkin * resolve cr comments * resolve the build issue * fix the data correctless issue * update fp8 header file and update the unit test for fp8 support * remove fp16 from fp8 headers * fix ut issue and catch up the latest code from develop * udate according to cr comments * update ut according to cr comments * update num floats for each SumPostDiv from 4 to 6 * update fp8 header file name * fix the typo	2024-03-08 15:17:53 -08:00
Bertan Dogancay	b275ed0b56	LL128 check if all XGMI (#1089 )	2024-02-21 09:41:40 -07:00
Sylvain Jeaugey	b6475625fb	2.20.3-1 Add support for alternating rings, allow for cross-nic rings without cross-rail communication. Add support for user buffer registration for network send/recv. Optimize aggregated operations to better utilize all channels. Add flattening for BCM PCI gen5 switches. Add support for inter-node NVLink communication Add support for port fusion in NET/IB. Add support for ReduceScatter and AllGather using Collnet. Update net API to v8. Fix hang during A2A connection.	2024-02-13 04:22:38 -08:00
BertanDogancay	00fdb1ef51	Clean up	2024-01-31 17:27:15 -08:00
Wenkai Du	1a134b283b	Merge remote-tracking branch 'rccl/develop' into 2.19.4	2024-01-31 11:53:10 -06:00
BertanDogancay	9ff53eeeae	Merge remote-tracking branch 'nccl/master' into develop	2024-01-30 14:43:43 -08:00
Bertan Dogancay	01b359027b	Include common.h in enqueue.cc instead (#1067 )	2024-01-30 08:24:22 -08:00
BertanDogancay	81ddf9de89	Merge remote-tracking branch 'nccl/v2.19' into develop	2024-01-24 15:25:33 -08:00
Wenkai Du	7e25d5bc55	Use new HIP graph API compatible with CUDA 11030 (#991 ) * Use new HIP graph API compatible with CUDA 11030 * Update dependency to ROCm 6.1 * Fix single stream use case	2024-01-21 19:00:50 -08:00
Bertan Dogancay	28d9b170c9	[DEV] Configure functions in RCCL (#986 ) * configure functions in rccl	2024-01-18 15:07:16 -07:00

1 2 3

147 Коммитов