rocm-systems

Tekijä	SHA1	Viesti	Päivämäärä
Bertan Dogancay	e96c8473a1	[DEVICE] Enable PAT algo for RCCL 1ppn (#1756 ) * Enable PAT algo for RCCL 1ppn	2025-07-04 13:45:18 -04:00
ryanhankins	9d35581d5e	Adding #include <dlfcn.h> in nccl_net.h to pass build (#1786 )	2025-07-02 19:21:53 -05:00
Wenkai Du	4640ab19b3	Add support for extended fine grained system memory pool (#1770 ) * Add support for extended fine-grained system memory pool * Use hipHostRegisterUncached * Add "sc0 sc1" flags for LL store on gfx950 * Update after HIP flag is changed to hipExtHostRegisterUncached	2025-07-01 16:38:49 -05:00
Bertan Dogancay	358dc1bc84	Switch to linear channel mapping for 2 nodes (#1777 )	2025-06-28 09:10:18 -05:00
mberenjk	5fb9d8f828	changing the HIP-VERSION to 6.3 to avoid using hip_fp8 for older ROCm versions (#1764 ) Co-authored-by: Marzieh Berenjkoub <mberenjk@.amd.com>	2025-06-26 11:15:01 -05:00
Dingming Wu	020dcf0a7c	Add proxyTrace (#1732 ) This feature tracks the proxy events and status of each send/recv op. ProxyTrace keeps a fixed number of active ops in host mem and dumps the status of each op when the program crashes or hangs.	2025-06-25 23:01:34 -05:00
BertanDogancay	aaf023976a	Merge remote-tracking branch 'nccl/master' into develop	2025-06-20 07:54:49 -05:00
Sarat Kamisetty	fa0422f174	generic net plugin ctxt that is extensible for use in multiple APIs (#1735 ) Co-authored-by: Sarat Kamisetty <sakamiset@amd.com>	2025-06-16 14:48:08 -07:00
Tim	ba97c9c18b	replayer update v0 (#1733 ) * First version of new replayer, with comments on future TODOs * plus minor fixes for UT * Updated format of recorder, especially in binary department, according to replayer's need	2025-06-13 15:05:34 -04:00
Richard Barnes	4486d091b8	Enable `-Wdeprecated-copy-with-user-provided-copy` (#1643 )	2025-06-13 08:23:31 -07:00
Arm Patinyasakdikul	6c37ae9470	Added missing copyright message. (#1742 ) * Added missing copyright message. * addressed comments.	2025-06-12 09:58:01 -05:00
corey-derochie-amd	03fba66e71	Deprecated MSCCL API functions (#1740 )	2025-06-11 17:52:09 -06:00
Nilesh M Negi	9d72be7b2f	[DEVICE] Adding ability to choose unroll factor at runtime (#1734 ) * Adding runtime unroll factor selection via RCCL_UNROLL_FACTOR * [BUILD] Add support for user-defined UNROLL for debugging * Update CHANGELOG.md * Fix COLLTRACE errors in CI * Add debug statements for unroll and resolve warnings * Incorporate UNROLL into ONLY_FUNCS for debugging --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com> Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-06-11 00:07:59 -05:00
Arm Patinyasakdikul	ec6efa9b26	Remove 'warpSize' compiler constant as it is deprecated in ROCm 7.0. (#1720 ) * Remove 'warpSize' compiler constant as it is deprecated in ROCm 7.0. * Create ncclShmemScratchWarpSize on host side for enqueue.cc. * Update src/enqueue.cc Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com> * address comments * fix number of threads --------- Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>	2025-06-06 07:34:43 -05:00
Pedram Alizadeh	3f7c08648f	Reapplying PR #1641 [AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1713 ) * Reapply "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)" This reverts commit 943ad6f7820739385a0b54e81f823d0df1dbf71c. * Decreasing NCCL_LL128_SHMEM_ELEMS_PER_THREAD from 16 to 8	2025-06-04 13:22:11 -04:00
Avinash	e94b360246	SPLITCOMM design fix in src/misc/msccl (#1715 ) * Fix TOC-TOU in mcclInit * Improving vector resize thread safety * Initial commit rank to comm change * Removing unwanted include header changes * Updated CHANGELOG.md * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-06-01 21:00:38 -05:00
alex-breslow-amd	2f6b20c00a	Use One Slice per Basic Primitive for AllReduce, ReduceScatter, AllGather (#1681 ) for Single Node on Some GFX9 Systems Using a single slice rather than the typical two provides about 5% speedup (sometimes more or less) on some GFX9 systems for single node.	2025-05-29 16:17:35 -07:00
Nilesh M Negi	12517a957e	Re-apply unroll=1 and 112 channels for gfx950 (#1706 ) * Reapply "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667) This reverts commit `329e13efff`. * Reapply "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620) This reverts commit `b17338d164`.	2025-05-28 14:58:10 -05:00
PedramAlizadeh	7f878baef0	Revert "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641 )" This reverts commit `00c1eb098c`.	2025-05-21 20:21:27 -05:00
corey-derochie-amd	170acf3bda	Switched to using the hip_fp8 header instead of rccl_float8, resolving compatibility issues. (#1546 ) * Revert "Revert "replacing rccl_float8 with hip_fp8 and address compatibility …" This reverts commit `824b81c034`. * [UT] Modify max stack size to 496 * adding a check for OCP type and replacing ROCM_VERSION with HIP_VERSION * addressing the ci failure * Adding the device tag --------- Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2025-05-14 15:33:03 -05:00
Mustafa Abduljabbar	00c1eb098c	[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641 ) * Update LL128 elems per thread * Precompute ix[g] in LL128 prim * Make Threadthreshold part of tuning models * Ignore channel tuning when channels are env controlled * Tune LL128 max limit for AG * Tune LL128 max limit for RS * Retune AR LL128 limits due to changes * Update CHANGELOG.md --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-05-14 14:35:54 -05:00
Avinash	5f6805b4f4	RCCL Multinode DMA Buffer crash fix (#1682 ) This commit handles DMABUF initialization and call appropriate handling function. This fixes crash in OS with no peermem support and relying on only DMABUF. * Initial test commit * Handling Dmabuf_fd opening and closing * Cleanup * Use DMABuff or Peermem as needed * Using user input for ibDmaBufSupportInitOnce * Revert all changes to rocmwrap.cc * Revert all changes to rocmwrap.cc * Changing to func definition braces * Reverting line removal in utils.h * useDmaBuf to calculate flushEnabled	2025-05-08 19:17:39 -05:00
Bertan Dogancay	590ad6acc2	Merge pull request #1662 from BertanDogancay/2.25 [SYNC] 2.25.1-1	2025-05-06 09:39:09 -04:00
Mustafa Abduljabbar	f3f3336468	Fix topo explorer's compatibility with NCCL 2.24 (#1671 ) * Fix build issues * Fix failure to find path remote rank	2025-05-05 15:26:29 -04:00
Nilesh M Negi	329e13efff	Revert "[SRC] Enable unroll=1 for gfx950 (#1602 )" (#1667 ) * Revert "[SRC] Enable unroll=1 for gfx950 (#1602)" This reverts commit `307bc10781`. * Update Changelog --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2025-04-30 23:33:08 -05:00
BertanDogancay	cb6e23ae67	Merge remote-tracking branch 'nccl/master' into develop	2025-04-30 13:31:41 -05:00
BertanDogancay	a6bf9bfc9e	Merge remote-tracking branch 'nccl/master' into develop	2025-04-23 20:47:43 -07:00
Mustafa Abduljabbar	82afb2bcfe	Expose production tuning table in topo_explorer using internal RCCL/NCCL logic (#1628 ) * Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled * Algo/protocol/max channels can be obtained with the new RCCL API * Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc * Add usage example in topo-explorer tool	2025-04-23 15:44:56 -04:00
Kamil Iskra	0524aef7a0	NCCL 2.26.3-1 Minimize the performance impact of the device kernel profiling support when the profiler plugin is not loaded. Reduce the overheads of CUDA graph capturing, which increased in NCCL 2.26.2 for large graphs. Fix the exchange of enhanced connection establishment (ECE) options to address potential slowdowns on networks utilizing RoCE. Test if cuMem host allocations work and if not, disable them. Enabled by default since NCCL 2.24 if the CUDA driver version is at least 12.6, such allocations rely on NUMA support, which is by default not available under Docker. We recommend invoking Docker with "--cap-add SYS_NICE" to enable it. Fix an initialization error when running with NCCL_NET_GDR_C2C=1 on multiple MNNVL domains with non-uniform network configurations across nodes. Fix the printing of sub-seconds in the debug log when using a custom NCCL_DEBUG_TIMESTAMP_FORMAT setting.	2025-04-22 13:50:40 -07:00
Bertan Dogancay	ac8ec4c08c	Fix NPKit for SendRecv (#1651 )	2025-04-21 12:34:47 -04:00
Tim	9a55ff60a9	RCCL Replayer update (#1603 ) RCCL recorder w/ suggested change and UT	2025-04-19 00:21:27 -04:00
Pedram Alizadeh	e40ff4f84a	all_reduce LL/LL128 and Ring/Tree multi-node tuning for MI300 (#1627 ) * Enabling LL128 by default on MI300 * Add missing CUDACHECK * Adjust BW correction factors to fix the Tree->Ring switching point * Refactor and add ll128 AR logarithmic factor to tuning models * Move RCCL tuning changes to a separate file * Use enum for tunable indexing * Use explicit indexing in tuning models to avoid mismatch issues * Place rcclGetSizePerRank in a function * Remove HIP ifdef for rccl-only call --------- Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>	2025-04-10 11:43:54 -04:00
Mustafa Abduljabbar	4be06f04d8	Add AllGather LL128 multi-node tuning and include LL cutoff points in tuning models (#1618 ) * Enable LL/LL128 cutoff points in tuning models * Initializing ll/ll128 model cutoffs for MI300 * Use RCCL_LL_LIMITS_UNDEFINED --------- Co-authored-by: PedramAlizadeh <pmohamma@amd.com>	2025-04-02 16:26:23 -04:00
Bertan Dogancay	532f54c244	Merge pull request #1559 from BertanDogancay/2.23 [SYNC] 2.23.4-1	2025-03-28 17:06:56 -04:00
Nilesh M Negi	307bc10781	[SRC] Enable unroll=1 for gfx950 (#1602 ) * [SRC] Enable unroll=1 for gfx950 * Fix typo from rebase in generate.py * Support for unroll=1 and gfx90a when building for all GPU targets --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2025-03-27 18:21:35 -05:00
BertanDogancay	0b2062c560	Merge remote-tracking branch 'nccl/master' into develop	2025-03-27 12:53:04 -05:00
Wenkai Du	90ad586d94	Add fault injection of starting warps with random variations (#1593 ) * Add fault injection of starting warps with random variations This is done by inserting randomly delays after __syncthreads(). The feature can be turned off by FAULT_INJECTION=OFF in cmake. * Remove manually introduced bug for demo purpose * Use only one thread per warp for checking wall clock	2025-03-20 16:11:43 -07:00
corey-derochie-amd	6505639cf4	removed gfx940 and gfx941 (#1606 ) * removed gfx940 and gfx941 * removed gfx940 and gfx941 * Update "gfx94" to "gfx942" in init.cc * Updated remaining "gfx94" updates to "gfx942" * Update filenames and variables from gfx940 to gfx942 --------- Co-authored-by: akolliasAMD <akollias@amd.com>	2025-03-20 09:34:53 -06:00
Kamil Iskra	f44ac759fe	NCCL 2.26.2-1 Profiler improvements * Add events for CUDA kernel start and end. * Allow network plugins to generate profiling events * Enable profiling on a per-operation basis, rather than per-communicator. * Add support for graph capturing. Add implicit launch order * Allow to prevent deadlocks when using multiple NCCL communicators per device by implicitly ordering NCCL operations using the host program order. Disabled by default, set NCCL_LAUNCH_ORDER_IMPLICIT=1 to enable. * Add a complementary mechanism to detect host threads racing to launch to the same device. Enabled by default, set NCCL_LAUNCH_RACE_FATAL=0 to disable. Optimize the PAT algorithm * Separate the computation and execution of PAT steps on different warps, allowing to run up to 16 PAT steps in parallel to significantly accelerate PAT and reduce its linear part. Add support for setting QoS per communicator * Add a new trafficClass field to the communicator configuration, to allow the application to select a particular traffic class for a given communicator. The meaning of the traffic class is network-specific and should be set in accordance with the network configuration. * For the IB/RoCE plugin, existing config variables such as NCCL_IB_SL and NCCL_IB_TC take precedence. Allow to enable GPU Direct RDMA specifically on C2C platforms * Disabled by default, set NCCL_NET_GDR_C2C=1 to enable. Do not disable user buffer registration unless PXN is really used * Only disable UB when a communicator has more than one rank per node on any node. RAS subsystem improvements * Report operation counts separately for each collective operation type. * Provide details about missing communicator ranks and reliably distinguish ranks that are no longer a given communicator's members (now reported as NOCOMM) from those that failed to respond. Add support for timestamps to NCCL diagnostic messages * On by default for WARN messages; NCCL_DEBUG_TIMESTAMP_LEVELS can be used to enable them for other debug levels as well. * The format can be changed using the NCCL_DEBUG_TIMESTAMP_FORMAT config variable. Reduce the memory usage with NVLink SHARP (NVLS) * Potentially save hundreds of MBs of device memory, considering the multicast buffer size granularity separately from the address alignment. Update performance tuning for recent Intel CPUs * Improve algorithm/protocol selection on recent CPUs such as Emerald Rapids and Sapphire Rapids. Improve channel scheduling when mixing LL and Simple operations. * Make LL operations account for 4x more traffic to ensure LL and simple operations complete at the same time. Refactor the plugin code * Clean up and harmonize the support code across the network, tuner, and profiler plugins. Add support for comment lines (starting with #) in the nccl.conf file * Issue #1540. Make user buffer registration problems print an INFO instead of a WARN. Drop support for network plugin interface version 5. Fix a race condition with split-shared communicators * NCCL could hang during connection setup if multiple communicators were grouped together that share resources. Fix a performance regression when using NCCL_CROSS_NIC=1 * NCCL would unnecessarily alternate rings, breaking the GPU-NIC associations. Make GID index detection code more resilient * Dynamic GID detection code was giving up too soon if the detected index was not available (e.g., wasn't mapped to the container's sysfs). * Issues #1538, #1573. Fix a race condition with non-blocking operation * Fix issue when creating a non-blocking communicator after a non- blocking collective operation on another communicator. Fix shared memory usage on recent Blackwell GPUs. * Issues NVIDIA/nccl-tests#287, NVIDIA/nccl-tests#291, #1637. Fix an error with NIC fusion and IB SHARP when recreating communicators * Disable the unloading of network plugins Make the auto-merge failures in the NIC fusion non-fatal * This could happen when trying to merge IB and RoCE devices. Fixes to ncclCommAbort * Fix hangs due to the progress thread spinning indefinitely on the network progress. * Reduce the abort time by up to two orders of magnitude. Fix a crash when libnccl.so was dynamically unloaded * The RAS subsystem was missing a clean-up handler. Fix a hang if the network plugin's test() call returns an error. Fix a hang on heterogeneous architectures * Ensure we harmonize the tuning to avoid different tuning choices, causing a hang. Fix double-free on failed ncclCommInitRank and ncclCommFinalize. Fix a potential list traversal bug during a group launch of multiple communicators * Issue #1599. Unify the handling of NCCL configuration variables * Under rare circumstances, some variables specified in the config file could be ignored.	2025-03-12 13:46:21 -07:00
Nusrat Islam	ac823818aa	misc/msccl: force use of mscclpp (#1581 )	2025-03-04 12:48:59 -06:00
Wenkai Du	f957c4fe22	NPKit: enable reduce scatter profiling (#1580 )	2025-03-04 10:03:56 -08:00
Bertan Dogancay	85eb1f16bc	Use bit reversal based mapping for multi-node (#1572 )	2025-02-26 09:48:03 -05:00
Pedram Alizadeh	f268553ee4	enable building rccl for gfx950 (#1571 )	2025-02-25 16:13:48 -05:00
Wenkai Du	32dc7ef47c	Enable GDRCopy only on gfx94x (#1550 ) * Enable GDRCopy only on gfx94x * Use cudaFree instead of hipFree * Add warning if failed to get device property * Remove extra return	2025-02-17 13:28:19 -08:00
Pedram Alizadeh	0e5f4d0662	reverting the (Reduce NPKit latency overhead in MSCCL kernel) PR #893 (#1525 )	2025-02-14 11:03:43 -05:00
corey-derochie-amd	824b81c034	Revert "replacing rccl_float8 with hip_fp8 and address compatibility issue (#…" (#1545 ) This reverts commit `d437d6e41c`.	2025-02-13 10:00:22 -07:00
mberenjk	d437d6e41c	replacing rccl_float8 with hip_fp8 and address compatibility issue (#1538 ) * replacing rccl_float8 with hip_fp8 and address compatibility issue with gfx942 --------- Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com> Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2025-02-13 10:34:17 -06:00
Wenkai Du	ebf7e2305e	Print KL/CL/KE events for all warps (#1544 ) * Print KL/CL/KE events for all warps * Fix count off-by-one issue * Fix opCount in KE and restore CPU thread option * Simplify count calculation	2025-02-12 13:36:31 -08:00
Bertan Dogancay	387c973b5d	[P2P] Have connIdx for both send and recv (#1524 )	2025-02-04 11:53:20 -05:00
Wenkai Du	a5c6b547a2	Add back opCount and channel ID to debug trace (#1520 )	2025-02-03 08:55:27 -08:00

1 2 3 4 5 ...

311 Commitit