rocm-systems

Автор	SHA1	Повідомлення	Дата
alex-breslow-amd	2f6b20c00a	Use One Slice per Basic Primitive for AllReduce, ReduceScatter, AllGather (#1681 ) for Single Node on Some GFX9 Systems Using a single slice rather than the typical two provides about 5% speedup (sometimes more or less) on some GFX9 systems for single node.	2025-05-29 16:17:35 -07:00
Nilesh M Negi	12517a957e	Re-apply unroll=1 and 112 channels for gfx950 (#1706 ) * Reapply "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667) This reverts commit `329e13efff`. * Reapply "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620) This reverts commit `b17338d164`.	2025-05-28 14:58:10 -05:00
Dingming Wu	51f87fbb43	Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv() (#1683 ) * Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv() For rocm older than 6.4, we need to set HSA_NO_SCRATCH_RECLAIM=1 to use LL128 protocol. This Env is set outside of RCCL, add the logging to detect whether its set during runtime. * check hip runtime ver via hipRuntimeGetVersion * move the detection to ncclinit func * correct rocm version integer * update warning message * avoid unnecessary info msg on hsa_no_scratch_reclaim detection	2025-05-14 10:12:45 -05:00
Avinash	c54a0c085a	collective trace improvements for debugging (#1661 )	2025-05-07 13:37:31 -05:00
Bertan Dogancay	590ad6acc2	Merge pull request #1662 from BertanDogancay/2.25 [SYNC] 2.25.1-1	2025-05-06 09:39:09 -04:00
Nilesh M Negi	329e13efff	Revert "[SRC] Enable unroll=1 for gfx950 (#1602 )" (#1667 ) * Revert "[SRC] Enable unroll=1 for gfx950 (#1602)" This reverts commit `307bc10781`. * Update Changelog --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2025-04-30 23:33:08 -05:00
BertanDogancay	cb6e23ae67	Merge remote-tracking branch 'nccl/master' into develop	2025-04-30 13:31:41 -05:00
BertanDogancay	a6bf9bfc9e	Merge remote-tracking branch 'nccl/master' into develop	2025-04-23 20:47:43 -07:00
Tim	9a55ff60a9	RCCL Replayer update (#1603 ) RCCL recorder w/ suggested change and UT	2025-04-19 00:21:27 -04:00
Nusrat Islam	f20c33effd	Fix MSCCLPP accuracy issue for allreduce7 (#1634 ) * ext-src: fix a graph-mode bug in allreduce7 * change MSCCLPP threshold to 16MB * ext-src: change message size threshold for allreduce7 * ext-src: address review comments	2025-04-18 08:54:32 -05:00
Nilesh M Negi	b17338d164	Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596 )" (#1620 ) * Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" This reverts commit `1df73e209e`. * [DOC] Update Changelog * [DOC] Update CHANGELOG	2025-03-28 17:57:06 -05:00
Bertan Dogancay	532f54c244	Merge pull request #1559 from BertanDogancay/2.23 [SYNC] 2.23.4-1	2025-03-28 17:06:56 -04:00
Nilesh M Negi	307bc10781	[SRC] Enable unroll=1 for gfx950 (#1602 ) * [SRC] Enable unroll=1 for gfx950 * Fix typo from rebase in generate.py * Support for unroll=1 and gfx90a when building for all GPU targets --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2025-03-27 18:21:35 -05:00
BertanDogancay	0b2062c560	Merge remote-tracking branch 'nccl/master' into develop	2025-03-27 12:53:04 -05:00
isaki001	9dc23d9265	Disable mscclpp (#1614 ) * disable mscclpp by default	2025-03-25 15:21:16 -05:00
gilbertlee-amd	626dc50ab5	Removing the experimental clique kernel files (#1610 )	2025-03-20 18:10:01 -06:00
Wenkai Du	90ad586d94	Add fault injection of starting warps with random variations (#1593 ) * Add fault injection of starting warps with random variations This is done by inserting randomly delays after __syncthreads(). The feature can be turned off by FAULT_INJECTION=OFF in cmake. * Remove manually introduced bug for demo purpose * Use only one thread per warp for checking wall clock	2025-03-20 16:11:43 -07:00
corey-derochie-amd	6505639cf4	removed gfx940 and gfx941 (#1606 ) * removed gfx940 and gfx941 * removed gfx940 and gfx941 * Update "gfx94" to "gfx942" in init.cc * Updated remaining "gfx94" updates to "gfx942" * Update filenames and variables from gfx940 to gfx942 --------- Co-authored-by: akolliasAMD <akollias@amd.com>	2025-03-20 09:34:53 -06:00
Wenkai Du	bd0092e8f1	GDRCOPY support: Off by default (#1605 )	2025-03-18 08:17:01 -07:00
Nilesh M Negi	1df73e209e	[GRAPH] Increase default nChannels to 112 for gfx950 (#1596 ) Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2025-03-14 14:47:03 -07:00
Nusrat Islam	ac823818aa	misc/msccl: force use of mscclpp (#1581 )	2025-03-04 12:48:59 -06:00
Bertan Dogancay	d88cca3098	[Transport] Fix IntraNet (#1582 )	2025-03-04 13:30:36 -05:00
Bertan Dogancay	85eb1f16bc	Use bit reversal based mapping for multi-node (#1572 )	2025-02-26 09:48:03 -05:00
Pedram Alizadeh	f268553ee4	enable building rccl for gfx950 (#1571 )	2025-02-25 16:13:48 -05:00
Wenkai Du	ebf7e2305e	Print KL/CL/KE events for all warps (#1544 ) * Print KL/CL/KE events for all warps * Fix count off-by-one issue * Fix opCount in KE and restore CPU thread option * Simplify count calculation	2025-02-12 13:36:31 -08:00
Wenkai Du	f5b15f27a9	Move collective trace to HBM and fix log issue (#1542 )	2025-02-11 11:40:14 -08:00
Bertan Dogancay	387c973b5d	[P2P] Have connIdx for both send and recv (#1524 )	2025-02-04 11:53:20 -05:00
Wenkai Du	a5c6b547a2	Add back opCount and channel ID to debug trace (#1520 )	2025-02-03 08:55:27 -08:00
Wenkai Du	caba0bc049	Add HDP flush for gfx940 (#1434 ) * Fix collective trace * Use nontemporal for st_global * Fix previous commit * Add HDP flush to data receive path * Fix previous commit * Control flushing by NCCL_NET_FORCE_FLUSH and RCCL_NET_HDP_FLUSH * Introduce RCCL_NET_HDP_FLUSH and RCCL_NET_GDR_FLUSH Both are on by default. Turn both off will skip all flush will likely result in data error. * Enable GDR copy by default * Remove GDR flush env var because it is disabled by GDC flush * Output kernel collective trace at comm destroy by default * Limit kernel timeout messages to 100 * Use system relaxed atomic for loadInt * Refine timeout messages and use atomic for setting offset from CPU * Add kernel trace for barrier timeout * Add backup barrier to avoid race in atomicAdd * Use different counters for different warps * Rework barrier implementation * Fix for other GFX * Use __hip_atomic_store and __hip_atomic_load * Fix bug in previous commit * Don't reset barrier values in running kernel * Update trace format * Fix typo * Switch back to hip_atomic_fetch_add * Use same barrier implementation for all GFX * Remove extra threadfence * Turn off HDP flush by default Please use RCCL_NET_HDP_FLUSH=1 to switch on HDP flush * Remove unnecessary changes from alterative barrier implementation * Added back __threadfence_block * Revert back to threadfence for gfx other than gfx94x	2025-01-31 07:51:10 -08:00
Sylvain Jeaugey	80f6bda437	NCCL 2.25.1-1 Add Blackwell/SM100 support * Add compilation for sm100 * Add graph search speeds for Blackwell * Optimize graph search to converge on large NVLink domains * Limit NVLS heads to 32 * Increase various limits to fit large NVLink domains * Add extra checks for IMEX setup, needed for MNNVL * Increase MAXCHANNELS to 64 Extend NVTX instrumentation to track NCCL communicators * Add communicator ID to NVTX traces to allow for correlation between ranks. RAS fixes	2025-01-27 03:33:57 -08:00
BertanDogancay	36343be84f	Merge remote-tracking branch 'nccl/master' into develop	2025-01-23 12:08:46 -06:00
Dingming Wu	69d0134ed2	improving kernel traces on opCount bits and adding channelId in ncclCollTrace (#1485 )	2025-01-10 07:57:46 -08:00
Sylvain Jeaugey	6aae379278	2.24.3-1 Network user buffer support for collectives * Leverage user buffer registration to achieve zero-copy inter-node communications for Ring, NVLS and Collnet Add RAS subsystem * Create a RAS thread keeping track of all NCCL communicators. * Add a ncclras tool contacting the RAS thread and getting a report. Add fp8 support * Add support for e5m2 and e4m3 8-bit floating point operations. * Use Tree/PAT algorithms when possible for better numerical stability. Add NIC fusion * Add a NET API to ask the network plugin to fuse a set of interfaces together. * Fuse multiple NICs under the same PCI switch as a single, larger NIC. Socket connection failure retry * Retry in case of socket connection failure (unreachable host) * Avoid "Software caused connection abort" errors on retries QP connection failure retry * Retry in case of IB QP connection failure during ibv_modify_qp. NET API improvements * Allow plugins to force a flush in case data and completion ordering is not guaranteed. * Indicate when completion is not needed (e.g. for the LL128 protocol), allowing plugins to skip generating a completion. * Allow for full offload of allgather operations when using one GPU per node. NCCL_ALGO/NCCL_PROTO strict enforcement * Extend NCCL_ALGO/NCCL_PROTO syntax to be able to specify ALGO/PROTO filters for each collective operation. * Strictly enforce the ALGO/PROTO filters, no longer fall back on the ring algorithm when the filtering leaves no option and error out instead. Enable CUMEM host allocations * Use cumem functions for host memory allocation by default. Improved profiler plugin API * Avoid dependencies with NCCL includes. * Add information on whether the buffer is registered or not Adjust PAT tuning * Improve transition between PAT and ring at scale. Fix hangs when running with different CPU architectures * Detect when we use a mix of GPU architectures * Ensure Algo/Proto decisions are made based on that unified state. Fix FD leak in UDS * Fix a leak when mapping buffers intra-node with cumem IPCs. Fix crash when mixing buffer registration and graph buffer registration. * Separate local and graph registration to avoid crashes when we free buffers. Fix user buffer registration with dmabuf * Make ncclSend/ncclRecv communication with buffer registration functional on network plugins relying on dmabuf for buffer registration. Fix crash in IB code caused by uninitialized fields. Fix non-blocking ncclSend/ncclRecv * Fix case where ncclSend/ncclRecv would return ncclSuccess in non-blocking mode even though the operation was not enqueued onto the stream. * Issue #1495 Various compiler tweaks and fixes * PR #758 Fix typo in ncclTopoPrintGraph * Issue #1468	2025-01-07 02:01:15 -08:00
Mustafa Abduljabbar	e6b179d627	Remove unneeded highestTransportType (#1461 )	2024-12-16 13:28:47 -05:00
Bertan Dogancay	cb175fb0b3	Template generic kernel for unroll factor (#1419 ) * Template generic kernel for unroll factor	2024-11-12 18:27:29 -05:00
Bertan Dogancay	373f113524	Dynamically select unroll factor to build for when targeting local arch (#1371 ) * Dynamically select unroll factor to build for when targeting local arch only	2024-10-21 10:53:11 -04:00
Arm Patinyasakdikul	133ea201cf	Increase default number of channels for MI300A in multi-node scenario. (#1366 ) This commit changed the default of channels of MI300A from 8 upto 24. This helps bring up multi-node performance to the expected level.	2024-10-11 11:37:48 -05:00
Wenkai Du	b55b6be0cb	Fix crash when PXN is enabled on some platforms (#1369 )	2024-10-11 09:02:59 -07:00
BertanDogancay	84081064a0	Merge remote-tracking branch 'nccl/master' into develop	2024-10-02 09:31:25 -05:00
Sylvain Jeaugey	68b542363f	2.23.4-1 Add scalable init API * Add new ncclCommInitRankScalable to allow for passing multiple unique IDs to the init function. * Spreads the load onto multiple bootstrap roots, allowing for constant bootstrap time. * Requires multiple ranks to create a unique ID, and the CPU-side ID exchange code to call allgather[v] instead of broadcast. Accelerate init bootstrap operations * Reduce the number of calls to allgather. * Allow roots to reply early to ranks when information is already available. * Add an option to use ncclNet instead of sockets to perform bootstrap allgather operations. Add PAT algorithms for Allgather and ReduceScatter * Parallel Aggregated Trees, variation of Bruck algorithm. * Logarithmic number of network steps for small sizes at scale. * Only supports one rank per node at the moment. Add support for registered buffers for intra-node communication. * Allow registered user buffers to be accessed directly intra-node * Avoids extra copies in algorithms which permit it, saving memory bandwidth and helping with compute overlap. Add profiler plugin API * New plugin API for profiling * Supports various levels of profiling, with a hierarchy. Asynchronous graph allocation * Make calls to cudaMalloc and cudaMemcpy during graph allocation asynchronous. * Significantly speeds up graph capture. Use fatal IB asynchronous events to stop network operation * Avoids many other error messages * Only fatal errors are affected; potentially transient errors (e.g. port down) do not cause an immediate stop. Set P2P level to PXB on AMD CPUs when using more than 2 GPUs per node * P2P would cause a significant performance degradation when using many GPUs, and therefore many interleaved data flows. * Disable P2P through the CPU when we have 3+ GPUs per node; keep it enabled when we only have 2 GPUs. Improve the init logs to report the real NCCL function. * Make the log report ncclCommInitRank or ncclCommSplit, rather than the generic ncclCommInitRankFunc. Add a parameter to set the location of the user configuration file. * Add NCCL_CONF_FILE environment variable to set where the user's configuration file resides. Increase default IB timeout * Increase IB timeout value from 18 to 20. * Should help avoid fatal errors on large RoCE systems. Add new check for nvidia peermem * On linux kernels 6.6+, /sys/kernel/mm/memory_peers is no longer present; check for /sys/module/nvidia_peermem/version instead. Fix old performance regression when mixing small and large operations. * Improves distribution of work on channels. Fix crash when NUMA IDs are equal to -1. * Can happen when a NIC is a virtual NIC, or when linux doesn't know which NUMA node a device is attached to * Issue NVIDIA/nccl-tests#233 Fix tree graph search when NCCL_CROSS_NIC is set to 1. * Would force NCCL to use the balanced_tree pattern, thereby disabling LL128 on platforms with 1 GPU+1 NIC per PCI switch. * Would also try to use alternate rings even though it was not needed. Compiler tweaks and fixes * PR #1177 * PR #1228 Fix stack smash * PR #1325 Fixes for multi-node NVLink + IB operation Coverity fixes and comments.	2024-09-16 23:41:17 -07:00
corey-derochie-amd	853a0586b4	Moved `mscclpp_ncclGetUniqueId` call into `ncclCommInitRankFunc` (#1332 ) * Moved call to `mscclpp_ncclGetUniqueId` into `ncclCommInitRankFunc` to avoid setting up transport early in environments where MSCCL++ isn't valid. * Checking `mscclEnabled` for the process and the topology to gate MSCCL++. * Allowed `mscclForceEnable` to enable MSCCL++.	2024-09-16 16:41:40 -06:00
corey-derochie-amd	736a705875	Re-enabled MSCCL++ (#1325 ) * Added restrictions around calling MSCCL++ collectives (#1281) * Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather. * Renamed and refactored some mscclpp types. * Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging. * Disable MSCCL++ when using managed memory buffers as it isn't supported. * Added datatype and op constraints for MSCCL++ AllReduce. * Added documentation on MSCCL++ restrictions to the README. * [BUILD] Support custom CMake flags in MSCCLPP (#1275) * [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> * [BUILD] CMake flags to support build-id in MSCCLPP Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> * [BUILD] Fix CMake warnings in MSCCLPP build Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> * Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them. --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> Co-authored-by: Corey Derochie <corey.derochie@amd.com> * Link to libmscclpp_nccl statically (#1282) * Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions. * Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled. * `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt. * Removed IBVerbs dependency for integrating with MSCCL++ (#1313) * Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294) * Include mscclpp as a git submodule (#1314) * Added the desired mscclpp commit as a git submodule. * Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively. * Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule. * Enabled MSCCL++ feature build. --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>	2024-09-11 09:55:16 -06:00
mberenjk	db840f024e	adding all nccl apis to api_support to enable rccl tracing by rocprofv3 (#1297 ) * adding all nccl apis to api_support to enable rccl tracing by rocprofv3 Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com> Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>	2024-08-22 12:36:07 -05:00
akolliasAMD	d6c317d6ae	removed hcc mentions (#1291 )	2024-08-14 15:04:13 -06:00
Tim	4200964202	Adding core binding in info (#1212 ) Signed-off-by: AtlantaPepsi <timhu102@amd.com>	2024-08-08 11:36:24 -04:00
corey-derochie-amd	b31b4082dd	Only initialize MSCCL++ when runtime-enabled. (#1266 )	2024-07-22 00:41:31 -06:00
corey-derochie-amd	9cbb3da224	Only enable MSCCL++ AllReduce for message sizes that are multiples 32 (#1253 ) * Only enable MSCCL++ AllReduce for message sizes that are multiples of 32. MSCCL++ does not handle these other sizes. * Sanitized MSCCL++ logging.	2024-07-12 17:04:23 -07:00
corey-derochie-amd	6dc47eecd7	Integrated RCCL with MSCCL++ for small message sizes (#1231 )	2024-07-12 15:32:58 -06:00
Rahul Vaidya	c755b9cf93	Improved version reporting in NCCL_DEBUG=VERSION (#1232 ) * Improved version reporting in NCCL_DEBUG=VERSION. Signed-off-by: rahulvaidya20 <ravaidya@amd.com> * Version reporting changes Signed-off-by: rahulvaidya20 <ravaidya@amd.com> * Versioning changes: Initialized char arrays to null and fixed typo. --------- Signed-off-by: rahulvaidya20 <ravaidya@amd.com>	2024-07-12 08:14:29 -05:00
corey-derochie-amd	0c36d571ea	Enable multi-threading for MSCCL (#1203 ) MSCCL can now run in a multi-threaded configuration. To test in the unit tests, added the ENABLE_OPENMP compile definition flag and the --openmp-test-enable flag to the unit test build script. To activate, set the environment variables UT_MULTITHREADED=1 and UT_PROCESS_MASK=1. Set Jenkins to use this mode.	2024-07-04 09:34:38 -06:00

1 2 3 4 5

246 Коміти